Neural Naturalist: Generating Fine-Grained Image Comparisons · 710 Images Dataset Domain Lang Ctx Cap Example CUB Captions (R, 2016) Birds M 1 1 “An all black bird with a very

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 708–717,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

708

Neural Naturalist: Generating Fine-Grained Image Comparisons

Maxwell Forbes🦉

Christine Kaeser-Chen🦜

Piyush Sharma🦜

Serge Belongie🦜🕊

🦉

University of Washington🦜

Google Research🕊

Cornell University and Cornell [email protected]

{christinech,piyushsharma}@[email protected]

https://mbforbes.github.io/neural-naturalist/

Abstract

We introduce the new Birds-to-Words datasetof 41k sentences describing fine-grained dif-ferences between photographs of birds. Thelanguage collected is highly detailed, whileremaining understandable to the everydayobserver (e.g., “heart-shaped face,” “squatbody”). Paragraph-length descriptions natu-rally adapt to varying levels of taxonomic andvisual distance—drawn from a novel strati-fied sampling approach—with the appropriatelevel of detail. We propose a new model calledNeural Naturalist that uses a joint image en-coding and comparative module to generatecomparative language, and evaluate the resultswith humans who must use the descriptions todistinguish real images.

Our results indicate promising potential forneural models to explain differences in visualembedding space using natural language, aswell as a concrete path for machine learning toaid citizen scientists in their effort to preservebiodiversity.

1 Introduction

Humans are adept at making fine-grained compar-isons, but sometimes require aid in distinguishingvisually similar classes. Take, for example, a cit-izen science effort like iNaturalist,1 where every-day people photograph wildlife, and the commu-nity reaches a consensus on the taxonomic labelfor each instance. Many species are visually sim-ilar (e.g., Figure 1, top), making them difficult fora casual observer to label correctly. This puts anundue strain on lieutenants of the citizen sciencecommunity to curate and justify labels for a largenumber of instances. While everyone may be ca-pable of making such distinctions visually, non-experts require training to know what to look for.

🦉

Work done during an internship at Google.1https://www.inaturalist.org

“Animal 2 looks smaller and has a stouter, darker bill than Animal 1. Animal 2 has black spots on its wings. Animal 2 has a black hood that extends down onto its breast, and the rest of its breast is white with orange only on its sides. In comparison, Animal 1’s breast is entirely orange.”

“Animal 2 is brightly red-colored all over, except for a black oval around its beak. Animal 1 has more muted red and grey colors.”

vs

perceptual difficulty: medium

high

ly d

etai

led

few

er d

etai

ls

body partdescriptive phrase

perceptual difficulty: high

Animal 1 Animal 1

Animal 1 Animal 1

Animal 2 Animal 2

Animal 2 Animal 2

vs

Figure 1: The Birds-to-Words dataset: comparative de-scriptions adapt naturally to the appropriate level of de-tail (orange underlines). A difficult distinction (TOP) isgiven a longer and more fined-grained comparison thanan easier one (BOTTOM). Annotators organically useeveryday language to refer to parts (green highlights).

Field guides exist for the purpose helping peo-ple learn how to distinguish between species. Un-fortunately, field guides are costly to create be-cause writing such a guide requires expert knowl-edge of class-level distinctions.

In this paper, we study the problem of explain-ing the differences between two images using nat-ural language. We introduce a new dataset calledBirds-to-Words of paragraph-length descriptionsof the differences between pairs of bird pho-tographs. We find several benefits from elicitingcomparisons: (a) without a guide, annotators nat-urally break down the subject of the image (e.g.,

https://mbforbes.github.io/neural-naturalist/

709

a bird) into pieces understood by the everyday ob-server (e.g., head, wings, legs); (b) by samplingcomparisons from varying visual and taxonomicdistances, the language exhibits naturally adap-tive granularity of detail based on the distinctionsrequired (e.g., “red body” vs “tiny stripe aboveits eye”); (c) in contrast to requiring comparisonsbetween categories (e.g., comparing one speciesvs. another), non-experts can provide high-qualityannotations without needing domain expertise.

We also propose the Neural Naturalist modelarchitecture for generating comparisons given twoimages as input. After embedding images into alatent space with a CNN, the model combines thetwo image representations with a joint encodingand comparative module before passing them to aTransformer decoder. We find that introducing acomparative module—an additional Transformerencoder—over the combined latent image repre-sentations yields better generations.

Our results suggest that these classes of neuralmodels can assist in fine-grained visual domainswhen humans require aid to distinguish closelyrelated instances. Non-experts—such as amateurnaturalists trying to tell apart two species—standto benefit from comparative explanations. Ourwork approaches this sweet-spot of visual exper-tise, where any two in-domain images can be com-pared, and the language is detailed, adaptive tothe types of differences observed, and still under-standable by laypeople.

Recent work has made impressive progress oncontext sensitive image captioning. One directionof work uses class labels as context, with the ob-jective of generating captions that distinguish whythe image belongs to one class over others (Hen-dricks et al., 2016; Vedantam et al., 2017). An-other choice is to use a second image as context,and generate a caption that distinguishes one im-age from another. Previous work has studied waysto generalize single-image captions into compar-ative language (Vedantam et al., 2017), as wellas comparing two images with high pixel overlap(e.g., surveillance footage) (Jhamtani and Berg-Kirkpatrick, 2018). Our work complements theseefforts by studying directly comparative, everydaylanguage on image pairs with no pixel overlap.

Our approach outlines a new way for modelsto aid humans in making visual distinctions. TheNeural Naturalist model requires two instances asinput; these could be, for example, a query image

…

…

…

increasing visual and taxonomic distance

sampling cut offat taxonomic CLASS

pivot speciessampled subtree

too distant

pivot pivotvisual visual

species species

genus genus

order order

Figure 2: Illustration of pivot-branch stratified sam-pling algorithm used to construct the Birds-to-Words dataset. The algorithm harnesses visual andtaxonomic distances (increasing vertically) to create achallenging task with board coverage.

and an image from a candidate class. By differ-entiating between these two inputs, a model mayhelp point out subtle distinctions (e.g., one animalhas spots on its side), or features that indicate agood match (e.g., only a slight difference in size).These explanations can aid in understanding bothdifferences between species, as well as variancewithin instances of a single species.

2 Birds-to-Words Dataset

Our goal is to collect a dataset of tuples (i1, i2, t),where i1 and i2 are images, and t is a natural lan-guage comparison between the two. Given a do-main D, this collection depends critically on thecriteria we use to select image pairs.

If we sample image pairs uniformly at random,we will end up with comparisons encompassinga broad range of phenomena. For example, twoimages that are quite different will yield categor-ical comparisons (“One is a bird, one is a mush-room.”). Alternatively, if the two images are verysimilar, such as two angles of the same creature,

710

ImagesDataset Domain Lang Ctx Cap Example

CUB Captions(R, 2016)

Birds M 1 1 “An all black bird with a very long rectrices and relatively dullbill.”

CUB-Justify(V, 2017)

Birds S 7 1 “The bird has white orbital feathers, a black crown, and yellowtertials.”

Spot-the-Diff(J&B, 2018)

Surveilance E 2 1–2 ”Silver car is gone. Person in a white t shirt appears. 3rd personin the group is gone.”

Birds-to-Words(this work)

Birds E 2 2 “Animal1 is gray, while animal2 is white. Animal2 has a long,yellow beak, while animal1’s beak is shorter and gray. Animal2appears to be larger than animal1.”

Table 1: Comparison with recent fine-grained language-and-vision datasets. Lang values: S = scientific, E =everyday, M = mixed. Images Ctx = number of images shown, Images Cap = number of images described incaption. Dataset citations: R = Reed et al., V = Vedantam et al., J&B = Jhamtani and Berg-Kirkpatrick.

comparisons between them will focus on highlydetailed nuances, such as variations in pose. Thesephenomena support rich lines of research, such asobject classification (Deng et al., 2009) and poseestimation (Murphy-Chutorian and Trivedi, 2009).

We aim to land somewhere in the middle. Wewish to consider sets of distinguishable but inti-mately related pairs. This sweet spot of visualsimilarity is akin to the genre of differences stud-ied in fine-grained visual classification (Wah et al.,2011; Krause et al., 2013). We approach this col-lection with a two-phase data sampling procedure.We first select pivot images by sampling from ourfull domain uniformly at random. We then branchfrom these images into a set of secondary im-ages that emphases fine-grained comparisons, butyields broad coverage over the set of sensible re-lations. Figure 2 provides an illustration of oursampling procedure.

2.1 DomainWe sample images from iNaturalist, a citizen sci-ence effort to collect research-grade2 observationsof plants and animals in the wild. We restrictour domain D to instances labeled under the taxo-nomic CLASS3 Aves (i.e., birds). While a broaderdomain would yield some comparable instances(e.g., bird and dragonfly share some common bodyparts), choosing only Aves ensures that all in-stances will be similar enough structurally to becomparable, and avoids the gut reaction compar-

2Research-grade observations have met or exceeded iNat-uralist’s guidelines for community consensus of the taxo-nomic label for a photograph.

3To disambiguate class, we use CLASS to denote the tax-onomic rank in scientific classification, and simply “class” torefer to the machine learning usage of the term as a label inclassification.

Birds-to-Words

Birds-to-Words Dataset

Image pairs 3,347Paragraphs / pair 4.8Paragraphs 16,067Tokens / paragraph 32.1 MEAN

Sentences 40,969Sentences / paragraph 2.6 MEAN

Clarity rating ≥ 4/5Train / dev / test 80% / 10% / 10%

Figure 3: Annotation lengths for compared datasets(TOP), and statistics for the proposed Birds-to-Words dataset (BOTTOM). The Birds-to-Words datasethas a large mass of long descriptions in comparison torelated datasets.

ison pointing out the differences in animal type.This choice yields 1.7M research-grade imagesand corresponding taxonomic labels from iNatu-ralist. We then perform pivot-branch sampling onthis set to choose pairs for annotation.

2.2 Pivot Images

The Aves domain in iNaturalist contains instancesof 9k distinct species, with heavy observation biasto more common species (such as the mallardduck). We uniformly sample species from the setof 9k to help overcome this bias. In total, we select405 species and corresponding photographs to useas i1 images.

711

Elementwise Mult.

⊙1E 2E M=

…

…

D1

Di

JointEncoding

Image Embedding Decoding

Passthrough

Mutation

ComparativeModule

“Animal 2 has a blue backand wings, while Animal 1 …”

i1

i2

CNNDense features

1E 1EMulti-HeadAttention

Embed

Per-token cross-entropy loss

Reshaped

Cropping, Adjustments

Transformer DecoderLayer i

Transformer DecoderLayer 1

Transformer Decoder Layer n

SoftmaxLinear

2E

2E M

Maxpool

max( =

Addition

1E 2E M=+

Subtraction

1E 2E M=- ),1E 2E M

J

Passthrough

…

J

…

Transformer Encodervariants

variants

detail

TransformerEncoder

Layer 1 Layer N

C

C

J

1 CN

X

DN

C

Multi-Head Attention

Figure 4: The proposed Neural Naturalist model architecture. The multiplicative joint encoding and Transformer-based comparative module yield the best comparisons between images.

2.3 Branching Images

We use both a visual similarity measure and tax-onomy to sample a set of comparison images i2branching off from each pivot image i1. We use abranching factor of k = 12 from each pivot image.

To capture visually similar images to i1, weemploy a similarity function V(i1, i2). We usean Inception-v4 (Szegedy et al., 2017) networkpretrained on ImageNet (Deng et al., 2009) andthen fine-tuned to perform species classificationon all research-grade observations in iNaturalist.We take the embedding for each image from thelast layer of the network before the final softmax.We perform a k-nearest neighbor search by quan-tizing each embedding and using L2 distance (Wuet al., 2017; Guo et al., 2016), selecting the kv = 2closest images in embedding space.

We also use the iNaturalist scientific taxonomyT (D) to sample images at varying levels of taxo-nomic distance from i1. We select kt = 10 tax-onomically branched images by sampling two im-ages each from the same SPECIES (` = 1), GENUS,FAMILY, ORDER, and CLASS (` = 5) as c. Thisyields 4,860 raw image pairs (i1, i2).

2.4 Language Collection

For each image pair (i1, i2), we elicit five natu-ral language paragraphs describing the differencesbetween them.

An annotator is instructed to write a paragraph(usually 2–5 sentences) comparing and contrasting

the animal appearing in each image. We instructannotators not to explicitly mention the species(e.g., “Animal 1 is a penguin”), and to instead fo-cus on visual details (e.g., “Animal 1 has a blackbody and a white belly”). They are additionallyinstructed to avoid mentioning aspects of the back-ground, scenery, or pose captured in the photo-graph (e.g., “Animal 2 is perched on a coconut”).

We discard all annotations for an image pairwhere either image did not have at least 4

5 pos-itive ratings of image clarity. This yields a to-tal of 3,347 image pairs, annotated with 16,067paragraphs. Detailed statistics of the Birds-to-Words dataset are shown in Figure 3, and exam-ples are provided in Figure 5. Further detailsof our both our algorithmic approach and datasetconstruction are given in Appendices A and B.

3 Neural Naturalist Model

Task Given two images (i1, i2) as input, ourtask is to generate a natural language paragrapht = x1 . . . xn that compares the two images.

Architecture Recent image captioning ap-proaches (Xu et al., 2015; Sharma et al., 2018)extract image features using a convolutional neu-ral network (CNN) which serve as input to a lan-guage decoder, typically a recurrent neural net-work (RNN) (Mikolov et al., 2010) or Transformer(Vaswani et al., 2017). We extend this paradigmwith a joint encoding step and comparative mod-ule to study how best to encode and transform

712

animal1 is brown and white with a short beak . animal2 is brown and gray with a long gray beak .animal1 is white with dark brown and white wings and a golden head . animal2 is brown-gold with dark solid-colored brown wings and a dark head .

MM

GG

animal1 is a dull yellow with grey tail feathers while animal2 is a yellow-green

animal1 has dark orange claws , while animal2 has grey claws . animal1 has yellow coloring with black on the top of the head and in tiny wing patches . animal2 is mostly green with red on the neck and brown on the wings .

MM

GG

animal1 has a black beak , while animal2 has a pale grey beak . animal1 is mostly black , while animal2 is mostly dark brown and white . animal1 has a black eye , while animal2 has a gold eye .animal1 is covered in black feathers , while animal2 has light grey abdomen and chest with variety of dark brown and light brown feathers over wings , back and head .

MM

GG

Animal 1 Animal 1 Animal 1 Animal 1Animal 2 Animal 2Animal 2 Animal 2

Animal 2 Animal 2

Animal 1 Animal 1

Animal 1 Animal 1animal1 is two - toned brown with a white patch on its head . animal2 is multi - colored with longer tail feathers .

animal1 is brown and white with a squatty body with a light brown head . animal2 is multi - colored with a light blue and black head .

MM

GG

these animals appear exactly the same .

animal1 and animal2 look the same .

MM

GG

animal1 is white with brown wings while animal2 is yellow with black headanimal1 has a long dark tail and flecked dark wings with a white curved beak . animal2 has a shorter beak and a yellow breast and head with a shorter brown tail .

MM

GG

animal1 is brown with black spots on the body while animal2 is tan with a white neck and black head

animal1 is brown and white with a long yellow and brown beak . animal2 is gray with a short light pink beak .

MM

GG

Animal 2 Animal 2Animal 1 Animal 1



animal2 ' s colors are brighter than animal1 . animal2 has more earthy colors than animal1 . animal1 is a bit bigger than animal2 .

both animals appear to be the same .

MM

GG

Animal 1 Animal 1Animal 2 Animal 2animal2 has a heart shaped face , whereas animal1 has an oval face . animal2 has entirely dark eyes . animal2 has a white beak , whereas animal1 has a dark beak . animal2 has more white in its feathers .

animal1 has yellow eyes . animal2 has black eyes . animal2 is lighter in color than animal1 . animal2 has a heart shaped face . animal1 doesn ' t .

MM

GG


Animal 2 Animal 2

Figure 5: Samples from the dev split of the proposed Birds-to-Words dataset, along with Neural Naturalist modeloutput (M) and one of five ground truth paragraphs (G). The second row highlights failure cases in red. The modelproduces coherent descriptions of variable granularity, though emphasis and assignment can be improved.

multiple latent image embeddings. A schematic ofthe model is outlined in Figure 4, and its key com-ponents are described in the upcoming sections.

3.1 Image Embedding

Both input images are first processed using CNNswith shared weights. In this work, we considerResNet (He et al., 2016) and Inception (Szegedyet al., 2017) architectures. In both cases, we ex-tract the representation from the deepest layer im-mediately before the classification layer. Thisyields a dense 2D grid of local image feature vec-tors, shaped (d, d, f). We then flatten each featuregrid into a (d2, f) shaped matrix:

E1 = 〈e11,1, . . . , e1d,d〉 = CNN(i1)

E2 = 〈e21,1, . . . , e2d,d〉 = CNN(i2)

3.2 Joint Encoding

We define a joint encoding J of the images whichcontains both embedded images (E1,E2), a mu-tated combination (M), or both. We consideras possible mutations M ∈ {E1 + E2,E1 −E2,max(E1,E2),E1�E2}. We try these encod-ing variants to explore whether simple mutationscan effectively combine the image representations.

713

3.3 Comparative Module

Given the joint encoding of the images (J), wewould like to represent the differences in featurespace (C) in order to generate comparative de-scriptions. We explore two variants at this stage.The first is a direct passthrough of the joint en-coding (C = J). This is analogous to “standard”CNN+LSTM architectures, which embed imagesand pass them directly to an LSTM for decod-ing. Because we try different joint encodings, apassthrough here also allows us to study their ef-fects in isolation.

Our second variant is an N -layer Transformerencoder. This provides an additional self-attentivemutations over the latent representations J. Eachlayer contains a multi-headed attention mecha-nism (ATTNMH). The intent is that self-attentionin Transformer encoder layers will guide compar-isons across the joint image encoding.

Denoting LN as Layer Norm and FF as Feed For-ward, with Ci as the output of the ith layer of theTransformer encoder, C0 = J, and C = CN :

CHi = LN(Ci−1 + ATTNMH(Ci−1))

Ci = LN(CHi + FF(CH

i ))

3.4 Decoder

We use an N -layer Transformer decoder architec-ture to produce distributions over output tokens.The Transformer decoder is similar to an encoder,but it contains an intermediary multi-headed atten-tion which has access to the encoder’s output C atevery time step.

DH1i = LN(X+ ATTNMASK,MH(X))

DH2i = LN(DH1

i + ATTNMH(DH1i ,C))

Di = LN(DH2i + FF(DH2

i ))

Here we denote the text observed during trainingas X, which is modulated with a position-basedencoding and masked in the first multi-headed at-tention.

4 Experiments

We train the Neural Naturalist model to producedescriptions of the differences between imagesin the Birds-to-Words dataset. We partition thedataset into train (80%), val (10%), and test (10%)sections by splitting based on the pivot images i1.

This ensures i1 species are unique across the dif-ferent splits.

We provide model hyperparameters and opti-mization details in Appendix C.

4.1 Baselines and Variants

The most frequent paragraph baseline producesonly the most observed description in the train-ing data, which is that the two animals appear tobe exactly the same. Text-Only samples captionsfrom the training data according to their empiri-cal distribution. Nearest Neighbor embeds bothimages and computes the lowest total L2 distanceto a training set pair, sampling a caption fromit. We include two standard neural baselines,CNN (+ Attention) + LSTM, which concatenatethe images embeddings, optionally perform atten-tion, and decode with an LSTM. The main modelvariants we consider are a simple joint encoding(J = 〈E1,E2〉), no comparative module (C = J),a small (1-layer) decoder, and our full Neural Nat-uralist model. We also try several other ablationsand model variants, which we describe later.

4.2 Quantitative Results

Automatic Metrics We evaluate our modelusing three machine-graded text metrics: BLEU-4 (Papineni et al., 2002), ROUGE-L (Lin, 2004),and CIDEr-D (Vedantam et al., 2015). Each gen-erated paragraph is compared to all five referenceparagraphs.

For human performance, we use a one-vs-restscheme to hold one reference paragraph out andcompute its metric using the other four. We av-erage this score across twenty-five runs over theentire split in question.

Results using these metrics are given in Table 2for the main baselines and model variants. We ob-serve improvement across BLEU-4 and ROUGE-L scores compared to baselines. Curiously, weobserve that the CIDEr-D metric is susceptible tocommon patterns in the data; our model, whenstopped at its highest CIDEr-D score, outputsa variant of, “these animals appear exactly thesame” for 95% of paragraphs, nearly mimick-ing the behavior of the most frequent paragraph(Freq.) baseline. The corpus-level behavior ofCIDEr-D gives these outputs a higher score. Weobserved anecdotally higher quality outputs corre-lated with ROUGE-L score, which we verify usinga human evaluation (paragraph after next).

714

Dev Test

BLEU-4 ROUGE-L CIDEr-D BLEU-4 ROUGE-L CIDEr-D

Most Frequent 0.20 0.31 0.42 0.20 0.30 0.43Text-Only 0.14 0.36 0.05 0.14 0.36 0.07Nearest Neighbor 0.18 0.40 0.15 0.14 0.36 0.06

CNN + LSTM (Vinyals et al., 2015) 0.22 0.40 0.13 0.20 0.37 0.07CNN + Attn. + LSTM (Xu et al., 2015) 0.21 0.40 0.14 0.19 0.38 0.11

Neural Naturalist – Simple Joint Encoding 0.23 0.44 0.23 - - -Neural Naturalist – No Comparative Module 0.09 0.27 0.09 - - -Neural Naturalist – Small Decoder 0.22 0.42 0.25 - - -Neural Naturalist – Full 0.24 0.46 0.28 0.22 0.43 0.25

Human 0.26 +/- 0.02 0.47 +/- 0.01 0.39 +/- 0.04 0.27 +/- 0.01 0.47 +/- 0.01 0.42 +/- 0.03

Table 2: Experimental results for comparative paragraph generation on the proposed dataset. For human captions,mean and standard deviation are given for a one-vs-rest scheme across twenty-five runs. We observed that CIDEr-D scores had little correlation with description quality. The Neural Naturalist model benefits from a strong jointencoding and Transformer-based comparative module, achieving the highest BLEU-4 and ROUGE-L scores.

Ablations and Model Variants We ablate andvary each of the main model components, runningthe automatic metrics to study coarse changes inthe model’s behavior. Results for these experi-ments are given in Table 3. For the joint encod-ing, we try combinations of four element-wise op-erations with and without both encoded images.To study the comparative module in greater detail,we examine its effect on the top three joint encod-ings: (i1, i2,−), −, and �. After fixing the bestjoint encoding and comparative module, we alsotry variations of the decoder (Transformer depth),as well as decoding algorithms (greedy decoding,multinomial sampling, and beamsearch).

Overall, we we see that the choice of joint en-coding requires a balance with the choice of com-parative module. More disruptive joint encod-ings (like element-wise multiplication �) appeartoo destructive when passed directly to a decoder,but yield the best performance when paired witha deep comparative module. Others (like subtrac-tion) function moderately well on their own, andare further improved when a comparative moduleis introduced.

Human Evaluation To verify our obser-vations about model quality and automatic met-rics, we also perform a human evaluation of thegenerated paragraphs. We sample 120 instancesfrom the test set, taking twenty each from the sixcategories for choosing comparative images (vi-sual similarity in embedding space, plus five taxo-nomic distances). We provide annotators with thetwo images in a random order, along with the out-put from the model at hand. Annotators must de-cide which image contains Animal 1, and which

contains Animal 2, or they may say that there is noway to tell (e.g., for a description like “both lookexactly the same”).

We collect three annotations per datum, andscore a decision only if≥ 2/3 annotators made thatchoice. A model receives +1 point if annotatorsdecide correctly, 0 if they cannot decide or agreethere is no way to tell, and -1 point if they decideincorrectly (label the images backwards). Thisscheme penalizes a model for confidently writingincorrect descriptions. The total score is then nor-malized to the range [−1, 1]. Note that Humanuses one of the five gold paragraphs sampled atrandom.

Results for this experiment are shown in Ta-ble 4. In this measure, we see the frequency andtext-only baselines now fall flat, as expected. Thefrequency baseline never receives any points, andthe text-only baseline is often penalized for incor-rectly guessing. Our model is successful at mak-ing distinctions between visually distinct species(GENUS column and ones further right), whichis near the challenge level of current fine-grainedvisual classification tasks. However, it struggleson the two data subsets with highest visual sim-ilarity (VISUAL, SPECIES). The significant gapbetween all methods and human performance inthese columns indicates ultra fine-grained distinc-tions are still possible for humans to describe, butpose a challenge for current models to capture.

4.3 Qualitative Analysis

In Figure 5, we present several examples of themodel output for pairs of images in the dev set,along with one of the five reference paragraphs. In

715

Joint Encoding DecodingAlgorithm

Dev

i1 i2 − + max � Comparative Module Decoder BLEU-4 ROUGE-L CIDEr-D

X X

6-LayerTransformer

6-LayerTransformer

Beamsearch

0.23 0.44 0.23X 0.23 0.45 0.27

X 0.24 0.43 0.28X 0.23 0.43 0.24

X 0.24 0.46 0.28X X X 0.22 0.44 0.22X X X 0.22 0.42 0.25X X X 0.21 0.42 0.22X X X 0.22 0.43 0.23X X X X X X 0.21 0.43 0.20

X Passthrough6-Layer

TransformerBeamsearch

0.00 0.02 0.00X 1-L Transformer 0.24 0.44 0.27X 3-L Transformer 0.24 0.44 0.27X 6-L Transformer 0.24 0.46 0.28

X Passthrough6-Layer


0.22 0.40 0.22X 1-L Transformer 0.21 0.41 0.26X 3-L Transformer 0.22 0.41 0.22X 6-L Transformer 0.23 0.45 0.27

X X X Passthrough6-Layer


0.09 0.27 0.09X X X 1-L Transformer 0.24 0.43 0.24X X X 3-L Transformer 0.22 0.42 0.26X X X 6-L Transformer 0.22 0.44 0.22

X6-Layer

Transformer

1-L TransformerBeamsearch

0.22 0.42 0.25X 3-L Transformer 0.23 0.42 0.25X 6-L Transformer 0.24 0.46 0.28

X6-Layer

Transformer6-Layer

Transformer

Greedy 0.21 0.44 0.18X Multinomial 0.20 0.42 0.16X Beamsearch 0.24 0.46 0.28

Table 3: Variants and ablations for the Neural Naturalist model. We find the best performing combination isan elementwise multiplication (�) for the joint encoding, a 6-layer Transformer comparative module, a 6-layerTransformer decoder, and using beamsearch to perform inference.

the following section, we split an analysis of themodel into two parts: largely positive findings, aswell as common error cases.

Positive Findings

We find that the model exhibits dynamic granu-larity, by which we mean that it adjusts the mag-nitude of the descriptions based on the scale ofdifferences between the two animals. If two ani-mals are quite similar, it generates fine-grained de-scriptions such as, “Animal 2 has a slightly morecurved beak than Animal 1,” or “Animal 1 is moreiridescent than Animal 2.” If instead the two an-imals are very different, it will generate text de-scribing larger-scale differences, like, “Animal 1has a much longer neck than Animal 2,” or “Ani-mal 1 is mostly white with a black head. Animal 2is almost completely yellow.”

We also observe that the model is able to pro-

duce coherent paragraphs of varying linguisticstructure. These include a range of compar-isons set up across both single and multiple sen-tences. For example, one it generates straight-forward comparisons of the form, Animal 1 hasX, while Animal 2 has Y. But it also generatescontrastive expressions with longer dependencies,such as Animal 1 is X, Y, and Z. Animal 2 is verysimilar, except W. Furthermore, the model will mixand match different comparative structures withina single paragraph.

Finally, in addition to varying linguistic struc-ture, we find the model is able to produce coher-ent semantics through a series of statements. Forexample, consider the following full output: “An-imal 1 has a very long neck compared to Animal2. Animal 1 has shorter legs than Animal 2. An-imal 1 has a black beak, Animal 2 has a brownbeak. Animal 1 has a yellow belly. Animal 2 has

716

VISUAL SPECIES GENUS FAMILY ORDER CLASS

Freq. 0.00 0.00 0.00 0.00 0.00 0.00Text-Only 0.00 -0.10 -0.05 0.00 0.15 -0.15CNN + LSTM -0.15 0.20 0.15 0.50 0.40 0.15CNN + Attn. + LSTM 0.15 0.15 0.15 -0.05 0.05 0.20Neural Naturalist 0.10 -0.10 0.35 0.40 0.45 0.55

Human 0.55 0.55 0.85 1.00 1.00 1.00

Table 4: Human evaluation results on 120 test set sam-ples, twenty per column. Scale: -1 (perfectly wrong)to 1 (perfectly correct). Columns are ordered left-to-right by increasing distance. Our model outperformsbaselines for several distances, though highly similarcomparisons still prove difficult.

darker wings than Animal 1.” The range of con-cepts in the output covers neck, legs, beak, belly,wings without repeating any topic or getting side-tracked.

Error Analysis

We also observe several patterns in the model’sshortcomings. The most prominent error case isthat the model will sometimes hallucinate differ-ences (Figure 5, bottom row). These range frompointing out significant changes that are miss-ing (e.g., “a black head” where there is none(Fig. 5, bottom left)), to clawing at subtle distinc-tions where there are none (e.g., “[its] colors arebrighter . . . and [it] is a bit bigger” (Fig. 5, bot-tom right)). We suspect that the model has learnedsome associations between common features inanimals, and will sometimes favor these associa-tions over visual evidence.

The second common error case is missing obvi-ous distinctions. This is observed in Fig. 5 (bot-tom middle), where the prominent beak of Ani-mal 1 is ignored by the model in favor of mundanedetails. While outlying features make for livelydescriptions, we hypothesize that the model maysometimes avoid taking them into account givenits per-token cross entropy learning objective.

Finally, we also observe the model sometimesswaps which features are attributed to whichanimal. This is partially observed in Fig. 5 (bot-tom left), where the “black head” actually belongsto Animal 1, not Animal 2. We suspect that mix-ing up references may be a trade-off for the repre-sentational power of attending over both images;there is no explicit bookkeeping mechanism to en-force which phrases refer to which feature com-parisons in each image.

5 Related Work

Employing visual comparisons to elicit focusednatural language observations was proposed by(Maji, 2012), and later investigated in the contextof crowdsourcing by (Zou et al., 2015). We takeinspiration from these works.

Previous work has collected natural languagecaptions of bird photographs: CUB Captions(Reed et al., 2016) and CUB-Justify (Vedantamet al., 2017) are both language annotations ontop of the CUB-2011 dataset of bird photographs(Wah et al., 2011). In addition to describing twophotos instead of one, the language in our datasetis more complex by comparison, containing a di-versity of comparative structures and implied se-mantics. We also collect our data without ananatomical guide for annotators, yielding every-day language in place of scientific terminology.

Conceptually, our paper offers a complemen-tary approach to works that generate single-imageclass-discriminative or image-discriminative cap-tions (Hendricks et al., 2016; Vedantam et al.,2017). Rather than discriminative captioning, wefocus on comparative language as a means forbridging the gap between varying granularities ofvisual diversity.

Methodologically, our work is most closely re-lated to the Spot-the-diff dataset (Jhamtani andBerg-Kirkpatrick, 2018). While dataset captionstwo images with only a small section of pixelsthat change (surveillance footage), we considerimage pairs with no pixel overlap, which motivatesour stratified sampling approach for drawing goodcomparisons.

Finally, the recently released NLVR2 dataset(Suhr et al., 2018) introduces a challenging nat-ural language reasoning task using two images ascontext. Our work instead focuses on generatingcomparative language rather than reasoning.

6 Conclusion

We present the new Birds-to-Words dataset andNeural Naturalist model for generating compar-ative explanations of fine-grained visual differ-ences. This dataset features paragraph-length,adaptively detailed descriptions written in every-day language. We hope that continued study ofthis area will produce models that can aid humansin critical domains like citizen science.

717

ReferencesJia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai

Li, and Li Fei-Fei. 2009. Imagenet: A large-scalehierarchical image database.

Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, andDavid Simcha. 2016. Quantization based fast innerproduct search. In Artificial Intelligence and Statis-tics, pages 482–490.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Lisa Anne Hendricks, Zeynep Akata, MarcusRohrbach, Jeff Donahue, Bernt Schiele, and TrevorDarrell. 2016. Generating visual explanations. InEuropean Conference on Computer Vision, pages3–19. Springer.

Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2018.Learning to describe differences between pairs ofsimilar images. arXiv preprint arXiv:1808.10584.

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3d object representations for fine-grainedcategorization. In 4th International IEEE Workshopon 3D Representation and Recognition (3dRR-13),Sydney, Australia.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out.

Subhransu Maji. 2012. Discovering a lexicon of partsand attributes. In European Conference on Com-puter Vision, pages 21–30. Springer.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recurrentneural network based language model. In Eleventhannual conference of the international speech com-munication association.

Erik Murphy-Chutorian and Mohan Manubhai Trivedi.2009. Head pose estimation in computer vision: Asurvey. IEEE transactions on pattern analysis andmachine intelligence, 31(4):607–626.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

Scott Reed, Zeynep Akata, Honglak Lee, and BerntSchiele. 2016. Learning deep representations offine-grained visual descriptions. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition, pages 49–58.

Piyush Sharma, Nan Ding, Sebastian Goodman, andRadu Soricut. 2018. Conceptual captions: Acleaned, hypernymed, image alt-text dataset for au-tomatic image captioning. In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), vol-ume 1, pages 2556–2565.

Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai,and Yoav Artzi. 2018. A corpus for reasoning aboutnatural language grounded in photographs. CoRR,abs/1811.00491.

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke,and Alexander A Alemi. 2017. Inception-v4,inception-resnet and the impact of residual connec-tions on learning. In Thirty-First AAAI Conferenceon Artificial Intelligence.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Ramakrishna Vedantam, Samy Bengio, Kevin Murphy,Devi Parikh, and Gal Chechik. 2017. Context-awarecaptions from context-agnostic supervision. In Pro-ceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition, pages 251–260.

Ramakrishna Vedantam, C Lawrence Zitnick, and DeviParikh. 2015. Cider: Consensus-based image de-scription evaluation. In Proceedings of the IEEEconference on computer vision and pattern recog-nition, pages 4566–4575.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Proceedings of the IEEEconference on computer vision and pattern recogni-tion, pages 3156–3164.

Catherine Wah, Steve Branson, Peter Welinder, PietroPerona, and Serge Belongie. 2011. The caltech-ucsdbirds-200-2011 dataset.

Xiang Wu, Ruiqi Guo, Ananda Theertha Suresh, San-jiv Kumar, Daniel N Holtmann-Rice, David Simcha,and Felix Yu. 2017. Multiscale quantization for fastsimilarity search. In Advances in Neural Informa-tion Processing Systems, pages 5745–5755.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In International conference on machine learn-ing, pages 2048–2057.

James Y Zou, Kamalika Chaudhuri, and Adam Tau-man Kalai. 2015. Crowdsourcing feature discoveryvia adaptively chosen comparisons. arXiv preprintarXiv:1504.00064.

http://arxiv.org/abs/1811.00491

http://arxiv.org/abs/1811.00491

Documents

Neural Naturalist: Generating Fine-Grained Image Comparisons · 710 Images Dataset Domain Lang Ctx Cap Example CUB Captions (R, 2016) Birds M 1 1 “An all black bird with a very