Appendix · 1 Appendix 2 0.1 Data augmentation 3 Fig. 1 shows some examples of augmentedMSCOCOimages and captions.We perform image- 4 level augmentation to the input images of a Faster

Appendix1

0.1 Data augmentation2

Fig. 1 shows some examples of augmented MSCOCO images and captions.We perform image-3level augmentation to the input images of a Faster R-CNN (pretrained on the Visual Genome1) and4apply RoI-level augmentation to the bounding boxes/RoIs detected by Faster R-CNN. For sentence5augmentation, we use the transformer-based neural machine translation models [1] pretrained on6WMT’19 2 to perform back-translation. For ground-truth dependency trees, we parse each sentence7with the dependency parser provided by Stanza3.8

Image (HFilp)

A dog standing in the grass near a flying frisbee.

A cute little dog running through a yard towards a frisbee.

A dog looks at a Frisbee as it flies toward it.

A dog in a yard catching a Frisbee.

A dog chasing after a purple frisbee on top of a green lawn.En-De-EN

En-Ru-EN

A dog chases a purple frisbee on a green lawn.A dog chases a purple Frisbee over a green lily.

En-De-EN

En-Ru-EN

En-De-EN

En-Ru-EN

En-De-EN

En-Ru-EN

En-De-EN

En-Ru-EN

A dog on a farm catches a Frisbee.Dog catches Frisbee in the yard.

A dog looks at a Frisbee as it flies towards it.A dog looks at Frisbee as he flies towards her.

A cute little dog running through a yard towards a Frisbee.Cute little dog runs through yard to Frisbee.

A dog stands in the grass next to a flying Frisbee.A dog standing in the grass next to a flying Frisbee.

A man and two women walking their dogs and hiking in the woods.

Three dogs are huddled together with their owners.

A group of people in the woods with dogs.

Three hikers walking in the wilderness with three dogs.

Three people standing in a wooded area walking three dogs. Three people stand in a wooded area and run three dogs.Three people stand in the forest, walking three dogs

Three hikers take three dogs for a walk in the wilderness.Three hikers walking in the wild with three dogs.

A group of people in the forest with dogs.A group of masked men with dogs.

Three dogs are crammed together with their owners.Three dogs walk with their owners.

A man and two women walk their dogs through the woods.A dog standing in the grass next to a flying Frisbee.

En-De-EN

En-Ru-EN

En-De-EN

En-Ru-EN

En-De-EN

En-Ru-EN

En-De-EN

En-Ru-EN

En-De-EN

En-Ru-EN

Sentences (Or iginal) Sentences (Back-Translation)

Sentences (Or iginal) Sentences (Augmented, Back-Translation)

RoI Jitter [0.9, 1.1]RoI Jitter [0.8, 1.2] RoI Hor izontal Flip

RoI Rotate 180 RoI Rotate 90RoI Rotate 270

Image (Or iginal)

RoI Jitter [0.9, 1.1] RoI Jitter [0.8, 1.2]RoI Hor izontal Flip

RoI Rotate 180RoI Rotate 90 RoI Rotate 270

Image (HFilp)Sentences (Or iginal) Sentences (Back-Translation) Image (Or iginal)

RoI Jitter [0.9, 1.1]RoI Jitter [0.8, 1.2] RoI Hor izontal FlipRoI Jitter [0.9, 1.1] RoI Jitter [0.8, 1.2]RoI Hor izontal FlipCOCO_val2014_000000057703

RoI Rotate 180 RoI Rotate 90RoI Rotate 270RoI Rotate 180RoI Rotate 90 RoI Rotate 270

COCO_val2014_000000037871

Figure 1: Examples of augmented MSCOCO images and captions. For each augmented image, weshow the object labels at the centers of respective bounding box for a better visualization. We applyper-category non-maximum suppression to the raw bounding boxes detected by Faster R-CNN.

1https://github.com/airsplay/py-bottom-up-attention2https://github.com/pytorch/fairseq3https://github.com/stanfordnlp/stanza

1

0.2 Implementation details9

0.2.1 Details of pretraining10

Fig. 2 illustrates the visual-textual alignment mechanisms of the three variants of our proposed SSRP.11For SSRPCross, we take the final hidden state of [CLS] to predict whether the sentence matches12with the image semantically. For SSRPShare and SSRPVisual, since they do not have the bidirectional13cross-attention as in SSRPCross, we take

∑i vi/Nv as the additional input and concatenate it with14

wCLS to generate the visual-textual alignment prediction using galign(·).15

man standing[MASK] [MASK]

Intra-Modality Encoder ( ) Intra-Modality Encoder ( )

[MASK] [MASK]

[CLS] A man standing ... [SEP]

Inter-Modality Encoder ( ) Inter-Modality Encoder ( )

[CLS] A ... [SEP] man standing[MASK] [MASK]


[MASK] [MASK]


Inter-Modality Encoder ( ) Inter-Modality Encoder ( )

[CLS] A ... [SEP] man standing[MASK] [MASK]


[MASK] [MASK]


[CLS] A ... [SEP]

Inter-Modality Encoder ( )

...

...

...

...

...

...

Aligned/Not Aligned

Classifier Aligned/Not Aligned

Classifier Aligned/Not Aligned

Classifier

Figure 2: An illustration of the mechanism used for obtaining the visual-textual alignment representa-tion falign for each of the three SSRP variants. These three SSRP variants can be used to facilitatethe fine-tuning for different downstream tasks. Note that, SSRPCross can only support visual-textualmulti-modal downstream tasks such as VQA, while SSRPShare and SSRPVisual can support not onlymulti-modal downstream tasks but also single-modal visual tasks such as image captioning.

0.2.2 Details of NLVR2 fine-tuning16

NLVR2 is a challenging visual reasoning task. It requires the model to determine whether the natural17language statement S is true about an image pair 〈Ii, Ij〉. During both fine-tuning and testing, we18feed alignment representations of the two images and the probed relationships to a binary classifier.19The predicted probability is computed as:20

p(Ii, Ij , S) = σ(fFC(fp([qi; qj ]))

)(A.1)

qk =fq([fkalign; fvw([R

vk;R

wk ])]), k ∈ [i, j] (A.2)

where fkalign, Rvk, and R

wk are the outputs of SSRP(Ik, S), and σ denotes the sigmoid activation21

function. The nonlinear transformation functions fvw, fq, fp and linear FC layer fFC have learnable22weights.23

For baseline models that do not consider relationships, the predicted probability is computed as:24

p(Ii, Ij , S) = σ(fFC(fp([f

ialign;f

jalign]))

)(A.3)

We fine-tune all models (including SSRP) with sigmoid binary cross-entropy loss.25

0.2.3 Details of VQA/GQA fine-tuning26

VQA requires the model to answer a natural language question Q related to an image I . We conduct27experiments on the VQA v2.0 dataset. We fine-tune our model on the train split using sigmoid28binary cross-entropy loss and evaluate it on the test-standard split. Note that VQA is based on29the MSCOCO image corpus, but the questions have never been seen by the model during training.30During fine-tuning, we feed the region features and given question into SSRPCross, and then output the31alignment representation and the probed relationships that are fed to a classifier for answer prediction:32

p(I,Q) = σ(fFC(fp(q))

)(A.4)

q =fq([falign; fvw([Rv;Rw])]) (A.5)

where falign, Rv , and Rw are the outputs of SSRPCross, and σ denotes the sigmoid activation function.33The nonlinear transformation functions fvw, fq, fp and linear FC layer fFC have learnable weights.34

2

0.2.4 Details of image captioning35

For image captioning, we use only the image branch of SSRPVisual, and feed the unmasked image36features into SSRPVisual. For each input image, we first extract the contextualized visual representation37v1:Nv and the implicit visual relationships R

v from the pretrained SSRPVisual. The inputs to the image38captioning model are the refined object features v1:Nv and probed relationships R

v . We treat Rv as39a global representation for the image. We set the number of hidden units of each LSTM to 1000, the40number of hidden units in the attention layer to 512. We first optimize the model on one Tesla V10041GPU using cross-entropy loss, with an initial learning rate of 5e−4, a momentum parameter of 0.9,42and a batch size of 100 for 40 epochs. After that, we further train the model to optimize it directly for43CIDEr score [2] for another 100 epochs. During testing, we adopt beam search with a beam size of 5.44We apply the same training and testing settings for Up-Down (Our Impl.) and SSRPVisual.45

0.2.5 Details of image retrieval46

For image retrieval, we also feed the unmasked image features into SSRPVisual and obtain the refined47contextualized visual representations along with the implicit visual relationships. We conduct the48retrieval experiment on MSCOCO validation set. We randomly sample the query images and retrieve49the top images according to their cosine similarities against the queries.50

We compare two kinds of methods: one that uses contextualized visual representations v1:Nv , and51another one that uses both contextualized visual representations v1:Nv and implicit visual relationships52Rv. For ‘Obj. + Rel.’ approach, we use the relationship-enhanced visual features obtained with531Nv

∑i

1Nv

∑k vidBv (vi,vk)

2. For ‘Obj.’ approach, we simply average the contextualized object54features with 1Nv

∑i vi. Fig. 3 shows the pipeline for the image retrieval task.55

Intra-Modality Encoder ( )


...



...





...

Cosine Similar ityCosine Similar ity

Relationship Probe ( ) Relationship Probe ( )

... ... ...

...

...

(a) Obj.+Rel. (b) Obj.

Figure 3: Illustrations of the two image retrieval methods mentioned in our paper.

3

0.3 Extra examples56

COCO_val2014_000000289941

Original Image Predicted visual relationship matr ix (Raw) RoI (Jitter 0.9-1.1) Predicted visual relationship matr ix (RoI Jitter 0.9-1.1)

RoI (HFlip) Predicted visual relationship matr ix (RoI HFlip) RoI (Rotate 270) Predicted visual relationship matr ix (RoI Rotate 270)

Figure 4: Examples of generated relationships for different augmented images. Darker colors indicatecloser visual relationships, while lighter colors indicate farther visual relationships.

Predicted relationship matr ix (Aug.) Ground-Truth relationship matr ix (Aug.) Predicted relationship matr ix (Aug.) Ground-Truth relationship matr ix (Aug.)

The minimum spanning trees extracted by SSRP The minimum spanning trees extracted by SSRP

Figure 5: Example of generated relationships for different augmented sentences. Bottom row showsthe minimum spanning trees. Black edges are the ground-truth parse; red are predicted by SSRPCross.

References57[1] Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. Facebook58

fair’s wmt19 news translation task submission. arXiv preprint arXiv:1907.06616, 2019.59

[2] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-60critical sequence training for image captioning. In CVPR, 2017.61

4

Data augmentationImplementation detailsDetails of pretrainingDetails of NLVR2 fine-tuningDetails of VQA/GQA fine-tuningDetails of image captioningDetails of image retrieval

Extra examples

Documents

Appendix · 1 Appendix 2 0.1 Data augmentation 3 Fig. 1 shows some examples of augmentedMSCOCOimages and captions.We perform image- 4 level augmentation to the input images of a Faster