LearningVisually ...freda/file/presentation/shi2018learning.pdfLearningVisually-GroundedSemanticsfromContrastiveAdversarialSamples Haoyue Shi1, Jiayuan Mao2, Tete Xiao*1, Yuning

Learning Visually-Grounded Semantics from Contrastive Adversarial SamplesHaoyue Shi*1, Jiayuan Mao*2, Tete Xiao*1, Yuning Jiang3 and Jian Sun3 1: Peking University 2: Tsinghua University 3: Megvii, Inc

{hyshi, jasonhsiao97}@pku.edu.cn, [email protected], {jyn, sunjian}@megvii.com

INTRODUCTION

Visual-Semantic Embeddings (VSE)• Use paralleled image-caption pairs and embed texts and images into a joint space.

• Several datasets have been created for such purpose.

• However, even MS-COCO[1] is too small compared with the compositional semantic space.

VSE with Contrastive Adversarial Samples (this work)• Show the limitation of existing datasets and frameworks through adversarial attacks.

• Close the gap with semantics-aware text augmentation.

• Evaluate the visual grounding on multiple tasks.

A SIMPLE YET EFFECTIVE APPROACH

Add the Contrastive∗ Adversarial Samples to the Training Set∗: Use the online hard example mining (OHEM) technique to find “Contrastive” ones.

VSE [2]:min `V SE(i, c) =

∑c′

[α+ s(i, c′)− s(i, c)]+ +∑i′

[α+ s(i′, c)− s(i, c)]+

VSE++ [3]:

min `VSE++(i, c) = maxc′ 6=c

[α+ s(i, c′)− s(i, c)] + maxi′ 6=i

[α+ s(i′, c)− s(i, c)]

VSE-C (ours):

min `VSE-C(i, c) = `VSE++(i, c) + maxc′′∈C′(c)

[α+ s(i, c′′)− s(i, c)]+

i: image, c: caption, C′: adversarial samples.

BEGIN WITH ADVERSARIAL ATTACKS

Three giraffes and a rhino graze from trees.

relation: graze from

Original CaptionImage Contrastive Adversarial Samples

Three cows and a rhino graze from trees.

noun:

numeral / indefinite article:

Three giraffes and three rhinos graze from trees.

relation:

Three giraffes and a rhino graze on trees.Trees graze from three giraffes and a rhino.

numeral indef. nounnoun noun

Semantics-aware Text Augmentation (Adversarial Samples)

• Noun: use Word-Net [4] to compare the word similarity (e.g., Synonyms, Hypernyms).

• Numeral/Indefinite Article: singularize or pluralize corresponding nouns when necessary.

• Relation: dependency-parsing based subject and object detection.

ResultModel R@1 R@10 Med r. Avg r.MS-COCO TestVSE [2] 47.7 87.8 2.0 5.8VSE++ [3] 55.7 92.4 1.0 4.3VSE-C (+n.) 50.7 90.7 1.0 5.2VSE-C (+num.) 53.3 90.2 1.0 5.8VSE-C (+rel.) 52.4 89.0 1.0 5.7VSE-C (+all) 50.2 89.8 1.0 5.2MS-COCO Test w/ Adversarial SamplesVSE [2] 28.0 71.6 4.0 11.7VSE++ [3] 35.6 72.5 3.0 11.8VSE-C (+n.) 40.3 80.2 2.0 9.2VSE-C (+num.) 46.9 86.3 2.0 6.9VSE-C (+rel.) 42.3 82.5 2.0 7.2VSE-C (+all) 47.4 88.8 2.0 5.5

GROUNDING TEST I: WORD-OBJECT CORRELATION

Task Description

Image CaptionsA table with a huge glass vase and fake flowers come out of it.A plant in a vase sits at the end of a table.A vase with flowers in it with long stems sitting on a table with candles.A large centerpiece that is sitting on the edge of a dining table.Flowers in a clear vase sitting on a table.

Positive Objects: table, plant, vase.Negative Objects: screen, pickle, sandwich, toy, hill, coat, cat, etc.

Model Result

Image Encoder(ResNet 152)

Vase

Image Embedding: !(#)

Word Embedding: %(&)

EmbeddingInteraction

Pr Positive #, &]

Word: &

Model MAPGloVe [5] 58.7VSE [2] 61.7VSE++ [3] 61.1VSE-C (ours, +all) 62.2VSE-C (ours, +n.) 62.8VSE-C (ours, +rel.) 62.3VSE-C (ours, +num.) 62.0

SALIENCY VISUALIZATIONWhich part in the image or caption, in particular, makes them semantically different? We compute

the Jacobian (we normalize the textual saliency for visualization):

J = ∇is(i, c′) = ∇iW

Ti f(i; θi) ·WT

c g(c′; θc)

an elephant walking against the weeds in the forest0.039 0.176 0.101 0.087 0.051 0.248 0.060 0.057 0.181

an elephant walking against the weeds in the forest0.030 0.108 0.125 0.258 0.108 0.176 0.077 0.027 0.090

an elephant walking through the weeds in the forest

VSE++

VSE-C

Original

Image

Image

Saliency

(VSE-C)

PAPER & CODE

Paper is available at http://aclweb.org/anthology/C18-1315. Code is availableat https://github.com/ExplorerFreda/VSE-C.

paper code

ACKNOWLEDGEMENTS

This work was done when HS, JM and TXwere intern researchers at Megvii Inc. HS, JMand TX contribute equally to this paper.

GROUNDING TEST II: FILL-IN-THE-BLANK

Model Result

A table with a huge glass _____ and fake flowers come out of it.

GRUEncoder

WordEmbeddings

Fusing with MLP

Predicted Word: Vase

……

……

…… ……

…… ….…

Image Embedding: !(#)

Model R@1 R@10Noun FillingGloVe 23.2 58.8VSE++ 25.0 61.7VSE-C (ours) 27.3 62.9Prep. FillingGloVe[5] 23.3 79.9VSE++ 34.9 84.9VSE-C (ours) 35.2 85.2All (Noun + Prep.)GloVe 23.3 66.6VSE++ 28.4 68.1VSE-C (ours) 30.0 70.98

REFERENCES

[1] Lin et al. Microsoft COCO: Common Objectsin Context. In ECCV, 2014.

[2] Kiros et al. Unifying Visual-SemanticEmbeddings with Multimodal Neu-ral Language Models. arXiv preprintarXiv:1411.2539, 2014.

[3] Faghri et al. Vse++: Improving visual-semantic embeddings with hard negatives.In BMVC, 2018.

[4] George A Miller. WordNet: a lexicaldatabase for English. Communications of theACM, 1995.

[5] Pennington et al. GloVe: Global Vectors forWord Representation. In EMNLP, 2014.

Documents

LearningVisually ...freda/file/presentation/shi2018learning.pdfLearningVisually-GroundedSemanticsfromContrastiveAdversarialSamples Haoyue Shi*1, Jiayuan Mao*2, Tete Xiao*1, Yuning

LearningVisually ...freda/file/presentation/shi2018learning.pdfLearningVisually-GroundedSemanticsfromContrastiveAdversarialSamples Haoyue Shi1, Jiayuan Mao2, Tete Xiao*1, Yuning