1
Learning Visually-Grounded Semantics from Contrastive Adversarial Samples Haoyue Shi* 1 , Jiayuan Mao* 2 , Tete Xiao* 1 , Yuning Jiang 3 and Jian Sun 3 1 : Peking University 2 : Tsinghua University 3 : Megvii, Inc {hyshi, jasonhsiao97}@pku.edu.cn, [email protected], {jyn, sunjian}@megvii.com I NTRODUCTION Visual-Semantic Embeddings (VSE) Use paralleled image-caption pairs and embed texts and images into a joint space. Several datasets have been created for such purpose. However, even MS-COCO[1] is too small compared with the compositional semantic space. VSE with Contrastive Adversarial Samples (this work) Show the limitation of existing datasets and frameworks through adversarial attacks. Close the gap with semantics-aware text augmentation. Evaluate the visual grounding on multiple tasks. AS IMPLE YET E FFECTIVE A PPROACH Add the Contrastive * Adversarial Samples to the Training Set * : Use the online hard example mining (OHEM) technique to find “Contrastive” ones. VSE [2]: min ‘ V SE (i, c)= X c 0 [α + s(i, c 0 ) - s(i, c)] + + X i 0 [α + s(i 0 ,c) - s(i, c)] + VSE++ [3]: min VSE++ (i, c) = max c 0 6=c [α + s(i, c 0 ) - s(i, c)] + max i 0 6=i [α + s(i 0 ,c) - s(i, c)] VSE-C (ours): min VSE-C (i, c)= VSE++ (i, c)+ max c 00 ∈C 0 (c) [α + s(i, c 00 ) - s(i, c)] + i: image, c: caption, C 0 : adversarial samples. B EGIN WITH A DVERSARIAL A TTACKS Three giraffes and a rhino graze from trees . relation: graze from Original Caption Image Contrastive Adversarial Samples Three cows and a rhino graze from trees. noun : numeral / indefinite article: Three giraffes and three rhinos graze from trees. relation: Three giraffes and a rhino graze on trees. Trees graze from three giraffes and a rhino . numeral indef. noun noun noun Semantics-aware Text Augmentation (Adversarial Samples) Noun: use Word-Net [4] to compare the word similarity (e.g., Synonyms, Hypernyms). Numeral/Indefinite Article: singularize or pluralize corresponding nouns when necessary. Relation: dependency-parsing based subject and object detection. Result Model R@1 R@10 Med r. Avg r. MS-COCO Test VSE [2] 47.7 87.8 2.0 5.8 VSE++ [3] 55.7 92.4 1.0 4.3 VSE-C (+n.) 50.7 90.7 1.0 5.2 VSE-C (+num.) 53.3 90.2 1.0 5.8 VSE-C (+rel.) 52.4 89.0 1.0 5.7 VSE-C (+all) 50.2 89.8 1.0 5.2 MS-COCO Test w/ Adversarial Samples VSE [2] 28.0 71.6 4.0 11.7 VSE++ [3] 35.6 72.5 3.0 11.8 VSE-C (+n.) 40.3 80.2 2.0 9.2 VSE-C (+num.) 46.9 86.3 2.0 6.9 VSE-C (+rel.) 42.3 82.5 2.0 7.2 VSE-C (+all) 47.4 88.8 2.0 5.5 G ROUNDING T EST I: W ORD -O BJECT C ORRELATION Task Description Image Captions A table with a huge glass vase and fake flowers come out of it. A plant in a vase sits at the end of a table . A vase with flowers in it with long stems sitting on a table with candles . A large centerpiece that is sitting on the edge of a dining table . Flowers in a clear vase sitting on a table . Positive Objects: table, plant, vase. Negative Objects: screen, pickle, sandwich, toy, hill, coat, cat, etc. Model Result Image Encoder (ResNet 152) Vase Image Embedding: !(#) Word Embedding: %(&) Embedding Interaction Pr Positive #, &] Word: & Model MAP GloVe [5] 58.7 VSE [2] 61.7 VSE++ [3] 61.1 VSE-C (ours, +all) 62.2 VSE-C (ours, +n.) 62.8 VSE-C (ours, +rel.) 62.3 VSE-C (ours, +num.) 62.0 S ALIENCY V ISUALIZATION Which part in the image or caption, in particular, makes them semantically different? We compute the Jacobian (we normalize the textual saliency for visualization): J = i s(i, c 0 )= i W T i f (i; θ i ) · W T c g (c 0 ; θ c ) an elephant walking against the weeds in the forest 0.039 0.176 0.101 0.087 0.051 0.248 0.060 0.057 0.181 an elephant walking against the weeds in the forest 0.030 0.108 0.125 0.258 0.108 0.176 0.077 0.027 0.090 an elephant walking through the weeds in the forest VSE++ VSE-C Original Image Image Saliency (VSE-C) P APER &C ODE Paper is available at http://aclweb.org/ anthology/C18-1315. Code is available at https://github.com/ExplorerFreda/ VSE-C. paper code A CKNOWLEDGEMENTS This work was done when HS, JM and TX were intern researchers at Megvii Inc. HS, JM and TX contribute equally to this paper. G ROUNDING T EST II: F ILL - IN - THE -B LANK Model Result A table with a huge glass _____ and fake flowers come out of it. GRU Encoder Word Embeddings Fusing with MLP Predicted Word: Vase …… …… …… …… …… ….… Image Embedding: !(#) Model R@1 R@10 Noun Filling GloVe 23.2 58.8 VSE++ 25.0 61.7 VSE-C (ours) 27.3 62.9 Prep. Filling GloVe[5] 23.3 79.9 VSE++ 34.9 84.9 VSE-C (ours) 35.2 85.2 All (Noun + Prep.) GloVe 23.3 66.6 VSE++ 28.4 68.1 VSE-C (ours) 30.0 70.98 R EFERENCES [1] Lin et al. Microsoft COCO: Common Objects in Context. In ECCV, 2014. [2] Kiros et al. Unifying Visual-Semantic Embeddings with Multimodal Neu- ral Language Models. arXiv preprint arXiv:1411.2539, 2014. [3] Faghri et al. Vse++: Improving visual- semantic embeddings with hard negatives. In BMVC, 2018. [4] George A Miller. WordNet: a lexical database for English. Communications of the ACM, 1995. [5] Pennington et al. GloVe: Global Vectors for Word Representation. In EMNLP, 2014.

LearningVisually ...freda/file/presentation/shi2018learning.pdfLearningVisually-GroundedSemanticsfromContrastiveAdversarialSamples Haoyue Shi*1, Jiayuan Mao*2, Tete Xiao*1, Yuning

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LearningVisually ...freda/file/presentation/shi2018learning.pdfLearningVisually-GroundedSemanticsfromContrastiveAdversarialSamples Haoyue Shi*1, Jiayuan Mao*2, Tete Xiao*1, Yuning

Learning Visually-Grounded Semantics from Contrastive Adversarial SamplesHaoyue Shi*1, Jiayuan Mao*2, Tete Xiao*1, Yuning Jiang3 and Jian Sun3 1: Peking University 2: Tsinghua University 3: Megvii, Inc

{hyshi, jasonhsiao97}@pku.edu.cn, [email protected], {jyn, sunjian}@megvii.com

INTRODUCTION

Visual-Semantic Embeddings (VSE)• Use paralleled image-caption pairs and embed texts and images into a joint space.

• Several datasets have been created for such purpose.

• However, even MS-COCO[1] is too small compared with the compositional semantic space.

VSE with Contrastive Adversarial Samples (this work)• Show the limitation of existing datasets and frameworks through adversarial attacks.

• Close the gap with semantics-aware text augmentation.

• Evaluate the visual grounding on multiple tasks.

A SIMPLE YET EFFECTIVE APPROACH

Add the Contrastive∗ Adversarial Samples to the Training Set∗: Use the online hard example mining (OHEM) technique to find “Contrastive” ones.

VSE [2]:min `V SE(i, c) =

∑c′

[α+ s(i, c′)− s(i, c)]+ +∑i′

[α+ s(i′, c)− s(i, c)]+

VSE++ [3]:

min `VSE++(i, c) = maxc′ 6=c

[α+ s(i, c′)− s(i, c)] + maxi′ 6=i

[α+ s(i′, c)− s(i, c)]

VSE-C (ours):

min `VSE-C(i, c) = `VSE++(i, c) + maxc′′∈C′(c)

[α+ s(i, c′′)− s(i, c)]+

i: image, c: caption, C′: adversarial samples.

BEGIN WITH ADVERSARIAL ATTACKS

Three giraffes and a rhino graze from trees.

relation: graze from

Original CaptionImage Contrastive Adversarial Samples

Three cows and a rhino graze from trees.

noun:

numeral / indefinite article:

Three giraffes and three rhinos graze from trees.

relation:

Three giraffes and a rhino graze on trees.Trees graze from three giraffes and a rhino.

numeral indef. nounnoun noun

Semantics-aware Text Augmentation (Adversarial Samples)

• Noun: use Word-Net [4] to compare the word similarity (e.g., Synonyms, Hypernyms).

• Numeral/Indefinite Article: singularize or pluralize corresponding nouns when necessary.

• Relation: dependency-parsing based subject and object detection.

ResultModel R@1 R@10 Med r. Avg r.MS-COCO TestVSE [2] 47.7 87.8 2.0 5.8VSE++ [3] 55.7 92.4 1.0 4.3VSE-C (+n.) 50.7 90.7 1.0 5.2VSE-C (+num.) 53.3 90.2 1.0 5.8VSE-C (+rel.) 52.4 89.0 1.0 5.7VSE-C (+all) 50.2 89.8 1.0 5.2MS-COCO Test w/ Adversarial SamplesVSE [2] 28.0 71.6 4.0 11.7VSE++ [3] 35.6 72.5 3.0 11.8VSE-C (+n.) 40.3 80.2 2.0 9.2VSE-C (+num.) 46.9 86.3 2.0 6.9VSE-C (+rel.) 42.3 82.5 2.0 7.2VSE-C (+all) 47.4 88.8 2.0 5.5

GROUNDING TEST I: WORD-OBJECT CORRELATION

Task Description

Image CaptionsA table with a huge glass vase and fake flowers come out of it.A plant in a vase sits at the end of a table.A vase with flowers in it with long stems sitting on a table with candles.A large centerpiece that is sitting on the edge of a dining table.Flowers in a clear vase sitting on a table.

Positive Objects: table, plant, vase.Negative Objects: screen, pickle, sandwich, toy, hill, coat, cat, etc.

Model Result

Image Encoder(ResNet 152)

Vase

Image Embedding: !(#)

Word Embedding: %(&)

EmbeddingInteraction

Pr Positive #, &]

Word: &

Model MAPGloVe [5] 58.7VSE [2] 61.7VSE++ [3] 61.1VSE-C (ours, +all) 62.2VSE-C (ours, +n.) 62.8VSE-C (ours, +rel.) 62.3VSE-C (ours, +num.) 62.0

SALIENCY VISUALIZATIONWhich part in the image or caption, in particular, makes them semantically different? We compute

the Jacobian (we normalize the textual saliency for visualization):

J = ∇is(i, c′) = ∇iW

Ti f(i; θi) ·WT

c g(c′; θc)

an elephant walking against the weeds in the forest0.039 0.176 0.101 0.087 0.051 0.248 0.060 0.057 0.181

an elephant walking against the weeds in the forest0.030 0.108 0.125 0.258 0.108 0.176 0.077 0.027 0.090

an elephant walking through the weeds in the forest

VSE++

VSE-C

Original

Image

Image

Saliency

(VSE-C)

PAPER & CODE

Paper is available at http://aclweb.org/anthology/C18-1315. Code is availableat https://github.com/ExplorerFreda/VSE-C.

paper code

ACKNOWLEDGEMENTS

This work was done when HS, JM and TXwere intern researchers at Megvii Inc. HS, JMand TX contribute equally to this paper.

GROUNDING TEST II: FILL-IN-THE-BLANK

Model Result

A table with a huge glass _____ and fake flowers come out of it.

GRUEncoder

WordEmbeddings

Fusing with MLP

Predicted Word: Vase

……

……

…… ……

…… ….…

Image Embedding: !(#)

Model R@1 R@10Noun FillingGloVe 23.2 58.8VSE++ 25.0 61.7VSE-C (ours) 27.3 62.9Prep. FillingGloVe[5] 23.3 79.9VSE++ 34.9 84.9VSE-C (ours) 35.2 85.2All (Noun + Prep.)GloVe 23.3 66.6VSE++ 28.4 68.1VSE-C (ours) 30.0 70.98

REFERENCES

[1] Lin et al. Microsoft COCO: Common Objectsin Context. In ECCV, 2014.

[2] Kiros et al. Unifying Visual-SemanticEmbeddings with Multimodal Neu-ral Language Models. arXiv preprintarXiv:1411.2539, 2014.

[3] Faghri et al. Vse++: Improving visual-semantic embeddings with hard negatives.In BMVC, 2018.

[4] George A Miller. WordNet: a lexicaldatabase for English. Communications of theACM, 1995.

[5] Pennington et al. GloVe: Global Vectors forWord Representation. In EMNLP, 2014.