Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Vision and Language Representation Learning

– Self Supervised Pretraining and Multi-Task Learning

1 1

Jiasen Lu

April 21, 2020

Image CaptioningVisual Question Answering

Visual Commonsense Reasoning Refer Expression

Vision and Language

2[Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

Refer Expression

Image CaptioningVisual Question Answering

Visual Commonsense Reasoning

Vision and Language

Visual Commonsense Reasoning

3[Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

Visual Grounding

C: A bunch of red and yellowflowers on a branch.

Q: What type of plant is this?

A: Banana

[Shen et.al 2018]

Common model for visual grounding and leverage them on a wide array of vision-and-language tasks

4

Pretrain-Transfer

Object Detection Semantic Segmentation Pose Estimation

Question Answering Commonsense Inference Sentiment Analysis

[Deng et.al 2009, Devlin 2018] 5

Pretrain-Transfer

Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee.

Conceptual Captions: pop artist performs at the festival in a city.

Conceptual Caption Dataset

• Aligned image-caption pairs.

• 3.3 million image compared to 0.12 million in COCOcaption.

• Automatically collected.

[Sharma et.al 2018] 6

BERT




BERT

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇

<SEP> Tok1 Tok 2 <SEP>

…

…

Sentence B

<CLS> Tok 1 Tok2 Tok N…

Sentence A

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…

7[Sharma et.al 2018, Devlin et.al 2018]]

ViLBERT




BERT



…

…

Sentence B

<CLS> Tok 1 Tok2 Tok N…

Sentence A


<MASK

><MASK

>

T1T2


Single-Stream model




BERT



…

…

Sentence

<CLS> …

Image



Single-Stream model




BERT



…

…

Sentence

<CLS> …

Image


<MASK

>

<MASK

>

T1


ViLBERT

Problem: Different modalities may require different level of abstractions.

• Linguistic stream:

artist

Linear

• Visual stream:

[He et.al. 2015] 11

ViLBERT

Solution: two-stream model which process visual and linguistic separately.

L - BERT


Tok1 Tok2 <SEP>

…

…

Sentence

<CLS>…

Image


<IMG>

V- BERT 𝑙 - layers𝑘 - layers

12

ViLBERT

Problem: how to fuse two different modality?

Solution: use co-attention [Lu et.al. 2016] to fuse information between differentsource.

13

ViLBERT

14

Co-attention [Lu et.al. 2016] to fuse information between different source.

Pre-training Objective

Masked multi-modal modelling

Multi-modal alignment prediction

• Follows masked LM in BERT.

• 15% of the words or image regions to predict.

• Linguistic stream:

o 80% of the time, replace with [MASK].

o 10% of the time, replace random word.

o 10% of the time, keep same.

• Visual stream:

o 80% of the time, replace with zero vector.

• Predict whether image and caption is aligned or not

15

Visualizations

A boat covered in flowers near the market.

[Sharma et.al 2018] 16

Sentence→ Image

Layer 0

Layer 5

H0 H7

BertVis https://github.com/jessevig/bertviz 17

Sentence→ Image

Layer 0

Layer 5

H0 H7


Image→ Sentence

Layer 0

Layer 5

H0 H7


Image → Sentence

Layer 0

Layer 5

H0 H7


Pre-training Fine-Tuning

Image and text pair from conceptual caption

Vision & Language BERT

…

ℎ𝑉0 ℎ𝑉1 ℎ𝑉2 ℎ𝑉3 ℎ𝑉𝒯

<IMG>

ℎ𝐿0 ℎ𝐿1 ℎ𝐿2 ℎ𝐿3 ℎ𝐿𝑇

<CLS> Man shopping for <SEP>

… …

…<MASK

>

<MASK

>

<MASK

><MASK>

Man shopping

Masked SentenceMasked Region

Image Question Pair

Vision & Language BERT

…

ℎ𝑉0 ℎ𝑉1 ℎ𝑉2 ℎ𝑉3 ℎ𝑉𝒯

<IMG>

ℎ𝐿0 ℎ𝐿1 ℎ𝐿2 ℎ𝐿3 ℎ𝐿𝑇

<CLS> What is the <SEP>

… …

…

QuestionImage

shopping

VQAVCRRefer

Fine-tuning Procedure

21

Tasks

[Antol 2015, zeller 2018, Yu 2016, Plummer 2015]22

Results

70.22

65.9

68.85 68.93

70.55

test-dev

VQA

43.1

47.27

52.73

49.48

54.04

val

VCR Q->A

23

65.3365.64

69.2168.61

72.34

val

RefCOCO+

48.6

45.5

58.2

test

Image Retrieval

Concurrent Work

24[Li 2019, Tan 2019, Li 2019, Su 2019, Zhou 2019, Chen 2019]

Summary

Summary

Task-agnostic visiolinguistic representations pretraining for visual grounding

• Introduce pretrain-transfer to vision and language tasks.

• Achieve SOTA on multiple vision and language tasks.

Limitations

The model can still learn inconsistent grounding by task specific finetuning.

• Training multiple vision and language task together – multi-task V&L

25

Multi-Task V&L Learning

One Model for V&L:ViLBERT Problems:

• Inconsistent grounding by task specific

finetuning.

• Four V&L tasks.

• Model is huge, overfitting.

What we want:

• Test on more tasks.

• Consistent Grounding across tasks.

• Explore the limit of the model.

26

VQA• VQA• Genome QA• GQA

Image Description• Caption based

Retrieval (COCO)• Caption based

Retrieval (COCO)

Referring Expression• Ref COCO• Ref COCO+• Ref COCOg• Visual 7w• GuessWhat

V&L Verification• NLVR2• Visual Entailment


27

Model improvements over ViLBERT


28


L - BERT


Tok1 Tok2 <SEP>

…

…

Aligned caption

<CLS>…

Image


<IMG>

V- BERT

• Masked multi-modal modelling only for aligned image caption pairs.

<MASK>


29


• Masked multi-modal modelling only for aligned image caption pairs.

• Masking overlapped image regions (IOU > 0.4).

L - BERT


Tok1 Tok2 <SEP>

…

…<CLS>…

Image


<IMG>

V- BERT

<MASK>

Aligned caption


30

Multi-Task Vision and Language Learning

• Use different head but similar tasks share the same head.

L - BERT


<TSK> Tok1 <SEP>

…

…<CLS>…

Image


<IMG>

V- BERT

Aligned caption

VQA/Genome QA

GQA

Retrieval

NLVR

Visual Entailment


31



L - BERT


<TSK> Tok1 <SEP>

…

…<CLS>…

Image


<IMG>

V- BERT

Aligned caption

VQA/Genome QA

GQA

Retrieval

NLVR

Visual Entailment

Refer Expression


32

• Add <TSK> token for multi-task training.


L - BERT


<TSK> Tok1 <SEP>

…

…<CLS>…

Image


<IMG>

V- BERT

Aligned caption



33




• Dynamic Stop and Go


34






35





Training Procedure

• Conceptual caption pre-training.

• Multi-task training.

• Finetune on single task.

Compare with SOTA

36

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

Model

Pretrained

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

Compare with SOTA

37

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

Compare with SOTA

38

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Compare with SOTA

39

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

CC+wiki

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Compare with SOTA

40

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

CC+wiki

2-stream

CC

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Compare with SOTA

41

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

CC+wiki

2-stream

CC

2-stream

COCO+VG

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Compare with SOTA

42

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

CC+wiki

2-stream

CC

2-stream

COCO+VG

1-stream

COCO+VG

CC+SBU

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Compare with SOTA

43

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

CC+wiki

2-stream

CC

2-stream

COCO+VG

1-stream

COCO+VG

CC+SBU

2-stream

CC

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Compare with SOTA

44

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream 1-stream

CC+wiki

2-stream

CC

2-stream

COCO+VG

1-stream

COCO+VG

CC+SBU

2-stream 2-stream

CC + MTCC

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Ablation Study

45

71.24

72.03 72.06

73.08

test-dev

VQA

34.1

36.18

35.05

36.38

test

Genome QA

59.09

59.659.81

60.72

test

GQA

64.8 65.06

63.02

67.36

R1

Retrieval COCO

61.46

66

63.19

66.12

R1

Retrieval Flickr

80.51

81.54

82.52

83.06

test

Visual7W

62.53

64.7864.43

65.19

test

GuessWhat

74.2574.62

77.7278.24

test

NLVR2

76.53 76.5276.63

77.32

test

SNLI-VE

69.47

72.873.4

74.12

test

RefCOCO+

Ablation Study

46

71.24

72.03 72.06

73.08

test-dev

VQA

34.1

36.18

35.05

36.38

test

Genome QA

59.09

59.659.81

60.72

test

GQA

64.8 65.06

63.02

67.36

R1

Retrieval COCO

61.46

66

63.19

66.12

R1

Retrieval Flickr

80.51

81.54

82.52

83.06

test

Visual7W

62.53

64.7864.43

65.19

test

GuessWhat

74.2574.62

77.7278.24

test

NLVR2

76.53 76.5276.63

77.32

test

SNLI-VE

69.47

72.873.4

74.12

test

RefCOCO+

Ablation Study

47

71.24

72.03 72.06

73.08

test-dev

VQA

34.1

36.18

35.05

36.38

test

Genome QA

59.09

59.659.81

60.72

test

GQA

64.8 65.06

63.02

67.36

R1

Retrieval COCO

61.46

66

63.19

66.12

R1

Retrieval Flickr

80.51

81.54

82.52

83.06

test

Visual7W

62.53

64.7864.43

65.19

test

GuessWhat

74.2574.62

77.7278.24

test

NLVR2

76.53 76.5276.63

77.32

test

SNLI-VE

69.47

72.873.4

74.12

test

RefCOCO+

Ablation Study

48

Relative Performance of

Trained withAverage

VQAImage

RetrievalRefer

ExpressionV&L

Verification

VQA - 0.38 0.38 -0.2 0.19

Image Retrieval 0.46 - 0.23 -4.13 -1.15

Refer Expression 0.39 0.78 - 0.24 0.47

V&L Verification 2.29 1.47 0.67 - 1.48

Average 1.04 0.88 0.43 -1.36 -

Task Performance with different Group

Ablation Study

49



Trained withAverage

VQAImage

RetrievalRefer

ExpressionV&L

Verification

VQA - 0.38 0.38 -0.2 0.19

Image Retrieval 0.46 - 0.23 -4.13 -1.15


V&L Verification 2.29 1.47 0.67 - 1.48

Average 1.04 0.88 0.43 -1.36 -

Ablation Study

50



Trained withAverage

VQAImage

RetrievalRefer

ExpressionV&L

Verification

VQA - 0.38 0.38 -0.2 0.19

Image Retrieval 0.46 - 0.23 -4.13 -1.15


V&L Verification 2.29 1.47 0.67 - 1.48

Average 1.04 0.88 0.43 -1.36 -

DEMO

https://vilbert.cloudcv.org/

https://vilbert.cloudcv.org/

Summary

Summary

Explore multi-tasks vision and language learning.

• Introduce several tricks to improve ViLBERT.

• Add <TSK> tokens to improved the multi-task learning.

• Finetune on the multi-task representation lead to new SOTA.

• Study how different groups connect with each other.

52

Potential directions

• How to use the information across tasks for XAI?

• How to incorporate more modalities?

• How to make the model smaller and more efficient?

• How to combine symbolic reasoning with representation learning.

The END

Question?

53

Documents

Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised