Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Vision and Language Representation Learning
– Self Supervised Pretraining and Multi-Task Learning
1 1
Jiasen Lu
April 21, 2020
Image CaptioningVisual Question Answering
Visual Commonsense Reasoning Refer Expression
Vision and Language
2[Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]
Refer Expression
Image CaptioningVisual Question Answering
Visual Commonsense Reasoning
Vision and Language
Visual Commonsense Reasoning
3[Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]
Visual Grounding
C: A bunch of red and yellowflowers on a branch.
Q: What type of plant is this?
A: Banana
[Shen et.al 2018]
Common model for visual grounding and leverage them on a wide array of vision-and-language tasks
4
Pretrain-Transfer
Object Detection Semantic Segmentation Pose Estimation
Question Answering Commonsense Inference Sentiment Analysis
[Deng et.al 2009, Devlin 2018] 5
Pretrain-Transfer
Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee.
Conceptual Captions: pop artist performs at the festival in a city.
Conceptual Caption Dataset
• Aligned image-caption pairs.
• 3.3 million image compared to 0.12 million in COCOcaption.
• Automatically collected.
[Sharma et.al 2018] 6
BERT
Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee.
Conceptual Captions: pop artist performs at the festival in a city.
Conceptual Caption Dataset
BERT
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇
<SEP> Tok1 Tok 2 <SEP>
…
…
Sentence B
<CLS> Tok 1 Tok2 Tok N…
Sentence A
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…
7[Sharma et.al 2018, Devlin et.al 2018]]
ViLBERT
Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee.
Conceptual Captions: pop artist performs at the festival in a city.
Conceptual Caption Dataset
BERT
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇
<SEP> Tok1 Tok 2 <SEP>
…
…
Sentence B
<CLS> Tok 1 Tok2 Tok N…
Sentence A
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…
<MASK
><MASK
>
T1T2
8[Sharma et.al 2018, Devlin et.al 2018]]
Single-Stream model
Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee.
Conceptual Captions: pop artist performs at the festival in a city.
Conceptual Caption Dataset
BERT
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇
<SEP> Tok1 Tok 2 <SEP>
…
…
Sentence
<CLS> …
Image
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…
9[Sharma et.al 2018, Devlin et.al 2018]]
Single-Stream model
Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee.
Conceptual Captions: pop artist performs at the festival in a city.
Conceptual Caption Dataset
BERT
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇
<SEP> Tok1 Tok 2 <SEP>
…
…
Sentence
<CLS> …
Image
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…
<MASK
>
<MASK
>
T1
10[Sharma et.al 2018, Devlin et.al 2018]]
ViLBERT
Problem: Different modalities may require different level of abstractions.
• Linguistic stream:
artist
Linear
• Visual stream:
[He et.al. 2015] 11
ViLBERT
Solution: two-stream model which process visual and linguistic separately.
L - BERT
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇
Tok1 Tok2 <SEP>
…
…
Sentence
<CLS>…
Image
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…
<IMG>
V- BERT 𝑙 - layers𝑘 - layers
12
ViLBERT
Problem: how to fuse two different modality?
Solution: use co-attention [Lu et.al. 2016] to fuse information between differentsource.
13
ViLBERT
14
Co-attention [Lu et.al. 2016] to fuse information between different source.
Pre-training Objective
Masked multi-modal modelling
Multi-modal alignment prediction
• Follows masked LM in BERT.
• 15% of the words or image regions to predict.
• Linguistic stream:
o 80% of the time, replace with [MASK].
o 10% of the time, replace random word.
o 10% of the time, keep same.
• Visual stream:
o 80% of the time, replace with zero vector.
• Predict whether image and caption is aligned or not
15
Visualizations
A boat covered in flowers near the market.
[Sharma et.al 2018] 16
Sentence→ Image
Layer 0
Layer 5
H0 H7
BertVis https://github.com/jessevig/bertviz 17
Sentence→ Image
Layer 0
Layer 5
H0 H7
BertVis https://github.com/jessevig/bertviz 18
Image→ Sentence
Layer 0
Layer 5
H0 H7
BertVis https://github.com/jessevig/bertviz 19
Image → Sentence
Layer 0
Layer 5
H0 H7
BertVis https://github.com/jessevig/bertviz 20
Pre-training Fine-Tuning
Image and text pair from conceptual caption
Vision & Language BERT
…
ℎ𝑉0 ℎ𝑉1 ℎ𝑉2 ℎ𝑉3 ℎ𝑉𝒯
<IMG>
ℎ𝐿0 ℎ𝐿1 ℎ𝐿2 ℎ𝐿3 ℎ𝐿𝑇
<CLS> Man shopping for <SEP>
… …
…<MASK
>
<MASK
>
<MASK
><MASK>
Man shopping
Masked SentenceMasked Region
Image Question Pair
Vision & Language BERT
…
ℎ𝑉0 ℎ𝑉1 ℎ𝑉2 ℎ𝑉3 ℎ𝑉𝒯
<IMG>
ℎ𝐿0 ℎ𝐿1 ℎ𝐿2 ℎ𝐿3 ℎ𝐿𝑇
<CLS> What is the <SEP>
… …
…
QuestionImage
shopping
VQAVCRRefer
Fine-tuning Procedure
21
Tasks
[Antol 2015, zeller 2018, Yu 2016, Plummer 2015]22
Results
70.22
65.9
68.85 68.93
70.55
test-dev
VQA
43.1
47.27
52.73
49.48
54.04
val
VCR Q->A
23
65.3365.64
69.2168.61
72.34
val
RefCOCO+
48.6
45.5
58.2
test
Image Retrieval
Concurrent Work
24[Li 2019, Tan 2019, Li 2019, Su 2019, Zhou 2019, Chen 2019]
Summary
Summary
Task-agnostic visiolinguistic representations pretraining for visual grounding
• Introduce pretrain-transfer to vision and language tasks.
• Achieve SOTA on multiple vision and language tasks.
Limitations
The model can still learn inconsistent grounding by task specific finetuning.
• Training multiple vision and language task together – multi-task V&L
25
Multi-Task V&L Learning
One Model for V&L:ViLBERT Problems:
• Inconsistent grounding by task specific
finetuning.
• Four V&L tasks.
• Model is huge, overfitting.
What we want:
• Test on more tasks.
• Consistent Grounding across tasks.
• Explore the limit of the model.
26
VQA• VQA• Genome QA• GQA
Image Description• Caption based
Retrieval (COCO)• Caption based
Retrieval (COCO)
Referring Expression• Ref COCO• Ref COCO+• Ref COCOg• Visual 7w• GuessWhat
V&L Verification• NLVR2• Visual Entailment
Multi-Task V&L Learning
27
Model improvements over ViLBERT
Multi-Task V&L Learning
28
Model improvements over ViLBERT
L - BERT
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇
Tok1 Tok2 <SEP>
…
…
Aligned caption
<CLS>…
Image
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…
<IMG>
V- BERT
• Masked multi-modal modelling only for aligned image caption pairs.
<MASK>
Multi-Task V&L Learning
29
Model improvements over ViLBERT
• Masked multi-modal modelling only for aligned image caption pairs.
• Masking overlapped image regions (IOU > 0.4).
L - BERT
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇
Tok1 Tok2 <SEP>
…
…<CLS>…
Image
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…
<IMG>
V- BERT
<MASK>
Aligned caption
Multi-Task V&L Learning
30
Multi-Task Vision and Language Learning
• Use different head but similar tasks share the same head.
L - BERT
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇
<TSK> Tok1 <SEP>
…
…<CLS>…
Image
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…
<IMG>
V- BERT
Aligned caption
VQA/Genome QA
GQA
Retrieval
NLVR
Visual Entailment
Multi-Task V&L Learning
31
Multi-Task Vision and Language Learning
• Use different head but similar tasks share the same head.
L - BERT
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇
<TSK> Tok1 <SEP>
…
…<CLS>…
Image
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…
<IMG>
V- BERT
Aligned caption
VQA/Genome QA
GQA
Retrieval
NLVR
Visual Entailment
Refer Expression
Multi-Task V&L Learning
32
• Add <TSK> token for multi-task training.
Multi-Task Vision and Language Learning
L - BERT
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇
<TSK> Tok1 <SEP>
…
…<CLS>…
Image
ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…
<IMG>
V- BERT
Aligned caption
• Use different head but similar tasks share the same head.
Multi-Task V&L Learning
33
• Add <TSK> token for multi-task training.
Multi-Task Vision and Language Learning
• Use different head but similar tasks share the same head.
• Dynamic Stop and Go
Multi-Task V&L Learning
34
• Add <TSK> token for multi-task training.
Multi-Task Vision and Language Learning
• Use different head but similar tasks share the same head.
• Dynamic Stop and Go
Multi-Task V&L Learning
35
• Add <TSK> token for multi-task training.
Multi-Task Vision and Language Learning
• Use different head but similar tasks share the same head.
• Dynamic Stop and Go
Training Procedure
• Conceptual caption pre-training.
• Multi-task training.
• Finetune on single task.
Compare with SOTA
36
70.55
70.8
71.1671.24
72.4272.27
72.06
73.08
test-dev
VQA
70.57
69.36
71.12
72.9
73.4
74.12
test
RefCOCO+
Model
Pretrained
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
Compare with SOTA
37
70.55
70.8
71.1671.24
72.4272.27
72.06
73.08
test-dev
VQA
70.57
69.36
71.12
72.9
73.4
74.12
test
RefCOCO+
2-streamModel
Pretrained CC
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
Compare with SOTA
38
70.55
70.8
71.1671.24
72.4272.27
72.06
73.08
test-dev
VQA
70.57
69.36
71.12
72.9
73.4
74.12
test
RefCOCO+
2-streamModel
Pretrained CC
1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
Compare with SOTA
39
70.55
70.8
71.1671.24
72.4272.27
72.06
73.08
test-dev
VQA
70.57
69.36
71.12
72.9
73.4
74.12
test
RefCOCO+
2-streamModel
Pretrained CC
1-stream
CC+wiki
1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
Compare with SOTA
40
70.55
70.8
71.1671.24
72.4272.27
72.06
73.08
test-dev
VQA
70.57
69.36
71.12
72.9
73.4
74.12
test
RefCOCO+
2-streamModel
Pretrained CC
1-stream
CC+wiki
2-stream
CC
1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
Compare with SOTA
41
70.55
70.8
71.1671.24
72.4272.27
72.06
73.08
test-dev
VQA
70.57
69.36
71.12
72.9
73.4
74.12
test
RefCOCO+
2-streamModel
Pretrained CC
1-stream
CC+wiki
2-stream
CC
2-stream
COCO+VG
1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
Compare with SOTA
42
70.55
70.8
71.1671.24
72.4272.27
72.06
73.08
test-dev
VQA
70.57
69.36
71.12
72.9
73.4
74.12
test
RefCOCO+
2-streamModel
Pretrained CC
1-stream
CC+wiki
2-stream
CC
2-stream
COCO+VG
1-stream
COCO+VG
CC+SBU
1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
Compare with SOTA
43
70.55
70.8
71.1671.24
72.4272.27
72.06
73.08
test-dev
VQA
70.57
69.36
71.12
72.9
73.4
74.12
test
RefCOCO+
2-streamModel
Pretrained CC
1-stream
CC+wiki
2-stream
CC
2-stream
COCO+VG
1-stream
COCO+VG
CC+SBU
2-stream
CC
1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
Compare with SOTA
44
70.55
70.8
71.1671.24
72.4272.27
72.06
73.08
test-dev
VQA
70.57
69.36
71.12
72.9
73.4
74.12
test
RefCOCO+
2-streamModel
Pretrained CC
1-stream 1-stream
CC+wiki
2-stream
CC
2-stream
COCO+VG
1-stream
COCO+VG
CC+SBU
2-stream 2-stream
CC + MTCC
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
Ablation Study
45
71.24
72.03 72.06
73.08
test-dev
VQA
34.1
36.18
35.05
36.38
test
Genome QA
59.09
59.659.81
60.72
test
GQA
64.8 65.06
63.02
67.36
R1
Retrieval COCO
61.46
66
63.19
66.12
R1
Retrieval Flickr
80.51
81.54
82.52
83.06
test
Visual7W
62.53
64.7864.43
65.19
test
GuessWhat
74.2574.62
77.7278.24
test
NLVR2
76.53 76.5276.63
77.32
test
SNLI-VE
69.47
72.873.4
74.12
test
RefCOCO+
Ablation Study
46
71.24
72.03 72.06
73.08
test-dev
VQA
34.1
36.18
35.05
36.38
test
Genome QA
59.09
59.659.81
60.72
test
GQA
64.8 65.06
63.02
67.36
R1
Retrieval COCO
61.46
66
63.19
66.12
R1
Retrieval Flickr
80.51
81.54
82.52
83.06
test
Visual7W
62.53
64.7864.43
65.19
test
GuessWhat
74.2574.62
77.7278.24
test
NLVR2
76.53 76.5276.63
77.32
test
SNLI-VE
69.47
72.873.4
74.12
test
RefCOCO+
Ablation Study
47
71.24
72.03 72.06
73.08
test-dev
VQA
34.1
36.18
35.05
36.38
test
Genome QA
59.09
59.659.81
60.72
test
GQA
64.8 65.06
63.02
67.36
R1
Retrieval COCO
61.46
66
63.19
66.12
R1
Retrieval Flickr
80.51
81.54
82.52
83.06
test
Visual7W
62.53
64.7864.43
65.19
test
GuessWhat
74.2574.62
77.7278.24
test
NLVR2
76.53 76.5276.63
77.32
test
SNLI-VE
69.47
72.873.4
74.12
test
RefCOCO+
Ablation Study
48
Relative Performance of
Trained withAverage
VQAImage
RetrievalRefer
ExpressionV&L
Verification
VQA - 0.38 0.38 -0.2 0.19
Image Retrieval 0.46 - 0.23 -4.13 -1.15
Refer Expression 0.39 0.78 - 0.24 0.47
V&L Verification 2.29 1.47 0.67 - 1.48
Average 1.04 0.88 0.43 -1.36 -
Task Performance with different Group
Ablation Study
49
Task Performance with different Group
Relative Performance of
Trained withAverage
VQAImage
RetrievalRefer
ExpressionV&L
Verification
VQA - 0.38 0.38 -0.2 0.19
Image Retrieval 0.46 - 0.23 -4.13 -1.15
Refer Expression 0.39 0.78 - 0.24 0.47
V&L Verification 2.29 1.47 0.67 - 1.48
Average 1.04 0.88 0.43 -1.36 -
Ablation Study
50
Task Performance with different Group
Relative Performance of
Trained withAverage
VQAImage
RetrievalRefer
ExpressionV&L
Verification
VQA - 0.38 0.38 -0.2 0.19
Image Retrieval 0.46 - 0.23 -4.13 -1.15
Refer Expression 0.39 0.78 - 0.24 0.47
V&L Verification 2.29 1.47 0.67 - 1.48
Average 1.04 0.88 0.43 -1.36 -
Summary
Summary
Explore multi-tasks vision and language learning.
• Introduce several tricks to improve ViLBERT.
• Add <TSK> tokens to improved the multi-task learning.
• Finetune on the multi-task representation lead to new SOTA.
• Study how different groups connect with each other.
52
Potential directions
• How to use the information across tasks for XAI?
• How to incorporate more modalities?
• How to make the model smaller and more efficient?
• How to combine symbolic reasoning with representation learning.
The END
Question?
53