Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Deep Reinforcement Learning-based Image Captioning with Embedding Reward
Zhou Ren1 Xiaoyu Wang1 Ning Zhang1 Xutao Lv1 Li-Jia Li2
1 Snap Research 2Google
Image captioning
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
2
[Farhadi et al. ECCV 2010][Kulkarni et al. CVPR 2011][Yang et al. EMNLP 2011][Fang et al. CVPR 2015][Lebret et al. ICLR 2015][Mao et al. ICLR 2015][Vinyals, et al. CVPR 2015][Karpathy et al. CVPR 2015][Chen et al. CVPR 2015][Xu et al. ICML 2015][Johnson et al. CVPR 2016][You et al. CVPR 2016]….
indoordarkfurrysiteat
indoordarkfurrysiteat
Previous work
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
3
catbat chair
[Farhadi et al. 2010; Kulkarni et al. 2011; Yang et al. 2011]
example:
Previous work
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
4
semantic attention [You et al. CVPR 2016]
word detection [Fang et al. CVPR 2015]
spatial attention [Xu et al. ICML 2015]
[Lebret et al. 2015; Mao et al. 2015; Vinyals, et al. 2015; Karpathy et al. 2015]
○ prone to accumulate generation errors during inference
○ sensitive to beam sizes during beam searchlocal
● Our target
○ better at utilizing the global information
○ be able to compensate errors
○ less sensitive to beam sizes
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
5
Motivation
Decision-Making framework
with Reinforcement Learning
● Limitations of current mainstream framework (encoder-decoder)
○ only local information is utilized
Why using decision-making?
Human-level gaming control [Mnih et al. Nature 2015]
Visual navigation [Zhu et al. ICRA 2017]
AlphaGo [Silver et al. Nature 2016]
Agent Goal
EnvironmentDeep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
6
state
actions
reward
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
7
Image captioning reformulation in decision-making
Agent Goal
Environmentstate
actions
reward
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
8
Image captioning reformulation in decision-making
Agent Goal
Environmentstate
actions
reward
- Goal: to generate a visual description given an image
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
9
Image captioning reformulation in decision-making
Agent Goal
Environmentstate
actions
reward
- Agent: the image captioning model to learn
- Goal: to generate a visual description given an image
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
10
Image captioning reformulation in decision-making
Agent Goal
Environmentstate
actions
reward
- Environment: the given image I + the words predicted so far
- Agent: the image captioning model to learn
- Goal: to generate a visual description given an image
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
11
Image captioning reformulation in decision-making
Agent Goal
Environmentstate
actions
reward
- State: representation of the environment at t ,
- Environment: the given image I + the words predicted so far
- Agent: the image captioning model to learn
- Goal: to generate a visual description given an image
- State: representation of the environment at t ,
- Action: the word to generate at t + 1,
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
12
Image captioning reformulation in decision-making
Agent Goal
Environmentstate
actions
reward
- Environment: the given image I + the words predicted so far
- Agent: the image captioning model to learn
- Goal: to generate a visual description given an image
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
13
- State: representation of the environment at t ,
- Environment: the given image I + the words predicted so far
- Agent: the image captioning model to learn
- Action: the word to generate at t + 1,
Image captioning reformulation in decision-making
Agent Goal
Environmentstate
actions
reward
- Goal: to generate a visual description given an image
- Reward: the feedback for reinforcement learning
Overview of our approach
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
14
❏ Training using reinforcement learning with embedding reward
❏ Testing using lookahead inference
● We propose a decision-making framework for image captioning
❏ An agent model contains a policy network, to capture the local information
a value network, to capture the global information
Our approach - agent architecture
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
15
Policy network
Value network
Our approach - agent architecture
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
16
Policy network (local guidance)
Value network
0.82
0.56
0.03
example:
Our approach - agent architecture
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
17
Policy network
Value network (global guidance)
0.89
0.03
vθ(st) is trying to regress the reward in the end
example:
Our approach - train our agent
● Pretrain policy network p with cross entropy loss
● Pretrain value network vᶿ with the mean squared loss
● Train p and vᶿ jointly using deep Reinforcement Learning
○ an Actor-Critic RL model
○ MIXER [Ranzato et al. ICLR 2016]
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
18
Reinforcement learning - reward definition
● Literature: metric-driven [Ranzato et al. ICLR 2016]
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
19
● Limitations:
○ metrics in image captioning are not perfectly defined.
○ it needs to be retrained for each metric in isolation.
○ it doesn’t have value network (no global guidance).
Embedding space
● Visual-Semantic Embedding
Reinforcement learning - reward definition
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
20
[Frome et al. NIPS 2013; Kiros et al. TACL 2015]
example:
Our approach - inference with our agent
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
21
Our approach - inference with our agent
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
22
Our approach - inference with our agent
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
23
Lookahead inference
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
24
Lookahead inference
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
25
Lookahead inference
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
26
Lookahead inference
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
27
global guidance
local guidance
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
28
Experimental Results
Results on MS-COCO
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
29
Results on MS-COCO
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
30
Results on MS-COCO
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
31
①
②
③
④
Results on MS-COCO
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
32
①
②
③
④
Results on MS-COCO
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
33
simple policy net
simple net structure①
Results on MS-COCO
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
34
simple policy net
simple net structure①
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
35
simple policy net
advanced net structure
semantic attention
spatial attention
object detection
②
Results on MS-COCO
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
36
simple policy net
advanced net structure
semantic attention
spatial attention
object detection
②
Results on MS-COCO
Results on MS-COCO
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
37
embedding-driven RL
metric-driven RL③
Results on MS-COCO
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
38
③
Results on MS-COCO
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
39
w/o external training data
with external training data④
Results on MS-COCO
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
40
④
Qualitative results
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
41
Qualitative results
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
42
Take Home
Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.
43
● We proposed a novel decision-making framework for image captioning.
○ An agent model → a policy network + a value network
○ A training method → Reinforcement Learning with embedding reward
○ An inference method → lookahead inference
● Utilizing both global and local information is important for sequential
generation tasks.
● Embedding can capture global information and can serve as a very
good global guidance.