Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward

Zhou Ren1 Xiaoyu Wang1 Ning Zhang1 Xutao Lv1 Li-Jia Li2

1 Snap Research 2Google

Image captioning

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

2

[Farhadi et al. ECCV 2010][Kulkarni et al. CVPR 2011][Yang et al. EMNLP 2011][Fang et al. CVPR 2015][Lebret et al. ICLR 2015][Mao et al. ICLR 2015][Vinyals, et al. CVPR 2015][Karpathy et al. CVPR 2015][Chen et al. CVPR 2015][Xu et al. ICML 2015][Johnson et al. CVPR 2016][You et al. CVPR 2016]….

indoordarkfurrysiteat

indoordarkfurrysiteat

Previous work


3

catbat chair

[Farhadi et al. 2010; Kulkarni et al. 2011; Yang et al. 2011]

example:

Previous work


4

semantic attention [You et al. CVPR 2016]

word detection [Fang et al. CVPR 2015]

spatial attention [Xu et al. ICML 2015]

[Lebret et al. 2015; Mao et al. 2015; Vinyals, et al. 2015; Karpathy et al. 2015]

○ prone to accumulate generation errors during inference

○ sensitive to beam sizes during beam searchlocal

● Our target

○ better at utilizing the global information

○ be able to compensate errors

○ less sensitive to beam sizes


5

Motivation

Decision-Making framework

with Reinforcement Learning

● Limitations of current mainstream framework (encoder-decoder)

○ only local information is utilized

Why using decision-making?

Human-level gaming control [Mnih et al. Nature 2015]

Visual navigation [Zhu et al. ICRA 2017]

AlphaGo [Silver et al. Nature 2016]

Agent Goal

EnvironmentDeep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

6

state

actions

reward


7

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward


8


Agent Goal

Environmentstate

actions

reward

- Goal: to generate a visual description given an image


9


Agent Goal

Environmentstate

actions

reward

- Agent: the image captioning model to learn



10


Agent Goal

Environmentstate

actions

reward

- Environment: the given image I + the words predicted so far




11


Agent Goal

Environmentstate

actions

reward

- State: representation of the environment at t ,





- Action: the word to generate at t + 1,


12


Agent Goal

Environmentstate

actions

reward





13




- Action: the word to generate at t + 1,


Agent Goal

Environmentstate

actions

reward


- Reward: the feedback for reinforcement learning

Overview of our approach


14

❏ Training using reinforcement learning with embedding reward

❏ Testing using lookahead inference

● We propose a decision-making framework for image captioning

❏ An agent model contains a policy network, to capture the local information

a value network, to capture the global information

Our approach - agent architecture


15

Policy network

Value network



16

Policy network (local guidance)

Value network

0.82

0.56

0.03

example:



17

Policy network

Value network (global guidance)

0.89

0.03

vθ(st) is trying to regress the reward in the end

example:

Our approach - train our agent

● Pretrain policy network p with cross entropy loss

● Pretrain value network vᶿ with the mean squared loss

● Train p and vᶿ jointly using deep Reinforcement Learning

○ an Actor-Critic RL model

○ MIXER [Ranzato et al. ICLR 2016]


18

Reinforcement learning - reward definition

● Literature: metric-driven [Ranzato et al. ICLR 2016]


19

● Limitations:

○ metrics in image captioning are not perfectly defined.

○ it needs to be retrained for each metric in isolation.

○ it doesn’t have value network (no global guidance).

Embedding space

● Visual-Semantic Embedding

Reinforcement learning - reward definition


20

[Frome et al. NIPS 2013; Kiros et al. TACL 2015]

example:

Our approach - inference with our agent


21



22



23

Lookahead inference


24

Lookahead inference


25

Lookahead inference


26

Lookahead inference


27

global guidance

local guidance


28

Experimental Results

Results on MS-COCO


29

Results on MS-COCO


30

Results on MS-COCO


31

①

②

③

④

Results on MS-COCO


32

①

②

③

④

Results on MS-COCO


33

simple policy net

simple net structure①

Results on MS-COCO


34

simple policy net

simple net structure①


35

simple policy net

advanced net structure

semantic attention

spatial attention

object detection

②

Results on MS-COCO


36

simple policy net

advanced net structure

semantic attention

spatial attention

object detection

②

Results on MS-COCO

Results on MS-COCO


37

embedding-driven RL

metric-driven RL③

Results on MS-COCO


38

③

Results on MS-COCO


39

w/o external training data

with external training data④

Results on MS-COCO


40

④

Qualitative results


41

Qualitative results


42

Take Home


43

● We proposed a novel decision-making framework for image captioning.

○ An agent model → a policy network + a value network

○ A training method → Reinforcement Learning with embedding reward

○ An inference method → lookahead inference

● Utilizing both global and local information is important for sequential

generation tasks.

● Embedding can capture global information and can serve as a very

good global guidance.

Thank you

44

[email protected]

Thank you!Welcome to visit our poster at #9-B

Documents

Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]