Deep Reinforcement Learning-based Image Captioning with...

Preview:

Citation preview

Deep Reinforcement Learning-based Image Captioning with Embedding Reward

Zhou Ren1 Xiaoyu Wang1 Ning Zhang1 Xutao Lv1 Li-Jia Li2

1 Snap Research 2Google

Image captioning

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

2

[Farhadi et al. ECCV 2010][Kulkarni et al. CVPR 2011][Yang et al. EMNLP 2011][Fang et al. CVPR 2015][Lebret et al. ICLR 2015][Mao et al. ICLR 2015][Vinyals, et al. CVPR 2015][Karpathy et al. CVPR 2015][Chen et al. CVPR 2015][Xu et al. ICML 2015][Johnson et al. CVPR 2016][You et al. CVPR 2016]….

indoordarkfurrysiteat

indoordarkfurrysiteat

Previous work

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

3

catbat chair

[Farhadi et al. 2010; Kulkarni et al. 2011; Yang et al. 2011]

example:

Previous work

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

4

semantic attention [You et al. CVPR 2016]

word detection [Fang et al. CVPR 2015]

spatial attention [Xu et al. ICML 2015]

[Lebret et al. 2015; Mao et al. 2015; Vinyals, et al. 2015; Karpathy et al. 2015]

○ prone to accumulate generation errors during inference

○ sensitive to beam sizes during beam searchlocal

● Our target

○ better at utilizing the global information

○ be able to compensate errors

○ less sensitive to beam sizes

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

5

Motivation

Decision-Making framework

with Reinforcement Learning

● Limitations of current mainstream framework (encoder-decoder)

○ only local information is utilized

Why using decision-making?

Human-level gaming control [Mnih et al. Nature 2015]

Visual navigation [Zhu et al. ICRA 2017]

AlphaGo [Silver et al. Nature 2016]

Agent Goal

EnvironmentDeep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

6

state

actions

reward

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

7

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

8

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- Goal: to generate a visual description given an image

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

9

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- Agent: the image captioning model to learn

- Goal: to generate a visual description given an image

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

10

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- Environment: the given image I + the words predicted so far

- Agent: the image captioning model to learn

- Goal: to generate a visual description given an image

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

11

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- State: representation of the environment at t ,

- Environment: the given image I + the words predicted so far

- Agent: the image captioning model to learn

- Goal: to generate a visual description given an image

- State: representation of the environment at t ,

- Action: the word to generate at t + 1,

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

12

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- Environment: the given image I + the words predicted so far

- Agent: the image captioning model to learn

- Goal: to generate a visual description given an image

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

13

- State: representation of the environment at t ,

- Environment: the given image I + the words predicted so far

- Agent: the image captioning model to learn

- Action: the word to generate at t + 1,

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- Goal: to generate a visual description given an image

- Reward: the feedback for reinforcement learning

Overview of our approach

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

14

❏ Training using reinforcement learning with embedding reward

❏ Testing using lookahead inference

● We propose a decision-making framework for image captioning

❏ An agent model contains a policy network, to capture the local information

a value network, to capture the global information

Our approach - agent architecture

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

15

Policy network

Value network

Our approach - agent architecture

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

16

Policy network (local guidance)

Value network

0.82

0.56

0.03

example:

Our approach - agent architecture

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

17

Policy network

Value network (global guidance)

0.89

0.03

vθ(st) is trying to regress the reward in the end

example:

Our approach - train our agent

● Pretrain policy network p with cross entropy loss

● Pretrain value network vᶿ with the mean squared loss

● Train p and vᶿ jointly using deep Reinforcement Learning

○ an Actor-Critic RL model

○ MIXER [Ranzato et al. ICLR 2016]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

18

Reinforcement learning - reward definition

● Literature: metric-driven [Ranzato et al. ICLR 2016]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

19

● Limitations:

○ metrics in image captioning are not perfectly defined.

○ it needs to be retrained for each metric in isolation.

○ it doesn’t have value network (no global guidance).

Embedding space

● Visual-Semantic Embedding

Reinforcement learning - reward definition

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

20

[Frome et al. NIPS 2013; Kiros et al. TACL 2015]

example:

Our approach - inference with our agent

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

21

Our approach - inference with our agent

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

22

Our approach - inference with our agent

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

23

Lookahead inference

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

24

Lookahead inference

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

25

Lookahead inference

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

26

Lookahead inference

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

27

global guidance

local guidance

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

28

Experimental Results

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

29

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

30

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

31

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

32

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

33

simple policy net

simple net structure①

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

34

simple policy net

simple net structure①

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

35

simple policy net

advanced net structure

semantic attention

spatial attention

object detection

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

36

simple policy net

advanced net structure

semantic attention

spatial attention

object detection

Results on MS-COCO

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

37

embedding-driven RL

metric-driven RL③

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

38

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

39

w/o external training data

with external training data④

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

40

Qualitative results

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

41

Qualitative results

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

42

Take Home

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

43

● We proposed a novel decision-making framework for image captioning.

○ An agent model → a policy network + a value network

○ A training method → Reinforcement Learning with embedding reward

○ An inference method → lookahead inference

● Utilizing both global and local information is important for sequential

generation tasks.

● Embedding can capture global information and can serve as a very

good global guidance.

Thank you

44

zhou.ren@snap.com

Thank you!Welcome to visit our poster at #9-B

Recommended