44
Deep Reinforcement Learning-based Image Captioning with Embedding Reward Zhou Ren 1 Xiaoyu Wang 1 Ning Zhang 1 Xutao Lv 1 Li-Jia Li 2 1 Snap Research 2 Google

Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward

Zhou Ren1 Xiaoyu Wang1 Ning Zhang1 Xutao Lv1 Li-Jia Li2

1 Snap Research 2Google

Page 2: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Image captioning

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

2

[Farhadi et al. ECCV 2010][Kulkarni et al. CVPR 2011][Yang et al. EMNLP 2011][Fang et al. CVPR 2015][Lebret et al. ICLR 2015][Mao et al. ICLR 2015][Vinyals, et al. CVPR 2015][Karpathy et al. CVPR 2015][Chen et al. CVPR 2015][Xu et al. ICML 2015][Johnson et al. CVPR 2016][You et al. CVPR 2016]….

Page 3: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

indoordarkfurrysiteat

indoordarkfurrysiteat

Previous work

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

3

catbat chair

[Farhadi et al. 2010; Kulkarni et al. 2011; Yang et al. 2011]

example:

Page 4: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Previous work

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

4

semantic attention [You et al. CVPR 2016]

word detection [Fang et al. CVPR 2015]

spatial attention [Xu et al. ICML 2015]

[Lebret et al. 2015; Mao et al. 2015; Vinyals, et al. 2015; Karpathy et al. 2015]

Page 5: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

○ prone to accumulate generation errors during inference

○ sensitive to beam sizes during beam searchlocal

● Our target

○ better at utilizing the global information

○ be able to compensate errors

○ less sensitive to beam sizes

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

5

Motivation

Decision-Making framework

with Reinforcement Learning

● Limitations of current mainstream framework (encoder-decoder)

○ only local information is utilized

Page 6: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Why using decision-making?

Human-level gaming control [Mnih et al. Nature 2015]

Visual navigation [Zhu et al. ICRA 2017]

AlphaGo [Silver et al. Nature 2016]

Agent Goal

EnvironmentDeep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

6

state

actions

reward

Page 7: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

7

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

Page 8: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

8

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- Goal: to generate a visual description given an image

Page 9: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

9

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- Agent: the image captioning model to learn

- Goal: to generate a visual description given an image

Page 10: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

10

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- Environment: the given image I + the words predicted so far

- Agent: the image captioning model to learn

- Goal: to generate a visual description given an image

Page 11: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

11

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- State: representation of the environment at t ,

- Environment: the given image I + the words predicted so far

- Agent: the image captioning model to learn

- Goal: to generate a visual description given an image

Page 12: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

- State: representation of the environment at t ,

- Action: the word to generate at t + 1,

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

12

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- Environment: the given image I + the words predicted so far

- Agent: the image captioning model to learn

- Goal: to generate a visual description given an image

Page 13: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

13

- State: representation of the environment at t ,

- Environment: the given image I + the words predicted so far

- Agent: the image captioning model to learn

- Action: the word to generate at t + 1,

Image captioning reformulation in decision-making

Agent Goal

Environmentstate

actions

reward

- Goal: to generate a visual description given an image

- Reward: the feedback for reinforcement learning

Page 14: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Overview of our approach

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

14

❏ Training using reinforcement learning with embedding reward

❏ Testing using lookahead inference

● We propose a decision-making framework for image captioning

❏ An agent model contains a policy network, to capture the local information

a value network, to capture the global information

Page 15: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Our approach - agent architecture

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

15

Policy network

Value network

Page 16: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Our approach - agent architecture

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

16

Policy network (local guidance)

Value network

0.82

0.56

0.03

example:

Page 17: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Our approach - agent architecture

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

17

Policy network

Value network (global guidance)

0.89

0.03

vθ(st) is trying to regress the reward in the end

example:

Page 18: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Our approach - train our agent

● Pretrain policy network p with cross entropy loss

● Pretrain value network vᶿ with the mean squared loss

● Train p and vᶿ jointly using deep Reinforcement Learning

○ an Actor-Critic RL model

○ MIXER [Ranzato et al. ICLR 2016]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

18

Page 19: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Reinforcement learning - reward definition

● Literature: metric-driven [Ranzato et al. ICLR 2016]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

19

● Limitations:

○ metrics in image captioning are not perfectly defined.

○ it needs to be retrained for each metric in isolation.

○ it doesn’t have value network (no global guidance).

Page 20: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Embedding space

● Visual-Semantic Embedding

Reinforcement learning - reward definition

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

20

[Frome et al. NIPS 2013; Kiros et al. TACL 2015]

example:

Page 21: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Our approach - inference with our agent

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

21

Page 22: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Our approach - inference with our agent

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

22

Page 23: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Our approach - inference with our agent

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

23

Page 24: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Lookahead inference

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

24

Page 25: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Lookahead inference

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

25

Page 26: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Lookahead inference

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

26

Page 27: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Lookahead inference

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

27

global guidance

local guidance

Page 28: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

28

Experimental Results

Page 29: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

29

Page 30: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

30

Page 31: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

31

Page 32: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

32

Page 33: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

33

simple policy net

simple net structure①

Page 34: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

34

simple policy net

simple net structure①

Page 35: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

35

simple policy net

advanced net structure

semantic attention

spatial attention

object detection

Results on MS-COCO

Page 36: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

36

simple policy net

advanced net structure

semantic attention

spatial attention

object detection

Results on MS-COCO

Page 37: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

37

embedding-driven RL

metric-driven RL③

Page 38: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

38

Page 39: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

39

w/o external training data

with external training data④

Page 40: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Results on MS-COCO

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

40

Page 41: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Qualitative results

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

41

Page 42: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Qualitative results

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

42

Page 43: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Take Home

Deep Reinforcement Learning-based Image Captioning with Embedding Reward. CVPR 2017.

43

● We proposed a novel decision-making framework for image captioning.

○ An agent model → a policy network + a value network

○ A training method → Reinforcement Learning with embedding reward

○ An inference method → lookahead inference

● Utilizing both global and local information is important for sequential

generation tasks.

● Embedding can capture global information and can serve as a very

good global guidance.

Page 44: Deep Reinforcement Learning-based Image Captioning with ...web.cs.ucla.edu/~zhou.ren/Zhou_CVPR17_talk.pdf · [Fang et al. CVPR 2015] [Lebret et al. ICLR 2015] [Mao et al. ICLR 2015]

Thank you

44

[email protected]

Thank you!Welcome to visit our poster at #9-B