Transcript
Page 1: RL and GAN for Sentence Generation and Chat-bot

RL and GAN for Sentence Generation and Chat-bot

Hung-yi Lee

Page 2: RL and GAN for Sentence Generation and Chat-bot

Outline

โ€ข Policy Gradient

โ€ข SeqGANโ€ข Two techniques: MCMC, partial

โ€ข Experiments: SeqGAN and dialogue

โ€ข Original GANโ€ข MadliGAN

โ€ข Gumbel

Page 3: RL and GAN for Sentence Generation and Chat-bot

Review: Chat-bot

โ€ข Sequence-to-sequence learning

Encoder Generator

Input sentence

output sentence

history information

Training data:

A: OOO

B: XXX

A: โˆ† โˆ† โˆ†

โ€ฆโ€ฆ

โ€ฆโ€ฆ

B: XXX

A: โˆ† โˆ† โˆ†

A: OOO

Page 4: RL and GAN for Sentence Generation and Chat-bot

Review: Encoder

ๅฅฝๆˆ‘ ๅพˆ

to generator

Encoder

Hierarchical Encoder

ๅ—Žไฝ  ๅฅฝ

Page 5: RL and GAN for Sentence Generation and Chat-bot

Review: Generator

A A

A

B

A

B

A

A

B

B

B

<BOS>

can be different with attention mechanism

: condition from decoder

Page 6: RL and GAN for Sentence Generation and Chat-bot

Review: Training Generator

Reference:

A

B

๐ถ =

๐‘ก

๐ถ๐‘ก

Minimizing cross-entropy of each component

A

A

B

B

A

B

<BOS>

A

B B

๐ถ1 ๐ถ2 ๐ถ3

: condition from decoder

Page 7: RL and GAN for Sentence Generation and Chat-bot

Review: Training Generator

เทœ๐‘ฅ๐‘ก

๐ถ๐‘ก

๐ถ๐‘ก = โˆ’๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ เทœ๐‘ฅ๐‘ก| เทœ๐‘ฅ1:๐‘กโˆ’1, โ„Ž

๐ถ =

๐‘ก

๐ถ๐‘ก

๐ถ = โˆ’

๐‘ก

๐‘™๐‘œ๐‘”๐‘ƒ เทœ๐‘ฅ๐‘ก| เทœ๐‘ฅ1:๐‘กโˆ’1, โ„Ž

Maximizing the likelihood of generating เทœ๐‘ฅ given h

= โˆ’๐‘™๐‘œ๐‘”๐‘ƒ เทœ๐‘ฅ1|โ„Ž ๐‘ƒ เทœ๐‘ฅ๐‘ก| เทœ๐‘ฅ1:๐‘กโˆ’1, โ„Ž

โ‹ฏ๐‘ƒ เทœ๐‘ฅ๐‘‡| เทœ๐‘ฅ1:๐‘‡โˆ’1, โ„Ž= โˆ’๐‘™๐‘œ๐‘”๐‘ƒ เทœ๐‘ฅ|โ„Ž

Training data: โ„Ž, เทœ๐‘ฅ โ„Ž: input sentence and history/context

เทœ๐‘ฅ: correct response (word sequence)

เทœ๐‘ฅ๐‘ก: t-th word, เทœ๐‘ฅ1:๐‘ก: first t words of เทœ๐‘ฅ

โ€ฆโ€ฆ โ€ฆโ€ฆ

เทœ๐‘ฅ๐‘ก+1

๐ถ๐‘ก+1

เทœ๐‘ฅ๐‘กโˆ’1

๐ถ๐‘กโˆ’1

generator output

โ€ฆโ€ฆ โ€ฆโ€ฆ

Page 8: RL and GAN for Sentence Generation and Chat-bot

RL for Sentence Generation

Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan Jurafsky,

โ€œDeep Reinforcement Learning for Dialogue Generationโ€œ, EMNLP 2016

Page 9: RL and GAN for Sentence Generation and Chat-bot

Introduction

โ€ข Machine obtains feedback from user

โ€ข Chat-bot learns to maximize the expected reward

https://image.freepik.com/free-vector/variety-of-human-avatars_23-2147506285.jpg

How are you?

Bye bye โ˜บ

Hello

Hi โ˜บ

-10 3

http://www.freepik.com/free-vector/variety-of-human-avatars_766615.htm

Page 10: RL and GAN for Sentence Generation and Chat-bot

Maximizing Expected Reward

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

เดค๐‘…๐œƒ

เดค๐‘…๐œƒ

Encoder Generator

๐œƒ

โ„Ž ๐‘ฅ Human

=

โ„Ž

๐‘ƒ โ„Ž

๐‘ฅ

๐‘… โ„Ž, ๐‘ฅ ๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž

๐‘… โ„Ž, ๐‘ฅ

Randomness in generator

Probability that the input/history is h

Maximizing expected reward

update

Page 11: RL and GAN for Sentence Generation and Chat-bot

Maximizing Expected Reward

= ๐ธโ„Ž~๐‘ƒ โ„Ž ๐ธ๐‘ฅ~๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž ๐‘… โ„Ž, ๐‘ฅ

โ‰ˆ1

๐‘

๐‘–=1

๐‘

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘–

Sample:

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

เดค๐‘…๐œƒ

=

โ„Ž

๐‘ƒ โ„Ž

๐‘ฅ

๐‘… โ„Ž, ๐‘ฅ ๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž

Maximizing expected reward

Encoder Generator

๐œƒ

โ„Ž ๐‘ฅ Human ๐‘… โ„Ž, ๐‘ฅ

update

เดค๐‘…๐œƒ

= ๐ธโ„Ž~๐‘ƒ โ„Ž ,๐‘ฅ~๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž ๐‘… โ„Ž, ๐‘ฅ

โ„Ž1, ๐‘ฅ1 , โ„Ž2, ๐‘ฅ2 , โ‹ฏ , โ„Ž๐‘ , ๐‘ฅ๐‘

Where is ๐œƒ?

Page 12: RL and GAN for Sentence Generation and Chat-bot

Policy Gradient

๐›ป เดค๐‘…๐œƒ =

โ„Ž

๐‘ƒ โ„Ž

๐‘ฅ

๐‘… โ„Ž, ๐‘ฅ ๐›ป๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž

=

โ„Ž

๐‘ƒ โ„Ž

๐‘ฅ

๐‘… โ„Ž, ๐‘ฅ ๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž๐›ป๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž

๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž

๐‘‘๐‘™๐‘œ๐‘” ๐‘“ ๐‘ฅ

๐‘‘๐‘ฅ=

1

๐‘“ ๐‘ฅ

๐‘‘๐‘“ ๐‘ฅ

๐‘‘๐‘ฅ

=

โ„Ž

๐‘ƒ โ„Ž

๐‘ฅ

๐‘… โ„Ž, ๐‘ฅ ๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž

โ‰ˆ1

๐‘

๐‘–=1

๐‘

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž

= ๐ธโ„Ž~๐‘ƒ โ„Ž ,๐‘ฅ~๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž ๐‘… โ„Ž, ๐‘ฅ ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž

Sampling

=

โ„Ž

๐‘ƒ โ„Ž

๐‘ฅ

๐‘… โ„Ž, ๐‘ฅ ๐‘ƒ๐œƒ ๐‘ฅ|โ„Žเดค๐‘…๐œƒ โ‰ˆ1

๐‘

๐‘–=1

๐‘

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘–

Page 13: RL and GAN for Sentence Generation and Chat-bot

Policy Gradient

โ€ข Gradient Ascent

๐œƒ๐‘›๐‘’๐‘ค โ† ๐œƒ๐‘œ๐‘™๐‘‘ + ๐œ‚๐›ป เดค๐‘…๐œƒ๐‘œ๐‘™๐‘‘

๐›ป เดค๐‘…๐œƒ โ‰ˆ1

๐‘

๐‘–=1

๐‘

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘–

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– is positive

After updating ๐œƒ, ๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘– will increase

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– is negative

After updating ๐œƒ, ๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘– will decrease

Page 14: RL and GAN for Sentence Generation and Chat-bot

Implementation

1

๐‘

๐‘–=1

๐‘

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘–

1

๐‘

๐‘–=1

๐‘

๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ เทœ๐‘ฅ๐‘–|โ„Ž๐‘–

1

๐‘

๐‘–=1

๐‘

๐›ป๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ เทœ๐‘ฅ๐‘–|โ„Ž๐‘–

1

๐‘

๐‘–=1

๐‘

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– ๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘–

๐‘… โ„Ž๐‘– , เทœ๐‘ฅ๐‘– = 1 Sampling as training data

weighted by ๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘–

ObjectiveFunction

Gradient

Maximum Likelihood

Reinforcement Learning

Training Data

โ„Ž1, เทœ๐‘ฅ1 , โ€ฆ , โ„Ž๐‘ , เทœ๐‘ฅ๐‘ โ„Ž1, ๐‘ฅ1 , โ€ฆ , โ„Ž๐‘ , ๐‘ฅ๐‘

EncoderGenera

torโ„Ž๐‘–

Human

๐‘ฅ๐‘–

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘–

Page 15: RL and GAN for Sentence Generation and Chat-bot

Implementation

๐œƒ๐‘ก

โ„Ž1, ๐‘ฅ1

โ„Ž2, ๐‘ฅ2

โ„Ž๐‘ , ๐‘ฅ๐‘

โ€ฆโ€ฆ

๐‘… โ„Ž1, ๐‘ฅ1

๐‘… โ„Ž2, ๐‘ฅ2

๐‘… โ„Ž๐‘ , ๐‘ฅ๐‘

โ€ฆโ€ฆ

1

๐‘

๐‘–=1

๐‘

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ๐‘ก ๐‘ฅ๐‘–|โ„Ž๐‘–

๐œƒ๐‘ก+1 โ† ๐œƒ๐‘ก + ๐œ‚๐›ป เดค๐‘…๐œƒ๐‘ก

1

๐‘

๐‘–=1

๐‘

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– ๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘–

New Objective:

๐œƒ0 can be well pre-trained from

โ„Ž1, เทœ๐‘ฅ1 , โ€ฆ , โ„Ž๐‘ , เทœ๐‘ฅ๐‘

Page 16: RL and GAN for Sentence Generation and Chat-bot

Add a Baseline

Ideal case

Due toSampling

(h,x1)

Because it is probability โ€ฆ

Not sampled

1

๐‘

๐‘–=1

๐‘

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– ๐‘™๐‘œ๐‘”โˆ‡๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘–

๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž

(h,x2) (h,x3)

(h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)

(h,x1) (h,x2) (h,x3)

If ๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– is always positive

Page 17: RL and GAN for Sentence Generation and Chat-bot

1

๐‘

๐‘–=1

๐‘

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– โˆ’ ๐‘ ๐‘™๐‘œ๐‘”๐›ป๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘–

Add a Baseline

(h,x1)

1

๐‘

๐‘–=1

๐‘

๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– ๐‘™๐‘œ๐‘”โˆ‡๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘–

There are several ways to obtain the baseline b.

๐‘ƒ๐œƒ ๐‘ฅ|โ„Ž

(h,x2) (h,x3)

Not sampled

Add baseline

If ๐‘… โ„Ž๐‘– , ๐‘ฅ๐‘– is always positive

(h,x1) (h,x2) (h,x3)

Page 18: RL and GAN for Sentence Generation and Chat-bot

Alpha GO style training !

โ€ข Let two agents talk to each other

How old are you?

See you.

See you.

See you.

How old are you?

I am 16.

I though you were 12.

What make you think so?

Using a pre-defined evaluation function to compute R(h,x)

Page 19: RL and GAN for Sentence Generation and Chat-bot

Example Reward

โ€ข The final reward R(h,x) is the weighted sum of three terms r1(h,x), r2(h,x) and r3(h,x)

๐‘… โ„Ž, ๐‘ฅ = ฮป1๐‘Ÿ1 โ„Ž, ๐‘ฅ + ฮป2๐‘Ÿ2 โ„Ž, ๐‘ฅ + ฮป3๐‘Ÿ3 โ„Ž, ๐‘ฅ

Ease of answering

Information Flow

Semantic Coherence

ไธ่ฆๆˆ็‚บๅฅ้ปž็Ž‹

่ชช้ปžๆ–ฐ้ฎฎ็š„

ไธ่ฆๅ‰่จ€ไธๅฐๅพŒ่ชž

Page 20: RL and GAN for Sentence Generation and Chat-bot

Example Results

Page 21: RL and GAN for Sentence Generation and Chat-bot

Reinforcement learning?Start with

observation ๐‘ 1 Observation ๐‘ 2 Observation ๐‘ 3

Action ๐‘Ž1: โ€œrightโ€

Obtain reward ๐‘Ÿ1 = 0

Action ๐‘Ž2 : โ€œfireโ€

(kill an alien)

Obtain reward ๐‘Ÿ2 = 5

Page 22: RL and GAN for Sentence Generation and Chat-bot

Reinforcement learning?

A A

A

B

A

B

A

A

B

B

B

observation

Actions set

The action we take influence the observation in the next step

<BO

S>

Action taken๐‘Ÿ = 0 ๐‘Ÿ = 0

๐‘Ÿ๐‘’๐‘ค๐‘Ž๐‘Ÿ๐‘‘:

R(โ€œBAAโ€, reference)

Marc'Aurelio Ranzato, SumitChopra, Michael Auli, WojciechZaremba, โ€œSequence Level Training with Recurrent Neural Networksโ€, ICLR, 2016

Page 23: RL and GAN for Sentence Generation and Chat-bot

Reinforcement learning?

โ€ข One can use any advanced RL techniques here.

โ€ข For example, actor-criticโ€ข Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh

Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, YoshuaBengio. "An Actor-Critic Algorithm for Sequence Prediction." ICLR, 2017.

Page 24: RL and GAN for Sentence Generation and Chat-bot

SeqGAN

Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, โ€œSeqGAN: Sequence Generative Adversarial Nets with Policy Gradientโ€, AAAI, 2017

Jiwei Li, Will Monroe, Tianlin Shi, Sรฉbastien Jean, Alan Ritter, Dan Jurafsky, โ€œAdversarial Learning for Neural Dialogue Generationโ€, arXiv

preprint, 2017

Page 25: RL and GAN for Sentence Generation and Chat-bot

Basic Idea โ€“ Sentence Generation

Generator

Discriminator

sentence x

sentence x Real or fake

Sampling from RNN at each time step also provides randomness

Original GAN

code z sampled from prior distribution

Page 26: RL and GAN for Sentence Generation and Chat-bot

Algorithm โ€“ Sentence Generation

โ€ข Initialize generator Gen and discriminator Dis

โ€ข In each iteration:

โ€ข Sample real sentences ๐‘ฅ from database

โ€ข Generate sentences ๐‘ฅ by Gen

โ€ข Update Dis to increase ๐ท๐‘–๐‘  ๐‘ฅ and decrease ๐ท๐‘–๐‘  ๐‘ฅ

โ€ข Update Gen such that

GeneratorDiscrimi

natorscalar

update

Page 27: RL and GAN for Sentence Generation and Chat-bot

Basic Idea โ€“ Chat-bot

Discriminator

Input sentence/history h response sentence x

Real or fake

http://www.nipic.com/show/3/83/3936650kd7476069.html

human dialogues

Chatbot

En De

Conditional GAN

response sentence x

Input sentence/history h

Page 28: RL and GAN for Sentence Generation and Chat-bot

Algorithm โ€“ Chat-bot

โ€ข Initialize generator Gen and discriminator Dis

โ€ข In each iteration:

โ€ข Sample real history โ„Ž and sentence ๐‘ฅ from database

โ€ข Sample real history โ„Žโ€ฒ from database, and generate sentences ๐‘ฅ by Gen(โ„Žโ€ฒ)

โ€ข Update Dis to increase ๐ท๐‘–๐‘  โ„Ž, ๐‘ฅ and decrease ๐ท๐‘–๐‘  โ„Žโ€ฒ, ๐‘ฅ

โ€ข Update Gen such that

Discriminator

scalar

update

Training data:

A: OOO

B: XXX

A: โˆ† โˆ† โˆ†

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ„Ž

h

x

Chatbot

En De

Page 29: RL and GAN for Sentence Generation and Chat-bot

A A

A

B

A

B

A

A

B

B

B

<BOS>

Can we do backpropogation?

Tuning generatorwill not change the output.

Encoder

Discriminator

scalar

update

Alternative:improved WGAN

scalarChatbot

En De

Page 30: RL and GAN for Sentence Generation and Chat-bot

Reinforcement Learning

โ€ข Consider the output of discriminator as reward

โ€ข Update generator to increase discriminator = to get maximum reward

โ€ข Different from typical RL

โ€ข The discriminator would update

๐›ป เดค๐‘…๐œƒ โ‰ˆ1

๐‘

๐‘–=1

๐‘

๐‘…โ„Ž๐‘– ๐‘ฅ๐‘– โˆ’ ๐‘ ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘–

Discriminator

scalar

updateChatbot

En De

reward

Discriminator Score

๐ท โ„Ž๐‘– , ๐‘ฅ๐‘–

Page 31: RL and GAN for Sentence Generation and Chat-bot

Reward for Every Generation Step

๐›ป เดค๐‘…๐œƒ โ‰ˆ1

๐‘

๐‘–=1

๐‘

๐ท โ„Ž๐‘– , ๐‘ฅ๐‘– โˆ’ ๐‘ ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘–

โ„Ž๐‘– = โ€œWhat is your name?โ€

๐‘ฅ๐‘– = โ€œI donโ€™t knowโ€

๐ท โ„Ž๐‘– , ๐‘ฅ๐‘– โˆ’ ๐‘ is negative

Update ๐œƒ to decrease log๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘–

๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘– = ๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฅ1๐‘– |โ„Ž๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฅ2

๐‘– |โ„Ž๐‘– , ๐‘ฅ1๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฅ3

๐‘– |โ„Ž๐‘– , ๐‘ฅ1:2๐‘–

๐‘ƒ "๐ผ"|โ„Ž๐‘–

โ„Ž๐‘– = โ€œWhat is your name?โ€

๐‘ฅ๐‘– = โ€œI am Johnโ€

๐ท โ„Ž๐‘– , ๐‘ฅ๐‘– โˆ’ ๐‘ is positive

Update ๐œƒ to increase log๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘–

๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘– = ๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฅ1๐‘– |โ„Ž๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฅ2

๐‘– |โ„Ž๐‘– , ๐‘ฅ1๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฅ3

๐‘– |โ„Ž๐‘– , ๐‘ฅ1:2๐‘–

๐‘ƒ "๐ผ"|โ„Ž๐‘–

Page 32: RL and GAN for Sentence Generation and Chat-bot

Reward for Every Generation Step

Method 2. Discriminator For Partially Decoded Sequences

๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘–|โ„Ž๐‘– = ๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฅ1๐‘– |โ„Ž๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฅ2

๐‘– |โ„Ž๐‘– , ๐‘ฅ1๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ ๐‘ฅ3

๐‘– |โ„Ž๐‘– , ๐‘ฅ1:2๐‘–

โ„Ž๐‘– = โ€œWhat is your name?โ€ ๐‘ฅ๐‘– = โ€œI donโ€™t knowโ€

๐‘ƒ "๐ผ"|โ„Ž๐‘– ๐‘ƒ "๐‘‘๐‘œ๐‘›โ€ฒ๐‘ก"|โ„Ž๐‘– , "๐ผ" ๐‘ƒ "๐‘˜๐‘›๐‘œ๐‘ค"|โ„Ž๐‘– , "๐ผ ๐‘‘๐‘œ๐‘›โ€ฒ๐‘ก"

Method 1. Monte Carlo (MC) Search

๐›ป เดค๐‘…๐œƒ โ‰ˆ1

๐‘

๐‘–=1

๐‘

๐‘ก=1

๐‘‡

๐‘„ โ„Ž๐‘– , ๐‘ฅ1:๐‘ก๐‘– โˆ’ ๐‘ ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘ก

๐‘–|โ„Ž๐‘– , ๐‘ฅ1:๐‘กโˆ’1๐‘–

Page 33: RL and GAN for Sentence Generation and Chat-bot

Monte Carlo Search

โ€ข How to estimate ๐‘„ โ„Ž๐‘– , ๐‘ฅ1:๐‘ก๐‘– ?

๐‘„ "๐‘Šโ„Ž๐‘Ž๐‘ก ๐‘–๐‘  ๐‘ฆ๐‘œ๐‘ข๐‘Ÿ ๐‘›๐‘Ž๐‘š๐‘’? ", "๐ผ"

I am John

I am happy

I donโ€™t know

I am superman

โ„Ž๐‘– ๐‘ฅ1๐‘–

๐‘ฅ๐ด =

๐‘ฅ๐ต =

๐‘ฅ๐ถ =

๐‘ฅ๐ท =

๐ท โ„Ž๐‘– , ๐‘ฅ๐ด

๐ท โ„Ž๐‘– , ๐‘ฅ๐ต

๐ท โ„Ž๐‘– , ๐‘ฅ๐ถ

๐ท โ„Ž๐‘– , ๐‘ฅ๐ท

= 1.0

= 0.1

= 0.1

= 0.8

๐‘„ โ„Ž๐‘– , "๐ผ" = 0.5

A roll-out generator for sampling is needed

avg

Page 34: RL and GAN for Sentence Generation and Chat-bot

Rewarding Partially Decoded Sequencesโ€ข Training a discriminator that is able to assign rewards to

both fully and partially decoded sequences

โ€ข Break generated sequences into partial sequences

h=โ€œWhat is your name?โ€, x=โ€œI am johnโ€

h=โ€œWhat is your name?โ€, x=โ€œI donโ€™t knowโ€

Disscalarh

xh=โ€œWhat is your name?โ€, x=โ€œI amโ€

h=โ€œWhat is your name?โ€, x=โ€œIโ€

h=โ€œWhat is your name?โ€, x=โ€œI donโ€™tโ€

h=โ€œWhat is your name?โ€, x=โ€œIโ€

๐‘„ โ„Ž, ๐‘ฅ1:๐‘ก

Dish

๐‘ฅ1:๐‘ก

Page 35: RL and GAN for Sentence Generation and Chat-bot

Teacher Forcing

โ€ข The training of generative model is unstableโ€ข This reward is used to promote or discourage the

generatorโ€™s own generated sequences.

โ€ข Usually It knows that the generated results are bad, but does not know what results are good.

โ€ข Teacher Forcing

Obtained by sampling

Adding more Data:

Training Data for SeqGAN:

โ„Ž1, เทœ๐‘ฅ1 , โ€ฆ , โ„Ž๐‘ , เทœ๐‘ฅ๐‘

โ„Ž1, ๐‘ฅ1 , โ€ฆ , โ„Ž๐‘ , ๐‘ฅ๐‘

weighted by ๐ท โ„Ž๐‘– , ๐‘ฅ๐‘–

Real data

Consider ๐ท โ„Ž๐‘– , เทœ๐‘ฅ๐‘– = 1

Page 36: RL and GAN for Sentence Generation and Chat-bot

Experiments in paper

โ€ข Sentence generation: Synthetic data

โ€ข Given an LSTM

โ€ข Using the LSTM to generate a lot of sequences as โ€œreal dataโ€

โ€ข Generator learns from the โ€œreal dataโ€ by different approaches

โ€ข Generator generates some sequences

โ€ข Using the LSTM to compute the negative loglikelihood (NLL) of the sequences

โ€ข Smaller is better

Page 37: RL and GAN for Sentence Generation and Chat-bot

Experiments in paper- Synthetic data

Page 38: RL and GAN for Sentence Generation and Chat-bot
Page 39: RL and GAN for Sentence Generation and Chat-bot

Experiments in paper- Real data

Page 40: RL and GAN for Sentence Generation and Chat-bot

Results - Chat-bot

Page 41: RL and GAN for Sentence Generation and Chat-bot

To Learn More โ€ฆ

Page 42: RL and GAN for Sentence Generation and Chat-bot

Algorithm โ€“ MaliGAN

โ€ข Initialize generator Gen and discriminator Dis

โ€ข In each iteration:

โ€ข Sample real sentences ๐‘ฅ from database

โ€ข Generate sentences ๐‘ฅ by Gen

โ€ข Update Dis to maximize

โ€ข Update Gen by gradient

1

๐‘

๐‘–=1

๐‘๐‘Ÿ๐ท ๐‘ฅ๐‘–

ฯƒ๐‘–=1๐‘ ๐‘Ÿ๐ท ๐‘ฅ๐‘–

โˆ’ ๐‘ ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘–

Maximum-likelihood Augmented Discrete GAN

๐‘ฅ

๐‘™๐‘œ๐‘”๐ท ๐‘ฅ +

๐‘ฅ

๐‘™๐‘œ๐‘” 1 โˆ’ ๐ท ๐‘ฅ

๐‘Ÿ๐ท ๐‘ฅ๐‘– =๐ท ๐‘ฅ๐‘–

1 โˆ’ ๐ท ๐‘ฅ๐‘–

๐ท โ„Ž๐‘– , ๐‘ฅ๐‘–

Page 43: RL and GAN for Sentence Generation and Chat-bot

To learn more โ€ฆโ€ฆ

โ€ข Professor forcing โ€ข Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng

Zhang, Aaron Courville, Yoshua Bengio, โ€œProfessor Forcing: A New Algorithm for Training Recurrent Networksโ€, NIPS, 2016

โ€ข Handling discrete output by methods other than policy gradientโ€ข MaliGAN, Boundary-seeking GANโ€ข Yizhe Zhang, Zhe Gan, Lawrence Carin, โ€œGenerating Text

via Adversarial Trainingโ€, Workshop on Adversarial Training, NIPS, 2016

โ€ข Matt J. Kusner, Josรฉ Miguel Hernรกndez-Lobato, โ€œGANS for Sequences of Discrete Elements with the Gumbel-softmax Distributionโ€, arXiv preprint, 2016


Recommended