RL and GAN for Sentence Generation and Chat-bot
Hung-yi Lee
Outline
โข Policy Gradient
โข SeqGANโข Two techniques: MCMC, partial
โข Experiments: SeqGAN and dialogue
โข Original GANโข MadliGAN
โข Gumbel
Review: Chat-bot
โข Sequence-to-sequence learning
Encoder Generator
Input sentence
output sentence
history information
Training data:
A: OOO
B: XXX
A: โ โ โ
โฆโฆ
โฆโฆ
B: XXX
A: โ โ โ
A: OOO
Review: Encoder
ๅฅฝๆ ๅพ
to generator
Encoder
Hierarchical Encoder
ๅไฝ ๅฅฝ
Review: Generator
A A
A
B
A
B
A
A
B
B
B
<BOS>
can be different with attention mechanism
: condition from decoder
Review: Training Generator
Reference:
A
B
๐ถ =
๐ก
๐ถ๐ก
Minimizing cross-entropy of each component
A
A
B
B
A
B
<BOS>
A
B B
๐ถ1 ๐ถ2 ๐ถ3
: condition from decoder
Review: Training Generator
เท๐ฅ๐ก
๐ถ๐ก
๐ถ๐ก = โ๐๐๐๐๐ เท๐ฅ๐ก| เท๐ฅ1:๐กโ1, โ
๐ถ =
๐ก
๐ถ๐ก
๐ถ = โ
๐ก
๐๐๐๐ เท๐ฅ๐ก| เท๐ฅ1:๐กโ1, โ
Maximizing the likelihood of generating เท๐ฅ given h
= โ๐๐๐๐ เท๐ฅ1|โ ๐ เท๐ฅ๐ก| เท๐ฅ1:๐กโ1, โ
โฏ๐ เท๐ฅ๐| เท๐ฅ1:๐โ1, โ= โ๐๐๐๐ เท๐ฅ|โ
Training data: โ, เท๐ฅ โ: input sentence and history/context
เท๐ฅ: correct response (word sequence)
เท๐ฅ๐ก: t-th word, เท๐ฅ1:๐ก: first t words of เท๐ฅ
โฆโฆ โฆโฆ
เท๐ฅ๐ก+1
๐ถ๐ก+1
เท๐ฅ๐กโ1
๐ถ๐กโ1
generator output
โฆโฆ โฆโฆ
RL for Sentence Generation
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan Jurafsky,
โDeep Reinforcement Learning for Dialogue Generationโ, EMNLP 2016
Introduction
โข Machine obtains feedback from user
โข Chat-bot learns to maximize the expected reward
https://image.freepik.com/free-vector/variety-of-human-avatars_23-2147506285.jpg
How are you?
Bye bye โบ
Hello
Hi โบ
-10 3
http://www.freepik.com/free-vector/variety-of-human-avatars_766615.htm
Maximizing Expected Reward
๐โ = ๐๐๐max๐
เดค๐ ๐
เดค๐ ๐
Encoder Generator
๐
โ ๐ฅ Human
=
โ
๐ โ
๐ฅ
๐ โ, ๐ฅ ๐๐ ๐ฅ|โ
๐ โ, ๐ฅ
Randomness in generator
Probability that the input/history is h
Maximizing expected reward
update
Maximizing Expected Reward
= ๐ธโ~๐ โ ๐ธ๐ฅ~๐๐ ๐ฅ|โ ๐ โ, ๐ฅ
โ1
๐
๐=1
๐
๐ โ๐ , ๐ฅ๐
Sample:
๐โ = ๐๐๐max๐
เดค๐ ๐
=
โ
๐ โ
๐ฅ
๐ โ, ๐ฅ ๐๐ ๐ฅ|โ
Maximizing expected reward
Encoder Generator
๐
โ ๐ฅ Human ๐ โ, ๐ฅ
update
เดค๐ ๐
= ๐ธโ~๐ โ ,๐ฅ~๐๐ ๐ฅ|โ ๐ โ, ๐ฅ
โ1, ๐ฅ1 , โ2, ๐ฅ2 , โฏ , โ๐ , ๐ฅ๐
Where is ๐?
Policy Gradient
๐ป เดค๐ ๐ =
โ
๐ โ
๐ฅ
๐ โ, ๐ฅ ๐ป๐๐ ๐ฅ|โ
=
โ
๐ โ
๐ฅ
๐ โ, ๐ฅ ๐๐ ๐ฅ|โ๐ป๐๐ ๐ฅ|โ
๐๐ ๐ฅ|โ
๐๐๐๐ ๐ ๐ฅ
๐๐ฅ=
1
๐ ๐ฅ
๐๐ ๐ฅ
๐๐ฅ
=
โ
๐ โ
๐ฅ
๐ โ, ๐ฅ ๐๐ ๐ฅ|โ ๐ป๐๐๐๐๐ ๐ฅ|โ
โ1
๐
๐=1
๐
๐ โ๐ , ๐ฅ๐ ๐ป๐๐๐๐๐ ๐ฅ|โ
= ๐ธโ~๐ โ ,๐ฅ~๐๐ ๐ฅ|โ ๐ โ, ๐ฅ ๐ป๐๐๐๐๐ ๐ฅ|โ
Sampling
=
โ
๐ โ
๐ฅ
๐ โ, ๐ฅ ๐๐ ๐ฅ|โเดค๐ ๐ โ1
๐
๐=1
๐
๐ โ๐ , ๐ฅ๐
Policy Gradient
โข Gradient Ascent
๐๐๐๐ค โ ๐๐๐๐ + ๐๐ป เดค๐ ๐๐๐๐
๐ป เดค๐ ๐ โ1
๐
๐=1
๐
๐ โ๐ , ๐ฅ๐ ๐ป๐๐๐๐๐ ๐ฅ๐|โ๐
๐ โ๐ , ๐ฅ๐ is positive
After updating ๐, ๐๐ ๐ฅ๐|โ๐ will increase
๐ โ๐ , ๐ฅ๐ is negative
After updating ๐, ๐๐ ๐ฅ๐|โ๐ will decrease
Implementation
1
๐
๐=1
๐
๐ โ๐ , ๐ฅ๐ ๐ป๐๐๐๐๐ ๐ฅ๐|โ๐
1
๐
๐=1
๐
๐๐๐๐๐ เท๐ฅ๐|โ๐
1
๐
๐=1
๐
๐ป๐๐๐๐๐ เท๐ฅ๐|โ๐
1
๐
๐=1
๐
๐ โ๐ , ๐ฅ๐ ๐๐๐๐๐ ๐ฅ๐|โ๐
๐ โ๐ , เท๐ฅ๐ = 1 Sampling as training data
weighted by ๐ โ๐ , ๐ฅ๐
ObjectiveFunction
Gradient
Maximum Likelihood
Reinforcement Learning
Training Data
โ1, เท๐ฅ1 , โฆ , โ๐ , เท๐ฅ๐ โ1, ๐ฅ1 , โฆ , โ๐ , ๐ฅ๐
EncoderGenera
torโ๐
Human
๐ฅ๐
๐ โ๐ , ๐ฅ๐
Implementation
๐๐ก
โ1, ๐ฅ1
โ2, ๐ฅ2
โ๐ , ๐ฅ๐
โฆโฆ
๐ โ1, ๐ฅ1
๐ โ2, ๐ฅ2
๐ โ๐ , ๐ฅ๐
โฆโฆ
1
๐
๐=1
๐
๐ โ๐ , ๐ฅ๐ ๐ป๐๐๐๐๐๐ก ๐ฅ๐|โ๐
๐๐ก+1 โ ๐๐ก + ๐๐ป เดค๐ ๐๐ก
1
๐
๐=1
๐
๐ โ๐ , ๐ฅ๐ ๐๐๐๐๐ ๐ฅ๐|โ๐
New Objective:
๐0 can be well pre-trained from
โ1, เท๐ฅ1 , โฆ , โ๐ , เท๐ฅ๐
Add a Baseline
Ideal case
Due toSampling
(h,x1)
Because it is probability โฆ
Not sampled
1
๐
๐=1
๐
๐ โ๐ , ๐ฅ๐ ๐๐๐โ๐๐ ๐ฅ๐|โ๐
๐๐ ๐ฅ|โ
(h,x2) (h,x3)
(h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)
(h,x1) (h,x2) (h,x3)
If ๐ โ๐ , ๐ฅ๐ is always positive
1
๐
๐=1
๐
๐ โ๐ , ๐ฅ๐ โ ๐ ๐๐๐๐ป๐๐ ๐ฅ๐|โ๐
Add a Baseline
(h,x1)
1
๐
๐=1
๐
๐ โ๐ , ๐ฅ๐ ๐๐๐โ๐๐ ๐ฅ๐|โ๐
There are several ways to obtain the baseline b.
๐๐ ๐ฅ|โ
(h,x2) (h,x3)
Not sampled
Add baseline
If ๐ โ๐ , ๐ฅ๐ is always positive
(h,x1) (h,x2) (h,x3)
Alpha GO style training !
โข Let two agents talk to each other
How old are you?
See you.
See you.
See you.
How old are you?
I am 16.
I though you were 12.
What make you think so?
Using a pre-defined evaluation function to compute R(h,x)
Example Reward
โข The final reward R(h,x) is the weighted sum of three terms r1(h,x), r2(h,x) and r3(h,x)
๐ โ, ๐ฅ = ฮป1๐1 โ, ๐ฅ + ฮป2๐2 โ, ๐ฅ + ฮป3๐3 โ, ๐ฅ
Ease of answering
Information Flow
Semantic Coherence
ไธ่ฆๆ็บๅฅ้ป็
่ชช้ปๆฐ้ฎฎ็
ไธ่ฆๅ่จไธๅฐๅพ่ช
Example Results
Reinforcement learning?Start with
observation ๐ 1 Observation ๐ 2 Observation ๐ 3
Action ๐1: โrightโ
Obtain reward ๐1 = 0
Action ๐2 : โfireโ
(kill an alien)
Obtain reward ๐2 = 5
Reinforcement learning?
A A
A
B
A
B
A
A
B
B
B
observation
Actions set
The action we take influence the observation in the next step
<BO
S>
Action taken๐ = 0 ๐ = 0
๐๐๐ค๐๐๐:
R(โBAAโ, reference)
Marc'Aurelio Ranzato, SumitChopra, Michael Auli, WojciechZaremba, โSequence Level Training with Recurrent Neural Networksโ, ICLR, 2016
Reinforcement learning?
โข One can use any advanced RL techniques here.
โข For example, actor-criticโข Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh
Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, YoshuaBengio. "An Actor-Critic Algorithm for Sequence Prediction." ICLR, 2017.
SeqGAN
Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, โSeqGAN: Sequence Generative Adversarial Nets with Policy Gradientโ, AAAI, 2017
Jiwei Li, Will Monroe, Tianlin Shi, Sรฉbastien Jean, Alan Ritter, Dan Jurafsky, โAdversarial Learning for Neural Dialogue Generationโ, arXiv
preprint, 2017
Basic Idea โ Sentence Generation
Generator
Discriminator
sentence x
sentence x Real or fake
Sampling from RNN at each time step also provides randomness
Original GAN
code z sampled from prior distribution
Algorithm โ Sentence Generation
โข Initialize generator Gen and discriminator Dis
โข In each iteration:
โข Sample real sentences ๐ฅ from database
โข Generate sentences ๐ฅ by Gen
โข Update Dis to increase ๐ท๐๐ ๐ฅ and decrease ๐ท๐๐ ๐ฅ
โข Update Gen such that
GeneratorDiscrimi
natorscalar
update
Basic Idea โ Chat-bot
Discriminator
Input sentence/history h response sentence x
Real or fake
http://www.nipic.com/show/3/83/3936650kd7476069.html
human dialogues
Chatbot
En De
Conditional GAN
response sentence x
Input sentence/history h
Algorithm โ Chat-bot
โข Initialize generator Gen and discriminator Dis
โข In each iteration:
โข Sample real history โ and sentence ๐ฅ from database
โข Sample real history โโฒ from database, and generate sentences ๐ฅ by Gen(โโฒ)
โข Update Dis to increase ๐ท๐๐ โ, ๐ฅ and decrease ๐ท๐๐ โโฒ, ๐ฅ
โข Update Gen such that
Discriminator
scalar
update
Training data:
A: OOO
B: XXX
A: โ โ โ
โฆโฆ
โฆโฆ
โ
h
x
Chatbot
En De
A A
A
B
A
B
A
A
B
B
B
<BOS>
Can we do backpropogation?
Tuning generatorwill not change the output.
Encoder
Discriminator
scalar
update
Alternative:improved WGAN
scalarChatbot
En De
Reinforcement Learning
โข Consider the output of discriminator as reward
โข Update generator to increase discriminator = to get maximum reward
โข Different from typical RL
โข The discriminator would update
๐ป เดค๐ ๐ โ1
๐
๐=1
๐
๐ โ๐ ๐ฅ๐ โ ๐ ๐ป๐๐๐๐๐ ๐ฅ๐|โ๐
Discriminator
scalar
updateChatbot
En De
reward
Discriminator Score
๐ท โ๐ , ๐ฅ๐
Reward for Every Generation Step
๐ป เดค๐ ๐ โ1
๐
๐=1
๐
๐ท โ๐ , ๐ฅ๐ โ ๐ ๐ป๐๐๐๐๐ ๐ฅ๐|โ๐
โ๐ = โWhat is your name?โ
๐ฅ๐ = โI donโt knowโ
๐ท โ๐ , ๐ฅ๐ โ ๐ is negative
Update ๐ to decrease log๐๐ ๐ฅ๐|โ๐
๐๐๐๐๐ ๐ฅ๐|โ๐ = ๐๐๐๐ ๐ฅ1๐ |โ๐ + ๐๐๐๐ ๐ฅ2
๐ |โ๐ , ๐ฅ1๐ + ๐๐๐๐ ๐ฅ3
๐ |โ๐ , ๐ฅ1:2๐
๐ "๐ผ"|โ๐
โ๐ = โWhat is your name?โ
๐ฅ๐ = โI am Johnโ
๐ท โ๐ , ๐ฅ๐ โ ๐ is positive
Update ๐ to increase log๐๐ ๐ฅ๐|โ๐
๐๐๐๐๐ ๐ฅ๐|โ๐ = ๐๐๐๐ ๐ฅ1๐ |โ๐ + ๐๐๐๐ ๐ฅ2
๐ |โ๐ , ๐ฅ1๐ + ๐๐๐๐ ๐ฅ3
๐ |โ๐ , ๐ฅ1:2๐
๐ "๐ผ"|โ๐
Reward for Every Generation Step
Method 2. Discriminator For Partially Decoded Sequences
๐๐๐๐๐ ๐ฅ๐|โ๐ = ๐๐๐๐ ๐ฅ1๐ |โ๐ + ๐๐๐๐ ๐ฅ2
๐ |โ๐ , ๐ฅ1๐ + ๐๐๐๐ ๐ฅ3
๐ |โ๐ , ๐ฅ1:2๐
โ๐ = โWhat is your name?โ ๐ฅ๐ = โI donโt knowโ
๐ "๐ผ"|โ๐ ๐ "๐๐๐โฒ๐ก"|โ๐ , "๐ผ" ๐ "๐๐๐๐ค"|โ๐ , "๐ผ ๐๐๐โฒ๐ก"
Method 1. Monte Carlo (MC) Search
๐ป เดค๐ ๐ โ1
๐
๐=1
๐
๐ก=1
๐
๐ โ๐ , ๐ฅ1:๐ก๐ โ ๐ ๐ป๐๐๐๐๐ ๐ฅ๐ก
๐|โ๐ , ๐ฅ1:๐กโ1๐
Monte Carlo Search
โข How to estimate ๐ โ๐ , ๐ฅ1:๐ก๐ ?
๐ "๐โ๐๐ก ๐๐ ๐ฆ๐๐ข๐ ๐๐๐๐? ", "๐ผ"
I am John
I am happy
I donโt know
I am superman
โ๐ ๐ฅ1๐
๐ฅ๐ด =
๐ฅ๐ต =
๐ฅ๐ถ =
๐ฅ๐ท =
๐ท โ๐ , ๐ฅ๐ด
๐ท โ๐ , ๐ฅ๐ต
๐ท โ๐ , ๐ฅ๐ถ
๐ท โ๐ , ๐ฅ๐ท
= 1.0
= 0.1
= 0.1
= 0.8
๐ โ๐ , "๐ผ" = 0.5
A roll-out generator for sampling is needed
avg
Rewarding Partially Decoded Sequencesโข Training a discriminator that is able to assign rewards to
both fully and partially decoded sequences
โข Break generated sequences into partial sequences
h=โWhat is your name?โ, x=โI am johnโ
h=โWhat is your name?โ, x=โI donโt knowโ
Disscalarh
xh=โWhat is your name?โ, x=โI amโ
h=โWhat is your name?โ, x=โIโ
h=โWhat is your name?โ, x=โI donโtโ
h=โWhat is your name?โ, x=โIโ
๐ โ, ๐ฅ1:๐ก
Dish
๐ฅ1:๐ก
Teacher Forcing
โข The training of generative model is unstableโข This reward is used to promote or discourage the
generatorโs own generated sequences.
โข Usually It knows that the generated results are bad, but does not know what results are good.
โข Teacher Forcing
Obtained by sampling
Adding more Data:
Training Data for SeqGAN:
โ1, เท๐ฅ1 , โฆ , โ๐ , เท๐ฅ๐
โ1, ๐ฅ1 , โฆ , โ๐ , ๐ฅ๐
weighted by ๐ท โ๐ , ๐ฅ๐
Real data
Consider ๐ท โ๐ , เท๐ฅ๐ = 1
Experiments in paper
โข Sentence generation: Synthetic data
โข Given an LSTM
โข Using the LSTM to generate a lot of sequences as โreal dataโ
โข Generator learns from the โreal dataโ by different approaches
โข Generator generates some sequences
โข Using the LSTM to compute the negative loglikelihood (NLL) of the sequences
โข Smaller is better
Experiments in paper- Synthetic data
Experiments in paper- Real data
Results - Chat-bot
To Learn More โฆ
Algorithm โ MaliGAN
โข Initialize generator Gen and discriminator Dis
โข In each iteration:
โข Sample real sentences ๐ฅ from database
โข Generate sentences ๐ฅ by Gen
โข Update Dis to maximize
โข Update Gen by gradient
1
๐
๐=1
๐๐๐ท ๐ฅ๐
ฯ๐=1๐ ๐๐ท ๐ฅ๐
โ ๐ ๐ป๐๐๐๐๐ ๐ฅ๐
Maximum-likelihood Augmented Discrete GAN
๐ฅ
๐๐๐๐ท ๐ฅ +
๐ฅ
๐๐๐ 1 โ ๐ท ๐ฅ
๐๐ท ๐ฅ๐ =๐ท ๐ฅ๐
1 โ ๐ท ๐ฅ๐
๐ท โ๐ , ๐ฅ๐
To learn more โฆโฆ
โข Professor forcing โข Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng
Zhang, Aaron Courville, Yoshua Bengio, โProfessor Forcing: A New Algorithm for Training Recurrent Networksโ, NIPS, 2016
โข Handling discrete output by methods other than policy gradientโข MaliGAN, Boundary-seeking GANโข Yizhe Zhang, Zhe Gan, Lawrence Carin, โGenerating Text
via Adversarial Trainingโ, Workshop on Adversarial Training, NIPS, 2016
โข Matt J. Kusner, Josรฉ Miguel Hernรกndez-Lobato, โGANS for Sequences of Discrete Elements with the Gumbel-softmax Distributionโ, arXiv preprint, 2016