RL and GAN for Sentence Generation and Chat-bot

Hung-yi Lee

Outline

• Policy Gradient

• SeqGAN• Two techniques: MCMC, partial

• Experiments: SeqGAN and dialogue

• Original GAN• MadliGAN

• Gumbel

Review: Chat-bot

• Sequence-to-sequence learning

Encoder Generator

Input sentence

output sentence

history information

Training data:

A: OOO

B: XXX

A: ∆ ∆ ∆

……

B: XXX

A: ∆ ∆ ∆

A: OOO

Review: Encoder

好我很

to generator

Encoder

Hierarchical Encoder

嗎你好

Review: Generator

can be different with attention mechanism

: condition from decoder

Review: Training Generator

Reference:

𝐶 =

𝐶𝑡

Minimizing cross-entropy of each component

𝐶1 𝐶2 𝐶3

: condition from decoder

Review: Training Generator

ො𝑥𝑡

𝐶𝑡

𝐶𝑡 = −𝑙𝑜𝑔𝑃𝜃 ො𝑥𝑡| ො𝑥1:𝑡−1, ℎ

𝐶 =

𝐶𝑡

𝐶 = −

𝑙𝑜𝑔𝑃 ො𝑥𝑡| ො𝑥1:𝑡−1, ℎ

Maximizing the likelihood of generating ො𝑥 given h

= −𝑙𝑜𝑔𝑃 ො𝑥1|ℎ 𝑃 ො𝑥𝑡| ො𝑥1:𝑡−1, ℎ

⋯𝑃 ො𝑥𝑇| ො𝑥1:𝑇−1, ℎ= −𝑙𝑜𝑔𝑃 ො𝑥|ℎ

Training data: ℎ, ො𝑥 ℎ: input sentence and history/context

ො𝑥: correct response (word sequence)

ො𝑥𝑡: t-th word, ො𝑥1:𝑡: first t words of ො𝑥

…… ……

ො𝑥𝑡+1

𝐶𝑡+1

ො𝑥𝑡−1

𝐶𝑡−1

generator output

…… ……

RL for Sentence Generation

Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan Jurafsky,

“Deep Reinforcement Learning for Dialogue Generation“, EMNLP 2016

Introduction

• Machine obtains feedback from user

• Chat-bot learns to maximize the expected reward

https://image.freepik.com/free-vector/variety-of-human-avatars_23-2147506285.jpg

How are you?

Bye bye ☺

Hi ☺

http://www.freepik.com/free-vector/variety-of-human-avatars_766615.htm

Maximizing Expected Reward

𝜃∗ = 𝑎𝑟𝑔max𝜃

ത𝑅𝜃

Encoder Generator

ℎ 𝑥 Human

𝑃 ℎ

𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ

𝑅 ℎ, 𝑥

Randomness in generator

Probability that the input/history is h

Maximizing expected reward

update

Maximizing Expected Reward

= 𝐸ℎ~𝑃 ℎ 𝐸𝑥~𝑃𝜃 𝑥|ℎ 𝑅 ℎ, 𝑥

𝑖=1

𝑅 ℎ𝑖 , 𝑥𝑖

Sample:

𝜃∗ = 𝑎𝑟𝑔max𝜃

ത𝑅𝜃

𝑃 ℎ

𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ

Maximizing expected reward

Encoder Generator

ℎ 𝑥 Human 𝑅 ℎ, 𝑥

update

ത𝑅𝜃

= 𝐸ℎ~𝑃 ℎ ,𝑥~𝑃𝜃 𝑥|ℎ 𝑅 ℎ, 𝑥

ℎ1, 𝑥1 , ℎ2, 𝑥2 , ⋯ , ℎ𝑁 , 𝑥𝑁

Where is 𝜃?

Policy Gradient

𝛻 ത𝑅𝜃 =

𝑃 ℎ

𝑅 ℎ, 𝑥 𝛻𝑃𝜃 𝑥|ℎ

𝑃 ℎ

𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ𝛻𝑃𝜃 𝑥|ℎ

𝑃𝜃 𝑥|ℎ

𝑑𝑙𝑜𝑔 𝑓 𝑥

𝑑𝑥=

𝑓 𝑥

𝑑𝑓 𝑥

𝑑𝑥

𝑃 ℎ

𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎ 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥|ℎ

𝑖=1

𝑅 ℎ𝑖 , 𝑥𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥|ℎ

= 𝐸ℎ~𝑃 ℎ ,𝑥~𝑃𝜃 𝑥|ℎ 𝑅 ℎ, 𝑥 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥|ℎ

Sampling

𝑃 ℎ

𝑅 ℎ, 𝑥 𝑃𝜃 𝑥|ℎത𝑅𝜃 ≈1

𝑖=1

Policy Gradient

• Gradient Ascent

𝜃𝑛𝑒𝑤 ← 𝜃𝑜𝑙𝑑 + 𝜂𝛻 ത𝑅𝜃𝑜𝑙𝑑

𝛻 ത𝑅𝜃 ≈1

𝑖=1

𝑅 ℎ𝑖 , 𝑥𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥𝑖|ℎ𝑖

𝑅 ℎ𝑖 , 𝑥𝑖 is positive

After updating 𝜃, 𝑃𝜃 𝑥𝑖|ℎ𝑖 will increase

𝑅 ℎ𝑖 , 𝑥𝑖 is negative

After updating 𝜃, 𝑃𝜃 𝑥𝑖|ℎ𝑖 will decrease

Implementation

𝑖=1

𝑅 ℎ𝑖 , 𝑥𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥𝑖|ℎ𝑖

𝑖=1

𝑙𝑜𝑔𝑃𝜃 ො𝑥𝑖|ℎ𝑖

𝑖=1

𝛻𝑙𝑜𝑔𝑃𝜃 ො𝑥𝑖|ℎ𝑖

𝑖=1

𝑅 ℎ𝑖 , 𝑥𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥𝑖|ℎ𝑖

𝑅 ℎ𝑖 , ො𝑥𝑖 = 1 Sampling as training data

weighted by 𝑅 ℎ𝑖 , 𝑥𝑖

ObjectiveFunction

Gradient

Maximum Likelihood

Reinforcement Learning

Training Data

ℎ1, ො𝑥1 , … , ℎ𝑁 , ො𝑥𝑁 ℎ1, 𝑥1 , … , ℎ𝑁 , 𝑥𝑁

EncoderGenera

torℎ𝑖

𝑥𝑖

Implementation

𝜃𝑡

ℎ1, 𝑥1

ℎ2, 𝑥2

ℎ𝑁 , 𝑥𝑁

……

𝑅 ℎ1, 𝑥1

𝑅 ℎ2, 𝑥2

𝑅 ℎ𝑁 , 𝑥𝑁

……

𝑖=1

𝑅 ℎ𝑖 , 𝑥𝑖 𝛻𝑙𝑜𝑔𝑃𝜃𝑡 𝑥𝑖|ℎ𝑖

𝜃𝑡+1 ← 𝜃𝑡 + 𝜂𝛻 ത𝑅𝜃𝑡

𝑖=1

𝑅 ℎ𝑖 , 𝑥𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥𝑖|ℎ𝑖

New Objective:

𝜃0 can be well pre-trained from

ℎ1, ො𝑥1 , … , ℎ𝑁 , ො𝑥𝑁

Add a Baseline

Ideal case

Due toSampling

(h,x1)

Because it is probability …

Not sampled

𝑖=1

𝑅 ℎ𝑖 , 𝑥𝑖 𝑙𝑜𝑔∇𝑃𝜃 𝑥𝑖|ℎ𝑖

𝑃𝜃 𝑥|ℎ

(h,x2) (h,x3)

(h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)

(h,x1) (h,x2) (h,x3)

If 𝑅 ℎ𝑖 , 𝑥𝑖 is always positive

𝑖=1

𝑅 ℎ𝑖 , 𝑥𝑖 − 𝑏 𝑙𝑜𝑔𝛻𝑃𝜃 𝑥𝑖|ℎ𝑖

Add a Baseline

(h,x1)

𝑖=1

𝑅 ℎ𝑖 , 𝑥𝑖 𝑙𝑜𝑔∇𝑃𝜃 𝑥𝑖|ℎ𝑖

There are several ways to obtain the baseline b.

𝑃𝜃 𝑥|ℎ

(h,x2) (h,x3)

Not sampled

Add baseline

If 𝑅 ℎ𝑖 , 𝑥𝑖 is always positive

(h,x1) (h,x2) (h,x3)

Alpha GO style training !

• Let two agents talk to each other

How old are you?

See you.

How old are you?

I am 16.

I though you were 12.

What make you think so?

Using a pre-defined evaluation function to compute R(h,x)

Example Reward

• The final reward R(h,x) is the weighted sum of three terms r1(h,x), r2(h,x) and r3(h,x)

𝑅 ℎ, 𝑥 = λ1𝑟1 ℎ, 𝑥 + λ2𝑟2 ℎ, 𝑥 + λ3𝑟3 ℎ, 𝑥

Ease of answering

Information Flow

Semantic Coherence

不要成為句點王

說點新鮮的

不要前言不對後語

Example Results

Reinforcement learning?Start with

observation 𝑠1 Observation 𝑠2 Observation 𝑠3

Action 𝑎1: “right”

Obtain reward 𝑟1 = 0

Action 𝑎2 : “fire”

(kill an alien)

Obtain reward 𝑟2 = 5

Reinforcement learning?

observation

Actions set

The action we take influence the observation in the next step

Action taken𝑟 = 0 𝑟 = 0

𝑟𝑒𝑤𝑎𝑟𝑑:

R(“BAA”, reference)

Marc'Aurelio Ranzato, SumitChopra, Michael Auli, WojciechZaremba, “Sequence Level Training with Recurrent Neural Networks”, ICLR, 2016

Reinforcement learning?

• One can use any advanced RL techniques here.

• For example, actor-critic• Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh

Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, YoshuaBengio. "An Actor-Critic Algorithm for Sequence Prediction." ICLR, 2017.

SeqGAN

Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient”, AAAI, 2017

Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, “Adversarial Learning for Neural Dialogue Generation”, arXiv

preprint, 2017

Basic Idea – Sentence Generation

Generator

Discriminator

sentence x

sentence x Real or fake

Sampling from RNN at each time step also provides randomness

Original GAN

code z sampled from prior distribution

Algorithm – Sentence Generation

• Initialize generator Gen and discriminator Dis

• In each iteration:

• Sample real sentences 𝑥 from database

• Generate sentences 𝑥 by Gen

• Update Dis to increase 𝐷𝑖𝑠 𝑥 and decrease 𝐷𝑖𝑠 𝑥

• Update Gen such that

GeneratorDiscrimi

natorscalar

update

Basic Idea – Chat-bot

Discriminator

Input sentence/history h response sentence x

Real or fake

http://www.nipic.com/show/3/83/3936650kd7476069.html

human dialogues

Chatbot

Conditional GAN

response sentence x

Input sentence/history h

Algorithm – Chat-bot

• Sample real history ℎ and sentence 𝑥 from database

• Sample real history ℎ′ from database, and generate sentences 𝑥 by Gen(ℎ′)

• Update Dis to increase 𝐷𝑖𝑠 ℎ, 𝑥 and decrease 𝐷𝑖𝑠 ℎ′, 𝑥

• Update Gen such that

Discriminator

scalar

update

Training data:

A: OOO

B: XXX

A: ∆ ∆ ∆

……

Chatbot

Can we do backpropogation?

Tuning generatorwill not change the output.

Encoder

Discriminator

scalar

update

Alternative:improved WGAN

scalarChatbot

Reinforcement Learning

• Consider the output of discriminator as reward

• Update generator to increase discriminator = to get maximum reward

• Different from typical RL

• The discriminator would update

𝛻 ത𝑅𝜃 ≈1

𝑖=1

𝑅ℎ𝑖 𝑥𝑖 − 𝑏 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥𝑖|ℎ𝑖

Discriminator

scalar

updateChatbot

reward

Discriminator Score

𝐷 ℎ𝑖 , 𝑥𝑖

Reward for Every Generation Step

𝛻 ത𝑅𝜃 ≈1

𝑖=1

𝐷 ℎ𝑖 , 𝑥𝑖 − 𝑏 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥𝑖|ℎ𝑖

ℎ𝑖 = “What is your name?”

𝑥𝑖 = “I don’t know”

𝐷 ℎ𝑖 , 𝑥𝑖 − 𝑏 is negative

Update 𝜃 to decrease log𝑃𝜃 𝑥𝑖|ℎ𝑖

𝑙𝑜𝑔𝑃𝜃 𝑥𝑖|ℎ𝑖 = 𝑙𝑜𝑔𝑃 𝑥1𝑖 |ℎ𝑖 + 𝑙𝑜𝑔𝑃 𝑥2

𝑖 |ℎ𝑖 , 𝑥1𝑖 + 𝑙𝑜𝑔𝑃 𝑥3

𝑖 |ℎ𝑖 , 𝑥1:2𝑖

𝑃 "𝐼"|ℎ𝑖

ℎ𝑖 = “What is your name?”

𝑥𝑖 = “I am John”

𝐷 ℎ𝑖 , 𝑥𝑖 − 𝑏 is positive

Update 𝜃 to increase log𝑃𝜃 𝑥𝑖|ℎ𝑖

𝑖 |ℎ𝑖 , 𝑥1:2𝑖

𝑃 "𝐼"|ℎ𝑖

Reward for Every Generation Step

Method 2. Discriminator For Partially Decoded Sequences

𝑖 |ℎ𝑖 , 𝑥1:2𝑖

ℎ𝑖 = “What is your name?” 𝑥𝑖 = “I don’t know”

𝑃 "𝐼"|ℎ𝑖 𝑃 "𝑑𝑜𝑛′𝑡"|ℎ𝑖 , "𝐼" 𝑃 "𝑘𝑛𝑜𝑤"|ℎ𝑖 , "𝐼 𝑑𝑜𝑛′𝑡"

Method 1. Monte Carlo (MC) Search

𝛻 ത𝑅𝜃 ≈1

𝑖=1

𝑡=1

𝑄 ℎ𝑖 , 𝑥1:𝑡𝑖 − 𝑏 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥𝑡

𝑖|ℎ𝑖 , 𝑥1:𝑡−1𝑖

Monte Carlo Search

• How to estimate 𝑄 ℎ𝑖 , 𝑥1:𝑡𝑖 ?

𝑄 "𝑊ℎ𝑎𝑡 𝑖𝑠 𝑦𝑜𝑢𝑟 𝑛𝑎𝑚𝑒? ", "𝐼"

I am John

I am happy

I don’t know

I am superman

ℎ𝑖 𝑥1𝑖

𝑥𝐴 =

𝑥𝐵 =

𝑥𝐶 =

𝑥𝐷 =

𝐷 ℎ𝑖 , 𝑥𝐴

𝐷 ℎ𝑖 , 𝑥𝐵

𝐷 ℎ𝑖 , 𝑥𝐶

𝐷 ℎ𝑖 , 𝑥𝐷

𝑄 ℎ𝑖 , "𝐼" = 0.5

A roll-out generator for sampling is needed

Rewarding Partially Decoded Sequences• Training a discriminator that is able to assign rewards to

both fully and partially decoded sequences

• Break generated sequences into partial sequences

h=“What is your name?”, x=“I am john”

h=“What is your name?”, x=“I don’t know”

Disscalarh

xh=“What is your name?”, x=“I am”

h=“What is your name?”, x=“I”

h=“What is your name?”, x=“I don’t”

h=“What is your name?”, x=“I”

𝑄 ℎ, 𝑥1:𝑡

𝑥1:𝑡

Teacher Forcing

• The training of generative model is unstable• This reward is used to promote or discourage the

generator’s own generated sequences.

• Usually It knows that the generated results are bad, but does not know what results are good.

• Teacher Forcing

Obtained by sampling

Adding more Data:

Training Data for SeqGAN:

ℎ1, ො𝑥1 , … , ℎ𝑁 , ො𝑥𝑁

ℎ1, 𝑥1 , … , ℎ𝑁 , 𝑥𝑁

weighted by 𝐷 ℎ𝑖 , 𝑥𝑖

Real data

Consider 𝐷 ℎ𝑖 , ො𝑥𝑖 = 1

Experiments in paper

• Sentence generation: Synthetic data

• Given an LSTM

• Using the LSTM to generate a lot of sequences as “real data”

• Generator learns from the “real data” by different approaches

• Generator generates some sequences

• Using the LSTM to compute the negative loglikelihood (NLL) of the sequences

• Smaller is better

Experiments in paper- Synthetic data

Experiments in paper- Real data

Results - Chat-bot

To Learn More …

Algorithm – MaliGAN

• Sample real sentences 𝑥 from database

• Generate sentences 𝑥 by Gen

• Update Dis to maximize

• Update Gen by gradient

𝑖=1

𝑁𝑟𝐷 𝑥𝑖

σ𝑖=1𝑁 𝑟𝐷 𝑥𝑖

− 𝑏 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥𝑖

Maximum-likelihood Augmented Discrete GAN

𝑙𝑜𝑔𝐷 𝑥 +

𝑙𝑜𝑔 1 − 𝐷 𝑥

𝑟𝐷 𝑥𝑖 =𝐷 𝑥𝑖

1 − 𝐷 𝑥𝑖

𝐷 ℎ𝑖 , 𝑥𝑖

To learn more ……

• Professor forcing • Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng

Zhang, Aaron Courville, Yoshua Bengio, “Professor Forcing: A New Algorithm for Training Recurrent Networks”, NIPS, 2016

• Handling discrete output by methods other than policy gradient• MaliGAN, Boundary-seeking GAN• Yizhe Zhang, Zhe Gan, Lawrence Carin, “Generating Text

via Adversarial Training”, Workshop on Adversarial Training, NIPS, 2016

• Matt J. Kusner, José Miguel Hernández-Lobato, “GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution”, arXiv preprint, 2016

RL and GAN for Sentence Generation and Chat-bot

Documents

BF Bot Manager V3 and Multiple Strategies bot manual€¦ · V3 AND MULTIPLE STRATEGIES BOT 1 ©2008-2016 BF Bot Manager V3 and Multiple Strategies bot ... Betfair NG-API

719 CITY PLANNING COMMITTEE MEETING 20 JULY 2016 … · 2019. 6. 23. · line basement line of basement rl 3.500 rl 3.650 rl 3.400 rl 3050 rl 2.800 rl 2.740 rl 2.720 rl 2.700 rl 2.460

SERIES ‘FM’ (NFPA) CYLINDER WITHROD LOCK · 1.000 RL-100400 1550 1.375 RL-138400 1550 5.00 1.000 RL-100500 2150 1.375 RL-138500 2150 6.00 1.375 RL-138600 2850 1.750 RL-175600

GaN/Gd2O3/GaN Single Crystal Heterostructure …GaN on Gd2O3. Fig. 2 XTEM showing the structure of GaN/Gd2O3/GaN. It was unexpected that an hcp Gd2O3 single crystal would grow on GaN,

NOMINATED RL LIST2201528586GUNPAL RL\827 2406502093RIYASAT ALI RL\828 2405010222JITENDRA KUMAR SHARMA RL\831 2405519571TILAK CHOUDHARY RL\832 2201007005BALWAN SINGH RL\834 2405500408TARACHAND

· contractile root (C RL = contractile root of lateral bulb), LB = lateral bulb, NR = nutrient roots (NRC = nutrient roots of lateral bulb), sh = shrinkage Bot. 106 (1993) Fig

NOMINATED RL LIST2201016983 vinod kumar rl\1837 2402501072 kishan lal yogi rl\1839 2405001154 suresh kumar mitharwal rl\1843 2402502879 saunu ram rl\1847 2201560722 hawa singh rl\1854

written permission of Rupert G.H. Milne Home, Keith Dunn ... · 7.1 RL 7.9 RL 7.5 RL 10.8 RL 9.9 RL 11.8 RL 10.9 RL 9.6 RL 8.4 RL 3.0 3.0 3.5 3.5 4.0 4.0 4.0 3.5 4.5 4.5 4.5 5.0 5.0

Vittuàl Robotics - (EE) EV3 Educator Vehicle · EV3 Cleanup Bot EV3 Cleanup Bot Cl eanup Bot 000 program Cleanup Bot EV3 'Cleanup Bot USB EV3 EV3

MedGAN Progressive GAN IcGAN CGAN BIM ﬀ … · Adversarial Machine Learning 3D-GAN AC-GAN AﬀGAN AdaGAN ALI AL-CGAN FGSM AnoGAN ArtGAN BIM Bayesian GAN BEGAN BiGAN BS-GAN CGAN

€¦ · THE BASIC DOCUMENTS . FOR . POSTDOCTORAL TRAINING . Adopted BOT 2/2004 . Rev. BOT 7/2004 . Rev. BOT 2/2005 . Rev. BOT 7/2006 . Rev. BOT 7/2007 . Rev. BOT 7/2008 . Rev

Artificial Intelligent: Intelligent Bot With Microsoft Bot Framework & Azure

Bot vs. Bot: Evading Machine Learning Malware Detection€¦ · Bot vs. Bot: Evading Machine Learning Malware Detection ... • create ‘.text’ and populate with ‘.text from

Building A Conversational Bot Using Bot Framework and Microsoft

Proton Recoils in GaN & Upcoming RF GaN Work

GaN Power HEMT Tutorial: GaN Basicsiganpower.com/wp-content/uploads/2019/10/GaN-Power-Device-Tutorial-Part1-GaN-Basics.pdf1，Kevin J. Chen, Understanding the Dynamic Behavior in GaN-on-Si

Abstractsarmad@kaist.ac.kr Hyunjoo Jenny Lee KAIST South Korea hyunjoo.lee@kaist.ac.kr Young Min Kim KIST, SNU South Korea youngmin.kim@snu.ac.kr Abstract We present RL-GAN-Net, where

Joshua Tree Joshua Tree - Countywide Plancountywideplan.com/.../2017/10/LUDs_JoshuaTree_171006.pdf · Joshua Tree Homestead Valley RL RL RL RL R/LM LDR RL RL F Date: 10/9/2017 0 3,000

APPROACH SURFACE HORIZONTAL SECTION RL 400…dsewebapps.dse.vic.gov.au/Shared/ATSAttachment1.nsf/(attachment... · rl 400 rl 400.52 rl 340 rl 350 rl 360 rl 370 rl 380 ... first section

MedGAN CGAN b-GAN LAPGAN MPM-GAN AdaGAN Introduction … · iGAN IAN ID-CGAN IcGAN InfoGAN LAPGAN LR-GAN LS-GAN LSGAN MGAN MAGAN MAD-GAN MARTA-GAN MalGAN McGAN MedGAN MIX+GAN MPM-GAN