Advanced Cutting Edge Research Seminar Dialogue …Nara Institute of Science and Technology Augmented Human Communication Laboratory PRESTO, Japan Science and Technology Agency Advanced

Nara Institute of Science and TechnologyAugmented Human Communication LaboratoryPRESTO, Japan Science and Technology Agency

Advanced Cutting Edge

Research Seminar

Dialogue System with Deep Neural

Networks

Assistant Professor

Koichiro Yoshino

2018/1/25 ©Koichiro Yoshino AHC-Lab. NAIST,

PRESTO JSTAdvanced Cutting Edge Research Seminar 2 1

1. Basis of spoken dialogue systems

– Type and modules of spoken dialogue systems

2. Deep learning for spoken dialogue systems

– Basis of deep learning (deep neural networks)

– Recent approaches of deep learning for spoken dialogue systems

3. Dialogue management using reinforcement learning

– Basis of reinforcement learning

– Statistical dialogue management using intention dependency graph

4. Dialogue management using deep reinforcement learning

– Implementation of deep Q-network in dialogue management

2018/1/25 ©Koichiro Yoshino

AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 2

Course works

• Perceptron

– Simple binary classifier

• Multi-layer perceptron

– Combination of binary classifiers

• Deep neural networks

• Ways to apply deep neural networks

– What kind of problem can be solved with DNN?



Basis of deep neural networks

• Simplest unit of neural networks

– Takes several inputs produces one output

– Perceptron does a single decision given several inputs

𝒚 = 𝐬𝐢𝐠𝐧

𝒊

𝒘𝒊 ∙ 𝝋𝒊(𝒙) + 𝒃



Simple perceptron

Binary output

+1 or -1Weight for 𝝋𝒊 Feature function on 𝒙

bias 𝒃

𝒚𝒙

• Problems the perceptron can solve

– Linearly solvable binary classification problem

• Positive: “This room is good”

• Positive: “This building is cool”

• Negative: “This room is bad”

• Problems the perceptron cannot solve

– Non-linear problems

• Positive: “very good”

• Positive: “not bad”

• Negative: “very bad”

• Negative: “not good”



Properties of simple perceptron

negative

positivegood

bad

cool

goodbad

not

very very good

not goodnot bad

very bad

• Using several classifiers

– Multi-layer perceptron (MLP)

• Feed-forward network

– Classifiers of the 1st layer take

the same input, but learn

different weights

– If we have two linear

separating planes, we can

classify the example of

non-linear problem



Solutions for non-linear problems

goodbad

not

very 𝒙𝟏: very good

𝒙𝟒: not good𝒙𝟑: not bad

𝒙𝟐: very bad

𝒚

• 1st layer

– Input: same as the single perceptron

– Output: features for the decision

of the 2nd layer

• Each perceptron may learn the

mapping between the input feature

space and the new feature space

• Kernel methods do the similar thing

• 2nd layer

– Input: features that come from

the 1st layer

– Output: classification result



Multi-layer perceptron (MLP)

goodbad

not

very 𝒙𝟏

𝒙𝟒𝒙𝟑

𝒙𝟐

𝒙𝟏

𝒙𝟒

𝒙𝟑

𝒙𝟐

𝝋1(𝒙)

𝝋2(𝒙)

𝝋1 𝒙 ∗ 𝝋2 𝒙 = 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒

𝝋1 𝒙 ∗ 𝝋2 𝒙 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒

• Deep neural network (DNN) is deeply layered multi-layer

perceptron

– It can train the mapping of 𝑿 and 𝒚 (𝒚 = 𝒇(𝑿)) even if it is complex

– Restricted Boltzmann machine (RBM) is a key technique to train the

model

• Pre-trains mapping of each layer (𝑿 and 𝑯𝟏, 𝑯𝟏 and 𝑯𝟐, …) and

finetunes entire the network (backpropagation)



Deep neural networks

𝒙𝟏

𝒙𝟐

𝒙𝟑

𝒙𝟒

𝒉𝟏𝟏

𝒉𝟐𝟏

𝒉𝟑𝟏

𝒉𝟒𝟏

𝒉𝟏𝟐

𝒉𝟐𝟐

𝒉𝟑𝟐

𝒉𝟒𝟐

𝒉𝟏𝟑

𝒉𝟐𝟑

𝒉𝟑𝟑

𝒉𝟒𝟑

𝒉𝟏 𝒚

• Recurrent neural network

is a neural network that

has a recursion

– 𝒉𝒕 = 𝐭𝐚𝐧𝐡(𝑾𝒙𝒉𝒙𝒕 +𝑾𝒉𝒉𝒉

𝒕−𝟏 + 𝒃𝒉)

– 𝒚𝒑𝒕 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑾𝒉𝒚𝒑𝒉𝒕 + 𝒃𝒚𝒑)

• 𝐭𝐚𝐧𝐡 =𝒆𝒙−𝒆−𝒙

𝒆𝒙+𝒆−𝒙, 𝐬𝐨𝐟𝐭𝐦𝐚𝐱 =

𝒆𝒚𝒊

𝒌 𝒆𝒚𝒌

• This structure works well for sequential input (𝑿𝟏, 𝑿𝟐, … , 𝑿𝒕)

– 𝒕 is a time step

– Input 𝒉𝒕−𝟏 for time-step 𝒕 will be a memory of previous input



Variation of neural networks:

Recurrent neural network (RNN)

𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒙𝟓

𝒚𝒑𝟏 𝒚𝒑

𝟐 𝒚𝒑𝟑 𝒚𝒑

𝟒 𝒚𝒑𝟓

𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓𝒉𝟎

• CNN is the state of the art algorithm for classification

– 𝒄𝒊,𝒋 = 𝒔=𝟎𝒎−𝟏 𝒕

𝒏−𝟏𝒘𝒔𝒕𝒙 𝒊+𝒔 ,(𝒋+𝒕) + 𝒃𝒄

– 𝒂𝒊,𝒋 = 𝐭𝐚𝐧𝐡 𝒄𝒊,𝒋 , 𝒑𝒊,𝒋 = 𝐦𝐚𝐱 𝒂𝒊,𝒋

– 𝒐 = 𝐭𝐚𝐧𝐡 𝑾𝒑𝒐𝒑 + 𝒃𝒐 , 𝒚 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑾𝒐𝒚𝒐 + 𝒃𝒚)



Variation of neural networks:

Convolutional neural network (CNN)

…

Convolution

Max-poolingActivationFull-connection

and softmax

• Deep learning can learn the mapping between 𝒙 and 𝒚

if we have a large-scale aligned data

– Speech sounds and their phonemes

– Transcribed utterances and dialogue states

– Belief and action-value function

– State and action

– Action and utterance

• Of course, it is not so simple, but successful works of deep

learning solve any problem of mapping that was hard to

solve existing frameworks…



Ways to apply deep learning



Tasks of spoken dialogue systems

SLU

LG

ModelKnowledge

base

DM

$FROM=Ikoma

$LINE=Kintetsu

1 ask $TO_GO

2 inform $NEXT

…Where will you go?

$FROM=Ikoma

$TO_GO=???

$LINE=KintetsuI’d like to take

Kintetsu-line

from Ikoma stat.

• Speech recognition

• Spoken language understanding

• Dialogue state tracking

• Action decision

• Language generation

• End-to-end dialogue



Speech recognition with DNN in early stage

• Conventional ASR architecture

𝐚𝐫𝐠𝐦𝐚𝐱𝑾𝑷(𝑾|𝑿) = 𝐚𝐫𝐠𝐦𝐚𝐱

𝑾𝑷 𝑿 𝑾 𝑷(𝑾)

𝑊 is word sequence and 𝑋 is speech

Acoustic

model

Language

model

DNN-HMMGMM-HMM

a r a

𝒙𝟏 𝒙𝟐 𝒙𝟑

a r a

……

…

…

𝒙𝟏

……

…

…

𝒙𝟐

……

…

…

𝒙𝟑



Speech recognition with DNN in early stage

• Just replace the generative probability of GMM for a phoneme with

discriminative probability to classify a phoneme from speech

– The other architecture (HMM-based phoneme sequence selection and n-

gram based language modeling) was the same, but it reduced 20-30% errors

of speech recognition

DNN-HMMGMM-HMM

a r a

𝒙𝟏 𝒙𝟐 𝒙𝟑

a r a

……

…

…

𝒙𝟏

……

…

…

𝒙𝟐

……

…

…

𝒙𝟑

• Language model calculates likelihoods of word sequence 𝑾

– 𝑷 𝑾 = 𝑷 𝒘𝟏 𝑷 𝒘𝟐|𝒘𝟏 …𝑷 𝒘𝒏|𝒘𝟏, … ,𝒘𝒏−𝟏

• Existing language models are modeled with N-gram model

that approximate given words

– 𝑷 𝑾 ≈ 𝒊𝑷(𝒘𝒊|𝒘𝒊−𝟏:𝒊−𝑵−𝟏)

• The problem can be solved with RNN

– 𝒉𝒕 = 𝐭𝐚𝐧𝐡(𝑾𝒘𝒉𝒘𝒕−𝟏 +𝑾𝒉𝒉𝒉

𝒕−𝟏 + 𝒃𝒉)

– 𝒘𝒕 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑾𝒉𝒚𝒑𝒉𝒕 + 𝒃𝒚𝒑)

• RNN (and its successors) became a state of the art LM



Language model and recurrent neural network



End-to-end speech recognition system

• Early DNN-based speech recognition

system just replaced some modules

with deep neural networks, but

recent researchers tries to train the

model of 𝐚𝐫𝐠𝐦𝐚𝐱𝑾𝑷(𝑾|𝑿)

– Including pre-processing of ASR

• Ochiai et al., “Multichannel end-to-end speech

recognition.” In Proc. ICML, 2017.

図は論文より引用

• SLU

– Convert the user utterance into machine-readable expressions

• DM

– Decide the next system action from the SLU result and dialogue history



Problem of Spoken language understanding (SLU)

and dialogue state tracking (DST)

I want to take

Kintetsu from

Ikoma

I want to take Kintetsu from Ikoma

Line From Stat

Train_info{$FROM=

Ikoma,$LINE=Kintetsu}

Train_info{

$FROM=Ikoma

$TO_GO=Namba

$LINE=Kintetsu

}

1 inform $NEXT_TRAIN

2 ask $TO_GO

…

SLU result Dialogue state

history

Action decision

$FROM=???

$TO_GO=Namba

$LINE=???

Train_info{$FROM=

Ikoma,$LINE=Kintetsu}



Simple classification for SLU

Chinese word model

Chinese char model

(Translated) English

word model

A MULTICHANNEL

CONVOLUTIONAL NEURAL

NETWORK FOR CROSS-LANGUAGE

DIALOG STATE TRACKING

Shi et al., In Proc. IEEE-SLT 2016



CNN for classification

• CNN requires to use fixed

size of matrix as the input

• Using two techniques

– Embedding

converts word into fixed

length meaning vector

– 0-padding

sets the height of matrix with

the maximum sentence

length in the training data

and fill with 0

if the sentence length is

smaller than the max.… …

He doesn't have … himself

word embedding

artificial

intelligence class

…

𝒙𝟏𝒙𝟐𝒙𝟑

fixed length

vector of words

0 0 0 0 0 0 0 00 0 0 0 0 0 0 0

0-padding: fill with 0 if the

sentence length is smaller than

maximum sentence length

• SLU: find a dialogue frame 𝑭 given words in the utterance 𝑾

– 𝐚𝐫𝐠𝐦𝐚𝐱𝑭𝑷(𝑭|𝑾)

• For the tagging problem, there are many works

– Slot filling, domain/Intent classification,

dialogue act classification, …

• DST: find a dialogue state 𝑺 given sequence of 𝑭𝟏:𝒕

– 𝐚𝐫𝐠𝐦𝐚𝐱𝑺𝑷(𝑺|𝑭𝟏:𝒕)

• It can be solved as a joint problem

– 𝑷 𝑺 𝑾𝟏:𝒕 = 𝑷 𝑺 𝑭𝟏:𝒕 𝑷(𝑭𝟏:𝒕|𝑾𝟏:𝒕)

• Can be solved with the same model? (with sequential model)

– RNN, long short term memory neural network (LSTM)



Problem definition of SLU and DST

• Word-Based Dialog State Tracking with Recurrent Neural Networks.

Henderson et al., In Proc. SIGDIAL, pp, 292-300, 2014.



RNN-based dialogue state tracking



LSTM-based dialogue state tracking

Is there any activity in

Singapore?

…

User utterance

word sequence

Word embedding

LSTM

Task: activity{

Area: Singapore

Price range: -

…}

…

…

…

…

Other features

T

Dialogue State Tracking

using Long Short Term

Memory Neural Networks.

Yoshino et al., In Proc.

IWSDS, 2016.

• RNN: Output dialogue state given sequence of words

(utterance)

– 𝒉𝒕 = 𝐭𝐚𝐧𝐡(𝑾𝑿𝒉𝑿𝒕 +𝑾𝒉𝒉𝒉

𝒕−𝟏 + 𝒃𝒉)

• 𝑿𝒕 is a sequence of words in time 𝒕

• Dialogue history is propagated as hidden layer 𝒉𝒕−𝟏

– 𝒚𝒑𝒕 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑾𝒉𝒚𝒑𝒉𝒕 + 𝒃𝒚𝒑)

• Output a dialogue state 𝒚𝒑𝒕 that has the highest probability

• Belief update

– 𝒃𝒕 ≈ 𝑷(𝒐𝒕|𝒔𝒕) 𝒔𝒊𝑷 𝒔𝒋𝒕 𝒔𝒊𝒕−𝟏 𝒃𝒕−𝟏



Relation between belief update and RNN-

based dialogue state tracker

observation state transition belief

observation state transition

belief

• Decide action 𝒂𝒕 given belief 𝒃𝒕

• There are two ways:

– Find the best policy (policy gradient)

– Find the best Q-function (Q-network)



Problem of action decision

Train_info{

$FROM=Ikoma

$TO_GO=Bamba

$LINE=Kintetsu

}

1 inform $NEXT_TRAIN

2 ask $TO_GO

…

Belief of dialogue states

Action decision

I’d like to go to

Namba with

Kintetsu

?

𝒃𝒕

𝒔𝟏𝒔𝟐𝒔𝟑𝒔𝟒

Train_info{

$FROM=Ikoma

$TO_GO=Bamba

$LINE=Kintetsu

}

Train_info{

$FROM=Ikoma

$TO_GO=Bamba

$LINE=Kintetsu

}

Train_info{

$FROM=Ikoma

$TO_GO=Bamba

$LINE=Kintetsu

}

𝒂𝒕

• Maximize the expected future reward (value function)

– 𝑽𝝅∗𝒔𝒕 = 𝐦𝐚𝐱

𝝅𝑽𝝅(𝒔𝒕)

= 𝐦𝐚𝐱𝒂 𝒔𝒕+𝟏𝑷 𝒔

𝒕+𝟏|𝒔𝒕, 𝒂𝒕

𝒂

𝑹 𝒔𝒕, 𝝅 𝒔 , 𝒔𝒕+𝟏 + 𝜸𝑽𝝅∗𝒔𝒕+𝟏

– 𝑸𝝅∗𝒔, 𝒂 = 𝒔𝒕+𝟏𝑷 𝒔

𝒕+𝟏|𝒔𝒕, 𝒂𝒕 𝑹 𝒔𝒕, 𝒂𝒕, 𝒔𝒕+𝟏 + 𝜸𝐦𝐚𝐱𝒂𝑸𝝅∗(𝒔𝒕+𝟏, 𝒂𝒕+𝟏)

• Policy gradient directly calculates the score 𝑽𝝅∗𝒔𝒕

• Q-network calculates 𝑸𝝅∗𝒔, 𝒂 for each action according to the

sampling manner of Q-learning



What was good actions?

• In policy gradient, policy is not decisive: 𝝅(𝒔, 𝒂) is a

probability of selecting action 𝒂 given state 𝒔

– 𝑱 𝜽 = 𝑽𝝅𝜽 𝒔

– 𝜵𝜽𝑱 𝜽 = 𝑬𝝅𝜽[𝜵𝜽𝐥𝐨𝐠𝝅_𝜽(𝒔, 𝒂)𝑸𝝅𝜽 𝒔, 𝒂 ]

– If we can learn parameter 𝜽 of policy 𝝅𝜽 as maximizing the 𝑱 𝜽

in existing data, we can get the policy that maximizes the reward for

existing data sequence

– We can use deep learning for the parameter learning



Policy gradient

• Premise: if we can calculate 𝑸 𝒔, 𝒂 for every pair, we should

decide action 𝒂 according to 𝐦𝐚𝐱𝒂𝑸 𝒔,𝒂

• Problem: we don’t know 𝑷 𝒔𝒕+𝟏|𝒔𝒕, 𝒂𝒕 to calculate 𝑸 𝒔, 𝒂

• Solution: approximate the 𝑷 𝒔𝒕+𝟏|𝒔𝒕, 𝒂𝒕 with sampling

– 𝑸 𝒔𝒕, 𝒂𝒕

𝒖𝒑𝒅𝒂𝒕𝒆𝟏 − 𝜶 𝑸 𝒔𝒕, 𝒂𝒕 + 𝜶 𝑹 𝒔𝒕, 𝒂𝒕, 𝒔𝒕+𝟏 + 𝜸𝐦𝐚𝐱

𝒂𝒕+𝟏𝑸(𝒔𝒕+𝟏, 𝒂𝒕+𝟏)

– Back-propagate the reward from the end of the sample

– We will try to build a dialogue manager by using this algorithm in

the next class



Q-learning

• Idea: If we can regress the Q-value at each sampling, learning

will be efficient

– 𝓛 𝜽𝒊 = 𝑬𝒔,𝒂,𝒓,𝒔′[ 𝒚 − 𝑸 𝒔, 𝒂𝟐]

– 𝒚 = 𝑹 𝒔𝒕, 𝒂𝒕, 𝒔𝒕+𝟏 + 𝜸𝐦𝐚𝐱𝒂𝑸(𝒔𝒕+𝟏, 𝒂𝒕+𝟏)

• Regression?

train the mapping between 𝒚 and 𝑸 𝒔, 𝒂 ?

Deep learning! (Deep Q-network; DQN)



Q-network

tanh 𝑸(𝒃, 𝒂)

𝒔 = 𝒔𝟏: 𝟎. 𝟎𝒔 = 𝒔𝟐: 𝟎. 𝟗

𝒔 = 𝒔𝒏: 𝟎. 𝟎

𝒃



Joint learning of dialogue state tracking and

action decision with deep learning

• LSTM-based DST

results are used as

the input of DQN

LSTM: calculate 𝒃𝒕 from observation

DQN: optimize 𝑸(𝒃𝒕, 𝒂𝒕+𝟏; 𝜽) with regression

Fine tuning of entire the network

Towards End-to-End

Learning for Dialog State

Tracking and Management

using Deep Reinforcement

Learning. Zhao et al., In

Proc. SIGDIAL, 2016

• Generate a sentence given a system action

• Conventional: statistical template based approach

– It is still weak for out of vocabulary…

(and out of template)



Language generation

Language

generationAsk $TO_GO Where will you go?

両手鍋/T を油/F で熱/Ac する

セロリ/F と青ねぎ/F とニンニク/F を加え/Ac る

１分ほど/D 炒め/Ac る

フローグラフからの手順書の生成. 情処論文誌, 2015.

• RNN can generate a sequence of words by using the

generated words as its next input (decoder model)

– 𝒉𝒕 = 𝐭𝐚𝐧𝐡(𝑾𝒘𝒉𝒘𝒕−𝟏 +𝑾𝒉𝒉𝒉

𝒕−𝟏 + 𝒃𝒉)

– 𝒘𝒕 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑾𝒉𝒚𝒑𝒉𝒕 + 𝒃𝒚𝒑)



RNN-language model for generation

𝒙𝟏=He 𝒙𝟐=doesn’t 𝒙𝟑=have 𝒙𝟕=in

𝒚𝒑𝟏=doesn’t 𝒚𝒑

𝟐=have 𝒚𝒑𝟑=very 𝒚𝒑

𝟖=himself

𝑾𝒉𝟏 𝑾𝒉𝟐 𝑾𝒉𝟑 𝑾𝒉𝒏−𝟐 𝑾𝒉𝟓𝑾𝒉𝟎…

𝑾𝒙𝟏 𝑾𝒙𝟐 𝑾𝒙𝟑 𝑾𝒙𝟕

Dimension size of

output will be a

vocabulary



Decoder with condition:

semantically conditioned LSTM

recurrent

hidden layer

embedding of

a word

1-hot dialog

act and slot

values

What to say

(contents)

How to say (=LM)

Semantically Conditioned LSTM-

based Natural Language Generation

for Spoken Dialogue Systems. Wen

et al., In Proc. EMNLP, 2015.



semantically conditioned LSTM

decoding results for dialogue systems

• Sequence-to-sequence

modeling of generation

– Change the response according

to the dialogue context



Context-aware NLG

Dusek et al., A context-aware natural

language generator for dialogue

systems. In Proc. SIGDIAL 2016.



QA style

SLU

ModelKnowledge

base

DM

$FROM=Ikoma

$LINE=Kintetsu

1 ask $TO_GO

2 inform $NEXT

…

$FROM=Ikoma

$TO_GO=???

$LINE=KintetsuI’d like to take

Kintetsu-line

from Ikoma stat.

Skip the

management(example-base)

LG

Where will you go?



Encoder-decoder

• We can combine ideas of encoder and decoder to make the neural

network that remembers the input sentence and outputs the input

sentence

– RNN may remember not only words but also order of words

where is the rest room EOS

rest

rest

room

room

is

is

next

entrance

.

.

EOS

Vinyals, Oriol, and Quoc Le. "A neural

conversational model." arXiv preprint

arXiv:1506.05869 (2015).



Attention model

• Gives attentional point to be decoded

next to the entrance . EOS

X

𝐬𝐢𝐠𝐧 (softmax)

X …

Decides to through the input or not!

where is the rest room EOS next to the entrance .

• 𝒂𝒕,𝒋 = 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧_𝐬𝐜𝐨𝐫𝐞(𝒉𝒆𝒋, 𝒉𝒅𝒕 )



Attention model

X X

Decides to through the input or not!

X X X

attention_score & softmax

next to the entrance . EOS

where is the rest room EOS next to the entrance .

• One typical end-to-end modeling of dialogue



ChitChat

Serban et al.,Building

end-to-end dialogue

systems using generative

hierarchical neural

network models. In Proc.

AAAI 2016.

• The task is goal

oriented (API call),

but try to work the

system on end-to-end

memory network

– The problem is

perfectly solved in

DSTC6-Track1



Memory Network for dialogue systems

• https://sites.google.com/site/deeplearningdialogue/

– Deep Learning for Dialogue Systems Tutorial

by Yun-Nung Chen, Asli Celikyikmaz, and Dilek Hakkani-Tur



If you are interested in more recent works…

https://sites.google.com/site/deeplearningdialogue/

• Deep learning is applied for several tasks of spoken dialogue systems

in recent years

– Speech recognition, understanding, state tracking, action decision,

generation, and end-to-end modeling

• How do we apply deep learning for dialogue (in development)?

– Clarify your problem to setup the input and output

– Find similar systems of existing works in recent (2-3 years) conferences

(SIGDIAL, NAACL, ACL, EMNLP, COLING, AAAI, IJCAI, …)

– Can you prepare the enough data pair of the input and the output?

• How do we apply deep learning for dialogue (in research)?

– Find a mapping problem that requires hi-dimension or non-linear

• Consider properties of your input and output, even if it is a part of your

problem

– Can you prepare the enough data pair of the input and the output?



Summary

• 1/30

• Dialogue management with Q-learning

– We will see the detailed algorithm and implementation of the

dialogue manager with Q-learning

– We will discuss about user simulator



Next contents

Documents

Advanced Cutting Edge Research Seminar Dialogue …Nara Institute of Science and Technology Augmented Human Communication Laboratory PRESTO, Japan Science and Technology Agency Advanced