Nara Institute of Science and TechnologyAugmented Human Communication LaboratoryPRESTO, Japan Science and Technology Agency
Advanced Cutting Edge
Research Seminar
Dialogue System with Deep Neural
Networks
Assistant Professor
Koichiro Yoshino
2018/1/25 ©Koichiro Yoshino AHC-Lab. NAIST,
PRESTO JSTAdvanced Cutting Edge Research Seminar 2 1
1. Basis of spoken dialogue systems
– Type and modules of spoken dialogue systems
2. Deep learning for spoken dialogue systems
– Basis of deep learning (deep neural networks)
– Recent approaches of deep learning for spoken dialogue systems
3. Dialogue management using reinforcement learning
– Basis of reinforcement learning
– Statistical dialogue management using intention dependency graph
4. Dialogue management using deep reinforcement learning
– Implementation of deep Q-network in dialogue management
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 2
Course works
• Perceptron
– Simple binary classifier
• Multi-layer perceptron
– Combination of binary classifiers
• Deep neural networks
• Ways to apply deep neural networks
– What kind of problem can be solved with DNN?
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 3
Basis of deep neural networks
• Simplest unit of neural networks
– Takes several inputs produces one output
– Perceptron does a single decision given several inputs
𝒚 = 𝐬𝐢𝐠𝐧
𝒊
𝒘𝒊 ∙ 𝝋𝒊(𝒙) + 𝒃
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 4
Simple perceptron
Binary output
+1 or -1Weight for 𝝋𝒊 Feature function on 𝒙
bias 𝒃
𝒚𝒙
• Problems the perceptron can solve
– Linearly solvable binary classification problem
• Positive: “This room is good”
• Positive: “This building is cool”
• Negative: “This room is bad”
• Problems the perceptron cannot solve
– Non-linear problems
• Positive: “very good”
• Positive: “not bad”
• Negative: “very bad”
• Negative: “not good”
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 5
Properties of simple perceptron
negative
positivegood
bad
cool
goodbad
not
very very good
not goodnot bad
very bad
• Using several classifiers
– Multi-layer perceptron (MLP)
• Feed-forward network
– Classifiers of the 1st layer take
the same input, but learn
different weights
– If we have two linear
separating planes, we can
classify the example of
non-linear problem
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 6
Solutions for non-linear problems
goodbad
not
very 𝒙𝟏: very good
𝒙𝟒: not good𝒙𝟑: not bad
𝒙𝟐: very bad
𝒚
• 1st layer
– Input: same as the single perceptron
– Output: features for the decision
of the 2nd layer
• Each perceptron may learn the
mapping between the input feature
space and the new feature space
• Kernel methods do the similar thing
• 2nd layer
– Input: features that come from
the 1st layer
– Output: classification result
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 7
Multi-layer perceptron (MLP)
goodbad
not
very 𝒙𝟏
𝒙𝟒𝒙𝟑
𝒙𝟐
𝒙𝟏
𝒙𝟒
𝒙𝟑
𝒙𝟐
𝝋1(𝒙)
𝝋2(𝒙)
𝝋1 𝒙 ∗ 𝝋2 𝒙 = 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝝋1 𝒙 ∗ 𝝋2 𝒙 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
• Deep neural network (DNN) is deeply layered multi-layer
perceptron
– It can train the mapping of 𝑿 and 𝒚 (𝒚 = 𝒇(𝑿)) even if it is complex
– Restricted Boltzmann machine (RBM) is a key technique to train the
model
• Pre-trains mapping of each layer (𝑿 and 𝑯𝟏, 𝑯𝟏 and 𝑯𝟐, …) and
finetunes entire the network (backpropagation)
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 8
Deep neural networks
𝒙𝟏
𝒙𝟐
𝒙𝟑
𝒙𝟒
𝒉𝟏𝟏
𝒉𝟐𝟏
𝒉𝟑𝟏
𝒉𝟒𝟏
𝒉𝟏𝟐
𝒉𝟐𝟐
𝒉𝟑𝟐
𝒉𝟒𝟐
𝒉𝟏𝟑
𝒉𝟐𝟑
𝒉𝟑𝟑
𝒉𝟒𝟑
𝒉𝟏 𝒚
• Recurrent neural network
is a neural network that
has a recursion
– 𝒉𝒕 = 𝐭𝐚𝐧𝐡(𝑾𝒙𝒉𝒙𝒕 +𝑾𝒉𝒉𝒉
𝒕−𝟏 + 𝒃𝒉)
– 𝒚𝒑𝒕 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑾𝒉𝒚𝒑𝒉𝒕 + 𝒃𝒚𝒑)
• 𝐭𝐚𝐧𝐡 =𝒆𝒙−𝒆−𝒙
𝒆𝒙+𝒆−𝒙, 𝐬𝐨𝐟𝐭𝐦𝐚𝐱 =
𝒆𝒚𝒊
𝒌 𝒆𝒚𝒌
• This structure works well for sequential input (𝑿𝟏, 𝑿𝟐, … , 𝑿𝒕)
– 𝒕 is a time step
– Input 𝒉𝒕−𝟏 for time-step 𝒕 will be a memory of previous input
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 9
Variation of neural networks:
Recurrent neural network (RNN)
𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒙𝟓
𝒚𝒑𝟏 𝒚𝒑
𝟐 𝒚𝒑𝟑 𝒚𝒑
𝟒 𝒚𝒑𝟓
𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒉𝟓𝒉𝟎
• CNN is the state of the art algorithm for classification
– 𝒄𝒊,𝒋 = 𝒔=𝟎𝒎−𝟏 𝒕
𝒏−𝟏𝒘𝒔𝒕𝒙 𝒊+𝒔 ,(𝒋+𝒕) + 𝒃𝒄
– 𝒂𝒊,𝒋 = 𝐭𝐚𝐧𝐡 𝒄𝒊,𝒋 , 𝒑𝒊,𝒋 = 𝐦𝐚𝐱 𝒂𝒊,𝒋
– 𝒐 = 𝐭𝐚𝐧𝐡 𝑾𝒑𝒐𝒑 + 𝒃𝒐 , 𝒚 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑾𝒐𝒚𝒐 + 𝒃𝒚)
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 10
Variation of neural networks:
Convolutional neural network (CNN)
…
Convolution
Max-poolingActivationFull-connection
and softmax
• Deep learning can learn the mapping between 𝒙 and 𝒚
if we have a large-scale aligned data
– Speech sounds and their phonemes
– Transcribed utterances and dialogue states
– Belief and action-value function
– State and action
– Action and utterance
• Of course, it is not so simple, but successful works of deep
learning solve any problem of mapping that was hard to
solve existing frameworks…
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 11
Ways to apply deep learning
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 12
Tasks of spoken dialogue systems
SLU
LG
ModelKnowledge
base
DM
$FROM=Ikoma
$LINE=Kintetsu
1 ask $TO_GO
2 inform $NEXT
…Where will you go?
$FROM=Ikoma
$TO_GO=???
$LINE=KintetsuI’d like to take
Kintetsu-line
from Ikoma stat.
• Speech recognition
• Spoken language understanding
• Dialogue state tracking
• Action decision
• Language generation
• End-to-end dialogue
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 13
Speech recognition with DNN in early stage
• Conventional ASR architecture
𝐚𝐫𝐠𝐦𝐚𝐱𝑾𝑷(𝑾|𝑿) = 𝐚𝐫𝐠𝐦𝐚𝐱
𝑾𝑷 𝑿 𝑾 𝑷(𝑾)
𝑊 is word sequence and 𝑋 is speech
Acoustic
model
Language
model
DNN-HMMGMM-HMM
a r a
𝒙𝟏 𝒙𝟐 𝒙𝟑
a r a
……
…
…
𝒙𝟏
……
…
…
𝒙𝟐
……
…
…
𝒙𝟑
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 14
Speech recognition with DNN in early stage
• Just replace the generative probability of GMM for a phoneme with
discriminative probability to classify a phoneme from speech
– The other architecture (HMM-based phoneme sequence selection and n-
gram based language modeling) was the same, but it reduced 20-30% errors
of speech recognition
DNN-HMMGMM-HMM
a r a
𝒙𝟏 𝒙𝟐 𝒙𝟑
a r a
……
…
…
𝒙𝟏
……
…
…
𝒙𝟐
……
…
…
𝒙𝟑
• Language model calculates likelihoods of word sequence 𝑾
– 𝑷 𝑾 = 𝑷 𝒘𝟏 𝑷 𝒘𝟐|𝒘𝟏 …𝑷 𝒘𝒏|𝒘𝟏, … ,𝒘𝒏−𝟏
• Existing language models are modeled with N-gram model
that approximate given words
– 𝑷 𝑾 ≈ 𝒊𝑷(𝒘𝒊|𝒘𝒊−𝟏:𝒊−𝑵−𝟏)
• The problem can be solved with RNN
– 𝒉𝒕 = 𝐭𝐚𝐧𝐡(𝑾𝒘𝒉𝒘𝒕−𝟏 +𝑾𝒉𝒉𝒉
𝒕−𝟏 + 𝒃𝒉)
– 𝒘𝒕 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑾𝒉𝒚𝒑𝒉𝒕 + 𝒃𝒚𝒑)
• RNN (and its successors) became a state of the art LM
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 15
Language model and recurrent neural network
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 16
End-to-end speech recognition system
• Early DNN-based speech recognition
system just replaced some modules
with deep neural networks, but
recent researchers tries to train the
model of 𝐚𝐫𝐠𝐦𝐚𝐱𝑾𝑷(𝑾|𝑿)
– Including pre-processing of ASR
• Ochiai et al., “Multichannel end-to-end speech
recognition.” In Proc. ICML, 2017.
図は論文より引用
• SLU
– Convert the user utterance into machine-readable expressions
• DM
– Decide the next system action from the SLU result and dialogue history
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 17
Problem of Spoken language understanding (SLU)
and dialogue state tracking (DST)
I want to take
Kintetsu from
Ikoma
I want to take Kintetsu from Ikoma
Line From Stat
Train_info{$FROM=
Ikoma,$LINE=Kintetsu}
Train_info{
$FROM=Ikoma
$TO_GO=Namba
$LINE=Kintetsu
}
1 inform $NEXT_TRAIN
2 ask $TO_GO
…
SLU result Dialogue state
history
Action decision
$FROM=???
$TO_GO=Namba
$LINE=???
Train_info{$FROM=
Ikoma,$LINE=Kintetsu}
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 18
Simple classification for SLU
Chinese word model
Chinese char model
(Translated) English
word model
A MULTICHANNEL
CONVOLUTIONAL NEURAL
NETWORK FOR CROSS-LANGUAGE
DIALOG STATE TRACKING
Shi et al., In Proc. IEEE-SLT 2016
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 19
CNN for classification
• CNN requires to use fixed
size of matrix as the input
• Using two techniques
– Embedding
converts word into fixed
length meaning vector
– 0-padding
sets the height of matrix with
the maximum sentence
length in the training data
and fill with 0
if the sentence length is
smaller than the max.… …
He doesn't have … himself
word embedding
artificial
intelligence class
…
𝒙𝟏𝒙𝟐𝒙𝟑
fixed length
vector of words
0 0 0 0 0 0 0 00 0 0 0 0 0 0 0
0-padding: fill with 0 if the
sentence length is smaller than
maximum sentence length
• SLU: find a dialogue frame 𝑭 given words in the utterance 𝑾
– 𝐚𝐫𝐠𝐦𝐚𝐱𝑭𝑷(𝑭|𝑾)
• For the tagging problem, there are many works
– Slot filling, domain/Intent classification,
dialogue act classification, …
• DST: find a dialogue state 𝑺 given sequence of 𝑭𝟏:𝒕
– 𝐚𝐫𝐠𝐦𝐚𝐱𝑺𝑷(𝑺|𝑭𝟏:𝒕)
• It can be solved as a joint problem
– 𝑷 𝑺 𝑾𝟏:𝒕 = 𝑷 𝑺 𝑭𝟏:𝒕 𝑷(𝑭𝟏:𝒕|𝑾𝟏:𝒕)
• Can be solved with the same model? (with sequential model)
– RNN, long short term memory neural network (LSTM)
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 20
Problem definition of SLU and DST
• Word-Based Dialog State Tracking with Recurrent Neural Networks.
Henderson et al., In Proc. SIGDIAL, pp, 292-300, 2014.
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 21
RNN-based dialogue state tracking
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 22
LSTM-based dialogue state tracking
Is there any activity in
Singapore?
…
User utterance
word sequence
Word embedding
LSTM
Task: activity{
Area: Singapore
Price range: -
…}
…
…
…
…
Other features
T
Dialogue State Tracking
using Long Short Term
Memory Neural Networks.
Yoshino et al., In Proc.
IWSDS, 2016.
• RNN: Output dialogue state given sequence of words
(utterance)
– 𝒉𝒕 = 𝐭𝐚𝐧𝐡(𝑾𝑿𝒉𝑿𝒕 +𝑾𝒉𝒉𝒉
𝒕−𝟏 + 𝒃𝒉)
• 𝑿𝒕 is a sequence of words in time 𝒕
• Dialogue history is propagated as hidden layer 𝒉𝒕−𝟏
– 𝒚𝒑𝒕 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑾𝒉𝒚𝒑𝒉𝒕 + 𝒃𝒚𝒑)
• Output a dialogue state 𝒚𝒑𝒕 that has the highest probability
• Belief update
– 𝒃𝒕 ≈ 𝑷(𝒐𝒕|𝒔𝒕) 𝒔𝒊𝑷 𝒔𝒋𝒕 𝒔𝒊𝒕−𝟏 𝒃𝒕−𝟏
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 23
Relation between belief update and RNN-
based dialogue state tracker
observation state transition belief
observation state transition
belief
• Decide action 𝒂𝒕 given belief 𝒃𝒕
• There are two ways:
– Find the best policy (policy gradient)
– Find the best Q-function (Q-network)
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 24
Problem of action decision
Train_info{
$FROM=Ikoma
$TO_GO=Bamba
$LINE=Kintetsu
}
1 inform $NEXT_TRAIN
2 ask $TO_GO
…
Belief of dialogue states
Action decision
I’d like to go to
Namba with
Kintetsu
?
𝒃𝒕
𝒔𝟏𝒔𝟐𝒔𝟑𝒔𝟒
Train_info{
$FROM=Ikoma
$TO_GO=Bamba
$LINE=Kintetsu
}
Train_info{
$FROM=Ikoma
$TO_GO=Bamba
$LINE=Kintetsu
}
Train_info{
$FROM=Ikoma
$TO_GO=Bamba
$LINE=Kintetsu
}
𝒂𝒕
• Maximize the expected future reward (value function)
– 𝑽𝝅∗𝒔𝒕 = 𝐦𝐚𝐱
𝝅𝑽𝝅(𝒔𝒕)
= 𝐦𝐚𝐱𝒂 𝒔𝒕+𝟏𝑷 𝒔
𝒕+𝟏|𝒔𝒕, 𝒂𝒕
𝒂
𝑹 𝒔𝒕, 𝝅 𝒔 , 𝒔𝒕+𝟏 + 𝜸𝑽𝝅∗𝒔𝒕+𝟏
– 𝑸𝝅∗𝒔, 𝒂 = 𝒔𝒕+𝟏𝑷 𝒔
𝒕+𝟏|𝒔𝒕, 𝒂𝒕 𝑹 𝒔𝒕, 𝒂𝒕, 𝒔𝒕+𝟏 + 𝜸𝐦𝐚𝐱𝒂𝑸𝝅∗(𝒔𝒕+𝟏, 𝒂𝒕+𝟏)
• Policy gradient directly calculates the score 𝑽𝝅∗𝒔𝒕
• Q-network calculates 𝑸𝝅∗𝒔, 𝒂 for each action according to the
sampling manner of Q-learning
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 25
What was good actions?
• In policy gradient, policy is not decisive: 𝝅(𝒔, 𝒂) is a
probability of selecting action 𝒂 given state 𝒔
– 𝑱 𝜽 = 𝑽𝝅𝜽 𝒔
– 𝜵𝜽𝑱 𝜽 = 𝑬𝝅𝜽[𝜵𝜽𝐥𝐨𝐠𝝅_𝜽(𝒔, 𝒂)𝑸𝝅𝜽 𝒔, 𝒂 ]
– If we can learn parameter 𝜽 of policy 𝝅𝜽 as maximizing the 𝑱 𝜽
in existing data, we can get the policy that maximizes the reward for
existing data sequence
– We can use deep learning for the parameter learning
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 26
Policy gradient
• Premise: if we can calculate 𝑸 𝒔, 𝒂 for every pair, we should
decide action 𝒂 according to 𝐦𝐚𝐱𝒂𝑸 𝒔,𝒂
• Problem: we don’t know 𝑷 𝒔𝒕+𝟏|𝒔𝒕, 𝒂𝒕 to calculate 𝑸 𝒔, 𝒂
• Solution: approximate the 𝑷 𝒔𝒕+𝟏|𝒔𝒕, 𝒂𝒕 with sampling
– 𝑸 𝒔𝒕, 𝒂𝒕
𝒖𝒑𝒅𝒂𝒕𝒆𝟏 − 𝜶 𝑸 𝒔𝒕, 𝒂𝒕 + 𝜶 𝑹 𝒔𝒕, 𝒂𝒕, 𝒔𝒕+𝟏 + 𝜸𝐦𝐚𝐱
𝒂𝒕+𝟏𝑸(𝒔𝒕+𝟏, 𝒂𝒕+𝟏)
– Back-propagate the reward from the end of the sample
– We will try to build a dialogue manager by using this algorithm in
the next class
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 27
Q-learning
• Idea: If we can regress the Q-value at each sampling, learning
will be efficient
– 𝓛 𝜽𝒊 = 𝑬𝒔,𝒂,𝒓,𝒔′[ 𝒚 − 𝑸 𝒔, 𝒂𝟐]
– 𝒚 = 𝑹 𝒔𝒕, 𝒂𝒕, 𝒔𝒕+𝟏 + 𝜸𝐦𝐚𝐱𝒂𝑸(𝒔𝒕+𝟏, 𝒂𝒕+𝟏)
• Regression?
train the mapping between 𝒚 and 𝑸 𝒔, 𝒂 ?
Deep learning! (Deep Q-network; DQN)
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 28
Q-network
tanh 𝑸(𝒃, 𝒂)
𝒔 = 𝒔𝟏: 𝟎. 𝟎𝒔 = 𝒔𝟐: 𝟎. 𝟗
𝒔 = 𝒔𝒏: 𝟎. 𝟎
𝒃
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 29
Joint learning of dialogue state tracking and
action decision with deep learning
• LSTM-based DST
results are used as
the input of DQN
LSTM: calculate 𝒃𝒕 from observation
DQN: optimize 𝑸(𝒃𝒕, 𝒂𝒕+𝟏; 𝜽) with regression
Fine tuning of entire the network
Towards End-to-End
Learning for Dialog State
Tracking and Management
using Deep Reinforcement
Learning. Zhao et al., In
Proc. SIGDIAL, 2016
• Generate a sentence given a system action
• Conventional: statistical template based approach
– It is still weak for out of vocabulary…
(and out of template)
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 30
Language generation
Language
generationAsk $TO_GO Where will you go?
両手鍋/T を 油/F で熱/Ac する
セロリ/F と 青ねぎ/F とニンニク/F を 加え/Ac る
1分ほど/D 炒め/Ac る
フローグラフからの手順書の生成. 情処論文誌, 2015.
• RNN can generate a sequence of words by using the
generated words as its next input (decoder model)
– 𝒉𝒕 = 𝐭𝐚𝐧𝐡(𝑾𝒘𝒉𝒘𝒕−𝟏 +𝑾𝒉𝒉𝒉
𝒕−𝟏 + 𝒃𝒉)
– 𝒘𝒕 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝑾𝒉𝒚𝒑𝒉𝒕 + 𝒃𝒚𝒑)
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 31
RNN-language model for generation
𝒙𝟏=He 𝒙𝟐=doesn’t 𝒙𝟑=have 𝒙𝟕=in
𝒚𝒑𝟏=doesn’t 𝒚𝒑
𝟐=have 𝒚𝒑𝟑=very 𝒚𝒑
𝟖=himself
𝑾𝒉𝟏 𝑾𝒉𝟐 𝑾𝒉𝟑 𝑾𝒉𝒏−𝟐 𝑾𝒉𝟓𝑾𝒉𝟎…
𝑾𝒙𝟏 𝑾𝒙𝟐 𝑾𝒙𝟑 𝑾𝒙𝟕
Dimension size of
output will be a
vocabulary
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 32
Decoder with condition:
semantically conditioned LSTM
recurrent
hidden layer
embedding of
a word
1-hot dialog
act and slot
values
What to say
(contents)
How to say (=LM)
Semantically Conditioned LSTM-
based Natural Language Generation
for Spoken Dialogue Systems. Wen
et al., In Proc. EMNLP, 2015.
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 33
semantically conditioned LSTM
decoding results for dialogue systems
• Sequence-to-sequence
modeling of generation
– Change the response according
to the dialogue context
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 34
Context-aware NLG
Dusek et al., A context-aware natural
language generator for dialogue
systems. In Proc. SIGDIAL 2016.
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 35
QA style
SLU
ModelKnowledge
base
DM
$FROM=Ikoma
$LINE=Kintetsu
1 ask $TO_GO
2 inform $NEXT
…
$FROM=Ikoma
$TO_GO=???
$LINE=KintetsuI’d like to take
Kintetsu-line
from Ikoma stat.
Skip the
management(example-base)
LG
Where will you go?
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 36
Encoder-decoder
• We can combine ideas of encoder and decoder to make the neural
network that remembers the input sentence and outputs the input
sentence
– RNN may remember not only words but also order of words
where is the rest room EOS
rest
rest
room
room
is
is
next
entrance
.
.
EOS
Vinyals, Oriol, and Quoc Le. "A neural
conversational model." arXiv preprint
arXiv:1506.05869 (2015).
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 37
Attention model
• Gives attentional point to be decoded
next to the entrance . EOS
X
𝐬𝐢𝐠𝐧 (softmax)
X …
Decides to through the input or not!
where is the rest room EOS next to the entrance .
• 𝒂𝒕,𝒋 = 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧_𝐬𝐜𝐨𝐫𝐞(𝒉𝒆𝒋, 𝒉𝒅𝒕 )
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 38
Attention model
X X
Decides to through the input or not!
X X X
attention_score & softmax
next to the entrance . EOS
where is the rest room EOS next to the entrance .
• One typical end-to-end modeling of dialogue
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 39
ChitChat
Serban et al.,Building
end-to-end dialogue
systems using generative
hierarchical neural
network models. In Proc.
AAAI 2016.
• The task is goal
oriented (API call),
but try to work the
system on end-to-end
memory network
– The problem is
perfectly solved in
DSTC6-Track1
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 40
Memory Network for dialogue systems
• https://sites.google.com/site/deeplearningdialogue/
– Deep Learning for Dialogue Systems Tutorial
by Yun-Nung Chen, Asli Celikyikmaz, and Dilek Hakkani-Tur
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 41
If you are interested in more recent works…
• Deep learning is applied for several tasks of spoken dialogue systems
in recent years
– Speech recognition, understanding, state tracking, action decision,
generation, and end-to-end modeling
• How do we apply deep learning for dialogue (in development)?
– Clarify your problem to setup the input and output
– Find similar systems of existing works in recent (2-3 years) conferences
(SIGDIAL, NAACL, ACL, EMNLP, COLING, AAAI, IJCAI, …)
– Can you prepare the enough data pair of the input and the output?
• How do we apply deep learning for dialogue (in research)?
– Find a mapping problem that requires hi-dimension or non-linear
• Consider properties of your input and output, even if it is a part of your
problem
– Can you prepare the enough data pair of the input and the output?
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 42
Summary
• 1/30
• Dialogue management with Q-learning
– We will see the detailed algorithm and implementation of the
dialogue manager with Q-learning
– We will discuss about user simulator
2018/1/25 ©Koichiro Yoshino
AHC-Lab. NAIST, PRESTO JSTAdvanced Cutting Edge Research Seminar 2 43
Next contents