Improving Goal-Oriented Visual Dialog Agents via Advanced ...ceur-ws.org/Vol-2202/paper1.pdf · guessing task, we develop a Memory Network [Sukhbaatar et al., 2015] with Attention

Improving Goal-Oriented Visual Dialog Agents via Advanced Recurrent Nets withTempered Policy Gradient

Rui Zhao and Volker TrespSiemens AG, Corporate Technology, Munich, Germany

Ludwig-Maximilians-Universitat Munchen, Munich, Germany{ruizhao, volker.tresp}@siemens.com

AbstractLearning goal-oriented dialogues by means ofdeep reinforcement learning has recently becomea popular research topic. However, training text-generating agents efficiently is still a considerablechallenge. Commonly used policy-based dialogueagents often end up focusing on simple utterancesand suboptimal policies. To mitigate this prob-lem, we propose a class of novel temperature-basedextensions for policy gradient methods, which arereferred to as Tempered Policy Gradients (TPGs).These methods encourage exploration with differ-ent temperature control strategies. We derive threevariations of the TPGs and show their superiorperformance on a recently published AI-testbed,i.e., the GuessWhat?! game. On the testbed, weachieve significant improvements with two innova-tions. The first one is an extension of the state-of-the-art solutions with Seq2Seq and Memory Net-work structures that leads to an improvement of9%. The second one is the application of ournewly developed TPG methods, which improvesthe performance additionally by around 5% and,even more importantly, helps produce more con-vincing utterances. TPG can easily be applied toany goal-oriented dialogue systems.

1 IntroductionIn recent years, deep learning has shown convincing perfor-mance in various areas such as image recognition, speechrecognition, and natural language processing (NLP). Deepneural nets are capable of learning complex dependenciesfrom huge amounts of data and its human generated anno-tations in a supervised way. In contrast, reinforcement learn-ing agents [Sutton and Barto, 1998] can learn directly fromtheir interactions with the environment without any super-vision and surpass human performance in several domains,for instance in the game of GO [Silver et al., 2016], as wellas many computer games [Mnih et al., 2015]. In this paperwe are concerned with the application of both approaches togoal-oriented dialogue systems [Bordes and Weston, 2017;de Vries et al., 2017; Das et al., 2017; Strub et al., 2017;Das et al., 2017; Lewis et al., 2017; Dhingra et al., 2016],

a problem that has recently caught the attention of machinelearning researchers. De Vries et al. [2017] have proposedas AI-testbed a visual grounded object guessing game calledGuessWhat?!. Das et al. [2017] formulated a visual dialoguesystem which is about two chatbots asking and answeringquestions to identify a specific image within a group of im-ages. More practically, dialogue agents have been applied tonegotiate a deal [Lewis et al., 2017] and access certain in-formation from knowledge bases [Dhingra et al., 2016]. Theessential idea in these systems is to train different dialogueagents to accomplish the tasks. In those papers, the agentshave been trained with policy gradients, i.e. REINFORCE[Williams, 1992].

In order to improve the exploration quality of policy gradi-ents, we present three instances of temperature-based meth-ods. The first one is a single-temperature approach whichis very easy to apply. The second one is a parallel approachwith multiple temperature policies running concurrently. Thissecond approach is more demanding on computational re-sources, but results in more stable solutions. The third oneis a temperature policy approach that dynamically adjusts thetemperature for each action at each time-step, based on actionfrequencies. This dynamic method is more sophisticated andproves more efficient in the experiments. In the experiments,all these methods demonstrate better exploration strategies incomparison to the plain policy gradient.

We demonstrate our approaches using a real-world datasetcalled GuessWhat?!. The GuessWhat?! game [de Vries et al.,2017] is a visual object discovery game between two players,the Oracle and the Questioner. The Questioner tries to iden-tify an object by asking the Oracle questions. The originalworks [de Vries et al., 2017; Strub et al., 2017] first pro-posed supervised learning to simulate and optimize the game.Strub et al. [2017] showed that the performance could be im-proved by applying plain policy gradient reinforcement learn-ing, which maximizes the game success rate, as a second pro-cessing step. Building on these previous works, we proposetwo network architecture extensions. We utilize a Seq2Seqmodel [Sutskever et al., 2014] to process the image alongwith the historical dialogues for question generation. For theguessing task, we develop a Memory Network [Sukhbaatar etal., 2015] with Attention Mechanism [Bahdanau et al., 2014]to process the generated question-answer pairs. We first trainthese two models using the plain policy gradient and use them

1

as our baselines. Subsequently, we train the models withour new TPG methods and compare the performances withthe baselines. We show that the TPG method is compatiblewith state-of-the-art architectures such as Seq2Seq and Mem-ory Networks and contributes orthogonally to these advancedneural architectures. To the best of our knowledge, the pre-sented work is the first to propose temperature-based policygradient methods to leverage exploration and exploitation inthe field of goal-oriented dialogue systems. We demonstratethe superior performance of our TPG methods by applying itto the GuessWhat?! game. Our contributions are:• We introduce Tempered Policy Gradients in the context

of goal-oriented dialogue systems, a novel class of ap-proaches to temperature control to better leverage explo-ration and exploitation during training.• We extend the state-of-the-art solutions for the Guess-

What?! game by integrating Seq2Seq and Memory Net-works. We show that TPGs are compatible with theseadvanced models and further improve the performance.

2 PreliminariesIn our notation, we use x to denote the input to a policy net-work π, and xi to denote the i-th element of the input vector.Similarly, w denotes the weight vector of π, and wi denotesthe i-th element of the weight vector of that π. The outputy is a multinoulli random variable with N states that followsa probability mass function, f(y = n | π(x | w)), whereΣNn=1f(y = n | π(x | w)) = 1 and f(·) ≥ 0. In a nutshell, apolicy network parametrizes a probabilistic unit that producesthe sampled output, mathematically, y ∼ f(π(x | w)).

At this point, we have defined the policy neural net andnow discuss performance measures commonly used for op-timizations. Typically, the expected value of the accumu-lated reward, i.e. return, conditioned on the policy networkparameters E(r | w) is used. Here, E denotes the ex-pectation operator, r the accumulated reward signal, and wthe network weight vector. The objective of reinforcementlearning is to update the weights in a way that maximizesthe expected return at each trial. In particular, the REIN-FORCE updating rule is: ∆wi = αi(r − bi)ei, where ∆widenotes the weight adjustment of weight wi, αi is a non-negative learning rate factor, and bi is a reinforcement base-line. The ei is the characteristic eligibility of wi, defined asei = (∂f/∂wi)/f = ∂lnf/∂wi. Williams [1992] has provedthat the updating quantity, (r − bi)∂lnf/∂wi, represents anunbiased estimate of ∂E(r | w)/∂wi.

3 Tempered Policy GradientIn order to improve the exploration quality of REINFORCEin the task of optimizing policy-based dialogue agents, weattempt to find the optimal compromise between explorationand exploitation. In TPGs we introduce a parameter τ , thesampling temperature of the probabilistic output unit, whichallows us to explicitly control the strengths of the exploration.

3.1 Exploration and ExploitationThe trade-off between exploration and exploitation is one ofthe great challenges in reinforcement learning [Sutton and

Barto, 1998]. To obtain a high reward, an agent must ex-ploit the actions that have already proved effective in get-ting more rewards. However, to discover such actions, theagent must try actions, which appear suboptimal, to explorethe action space. In a stochastic task like text generation,each action, i.e. a word, must be tried many times to findout whether it is a reliable choice or not. The exploration-exploitation dilemma has been intensively studied over manydecades [Carmel and Markovitch, 1999; Nachum et al., 2016;Liu et al., 2017]. Finding the balance between explorationand exploitation is considered crucial for the success of rein-forcement learning [Thrun, 1992].

3.2 Temperature SamplingIn text generation, it is well-known that the simple trickof temperature adjustment is sufficient to shift the languagemodel to be more conservative or more diversified [Karpathyand Fei-Fei, 2015]. In order to control the trade-off betweenexploration and exploitation, we borrow the strength of thetemperature parameter τ ≥ 0 to control the sampling. Theoutput probability of each word is transformed by a tempera-ture function as:

fτ (y = n | π(x | w)) =f(y = n | π(x | w))

1τ

ΣNm=1f(y = m | π(x | w))1τ

.

We use notation fτ to denote a probability mass function fthat is transferred by a temperature function with temperatureτ . When the temperature is high, τ > 1, the distribution be-comes more uniform; when the temperature is low, τ < 1,the distribution appears more spiky. TPG is defined as anextended algorithm of the Monte Carlo Policy Gradient ap-proach. We use a higher temperature, τ > 1, to encouragethe model to explore in the action space, and conversely, alower temperature, τ < 1, to encourage exploitation. In theextreme case, when τ = 0, we obtain greedy search.

3.3 Tempered Policy Gradient MethodsHere, we introduce three instances of TPGs in the domainof goal-oriented dialogues, including single, parallel, and dy-namic tempered policy gradient methods.

Single-TPG: The Single-TPG method simply uses a globaltemperature τglobal during the whole training process, i.e., weuse τglobal > 1 to encourage exploration. The forward passis represented mathematically as: yτglobal ∼ fτglobal(π(x |w)), where π(x | w) represents a policy neural network thatparametrizes a distribution fτglobal over the vocabulary, andyτglobal means the word sampled from this tempered worddistribution. After sampling, the weight of the neural net isupdated using,

∆wi = αi(r − bi)∂lnf(yτglobal | π(x | w))/∂wi.

Noteworthy is that the actual gradient, ∂lnf(yτglobal | π(x |w))/∂wi, depends on the sampled word, yτglobal , however,does not depend directly on the temperature, τ . We preferto find the words that lead to a reward, so that the modelcan learn quickly from these actions, otherwise, the neuralnetwork only learns to avoid current failure actions. WithSingle-TPG and τ > 1, the entire vocabulary of a dialogue

2

agent is explored more efficiently than by REINFORCE, be-cause nonpreferred words have a higher probability of be-ing explored. This Single-TPG method is very easy to useand could yield a performance improvement after trainingbecause the goal-oriented dialogue optimization could bene-fit from increased exploration. The temperature is initializedwith τ = 1, then fine-tuned based on the learning curve onthe validation sets, and subsequently left fixed..

Parallel-TPG: A more advanced version of Single-TPGis the Parallel-TPG that deploys several Single-TPGs con-currently with different temperatures, τ1, ..., τn, and updatesthe weights based on all generated samples. During the for-ward pass, multiple copies of the neural nets parameterizemultiple word distributions. The words are sampled in par-allel at various temperatures, mathematically, yτ1 , ..., yτn ∼fτ1,...,τn(π(x | w)). After sampling, in the backward passthe weights are updated with the sum of gradients. The for-mula is given by

∆wi = Σkαi(r − bi)∂lnf(yτk | π(x | w))/∂wi,

where k ∈ {1, ..., n}. The combinational use of higher andlower temperatures ensures both exploration and exploitationat the same time. The sum over weight updates of paral-lel policies gives a more accurate Monte Carlo estimate of∂E(r | w)/∂wi, due to the nature of Monte Carlo methods[Robert, 2004]. Thus, compared to Single-TPG, we wouldargue that Parallel-TPG is more robust and stable, althoughParallel-TPG needs more computational power. However,these computations can be easily distributed in a parallel fash-ion using state-of-the-art graphics processing units.

Dynamic-TPG: As a third variant, we introduce theDynamic-TPG, which is the most sophisticated approach inthe current TPG family. The essential idea is that we use aheuristic function h to assign the temperature τ to the worddistribution at each time step, t. The temperature is boundedin a predefined range [τmin, τmax]. The heuristic functionwe used here is based upon the term frequency inverse doc-ument frequency, tf-idf [Leskovec et al., 2014]. In the con-text of goal-oriented dialogues, we use the counted num-ber of each word as term frequency tf and the total num-ber of generated dialogues during training as document fre-quency df. We use the word that has the highest probabil-ity to be sampled at current time-step, y∗t , as the input tothe heuristic function h. Here, y∗t is the maximizer of theprobability mass function f . Mathematically, it is defined asy∗t = argmax(f(π(x | w))). We propose that tf-idf(y∗t ) ap-proximates the concentration level of the distribution, whichmeans that if the same word is always sampled from a distri-bution, then the distribution is very concentrated. Too muchconcentration prevents the model from exploration, so that ahigher temperature is needed. In order to achieve this effect,the heuristic function is defined as

τht = h(tf-idf(y∗t ))

= τmin + (τmax − τmin)tf-idf(y∗t )− tf-idfmintf-idfmax − tf-idfmin

.

With this heuristic, words that occur very often are depressedby applying a higher temperature to those words, making

LSTM LSTM LSTM LSTM LSTM

Is it a person ?

MLPSpatial

MLPCategory

MLPNoYesN.A.

Figure 1: Oracle model

them less likely to be selected in the near future. In the for-ward pass, a word is sampled using yτ

ht ∼ fτht (π(x | w)). In

the backward pass, the weights are updated correspondingly,using

∆wi = αi(r − bi)∂lnf(yτht | π(x | w))/∂wi,

where τht is the temperature calculated from the heuris-tic function. Compared to Parallel-TPG, the advantage ofDynamic-TPG is that it assigns temperature more appropri-ately, without increasing the computational load.

4 GuessWhat?! GameWe evaluate our concepts using a recent testbed for AI, calledthe GuessWhat?! game [de Vries et al., 2017], available athttps://guesswhat.ai. The dataset consists of 155 kdialogues, including 822 k question-answer pairs, each com-posed of around 5 k words, about 67 k images [Lin et al.,2014] and 134 k objects. The game is about visual objectdiscovery trough a multi-round QA among different players.

Formally, a GuessWhat?! game is represented by a tuple(I,D,O, o∗), where I ∈ RH×W denotes an image of heightH and width W ; D represents a dialogue composed of Mrounds of question-answer pairs (QAs), D = (qm, am)Mm=1;O stands for a list of K objects O = (ok)Kk=1; and o∗ isthe target object. Each question is a sequence of words,qm = {ym,1, ......, ym,Nm} with length Nm. The words aretaken from a defined vocabulary V , which consists of thewords and a start token and an end token. Each answer iseither yes, no, or not applicable, i.e. am ∈ {yes, no, n.a.}.For each object ok, there is a corresponding object cate-gory ck ∈ {1, ......, C} and a pixel-wise segmentation maskSk ∈ {0, 1}H×W . Finally, we use colon notation (:) to selecta subset of a sequence, for instance, (q, a)1:m refers to thefirst m rounds of QAs in a dialogue.

4.1 Models and PretrainingFollowing [Strub et al., 2017], we first train all three modelsin a supervised fashion.

Oracle: The task of the Oracle is to answer questionsregarding to the target object. We outline here the sim-ple neural network architecture that achieved the best per-formance in the study of [de Vries et al., 2017], and whichwe also used in our experiments. The input informationused here is of three modalities, namely the question q, thespatial information x∗spatial and the category c∗ of the tar-get object. For encoding the question, de Vries et al. firstuse a lookup table to learn the embedding of words, then

3

CNNImage

HistoryLSTM

Is it a person? Yes......

Encoder

LSTM LSTM LSTM LSTM

Is this

personIs

person

this

......

......

Decoder

Figure 2: Question-Generator model

use a one layer long-short-term-memory (LSTM) [Hochre-iter and Schmidhuber, 1997] to encode the whole ques-tion. For spatial information, de Vries et al. extract an8-dimensional vector of the location of the bounding box[xmin, ymin, xmax, ymax, xcenter, ycenter, wbox, hbox],where x, y denote the coordinates and wbox, hbox denotethe width and height of the bounding box, respectively. DeVries et al. normalize the image width and height so that thecoordinates range from -1 to 1. The origin is at the imagecenter. The category embedding of the object is also learnedwith a lookup table during training. At the last step, de Vrieset al. concatenate all three embeddings into one feature vec-tor and fed it into a one hidden layer multilayer perceptron(MLP). The softmax output layer predicts the distribution,Oracle := p(a | q, c∗, x∗spatial), over the three classes,including no, yes, and not applicable. The model is trainedusing the negative log-likelihood criterion. The Oracle struc-ture is shown in Fig. 1.

Question-Generator: The goal of the Question-Generator(QGen) is to ask the Oracle meaningful questions, qm+1,given the whole image, I , and the historical question-answerpairs, (q, a)1:m. In previous work [Strub et al., 2017], thestate transition function was modelled as an LSTM, whichwas trained using whole dialogues so that the model mem-orizes the historical QAs. We refer to this as dialogue leveltraining. We develop a novel QGen architecture using a mod-ified version of the Seq2Seq model [Sutskever et al., 2014].The modified Seq2Seq model enables question level training,which means that the model is fed with historical QAs, andthen learns to produce a new question. Following [Strub etal., 2017], we first encode the whole image into a fixed-sizefeature vector using the VGG-net [Simonyan and Zisserman,2014]. The features come from the fc-8 layer of the VGG-net. For processing historical QAs, we use a lookup table tolearn the word embeddings, then again use an LSTM encoderto encode the history information into a fixed-size latent rep-resentation, and concatenate it with the image representation:

sencm,Nm = encoder((LSTM(q, a)1:m),VGG(I)).

The encoder and decoder are coupled by initializing thedecoder state with the last encoder state, mathematically,sdecm+1,0 = sencm,Nm. The LSTM decoder generates each wordbased on the concatenated representation and the previousgenerated word (note the first word is a start token):

ym+1,n = decoder(LSTM((ym+1,n−1, sdecm+1,n−1)).

The decoder shares the same lookup table weights as the en-coder. The Seq2Seq model, consisting of the encoder and

MLP

MLP

MLP

MLP

......

Spatial

Category

Spatial

Category

MLP

MLP

...

...

Objects

Is it a person? YesIs this person in the left? NoIs he on the right side? Yes......

QA 1

QA 2

QA 3

QA 4

LSTM

LSTM

LSTM

LSTM

Attention Mechanism

...

Memory

+

Weighted Sum

.Object

SoftMax

Figure 3: Guesser model

the decoder, is trained end-to-end to minimize the negativelog-likelihood cost. During testing, the decoder gets a starttoken and the representation from the encoder, and then gen-erates each word at each time step until it encounters a ques-tion mark token, QGen := p(ym+1,n | (q, a)1:m, I). Theoutput is a complete question. After several question-answerrounds, the QGen outputs an end-of-dialogue token, and stopsasking questions. The overall structure of the QGen model isillustrated in Fig. 2.

Guesser: The goal of the Guesser model is to find outwhich object the Oracle model is referring to, given the com-plete history of the dialogue and a list of objects in the im-age, p(o∗ | (q, a)1:M , x

Ospatial, c

O). The Guesser model hasaccess to the spatial, xOspatial, and category information, cO,of the objects in the list. The task of the Guesser model ischallenging because it needs to understand the dialogue andto focus on the important content, and then guess the object.To accomplish this task, we decided to integrate the Mem-ory [Sukhbaatar et al., 2015] and Attention [Bahdanau et al.,2014] modules into the Guesser architecture used in the previ-ous work [Strub et al., 2017]. First, we use an LSTM headerto process the varying lengths of question-answer pairs in par-allel into multiple fixed-size vectors. Here, each QA-pair hasbeen encoded into some facts, Factm = LSTM((q, a)m),and stored into a memory base. Later, we use the sum ofthe spatial and category embeddings of all objects as a key,Key1 = MLP(xOspatial, c

O), to query the memory and calcu-late an attention mask, Attention1(Factm) = Factm�key1,over each fact. Next, we use the sum of attended facts and thefirst key to calculate the second key. Further, we use the sec-ond key to query the memory base again to have a more accu-rate attention. These are the so called “two-hops” of attentionin the literature [Sukhbaatar et al., 2015]. Finally, we com-pare the attended facts with each object embedding in the listusing a dot product. The most similar object to these facts isthe prediction, Guesser := p(o∗ | (q, a)1:M , x

Ospatial, c

O).The intention of using the attention module here is to find outthe most relevant descriptions or facts concerning the candi-date objects. We train the whole Guesser network end-to-endusing the negative log-likelihood criterion. A more graphicaldescription of the Guesser model is shown in Fig. 3.

4.2 Reinforcement LearningNow, we post-train the QGen and the Guesser model withreinforcement learning. We keep the Oracle model fixed. Ineach game episode, when the models find the correct object,r = 1, otherwise, r = 0.

4

Next, we can assign credits for each action of the QGenand the Guesser models. In the case of the QGen model, wespread the reward uniformly over the sequence of actions inthe episode. The baseline function, b, used here is the runningaverage of the game success rate. Consider that the Guessermodel has only one action in each episode, i.e., taking theguess. If the Guesser finds the correct object, then it gets animmediate reward and the Guesser’s parameters are updatedusing the REINFORCE rule without baseline. The QGen istrained using the following four methods.

REINFORCE: The baseline method used here is RE-INFORCE [Williams, 1992]. During training, in the for-ward pass the words are sampled with τ = 1, ym+1,n ∼f(QGen(x | w)). In the backward pass, the weights are up-dated using REINFORCE, as,

w = w + α(r − b)∇wlnf(ym+1,n | QGen(x | w)).

Single-TPG: We use temperature τglobal = 1.5 during train-ing to encourage exploration, mathematically, yτglobalm+1,n ∼fτglobal(QGen(x | w)). In the backward pass, the weightsare updated using

w = w + α(r − b)∇wlnf(yτglobalm+1,n | QGen(x | w)).

Parallel-TPG: For Parallel-TPG, we use two tempera-tures τ1 = 1.0 and τ2 = 1.5 to encourage the explo-ration. The words are sampled in the forward pass usingyτ1m+1,n, y

τ2m+1,n ∼ fτ1,τ2(QGen(x | w)). In the backward

pass, the weights are updated using

w = w + Σ2k=1α(r − b)∇wlnf(yτkm+1,n | QGen(x | w)).

Dynamic-TPG: The last method we evaluated is Dynamic-TPG. We use a heuristic function to calculate the temperaturefor each word at each time step:

τhm+1,n = τmin + (τmax − τmin)tf-idf(y∗m+1,n)− tf-idfmin

tf-idfmax − tf-idfmin,

where we set τmin = 0.5, τmax = 1.5, and set tf-idfmin = 0,tf-idfmax = 8. After the calculation of τhm+1,n, we substitutethe value into the formula at each time step and sample thenext word using

yτhm+1,n

m+1,n ∼ fτhm+1,n(QGen(x | w)).

In the backward pass, the weights are updated using

w = w + α(r − b)∇wlnf(yτhm+1,n

m+1,n | QGen(x | w)).

For all four methods, we use greedy search in evaluation.

5 ExperimentWe first train all the networks in a supervised fashion,and then further optimize the QGen and the Guessermodel using reinforcement learning. The source codeis available at https://github.com/ruizhaogit/GuessWhat-TemperedPolicyGradient, which usesTorch7 [Collobert et al., 2011].

# Method Accuracy1 [Strub et al., 2017] 52.30%2 [Strub and de Vries, 2017] 60.30%3 REINFORCE 69.66%4 Single-TPG 69.76%5 Parallel-TPG 73.86%6 Dynamic-TPG 74.31%

Table 1: Performance comparison of our methods to other methodsreported in literature after reinforcement learning

5.1 PretrainingWe train all three models using 0.5 dropout [Srivastava et al.,2014] during training, using the ADAM optimizer [Kingmaand Ba, 2014]. We use a learning rate of 0.0001 for the Or-acle model and the Guesser model, and a learning rate of0.001 for QGen. All the models are trained with at most 30epochs and early stopped within five epochs without improve-ment on the validation set. We report the performance on thetest set which consists of images not used in training. Wereport the game success rate as the performance metric forall three models, which equals to the number of succeededgames divided by the total number of all games. Comparedto previous works [de Vries et al., 2017; Strub et al., 2017;Strub and de Vries, 2017], after supervised training, our mod-els obtain a game success rate of 48.77%, that is 4% higherthan state-of-the-art methods [Strub and de Vries, 2017],which has 44.6% accuracy.

5.2 Reinforcement LearningWe first initialize all models with pre-trained parameters fromsupervised learning and then post-train the QGen using eitherREINFORCE or TPG for 80 epochs. We update the parame-ters using stochastic gradient descent (SGD) with a learningrate of 0.001 and a batch size of 64. In each epoch, we sam-ple each image in the training set once and randomly pickone of the objects as a target. We track the running average ofthe game success rate and use it directly as the baseline, b, inREINFORCE. We limit the maximum number of questionsto 8 and the maximum number of words to 12. Simultane-ously, we train the Guesser model using REINFORCE with-out baseline and using SGD with a learning rate of 0.0001.The performance comparison between our baseline (# 3) withmethods from literature (# 1 & # 2) is shown in Tab. 1.

REINFORCE Baseline: From Tab. 1, we see that ourmodels trained with REINFORCE (# 3) are about 9% betterthan the state-of-the-art methods (# 1 & # 2). The improve-ments are due to using advanced mechanisms and techniquessuch as the Seq2Seq structure in the QGen, the memory andattention mechanisms in the Guesser, and the training of theGuesser model with reinforcement learning. One importantdifference is that our QGen model is trained in question level.This means that the model first learns to query meaningfully,step by step. Eventually, it learns to conduct a meaningfuldialog. Compared to directly learning to manage a strate-gic conversation, this bottom-up training procedure helps themodel absorb knowledge, because it breaks large tasks downinto smaller, more manageable pieces. This makes the learn-

5

Image Policy Gradient Tempered Policy Gradient

Is it in left? NoIs it in front? NoIs it in right? YesIs it in middle? YesIs it person? NoIs it ball? NoIs it bat? NoIs it car? YesStatus: Failure

Is it a person? NoIs it a vehicle? YesIs it a truck? YesIs it in front of photo? NoIn the left half? NoIn the middle of photo? YesIs it to the right photo? YesIs it in the middle of photo? YesStatus: Success

Is it in left? NoIs it in front? YesIs it in right? NoIs it in middle? YesIs it person? NoIs it giraffe? YesIs in middle? YesIs in middle? YesStatus: Failure

Is it a giraffe? YesIn front of photo? YesIn the left half? YesIs it in the middle of photo? YesIs it to the left of photo? YesIs it to the right photo? NoIn the left in photo? NoIn the middle of photo? YesStatus: Success

Table 2: Some samples generated by our improved models using REINFORCE (left column: “Policy Gradient”) and Dynamic-TPG (rightcolumn: “Tempered Policy Gradient”). The green bounding boxes highlight the target objects; the red boxes highlight the wrong guesses.

ing for QGen much easier. In the remainder of the section,we use our models, boosted with memory network, attention,and Seq2Seq, trained with REINFORCE as a strong baselineand analyse the performance improvements achieved by theTPGs.

From Tab. 1, we see that compared to REINFORCE (# 3),Single-TPG (# 4) with τglobal = 1.5 achieves a comparableperformance. With two different temperatures τ1 = 1.0 andτ2 = 1.5, Parallel-TPG (# 5) achieves an improvement of ap-proximately 4%. Parallel-TPG requires more computationalresources. Compared to Parallel-TPG, Dynamic-TPG onlyuses the same computational power as REINFORCE does andstill gives a larger improvement by using a dynamic temper-ature, τht ∈ [0.5, 1.5]. After comparison, we can see that thebest model is Dynamic-TPG (# 6), which gives a 4.65% im-provement upon our strong baseline. Here, we have shownthat our proposed methods contribute orthogonally, in thesense that they further improve the models already boostedwith advanced modules such as memory network, attention,and Seq2Seq.

TPG Dialogue Samples: The generated dialogue samplesin Tab. 2 can give some interesting insights in explaining whyTPG methods give a better result. First of all, the sentencesgenerated from TPG-trained models are on average longerand use slightly more complex structures, such as “Is it inthe middle of photo?” instead of a simple form “Is it in mid-dle?”. Secondly, TPGs enable the models to explore betterand comprehend more words. For example, in the first task(upper half of Tab. 2), both models ask about the category.The REINFORCE-trained model can only ask with the sin-gle word “car” to query about the vehicle category. In con-trast, the TPG-trained model can first ask a more general cat-egory with the word “vehicle” and follows up querying witha more specific category “trucks”. These two words “vehi-

cle” and “trucks” give much more information than the singleword “car”, and help the Guesser model identify the truckamong many cars. Lastly, similar to the category case, themodels trained with TPG can first ask a larger spatial rangeof the object and follow up with a smaller range. In the sec-ond task (lower half of Tab. 2), we see that the TPG-trainedmodel first asks “In the left half?”, which refers to all thethree giraffes in the left half, and the answer is “Yes”. Thenit asks “Is it to the left of photo?”, which refers to the sec-ond left giraffe, and the answer is “Yes”. Eventually theQGen asks “In the left in photo?”, which refers to the mostleft giraffe, and the answer is “No”. These specific ques-tions about locations are not learned using REINFORCE. TheREINFORCE-trained model can only ask a similar questionwith the word “left”. In this task, there are many giraffes inthe left part of the image. The top-down spatial questionshelp the Guesser model find the correct giraffe. To summa-rize, the TPG-trained models use longer and more informa-tive sentences than the REINFORCE-trained models.

6 Conclusion

Our paper makes two contributions. Firstly, by extending ex-isting models with Seq2Seq and Memory Networks we couldimprove the performance of a goal-oriented dialogue sys-tem by 9%. Secondly, we introduced TPG, a novel class oftemperature-based policy gradient approaches. TPGs boostedthe performance of the goal-oriented dialogue systems by an-other 4.7%. Among the three TPGs, Dynamic-TPG gave thebest performance, which helped the agent comprehend morewords, and produce more meaningful questions. TPG is ageneric strategy to encourage word exploration on top of pol-icy gradients and can be applied to any text-generating agents.

6

References[Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun

Cho, and Yoshua Bengio. Neural machine translationby jointly learning to align and translate. arXiv preprintarXiv:1409.0473, 2014.

[Bordes and Weston, 2017] Antoine Bordes and Jason We-ston. Learning end-to-end goal-oriented dialog. In Inter-national Conference on Learning Representations (ICLR),2017.

[Carmel and Markovitch, 1999] David Carmel and ShaulMarkovitch. Exploration strategies for model-based learn-ing in multi-agent systems: Exploration strategies. Au-tonomous Agents and Multi-agent systems, 2(2):141–172,1999.

[Collobert et al., 2011] R. Collobert, K. Kavukcuoglu, andC. Farabet. Torch7: A matlab-like environment for ma-chine learning. In BigLearn, NIPS Workshop, 2011.

[Das et al., 2017] Abhishek Das, Satwik Kottur, Jose M.F.Moura, Stefan Lee, and Dhruv Batra. Learning coopera-tive visual dialog agents with deep reinforcement learning.In International Conference on Computer Vision (ICCV),2017.

[de Vries et al., 2017] Harm de Vries, Florian Strub, SarathChandar, Olivier Pietquin, Hugo Larochelle, and Aaron C.Courville. Guesswhat?! visual object discovery throughmulti-modal dialogue. In Conference on Computer Visionand Pattern Recognition (CVPR), 2017.

[Dhingra et al., 2016] Bhuwan Dhingra, Lihong Li, Xiu-jun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed,and Li Deng. End-to-end reinforcement learning of di-alogue agents for information access. arXiv preprintarXiv:1609.00777, 2016.

[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter andJurgen Schmidhuber. Long short-term memory. Neuralcomputation, 1997.

[Karpathy and Fei-Fei, 2015] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating im-age descriptions. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 3128–3137, 2015.

[Kingma and Ba, 2014] Diederik Kingma and Jimmy Ba.Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014.

[Leskovec et al., 2014] Jure Leskovec, Anand Rajaraman,and Jeffrey David Ullman. Mining of massive datasets.Cambridge university press, 2014.

[Lewis et al., 2017] Mike Lewis, Denis Yarats, Yann NDauphin, Devi Parikh, and Dhruv Batra. Deal or nodeal? end-to-end learning for negotiation dialogues. arXivpreprint arXiv:1706.05125, 2017.

[Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Be-longie, James Hays, Pietro Perona, Deva Ramanan, PiotrDollar, and C Lawrence Zitnick. Microsoft coco: Com-mon objects in context. In European conference on com-puter vision, pages 740–755. Springer, 2014.

[Liu et al., 2017] Yang Liu, Prajit Ramachandran, QiangLiu, and Jian Peng. Stein variational policy gradient. arXivpreprint arXiv:1704.02399, 2017.

[Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu,David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidje-land, Georg Ostrovski, et al. Human-level control throughdeep reinforcement learning. Nature, 2015.

[Nachum et al., 2016] Ofir Nachum, Mohammad Norouzi,and Dale Schuurmans. Improving policy gradient byexploring under-appreciated rewards. arXiv preprintarXiv:1611.09321, 2016.

[Robert, 2004] Christian P Robert. Monte carlo methods.Wiley Online Library, 2004.

[Silver et al., 2016] David Silver, Aja Huang, Chris J Maddi-son, Arthur Guez, Laurent Sifre, George Van Den Driess-che, Julian Schrittwieser, Ioannis Antonoglou, Veda Pan-neershelvam, Marc Lanctot, et al. Mastering the game ofgo with deep neural networks and tree search. Nature,529(7587):484–489, 2016.

[Simonyan and Zisserman, 2014] Karen Simonyan and An-drew Zisserman. Very deep convolutional networksfor large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

[Srivastava et al., 2014] Nitish Srivastava, Geoffrey E Hin-ton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. Dropout: a simple way to prevent neuralnetworks from overfitting. Journal of Machine LearningResearch, 15(1):1929–1958, 2014.

[Strub and de Vries, 2017] Florian Strub and Harm de Vries.Guesswhat?! models. https://github.com/GuessWhatGame/guesswhat/, 2017.

[Strub et al., 2017] Florian Strub, Harm de Vries, JeremieMary, Bilal Piot, Aaron C. Courville, and OlivierPietquin. End-to-end optimization of goal-driven and vi-sually grounded dialogue systems. In International JointConference on Artificial Intelligence (IJCAI), 2017.

[Sukhbaatar et al., 2015] Sainbayar Sukhbaatar, Jason We-ston, Rob Fergus, et al. End-to-end memory networks. InAdvances in neural information processing systems, pages2440–2448, 2015.

[Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, andQuoc V Le. Sequence to sequence learning with neuralnetworks. In Advances in neural information processingsystems, pages 3104–3112, 2014.

[Sutton and Barto, 1998] Richard S Sutton and Andrew GBarto. Reinforcement learning: An introduction. MITpress, 1998.

[Thrun, 1992] Sebastian B Thrun. Efficient exploration inreinforcement learning. 1992.

[Williams, 1992] Ronald J Williams. Simple statisticalgradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 1992.

7

Documents

Improving Goal-Oriented Visual Dialog Agents via Advanced ...ceur-ws.org/Vol-2202/paper1.pdf · guessing task, we develop a Memory Network [Sukhbaatar et al., 2015] with Attention