8
LEARNING GOAL-ORIENTED VISUAL DIALOG VIA TEMPERED POLICY GRADIENT Rui Zhao, Volker Tresp Ludwig Maximilian University, Oettingenstr. 67, 80538 Munich, Germany Siemens AG, Corporate Technology, Otto-Hahn-Ring 6, 81739 Munich, Germany ABSTRACT Learning goal-oriented dialogues by means of deep reinforce- ment learning has recently become a popular research topic. However, commonly used policy-based dialogue agents of- ten end up focusing on simple utterances and suboptimal policies. To mitigate this problem, we propose a class of novel temperature-based extensions for policy gradient meth- ods, which are referred to as Tempered Policy Gradients (TPGs). On a recent AI-testbed, i.e., the GuessWhat?! game, we achieve significant improvements with two innovations. The first one is an extension of the state-of-the-art solutions with Seq2Seq and Memory Network structures that leads to an improvement of 7%. The second one is the applica- tion of our newly developed TPG methods, which improves the performance additionally by around 5% and, even more importantly, helps produce more convincing utterances. Index TermsGoal-Oriented Dialog System, Deep Re- inforcement Learning, Recurrent Neural Network 1. INTRODUCTION In recent years, deep learning has shown convincing perfor- mance in various areas such as image recognition, speech recognition, and natural language processing (NLP). Deep neural nets are capable of learning complex dependencies from huge amounts of data and its human generated annota- tions in a supervised way. In contrast, reinforcement learning agents [2] can learn directly from their interactions with the environment without any supervision and surpass human per- formance in several domains, for instance in the game of GO [3], as well as many computer games [4]. In this paper we are concerned with the application of both approaches to goal- oriented dialogue systems [5, 6, 7, 8, 9, 10, 11, 7, 12, 13, 14], a problem that has recently caught the attention of machine learning researchers. De Vries et al. [6] have proposed as AI-testbed a visual grounded object guessing game called GuessWhat?!. Das et al. [7] formulated a visual dialogue system which is about two chatbots asking and answering questions to identify a specific image within a group of im- ages. More practically, dialogue agents have been applied This paper is an extended version of the IJCAI workshop paper [1]. to negotiate a deal [12] and access certain information from knowledge bases [13]. The essential idea in these systems is to train different dialogue agents to accomplish the tasks. In those papers, the agents have been trained with policy gradients, i.e. REINFORCE [15]. In order to improve the exploration quality of policy gra- dients, we present three instances of temperature-based meth- ods. The first one is a single-temperature approach which is very easy to apply. The second one is a parallel approach with multiple temperature policies running concurrently. This sec- ond approach is more demanding on computational resources, but results in more stable solutions. The third one is a tem- perature policy approach that dynamically adjusts the tem- perature for each action at each time-step, based on action frequencies. This dynamic method is more sophisticated and proves more efficient in the experiments. In the experiments, all these methods demonstrate better exploration strategies in comparison to the plain policy gradient. We demonstrate our approaches using a real-world dataset called GuessWhat?!. The GuessWhat?! game [6] is a vi- sual object discovery game between two players, the Ora- cle and the Questioner. The Questioner tries to identify an object by asking the Oracle questions. The original works [6, 8] first proposed supervised learning to simulate and op- timize the game. Strub et al. [8] showed that the perfor- mance could be improved by applying plain policy gradient reinforcement learning, which maximizes the game success rate, as a second processing step. Building on these previ- ous works, we propose two network architecture extensions. We utilize a Seq2Seq model [16] to process the image along with the historical dialogues for question generation. For the guessing task, we develop a Memory Network [17] with At- tention Mechanism [18] to process the generated question- answer pairs. We first train these two models using the plain policy gradient and use them as our baselines. Subsequently, we train the models with our new TPG methods and com- pare the performances with the baselines. We show that the TPG method is compatible with state-of-the-art architectures such as Seq2Seq and Memory Networks and contributes or- thogonally to these advanced neural architectures. To the best of our knowledge, the presented work is the first to propose temperature-based policy gradient methods to leverage explo-

LEARNING GOAL-ORIENTED VISUAL DIALOG VIA TEMPERED … › ~tresp › papers › zhaorui-camera.pdfRui Zhao, Volker Tresp Ludwig Maximilian University, Oettingenstr. 67, 80538 Munich,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • LEARNING GOAL-ORIENTED VISUAL DIALOG VIA TEMPERED POLICY GRADIENT

    Rui Zhao, Volker Tresp

    Ludwig Maximilian University, Oettingenstr. 67, 80538 Munich, Germany

    Siemens AG, Corporate Technology, Otto-Hahn-Ring 6, 81739 Munich, Germany

    ABSTRACT

    Learning goal-oriented dialogues by means of deep reinforce-ment learning has recently become a popular research topic.However, commonly used policy-based dialogue agents of-ten end up focusing on simple utterances and suboptimalpolicies. To mitigate this problem, we propose a class ofnovel temperature-based extensions for policy gradient meth-ods, which are referred to as Tempered Policy Gradients(TPGs). On a recent AI-testbed, i.e., the GuessWhat?! game,we achieve significant improvements with two innovations.The first one is an extension of the state-of-the-art solutionswith Seq2Seq and Memory Network structures that leadsto an improvement of 7%. The second one is the applica-tion of our newly developed TPG methods, which improvesthe performance additionally by around 5% and, even moreimportantly, helps produce more convincing utterances.

    Index Terms— Goal-Oriented Dialog System, Deep Re-inforcement Learning, Recurrent Neural Network

    1. INTRODUCTION

    In recent years, deep learning has shown convincing perfor-mance in various areas such as image recognition, speechrecognition, and natural language processing (NLP). Deepneural nets are capable of learning complex dependenciesfrom huge amounts of data and its human generated annota-tions in a supervised way. In contrast, reinforcement learningagents [2] can learn directly from their interactions with theenvironment without any supervision and surpass human per-formance in several domains, for instance in the game of GO[3], as well as many computer games [4]. In this paper we areconcerned with the application of both approaches to goal-oriented dialogue systems [5, 6, 7, 8, 9, 10, 11, 7, 12, 13, 14],a problem that has recently caught the attention of machinelearning researchers. De Vries et al. [6] have proposed asAI-testbed a visual grounded object guessing game calledGuessWhat?!. Das et al. [7] formulated a visual dialoguesystem which is about two chatbots asking and answeringquestions to identify a specific image within a group of im-ages. More practically, dialogue agents have been applied

    This paper is an extended version of the IJCAI workshop paper [1].

    to negotiate a deal [12] and access certain information fromknowledge bases [13]. The essential idea in these systemsis to train different dialogue agents to accomplish the tasks.In those papers, the agents have been trained with policygradients, i.e. REINFORCE [15].

    In order to improve the exploration quality of policy gra-dients, we present three instances of temperature-based meth-ods. The first one is a single-temperature approach which isvery easy to apply. The second one is a parallel approach withmultiple temperature policies running concurrently. This sec-ond approach is more demanding on computational resources,but results in more stable solutions. The third one is a tem-perature policy approach that dynamically adjusts the tem-perature for each action at each time-step, based on actionfrequencies. This dynamic method is more sophisticated andproves more efficient in the experiments. In the experiments,all these methods demonstrate better exploration strategies incomparison to the plain policy gradient.

    We demonstrate our approaches using a real-world datasetcalled GuessWhat?!. The GuessWhat?! game [6] is a vi-sual object discovery game between two players, the Ora-cle and the Questioner. The Questioner tries to identify anobject by asking the Oracle questions. The original works[6, 8] first proposed supervised learning to simulate and op-timize the game. Strub et al. [8] showed that the perfor-mance could be improved by applying plain policy gradientreinforcement learning, which maximizes the game successrate, as a second processing step. Building on these previ-ous works, we propose two network architecture extensions.We utilize a Seq2Seq model [16] to process the image alongwith the historical dialogues for question generation. For theguessing task, we develop a Memory Network [17] with At-tention Mechanism [18] to process the generated question-answer pairs. We first train these two models using the plainpolicy gradient and use them as our baselines. Subsequently,we train the models with our new TPG methods and com-pare the performances with the baselines. We show that theTPG method is compatible with state-of-the-art architecturessuch as Seq2Seq and Memory Networks and contributes or-thogonally to these advanced neural architectures. To the bestof our knowledge, the presented work is the first to proposetemperature-based policy gradient methods to leverage explo-

  • ration and exploitation in the field of goal-oriented dialoguesystems. We demonstrate the superior performance of ourTPG methods by applying it to the GuessWhat?! game.

    2. PRELIMINARIES

    In our notation, we use x to denote the input to a policy net-work π, and xi to denote the i-th element of the input vector.Similarly, w denotes the weight vector of π, and wi denotesthe i-th element of the weight vector of that π. The outputy is a multinoulli random variable with N states that followsa probability mass function, f(y = n | π(x | w)), whereΣNn=1f(y = n | π(x | w)) = 1 and f(·) ≥ 0. In a nutshell, apolicy network parametrizes a probabilistic unit that producesthe sampled output, mathematically, y ∼ f(π(x | w)).

    Typically, the expected value of the accumulated re-ward, i.e. return, conditioned on the policy network param-eters E(r | w) is used. Here, E denotes the expectationoperator, r the accumulated reward signal, and w the net-work weight vector. The objective of reinforcement learningis to update the weights in a way that maximizes the ex-pected return at each trial. In particular, the REINFORCEupdating rule is: ∆wi = αi(r − bi)ei, where ∆wi de-notes the weight adjustment of weight wi, αi is a nonneg-ative learning rate factor, and bi is a reinforcement base-line. The ei is the characteristic eligibility of wi, defined asei = (∂f/∂wi)/f = ∂lnf/∂wi. Williams [15] has provedthat the updating quantity, (r − bi)∂lnf/∂wi, represents anunbiased estimate of ∂E(r | w)/∂wi.

    3. TEMPERED POLICY GRADIENT

    In order to improve the exploration quality of REINFORCEin the task of optimizing policy-based dialogue agents, weattempt to find the optimal compromise between explorationand exploitation. In TPGs we introduce a parameter τ , thesampling temperature of the probabilistic output unit, whichallows us to explicitly control the strengths of the exploration.

    3.1. Exploration and Exploitation

    The trade-off between exploration and exploitation is one ofthe great challenges in reinforcement learning [2]. To obtaina high reward, an agent must exploit the actions that have al-ready proved effective in getting more rewards. However, todiscover such actions, the agent must try actions, which ap-pear suboptimal, to explore the action space. In a stochastictask like text generation, each action, i.e. a word, must betried many times to find out whether it is a reliable choiceor not. The exploration-exploitation dilemma has been inten-sively studied over many decades [19, 20, 21]. Finding thebalance between exploration and exploitation is consideredcrucial for the success of reinforcement learning [22].

    3.2. Temperature Sampling

    In text generation, it is well-known that the simple trick oftemperature adjustment is sufficient to shift the languagemodel to be more conservative or more diversified [23]. Inorder to control the trade-off between exploration and ex-ploitation, we borrow the strength of the temperature param-eter τ ≥ 0 to control the sampling. The output probability ofeach word is transformed by a temperature function as:

    fτ (y = n | π(x | w)) = f(y = n | π(x | w))1τ

    ΣNm=1f(y = m | π(x | w))1τ

    .

    We use notation fτ to denote a probability mass function fthat is transferred by a temperature function with temperatureτ . When the temperature is high, τ > 1, the distributionbecomes more uniform; when the temperature is low, τ < 1,the distribution appears more spiky.

    3.3. Tempered Policy Gradient Methods

    Here, we introduce three instances of TPGs in the domainof goal-oriented dialogues, including single, parallel, and dy-namic tempered policy gradient methods.

    Single-TPG: The Single-TPG method simply uses aglobal temperature τglobal during the whole training pro-cess, i.e., we use τglobal > 1 to encourage exploration. Theforward pass is represented mathematically as: yτglobal ∼fτglobal(π(x | w)), where π(x | w) represents a policy neu-ral network that parametrizes a distribution fτglobal over thevocabulary, and yτglobal means the word sampled from thistempered word distribution. After sampling, the weight ofthe neural net is updated using,

    ∆wi = αi(r − bi)∂lnf(yτglobal | π(x | w))/∂wi.

    Noteworthy is that the actual gradient, ∂lnf(yτglobal | π(x |w))/∂wi, depends on the sampled word, yτglobal , however,does not depend directly on the temperature, τ . With Single-TPG and τ > 1, the entire vocabulary of a dialogue agent isexplored more efficiently than by REINFORCE, because non-preferred words have a higher probability of being explored.

    Parallel-TPG: A more advanced version of Single-TPGis the Parallel-TPG that deploys several Single-TPGs con-currently with different temperatures, τ1, ..., τn, and updatesthe weights based on all generated samples. During the for-ward pass, multiple copies of the neural nets parameterizemultiple word distributions. The words are sampled in par-allel at various temperatures, mathematically, yτ1 , ..., yτn ∼fτ1,...,τn(π(x | w)). After sampling, in the backward pass theweights are updated with the sum of gradients. The formulais given by

    ∆wi = Σkαi(r − bi)∂lnf(yτk | π(x | w))/∂wi,

    where k ∈ {1, ..., n}. The combinational use of higher andlower temperatures ensures both exploration and exploitation

  • at the same time. The sum over weight updates of paral-lel policies gives a more accurate Monte Carlo estimate of∂E(r | w)/∂wi, due to the nature of Monte Carlo meth-ods [24]. Thus, compared to Single-TPG, we would arguethat Parallel-TPG is more robust and stable, although Parallel-TPG needs more computational power. However, these com-putations can easily be distributed in a parallel fashion usingstate-of-the-art graphics processing units.

    Dynamic-TPG: As a third variant, we introduce theDynamic-TPG, which is the most sophisticated approach inthe current TPG family. The essential idea is that we use aheuristic function h to assign the temperature τ to the worddistribution at each time step, t. The temperature is boundedin a predefined range [τmin, τmax]. The heuristic function weused here is based upon the term frequency inverse documentfrequency, tf-idf [25]. In the context of goal-oriented dia-logues, we use the counted number of each word as term fre-quency tf and the total number of generated dialogues duringtraining as document frequency df. We use the word that hasthe highest probability to be sampled at current time-step, y∗t ,as the input to the heuristic function h. Here, y∗t is the maxi-mizer of the probability mass function f . Mathematically, itis defined as y∗t = argmax(f(π(x | w))). We propose thattf-idf(y∗t ) approximates the concentration level of the distri-bution, which means that if the same word is always sampledfrom a distribution, then the distribution is very concentrated.Too much concentration prevents the model from exploration,so that a higher temperature is needed. In order to achievethis effect, the heuristic function is defined as

    τht = h(tf-idf(y∗t ))

    = τmin + (τmax − τmin)tf-idf(y∗t )− tf-idfmintf-idfmax − tf-idfmin

    .

    With this heuristic, words that occur very often are depressedby applying a higher temperature to those words, makingthem less likely to be selected in the near future. In the for-ward pass, a word is sampled using yτ

    ht ∼ fτht (π(x | w)). In

    the backward pass, the weights are updated correspondingly:

    ∆wi = αi(r − bi)∂lnf(yτht | π(x | w))/∂wi,

    where τht is the temperature calculated from the heuris-tic function. Compared to Parallel-TPG, the advantage ofDynamic-TPG is that it assigns temperature more appropri-ately, without increasing the computational load.

    4. GUESSWHAT?! GAME

    We evaluate our methods using a recent testbed for AI,called the GuessWhat?! game [6], available at https://guesswhat.ai. The dataset consists of 155 k dia-logues, including 822 k question-answer pairs, each com-posed of around 5 k words, about 67 k images [26] and 134 k

    LSTM LSTM LSTM LSTM LSTM

    Is it a person ?

    MLPSpatial

    MLPCategory

    MLPNoYesN.A.

    Fig. 1: Oracle model

    objects. The game is about visual object discovery trough amulti-round QA among different players.

    Formally, a GuessWhat?! game is represented by a tuple(I,D,O, o∗), where I ∈ RH×W denotes an image of heightH and width W ; D represents a dialogue composed of Mrounds of question-answer pairs (QAs), D = (qm, am)Mm=1;O stands for a list of K objects O = (ok)Kk=1; and o

    ∗ is thetarget object. Each question is a sequence of words, qm ={ym,1, ......, ym,Nm} with length Nm. The words are takenfrom a defined vocabulary V , which consists of the words anda start token and an end token. Each answer is either yes, no,or not applicable, i.e. am ∈ {yes, no, n.a.}. For each objectok, there is a corresponding object category ck ∈ {1, ......, C}and a pixel-wise segmentation mask Sk ∈ {0, 1}H×W . Fi-nally, we use colon notation (:) to select a subset of a se-quence, for instance, (q, a)1:m refers to the first m roundsof QAs in a dialogue.

    4.1. Models and Pretraining

    Following [8], we first train all three models in a supervisedfashion.

    Oracle: The task of the Oracle is to answer questionsregarding to the target object. We outline here the simpleneural network architecture that achieved the best perfor-mance in the study of [6], and which we also used in ourexperiments. The input information used here is of threemodalities, namely the question q, the spatial informationx∗spatial and the category c

    ∗ of the target object. For en-coding the question, de Vries et al. first use a lookup ta-ble to learn the embedding of words, then use a one layerlong-short-term-memory (LSTM) [27] to encode the wholequestion. For spatial information, de Vries et al. extract an8-dimensional vector of the location of the bounding box[xmin, ymin, xmax, ymax, xcenter, ycenter, wbox, hbox],where x, y denote the coordinates and wbox, hbox denotethe width and height of the bounding box, respectively. DeVries et al. normalize the image width and height so that thecoordinates range from -1 to 1. The origin is at the imagecenter. The category embedding of the object is also learnedwith a lookup table during training. At the last step, de Vrieset al. concatenate all three embeddings into one feature vec-tor and fed it into a one hidden layer multilayer perceptron(MLP). The softmax output layer predicts the distribution,Oracle := p(a | q, c∗, x∗spatial), over the three classes,

  • CNNImage

    History LSTMIs it a person? Yes......

    Encoder

    LSTM LSTM LSTM LSTM

    Is this

    personIs

    person

    this

    ......

    ......

    Decoder

    Fig. 2: Question-Generator model

    including no, yes, and not applicable. The model is trainedusing the negative log-likelihood criterion. The Oracle struc-ture is shown in Fig. 1.

    Question-Generator: The goal of the Question-Generator(QGen) is to ask the Oracle meaningful questions, qm+1,given the whole image, I , and the historical question-answerpairs, (q, a)1:m. In previous work [8], the state transitionfunction was modelled as an LSTM, which was trained usingwhole dialogues so that the model memorizes the historicalQAs. We refer to this as dialogue level training. We developa novel QGen architecture using a modified version of theSeq2Seq model [16]. The modified Seq2Seq model enablesquestion level training, which means that the model is fedwith historical QAs, and then learns to produce a new ques-tion. Following [8], we first encode the whole image into afixed-size feature vector using the VGG-net [28]. The fea-tures come from the fc-8 layer of the VGG-net. For process-ing historical QAs, we use a lookup table to learn the wordembeddings, then again use an LSTM encoder to encode thehistory information into a fixed-size latent representation, andconcatenate it with the image representation:

    sencm,Nm = encoder((LSTM(q, a)1:m),VGG(I)).

    The encoder and decoder are coupled by initializing thedecoder state with the last encoder state, mathematically,sdecm+1,0 = s

    encm,Nm. The LSTM decoder generates each word

    based on the concatenated representation and the previousgenerated word (note the first word is a start token):

    ym+1,n = decoder(LSTM((ym+1,n−1, sdecm+1,n−1)).

    The decoder shares the same lookup table weights as the en-coder. The Seq2Seq model, consisting of the encoder andthe decoder, is trained end-to-end to minimize the negativelog-likelihood cost. During testing, the decoder gets a starttoken and the representation from the encoder, and then gen-erates each word at each time step until it encounters a ques-tion mark token, QGen := p(ym+1,n | (q, a)1:m, I). Theoutput is a complete question. After several question-answerrounds, the QGen outputs an end-of-dialogue token, and stopsasking questions. The overall structure of the QGen model isillustrated in Fig. 2.

    Guesser: The goal of the Guesser model is to find outwhich object the Oracle model is referring to, given the com-

    MLPMLP

    MLPMLP

    ......

    SpatialCategory

    SpatialCategory

    MLP

    MLP

    ......

    Objects

    Is it a person? YesIs this person in the left? NoIs he on the right side? Yes......

    QA 1QA 2QA 3QA 4

    LSTMLSTMLSTMLSTM

    Attention Mechanism

    ...

    Memory

    +

    Weighted Sum

    .ObjectSoftMax

    Fig. 3: Guesser model

    plete history of the dialogue and a list of objects in the im-age, p(o∗ | (q, a)1:M , xOspatial, cO). The Guesser model hasaccess to the spatial, xOspatial, and category information, c

    O,of the objects in the list. The task of the Guesser model ischallenging because it needs to understand the dialogue andto focus on the important content, and then guess the object.To accomplish this task, we decided to integrate the Memory[17] and Attention [18] modules into the Guesser architectureused in the previous work [8]. First, we use an LSTM headerto process the varying lengths of question-answer pairs in par-allel into multiple fixed-size vectors. Here, each QA-pair hasbeen encoded into some facts, Factm = LSTM((q, a)m),and stored into a memory base. Later, we use the sum ofthe spatial and category embeddings of all objects as a key,Key1 = MLP(x

    Ospatial, c

    O), to query the memory and calcu-late an attention mask, Attention1(Factm) = Factm�key1,over each fact. Next, we use the sum of attended facts andthe first key to calculate the second key. Further, we use thesecond key to query the memory base again to have a moreaccurate attention. These are the so called “two-hops" of at-tention in the literature [17]. Finally, we compare the attendedfacts with each object embedding in the list using a dot prod-uct. The most similar object to these facts is the prediction,Guesser := p(o∗ | (q, a)1:M , xOspatial, cO). The intention ofusing the attention module here is to find out the most relevantdescriptions or facts concerning the candidate objects. Wetrain the whole Guesser network end-to-end using the nega-tive log-likelihood criterion. A more graphical description ofthe Guesser model is shown in Fig. 3.

    4.2. Reinforcement Learning

    Now, we post-train the QGen and the Guesser model withreinforcement learning. We keep the Oracle model fixed. Ineach game episode, when the models find the correct object,r = 1, otherwise, r = 0.

    Next, we can assign credits for each action of the QGenand the Guesser models. In the case of the QGen model, wespread the reward uniformly over the sequence of actions inthe episode. The baseline function, b, used here is the runningaverage of the game success rate. Consider that the Guessermodel has only one action in each episode, i.e., taking theguess. If the Guesser finds the correct object, then it gets an

  • immediate reward and the Guesser’s parameters are updatedusing the REINFORCE rule without baseline. The QGen istrained using the following four methods.

    REINFORCE: The baseline method used here is REIN-FORCE [15]. During training, in the forward pass the wordsare sampled with τ = 1, ym+1,n ∼ f(QGen(x | w)). In thebackward pass, the weights are updated using REINFORCE,w = w + α(r − b)∇wlnf(ym+1,n | QGen(x | w)).

    Single-TPG: We use temperature τglobal = 1.5 duringtraining to encourage exploration, mathematically, yτglobalm+1,n ∼fτglobal(QGen(x | w)). In the backward pass, the weightsare updated using w = w + α(r − b)∇wlnf(y

    τglobalm+1,n |

    QGen(x | w)).Parallel-TPG: For Parallel-TPG, we use two temper-

    atures τ1 = 1.0 and τ2 = 1.5 to encourage the explo-ration. The words are sampled in the forward pass usingyτ1m+1,n, y

    τ2m+1,n ∼ fτ1,τ2(QGen(x | w)). In the backward

    pass, the weights are updated using w = w + Σ2k=1α(r −b)∇wlnf(yτkm+1,n | QGen(x | w)).

    Dynamic-TPG: The last method we evaluated is Dynamic-TPG. We use a heuristic function to calculate the tempera-ture for each word at each time step: τhm+1,n = τmin +

    (τmax−τmin)tf-idf(y∗m+1,n)−tf-idfmin

    tf-idfmax−tf-idfmin,where we set τmin = 0.5,

    τmax = 1.5, and set tf-idfmin = 0, tf-idfmax = 8. Af-ter the calculation of τhm+1,n, we substitute the value intothe formula at each time step and sample the next word using

    yτhm+1,nm+1,n ∼ fτ

    hm+1,n(QGen(x | w)). In the backward pass, the

    weights are updated using w = w+α(r−b)∇wlnf(yτhm+1,nm+1,n |

    QGen(x | w)). For all four methods, we use greedy searchin evaluation.

    5. EXPERIMENT

    We first train all the networks in a supervised fashion, andthen optimize the QGen and the Guesser model using rein-forcement learning. Our implementation 1 uses Torch [29].

    5.1. Pretraining

    We train all three models using 0.5 dropout [30] during train-ing, using the ADAM optimizer [31]. We use a learning rateof 0.0001 for the Oracle model and the Guesser model, and alearning rate of 0.001 for QGen. All the models are trainedwith at most 30 epochs and early stopped within five epochswithout improvement on the validation set. We report the per-formance on the test set which consists of images not used intraining. We report the game success rate as the performancemetric for all three models, which equals to the number ofsucceeded games divided by the total number of all games.Compared to previous works [6, 8, 32], after supervised train-ing, our models obtain a game success rate of 48.77%, that

    1https://github.com/ruizhaogit/GuessWhat-TemperedPolicyGradient

    # Method Accuracy1 Strub et al., 2017 [8] 52.30%2 Strub and de Vries, 2017 [32] 60.30%3 Our Torch reimplementation of (# 2) 62.61%4 (# 3) + new QGen (Seq2Seq) 63.47%5 (# 4) + new Guesser (Memory Nets) 68.32%6 (# 5) + new Guesser (REINFORCE) 69.66%7 (# 6) + Single-TPG 69.76%8 (# 6) + Parallel-TPG 73.86%9 (# 6) + Dynamic-TPG 74.31%

    Table 1: Performance comparison and ablation tests

    is 4% higher than state-of-the-art methods [32], which has44.6% accuracy.

    5.2. Reinforcement Learning

    We first initialize all models with pre-trained parameters fromsupervised learning and then post-train the QGen using eitherREINFORCE or TPG for 80 epochs. We update the parame-ters using stochastic gradient descent (SGD) with a learningrate of 0.001 and a batch size of 64. In each epoch, we sampleeach image in the training set once and randomly pick one ofthe objects as a target. We track the running average of thegame success rate and use it directly as the baseline, b, in RE-INFORCE. We limit the maximum number of questions to 8and the maximum number of words to 12. Simultaneously, wetrain the Guesser model using REINFORCE without baselineand using SGD with a learning rate of 0.0001. The perfor-mance comparison is shown in Tab. 1.

    Ablation Study: From Tab. 1 (# 2 & 3), we see that ourreimplementation using Torch [29] achieves a comparableperformance compared to the original TensorFlow imple-mentation [32]. We use our reimplementation as the baseline.

    Upon the baseline, the new QGen model with Seq2Seqstructure improves the performance by about 1%, see Tab. 1(# 3 & 4). With the Seq2Seq structure, our QGen model istrained in question level. This means that the model firstlearns to query meaningfully, step by step. Eventually, itlearns to conduct a meaningful dialog. Compared to directlylearning to manage a strategic conversation, this bottom-uptraining procedure helps the model absorb knowledge, be-cause it breaks large tasks down into smaller, more manage-able pieces. This makes the learning for QGen much easier.

    The next improvement is because of our new Guessermodel, which uses Memory Network with two-hops atten-tion [17]. The memory and attention mechanisms bring animprovement of 4.85%, as shown in Tab. 1 (# 4 & 5). Fur-thermore, we train the new Guesser model additionally viaREINFORCE (# 6). In this way, the Guesser and the QGenlearn to cooperate with each other and improve the perfor-mance by another 1.34%, as shown in Tab. 1 (# 5 & 6).

  • Image Policy Gradient Tempered Policy Gradient

    Is it in left? NoIs it in front? NoIs it in right? YesIs it in middle? YesIs it person? NoIs it ball? NoIs it bat? NoIs it car? YesStatus: Failure

    Is it a person? NoIs it a vehicle? YesIs it a truck? YesIs it in front of photo? NoIn the left half? NoIn the middle of photo? YesIs it to the right photo? YesIs it in the middle of photo? YesStatus: Success

    Is it in left? NoIs it in front? YesIs it in right? NoIs it in middle? YesIs it person? NoIs it giraffe? YesIs in middle? YesIs in middle? YesStatus: Failure

    Is it a giraffe? YesIn front of photo? YesIn the left half? YesIs it in the middle of photo? YesIs it to the left of photo? YesIs it to the right photo? NoIn the left in photo? NoIn the middle of photo? YesStatus: Success

    Table 2: Some samples generated by our improved models using REINFORCE (left column: “Policy Gradient”) and Dynamic-TPG (rightcolumn: “Tempered Policy Gradient”). The green bounding boxes highlight the target objects; the red boxes highlight the wrong guesses.

    Here, we take a closer look at the improvement brought byTPGs. From Tab. 1, we see that compared to the REINFORCE-trained models (# 6), Single-TPG (# 7) with τglobal = 1.5achieves a comparable performance. With two different tem-peratures τ1 = 1.0 and τ2 = 1.5, Parallel-TPG (# 8) achievesan improvement of approximately 4%. Parallel-TPG requiresmore computational resources. Compared to Parallel-TPG,Dynamic-TPG only uses the same computational power asREINFORCE does and still gives a larger improvement byusing a dynamic temperature, τht ∈ [0.5, 1.5]. After compar-ison, we can see that the best model is Dynamic-TPG (# 9),which gives a 4.65% improvement upon new models (# 6).

    TPG Dialogue Samples: The generated dialogue sam-ples in Tab. 2 can give some interesting insights. First of all,the sentences generated from TPG-trained models are on av-erage longer and use slightly more complex structures, suchas “Is it in the middle of photo?" instead of a simple form “Isit in middle?". Secondly, TPGs enable the models to explorebetter and comprehend more words. For example, in the firsttask (upper half of Tab. 2), both models ask about the cate-gory. The REINFORCE-trained model can only ask with thesingle word “car" to query about the vehicle category. In con-trast, the TPG-trained model can first ask a more general cat-egory with the word “vehicle" and follows up querying witha more specific category “trucks". These two words “vehi-cle" and “trucks" give much more information than the singleword “car", and help the Guesser model identify the truckamong many cars. Lastly, similar to the category case, themodels trained with TPG can first ask a larger spatial range

    of the object and follow up with a smaller range. In the sec-ond task (lower half of Tab. 2), we see that the TPG-trainedmodel first asks “In the left half?", which refers to all the threegiraffes in the left half, and the answer is “Yes”. Then it asks“Is it to the left of photo?", which refers to the second leftgiraffe, and the answer is “Yes”. Eventually the QGen asks“In the left in photo?”, which refers to the most left giraffe,and the answer is “No”. These specific questions about loca-tions are not learned using REINFORCE. The REINFORCE-trained model can only ask a similar question with the word“left". In this task, there are many giraffes in the left part ofthe image. The top-down spatial questions help the Guessermodel find the correct giraffe. To summarize, the TPG-trainedmodels use longer and more informative sentences than theREINFORCE-trained models.

    6. CONCLUSION

    Our paper makes two contributions. Firstly, by extendingexisting models with Seq2Seq and Memory Networks wecould improve the performance of a goal-oriented dialoguesystem by 7%. Secondly, we introduced TPG, a novel class oftemperature-based policy gradient approaches. TPGs boostedthe performance of the goal-oriented dialogue systems byanother 4.7%. Among the three TPGs, Dynamic-TPG gavethe best performance, which helped the agent comprehendmore words, and produce more meaningful questions. TPGis a generic strategy to encourage word exploration on top ofpolicy gradients and can be applied to any dialog agents.

  • 7. REFERENCES

    [1] Rui Zhao and Volker Tresp, “Improving goal-oriented visual dialog agents via advanced recurrentnets with tempered policy gradient,” arXiv preprintarXiv:1807.00737, 2018.

    [2] Richard S Sutton and Andrew G Barto, Reinforcementlearning: An introduction, MIT press, 1998.

    [3] David Silver, Aja Huang, Chris J Maddison, ArthurGuez, Laurent Sifre, George Van Den Driessche, JulianSchrittwieser, Ioannis Antonoglou, Veda Panneershel-vam, Marc Lanctot, et al., “Mastering the game of gowith deep neural networks and tree search,” Nature, vol.529, no. 7587, pp. 484–489, 2016.

    [4] Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Andrei A Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, GeorgOstrovski, et al., “Human-level control through deepreinforcement learning,” Nature, 2015.

    [5] Antoine Bordes and Jason Weston, “Learning end-to-end goal-oriented dialog,” in International Conferenceon Learning Representations (ICLR), 2017.

    [6] Harm de Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, and Aaron C. Courville,“Guesswhat?! visual object discovery through multi-modal dialogue,” in Conference on Computer Vision andPattern Recognition (CVPR), 2017.

    [7] Abhishek Das, Satwik Kottur, José M.F. Moura, StefanLee, and Dhruv Batra, “Learning cooperative visual di-alog agents with deep reinforcement learning,” in Inter-national Conference on Computer Vision (ICCV), 2017.

    [8] Florian Strub, Harm de Vries, Jérémie Mary, Bilal Piot,Aaron C. Courville, and Olivier Pietquin, “End-to-endoptimization of goal-driven and visually grounded dia-logue systems,” in International Joint Conference onArtificial Intelligence (IJCAI), 2017.

    [9] Zachary Lipton, Xiujun Li, Jianfeng Gao, LihongLi, Faisal Ahmed, and Li Deng, “Bbq-networks:Efficient exploration in deep reinforcement learningfor task-oriented dialogue systems,” arXiv preprintarXiv:1711.05715, 2017.

    [10] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley,Jianfeng Gao, and Dan Jurafsky, “Deep reinforce-ment learning for dialogue generation,” arXiv preprintarXiv:1606.01541, 2016.

    [11] Jason D Williams and Geoffrey Zweig, “End-to-end lstm-based dialog control optimized with super-vised and reinforcement learning,” arXiv preprintarXiv:1606.01269, 2016.

    [12] Mike Lewis, Denis Yarats, Yann N Dauphin, DeviParikh, and Dhruv Batra, “Deal or no deal? end-to-end learning for negotiation dialogues,” arXiv preprintarXiv:1706.05125, 2017.

    [13] Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao,Yun-Nung Chen, Faisal Ahmed, and Li Deng, “End-to-end reinforcement learning of dialogue agents for infor-mation access,” arXiv preprint arXiv:1609.00777, 2016.

    [14] Rui Zhao and Volker Tresp, “Efficient dialog policylearning via positive memory retention,” in IEEE Spo-ken Language Technology (SLT) (forthcoming), 2018.

    [15] Ronald J Williams, “Simple statistical gradient-following algorithms for connectionist reinforcementlearning,” Machine learning, 1992.

    [16] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Se-quence to sequence learning with neural networks,”in Advances in neural information processing systems,2014, pp. 3104–3112.

    [17] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.,“End-to-end memory networks,” in Advances in neuralinformation processing systems, 2015, pp. 2440–2448.

    [18] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio, “Neural machine translation by jointly learning toalign and translate,” arXiv preprint arXiv:1409.0473,2014.

    [19] David Carmel and Shaul Markovitch, “Explorationstrategies for model-based learning in multi-agent sys-tems: Exploration strategies,” Autonomous Agents andMulti-agent systems, vol. 2, no. 2, pp. 141–172, 1999.

    [20] Ofir Nachum, Mohammad Norouzi, and Dale Schuur-mans, “Improving policy gradient by exploring under-appreciated rewards,” arXiv preprint arXiv:1611.09321,2016.

    [21] Yang Liu, Prajit Ramachandran, Qiang Liu, and JianPeng, “Stein variational policy gradient,” arXiv preprintarXiv:1704.02399, 2017.

    [22] Sebastian B Thrun, “Efficient exploration in reinforce-ment learning,” 1992.

    [23] Andrej Karpathy and Li Fei-Fei, “Deep visual-semanticalignments for generating image descriptions,” in Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2015, pp. 3128–3137.

    [24] Christian P Robert, Monte carlo methods, Wiley OnlineLibrary, 2004.

    [25] Jure Leskovec, Anand Rajaraman, and Jeffrey DavidUllman, Mining of massive datasets, Cambridge uni-versity press, 2014.

  • [26] Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollár, andC Lawrence Zitnick, “Microsoft coco: Common objectsin context,” in European conference on computer vision.Springer, 2014, pp. 740–755.

    [27] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, 1997.

    [28] Karen Simonyan and Andrew Zisserman, “Very deepconvolutional networks for large-scale image recogni-tion,” arXiv preprint arXiv:1409.1556, 2014.

    [29] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7:A matlab-like environment for machine learning,” inBigLearn, NIPS Workshop, 2011.

    [30] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: asimple way to prevent neural networks from overfitting,”Journal of Machine Learning Research, vol. 15, no. 1,pp. 1929–1958, 2014.

    [31] Diederik Kingma and Jimmy Ba, “Adam: Amethod for stochastic optimization,” arXiv preprintarXiv:1412.6980, 2014.

    [32] Florian Strub and Harm de Vries, “Guesswhat?! mod-els,” https://github.com/GuessWhatGame/guesswhat/, 2017.