ARAML: A Stable Adversarial Training Framework for Text ...The most similar works to our model are RAML (Norouzi et al.,2016) and MaliGAN (Che et al., 2017): 1) Compared with RAML,

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 4271–4281,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

4271

ARAML: A Stable Adversarial Training Framework for Text Generation

Pei Ke∗, Fei Huang∗, Minlie Huang†, Xiaoyan ZhuInstitute for Artificial Intelligence, State Key Lab of Intelligent Technology and Systems

Beijing National Research Center for Information Science and TechnologyDepartment of Computer Science and Technology, Tsinghua University, Beijing 100084, China

[email protected], [email protected]@tsinghua.edu.cn, [email protected]

Abstract

Most of the existing generative adversarial net-works (GAN) for text generation suffer fromthe instability of reinforcement learning train-ing algorithms such as policy gradient, leadingto unstable performance. To tackle this prob-lem, we propose a novel framework called Ad-versarial Reward Augmented Maximum Like-lihood (ARAML). During adversarial train-ing, the discriminator assigns rewards to sam-ples which are acquired from a stationary dis-tribution near the data rather than the gen-erator’s distribution. The generator is opti-mized with maximum likelihood estimationaugmented by the discriminator’s rewards in-stead of policy gradient. Experiments showthat our model can outperform state-of-the-arttext GANs with a more stable training process.

1 Introduction

Natural text generation, as a key task in NLP,has been advanced substantially thanks to theflourish of neural models (Bengio et al., 2003;Mikolov et al., 2010). Typical frameworks suchas sequence-to-sequence (seq2seq) have been ap-plied to various generation tasks, including ma-chine translation (Sutskever et al., 2014) and di-alogue generation (Vinyals and Le, 2015). Thestandard paradigm to train such neural modelsis maximum likelihood estimation (MLE), whichmaximizes the log-likelihood of observing eachword in the text given the ground-truth proceed-ing context (Graves, 2013).

Although widely used, MLE suffers from theexposure bias problem (Bengio et al., 2015; Ran-zato et al., 2016): during test, the model sequen-tially predicts the next word conditioned on its pre-vious generated words while during training con-ditioned on ground-truth words. To tackle this

∗ Equal contribution† Corresponding author: Minlie Huang

problem, generative adversarial networks (GAN)with reinforcement learning (RL) training ap-proaches have been introduced to text generationtasks (Yu et al., 2017; Che et al., 2017; Lin et al.,2017; Fedus et al., 2018; Guo et al., 2018; Shiet al., 2018; Xu et al., 2018), where the discrim-inator is trained to distinguish real and generatedtext samples to provide reward signals for the gen-erator, and the generator is optimized via policygradient (Yu et al., 2017).

However, recent studies have shown that poten-tial issues of training GANs on discrete data aremore severe than exposure bias (Semeniuta1 et al.,2018; Caccia et al., 2018). One of the fundamentalissues when generating discrete text samples withGANs is training instability. Updating the gener-ator with policy gradient always leads to an un-stable training process because it’s difficult for thegenerator to derive positive and stable reward sig-nals from the discriminator even with careful pre-training (Che et al., 2017). As a result, the genera-tor gets lost due to the high variance of reward sig-nals and the training process may finally collapse(Li et al., 2017).

In this paper, we propose a novel adversar-ial training framework called Adversarial RewardAugmented Maximum Likelihood (ARAML) todeal with the instability issue of training GANsfor text generation. At each iteration of adversarialtraining, we first train the discriminator to assignhigher rewards to real data than to generated sam-ples. Then, inspired by reward augmented max-imum likelihood (RAML) (Norouzi et al., 2016),the generator is updated on the samples acquiredfrom a stationary distribution with maximum like-lihood estimation (MLE), weighted by the dis-criminator’s rewards. This stationary distributionis designed to guarantee that training samples aresurrounding the real data, thus the explorationspace of our generator is indeed restricted by the

4272

MLE training objective, resulting in more stabletraining. Compared to other text GANs with RLtraining techniques, our framework acquires sam-ples from the stationary distribution rather than thegenerator’s distribution, and uses RAML trainingparadigm to optimize the generator instead of pol-icy gradient. Our contributions are mainly as fol-lows:

• We analyze the fundamental issue of currentGANs for text generation from the perspec-tives of training instability.

• We propose a novel framework called Adver-sarial Reward Augmented Maximum Like-lihood (ARAML), which incorporates sta-ble RAML training into adversarial trainingparadigm. Experimental results on three textgeneration tasks show the effectiveness of ourmethod.

2 Related Work

Recently, text generation has been widely studiedwith neural models trained with maximum likeli-hood estimation (Graves, 2013). However, MLEtends to generate universal text (Li et al., 2016).Various methods have been proposed to enhancethe generation quality by refining the objectivefunction (Li et al., 2016; Mou et al., 2016) or mod-ifying the generation distribution with external in-formation like topic (Xing et al., 2017), sentencetype (Ke et al., 2018), emotion (Zhou et al., 2018a)and knowledge (Zhou et al., 2018b).

As mentioned above, MLE suffers from the ex-posure bias problem (Bengio et al., 2015; Ranzatoet al., 2016). Thus, reinforcement learning hasbeen introduced to text generation tasks such aspolicy gradient (Ranzato et al., 2016) and actor-critic (Bahdanau et al., 2017). (Norouzi et al.,2016) proposed an efficient and stable approachcalled Reward Augmented Maximum Likelihood(RAML), which connects the log-likelihood andexpected rewards to incorporate MLE training ob-jective into RL framework.

Since some text generation tasks have no ex-plicit metrics to be directly optimized, adversar-ial training has been applied to generating discretetext samples with a discriminator to learn a properreward. For instance, SeqGAN (Yu et al., 2017)devised a discriminator to distinguish the real dataand generated samples, and a generator to max-imize the reward from the discriminator via pol-

icy gradient. Other variants of GANs have beenproposed to improve the generator or the discrimi-nator. To improve the generator, MaliGAN (Cheet al., 2017) developed a normalized maximumlikelihood optimization target for the generator tostably model the discrete sequences. LeakGAN(Guo et al., 2018) guided the generator with re-ward signals leaked from the discriminator at allgeneration steps to deal with long text generationtask. MaskGAN (Fedus et al., 2018) employedan actor-critic architecture to make the generatorfill in missing text conditioned on the surroundingcontext, which is expected to mitigate the prob-lem of mode collapse. As for the discriminator,RankGAN (Lin et al., 2017) replaced traditionaldiscriminator with a ranker to learn the relativeranking information between the real texts andgenerated ones. Inverse reinforcement learning(Shi et al., 2018) used a trainable reward approxi-mator as the discriminator to provide dense rewardsignals at each generation step. DPGAN (Xu et al.,2018) introduced a language model based discrim-inator and regarded cross-entropy as rewards topromote the diversity of generation results.

The most similar works to our model are RAML(Norouzi et al., 2016) and MaliGAN (Che et al.,2017): 1) Compared with RAML, our model addsa discriminator to learn the reward signals insteadof choosing existing metrics as rewards. We be-lieve that our model can adapt to various text gen-eration tasks, particularly those without explicitevaluation metrics. 2) Unlike MaliGAN, we ac-quire samples from a fixed distribution near thereal data rather than the generator’s distribution,which is expected to make the training processmore stable.

3 Model

3.1 Task Definition and Model Overview

Text generation can be formulated as follows:given the real data distribution Pdata(X), the taskis to train a generative model Gθ where PGθ(X)can fit Pdata(X) well. In this formulation, X =x1x2 · · ·xm and xt(1 ≤ t ≤ m) denotes a word inthe vocabulary V .

Figure 1 shows the overview of our modelARAML. This adversarial training frameworkconsists of two phases: 1) The discriminator istrained to assign higher rewards to real data thanto generated data. 2) The generator is trained onthe samples acquired from a stationary distribu-

4273

Real Data

G

sample sample

Real Data

Train D

Real Data

Sample from 𝑃𝑠

reward1

reward2

reward3

Training Samples

D Train

G

Discriminator

Train

Generator

Generated Data

Figure 1: Overview of ARAML. The training samples are acquired from a stationary distribution Ps based on thereal data. The generator is then trained on the samples augmented by the discriminator’s rewards. The discriminatoris trained to distinguish real data and generated data.

tion with reward augmented MLE training objec-tive. This training paradigm of the generator in-deed constrains the search space with the MLEtraining objective, which alleviates the issue of un-stable training.

3.2 Discriminator

The discriminatorDφ aims to distinguish real dataand generated data like other GANs. Inspired byLeast-Square GAN (Mao et al., 2017), we devisethe loss function as follows:

LDφ =1

2EX∼Pdata(X)

[(Dφ(X)− 1)2

]+

1

2EX∼PGθ (X)

[(Dφ(X))2

](1)

This loss function forces the discriminator to as-sign higher rewards to real data than to gener-ated data, so the discriminator can learn to providemore proper rewards as the training proceeds.

3.3 Generator

The training objective of our generator Gθ is de-rived from the objective of other discrete GANswith RL training method:

LRL,θ = −EX∼PGθ (X)[rφ(X)]− τH(PGθ(X))

(2)

where rφ(X) denotes the rewards from the dis-criminator Dφ and the entropy regularized termH(PGθ(X)) encourages Gθ to generate diversetext samples. τ is a temperature hyper-parameterto balance these two terms.

As mentioned above, discrete GANs sufferfrom the instability issue due to policy gradient,

thus they are consequently difficult to train. In-spired by RAML (Norouzi et al., 2016), we intro-duce an exponential payoff distribution Qφ(X) toconnect RL loss with RAML loss:

Qφ(X) =1

Zexp(rφ(X)/τ) (3)

where Z =∑

X exp(rφ(X)/τ). Thus, we canrewrite LRL,θ with PGθ(X) and Qφ(X) as fol-lows:

LRL,θ = τKL(PGθ(X)||Qφ(X)) + constant(4)

Following RAML, we remove the constant termand optimize the KL divergence in the oppositedirection:

LRAML,θ = KL(Qφ(X)||PGθ(X))

= −EX∼Qφ(X)[logPGθ(X)]−H(Qφ(X))

= −EX∼Qφ(X)[logPGθ(X)] + constant (5)

where H(Qφ(X)) is a constant in the trainingphase of the generator. It has been proved thatLRL,θ and LRAML,θ are equivalent up to theirfirst order Taylor approximations, and they havethe same global optimum (Norouzi et al., 2016).LRAML,θ can be trained in a MLE-like fash-ion but sampling from the distribution Qφ(X)is intractable in the adversarial setting, becauseQφ(X) varies with the discriminator Dφ. Thus,we introduce importance sampling to separatesampling process fromDφ and obtain the final lossfunction:

LGθ = −EX∼Ps(X)[Wφ(X) logPGθ(X)] (6)

4274

where Ps(X) denotes a stationary distribution andWφ(X) ∝ Qφ(X)/Ps(X). To optimize this lossfunction, we first construct the fixed distributionPs(X) to get samples, and devise the proper re-ward function rφ(X) to train the generator in astable and effective way.

3.3.1 SamplingWe construct the distribution Ps based on Pdata:

Ps(X) = EX∼Pdata(X)[Ps(Xs|X)] (7)

In this way, Ps(Xs|X) can be designed to guaran-tee that Ps(X) is near Pdata(X), leading to a morestable training process. To obtain a new sampleXs

from a real data sample X , we can design threesteps which contain sampling an edit distance d,the positions {p1, p2, · · · , pd} for substitution andthe new words {w1, w2, · · · , wd} filled into thecorresponding positions. Thus, Ps(Xs|X) can bedecomposed into three terms:

Ps(Xs|X) = P (d, p, w|X)

= P (d|X)P (p|X, d)P (w|X, d, p) (8)

The first step is to sample an edit distance basedon a real data sample X , where X = x1x2 · · ·xmis a sequence of length m. The number of sen-tences which have the edit distance e to some in-put sentence can be computed approximately asbelow:

c(e,m) =

(m

e

)· (|V| − 1)e (9)

where c(e,m) denotes the number of sentenceswhich have an edit distance e(e ∈ {0, 1, ...,m})to a sentence of length m, and |V| indicates thesize of vocabulary. We then follow (Norouzi et al.,2016) to re-scale the counts by exp{−e/τ} and donormalization, so that we can sample an edit dis-tance d∗ from:

P (d = d∗|X) =exp{−d∗/τ}c(d∗,m)∑me=0 exp{−e/τ}c(e,m)

(10)where τ , as a temperature hyper-parameter, re-stricts the search space surrounding the originalsentence. Larger τ brings more samples with longedit distances.

The next step is to select positions for substitu-tion based on the sampled edit distance d∗. Intu-itively, we can randomly choose d∗ distinct posi-tions inX to be replaced by new words. The prob-ability of choosing the position p∗ is calculated as

follows:

P (p = p∗|X, d = d∗) =d∗

m(11)

Following this sampling strategy, we can obtainthe position set {p1, p2, · · · , pd∗}. This strategyapproximately guarantees that the edit distance be-tween a new sentence and the original sentence isd∗.

At the final step, our model determines newwords for substitution at each sampled positionpj(j = 1, 2, ..., d∗). We can formulate this sam-pling process from the original sequence X to anew sample Xs as a sequential transition X =X0 → X1 → · · · → Xd∗ = Xs. At eachstep from Xj−1 to Xj (j = 1, · · · , d∗), wefirst sample a new word wj from the distributionP (w|Xj−1, p = pj), then replace the old word atposition pj of Xj−1 to obtain Xj . The whole sam-pling process can be decomposed as follows:

P (w|X, d = d∗,p = {p1, p2, · · · , pd∗})

=d∗∏j=1

P (wj |Xj−1, p = pj) (12)

There are two common sampling strategies tomodel P (w|Xj−1, p = pj), i.e. random sam-pling and constrained sampling. Random sam-pling strategy samples a new word wj accordingto the uniform distribution over the vocabularyV (Norouzi et al., 2016), while constrained sam-pling strategy samples wj to maximize the lan-guage model score of the target sentence Xj (Suet al., 2018; Miao et al., 2019). Here, we adoptconstrained sampling in our model and comparethe performances of two strategies in the experi-ment.

3.3.2 TrainingWe devise the reward function rφ(X) according tothe discriminator’s output Dφ(X) and the station-ary distribution Ps(X):

rφ(X) = τ · [logPs(X) +Dφ(X)] (13)

Intuitively, this reward function encourages thegenerator to generate sentences with large sam-pling probability and high rewards from the dis-criminator. Thus, the weight of samples Wφ(X)can be calculated as follows:

Wφ(X) ∝Qφ(X)

Ps(X)∝ exp {Dφ(X)} (14)

4275

So far, we can successfully optimize the gener-ator’s loss LGθ via Equation 6. This trainingparadigm makes our generator avoid possible vari-ances caused by policy gradient and get more sta-ble reward signals from the discriminator, becauseour generator is restricted to explore the trainingsamples near the real data.

Algorithm 1 Adversarial Reward AugmentedMaximum LikelihoodRequire:

Total adversarial training iterations: N itersSteps of training generator: G stepsSteps of training discriminator: D steps

1: Pre-train the generator Gθ with MLE loss2: Generate samples from PGθ3: Pre-train the discriminator Dφ via Eq.(1)4: Construct Ps(X) via Eq.(7) - Eq.(12)5: for each s = 1, 2, ..., N iters do6: for each j = 1, 2, ..., G steps do7: Update Gθ via Eq.(6)8: end for9: for each k = 1, 2, ..., D steps do

10: Update Dφ via Eq.(1)11: end for12: end for

3.4 Extension to Conditional Text GenerationWe have shown our adversarial training frame-work for text generation tasks without an in-put. Actually, it can also be extended to con-ditional text generation tasks like dialogue gen-eration. Given the data distribution Pdata(C,X)whereC,X denote contexts and responses respec-tively, the objective function of ARAML’s genera-tor can be modified as below:

LGθ = −E(C,X)∼Pdata(C,X)

[EXs∼Ps(Xs|C,X) [Wφ(C,Xs) logPGθ(Xs|C)]

](15)

where Wφ(C,Xs) ∝ exp{Dφ(C,Xs)} andDφ(C,Xs) is trained to distinguish whether Xs isthe true response to C.

3.5 Comparison with RAML and MaliGANThe most similar works to our framework areRAML (Norouzi et al., 2016) and MaliGAN (Cheet al., 2017). The main difference among them isthe training objective of their generators. We haveshown different objective functions in Table 1. For

comparison, we use the form with no input for allthe three models.

Our model is greatly inspired by RAML, whichgets samples from a non-parametric distributionQ(X) constructed based on a specific reward.Compared to RAML, our reward comes from alearnable discriminator which varies as the adver-sarial training proceeds rather than a specific re-ward function. This difference equips our frame-work with the ability to adapt to the text gener-ation tasks with no explicit evaluation metrics asrewards.

Our model is also similar to MaliGAN, whichgets samples from the generator’s distribution.In MaliGAN’s training objective, Gθ′ also indi-cates the generator’s distribution but it’s used inthe sampling phase and fixed at each optimiza-tion step. The weight of samples W

′φ(X) ∝

Dφ(X)1−Dφ(X) . Different from our model, MaliGANacquires samples from the generator’s distributionPGθ′ , which usually brings samples with low re-wards even with careful pre-training for the gen-erator, leading to training instability. Instead, ourframework gets samples from a stationary distri-bution Ps around real data, thus our training pro-cess is more stable.

Model Training Objective of GeneratorRAML LGθ = −EX∼Q(X)[logPGθ (X)]

MaliGAN LGθ = −EX∼PGθ′

(X)[W′φ(X) logPGθ (X)]

ARAML LGθ = −EX∼Ps(X)[Wφ(X) logPGθ (X)]

Table 1: Training objectives of generators for RAML,MaliGAN and ARAML.

4 Experiment

4.1 Datasets

Dataset Amount(Train/Test) Vocabulary LengthCOCO 80,000/5,000 4,839 12.8

EMNLP2017 49,996/10,000 5,721 27.8WeiboDial 100,000/5,000 7,998 7.3/10.8

Table 2: Statistics of COCO, EMNLP2017 WMT andWeiboDial. The average lengths 7.3/10.8 of Weibo-Dial indicate the lengths of posts and responses, respec-tively.

We evaluated ARAML on three datasets:COCO image caption dataset (Chen et al.,2015), EMNLP2017 WMT dataset1 and Weibo-Dial single-turn dialogue dataset (Qian et al.,

1http://statmt.org/wmt17/translation-task.html

4276

2018). COCO and EMNLP2017 WMT are thecommon benchmarks with no input to evaluate theperformance of discrete GANs, and we followedthe existing works to preprocess these datasets(Shi et al., 2018; Guo et al., 2018). WeiboDial, asa dialogue dataset, was applied to test the perfor-mance of our model with input trigger. We sim-ply removed post-response pairs containing low-frequency words and randomly selected a subsetfor our training/test set. The statistics of threedatasets are presented in Table 2.

4.2 Baselines

We compared our model with MLE, RL and GANbaselines. Since COCO and EMNLP2017 WMTdon’t have input while WeiboDial regards postsas input, we chose the following baselines respec-tively:MLE: a RNN model trained with MLE objective(Graves, 2013). Its extension, Seq2Seq, can workon the dialogue dataset (Sutskever et al., 2014).SeqGAN: The first text GAN model that updatesthe generator with policy gradient based on the re-wards from the discriminator (Yu et al., 2017).LeakGAN: A variant of SeqGAN that providesrewards based on the leaked information of thediscriminator for the generator (Guo et al., 2018).MaliGAN: A variant of SeqGAN that optimizesthe generator with a normalized maximum likeli-hood objective (Che et al., 2017).IRL: This inverse reinforcement learning methodreplaces the discriminator with a reward approxi-mator to provide dense rewards (Shi et al., 2018).RAML: A RL approach to incorporate MLE ob-jective into RL training framework, which regardsBLEU as rewards (Norouzi et al., 2016).DialogGAN: An extension of SeqGAN tunedto dialogue generation task with MLE objectiveadded to the adversarial objective (Li et al., 2017).DPGAN: A variant of DialogGAN which uses alanguage model based discriminator and regardscross-entropy as rewards (Xu et al., 2018).

Note that MLE, SeqGAN, LeakGAN, Mali-GAN and IRL are the baselines on COCO andEMNLP2017 WMT, while MLE, RAML, Dialog-GAN, and DPGAN on WeiboDial. The originalcodes are used to test the baselines.

4.3 Implementation Details

The implementation details of our model areshown in Table 3. For COCO / EMNLP2017, the

Dataset COCO / EMNLP2017 WeiboDialGenerator LSTM GRUDiscriminator GRU & CNN GRU & MLP

Temperature 0.85 (COCO) 0.950.9 (EMNLP2017)Sampling size 5 5Dimension of 128 100word embeddingBatch size 100 100Pretraining epochs 50 / 15 / 50 50 / 10 / 30(G/D/LM)Optimizer Adam AdamLearning rate(G/D) 0.001 / 0.0001 0.001 / 0.0001

Table 3: Implementation details of ARAML. G/D/LMindicates the generator / discriminator / language modelused in constrained sampling, respectively.

generator is a LSTM unit (Hochreiter and Schmid-huber, 1997) with 128 cells, and the discriminatoris implemented based on (Yu et al., 2017). ForWeiboDial, the generator is an encoder-decoderstructure with attention mechanism, where boththe encoder and the decoder consist of a two-layerGRU (Cho et al., 2014) with 128 cells. The dis-criminator is implemented based on (Tao et al.,2018). The language model used in the con-strained sampling of ARAML is implemented inthe same setting as the generators, and is pre-trained on the training set of each dataset. Thecodes and the datasets are available at https://github.com/kepei1106/ARAML.

As for the details of the baselines, the genera-tors of all the baselines except LeakGAN are thesame as ours. Note that the generator of Leak-GAN consists of a hierarchical LSTM unit, thuswe followed the implementation in the originalpaper. In terms of the differences, the discrimi-nators of GAN baselines are implemented basedon the original papers. Other hyper-parameters ofbaselines including batch size, learning rate, andpre-training epochs, were set based on the origi-nal codes, because the convergence of baselines issensitive to these hyper-parameters.

4.4 Language Generation on COCO andEMNLP2017 WMT

We adopted forward/reverse perplexity (Zhaoet al., 2018) and Self-BLEU (Zhu et al., 2018)to evaluate the quality of generated texts. For-ward perplexity (PPL-F) indicates the perplexityon the generated data provided by a languagemodel trained on real data to measure the fluencyof generated samples. Reverse perplexity (PPL-R)switches the roles of generated data and real data

https://github.com/kepei1106/ARAML

https://github.com/kepei1106/ARAML

4277

Model COCO EMNLP2017 WMTPPL-F PPL-R S-BLEU-2/3/4 PPL-F PPL-R S-BLEU-2/3/4

MLE 18.83± 0.51 38.81± 0.97 0.790± 0.006 / 0.598± 0.009 / 0.419± 0.010 55.64± 1.03 192.33± 9.51 0.771± 0.005 / 0.505± 0.009 / 0.304± 0.008SeqGAN 33.07± 1.98 49.24± 2.36 0.814± 0.005 / 0.619± 0.008 / 0.430± 0.007 67.60± 3.48 276.82± 5.12 0.786± 0.019 / 0.500± 0.029 / 0.276± 0.023LeakGAN 11.43± 2.74 87.54± 6.42 0.877± 0.032 / 0.762± 0.045 / 0.645± 0.049 17.92± 1.77 491.70± 18.29 0.890± 0.013 / 0.751± 0.011 / 0.604± 0.009MaliGAN 47.16± 2.94 54.40± 1.29 0.786± 0.005 / 0.572± 0.008 / 0.370± 0.007 126.84± 11.18 288.20± 16.48 0.780± 0.019 / 0.494± 0.032 / 0.265± 0.028

IRL 41.86± 11.82 84.23± 7.02 0.857± 0.014 / 0.687± 0.031 / 0.499± 0.062 285.20± 36.47 1124.57± 109.80 0.890± 0.008 / 0.656± 0.052 / 0.406± 0.077ARAML 26.97± 0.55 35.79± 0.49 0.777± 0.005 / 0.560± 0.006/ 0.366± 0.008 77.90± 0.70 188.41± 3.18 0.745± 0.002 / 0.455± 0.006 / 0.257± 0.006

Table 4: Automatic evaluation on COCO and EMNLP2017 WMT. Each metric is presented with mean and standarddeviation.

to reflect the discrepancy between the generateddistribution and the data distribution. Self-BLEU(S-BLEU) regards each sentence in the generatedcollection as hypothesis and the others as refer-ence to obtain BLEU scores, which evaluates thediversity of generated results.

Results are shown in Table 4. LeakGAN per-forms best on forward perplexity because it cangenerate more fluent samples. As for reverse per-plexity, our model ARAML beats other baselines,showing that our model can fit the data distributionbetter. Other GANs, particularly LeakGAN, ob-tain high reverse perplexity due to mode collapse(Shi et al., 2018), thus they only capture limitedfluent expressions, resulting in large discrepancybetween the generated distribution and data distri-bution. ARAML also outperforms the baselinesin terms of Self-BLEU, indicating that our modeldoesn’t fall into mode collapse with the help of theMLE training objective and has the ability to gen-erate more diverse sentences.

We also provide standard deviation of each met-ric in Table 4, reflecting the stability of eachmodel’s performance. Our model ARAML nearlyachieves the smallest standard deviation in all themetrics, indicating that our framework outper-forms policy gradient in the stability of adversarialtraining.

4.5 Dialogue Generation on WeiboDial

Dialogue evaluation is an open problem and ex-isting works have found that automatic metricshave low correlation to human evaluation (Liuet al., 2016; Novikova et al., 2017; Chaganty et al.,2018). Thus, we resorted to manual evaluation toassess the generation quality on WeiboDial. Werandomly sampled 200 posts from the test set andcollected the generated results from all the mod-els. For each pair of responses (one from ARAMLand the other from a baseline, given the same in-put post), five annotators were hired to label whichresponse is better (i.e. win, lose or tie) in terms ofgrammaticality (whether a response itself is gram-

matical and logical) and relevance (whether a re-sponse is appropriate and relevant to the post). Thetwo metrics were evaluated independently.

The evaluation results are shown in Table 5. Tomeasure the inter-annotator agreement, we calcu-lated Fleiss’ kappa (Fleiss, 1971) for each pair-wise comparison where results show moderateagreement (0.4 ≤ κ ≤ 0.6). We also conductedsign test to check the significance of the differ-ences.

As shown in Table 5, ARAML performs signif-icantly better than other baselines in all the cases.This result indicates that the samples surroundingtrue responses provide stable rewards for the gen-erator, and stable RAML training paradigm signif-icantly enhances the performance in both metrics.

4.6 Further Analysis on Stability

0 50 100 150 200 250

Epoch

0

10

20

30

40

50

60

PPL-F

ARAMLSeqGANLeakGANMaliGANIRL

0 50 100 150 200 250

Epoch

20

40

60

80

100

120

PPL-R

ARAMLSeqGANLeakGANMaliGANIRL

Figure 2: PPL-F/PPL-R curves of ARAML, SeqGAN,LeakGAN, MaliGAN and IRL in the training pro-cess. The shade area indicates the standard devia-tion at each data point. The dotted vertical lines sep-arate pre-training and adversarial training phases (50for ARAML, IRL and MaliGAN, 80 for SeqGAN andLeakGAN).

To verify the training stability, we conductedexperiments on COCO many times and chose thebest 5 trials for SeqGAN, LeakGAN, IRL, Mali-GAN and ARAML, respectively. Then, we pre-sented the forward/reverse perplexity in the train-

4278

Model Grammaticalityκ

RelevanceκWin (%) Lose (%) Tie (%) Win (%) Lose (%) Tie (%)

ARAML vs. MLE 56.5** 36.5 7.0 0.456 50.5** 37.5 12.0 0.465ARAML vs. RAML 54.5** 37.5 8.0 0.416 47.0* 40.5 12.5 0.480ARAML vs. DialogGAN 75.5** 13.5 11.0 0.445 73.0** 11.0 16.0 0.460ARAML vs. DPGAN 57.5** 36.0 6.5 0.435 56.5** 30.5 13.0 0.529

Table 5: Human evaluation on WeiboDial. The scores represent the percentages of Win, Lose or Tie when ourmodel is compared with a baseline. κ denotes Fleiss’ kappa (all are moderate agreement). The scores marked with* mean p-value< 0.05 and ** indicates p-value< 0.01 in sign test.

ing process in Figure 2. We can see that our modelwith smaller standard deviation is more stable thanother GAN baselines in both metrics. AlthoughLeakGAN reaches the best forward perplexity, itsstandard deviation is extremely large and it per-forms badly in reverse perplexity, indicating that itgenerates limited expressions that are grammaticalyet divergent from the data distribution.

4.7 Ablation Study

4.7.1 Impact of TemperatureThe temperature τ controls the search space sur-rounding the real data as we analyze in Section3.3.1. To investigate its impact on the perfor-mance of our model, we fixed all the other hyper-parameters and test ARAML with different tem-peratures on COCO.

0.75 0.8 0.85 0.9 0.95 1Temparature

20

22

24

26

28

30

32

PPL-F

0.75 0.8 0.85 0.9 0.95 1Temperature

34

34.5

35

35.5

36

36.5

37

PPL-R

0.8 0.9 1Temperature

0.75

0.755

0.76

0.765

0.77

0.775

0.78

0.785

0.79

S-BL

EU-2


0.5

0.52

0.54

0.56

0.58

0.6

S-BL

EU-3


0.3

0.32

0.34

0.36

0.38

0.4

S-BL

EU-4

Figure 3: PPL-F, PPL-R and S-BLEU of ARAML withdifferent temperatures τ ∈ {0.8, 0.85, 0.9, 0.95} onCOCO.

The experimental results are shown in Figure 3.We can see that as the temperature becomes larger,forward perplexity increases gradually while Self-BLEU decreases. As mentioned in Section 3.3.1,

large temperatures encourage our generator to ex-plore the samples that are distant from real datadistribution, thus the diversity of generated resultswill be improved. However, these samples dis-tant from the data distribution are more likely tobe poor in fluency, leading to worse forward per-plexity. Reverse perplexity is influenced by bothgeneration quality and diversity, so the correla-tion between temperature and reverse perplexity isnot intuitive. We can observe that the model withτ = 0.95 reaches the best reverse perplexity.

4.7.2 Impact of Sampling StrategyWe have mentioned two common sampling strate-gies in Section 3.3.1, i.e. random samplingand constrained sampling. To analyze their im-pact, we keep all the model structures and hyper-parameters fixed and test ARAML with these twostrategies on COCO.

Model PPL-F PPL-R S-BLEU-2/3/4ARAML-R 37.48±0.53 37.44±0.56 0.752/0.571/0.384ARAML-C 26.97±0.55 35.79±0.49 0.777/0.560/0.366

Table 6: PPL-F, PPL-R and S-BLEU of ARAML withrandom sampling (ARAML-R) and constrained sam-pling (ARAML-C) on COCO.

Table 6 shows the results. It’s obvious that ran-dom sampling hurts the model performance ex-cept Self-BLEU-1, because it indeed allows low-quality samples available to the generator. Explor-ing these samples degrades the quality and diver-sity of generated results. Despite the worse per-formance on automatic metrics, random samplingdoesn’t affect the training stability of our frame-work. The standard deviation of ARAML-R is stillsmaller than other GAN baselines.

4.8 Case Study

Table 7 presents the examples generated by themodels on COCO. We can find that other baselinessuffer from grammatical errors (e.g. “in front offlying her kite” from MLE), repetitive expressions

4279

Model Generated SamplesMLE A little girl sitting on a beach in front of flying

her kite at the beach.A little boy standing in a room next to a desk.

SeqGAN A man sitting on a bench with snow boardin the background.A brown gray cat is in the corner of a street.

LeakGAN A person that is holding something whileanother kid is standing in the water.A room with a television, mantle, and a chair.

MaliGAN A man with a shirt on holding one large pinkgiant and white kite.A couple and vases are outside on the bed.

IRL A group of people wearing helmet sitting downon a cell phone.A group of people sitting in the middle of tracks.

ARAML A man is wearing a hat and holding a toothbrushas he stands on the grass of a field.A boy reading a book on a sofa in a room.

Table 7: Examples of generated sentences on COCO.Grammatical errors are in red, while blue text repre-sents repetitive expressions and green part indicates in-coherent statements.

(e.g. “A group of people” from IRL) and incoher-ent statements (e.g. “A group of people sitting ona cell phone” from IRL). By contrast, our modelperforms well in these sentences and has the abil-ity to generate grammatical and coherent results.

Post: 你还迟到哪？你没这习惯啊Are you late? You don’t have this habit.

MLE: 你是不是要好好照顾自己的！！！Would you take care of yourself !!!

RAML: 这是我的生活，我也不想干This is my life. I don’t want to do this.

DialogGAN: 这社会的温度快到了。周末，要躺床上！The temperature of the society has nearlyreached. I’ll lie on my bed on the weekend!

DPGAN: 我在上班，我就不习惯了I’m working and I’m used to doing this.

ARAML: 嗯，周末就不迟到了！Hum. I won’t be late on the weekend!

Table 8: Examples of generated responses on Weibo-Dial.

Table 8 shows the generated examples on Wei-boDial. It’s obvious that other baselines don’t cap-ture the topic word “late” in the post, thus gener-ate irrelevant responses. ARAML can provide aresponse that is grammatical and closely relevantto the post.

5 Conclusion

We propose a novel adversarial training frame-work to deal with the instability problem of currentGANs for text generation. To address the instabil-ity issue caused by policy gradient, we incorporate

RAML into the adversarial training paradigm tomake our generator acquire stable rewards. Ex-periments show that our model performs betterthan several state-of-the-art GAN baselines withlower training variance, yet producing better per-formance on three text generation tasks.

Acknowledgments

This work was supported by the NationalScience Foundation of China (Grant No.61936010/61876096) and the National Key R&DProgram of China (Grant No. 2018YFC0830200).We would like to thank THUNUS NExT Joint-Labfor the support.

ReferencesDzmitry Bahdanau, Philemon Brakel, Kelvin Xu,

Anirudh Goyal, Ryan Lowe, Joelle Pineau, AaronCourville, and Yoshua Bengio. 2017. An actor-criticalgorithm for sequence prediction. In Proceedingsof International Conference on Learning Represen-tations.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, andNoam Shazeer. 2015. Scheduled sampling for se-quence prediction with recurrent neural networks.In Advances in Neural Information Processing Sys-tems, pages 1171–1179.

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model. Journal of Machine Learning Re-search, 3:1137–1155.

Massimo Caccia, Lucas Caccia, William Fedus, HugoLarochelle, Joelle Pineau, and Laurent Charlin.2018. Language gans falling short. arXiv preprintarXiv: 1811.02549.

Arun Tejasvi Chaganty, Stephen Mussmann, and PercyLiang. 2018. The price of debiasing automatic met-rics in natural language evaluation. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics, pages 643–653.

Tong Che, Yanran Li, Ruixiang Zhang, R DevonHjelm, Wenjie Li, Yangqiu Song, and Yoshua Ben-gio. 2017. Maximum-likelihood augmented discretegenerative adversarial networks. arXiv preprintarXiv: 1702.07983.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollar, andC. Lawrence Zitnick. 2015. Microsoft coco cap-tions: Data collection and evaluation server. arXivpreprint arXiv: 1504.00325.

Kyunghyun Cho, Bart Van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the proper-ties of neural machine translation: Encoder-decoder

4280

approaches. In Proceedings of Eighth Workshop onSyntax, Semantics and Structure in Statistical Trans-lation, pages 103–111.

William Fedus, Ian J. Goodfellow, and Andrew M. Dai.2018. Maskgan: Better text generation via filling inthe . In Proceedings of International Confer-ence on Learning Representations.

Joseph L Fleiss. 1971. Measuring nominal scale agree-ment among many raters. Psychological Bulletin,76(5):378–382.

Alex Graves. 2013. Generating sequences with re-current neural networks. arXiv preprint arXiv:1308.0850.

Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, YongYu, and Jun Wang. 2018. Long text generation viaadversarial training with leaked information. In Pro-ceedings of AAAI conference on Artificial Intelli-gence, pages 5141–5148.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Pei Ke, Jian Guan, Minlie Huang, and Xiaoyan Zhu.2018. Generating informative responses with con-trolled sentence function. In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics, pages 1499–1508.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016. A diversity-promoting ob-jective function for neural conversation models. InConference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 110–119.

Jiwei Li, Will Monroe, Tianlin Shi, Sebastien Jean,Alan Ritter, and Dan Jurafsky. 2017. Adversariallearning for neural dialogue generation. In Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing, pages 2157–2169.

Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang,and Ming-Ting Sun. 2017. Adversarial ranking forlanguage generation. In Advances in Neural Infor-mation Processing Systems, pages 3155–3165.

Chia Wei Liu, Ryan Lowe, Iulian V. Serban, MichaelNoseworthy, Laurent Charlin, and Joelle Pineau.2016. How not to evaluate your dialogue system:An empirical study of unsupervised evaluation met-rics for dialogue response generation. In Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing, pages 2122–2132.

Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K.Lau, Zhen Wang, and Stephen Paul Smolley.2017. Least squares generative adversarial net-works. In International Conference on ComputerVision, pages 2813–2821.

Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and LeiLi. 2019. Cgmh: Constrained sentence generationby metropolis-hastings sampling. In Proceedings ofAAAI conference on Artificial Intelligence.

Tomas Mikolov, Martin Karafiat, Lukas Burget,Jan Honza Cernock, and Sanjeev Khudanpur. 2010.Recurrent neural network based language model. InProceedings of the 11st Annual Conference of theInternational Speech Communication Association,pages 1045–1048.

Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang,and Zhi Jin. 2016. Sequence to backward and for-ward sequences: A content-introducing approach togenerative short-text conversation. In Proceedingsof 26th International Conference on ComputationalLinguistics, pages 3349–3358.

Mohammad Norouzi, Samy Bengio, Zhifeng Chen,Navdeep Jaitly, Mike Schuster, Yonghui Wu, andDale Schuurmans. 2016. Reward augmented max-imum likelihood for neural structured prediction.In Advances in Neural Information Processing Sys-tems, pages 1723–1731.

Jekaterina Novikova, Ondrej Dusek, Amanda CercasCurry, and Verena Rieser. 2017. Why we need newevaluation metrics for NLG. In Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 2241–2252.

Qiao Qian, Minlie Huang, Haizhou Zhao, JingfangXu, and Xiaoyan Zhu. 2018. Assigning personal-ity/profile to a chatting machine for coherent con-versation generation. In Proceedings of the Twenty-Seventh International Joint Conference on ArtificialIntelligence, pages 4279–4285.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2016. Sequence level train-ing with recurrent neural networks. In Proceedingsof International Conference on Learning Represen-tations.

Stanislau Semeniuta1, Aliaksei Severyn, and Syl-vain Gelly. 2018. On accurate evaluation of gansfor language generation. arXiv preprint arXiv:1806.04936.

Zhan Shi, Xinchi Chen, Xipeng Qiu, and XuanjingHuang. 2018. Toward diverse text generation withinverse reinforcement learning. In Proceedings ofthe Twenty-Seventh International Joint Conferenceon Artificial Intelligence, pages 4361–4367.

Jinyue Su, Jiacheng Xu, Xipeng Qiu, and XuanjingHuang. 2018. Incorporating discriminator in sen-tence generation: A gibbs sampling method. InProceedings of AAAI conference on Artificial Intel-ligence, pages 5496–5503.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in Neural Information Process-ing Systems, pages 3104–3112.

4281

Chongyang Tao, Lili Mou, Dongyan Zhao, and RuiYan. 2018. Ruber: An unsupervised method for au-tomatic evaluation of open-domain dialog systems.In Proceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence, pages 722–729.

Oriol Vinyals and Quoc Le. 2015. A neural conversa-tional model. In International Conference on Ma-chine Learning Deep Learning Workshop.

Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang,Ming Zhou, and Wei-Ying Ma. 2017. Topic awareneural response generation. In Proceedings of theThirty-First AAAI Conference on Artificial Intelli-gence, pages 3351–3357.

Jingjing Xu, Xuancheng Ren, Junyang Lin, andXu Sun. 2018. Diversity-promoting gan: A cross-entropy based generative adversarial network for di-versified text generation. In Conference on Empiri-cal Methods in Natural Language Processing, page3940–3949.

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.2017. Seqgan: Sequence generative adversarial netswith policy gradient. In Proceedings of AAAI con-ference on Artificial Intelligence, pages 2852–2858.

Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexan-der M. Rush, and Yann LeCun. 2018. Adversari-ally regularized autoencoders. In Proceedings of the35th International Conference on Machine Learn-ing, pages 5897–5906.

Hao Zhou, Minlie Huang, Tianyang Zhang, XiaoyanZhu, and Bing Liu. 2018a. Emotional chatting ma-chine: Emotional conversation generation with in-ternal and external memory. In Proceedings of AAAIconference on Artificial Intelligence.

Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao,Jingfang Xu, and Xiaoyan Zhu. 2018b. Com-monsense knowledge aware conversation generationwith graph attention. In Proceedings of the Twenty-Seventh International Joint Conference on ArtificialIntelligence, pages 4623–4629.

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo,Weinan Zhang, Jun Wang, and Yong Yu. 2018.Texygen: A benchmarking platform for text gener-ation models. In Proceedings of the 41st Interna-tional ACM SIGIR Conference on Research Devel-opment in Information Retrieval, pages 1097–1100.

Documents

ARAML: A Stable Adversarial Training Framework for Text ...The most similar works to our model are RAML (Norouzi et al.,2016) and MaliGAN (Che et al., 2017): 1) Compared with RAML,