Attending to Future Tokens for Bidirectional Sequence ... · 1 M XM m=1 Xjyj j=1 logˇ (p j= y jjs); where Mis the size of a minibatch. At inference time, the placeholder tokens can

Attending to Future Tokens for Bidirectional Sequence Generation

Carolin Lawrence and Bhushan Kotnis and Mathias NiepertNEC Laboratories Europe

{carolin.lawrence, bhushan.kotnis, mathias.niepert}@neclab.eu

AbstractNeural sequence generation is typicallyperformed token-by-token and left-to-right.Whenever a token is generated only previouslyproduced tokens are taken into consideration.In contrast, for problems such as sequenceclassification, bidirectional attention, whichtakes both past and future tokens into con-sideration, has been shown to perform muchbetter. We propose to make the sequencegeneration process bidirectional by employingspecial placeholder tokens. Treated as a nodein a fully connected graph, a placeholdertoken can take past and future tokens intoconsideration when generating the actualoutput token. We verify the effectiveness ofour approach experimentally on two conversa-tional tasks where the proposed bidirectionalmodel outperforms competitive baselines by alarge margin.

1 Introduction

When generating an output sequence, neural net-work models typically produce one token at atime. At each generation step, only the alreadyproduced sequence is taken into account. How-ever, future and not-yet-produced tokens can alsobe highly relevant when choosing the current to-ken. The importance of attending to both past andfuture tokens is apparent in self-attention archi-tectures such as the Transformer (Vaswani et al.,2017). The self-attention module of a Transformernetwork treats a sequence bidrectionally as a fullyconnected graph of tokens – when a token is pro-duced all other tokens are taken into considera-tion. However, this requires the entire sequenceto be known a priori and when a Transformer isused for sequence generation, the self-attentionprocess only includes previously produced tokens(Vaswani et al. (2017); Radford et al. (2019); in-ter alia). But the bidirectional self-attention is a

crucial property of the highly successful languagemodel BERT (Devlin et al., 2018). During the pre-training procedure of BERT, a fraction of input to-kens is randomly masked out and the training ob-jective is to predict these masked tokens correctly.BERT can then be fine-tuned for various classifi-cation tasks. Unfortunately, BERT cannot be di-rectly used for sequence generation because thebidirectional nature of the approach requires theentire sequence to be known beforehand.

Inspired by BERT’s masking-based objective,we propose to start out with a sequence of place-holder tokens which are iteratively replaced by to-kens from the output vocabulary to eventually gen-erate the full output sequence. For an example seeFigure 1. With this novel model component, theself-attention of a Transformer can take both pastand future tokens into consideration, leading toBidirectional Sequence generation (BISON). Fur-thermore, it allows us to directly incorporate thepre-trained language model BERT and, to the bestof our knowledge, for the first time directly fine-tune it for sequence generation.

BISON makes two major contributions whichwe investigate in turn. First, we explore differ-ent stochastic placeholder replacement strategiesto determine, at training time, where to positionthe placeholder tokens. This is crucial as we needthe BISON models to be exposed to a large numberof heterogeneous placeholder configurations. Sec-ond, we explore several strategies for iterativelygenerating, at inference time, a complete outputsequence from an initial sequence of placeholders.

We evaluate our bidirectional sequence gener-ation approach on two conversational tasks. BI-SON outperforms both competitive baselines andstate of the art neural network approaches on bothdatasets by a significant margin.

arX

iv:1

908.

0591

5v2

[st

at.M

L]

17

Sep

2019

1 p p p p p p p p p p p p p2 p you been p or enrolled in an p degree or program ?3 Have you been accepted or enrolled in an accredited degree or program ?

Figure 1: Example generation, going from a sequence of the placeholder token p (1), to an intermediate represen-tation (2) and to the final output (3).

2 Sequence Generation withTransformers

For sequence-to-sequence tasks, an input se-quence x = x1, x2, . . . , x|x| is to be mapped toan output sequence y = y1, y2, . . . , y|y| by somemodel πθ with learnable parameters θ. For neu-ral models, this is typically done by first encodingthe input sequence x and then calling a decodert times to produce a sequence y token-by-token,from left-to-right.

A popular choice for both encoder and decoderis the transformer (Vaswani et al., 2017). It takes asequence of embedded tokens s = s1, s2, . . . , s|s|and treats it as a fully connected graph over whicha self-attention module is applied: for each to-ken st in the sequence it assigns a probabilisticattention score at to every other token in the sen-tence. For the full mathematical details we referthe reader to (Vaswani et al., 2017).

Typically a transformer encoder is employed toencode x, whereas a transformer decoder is usedto produce y. In contrast to the encoder, the de-coder at time step t only has access to previouslyproduced tokens s<t = s1, s2, . . . , st−1. Conse-quently, the attention module cannot take possi-ble future tokens into account when making itsdecision at time t. Additionally, in this encoder-decoder framework, there is a disconnect betweeninput x and output y because the self-attentionmodules are applied to x and y in isolation beforethey are combined.

The latter weakness has been overcome in re-cent work (Radford et al., 2018; Wolf et al., 2019;Radford et al., 2019) by feeding the concatenations = x ⊕ y to a transformer decoder. At trainingtime, given the current token st, the transformeris trained to predict the next word st+1 via maxi-mum likelihood estimation. At test time, the trans-former is conditioned on x and then produces theoutput y token-by-token. But because the model isa transformer decoder, it is unable to take possiblefuture tokens into account.

3 Bidirectional Sequence Generation

During sequence generation, we want to takeboth past and future tokens into account. Moreformally, at time t, we want to attend to boths1, . . . , st−1 as well as st+1, . . . , s|s|. To do this,we give the sequence s = x ⊕ y, the concatena-tion of the sequences x and y, to a Transformerencoder, rather than a decoder. Of course, at in-ference time y is unknown. Thus, we propose toreplace each token yj with a placeholder token p.Since the model needs to be exposed to hetero-geneous placeholder token configurations duringtraining time, we introduce a placeholder strat-egy that replaces some tokens yj with placeholdertokens p at training time. Hence, during train-ing, a sequence y is replaced by a sequence p =p1, p2, , p|y|, where a token pj is either the origi-nal token yj or the placeholder token p. We intro-duce two placeholder strategies in the followingsection. At inference time, p contains only place-holder tokens up to some pre-determined maxi-mum sequence length.

With the placeholder strategy in place, a Trans-former encoder is given the sequence s = x ⊕ pas input. The self-attention module then computeshidden representations rt of each token st by at-tending to every other token in the sequence s.Because the output sequence is already present inthe form of placeholder tokens both past tokensas well as future, not-yet-produced, tokens can betaken into consideration for every token st. Fol-lowing the self-attention step, placeholder tokensare converted into a token from the output vocab-ulary with a language model (LM) classificationlayer, where for each placeholder pt, its hiddenrepresentation rt is mapped to a distribution dtover the output vocabulary.

At training time, each output sequence token isfed the gold label and updates to the model πθ areperformed using stochastic gradient descent witha cross-entropy loss, i.e.

Lπθ = − 1

M

M∑m=1

|y|∑j=1

log πθ(pj = yj |s),

where M is the size of a minibatch.At inference time, the placeholder tokens can be

replaced iteratively based on the probability dis-tribution dt over the output vocabulary for eachplaceholder pt. Different sequence generationstrategies are outlined in Section 3.2.

3.1 Placeholder Replacement StrategyAt inference time, the output sequence starts outwith a sequence of placeholder tokens. To intro-duce this notion at training time, we require a strat-egy that replaces some output sequence tokens yjwith the placeholder token p. The simplest ap-proach would be to replace all output sequence to-kens yj with the placeholder token p. However,with this approach the model is never confrontedwith a sequence containing a mix of output se-quence tokens and placeholder tokens.

Due to the exponential number of possiblereplacement configurations per given token se-quence, we introduce probabilistic generativemodels that we can use to draw diverse sequencereplacements. To this end, we model the decisionwhether to use placeholder or input tokens withprobabilistic models, two of which we propose inthe following.

Bernoulli Random Variables (RV). We modeleach position of the input sequence with a binaryrandom variable with a fixed mean. For every in-put sequence and every position i in the sequence,we draw from a Bernoulli variable with mean µ todecide whether to use yi or p. The expected num-ber of placeholder tokens in a sequence of length|y| is |y|µ and the variance is σ2 = |y|µ(1 − µ).The variance of this strategy is determined by themean and, therefore, the probabilistic model hasone tunable parameter µ.

Gaussian Random Variables. The number ofplaceholder tokens of the input sequence p canbe seen as drawn from an unknown optimal Bino-mial distribution. We can approximate this Bino-mial with a normal distribution N (µ, σ2), whereµ is the mean and σ the standard deviation andthey are considered hyperparameters. More for-mally, for every input sequence, we draw a valueP ∼ N (µ, σ2). Multiplied with the sequence

length |y|, the nearest integer value b(|y| · P )e isused as the number of placeholder tokens for thegiven sequence. The positions of the placeholdertokens in the sequence are then determined at ran-dom. The resulting probabilistic model’s two pa-rameters (mean µ and standard deviation σ) aretreated as hyperparameters and are not updatedduring training. Being able to tune the varianceparameter independently from the mean parame-ter might provide an advantage over the parame-terization with Bernoulli RVs.

3.2 Sequence Generation Strategies

Starting with a sequence of placeholder tokens atinference time, it is possible to generate outputtoken sequences in arbitrary order. We experi-ment with the following strategies. The distribu-tion in all of these strategies are the distributionsdt (t = 1, ..., n) for the placeholders over the out-put vocabulary. We use the term uncover to meanthat an output token is generated for a placeholdertoken.

One-step greedy. In a single time step, allplaceholder tokens are uncovered simultaneouslyby picking the most probable token from the out-put vocabulary for each placeholder.

Highest probability. Placeholders are replacediteratively and the placeholder to be uncovered isthe placeholder that assigns the highest probabilityto a token from the output vocabulary, indicatingthe model is the most sure about this token.

Lowest entropy. Placeholders are replaced iter-atively and the placeholder to be uncovered is theplaceholder that exhibits the lowest entropy overits output vocabulary distribution and the mostlikely token at this position is chosen. Intuitively,the lowest entropy indicates the position where theuncertainty of the model to decide between tokensof the output vocabulary is the lowest.

Left-to-right. Placeholders are replaced itera-tively, moving from left-to-right and thus mimick-ing the typical writing style for English. Note thatthis approach still differs from the Transformer de-coders because future tokens are considered viathe placeholder representations.

No look ahead. To test whether future place-holders hold useful information, we consider anadversarial sequence generation strategy: Againwe iteratively uncover placeholders from left-to-right, but we suppress all attention flows from fu-ture placeholders. This imitates the behaviour of

a transformer decoder but follows the idea of pre-dicting a token on a placeholder, rather than pre-dicting the next word as is typically done in trans-former decoders. If this performs worse than left-to-right, there is indeed valuable information in fu-ture placeholder tokens.

4 Experiments

We conduct a series of experiments to explore BI-SON’s behavior. First, we want to compare two to-ken replacement strategies for training as well asthe four generation strategies for inference. Sec-ond, we want to compare BISON to state of the artmethods and investigate the impact of its ability toattend to future tokens.

4.1 Datasets

We run experiments on the two following conver-sational datasets.

Goal-oriented SHARC (Saeidi et al., 2018).SHARC is a dialogue, text-based question-answering dataset. Unlike many popular QAdatasets, answers cannot simply be extracted fromthe text. Given a regulatory text, such as a textfrom the UK government’s website, and a userscenario with corresponding question, it is nec-essary to interpret the text in the context of thespecific user’s needs. Before generating its finalanswer, a system may generate clarification ques-tions. Finally, the system decides if the answer tothe user’s original question is “Yes”, “No” or “Ir-relevant” where the latter means the question can-not be answered with the given text.

We perform the evaluation with the officialSHARC script. For a set of generated clarifica-tion questions, it computes BLEU n-gram scoresfor n = 1, 2, 3, 4 using a set of clarification ques-tion in the set of gold responses. In each step ofthe conversation, the model under evaluation gen-erates an output token sequence. This output isautomatically assigned to the category “More” ifit is a clarification question, and to “Yes”, “No”,and “Irrelevant” otherwise. Since this is a clas-sification task we can compute micro and macroaccuracy for it. The final model is chosen usingthe highest BLEU-4 score on the development set.

The SHARC dataset has a hidden test set and,therefore, it is not feasible to evaluate our vari-ous model variants. Hence, we take 30 uniquerule texts and their corresponding training exam-ples from the training set. This leads to a new de-

velopment set of 2,465 instances and leaves the of-ficial development set to be used as a test set here.Finally we submitted our best model to be evalu-ated on the hidden test set.

Free-form DAILY DIALOG (Li et al., 2017).DAILY DIALOG is a dataset of written conversa-tions occurring in daily life. Following the authorsof the corpus, we report BLEU n-gram scores forn = 1, 2, 3, 4 for the generated output sequenceswith respect to the given gold responses. We tok-enize these responses equally to ensure a fair com-parison.

4.2 BISON Settings

We implement BISON based on the BERT Pytorchcode1 and initialize with the pre-trained BERTmodel BERT-BASE-UNCASED (Devlin et al., 2018). Consequently we employ the same model archi-tecture and tokenisation as (Devlin et al., 2018) re-sulting in a model with about 110M parameters.To remain compatible with the BERT model, weprepend each sequence with a [CLS] token andplace a [SEP] token after the input context. Sim-ilarly, producing a second [SEP] token indicatesthe end of sequence generation. For input contextof SHARC, we follow Saeidi et al. (2018) and usethe concatenation of question, rule text, scenarioand history. The input context for DAILY DIALOG

is the concatenation of all previous utterances.On the SHARC and DAILY DIALOG training

sets we train for 20 and 40 epochs, respectively,which equates in each case to about 200k seen ex-amples. As optimizer we used ADAM (Kingmaand Ba, 2015) with β1 = 0.9, β2 = 0.999, a L2weight decay of 0.01 and a learning rate warm-upover the first 10% of training steps. As learningrates we consider both the pre-training learningrate of BERT 1e-4 and the fine-tuning learning rate3e-5. On preliminary experiments 3e-5 proved tobe best for SHARC, whereas it is 1e-4 for DAILY

DIALOG. We set the batch size to 15. Finally,the maximum sequence generation length, is setto 50 for SHARC and to 100 for DAILY DIALOG,which was chosen based on values observed in thetraining data. As the maximum sequence lengthof the BERT model is 512, longer input sequencesare truncated accordingly. For the main results,we employ the sequence generation strategy left-to-right, which we found to work best. Later on

1https://github.com/huggingface/pytorch-pretrained-BERT

https://github.com/huggingface/pytorch-pretrained-BERT

https://github.com/huggingface/pytorch-pretrained-BERT

we also report results for the other strategies.For the Bernoulli RV approach, we test µ ∈

[0.2, 0.8] with increments of 0.1. For the GaussianRV approach, we test all possible combinations forthe following hyperparmeters µ = {0.4, 0.5, 0.6}and σ = {0.3, 0.6, 0.9}. The best combination onthe SHARC dev set is µ = 0.5, σ = 0.6. It out-performs the best Bernoulli approach (µ = 0.7)by 3.4 point in BLEU-4 score. Some Bernoulli ex-periments in fact only produced a very small num-ber of clarification question, e.g. µ = 0.5 onlygenerated 9 clarification questions on the develop-ment set, whereas in the ground truth responses846 clarification questions occur. This suggeststhat a high variance is important, as the Bernoullisetups all have a variance of 0.25 or lower andour best Gaussian approach has a variance of 0.6.We directly employ the Gaussian distribution withµ = 0.5, σ = 0.6 on the DAILY DIALOG task.

4.3 Baselines

To measure the success of our proposed approach,we consider the following three baselines.

Encoder-Decoder Transformer (E&D). First,we compare our bidirectional encoder to a stan-dard encoder-decoder Transformer where the de-coder only has access to tokens produced so farto compute its self-attention. We use the imple-mentation of OpenNMT (Klein et al., 2017) andemploy the parameters suggested by them, but ad-just the learning rate to 0.1, which we found towork better for both datasets. Additionally, we in-creased the word and hidden dimension size to 768and the number of attention heads to 12 to matchthe capacity of our model. Training ran for 50epochs. Needing both an encoder and a decoder,this leads to a total of about 270M parameters.

Encoder-Decoder Transformer with BERT(E&D+B). The power of our bidirectional decoderstems from two advantages. First, we can initial-ize our model with the pre-trained BERT-BASE-UNCASED model. Second, the decoding processis bidirectional. It would be possible to transferthe first advantage to an encoder-decoder frame-work by using BERT embeddings. This is how-ever only possible for the input sequence, becausethe bidirectionality of BERT requires the entire se-quence to be available beforehand. Thus, we mod-ify implementation of OpenNMT to use the BERTmodel as the encoder. The weights are frozenwhen training the decoder, which produced bet-

Model MicroAcc.

MacroAcc.

B-1 B-4

SH

AR

C E&D 31.9 38.9 17.1 1.9

E&D+B 54.7 60.4 24.3 4.3GPT2 60.4 65.1 53.7 33.9BISON 64.9 68.8 61.8 46.2

Table 1: Results on the SHARC test set, averaged over3 independent runs for GPT2 and BISON, reporting mi-cro accuracy and macro accuracy in terms of the clas-sification task and BLEU-1 and BLEU-4 on instancesfor which a clarification question was generated. E&Duses no language model pre-training.

Model MicroAcc.

MacroAcc.

B-1 B-4

SH

AR

C E3 67.6 73.3 54.1 38.7BISON 66.9 71.6 58.8 44.3

Table 2: Results on the official hidden SHARC testset of our model compared to the best model on theleaderboard, E3 (Zhong and Zettlemoyer, 2019).

ter results than allowing the gradients to also flowthrough the BERT model. Again, with both an en-coder and decoder, this leads to a total of about270M parameters.

GPT2. Radford et al. (2019) present a trans-former decoder, GPT2, trained as a languagemodel on large amounts of monolingual text. Rad-ford et al. (2019) showed that it is possible to per-form various tasks in a zero-shot setting by prim-ing the language model with an input and lettingit generate further words greedily. This setup canbe transferred to a supervised setting, where themodel is fine-tuned to a dataset by using maximumlikelihood estimation to increase the probabilityof the gold output sequence (Wolf et al., 2019).As the starting point for the supervised learning,we initialize the model with the pre-trained modelGPT-2-117M released by Radford et al. (2019)2

and then fine-tune. With 117M parameters, thismodel is comparable to our model. Unlike base-line 2, this setup can directly employ a pre-trainedmodel as our approach can, but it is not bidirec-tional.

2https://github.com/openai/gpt-2

https://github.com/openai/gpt-2

4.4 Results

We report the results of our approach, the variousbaselines, as well as the previous state-of-the-art(SOTA) scores where applicable in Table 1 and 2for SHARC and in Table 3 for DAILY DIALOG.

On the SHARC dataset, we observe very poorBLEU-4 performance for the encoder-decoderTransformer (E&D), which is consistent with re-sults from Saeidi et al. (2018), who could notget a LSTM-based network to work without anadditional classification head. Adding BERT(E&D+B) slightly improves performance. By di-rectly leveraging a pre-trained model, GPT2 out-performs the previous models by a large margin,reaching 33.9% on BLEU-4 and a micro accuracyof 60.4%. BISON is able to take future tokens intoconsideration and outperforms GPT2 by 12.3 per-centage points in BLEU-4 and by 4.5 points in mi-cro accuracy.

We submitted the best BISON model out of therandom three of Table 1 to be evaluated on the hid-den test set and report results in comparison to thebest model on the leaderboard,3 E3 (Zhong andZettlemoyer, 2019) in Table 2. BISON outper-forms E3 by 5.6 BLEU-4 points, while it is onlyslightly worse than E3 in terms of accuracy.

On the DAILY DIALOG dataset the informationretrieval-based method (IR in Table 3) introducedby Li et al. (2017) is very strong and outperformsthe best end-to-end model (E2E) (Luo et al., 2018)by over 16 percentage points in BLEU-4. The bestend-to-end model is based on LSTMs and Luoet al. (2018) report performance increases whenadding an attention module to their setup. Theencoder-decoder transformer (E&D) outperformsthis setup by over 2 percentage points in BLEU-4 and we conjecture that this is due to the trans-former making more effective use of the attentionprinciple. Adding BERT (E&D+B) does not helpmuch for this dataset. But again we observe a largeincrease of performance when directly employingpre-trained models. GPT2 performs on par withthe IR SOTA, achieving a BLEU-4 score of 19.4%.Again, BISON can outperform GPT2, here with adifference of 6.2 points in BLEU-4 and even largerincreases in the other scores.

Effect of bidirectionality. To investigate thatour model benefits from bidirectionality, we con-sider a setup where BISON isn’t allowed to attend

3https://sharc-data.github.io/leaderboard.html, 19 August 2019

Model B-1 B-2 B-3 B-4

DA

ILY

DIA

LO

G IR - 25.8 20.4 19.4E2E 14.2 5.7 3.8 2.8E&D 22.3 6.8 5.7 5.2

E&D+B 26.1 7.3 6.0 5.5GPT2 42.3 23.6 20.7 19.4BISON 54.9 32.6 28.0 25.6

Table 3: BLEU n-gram scores for n = 1, 2, 3, 4 on theDailyDialog test set, averaged over 3 independent runsfor GPT2 and BISON. Models before the line do notmake use of a pre-trained language model. IR (SOTA)(Li et al., 2017) and E2E (SOTA) (Luo et al., 2018)are, to the best of our knowledge, the best previouslypublished scores for information retrieval and end-to-end approaches.

Model MicroAcc.

MacroAcc.

B-1 B-4S

HA

RC BISON 64.9 68.8 61.8 46.2

past only 64.3 67.4 35.0 21.3


DD BISON 54.9 32.6 28.0 25.6

past only 48.0 24.6 18.5 14.8

Table 4: Comparison of BISON to a setup where BI-SON isn’t allowed to attend to future tokens, i.e. pastonly, for SHARC and DAILY DIALOG (DD).

to future tokens during prediction (see Table 4).It causes a drop in BLEU-4 performance of about25 points on the SHARC dataset and a drop of 10points on the DAILY DIALOG dataset. This show-cases that BISON during training has learnt to relyon the ability to attend to future tokens.

Effect of pre-trained model. We are curi-ous how big the effect of the pre-trained modelis. Thus, instead of starting with the BERT-BASE-UNCASED weights, we initialize BISON with ran-dom weights drawn from a normal distributionwith mean 0.0 and standard deviation of 0.02. Re-sults are presented in Table 5 for SHARC andDAILY DIALOG. Even without a pre-trained lan-guage model, our approach can outperform thestandard encoder-decoder transformer framework(E&D) on both datasets, although we had to in-crease the number of epochs for the SHARCdataset to 40. On the DAILY DIALOG task, we areeven able to outperform GPT2. This demonstrates

https://sharc-data.github.io/leaderboard.html

https://sharc-data.github.io/leaderboard.html

Model MicroAcc.

MacroAcc.

B-1 B-4S

HA

RC E&D 31.9 38.9 17.1 1.9

BISON 52.9 57.4 21.9 2.3


DD E&D 22.3 6.8 5.7 5.2

BISON 46.3 27.0 23.6 22.4

Table 5: Best end-to-end models that do not use apre-trained language model in comparison with BISONthat uses randomly initialized weights for SHARC andDAILY DIALOG (DD), averaged over 3 runs.

Strategy SHARC DAILY DIALOG

one step greedy 22.9 9.3lowest entropy 40.3 16.8highest probability 50.9 16.4left-to-right 46.2 23.8

Table 6: BLEU-4 using various sequence generationstrategies for BISON on SHARC and DAILY DIALOG.

the effectiveness of our approach in itself, free ofany pre-trained language model.

Effect of sequence generation strategies. Wepresent the different sequence generation strate-gies in Table 6. The best overall sequence genera-tion strategy is to predict from left to right whichachieves good results on both datasets. On theSHARC dataset the highest probability approachperforms better than left-to-right. However, onDAILY DIALOG this approach is not as successful.This suggests that it might be worth selecting thebest sequence generation strategy for each datasetindividually. However, we hypothesize that left-to-right works consistently well due to the left-to-right nature of the English language. A brief ex-periment with a right-to-left strategy gave poor re-sults.

5 Analysis

We believe that the placeholders capture sequen-tial information present in the language modellearned during pre-training. After running a trans-former encoder where each position can attendto every other position, a placeholder token willhave a probability distribution over the output vo-cabulary and this distribution is informed by allother tokens in input and output. Thus, a place-

Dataset α1 α2 α3

SHARC 92.6±3.1 5.2±2.4 2.2±1.8DD 97.0±2.5 2.3±2.3 0.7±0.4

α2 α3

SHARC - 71.7±13.7 28.3±13.7DD - 70.2±14.5 29.8±14.5

Table 7: Average attention weights and standard de-viation when predicting from left-to-right on bothSHARC and DAILY DIALOG (DD) for different partsof the sequence, where α1 is for the input sequencex, α2/α2 is for the already produced sequence y andα3/α3 is for the sequence of remaining placeholder to-kens p. αk are the normalized attention weights acrossall three parts, whereas αk normalizes over the secondand third part.

holder could be seen as a mixture of tokens withvarying probabilities. As placeholders are subse-quently uncovered, the other placeholders can up-date their distribution by taking the newly revealedtoken into consideration.

For example, in Figure 2, for the sentence“isthe animal an endangered animal ?”, while gen-erating “endangered”, the self-attention head paysattention to the next placeholder token, which inthe next step is revealed to be “animal”. Whileproducing “endangered”, the distribution for thenext position already placed a high probability on“animal”, thus the current token can take this intoconsideration and produces “endangered”. Furtherheat maps demonstrating this can be found in theappendix.

The quantify this intuition, we measure the av-erage attention score on various parts of the se-quence. For this, we use the left-to-right predic-tion strategy. Thus, at time t, we can decomposeour sequence into three parts: s = x⊕y⊕p, wherex is the input, y the already produced sequenceand p the remaining sequence of placeholder to-kens. For each attention head, we can decomposethe attention probabilities into the three parts,

1. attention on the input text,

2. attention on the current word and alreadygenerated words (left of the current word),

3. attention on words that are yet to be generated(right of the current word).

Figure 2: Heat map (darker hues indicate higher atten-tion) that shows an example of where an attention headlooks into the future while generating from left to right.Each row shows the attention over the output sequencefor this row’s placeholder token at that point in time.Word in previous rows have been produced already,whereas words of later rows still hold placeholder to-kens. Thus the upper triangle of the matrix shows theattention that is paid to future tokens. The red squareshows that while generating the token “endangered”,the attention head already takes the next placeholderinto account, which is revealed to be “animal” in thenext step. Best viewed in color.

This is mathematically expressed as

at = a0:|x| ⊕ a|x|+1:|x|+t ⊕ a|x|+t+1,|s|,

where |s| is the maximum possible sequencelength. For each part we can calculate an averageleading to three values, a1t , a

2t and a3t .

Averaged over all T generation time steps andall D data points, we can derive a score for eachpart k, k = 1, 2, 3 and each attention head h:

αkh =1

D

1

T

D∑d=1

T∑t=1

akd,t

Note that we use the attention heads in the finalBERT self-attention layer. Averaging over all Hattention heads, αk = 1

H

∑Hh=1 α

kh, leads to the

results reported in Table 7 for both datasets. Un-surprisingly, we find that with scores of over 90%for both datasets the majority of the attention is fo-cused on the first part, i.e. the conditioning inputx (see α1 in Table 7). The remaining attention issplit between the already produced sequence (α2)and the future tokens (α3).

To directly compare the relationship within thesequence generation, we re-normalize over α2 andα3, leading to new values α2 and α3 (see Table 7).Here we can see that the past, already producedtokens are about twice as important as the future,not-yet-produced tokens. But with scores of justunder 30% on both datasets, we see that a sub-stantial amount of attention is also focused on thefuture, not-yet-produced tokens.

Interestingly, with a standard deviation of about14%, the values of α2 and α3 vary strongly acrossthe different attention heads. For example on theSHARC dataset, we find one attention head whereonly about 9% is focused on the future and anotherwhere it is about 64% and thus this attention headpays more attention to the future than the past. Agraphical overview can be found in the appendixfor both datasets.

6 Related Work

Transformers (Vaswani et al., 2017) model se-quences as fully connected graphs and apply abidirectional self-attention module where everytoken can attend to every other token. Becauseof this a Transformer is not restricted to sequen-tial orderings. However, Vaswani et al. (2017); in-ter alia still restrict themselves to producing to-kens from left-to-right and only allow a Trans-former decoder to attend to previously producedtokens. Recently, several attempts have been madeto lift the left-to-right restriction in Transformer orLSTM-based models (Gu et al., 2019; Stern et al.,2019; Welleck et al., 2019; Zhou et al., 2019), butin those approaches it is not possible to attend tofuture, not-yet-produced tokens.

Concurrently to our work, (Ghazvininejad et al.,2019) proposed a similar placeholder strategy ap-proach for generating in the context of machinetranslation. However, they employ an encoder-decoder framework, whereas we only require anencoder, which more closely links input and out-put via a single shared attention module. Fur-thermore, they only consider uniform samplingof placeholders whereas we found that the highervariance, which we can control with the Gaussianrandom variable approach, leads to better results.

Bidirectionality is one of the crucial ingredientsin the success of the recently proposed unsuper-vised language model BERT (Devlin et al., 2018).For this, Devlin et al. (2018) propose a Trans-former encoder to take full advantage of the bidi-

rectional nature of the Transformer. Their result-ing model, BERT, can directly be applied to var-ious classification tasks but not to sequence gen-eration tasks. Our approach shows how a Trans-former encoder can be used for sequence gener-ation and this allows us to directly incorporateBERT into our experiments.

GPT (Radford et al., 2018) and GPT2 (Radfordet al., 2019) are both pre-trained language mod-els that use a Transformer decoder instead, whichcan only attend to already produced tokens. Fordialogue, the GPT model has been fine-tuned forthe chit-chat dataset PersonaChat (Zhang et al.,2018) by Wolf et al. (2019). While GPT and GPT2can immediately be used as a sequence generators,these models do not offer bidirectionality and theycannot attend to not-yet-produced tokens. Ourbidirectional encoder for sequence generation cancombine the best of both worlds.

7 Conclusion

We introduced bidirectional sequence generationby employing placeholders in the output sequence.These placeholder tokens are subsequently re-placed by tokens of the output vocabulary. Cru-cially, this allows a transformer encoder to attendto both past and future, not-yet-produced token.Simply masking all placeholder tokens is not fea-sible. Instead we investigated two placeholderstrategies, based on Bernoulli and Gaussian ran-dom variables. At prediction time, our approach isnot restricted to produce the output sequence fromleft to right. However, this strategy proved to pro-duce most consistent results in our experiments.

Our approach outperforms previous end-to-endapproaches that do not make use of any pre-trained language models. In conjunction with thepre-trained language model BERT, our bidirec-tional sequence generation approach allows us toachieve new state-of-art results on both conversa-tional tasks. In the future, we would like to applyour approach to other sequence generation tasks.Additionally, we wonder if a further performanceincrease could be achieved if the pre-training ofBERT would employ our placeholder strategy.

ReferencesJacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. ArXiv e-prints, 1810.04805.

Marjan Ghazvininejad, Omer Levy, Yinhan Liu, andLuke Zettlemoyer. 2019. Constant-Time MachineTranslation with Conditional Masked LanguageModels. arXiv:1904.09324 [cs, stat]. ArXiv:1904.09324.

Jiatao Gu, Qi Liu, and Kyunghyun Cho. 2019.Insertion-based decoding with automatically in-ferred generation order. CoRR, abs/1902.01370.

Diederick P Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization. In Inter-national Conference on Learning Representations(ICLR), San Diego, CA, USA.

G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M.Rush. 2017. OpenNMT: Open-Source Toolkitfor Neural Machine Translation. ArXiv e-prints,1701.02810.

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, ZiqiangCao, and Shuzi Niu. 2017. Dailydialog: A manuallylabelled multi-turn dialogue dataset. In Proceedingsof the Eighth International Joint Conference on Nat-ural Language Processing (IJCNLP), Taipei, Tai-wan.

Liangchen Luo, Jingjing Xu, Junyang Lin, Qi Zeng,and Xu Sun. 2018. An auto-encoder matchingmodel for learning utterance-level semantic depen-dency in dialogue generation. In Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving Language Under-standing by Generative Pre-Training. Technical Re-port Technical report, OpenAI.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,and Dario Amodei. 2019. Language Models areUnsupervised Multitask Learners. Technical report,OpenAI.

Marzieh Saeidi, Max Bartolo, Patrick Lewis, SameerSingh, Tim Rocktäschel, Mike Sheldon, GuillaumeBouchard, and Sebastian Riedel. 2018. Interpreta-tion of natural language rules in conversational ma-chine reading. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP).

Mitchell Stern, William Chan, Jamie Kiros, and JakobUszkoreit. 2019. Insertion transformer: Flexible se-quence generation via insertion operations. CoRR,abs/1902.03249.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In Advances in Neural Information Pro-cessing Systems 30 (NIPS).

Sean Welleck, Kianté Brantley, Hal Daumé III, andKyunghyun Cho. 2019. Non-monotonic sequentialtext generation. CoRR, abs/1902.02192.

http://arxiv.org/abs/1810.04805








https://arxiv.org/abs/1412.6980




http://aclweb.org/anthology/I17-1099

http://aclweb.org/anthology/I17-1099

http://aclweb.org/anthology/D18-1075



https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf






http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf



Thomas Wolf, Victor Sanh, Julien Chaumond, andClement Delangue. 2019. TransferTransfo: ATransfer Learning Approach for Neural NetworkBased Conversational Agents. ArXiv e-prints,1901.08149.

Saizheng Zhang, Emily Dinan, Jack Urbanek, ArthurSzlam, Douwe Kiela, and Jason Weston. 2018. Per-sonalizing dialogue agents: I have a dog, do youhave pets too? In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (ACL), Melbourne, Australia.

Victor Zhong and Luke Zettlemoyer. 2019. E3:Entailment-driven extracting and editing for conver-sational machine reading. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, Florence, Italy.

Long Zhou, Jiajun Zhang, and Chengqing Zong. 2019.Synchronous bidirectional neural machine transla-tion. Transactions of the Association for Compu-tational Linguistics, 7:91–105.

Appendix

A Forward & Backward Attention

Figures 3 and 4 present the normalized forwardattention α2 and backward attention α3 for thedifferent attention heads over the sequence gen-eration part for the SHARC and DAILY DIALOG

dataset, respectively.

Figure 3: α2 (Backward Attention) and α3 (For-ward Attention) across the 12 attention heads for theSHARC dataset.

Figure 4: α2 (Backward Attention) and α3 (ForwardAttention) across the 12 attention heads for the DAILYDIALOG dataset.

B Heat Maps

The heat maps (darker hues indicate higher atten-tion) of Figures 5 and 6 show examples of wherean attention head strongly looks into the futurewhile generating from left to right. Each rowshows the attention over the output sequence forthis row’s placeholder token at that point in time.Word in previous rows have been produced al-ready, whereas words of later rows still hold place-holder tokens. Thus the upper triangle of the ma-trix shows the attention that is paid to future to-kens. The red square marks the point of interest.Best viewed in color.

Figure 5

In Figure 5, we can see that when decidingon the first word, “did”, high attention is paid toimportant future tokens. In particular, there is astrong focus on the 3rd placeholder, which is laterrevealed to be “want”. Attention is also paid to the




https://www.aclweb.org/anthology/P18-1205






https://doi.org/10.1162/tacl_a_00256

https://doi.org/10.1162/tacl_a_00256

position that is later revealed to become a ques-tion mark. This indicates that model plans aheadand realizing a question will be formed, the word“did” is chosen. Note that the similar sentence,“you want to claim.” would not require the word“did” as it would not be a question.

Also in Figure 5, both the words “want” and“to” pay strong attention to the final word “claim”in the phrase “want to claim”.

In Figure 6, when producing the word “ambu-lance” the attention is focused on the next place-holder token, which is in the next step revealed tobe the word “driver” in the noun phrase “ambu-lance driver”.

Figure 6

Documents

Attending to Future Tokens for Bidirectional Sequence ... · 1 M XM m=1 Xjyj j=1 logˇ (p j= y jjs); where Mis the size of a minibatch. At inference time, the placeholder tokens can