Fast and Accurate Deep Bidirectional Language ... · ﬁrming that deep bidirectional language models are useful in unsupervised applications as well. However, concerning its applications

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 823–835July 5 - 10, 2020. c©2020 Association for Computational Linguistics

823

Fast and Accurate Deep Bidirectional Language Representationsfor Unsupervised Learning

Joongbo Shin, Yoonhyung Lee, Seunghyun Yoon, Kyomin JungSeoul National University

Republic of Korea{jbshin, cpi1234, mysmilish, kjung}@snu.ac.kr

Abstract

Even though BERT has achieved success-ful performance improvements in various su-pervised learning tasks, BERT is still lim-ited by repetitive inferences on unsupervisedtasks for the computation of contextual lan-guage representations. To resolve this limita-tion, we propose a novel deep bidirectional lan-guage model called a Transformer-based TextAutoencoder (T-TA). The T-TA computes con-textual language representations without repe-tition and displays the benefits of a deep bidi-rectional architecture, such as that of BERT. Incomputation time experiments in a CPU en-vironment, the proposed T-TA performs oversix times faster than the BERT-like model on areranking task and twelve times faster on a se-mantic similarity task. Furthermore, the T-TAshows competitive or even better accuraciesthan those of BERT on the above tasks. Codeis available at https://github.com/joongbo/tta.

1 Introduction

A language model is an essential component ofmany natural language processing (NLP) applica-tions ranging from automatic speech recognition(ASR) (Chan et al., 2016; Panayotov et al., 2015)to neural machine translation (NMT) (Sutskeveret al., 2014; Sennrich et al., 2016; Vaswani et al.,2017). Recently, the Bidirectional Encoder Rep-resentations from Transformers (BERT) (Devlinet al., 2019) and its variations have led to signif-icant improvements in learning natural languagerepresentation and have achieved state-of-the-artperformances on various downstream tasks suchas the General Language Understanding Evalua-tion (GLUE) benchmark (Wang et al., 2019) andquestion answering (Rajpurkar et al., 2016). BERTcontinues to succeed in various unsupervised tasks,such as the N -best list reranking for ASR andNMT (Shin et al., 2019; Salazar et al., 2019), con-

firming that deep bidirectional language models areuseful in unsupervised applications as well.

However, concerning its applications to unsuper-vised learning tasks, BERT is significantly ineffi-cient at computing language representations at theinference stage (Salazar et al., 2019). During train-ing, BERT adopts the masked language modeling(MLM) objective, which is to predict the originalword of the explicitly masked word from the inputsequence. Following the MLM objective, each con-textual word representation should be computed bya two-step process: masking a word in the input andfeeding the result to BERT. During the inferencestage, this process is repeated n times to obtainthe representations of all the words within a textsequence (Wang and Cho, 2019; Shin et al., 2019;Salazar et al., 2019), resulting in a computationalcomplexity of O(n3)1 in terms of the number ofwords n. Hence, it is necessary to reduce the com-putational complexity when applying the model tosituations where the inference time is critical, e.g.,mobile environments and real-time systems (Sanhet al., 2019; Lan et al., 2019). Considering this limi-tation of BERT, we submit a new research question:“Can we construct a deep bidirectional languagemodel with a minimal inference time while main-taining the accuracy of BERT?”

In this paper, in response to the above question,we propose a novel bidirectional language modelnamed the Transformer-based Text Autoencoder(T-TA), which has a reduced computational com-plexity of O(n2) when applying the model to un-supervised applications. The proposed model istrained with a new learning objective named lan-guage autoencoding (LAE). The LAE objective,which allows the target labels to be the same asthe text input, is to predict every token in the inputsequence simultaneously without merely copying

1A complexity of O(n2) is derived from the per-layercomplexity of the Transformer (Vaswani et al., 2017).

https://github.com/joongbo/tta

824

the input to the output. To learn the proposed objec-tive, we devise both a diagonal masking operationand an input isolation mechanism inside the T-TAbased on the Transformer encoder (Vaswani et al.,2017). These components enable the proposed T-TA to compute contextualized language representa-tions at once while maintaining the benefits of thedeep bidirectional architecture of BERT.

We conduct a series of experiments on two unsu-pervised tasks: N -best list reranking and unsuper-vised semantic textual similarity. First, by conduct-ing runtime experiments in a CPU environment, weshow that the proposed T-TA is 6.35 times fasterthan the BERT-like model in the reranking taskand 12.7 times faster in the unsupervised semantictextual similarity task. Second, despite its fasterinference time, the T-TA achieves competitive per-formances relative to BERT on reranking tasks.Furthermore, the T-TA outperforms BERT by up to8 points in Pearson’s r on unsupervised semantictextual similarity tasks.

2 Related Works

When referring to an autoencoder for lan-guage modeling, sequence-to-sequence learningapproaches have been commonly used. These ap-proaches encode a given sentence into a com-pressed vector representation, followed by a de-coder that reconstructs the original sentence fromthe sentence-level representation (Sutskever et al.,2014; Cho et al., 2014; Dai and Le, 2015). To thebest of our knowledge, however, none of theseapproaches consider an autoencoder that encodesword-level representations (such as BERT) withoutan autoregressive decoding process.

Many studies have been performed on neuralnetwork-based language models for word-levelrepresentations. Distributed word representationswere proposed and attracted considerable inter-est, as they were considered to be fundamentalbuilding blocks for NLP tasks (Rumelhart et al.,1986; Bengio et al., 2003; Mikolov et al., 2013b).Subsequently, researchers explored contextualizedrepresentations of text where each word has adifferent representation depending on the con-text (Peters et al., 2018; Radford et al., 2018).Most recently, a Transformer-based deep bidirec-tional model was proposed and applied to vari-ous supervised-learning tasks with remarkable suc-cess (Devlin et al., 2019).

For unsupervised tasks, researchers have adopted

recently developed language-representation modelsand investigated their effectiveness; a typical exam-ple is the N -best list reranking for ASR and NMTtasks. In particular, studies have integrated left-to-right and right-to-left language models (Arisoyet al., 2015; Chen et al., 2017; Peris and Casacu-berta, 2015) to outperform conventional unidirec-tional language models (Mikolov et al., 2010; Sun-dermeyer et al., 2012) in these tasks. Furthermore,BERT-based approaches have been explored andhave achieved significant performance improve-ments on these tasks because bidirectional lan-guage models yield the pseudo-log-likelihood of agiven sentence, and this score is useful in rankingthe n-best hypotheses (Wang and Cho, 2019; Shinet al., 2019; Salazar et al., 2019).

Another line of research involves reducing thecomputation time and memory consumption ofBERT. Lan et al. (2019) proposed parameter-reduction techniques, factorized embedding param-eterization and cross-layer parameter sharing andreported 18 times fewer parameters and a 1.7-foldincrease in the training time. Similarly, Sanh et al.(2019) presented a method to pretrain a smallermodel that can be fine-tuned for downstream tasksand achieved 1.4 times fewer parameters with a 1.6-fold increase in the inference time. However, noneof these studies developed methods that directlyrevise the BERT architecture to reduce the compu-tational complexity during the inference stage.

3 Language Model Baselines

In a conventional language modeling task, the i-th token xi is predicted using its preceding con-text x<i= [x1, . . . , xi−1]; throughout this paper,this objective is known as causal language mod-eling (CLM) following (Conneau and Lample,2019). As shown in Figure 1a, we can obtain(left-to-right) contextualized language representa-tions HC = [HC

1 , . . . ,HCn ] after feeding the input

sequence to the CLM-trained language model onlyonce, where HC

i =hC(x<i) is the hidden represen-tation of the i-th token. This paper takes this uni-directional language model (uniLM) as our speedbaseline. However, contextualized language rep-resentations obtained from the uniLM are insuf-ficient to accurately encode a given text becausefuture contexts cannot be leveraged to understandthe current tokens during the inference stage.

Recently, BERT (Devlin et al., 2019) wasdesigned to enable the full contextualization

825

(a) Causal language modeling (b) Masked language modeling (c) Language autoencoding

Figure 1: Schematic diagrams of language models for the (a) CLM, (b) MLM, and (c) LAE objectives.

of language representations by using the MLMobjective, in which some tokens from theinput sequence are randomly masked; the ob-jective is to predict the original tokens at themasked positions using only their context. Asin Figure 1b, we can obtain a contextualizedrepresentation of the i-th token HM

i =hM(Mi(x))by masking the token in the input sequence andfeeding it to the MLM-trained model, whereMi(x)= [x1, . . . , xi−1,[MASK], xi+1, . . . , xn]signifies an external masking operation. Thispaper takes this bidirectional language model(biLM) as our performance baseline. However, thismask-and-predict approach should be repeated ntimes to obtain all the language representationsHM = [HM

1 , . . . ,HMn ] because learning occurs

only at the masked position during the MLMtraining stage. Although the resulting languagerepresentations are robust and accurate, as a conse-quence of this repetition, the model is significantlyinefficient when applied to unsupervised tasks suchas N -best list reranking (Wang and Cho, 2019;Shin et al., 2019; Salazar et al., 2019).

4 Proposed Method

4.1 Language Autoencoding

In this paper, we propose a new learning objectivenamed language autoencoding (LAE) for obtain-ing fully contextualized language representationswithout repetition. The LAE objective, with whichthe output is the same as the input, is to predictevery token in a text sequence simultaneouslywithout merely copying the input to the output.For the proposed task, a language model shouldreproduce the whole input at once while avoidingoverfitting; otherwise, the model outputs only therepresentation copied from the input representationwithout learning any statistics of the language.To this end, the flow of information from the

i-th input to the i-th output should be blockedinside the model shown in Figure 1c. From theLAE objective, we can obtain fully contextualizedlanguage representations HL = [HL

1 , . . . ,HLn ]

all at once, where HLi =hL(x\i) and

x\i= [x1, . . . , xi−1, xi+1, . . . , xn]. The methodfor blocking the flow of information is describedin the next section.

4.2 Transformer-based Text Autoencoder

In this section, we introduce the novel architectureof the proposed T-TA shown in Figure 2. As indi-cated by its name, the T-TA architecture is basedon the Transformer encoder (Vaswani et al., 2017).To learn the proposed LAE objective, we developboth a diagonal masking operation and an inputisolation mechanism inside the T-TA. Both devel-opments are designed to enable the language modelto predict all tokens simultaneously while maintain-ing the deep bidirectional property (see the descrip-tions in the following subsections). For brevity, werefer to the original paper on the Transformer en-coder (Vaswani et al., 2017) for other details regard-ing the standard functions, such as the multiheadattention and scaled dot-product attention mecha-nisms, layer normalization, and the position-wisefully connected feed-forward network.

4.2.1 Diagonal MaskingAs shown in Figure 3, a diagonal masking opera-tion is implemented inside the scaled dot-productattention mechanism to be “self-unknown” duringthe inference stage. This operation prevents infor-mation from flowing to the same position in thenext layer by masking out the diagonal values inthe input of the softmax function. Specifically, theoutput vector at each position is the weighted sumof the value V at other positions, where the atten-tion weights come from the query Q and the keyK.

826

Figure 2: Architecture of our T-TA. The highlightedbox and dashed arrows are the innovations presentedin this paper.

The diagonal mask becomes meaningless whenwe use it together with a residual connection or uti-lize it within the multilayer architecture. To retainthe self-unknown functional, we can remove theresidual connection and adopt a single-layer archi-tecture. However, it is essential to utilize a deeparchitecture to understand the intricate patterns ofnatural language. To this end, we further developthe architecture described in the next section.

4.2.2 Input Isolation

We now propose an input isolation mechanism toensure that the residual connection and the multi-layer architecture are compatible with the above-mentioned diagonal masking operation. In the in-put isolation mechanism, the key and value inputs(K and V, respectively) of all encoding layers areisolated from the network flow and are fixed tothe sum of the token embeddings and the positionembeddings. Hence, only the query inputs (Q) areupdated across the layers during the inference stageby referring to the fixed output of the embeddinglayer.

Figure 3: Diagonal masking of the scaled dot-productattention mechanism. The highlighted box and dashedarrow represent the innovations reported in this paper.

Additionally, we input the position embeddingsto the Q of the very first encoding layer, therebymaking the self-attention mechanism effective. Oth-erwise, the attention weights will be the same at allpositions, and thus, the first self-attention mecha-nism will function as a simple average of all theinput representations (except the “self” position).Finally, we apply the residual connection only tothe query to completely maintain unawareness. Thedashed arrows in Figure 2 show the proposed inputisolation mechanism inside the T-TA.

By using diagonal masking and input isolationin conjunction, the T-TA can have multiple encoderlayers, enabling the T-TA to obtain high-qualitycontextual language representations after feeding asequence into the model only once.

4.3 Discussion and Analysis

Heretofore, we have introduced the new learningobjective named LAE, and the novel deep bidirec-tional language model named T-TA. We will verifythe architecture of the proposed T-TA in Section4.3.1 and compare our model with the recently pro-posed strong baseline BERT in Section 4.3.2.

4.3.1 Verification of the ArchitectureHere, we discuss how diagonal masking with inputisolation preserves the “self-unknown” property indetail.

As shown in Figure 2, we have two in-put embeddings, namely, token embeddingsX= [X1, . . . , Xn]

T ∈ Rn×d and position embed-dings P= [P1, . . . , Pn]

T ∈ Rn×d, where d is anembedding dimension. From the input isolationmechanism, the key and value K=V=X+P havethe information of the input tokens and are fixed in

827

all layers, but the query Ql is updated across thelayers during the inference stage starting from theposition embeddings Q1=P in the first layer.

Let us consider the l-th encoding layer’s queryinput Ql and its output Hl =Ql+1:

Hl = SMSAN(Ql,K,V)

= g(Norm(Add(Ql, f(Ql,K,V)))),(1)

where SMSAN(·) is the self-masked self-attentionnetwork, namely, the encoding layer of the T-TA,g(x)=Norm(Add(x,FeedForward(x))) signifiestwo upper subboxes of the encoding layer in Fig-ure 2, and f(·) is the (multihead) diagonal-maskedself-attention (DMSA) mechanism. As illustratedin Figure 3, the DMSA module computes Zl asfollows:

Zl = f(Ql,K,V) = DMSA(Ql,K,V)

= SoftMax(DiagMask(QlKT /√d))V.

(2)

In the DMSA module, the i-th element ofZl = [Z l

1, . . . , Zln]

T is always computed by aweighted average of the fixed V while discard-ing the information of the i-th token Xi in Vi.Specifically, Z l

i is the weighted average of Vwith the attention weight vector sli, i.e., Z l

i = sliV,where sli= [sl1, . . . , s

li−1, 0, s

li+1, . . . , s

ln]∈R1×n.

Here, we note that the DMSA mechanism is re-lated only to the “self-unknown” property since notoken representations are referred to each other insubsequent transformations from Zl to Hl. There-fore, we can guarantee that the i-th element of thequery representation in any layer,Ql

i, never encoun-ters the corresponding token representation startingfrom Q1

i =Pi. Consequently, the T-TA preservesthe “self-unknown” property during the inferencestage while maintaining the residual connectionand multilayer architecture.

4.3.2 Comparison with BERTThere are several differences between the strongbaseline BERT (Devlin et al., 2019) and the pro-posed T-TA, while both models learn deep bidirec-tional language representations.

• While BERT uses an external masking operationin the input, the T-TA has an internal maskingoperation in the model, as we intend. Addition-ally, while BERT is based on a denoising au-toencoder, the T-TA is based on an autoencoder.With this novel approach, the T-TA does not need

mask-and-predict repetition during the comput-ing of contextual language representations. Con-sequently, we reduce the computational complex-ity from O(n3) with the BERT to O(n2) withthe T-TA in applications to unsupervised learn-ing tasks.

• As in the T-TA, feeding an intact input (withoutmasking) into BERT is also possible. However,we argue that this process will significantly di-minish the model performance in unsupervisedapplications since the MLM objective does notconsider intact tokens much. In the next section,we include experiments that reveal the modelperformance with intact inputs (described in Ta-bles 1, 3, and 4). For further reference, we alsosuggest a previous study that reported the sameopinion (Salazar et al., 2019).

5 Experiments

To evaluate the proposed method, we conduct aseries of experiments. We first evaluate the con-textual language representations obtained from theT-TA on N -best list reranking tasks. We then ap-ply our method to unsupervised semantic textualsimilarity (STS) tasks. The following sections willdemonstrate that the proposed model is much fasterthan BERT during the inference stage (Section 5.2)while showing competitive or even better accura-cies than those of BERT on reranking tasks (Sec-tion 5.3) and STS tasks (Section 5.4).

5.1 Language Model SetupsThe main purpose of this paper is to compare theproposed T-TA with a biLM trained with the MLMobjective. For a fair comparison, each model hasthe same number of parameters based on the Trans-former as follows: |L| = 3 self-attention layerswith d = 512 input and output dimensions, h = 8attention heads, and df = 2048 hidden units forthe position-wise feed-forward layers. We use aGaussian error linear unit (gelu) activation func-tion (Hendrycks and Gimpel, 2016) rather than thestandard rectified linear unit (relu) following Ope-nAI GPT (Radford et al., 2018) and BERT (Devlinet al., 2019). In our experiments, we set the posi-tion embeddings to be trainable following BERT(Devlin et al., 2019) rather than a fixed sinusoid(Vaswani et al., 2017) with supported sequencelengths up to 128 tokens. We use WordPiece em-beddings (Wu et al., 2016) with a vocabulary ofapproximately |V | ' 30, 000 tokens. The weights

828

of the embedding layer and the last softmax layerof the Transformer are shared. For the speed base-line, we also implement a uniLM that has the samenumber of parameters as the T-TA and biLM.

For training, we create a training instance con-sisting of a single sentence with [BOS] and[EOS] tokens at the beginning and end of eachsentence, respectively. We use 64 sentences as thetraining batch and train the language models over1M steps for ASR and 2M steps for NMT. Wetrain the language models with Adam (Kingma andBa, 2014) with an initial learning rate of 1e − 4and coefficients of β1 = 0.9 of β2 = 0.999; thelearning rate is set to warm up over the first 50ksteps, and the learning rate exhibits linear decay.We use a dropout probability of 0.1 on all layers.Our implementation is based on Google’s officialcode for BERT2.

To train the language models that we implement,we use an English Wikipedia dump (approximately13 GB in size) containing approximately 120Msentences. The trained models are used for rerank-ing in NMT and unsupervised STS tasks. For theASR reranking task, we use additional in-domaintraining data, namely, 4.0 GB of normalized textdata from the official LibriSpeech corpus contain-ing approximately 40M sentences.

5.2 Runtime Analysis

We first measure the runtime of each languagemodel to compute the contextual language rep-resentation HL ∈Rn×d of a given text sequence.In the unsupervised STS tasks, we directly useHL for the analysis. In the case of the rerankingtask, further computation is required: we computeSoftmax(HLET ) to obtain the likelihood of eachtoken, where E ∈ R|V |×d is the weight parameterof the softmax layer. Therefore, the computationalcomplexity of the reranking task is larger than thatof the STS task.

To measure the runtime, we use an Intel(R)Core(TM) i7-6850K CPU (3.60 GHz) and the Ten-sorFlow 1.12.0 library with Python 3.6.8 on Ubuntu16.04.06 LTS. In each experiment, we measure theruntime 50 times and average the results.

Figure 4 shows that the T-TA exhibits faster run-times than the biLM, and the gap between the T-TAand biLM increases as the sentence becomes longer.To facilitate a numerical comparison, we set thestandard number of words to 20, which is approxi-

2https://github.com/google-research/bert

Figure 4: Average runtimes of each model according tothe number of words on STS and reranking tasks, sub-scripted as sts and rrk, respectively.

mately the average number of words in a contempo-rary English sentence (DuBay, 2006). In this setup,in the STS tasks, the T-TA takes approximately9.85 ms, while the biLM takes approximately 125ms; hence, the T-TA is 12.7 times faster than thebiLM. In the reranking task, the T-TA is 6.35 timesfaster than the biLM (which is still significant);this reduction occurs because the repetition of thebiLM is related only to computing HL rather thanSoftmax(HLET ).

For the visual clarity of Figure 4, we omit theruntime results of the uniLM, which is as fast asthe T-TA (see Appendix B.1). With such a fastinference time, we next demonstrate that the T-TAis as accurate as BERT.

5.3 Reranking the N-best ListTo evaluate the language models, we conduct exper-iments on the unsupervised task of reranking theN -best list. In these experiments, we apply eachlanguage model to rerank the 50 best candidate sen-tences, which are obtained in advance using eachsequence-to-sequence model on ASR and NMT.The ASR and NMT models we implement are de-tailed in Appendices A.1 and A.2.

We rescore the sentences by linearly interpolat-ing two scores from a sequence-to-sequence modeland each language model as follows:

score = (1− λ) · scores2s + λ · scorelm,

where scores2s is the score from the sequence-to-sequence model, scorelm is the score from the lan-guage model calculated by the sum (or mean) ofthe log-likelihood of each token, and the interpola-tion weight λ is set to a value that leads to the bestperformance in the development set.

https://github.com/google-research/bert

829

One of the strong baseline language models, thepretrained BERT-base-uncased model (De-vlin et al., 2019), is used for reranking tasks. Wealso include the reranking results from the tradi-tional count-based 5-gram language models trainedon each dataset using the KenLM library (Heafield,2011).

We note that the T-TA and biLM (includingBERT) assign the pseudo-log-likelihood to thescore of a given sentence, whereas the uniLM as-signs the log-likelihood. Because the reranking taskis based on the relative scores of the n-best hypothe-ses, the fact that the bidirectional models yields thepseudo-log-likelihood of a given sentence does notimpact this task (Wang and Cho, 2019; Shin et al.,2019; Salazar et al., 2019).

5.3.1 Results on ASR

For reranking in ASR, we use preparedN -best listsobtained from dev and test sets using Seq2SeqASR,which we train on the LibriSpeech ASR corpus.Additionally, we use the N -best lists obtained from(Shin et al., 2019) to confirm the robustness of thelanguage models in a testing environment. Table 1shows the word error rates (WERs) for each methodafter reranking. The interpolation weights λ are 0.3or 0.4 in all N -best lists for ASR.

First, we confirm that the bidirectional modelstrained with the LAE (T-TA) and MLM (biLM) ob-jectives consistently outperform the uniLM trainedwith the CLM objective. The performance gainsfrom reranking are much lower in the better basesystem Seq2SeqASR, and it is evidently challeng-ing to rerank the N -best list using a languagemodel if the speech recognition model performswell enough. Interestingly, the T-TA is competitivewith (or even better than) the biLM; this may re-sult from the gap between the training and testingof the biLM: the biLM predicts multiple masks ata time when training but predicts only one maskat a time when testing. Moreover, the 3-layer T-TA is better than the 12-layer BERT-base, showingthat in-domain data are critical to language modelapplications.

Finally, we note that feeding an intact input toBERT (the corresponding model is denoted as “w/BERT\M” in Table 1) causes the model to underper-form relative to the other models, demonstratingthat the mask-and-predict approach is necessaryfor effective reranking.

Methoddev test

clean other clean other

Shin et al. 7.17 19.79 7.25 20.37w/ n-gram 5.62 16.85 5.75 17.72w/ ∗uniSANLMw 6.05 17.32 6.11 18.13w/ ∗biSANLMw 5.52 16.61 5.65 17.37w/ BERT 5.24 16.56 5.38 17.46w/ BERT\M 7.08 19.61 7.14 20.18

w/ uniLM 5.07 16.20 5.14 17.00w/ biLM 4.94 16.09 5.14 16.81w/ T-TA 4.98 16.09 5.11 16.91

Seq2SeqASR 4.11 12.31 4.31 13.14w/ n-gram 3.94 11.93 4.15 12.89w/ BERT 3.72 11.59 3.97 12.46w/ BERT\M 4.09 12.26 4.28 13.15

w/ uniLM 3.82 11.73 4.05 12.63w/ biLM 3.73 11.53 3.97 12.41w/ T-TA 3.67 11.56 3.97 12.38

Table 1: WERs after reranking with each languagemodel on LibriSpeech. The ‘other’ sets are recordedin noisier environments than the ‘clean’ sets. Bold fontdenotes the best performance on each subtask, and ∗signifies a word-level language model from Shin et al.(2019).

5.3.2 Results on NMTTo compare the reranking performances in anotherdomain, NMT, we again prepare N -best lists us-ing Seq2SeqNMT

3 from the WMT13 German-to-English (De→En) and French-to-English (Fr→En)test sets. Table 2 shows the bilingual evaluationunderstudy (BLEU) scores for each method afterreranking. Each interpolation weight becomes avalue that shows the best performance on each testset with each method in NMT. The interpolationweights λ are 0.4 or 0.5 in the N -best lists forNMT.

We confirm again that the bidirectional modelstrained with the LAE and MLM objectives performbetter than the uniLM trained with the CLM objec-tive. Additionally, the Fr→En translation has lesseffect on the reranking than the De→En transla-tion because the base NMT system for Fr→En isbetter than that for De→En. The 12-layer BERTmodel appears much better than the other models atreranking on NMT; hence, the N -best hypothesesof the NMT model seem to be more indistinguish-

3The Seq2Seq models for De→En and Fr→En are trainedindependently using the t2t library (Vaswani et al., 2018).

830

Method De→En Fr→En

Seq2SeqNMT 27.83 29.63w/ n-gram 28.41 30.04w/ BERT 29.31 30.52w/ uniLM 28.80 30.21w/ biLM 28.76 30.32w/ T-TA 28.83 30.20

Table 2: BLEU scores after reranking with each lan-guage model on WMT13. Bold font denotes the bestperformance on each subtask, and the underlined val-ues signify the best performances in our implementa-tions.

able than those of the ASR model from a languagemodeling perspective.

All the reranking results on the ASR and NMTtasks demonstrate that the proposed T-TA performsboth efficiently (similar to the uniLM) and effec-tively (similar to the biLM).

5.4 Unsupervised STS

In addition to the reranking task, we apply the lan-guage models to an STS task, that is, measuringthe similarity between the meaning of sentencepairs. We use the STS Benchmark (STS-B) (Ceret al., 2017) and Sentences Involving Composi-tional Knowledge (SICK) (Marelli et al., 2014)datasets, both of which have a set of sentence pairswith corresponding similarity scores. The evalu-ation metric of STS is Pearson’s r between thepredicted similarity scores and the reference scoresof the given sentence pairs.

In this section, we address the unsupervised STStask to examine the inherent ability of each lan-guage model to obtain contextual language repre-sentations, and we mainly compare the languagemodels that are trained on the English Wikipediadump. To compute the similarity score of a givensentence pair, we use the cosine similarity oftwo sentence representations, where each repre-sentation is obtained by averaging each languagemodel’s contextual representations. Specifically,the contextual representations of a given sentenceare the outputs of the final encoding layer of eachmodel, denoted as context in Tables 3 and 4. Forcomparison, we use noncontextual representations,which are obtained from the outputs of the embed-ding layer, denoted as embed in Tables 3 and 4.

As a strong baseline for unsupervised STS tasks,we also include the 12-layer BERT model (Devlin

MethodSTS-B-dev STS-B-test

context embed context embed

BERT 64.78 - 54.22 -BERT\M 59.17 60.07 47.91 48.19BERT[CLS] 29.16 17.18

uniLM 56.25 63.87 39.57 55.00uniLM[EOS] 40.75 38.30biLM 59.99 - 50.76 -biLM\M 53.20 58.80 36.51 49.08T-TA 71.88 54.75 62.27 44.74

GloVe - 52.4 - 40.6Word2Vec - 70.0 - 56.5

Table 3: Pearson’s r×100 results on the STS-B dataset.“-” denotes an infeasible value, and bold font denotesthe top 2-performing models on each subtask.

et al., 2019), and we employ BERT in the mask-and-predict approach for computing the contextualrepresentations of each sentence. Note that we usethe most straightforward approach for the unsuper-vised STS task to focus on comparing token-levellanguage representations.

5.4.1 Results on STS-BThe STS-B dataset has 5749/1500/1379 sentencepairs with train/dev/test splits and correspondingscores ranging from 0 to 5. We test the languagemodels on the STS-B-dev and STS-B-test sets us-ing the simplest approach on the unsupervised STStask. As additional baselines, we include the resultsof GloVe (Pennington et al., 2014) and Word2Vec(Mikolov et al., 2013a) from the official sites ofSTS Benchmark4.

Table 3 shows our T-TA trained with the LAEobjective best captures the semantics of a sen-tence over the Transformer-based language models.Remarkably, our 3-layer T-TA trained on a rela-tively small dataset outperforms the 12-layer BERTtrained on a larger dataset (Wikipedia + BookCor-pus). Furthermore, the embedding representationsare trained better by the CLM objective than by theother language modeling objectives; we supposethat the uniLM depends strongly on the embeddinglayer due to its unidirectional context constraint.

Since the uniLM encodes all contexts in the lasttoken, [EOS], we also use the last representationas the sentence representation; however, this ap-proach does not outperform the average sentence

4http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

831

MethodSICK-test

context embed

BERT 64.31 -BERT\M 61.18 64.63

uniLM 54.20 65.69biLM 58.98 -biLM\M 53.79 62.67T-TA 69.49 60.77

Table 4: Pearson’s r × 100 results on the SICK dataset.“-” denotes an infeasible value, and bold font denotesthe best performance on each subtask.

representation. Similarly, BERT has a special to-ken, [CLS], which is trained for the “next sentenceprediction” objective; thus, we also use the [CLS]token to see how this model learns the sentencerepresentation, but it significantly underperformsthe other models.

5.4.2 Results on SICKWe further evaluate the language models on theSICK dataset, which consists of 4934/4906 sen-tence pairs with training/testing splits and scoresranging from 1 to 5. The results are in Table 4,from which we obtain the same observations asthose reported for STS-B.

All results on unsupervised STS tasks demon-strate that the T-TA learns textual semantics bestusing the token-level LAE objective.

6 Conclusion

In this work, we propose a novel deep bidirectionallanguage model, namely, the T-TA, to eliminatethe computational overload of applying BERT tounsupervised applications. Experimental results onN -best list reranking and unsupervised STS tasksdemonstrate that the proposed T-TA is significantlyfaster than the BERT-like approach, and its encod-ing ability is competitive with (or even better than)that of BERT.

Acknowledgments

K. Jung is with ASRI, Seoul National Univer-sity, Korea. This work was supported by theMinistry of Trade, Industry & Energy (MOTIE,Korea) under the Industrial Technology Innova-tion Program (No.10073144) and by the NRFgrant funded by the Korean government (MSIT)(NRF2016M3C4A7952587).

ReferencesEbru Arisoy, Abhinav Sethy, Bhuvana Ramabhadran,

and Stanley Chen. 2015. Bidirectional recurrent neu-ral network language models for automatic speechrecognition. In 2015 IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP), pages 5421–5425. IEEE.

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model. Journal of machine learning research,3(Feb):1137–1155.

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017task 1: Semantic textual similarity multilingual andcrosslingual focused evaluation. In Proceedings ofthe 11th International Workshop on Semantic Evalu-ation (SemEval-2017), pages 1–14.

William Chan, Navdeep Jaitly, Quoc Le, and OriolVinyals. 2016. Listen, attend and spell: A neuralnetwork for large vocabulary conversational speechrecognition. In 2016 IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP), pages 4960–4964. IEEE.

Xie Chen, Anton Ragni, Xunying Liu, and Mark JFGales. 2017. Investigating bidirectional recurrentneural network language models for speech recog-nition. In INTERSPEECH, pages 269–273.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1724–1734.

Jan K Chorowski, Dzmitry Bahdanau, DmitriySerdyuk, Kyunghyun Cho, and Yoshua Bengio.2015. Attention-based models for speech recogni-tion. In Advances in neural information processingsystems, pages 577–585.

Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advancesin Neural Information Processing Systems, pages7057–7067.

Andrew M Dai and Quoc V Le. 2015. Semi-supervisedsequence learning. In Advances in neural informa-tion processing systems, pages 3079–3087.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

832

William H DuBay. 2006. The classic readability stud-ies. Impact Information, Costa Mesa, California.

Kenneth Heafield. 2011. Kenlm: Faster and smallerlanguage model queries. In Proceedings of the sixthworkshop on statistical machine translation, pages187–197. Association for Computational Linguis-tics.

Dan Hendrycks and Kevin Gimpel. 2016. Bridg-ing nonlinearities and stochastic regularizers withgaussian error linear units. arXiv preprintarXiv:1606.08415.

Takaaki Hori, Shinji Watanabe, Yu Zhang, and WilliamChan. 2017. Advances in joint ctc-attention basedend-to-end speech recognition with a deep cnn en-coder and rnn-lm. Proc. Interspeech 2017, pages949–953.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2019. Albert: A lite bert for self-supervised learningof language representations. In International Con-ference on Learning Representations.

Marco Marelli, Luisa Bentivogli, Marco Baroni, Raf-faella Bernardi, Stefano Menini, and Roberto Zam-parelli. 2014. Semeval-2014 task 1: Evaluation ofcompositional distributional semantic models on fullsentences through semantic relatedness and textualentailment. In Proceedings of the 8th internationalworkshop on semantic evaluation (SemEval 2014),pages 1–8.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recurrentneural network based language model. In Eleventhannual conference of the international speech com-munication association.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Vassil Panayotov, Guoguo Chen, Daniel Povey, andSanjeev Khudanpur. 2015. Librispeech: an asrcorpus based on public domain audio books. In2015 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pages5206–5210. IEEE.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 conference

on empirical methods in natural language process-ing (EMNLP), pages 1532–1543.

Alvaro Peris and Francisco Casacuberta. 2015. A bidi-rectional recurrent neural language model for ma-chine translation. Procesamiento del Lenguaje Nat-ural, 55:109–116.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), vol-ume 1, pages 2227–2237.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392.

David E Rumelhart, Geoffrey E Hinton, and Ronald JWilliams. 1986. Learning representations by back-propagating errors. nature, 323(6088):533–536.

Julian Salazar, Davis Liang, Toan Q Nguyen, andKatrin Kirchhoff. 2019. Pseudolikelihood rerank-ing with masked language models. arXiv preprintarXiv:1910.14659.

Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving neural machine translation mod-els with monolingual data. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages86–96.

Yusuxke Shibata, Takuya Kida, Shuichi Fukamachi,Masayuki Takeda, Ayumi Shinohara, Takeshi Shino-hara, and Setsuo Arikawa. 1999. Byte pair encoding:A text compression scheme that accelerates patternmatching. Technical report, Technical Report DOI-TR-161, Department of Informatics, Kyushu Univer-sity.

Joonbo Shin, Yoonhyung Lee, and Kyomin Jung. 2019.Effective sentence scoring method using bert forspeech recognition. In Asian Conference on Ma-chine Learning, pages 1081–1093.

Martin Sundermeyer, Ralf Schluter, and Hermann Ney.2012. Lstm neural networks for language modeling.In Thirteenth annual conference of the internationalspeech communication association.

833

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In Advances in Neural Information Processing Sys-tems, pages 3104–3112.

Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran-cois Chollet, Aidan Gomez, Stephan Gouws, LlionJones, Łukasz Kaiser, Nal Kalchbrenner, Niki Par-mar, et al. 2018. Tensor2tensor for neural machinetranslation. In Proceedings of the 13th Conferenceof the Association for Machine Translation in theAmericas (Volume 1: Research Papers), pages 193–199.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Alex Wang and Kyunghyun Cho. 2019. Bert has amouth, and it must speak: Bert as a markov ran-dom field language model. In Proceedings of theWorkshop on Methods for Optimizing and Evaluat-ing Neural Language Generation, pages 30–36.

Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2019.Glue: A multi-task benchmark and analysis platformfor natural language understanding. In 7th Inter-national Conference on Learning Representations,ICLR 2019.

Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, Nelson-Enrique Yalta Soplin, Jahn Heymann, MatthewWiesner, Nanxin Chen, et al. 2018. Espnet: End-to-end speech processing toolkit. Proc. Interspeech2018, pages 2207–2211.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le,Mohammad Norouzi, Wolfgang Macherey, MaximKrikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.2016. Google’s neural machine translation system:Bridging the gap between human and machine trans-lation. arXiv preprint arXiv:1609.08144.

Matthew D Zeiler. 2012. Adadelta: an adaptive learn-ing rate method. arXiv preprint arXiv:1212.5701.

834

Appendix

A Implementation Details

A.1 Setup for the ASR System

This section introduces our implementation of theASR system.

For the input features, we use an 80-band Mel-scale spectrogram derived from the speech sig-nal. The target sequence is processed in 5K case-insensitive subword units created via unigrambyte-pair encoding (Shibata et al., 1999). We usean attention-based encoder-decoder model as ouracoustic model. The encoder is a 5-layer bidirec-tional long short-term memory (LSTM) network,and there are bottleneck layers that conduct a lineartransformation between every LSTM layer. Addi-tionally, there is a VGG module before the encoder,and it reduces the number of encoding time stepsby one-quarter through two max-pooling layers.The decoder is a 2-layer bidirectional LSTM net-work with a location-aware attention mechanism(Chorowski et al., 2015). All the layers have 1024hidden units. The model is trained with an addi-tional connectionist temporal classification (CTC)objective function because the left-to-right con-straint of CTC helps learn alignments betweenspeech-text pairs (Hori et al., 2017).

Our model is trained for 20 epochs on 960 h ofLibriSpeech training data using the Adadelta op-timizer (Zeiler, 2012). Using this acoustic model,we obtain the 50 best decoded sentences for eachinput audio file through the hybrid CTC-attention-based scoring (Hori et al., 2017) method. ForSeq2SeqASR, we additionally use a pretrained re-current neural network language model (RNNLM)to combine the log-probability plm of the RNNLMduring decoding as follows:

log p(yn|y1:n−1)= log pam(yn|y1:n−1) + β log plm(yn|y1:n−1),

(3)

where β is set to 0.7. We use the efficient spatialpyramid network (ESPNet) toolkit (Watanabe et al.,2018) for this implementation.

Table 5 shows the oracle word error rates(WERs) of the 50 best lists measured assumingthat the best sentence is always picked from thecandidates. We also include the oracle WERs fromthe 50 best lists of (Shin et al., 2019).

Methoddev test

clean other clean other

Shin et al. 7.17 19.79 7.26 20.37oracle 3.18 12.98 3.19 13.61

Seq2SeqASR 4.11 12.31 4.31 13.14oracle 1.80 7.90 1.96 8.39

Table 5: Oracle WERs of the 50 best lists on Lib-riSpeech from each ASR system.

A.2 Setup for the NMT SystemWe implement the standard Transformer model(Vaswani et al., 2017) using the Tensor2Tensor li-brary (Vaswani et al., 2018) for NMT. Both theencoder and the decoder of the Transformer consistof 6 layers with 512 hidden units, and the numberof self-attention heads is 8. The maximum numberof input tokens is set to 256, and we use a sharedvocabulary of size 32k. For effective training, welet the token embedding layer and the last softmaxlayer share their weights. The other hyperparame-ters of our translation system follow the standardtransformer_base_single_gpu setting inGoogle’s official Tensor2Tensor repository5.

We train the baseline model on the standardWMT13 Fr→En and De→En datasets with 250ksteps using the Adam optimizer (Kingma and Ba,2014). We use linear-warmup-square-root-decaylearning rate scheduling with the default learningrate (2.5e-4) and number of warmup steps (16k).Using this baseline translation model, we obtain the50 best decoded sentences for each source throughthe beam search. The oracle BLEU scores for theNMT system are shown in Table 6.

MethodWMT13

De→En Fr→En

Seq2SeqNMT 27.83 29.63oracle 38.18 39.58

Table 6: Oracle BLEU scores of the 50 best lists onWMT13

B Additional Experiments

B.1 Runtimes of the uniLM and T-TAAs mentioned in Section 5.2, we also measure theruntimes of the uniLM we implement. Figure 5

5https://github.com/tensorflow/tensor2tensor

https://github.com/tensorflow/tensor2tensor

835

Figure 5: Runtimes according to the number of wordsfor the uniLM and T-TA.

Figure 6: Runtimes according to the number of wordsfor the biLM and T-TA in the GPU-augmented environ-ment.

shows the average runtimes of the uniLM and the T-TA for the number of words in a sentence. Since weuse subword tokens, the number of words nw andthe number of tokens n can be different (nw ≤ n).

B.2 Runtimes on a GPU

Additionally, we similarly measure the runtimesin a GPU-augmented environment (using GeForceGTX 1080 Ti). Figure 6 shows the average run-times of the biLM and the T-TA for the number ofwords in a sentence. In our 20-word standard inthe STS task, the T-TA takes approximately 2.51ms, whereas biLM takes approximately 4.72 ms,showing that the T-TA is 1.88 times faster than thebiLM. Compared to the CPU-only environment, thespeed difference is significantly reduced due to thesupport offered by the GPU. Considering Figure4, however, the CPU-only environment and GPU-augmented environment show a similar tendency:the longer the sentence is, the more significant thedifference in the runtime between the T-TA and thebiLM.

B.3 Perplexity and RerankingIn general, perplexity (PPL) is a measure of howwell the language model is trained. To investigatethe alignment of the PPL and reranking, we com-pute the PPL of reference sentences from the Lib-riSpeech dev-clean and test-clean sets using eachlanguage model. We can obtain the pseudoperplex-ity (pPPL) from the biLM and T-TA since theydo not follow the product rule, unlike the uniLM.Note that we compute the subword-level (p)PPL(not word-level); these values are valid only in ourvocabulary.

Method [WER] (p)PPLa (p)PPLm

devclean

uniLM [3.82] 341.5 70.80biLM [3.73] (76.49) (11.93)T-TA [3.67] (293.4) (11.69)

testclean

uniLM [4.05] 495.5 73.18biLM [3.97] (75.43) (12.72)T-TA [3.97] (590.0) (12.43)

Table 7: (pseudo)Perplexities and corresponding WERsof the language models on LibriSpeech.

We find that the WERs are better aligned withthe median of pPPLm than with the average pPPLa.Interestingly, the pPPLa of the T-TA is similar tothe PPLa of the uniLM, but the pPPLm of the T-TA is similar to that of the biLM. We additionallydiscover that if the length of a sentence is short, theT-TA shows a very high PPL, even higher than thatof the uniLM.

Documents

Fast and Accurate Deep Bidirectional Language ... · ﬁrming that deep bidirectional language models are useful in unsupervised applications as well. However, concerning its applications