13
Strategies for Training Large Vocabulary Neural Language Models Wenlin Chen Washington University St Louis, MO [email protected] David Grangier Facebook AI Research Menlo Park, CA [email protected] Michael Auli Facebook AI Research Menlo Park, CA [email protected] Abstract Training neural network language models over large vocabularies is still computationally very costly compared to count-based models such as Kneser-Ney. At the same time, neural language models are gaining popularity for many applications such as speech recogni- tion and machine translation whose success depends on scalability. We present a sys- tematic comparison of strategies to represent and train large vocabularies, including soft- max, hierarchical softmax, target sampling, noise contrastive estimation and self normal- ization. We further extend self normalization to be a proper estimator of likelihood and in- troduce an efficient variant of softmax. We evaluate each method on three popular bench- marks, examining performance on rare words, the speed/accuracy trade-off and complemen- tarity to Kneser-Ney. 1 Introduction Neural network language models (Bengio et al., 2003; Mikolov et al., 2010) have gained popular- ity for tasks such as automatic speech recognition (Arisoy et al., 2012) and statistical machine trans- lation (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture to neu- ral language models have been proposed for transla- tion (Le et al., 2012; Devlin et al., 2014; Bahdanau et al., 2015), summarization (Chopra et al., 2015) and language generation (Sordoni et al., 2015). Work done while Wenlin was an intern at Facebook. Language models assign a probability to a word given a context of preceding, and possibly subse- quent, words. The model architecture determines how the context is represented and there are sev- eral choices including recurrent neural networks (Mikolov et al., 2010), or log-bilinear models (Mnih and Hinton, 2010). We experiment with a simple but proven feed-forward neural network model similar to Bengio et al. (2003). Our focus is not the model architecture or how the context can be represented but rather how to efficiently deal with large output vocabularies, a problem common to all approaches to neural language modeling and related tasks such as machine translation and language generation. Practical training speed for these models quickly decreases as the vocabulary grows. This is due to three combined factors. First, model evaluation and gradient computation become more time consum- ing, mainly due to the need of computing normalized probabilities over a large vocabulary. Second, large vocabularies require more training data in order to observe enough instances of infrequent words which increases training times. Third, a larger training set often allows for higher capacity models which re- quires more training iterations. In this paper we provide an overview of popular strategies to model large vocabularies for language modeling. This includes the classical softmax over all output classes, hierarchical softmax which intro- duces latent variables, or clusters, to simplify nor- malization, target sampling which only considers a random subset of classes for normalization, noise contrastive estimation which discriminates between genuine data points and samples from a noise distri- arXiv:1512.04906v1 [cs.CL] 15 Dec 2015

arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA [email protected] Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

Strategies for Training Large Vocabulary Neural Language Models

Wenlin Chen†

Washington UniversitySt Louis, MO

[email protected]

David GrangierFacebook AI Research

Menlo Park, [email protected]

Michael AuliFacebook AI Research

Menlo Park, [email protected]

Abstract

Training neural network language models overlarge vocabularies is still computationally verycostly compared to count-based models suchas Kneser-Ney. At the same time, neurallanguage models are gaining popularity formany applications such as speech recogni-tion and machine translation whose successdepends on scalability. We present a sys-tematic comparison of strategies to representand train large vocabularies, including soft-max, hierarchical softmax, target sampling,noise contrastive estimation and self normal-ization. We further extend self normalizationto be a proper estimator of likelihood and in-troduce an efficient variant of softmax. Weevaluate each method on three popular bench-marks, examining performance on rare words,the speed/accuracy trade-off and complemen-tarity to Kneser-Ney.

1 Introduction

Neural network language models (Bengio et al.,2003; Mikolov et al., 2010) have gained popular-ity for tasks such as automatic speech recognition(Arisoy et al., 2012) and statistical machine trans-lation (Schwenk et al., 2012; Vaswani et al., 2013).Furthermore, models similar in architecture to neu-ral language models have been proposed for transla-tion (Le et al., 2012; Devlin et al., 2014; Bahdanauet al., 2015), summarization (Chopra et al., 2015)and language generation (Sordoni et al., 2015).

†Work done while Wenlin was an intern at Facebook.

Language models assign a probability to a wordgiven a context of preceding, and possibly subse-quent, words. The model architecture determineshow the context is represented and there are sev-eral choices including recurrent neural networks(Mikolov et al., 2010), or log-bilinear models (Mnihand Hinton, 2010). We experiment with a simple butproven feed-forward neural network model similarto Bengio et al. (2003). Our focus is not the modelarchitecture or how the context can be representedbut rather how to efficiently deal with large outputvocabularies, a problem common to all approachesto neural language modeling and related tasks suchas machine translation and language generation.

Practical training speed for these models quicklydecreases as the vocabulary grows. This is due tothree combined factors. First, model evaluation andgradient computation become more time consum-ing, mainly due to the need of computing normalizedprobabilities over a large vocabulary. Second, largevocabularies require more training data in order toobserve enough instances of infrequent words whichincreases training times. Third, a larger training setoften allows for higher capacity models which re-quires more training iterations.

In this paper we provide an overview of popularstrategies to model large vocabularies for languagemodeling. This includes the classical softmax overall output classes, hierarchical softmax which intro-duces latent variables, or clusters, to simplify nor-malization, target sampling which only considers arandom subset of classes for normalization, noisecontrastive estimation which discriminates betweengenuine data points and samples from a noise distri-

arX

iv:1

512.

0490

6v1

[cs.C

L] 1

5 D

ec 2

015

Page 2: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

bution, and infrequent normalization, also referredas self-normalization, which computes the partitionfunction at an infrequent rate. We also extend self-normalization to be a proper estimator of likelihood.Furthermore, we introduce differentiated softmax, anovel variation of softmax which assigns more ca-pacity to frequent words and which we show to befaster and more accurate than softmax (§2).

Our comparison assumes a reasonable budget ofone week for training models. We evaluate on threewell known benchmarks differing in the amount oftraining data and vocabulary size, that is Penn Tree-bank, Gigaword and the recently introduced BillionWord benchmark (§3).

Our results show that conclusions drawn fromsmall datasets do not always generalize to larger set-tings. For instance, hierarchical softmax is less ac-curate than softmax on the small vocabulary PennTreebank task but performs best on the very largevocabulary Billion Word benchmark, because hier-archical softmax is the fastest method for trainingand can perform more training updates in the sameperiod of time. Furthermore, our results with dif-ferentiated softmax demonstrate that assigning ca-pacity where it has the most impact allows to trainbetter models in our time budget (§4).

Unlike traditional count-based models, our neuralmodels benefit less from more training data becausethe computational complexity of training is muchhigher, exceeding our time budget in some cases.Finally, our analysis shows clearly that Kenser-Neycount-based language models are very competitiveon rare words, contrary to the common belief thatneural models are better on infrequent words (§5).

2 Modeling Large Vocabularies

We first introduce our basic language model archi-tecture with a classical softmax and then describevarious other methods including a novel variation ofsoftmax.

2.1 Softmax Neural Language Model

Our feed-forward neural network implements an n-gram language model, i.e., it is a parametric functionestimating the probability of the next word wt givenn � 1 previous context words, wt�1, . . . , wt�n+1.Formally, we take as input a sequence of discrete

indexes representing the n � 1 previous words andoutput a vocabulary-sized vector of probability esti-mates, i.e.,

f : {1, . . . , V }n�1 ! [0, 1]V ,

where V is the vocabulary size. This function re-sults from the composition of simple differentiablefunctions or layers.

Specifically, f composes an input mapping fromdiscrete word indexes to continuous vectors, a suc-cession of linear operations followed by hyperbolictangent non-linearities, plus one final linear opera-tion, followed by a softmax normalization.

The input layer maps each context word index toa continuous d

0

-dimensional vector. It relies on aparameter matrix W 0 2 RV⇥d

0 to convert the input

x = [wt�1, . . . , wt�n+1

] 2 {1, . . . , V }n�1

to n � 1 vectors of dimension d0

. These vectors areconcatenated into a single (n � 1) ⇥ d

0

matrix,

h0

= [W 0

w

t�1

; . . . ;W 0

w

t�n+1

] 2 Rn�1⇥d

0 .

This state h0 is considered as a (n � 1) ⇥ d0

vectorby the next layer. The subsequent states are com-puted through k layers of linear mappings followedby hyperbolic tangents, i.e.

8i = 1, . . . , k, hi

= tanh(W ihi�1

+ bi) 2 Rdi

where W i 2 Rdi⇥di�1 , b 2 Rdi are learn-able weights and biases and tanh denotes thecomponent-wise hyperbolic tangent.

Finally, the last layer performs a linear operationfollowed by a softmax normalization, i.e.,

hk+1

= W k+1hk

+ bk+1 2 RV (1)

and y =

1

Zexp(hk+1

) 2 [0, 1]V (2)

where Z =

VX

j=1

exp(hk+1

j

).

and exp denotes the component-wise exponential.The network output y is therefore a vocabulary-sizedvector of probability estimates. We use the standard

Page 3: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

cross-entropy loss with respect to the computed logprobabilities

@ log yi

@hk+1

j

= �ij

� yj

where �ij

= 1 if i = j and 0 otherwise The gradientupdate therefore increases the score of the correctoutput hk+1

i

and decreases the score of all other out-puts hk+1

j

for j 6= i.A downside of the classical softmax formulation

is that it requires computation of the activations forall output words (see Equation 2). When group-ing multiple input examples into a batch, Equa-tion 1 amounts to a large matrix-matrix product ofthe form W k+1Hk where W k+1 2 RV⇥dk , Hk

=

[hk

1

; . . . ;hk

l

] 2 Rdk⇥l, where l is the number of in-put examples in a batch. For example, typical set-tings for the gigaword corpus (§3) are a vocabularyof size V = 100, 000, with output word embeddingsize d

k

= 1024 and batch size of l = 500 examples.This gives a very large matrix-matrix product of100, 000⇥ 1024 by 1024⇥ 500. The rest of the net-work involves matrix-matrix operations whose sizeis determined by the batch size and the layer dimen-sions, both are typically much smaller than the vo-cabulary size, ranging for hundreds to a couple ofthousands. Therefore, the output layer dominatesthe complexity of the entire network.

This computational burden is high even forGraphics Processing Units (GPUs). GPUs are wellsuited for matrix-matrix operation when matrix di-mensions are in the thousands, but become less ef-ficient with dimensions over 10, 000. The size ofthe output matrix is therefore a bottleneck duringtraining. Previous work suggested tackling theseproducts by sharding them across multiple GPUs(Sutskever et al., 2014), which introduces additionalengineering challenges around inter-GPU commu-nication. This paper focuses on orthogonal algo-rithmic solutions which are also relevant to paralleltraining.

2.2 Hierarchical SoftmaxHierarchical Softmax (HSM) organizes the outputvocabulary into a tree where the leaves are the wordsand the intermediate nodes are latent variables, orclasses (Morin and Bengio, 2005). The tree has po-tentially many levels and there is a unique path from

the root to each word. The probability of a word isthe product of the probabilities of the latent variablesalong the path from the root to the leaf, including theprobability of the leaf. If the tree is perfectly bal-anced, this can reduce the complexity from O(V ) toO(log V ).

We experiment with a version that follows Good-man (2001) and which has been used in Mikolov etal. (2011b). Goodman proposed a two-level treewhich first predicts the class of the next word ct andthen the actual word wt given context x

p(wt|x) = p(ct|x) p(wt|ct, x) (3)

If the number of classes is O(

pV ) and each class

has the same number of members, then we only needto compute O(2

pV ) outputs. This is a good strat-

egy in practice as it yields weight matrices for clus-ters and words whose largest dimension is less than⇠ 1, 000, a setting for which GPUs are fast.

A popular strategy clusters words based on fre-quency. It slices the list of words sorted by fre-quency into clusters that contain an equal share ofthe total unigram probability. We pursue this strat-egy and compare it to random class assignment andto clustering based on word embedding features.The latter applies k-means over word embeddingsobtained from Hellinger PCA over co-occurrencecounts (Lebret and Collobert, 2014). Alternativeword representations (Brown et al., 1992; Mikolovet al., 2013) are also relevant but an extensive studyof word clustering techniques is beyond the scope ofthis work.

2.3 Differentiated SoftmaxThis section introduces a novel variation of soft-max that assigns variable capacity per word in theoutput layer. The weight matrix of the final layerW k+1 2 Rdk⇥V stores output embeddings of sizedk

for the V words the language model may pre-dict: W k+1

1

; . . . ;W k+1

V

. Differentiated softmax (D-Softmax) varies the dimension of the output em-beddings d

k

across words depending on how muchmodel capacity is deemed suitable for a given word.In particular, it is meaningful to assign more param-eters to frequent words than to rare words. By defini-tion, frequent words occur more of ten in the trainingdata than rare words and therefore allow to fit moreparameters.

Page 4: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

W k+1 hk

dA

dB

dC

|A|

|B|

|C|

dA

dB

dC

Figure 1: Final weight matrix W k+1 and hidden layerhk for differentiated softmax for partitions A, B, Cof the output vocabulary with embedding dimensionsdA, dB , dC ; non-shaded areas are zero.

In particular, we define partitions of the output vo-cabulary based on word frequency and the wordsin each partition share the same embedding size.For example, we may partition the frequency or-dered set of output word ids, O = {1, . . . , V }, intoAdA

= {1, . . . , K} and BdB= {K+1, . . . , V } s.t.

A [ B = O ^ A \ B = ;, where dA

and dB

aredifferent output embedding sizes and K is a word id.

Partitioning results in a sparse final weight matrixW k+1 which arranges the embeddings of the outputwords in blocks, each one corresponding to a sepa-rate partition (Figure 1). The size of the final hid-den layer hk is the sum of the embedding sizes ofthe partitions. The final hidden layer is effectively aconcatenation of separate features for each partitionwhich are used to compute the dot product with thecorresponding embedding type in W k+1. In prac-tice, we compute separate matrix-vector products,or in batched form, matrix-matrix products, for eachpartition in W k+1 and hk.

Overall, differentiated softmax can lead to largespeed-ups as well as accuracy gains since we cangreatly reduce the complexity of computing the out-put layer. Most significantly, this strategy speeds upboth training and inference. This is in contrast to hi-erarchical softmax which is fast during training butrequires even more effort than softmax for comput-ing the most likely next word.

2.4 Target SamplingSampling-based methods approximate the softmaxnormalization (Equation 2) by selecting a number ofimpostors instead of using all outputs. This can sig-nificantly speed-up each training iteration, depend-ing on the size of the impostor set.

We follow Jean et al. (2014) who choose asimpostors all positive examples in a mini-batch aswell as a subset of the remaining words. This sub-set is sampled uniformly and its size is chosen bycross-validation. A downside of sampling is that the(downsampled) final weight matrix W k+1 (Equa-tion 1) keeps changing between mini-batches. Thisis computationally costly and the success of sam-pling hinges on being to estimate a good modelwhile keeping the number of samples small.

2.5 Noise Contrastive EstimationNoise contrastive estimation (NCE) is anothersampling-based technique (Hyvarinen, 2010; Mnihand Teh, 2012). Contrary to target sampling, it doesnot maximize the training data likelihood directly.Instead, it solves a two-class problem of distinguish-ing genuine data from noise samples. The train-ing algorithm samples a word w given the precedingcontext x from a mixture

P (w|x) = 1

k + 1

Ptrain

(w|x) + k

k + 1

Pnoise

(w|x)

where Ptrain

is the empirical distribution of the train-ing set and P

noise

is a known noise distributionwhich is typically a context-independent unigramdistribution fitted on the training set. The trainingalgorithm fits the model ˆP (w|x) to recover whethera mixture sample came from the data or the noisedistribution, this amounts to minimizing the binarycross-entropy

�y log

ˆP (y = 1|w, x)� (1�y) log

ˆP (y = 0|w, x)

where y is a binary variable indicating whether thecurrent sample originates from the data (y = 1)or the noise (y = 0) and ˆP (y = 1|w, x) =

ˆ

P (w|x)ˆ

P (w|x)+kP

noise

(w|x) ,ˆP (y = 0|w, x) = 1 � ˆP (y =

1|w, x) are the model estimates of the correspond-ing posteriors.

This formulation still involves a softmax over thevocabulary to compute ˆP (w|x). However, Mnih

Page 5: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

and Teh (2012) suggest to forego the normalizationstep and simply consider replacing ˆP (w|x) with un-normalized exponentiated scores which makes thecomplexity of training independent of the vocabu-lary size. At test time, the softmax normalization isreintroduced to obtain a proper distribution.

2.6 Infrequent NormalizationAndreas and Klein (2015) also propose to relaxscore normalization. Their strategy (here referredto as WeaknormSQ) associates unnormalized likeli-hood maximization with a penalty term that favorsnormalized predictions. This yields the followingloss over the training set T

L(2)

= �X

(w,x)2T

s(w|x) + ↵X

(w,x)2T

(logZ(x))2

where s(w|x) refers to the unnormalized scoreof word w given context x and Z(x) =P

w

exp(s(w|x)) refers to the partition function forcontext x. For efficient training, the second term canbe down-sampled

L(2)

↵,�

= �X

(w,x)

2train

s(w|x) + ↵

X

(w,x)

2train�

(logZ(x))2

where T�

is the training set sampled at rate �. Asmall rate implies computing the partition functiononly for a small fraction of the training data.

This work extends this strategy to the case wherethe log partition term is not squared (Weaknorm),i.e.,

L(1)

↵,�

= �X

(w,x)

2train

s(w|x) + ↵

X

(w,x)

2train�

logZ(x)

For ↵ = 1, this loss is an unbiased estimator of thenegative log-likelihood of the training data L(2)

1

=

�P(w,x)2train s(w|x) � logZ(x).

2.7 Other MethodsFast locality-sensitive hashing has been used to ap-proximate the dot-product between the final hiddenlayer activation hk and the output word embedding(Vijayanarasimhan et al., 2014). However, duringtraining, there is a high overhead for re-indexing theembeddings and test time speed-ups virtually vanishas the batch size increases due to the efficiency ofmatrix-matrix products.

Dataset Train Test Vocab OOVPTB 1M 0.08M 10k 5.8%gigaword 4,631M 279M 100k 5.6%billionW 799M 8.1M 793k 0.3%

Table 1: Dataset statistics. Number of tokens for train andtest set, vocabulary size and ratio of out-of-vocabularywords in the test set.

3 Experimental Setup

This section describes the data used in our experi-ments, our evaluation methodology and our valida-tion procedure.

Datasets Our experiments are performed over threedatasets of different sizes: Penn Treebank (PTB),WMT11-lm (billionW) and English Gigaword, ver-sion 5 (gigaword). Penn Treebank is a well-established dataset for evaluating language mod-els (Marcus et al., 1993). It is the smallest datasetwith a benchmark setting relying on 1 million to-kens and a vocabulary size of 10, 000 (Mikolov etal., 2011a). The vocabulary roughly correspondsto words occurring at least twice in the trainingset. The WMT11-lm corpus has been recently intro-duced as a larger corpus to evaluate language mod-els and their impact on statistical machine transla-tion (Chelba et al., 2013). It contains close to abillion tokens and a vocabulary of about 800,000words, which corresponds to words with more than3 occurrences in the training set.1 This dataset isoften referred as the billion word benchmark. Gi-gaword (Parker et al., 2011) is the largest corpuswe consider with 5 billion tokens of newswire data.Even though it has been used for language model-ing previously (Heafield, 2011), there is no standardtrain/test split or vocabulary for this set. We splitthe data according to time: the training set coversthe period 1994–2009 and the test data covers 2010.The vocabulary consists of the 100, 000 most fre-quent words, which roughly corresponds to wordswith more than 100 occurrences in the training data.Table 1 summarizes data set statistics.

Evaluation Performance is evaluated in terms ofperplexity over the test set. For PTB and billionW,

1We use the version distributed by Tony Robinson athttp://tiny.cc/1billionLM .

Page 6: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

we report perplexity results on a per sentence ba-sis, i.e., the model does not use context wordsacross sentence boundaries and we score the end-of-sentence marker. This is the standard setting forthese benchmarks. On gigaword, we do not seg-ment the data into sentences and the model uses con-texts crossing sentence boundaries and the evalua-tion does not include end-of-sentence markers.

Our baseline is an interpolated Kneser-Ney (KN)language model and we use the KenLM toolkitto train 5-gram models without pruning (Heafield,2011). For our neural models, we train 11-gram lan-guage models for gigaword, billionW and a 6-gramlanguage model for the smaller PTB. The parame-ters of the models are the weights W i and the bi-ases bi for i = 0, . . . , k + 1. These parametersare learned by maximizing the log-likelihood of thetraining data relying on stochastic gradient descent(SGD) (LeCun et al., 1998).

Validation The hyper-parameters of the model arethe number of layers k and the dimension of eachlayer d

i

, 8i = 0, . . . , k. These parameters are setby cross-validation, i.e., the parameters which max-imize the likelihood over a validation set (subsetof the training data excluded from sampling duringSGD optimization). We also cross-validate the num-ber of clusters and as well as the clustering tech-nique for hierarchical softmax, the number of fre-quency bands and their allocated capacity for differ-entiated softmax, the number of distractors for tar-get sampling, the noise/data ratio for NCE, as wellas the regularization rate and strength for infrequentnormalization. Similarly, the SGD parameters, i.e.,learning rate and mini-batch size, are also set tomaximize validation accuracy.

Training Time We train for 168 hours (one week)on the large datasets (billionW, gigaword) and 24hours (one day) for Penn Treebank. We select thehyper-parameters which yield the best validationperplexity after the allocated time and report the per-plexity of the resulting model on the test set. Thistraining time is a trade-off between being able to doa comprehensive exploration of the various settingsfor each method and good accuracy.

120

130

140

150

160

170

180

190

0 5 10 15 20

Perp

lexi

ty

Training time (hours)

SoftmaxSampling

HSMD-SoftmaxWeaknorm

WeaknormSQNCE

Figure 2: Penn Treebank learning curve on the validationset.

4 Results

Looking at test results (Table 2) and learning pathson the validation sets (Figures 2, 3, and 4) we can seea clear trend: the competitiveness of softmax dimin-ishes with the vocabulary size. Softmax does verywell on the small vocabulary Penn Treebank cor-pus, but it does very poorly on the larger vocabularybillionW corpus. Faster methods such as sampling,hierarchical softmax, and infrequent normalization(Weaknorm and WeaknormSQ) are much better inthe large-vocabulary setting of billionW.

D-Softmax is performing very well on all data setsand shows that assigning higher capacity where itbenefits most results in better models. Target sam-pling performs worse than softmax on gigaword butbetter on billionW. Hierarchical softmax performsvery poorly on Penn Treebank which is in stark con-trast to billionW where it does very well. Noise con-trastive estimation has good accuracy on billionW,where speed is essential to achieving good accuracy.

Of all the methods, hierarchical softmax pro-cesses most training examples in a given time frame(Table 3). Our test time speed comparison assumesthat we would like to find the highest scoring nextword, instead rescoring an existing string. Thisscenario requires scoring all output words and D-Softmax can process nearly twice as many tokensper second than the other methods whose complex-

Page 7: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

PTB gigaword billionWKN 141.2 57.1 70.2Softmax 123.8 56.5 108.3D-Softmax 121.1 52.0 91.2Sampling 124.2 57.6 101.0HSM 138.2 57.1 85.2NCE 143.1 78.4 104.7Weaknorm 124.4 56.9 98.7WeaknormSQ 122.1 56.1 94.9KN+Softmax 108.5 43.6 59.4KN+D-Softmax 107.0 42.0 56.3KN+Sampling 109.4 43.8 58.1KN+HSM 115.0 43.9 55.6KN+NCE 114.6 49.0 58.8KN+Weaknorm 109.2 43.8 58.1KN+WeaknormSQ 108.8 43.8 57.7

Table 2: Test perplexity of individual models and inter-polation with Kneser-Ney.

50

60

70

80

90

100

110

0 20 40 60 80 100 120 140 160 180

Perp

lexi

ty

Training time (hours)

SoftmaxSampling

HSMD-SoftmaxWeaknorm

WeaknormSQNCE

Figure 3: Gigaword learning curve on the validation set.

ity is then similar to softmax.

4.1 SoftmaxDespite being our baseline, softmax ranks amongthe most accurate methods on PTB and it is sec-ond best on gigaword after D-Softmax (with Wea-knormSQ performing similarly). For billionW, theextremely large vocabulary makes softmax train-ing too slow to compete with faster alternatives.However, of all the methods softmax has the sim-

80

100

120

140

160

180

0 20 40 60 80 100 120 140 160 180

Perp

lexi

ty

Training time (hours)

SoftmaxSampling

HSMD-SoftmaxWeaknorm

WeaknormSQNCE

Figure 4: Billion Word learning curve on the validationset.

train testSoftmax 510 510D-Softmax 960 960Sampling 1,060 510HSM 12,650 510NCE 4,520 510Weaknorm 1,680 510WeaknormSQ 2,870 510

Table 3: Training and testing speed on billionW in tokensper second. Most techniques are identical to softmax attest time, HSM can be faster at test time if only few wordsinvolving few clusters are being scored.

plest implementation and it has no additional hyper-parameters compared to other methods.

4.2 Target SamplingFigure 5 shows that target sampling is most accu-rate when the distractor set represents a large frac-tion of the vocabulary, i.e. more than 30% on gi-gaword (billionW best setting is even higher with50%). Target sampling is asymptotically faster andtherefore performs more iterations than softmax inthe same time. However, it makes less progress interms of perplexity reduction per iteration comparedto softmax. Overall, it is not much better than soft-max. A reason might be that the sampling procedurechooses distractors independently from context, orcurrent model performance. This does not favorsampling distractors the model incorrectly consid-ers likely given the current context. These distrac-

Page 8: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

50 60 70 80 90

100 110 120

0 10 20 30 40 50 60 70 80 90 100

Perp

lexi

ty

Distractors per Sample (% of vocabulary)

Sampling

Figure 5: Number of Distractors versus Perplexity forTarget Sampling over Gigaword

tors would yield high gradient that could make themodel progress faster.

4.3 Hierarchical Softmax

Hierarchical softmax is very efficient for large vo-cabularies and it is the best method on billionW. Onthe other hand, HSM is performing poorly on smallvocabularies as seen on Penn Treebank.

We found that a good word clustering structurehelps learning: when each cluster contains wordsoccurring in similar contexts, cluster likelihoods areeasier to learn; when the cluster structure is uninfor-mative, cluster likelihoods converge to the uniformdistribution. This adversely affects accuracy sincewords can never have higher probability than theirclusters (cf. Equation 3).

Our experiments group words into a two level hi-erarchy and compare four clustering strategies overbillionW and gigaword (§2.2). Random clusteringshuffles the vocabulary and splits it into equallysized partitions. Frequency-based clustering firstorders words based on the number of their oc-currences and assigns words to clusters such thateach cluster represents an equal share of frequencycounts (Mikolov et al., 2011b). K-means runs thewell-know clustering algorithm on Hellinger PCAword embeddings. Weighted k-means is similar butweights each word by its frequency.

Random clustering performs worst (Table 4) fol-lowed by frequency-based clustering but k-meansdoes best; weighted k-means performs similarlythan its unweighted version. In our initial experi-ments, pure k-means performed very poorly becausethe most significant cluster captured up to 40% of

billionW gigawordrandom 98.51 62,27frequency-based 92.02 59.47k-means 85.70 57.52weighted k-means 85.24 57.09

Table 4: Comparison of clustering techniques for hierar-chical softmax.

the word frequencies in the data. We resorted to ex-plicitly capping the frequency-budget of each clus-ter to ⇠ 10% which brought k-means to the perfor-mance of weighted k-means.

4.4 Differentiated Softmax

D-Softmax is the best technique on gigaword, andthe second best on billionW, after HSM. On PTBit ranks among the best techniques whose perplexi-ties cannot be reliably distinguished. The variable-capacity scheme of D-Softmax can assign large em-beddings to frequent words, while keeping compu-tational complexity manageable through small em-beddings for rare words.

Unlike for hierarchical softmax, NCE or Wea-knorm, the computational advantage of D-Softmaxis preserved at test time (Table 3). D-Softmax is thefastest technique at test time, while ranking amongthe most accurate methods. This speed advantageis due to the low dimensional representation of rarewords which negatively affects the model accuracyon these words (Table 5).

4.5 Noise Contrastive Estimation

For language modeling we found NCE difficult touse in practice. In order to work with large neuralnetworks and large vocabularies, we had to disso-ciate the number of noise samples from the data tonoise ratio in the modeled mixture. For instance, adata/noise ratio of 1/50 gives good performance inour experiments but estimating only 50 noise sampleposteriors per data point is wasteful given the cost ofnetwork evaluation. Moreover, this setting does notallow frequent sampling of every word in a large vo-cabulary. Our setting considers more noise samplesand up-weights the data sample. This allows to setthe data/noise ratio independently from the numberof noise samples.

Page 9: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

4

5

6

7

8

9

10

0.054 0.056 0.058 0.06 0.062 0.064

Entro

py

NCE Loss

Figure 6: Validation entropy versus NCE loss over gi-gaword for different experiments differing only in theirlearning rates and initial weights.

Overall, NCE results are better than softmax onlyfor billionW, a setting for which softmax is veryslow due to the very large vocabulary. Why doesNCE perform so poorly? Figure 6 shows entropyon the validation set versus the NCE loss for severalmodels. The results clearly show that similar NCEloss values can result in very different validation en-tropy. Although NCE might make sense for othermetrics, it is not among the best techniques for min-imizing perplexity.

4.6 Infrequent NormalizationInfrequent normalization (Weaknorm and Wea-knormSQ) performs better than softmax on billionWand comparably to softmax on Penn Treebank andgigaword (Table 2). The speedup from skipping par-tition function computations is substantial. For in-stance, WeaknormSQ on billionW evaluates the par-tition only on 10% of the examples. In one week,the model is evaluated and updated on 868M to-kens (with 86.8M partition evaluations) compared to156M tokens for softmax.

Although referred to as self-normalizing in the lit-erature (Andreas and Klein, 2015), the trained mod-els still needs to be normalized after training. Thepartition cannot be considered as a constant andvaries greatly between data samples. On billionW,the 10th to 90th percentile range was 9.4 to 10.3on the natural log scale, i.e., a ratio of 2.5 for Wea-

knormSQ.It is worth noting that the squared regularizer ver-

sion of infrequent normalization (WeaknormSQ) ishighly sensitive to the regularization parameter. Weoften found regularization strength to be either toolow (collapse) or too high (blow-up) after a few daysof training. We added an extra unit to our model inorder to bound predictions, which yields more sta-ble training and better generalization performance.We bounded unnormalized predictions within therange [�10,+10] by using x ! 10 tanh(x/5)). Wealso observed that for the non-squared version of thetechnique (Weaknorm), a regularization strength of1 was the best setting. With this choice, the loss isan unbiased estimator of the data likelihood.

5 Analysis

This section discusses model capacity, model ini-tialization, training set size and performance on rarewords.

5.1 Model CapacityTraining neural language models over large corporahighlights that training time, not training data, isthe main factor limiting performance. The learn-ing curves on gigaword and billionW indicate thatmost models are still making progress after oneweek. Training time has therefore to be taken intoaccount when considering increasing capacity. Fig-ure 7 shows validation perplexity versus the numberof iterations for a week of training. This figure in-dicates that a softmax model with 1024 hidden unitsin the last layer could perform better than the 512-hidden unit model with a longer training horizon.However, in the allocated time, 512 hidden unitsyield the best validation performance. D-softmaxshows that it is possible to selectively increase ca-pacity, i.e. to allocate more hidden units to the repre-sentation of the most frequent words at the expenseof rarer words. This captures most of the benefit of alarger softmax model while staying within a reason-able training budget.

5.2 Effect of InitializationSeveral techniques for pre-training word embed-dings have been recently proposed (Mikolov et al.,2013; Lebret and Collobert, 2014; Pennington et al.,

Page 10: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

80

100

120

140

160

180

200

0 50 100 150 200 250 300

Perp

lexi

ty

Training tokens (millions)

D-Softmax 1024x50K, 512x100K, 64x640KD-Softmax 1024x50K, 256x740K

Softmax 1024Softmax 512

Figure 7: Validation perplexity per iteration on billionWfor softmax and D-softmax. Softmax uses the same 512or 1024 units for all words. The first D-Softmax exper-iment uses 1024 units for the 50K most frequent words,512 for the next 100K, and 64 units for the rest, the sec-ond experiment only considers two frequency bands. Alllearning curves end after one week.

2014). Our experiments use Hellinger PCA (Lebretand Collobert, 2014), motivated by its simplicity: itcan be computed in a few minutes and only requiresan implementation of parallel co-occurrence count-ing as well as fast randomized PCA. We considerinitializing both the input word embeddings and theoutput matrix from PCA embeddings.

Figure 8 shows that PCA is better than random forinitializing both input and output word representa-tions; initializing both from PCA is even better. Theresults show that even after a week of training, theinitial conditions still impact the validation perplex-ity. This trend is not specific to softmax and similaroutcomes have been observed for other strategies.After a week of training, we observe only for HSMthat the random initialization of the output matrixcan reach performance comparable to PCA initial-ization.

5.3 Training Set Size

Large training sets and a fixed training time intro-duce competition between slower models with morecapacity and observing more training data. Thistrade-off only applies to iterative SGD optimizationand it does not apply to classical count-based mod-

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120 140 160 180

Perp

lexi

ty

Training time (hours)

Input: PCA, Output: PCAInput: PCA, Output: RandomInput: Random, Output: PCA

Input: Random, Output: Random

Figure 8: Effect of random initialization and withHellinger PCA on gigaword for softmax.

55

60

65

70

75

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Perp

lexi

ty

Training data size (billions)

SoftmaxKN

Figure 9: Effect of training set size measured on the testset of gigaword for Softmax and Kneser-Ney.

els, which visit the training set once and then solvetraining in closed form.

We compare Kneser-Ney and softmax, trained forone week, with gigaword on differently sized sub-sets of the training data. For each setting we takecare to include all data from the smaller subsets.Figure 9 shows that the performance of the neuralmodel improves very little on more than 500M to-kens. In order to benefit from the full training set wewould require a much higher training budget, fasterhardware, or parallelization.

Scaling training to large datasets can have a sig-nificant impact on perplexity, even when data fromthe distribution of interest is limited. As an illus-tration, we adapted a softmax model trained on bil-lionW to Penn Treebank and achieved a perplexityof 96 - a far better result than with any model we

Page 11: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

1-4K 4-20K 20-40K 40-70K 70-100KKneser-Ney 3.48 7.85 9.76 10.76 11.57Softmax 3.46 7.87 9.76 11.09 12.39D-Softmax 3.35 7.79 10.13 12.22 12.69Target sampling 3.51 7.62 9.51 10.81 12.06HSM 3.49 7.86 9.38 10.30 11.24NCE 3.74 8.48 10.60 12.06 13.37Weaknorm 3.46 7.86 9.77 11.12 12.40WeaknormSQ 3.46 7.79 9.67 10.98 12.32

Table 5: Test set entropy of various word frequency ranges on gigaword.

trained from scratch on PTB (cf. Table 2).

5.4 Rare Words

How well are neural models performing on rarewords? To answer this question we computed en-tropy across word frequency bands of the vocabu-lary for Kneser-Ney and neural models, that is wereport entropy for the 4, 000 most frequent words,then the next most frequent 16, 000 words and so on.Table 5 shows that Kneser-Ney is very competitiveon rare words, contrary to the common belief thatneural models are better on infrequent words. Forfrequent words, neural models are on par or betterthan Kneser-Ney. This highlights that the two ap-proaches complement each other, as observed in ourcombination experiments (Table 2).

Among the neural strategies, D-Softmax excelson frequent words but performs poorly on rare ones.This is because D-Softmax assigns more capacity tofrequent words at the expense of rare ones. Overall,hierarchical softmax is the best neural technique forrare words since it is very fast. Hierarchical softmaxdoes more iterations than the other techniques andobserves the occurrences of every rare words severaltimes.

6 Conclusions

This paper presents the first comprehensive analy-sis of strategies to train large vocabulary neural lan-guage models. Large vocabularies are a challengefor neural networks as they need to compute thepartition function over the entire vocabulary at eachevaluation.

We compared classical softmax to hierarchi-cal softmax, target sampling, noise contrastive

estimation and infrequent normalization, com-monly referred to as self-normalization. Further-more, we extend infrequent normalization, or self-normalization, to be a proper estimator of likelihoodand we introduce differentiated softmax, a novelvariant of softmax which assigns less capacity torare words in order to reduce computation.

Our results show that methods which are effectiveon small vocabularies are not necessarily the beston large vocabularies. In our setting, target sam-pling and noise contrastive estimation failed to out-perform the softmax baseline. Overall, differenti-ated softmax and hierarchical softmax are the beststrategies for large vocabularies. Compared to clas-sical Kneser-Ney models, neural models are better atmodeling frequent words, but they are less effectivefor rare words. A combination of the two is there-fore very effective.

From this paper, we conclude that there is still alot to explore in training from a combination of nor-malized and unnormalized objectives. We also seeparallel training and better rare word modeling aspromising future directions.

7 Acknowledgments

Do not number the acknowledgment section. Do notinclude this section when submitting your paper forreview.

References[Andreas and Klein2015] Jacob Andreas and Dan Klein.

2015. When and why are log-linear models self-normalizing? In Proc. of NAACL.

[Arisoy et al.2012] Ebru Arisoy, Tara N. Sainath, BrianKingsbury, and Bhuvana Ramabhadran. 2012. Deep

Page 12: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

Neural Network Language Models. In NAACL-HLTWorkshop on the Future of Language Modeling forHLT, pages 20–28, Stroudsburg, PA, USA. Associa-tion for Computational Linguistics.

[Bahdanau et al.2015] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. 2015. Neural machinetranslation by jointly learning to align and translate.In Proc. of ICLR. Association for Computational Lin-guistics, May.

[Bengio et al.2003] Yoshua Bengio, Rejean Ducharme,Pascal Vincent, and Christian Jauvin. 2003. A Neu-ral Probabilistic Language Model. Journal of MachineLearning Research, 3:1137–1155.

[Brown et al.1992] Peter F. Brown, Peter V. deSouza,Robert L. Mercer, Vincent J. Della Pietra, andJenifer C. Lai. 1992. Class-based n-gram mod-els of natural language. Computational Linguistics,18(4):467–479, Dec.

[Chelba et al.2013] Ciprian Chelba, Tomas Mikolov,Mike Schuster, Qi Ge, Thorsten Brants, PhillippKoehn, and Tony Robinson. 2013. One billion wordbenchmark for measuring progress in statistical lan-guage modeling. Technical report, Google.

[Chopra et al.2015] Sumit Chopra, Jason Weston, andAlexander M. Rush. 2015. Tuning as ranking. InProc. of EMNLP. Association for Computational Lin-guistics, Sep.

[Devlin et al.2014] Jacob Devlin, Rabih Zbib,Zhongqiang Huang, Thomas Lamar, RichardSchwartz, , and John Makhoul. 2014. Fast andRobust Neural Network Joint Models for StatisticalMachine Translation. In Proc. of ACL. Associationfor Computational Linguistics, June.

[Goodman2001] Joshua Goodman. 2001. Classes forFast Maximum Entropy Training. In Proc. of ICASSP.

[Heafield2011] Kenneth Heafield. 2011. KenLM: Fasterand Smaller Language Model Queries. In Workshopon Statistical Machine Translation, pages 187–197.

[Hyvarinen2010] Michael Gutmann Aapo Hyvarinen.2010. Noise-contrastive estimation: A new estimationprinciple for unnormalized statistical models. In Proc.of AISTATS.

[Jean et al.2014] Sebastien Jean, Kyunghyun Cho,Roland Memisevic, and Yoshua Bengio. 2014. OnUsing Very Large Target Vocabulary for NeuralMachine Translation. CoRR, abs/1412.2007.

[Le et al.2012] Hai-Son Le, Alexandre Allauzen, andFrancois Yvon. 2012. Continuous Space Transla-tion Models with Neural Networks. In Proc. of HLT-NAACL, pages 39–48, Montreal, Canada. Associationfor Computational Linguistics.

[Lebret and Collobert2014] Remi Lebret and Ronan Col-lobert. 2014. Word Embeddings through HellingerPCA. In Proc. of EACL.

[LeCun et al.1998] Yann LeCun, Leon Bottou, GenevieveOrr, and Klaus-Robert Mueller. 1998. Efficient Back-Prop. In Genevieve Orr and Klaus-Robert Muller, ed-itors, Neural Networks: Tricks of the trade. Springer.

[Marcus et al.1993] Mitchell P. Marcus, Mary AnnMarcinkiewicz, and Beatrice Santorini. 1993. Build-ing a Large Annotated Corpus of English: The PennTreebank. Computational Linguistics, 19(2):314–330,Jun.

[Mikolov et al.2010] Tomas Mikolov, Karafiat Martin,Lukas Burget, Jan Cernocky, and Sanjeev Khudan-pur. 2010. Recurrent Neural Network based LanguageModel. In Proc. of INTERSPEECH, pages 1045–1048.

[Mikolov et al.2011a] Tomas Mikolov, Anoop Deoras,Stefan Kombrink, Lukas Burget, and Jan Honza Cer-nocky. 2011a. Empirical Evaluation and Combinationof Advanced Language Modeling Techniques. In In-terspeech.

[Mikolov et al.2011b] Tomas Mikolov, Stefan Kombrink,Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur.2011b. Extensions of Recurrent Neural Network Lan-guage Model. In Proc. of ICASSP, pages 5528–5531.

[Mikolov et al.2013] Tomas Mikolov, Kai Chen, GregCorrado, and Jeffrey Dean. 2013. Efficient Estima-tion of Word Representations in Vector Space. CoRR,abs/1301.3781.

[Mnih and Hinton2010] Andriy Mnih and Geoffrey E.Hinton. 2010. A Scalable Hierarchical DistributedLanguage Model. In Proc. of NIPS.

[Mnih and Teh2012] Andriy Mnih and Yee Whye Teh.2012. A fast and simple algorithm for training neu-ral probabilistic language models. In Proc. of ICML.

[Morin and Bengio2005] Frederic Morin and YoshuaBengio. 2005. Hierarchical Probabilistic Neural Net-work Language Model. In Proc. of AISTATS.

[Parker et al.2011] Robert Parker, David Graff, JunboKong, Ke Chen, and Kazuaki Maeda. 2011. EnglishGigaword Fifth Edition. Technical report, LinguisticData Consortium.

[Pennington et al.2014] Jeffrey Pennington, RichardSocher, and Christopher D Manning. 2014. Glove:Global vectors for word representation. In Proceed-ings of the Empiricial Methods in Natural LanguageProcessing.

[Schwenk et al.2012] Holger Schwenk, AnthonyRousseau, and Mohammed Attik. 2012. Large,Pruned or Continuous Space Language Models on aGPU for Statistical Machine Translation. In NAACL-HLT Workshop on the Future of Language Modelingfor HLT, pages 11–19. Association for ComputationalLinguistics.

Page 13: arXiv:1512.04906v1 [cs.CL] 15 Dec 2015 · Menlo Park, CA michaelauli@fb.com Abstract ... (Schwenk et al., 2012; Vaswani et al., 2013). Furthermore, models similar in architecture

[Sordoni et al.2015] Alessandro Sordoni, Michel Galley,Michael Auli, Chris Brockett, Yangfeng Ji, Mar-garet Mitchell, Jian-Yun Nie1, Jianfeng Gao, andBill Dolan. 2015. A Neural Network Approach toContext-Sensitive Generation of Conversational Re-sponses. In Proc. of NAACL. Association for Com-putational Linguistics, May.

[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, andQuoc Le. 2014. Sequence to Sequence Learning withNeural Networks. In Proc. of NIPS.

[Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao,Victoria Fossum, and David Chiang. 2013. Decod-ing with Large-scale Neural Language Models im-proves Translation. In Proc. of EMNLP. Associationfor Computational Linguistics, October.

[Vijayanarasimhan et al.2014] Sudheendra Vijaya-narasimhan, Jonathon Shlens, Rajat Monga, and JayYagnik. 2014. Deep networks with large outputspaces. CoRR, abs/1412.7479.