Softmax Approximations for Learning Word Embeddings and Language Modeling (Sebastian Ruder)

Softmax Ap-proximations

SebastianRuder

Softmax

Softmax-basedApproaches

HierarchialSoftmax

DifferentiatedSoftmax

CNN-Softmax

Sampling-basedApproaches

Margin-basedHinge Loss

NoiseContrastiveEstimation

NegativeSampling

Bibliography

Softmax Approximations for Learning WordEmbeddings and Language Modeling

Sebastian Ruder@seb ruder

1st NLP Meet-up

03.08.16


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Agenda

1 Softmax

2 Softmax-based ApproachesHierarchial SoftmaxDifferentiated SoftmaxCNN-Softmax

3 Sampling-based ApproachesMargin-based Hinge LossNoise Contrastive EstimationNegative Sampling


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Language modeling objective

Goal: Probabilistic model of languageMaximize probability of a word wt given its n previouswords, i.e. p(wt | wt−1, · · ·wt−n+1)

N-gram models:

p(wt | wt−1, · · · ,wt−n+1) =count(wt−n+1, · · · ,wt−1,wt)

count(wt−n+1, · · · ,wt−1)


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Softmax objective for language modeling

Figure: Predicting the next word with the softmax


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Softmax objective for language modeling

Neural networks with softmax:

p(w | wt−1, · · · ,wt−n+1) =exp(h>vw )∑

wi∈V exp(h>vwi )

where

h is ”hidden” representation of input, i.e. previous wordsof dimensionality dvwi is the ”output” word embedding of word i , 6= wordembeddingV is the vocabulary

Inner product h>vw computes score (”unnormalized”probability) of model for word w given input

Output word embeddings are stored in a d × |V | matrix


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Neural language model

Figure: Neural language model [Bengio et al., 2003]


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Softmax use cases

Maximum entropy models minimize same probabilitydistribution:

Ph(y | x) =exp(h · f (x , y))∑

y ′∈Y exp(h · f (x , y ′))

whereh is a weight vectorf (x , y) is a feature vector

Pervasive use in NNs:Go-to multi-class classification objective”Soft” selection e.g. for attention, memory retrieval, etc.

Denominator is called partition function:

Z =∑wi∈V

exp(h>vwi )


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Softmax-based vs. sampling-based

Softmax-based approaches keep softmax layer intact,make it more efficient.

Sampling-based approaches optimize a different lossfunction that approximates the softmax.


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Hierarchical Softmax

Softmax as a binary tree: evaluate at most log2 |V | nodesinstead of all |V | nodes

Figure: Hierarchical softmax [Morin and Bengio, 2005]


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Hierarchical Softmax

Structure is important; fastest (and most commonly used)variant: Huffman tree (short paths for frequent words)

Figure: Hierarchical softmax [Mnih and Hinton, 2008]


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Differentiated Softmax

Idea: We have more knowledge (co-occurrences, etc.)about frequent words, less about rare words→ words that occur more often allows us to fit moreparameters; extremely rare words only allow to fit a few→ different embedding sizes to represent each output word

Larger embeddings (more parameters) for frequent words,smaller embeddings for rare words


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Differentiated Softmax

Figure: Differentiated softmax [Chen et al., 2015]


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

CNN-Softmax

Idea: Instead of learning all output word embeddingsseparately, learn function to produce them

Figure: CNN-Softmax [Jozefowicz et al., 2016]


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Sampling-based approaches

Sampling-based approaches optimize a different lossfunction that approximates the softmax.


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Margin-based Hinge Loss

Idea: Why do multi-class classification at all? Only onecorrect word, many incorrect ones. [Collobert et al., 2011]

Train model to produce higher scores for correct wordwindows than for incorrect ones, i.e. maximize∑

x∈X

∑w∈V

max{0, 1− f (x) + f (x (w))}

where

x is a correct windowx (w) is a ”corrupted” window (target word replaced byrandom word)f (x) is the score output by the model


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Noise Contrastive Estimation

Idea: Train model to differentiate target word from noise

Figure: Noise Contrastive Estimation (NCE) [Mnih and Teh, 2012]


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Noise Contrastive Estimation

Language modeling reduces to binary classification

Draw k noise samples from a noise distribution (e.g.unigram) for every word; correct words given their contextare true (y = 1), noise samples are false (y = 0)

Minimize cross-entropy with logistic regression loss

Approximates softmax as number of noise samples kincreases


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Negative Sampling

Simplification of NCE [Mikolov et al., 2013]

No longer approximates softmax as goal is to learnhigh-quality word embeddings (rather than languagemodeling)

Makes NCE more efficient by making most expensive termconstant


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Thank you for your attention!The content of most of these slides is also available as blogposts at sebastianruder.com.For more information: [email protected]

sebastianruder.com

[email protected]


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Bibliography I

[Bengio et al., 2003] Bengio, Y., Ducharme, R., Vincent, P.,and Janvin, C. (2003).A Neural Probabilistic Language Model.The Journal of Machine Learning Research, 3:1137–1155.

[Chen et al., 2015] Chen, W., Grangier, D., and Auli, M.(2015).Strategies for Training Large Vocabulary Neural LanguageModels.

[Collobert et al., 2011] Collobert, R., Weston, J., Bottou, L.,Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011).Natural Language Processing (almost) from Scratch.Journal of Machine Learning Research, 12(Aug):2493–2537.


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Bibliography II

[Jozefowicz et al., 2016] Jozefowicz, R., Vinyals, O., Schuster,M., Shazeer, N., and Wu, Y. (2016).Exploring the Limits of Language Modeling.

[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., andDean, J. (2013).Distributed Representations of Words and Phrases and theirCompositionality.NIPS, pages 1–9.

[Mnih and Hinton, 2008] Mnih, A. and Hinton, G. E. (2008).A Scalable Hierarchical Distributed Language Model.Advances in Neural Information Processing Systems, pages1–8.


SebastianRuder

Softmax


HierarchialSoftmax


CNN-Softmax




NegativeSampling

Bibliography

Bibliography III

[Mnih and Teh, 2012] Mnih, A. and Teh, Y. W. (2012).A Fast and Simple Algorithm for Training NeuralProbabilistic Language Models.Proceedings of the 29th International Conference onMachine Learning (ICML’12), pages 1751–1758.

[Morin and Bengio, 2005] Morin, F. and Bengio, Y. (2005).Hierarchical Probabilistic Neural Network Language Model.Aistats, 5.

Technology

Softmax Approximations for Learning Word Embeddings and Language Modeling (Sebastian Ruder)