Upload
sebastian-ruder
View
369
Download
1
Embed Size (px)
Citation preview
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Softmax Approximations for Learning WordEmbeddings and Language Modeling
Sebastian Ruder@seb ruder
1st NLP Meet-up
03.08.16
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Agenda
1 Softmax
2 Softmax-based ApproachesHierarchial SoftmaxDifferentiated SoftmaxCNN-Softmax
3 Sampling-based ApproachesMargin-based Hinge LossNoise Contrastive EstimationNegative Sampling
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Language modeling objective
Goal: Probabilistic model of languageMaximize probability of a word wt given its n previouswords, i.e. p(wt | wt−1, · · ·wt−n+1)
N-gram models:
p(wt | wt−1, · · · ,wt−n+1) =count(wt−n+1, · · · ,wt−1,wt)
count(wt−n+1, · · · ,wt−1)
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Softmax objective for language modeling
Figure: Predicting the next word with the softmax
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Softmax objective for language modeling
Neural networks with softmax:
p(w | wt−1, · · · ,wt−n+1) =exp(h>vw )∑
wi∈V exp(h>vwi )
where
h is ”hidden” representation of input, i.e. previous wordsof dimensionality dvwi is the ”output” word embedding of word i , 6= wordembeddingV is the vocabulary
Inner product h>vw computes score (”unnormalized”probability) of model for word w given input
Output word embeddings are stored in a d × |V | matrix
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Neural language model
Figure: Neural language model [Bengio et al., 2003]
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Softmax use cases
Maximum entropy models minimize same probabilitydistribution:
Ph(y | x) =exp(h · f (x , y))∑
y ′∈Y exp(h · f (x , y ′))
whereh is a weight vectorf (x , y) is a feature vector
Pervasive use in NNs:Go-to multi-class classification objective”Soft” selection e.g. for attention, memory retrieval, etc.
Denominator is called partition function:
Z =∑wi∈V
exp(h>vwi )
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Softmax-based vs. sampling-based
Softmax-based approaches keep softmax layer intact,make it more efficient.
Sampling-based approaches optimize a different lossfunction that approximates the softmax.
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Hierarchical Softmax
Softmax as a binary tree: evaluate at most log2 |V | nodesinstead of all |V | nodes
Figure: Hierarchical softmax [Morin and Bengio, 2005]
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Hierarchical Softmax
Structure is important; fastest (and most commonly used)variant: Huffman tree (short paths for frequent words)
Figure: Hierarchical softmax [Mnih and Hinton, 2008]
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Differentiated Softmax
Idea: We have more knowledge (co-occurrences, etc.)about frequent words, less about rare words→ words that occur more often allows us to fit moreparameters; extremely rare words only allow to fit a few→ different embedding sizes to represent each output word
Larger embeddings (more parameters) for frequent words,smaller embeddings for rare words
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Differentiated Softmax
Figure: Differentiated softmax [Chen et al., 2015]
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
CNN-Softmax
Idea: Instead of learning all output word embeddingsseparately, learn function to produce them
Figure: CNN-Softmax [Jozefowicz et al., 2016]
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Sampling-based approaches
Sampling-based approaches optimize a different lossfunction that approximates the softmax.
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Margin-based Hinge Loss
Idea: Why do multi-class classification at all? Only onecorrect word, many incorrect ones. [Collobert et al., 2011]
Train model to produce higher scores for correct wordwindows than for incorrect ones, i.e. maximize∑
x∈X
∑w∈V
max{0, 1− f (x) + f (x (w))}
where
x is a correct windowx (w) is a ”corrupted” window (target word replaced byrandom word)f (x) is the score output by the model
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Noise Contrastive Estimation
Idea: Train model to differentiate target word from noise
Figure: Noise Contrastive Estimation (NCE) [Mnih and Teh, 2012]
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Noise Contrastive Estimation
Language modeling reduces to binary classification
Draw k noise samples from a noise distribution (e.g.unigram) for every word; correct words given their contextare true (y = 1), noise samples are false (y = 0)
Minimize cross-entropy with logistic regression loss
Approximates softmax as number of noise samples kincreases
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Negative Sampling
Simplification of NCE [Mikolov et al., 2013]
No longer approximates softmax as goal is to learnhigh-quality word embeddings (rather than languagemodeling)
Makes NCE more efficient by making most expensive termconstant
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Thank you for your attention!The content of most of these slides is also available as blogposts at sebastianruder.com.For more information: [email protected]
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Bibliography I
[Bengio et al., 2003] Bengio, Y., Ducharme, R., Vincent, P.,and Janvin, C. (2003).A Neural Probabilistic Language Model.The Journal of Machine Learning Research, 3:1137–1155.
[Chen et al., 2015] Chen, W., Grangier, D., and Auli, M.(2015).Strategies for Training Large Vocabulary Neural LanguageModels.
[Collobert et al., 2011] Collobert, R., Weston, J., Bottou, L.,Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011).Natural Language Processing (almost) from Scratch.Journal of Machine Learning Research, 12(Aug):2493–2537.
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Bibliography II
[Jozefowicz et al., 2016] Jozefowicz, R., Vinyals, O., Schuster,M., Shazeer, N., and Wu, Y. (2016).Exploring the Limits of Language Modeling.
[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., andDean, J. (2013).Distributed Representations of Words and Phrases and theirCompositionality.NIPS, pages 1–9.
[Mnih and Hinton, 2008] Mnih, A. and Hinton, G. E. (2008).A Scalable Hierarchical Distributed Language Model.Advances in Neural Information Processing Systems, pages1–8.
Softmax Ap-proximations
SebastianRuder
Softmax
Softmax-basedApproaches
HierarchialSoftmax
DifferentiatedSoftmax
CNN-Softmax
Sampling-basedApproaches
Margin-basedHinge Loss
NoiseContrastiveEstimation
NegativeSampling
Bibliography
Bibliography III
[Mnih and Teh, 2012] Mnih, A. and Teh, Y. W. (2012).A Fast and Simple Algorithm for Training NeuralProbabilistic Language Models.Proceedings of the 29th International Conference onMachine Learning (ICML’12), pages 1751–1758.
[Morin and Bengio, 2005] Morin, F. and Bengio, Y. (2005).Hierarchical Probabilistic Neural Network Language Model.Aistats, 5.