Word2Vec: Learning of word representations in a vector space
1
Daniele Di Mitri - Joeri Hermans
23 March 2015
Student Lecture - Di Mitri & Hermans
1. Classic NLP techniques limitations
2. Skip-gram
3. Negative sampling
4. Learning of word representations
5. Applications
6. References
Outline
2
Student Lecture - Di Mitri & Hermans
classic NLP techniques N-grams, Bag of words
• words as atomic units
• or in vector space [0,0,0,0,1,0,0….0] also known as one-hot
simple and robust models also when trained on huge amounts of data BUT
• No semantical relationships between words: not designed to
model linguistic knowledge.
• Data is extremely sparse due to high number of dimensions
• Scaling up will not result in significant progress
3
love candy store
Classic NLP techniques limitations
Student Lecture - Di Mitri & Hermans
successful intuition: the context represents the semantics
Word’s context
4
these words represent banking
Student Lecture - Di Mitri & Hermans
• One-hot problem [0,0,1] AND [1,0,0] = 0!
• Bengio et al (2003) introduce word features (feature vector) learned using a neural architecture
P(wt |w
t-(n-1),…,w
t-1)
candy = {0.124, -0.553, 0.923, 0.345, -0.009}
• Dimensionality reduction using word vectors • Data sparsity is no longer a problem.• Not computationally efficient.
Feature vectors
5
Student Lecture - Di Mitri & Hermans
• Mikolov et al. introduce in 2013 more computationally efficient neural architectures skip-gram and Continuous Bag of words
• Hypothesis: more simple models trained on (a lot) more data will result in better word representations
• How to evaluate these word representations? Semantical similarity (cosine similarity)!
Importance of efficiency
6
Student Lecture - Di Mitri & Hermans
Example
7
vec(“man”) – vec(“king”) + vec(“woman”) = vec(“queen”)
Student Lecture - Di Mitri & Hermans
Feedforward NN for classification
Classification task: predict next and previous words (the context)
The features learned in weight matrix to hidden layer are our word vectors
Skip-gram
8
Supervised learning with unlabeled input data!
Student Lecture - Di Mitri & Hermans
• Computing similarity between every word is very expensive.
• Including the correct context, select multiple incorrect contexts at random.
• Faster training
• Only a few words will change instead of all words in the language.
Negative sampling
9
Student Lecture - Di Mitri & Hermans
• In Machine learning• Machine translation.
• In Data mining• Dimensionality reduction.
Example applications
11
Student Lecture - Di Mitri & Hermans
1. Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model.
2. Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning.
3. Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. Ecient estimation of word representations in vector space.
4. Tomas Mikolov, Wen tau Yih, and Georey Zweig. Linguistic regularities in continuous space word representations.
• Try the code word2vec.googlecode.com
References
13