Download pdf - Word2Vec: Learning of word representations in a vector space - Di Mitri & Hermans

Word2Vec: Learning of word representations in a vector space

1

Daniele Di Mitri - Joeri Hermans

23 March 2015

Student Lecture - Di Mitri & Hermans

1. Classic NLP techniques limitations

2. Skip-gram

3. Negative sampling

4. Learning of word representations

5. Applications

6. References

Outline

2


classic NLP techniques N-grams, Bag of words

• words as atomic units

• or in vector space [0,0,0,0,1,0,0….0] also known as one-hot

simple and robust models also when trained on huge amounts of data BUT

• No semantical relationships between words: not designed to

model linguistic knowledge.

• Data is extremely sparse due to high number of dimensions

• Scaling up will not result in significant progress

3

love candy store

Classic NLP techniques limitations


successful intuition: the context represents the semantics

Word’s context

4

these words represent banking


• One-hot problem [0,0,1] AND [1,0,0] = 0!

• Bengio et al (2003) introduce word features (feature vector) learned using a neural architecture

P(wt |w

t-(n-1),…,w

t-1)

candy = {0.124, -0.553, 0.923, 0.345, -0.009}

• Dimensionality reduction using word vectors • Data sparsity is no longer a problem.• Not computationally efficient.

Feature vectors

5


• Mikolov et al. introduce in 2013 more computationally efficient neural architectures skip-gram and Continuous Bag of words

• Hypothesis: more simple models trained on (a lot) more data will result in better word representations

• How to evaluate these word representations? Semantical similarity (cosine similarity)!

Importance of efficiency

6


Example

7

vec(“man”) – vec(“king”) + vec(“woman”) = vec(“queen”)


Feedforward NN for classification

Classification task: predict next and previous words (the context)

The features learned in weight matrix to hidden layer are our word vectors

Skip-gram

8

Supervised learning with unlabeled input data!


• Computing similarity between every word is very expensive.

• Including the correct context, select multiple incorrect contexts at random.

• Faster training

• Only a few words will change instead of all words in the language.

Negative sampling

9

Student Lecture - Di Mitri & Hermans 10


• In Machine learning• Machine translation.

• In Data mining• Dimensionality reduction.

Example applications

11


1. Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model.

2. Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning.

3. Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. Ecient estimation of word representations in vector space.

4. Tomas Mikolov, Wen tau Yih, and Georey Zweig. Linguistic regularities in continuous space word representations.

• Try the code word2vec.googlecode.com

References

13


Questions?

Thank you for your attention!

14