Word2Vec: Vector presentation of words - Mohammad Mahdavi

Word2Vec:Vector Representations of Words

M o h a m m a d M a h d a v iMay 2016

1

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Mo ham m adM a h d a v iMotivation

Image and audio processing systems work with rich, high-dimensional datasets encoded as vectors.

3/25

Mo ham m adM a h d a v iMotivation

However, natural language processing systems traditionally treat words as discrete atomic symbols.

Encodings are arbitrary

Provide no useful information to

the system

Leads to data sparsity (So, we need more data)

cat

One-HotWord as Atomic Units

dog

[0,...,0,1,0,...,0]

[0,...,1,0,0,...,0]

4/25

Mo ham m adM a h d a v iWhat is Word2Vec?

Word2Vec is a model for learning vector representations of words, called "word embeddings".

𝑐𝑎𝑡 ⟶ [0.122,0.81,0.405,… , 0.77]

Using vector representations can overcome some of these obstacles!

5/25

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Mo ham m adM a h d a v iVector Space Models

Vector space models (VSMs) represent (embed) words in a continuous vector space.

Semantically similar words are mapped to nearby points.

Basic Idea is Distributional Hypothesis: words that appear in the same contexts share semantic meaning.

7/25

Mo ham m adM a h d a v iWord’s Context

w h e re t h e re ' s a w i l l t h e re ' s a w ay.

Target Word

Word’s Context (Window = 3)

8/25

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Mo ham m adM a h d a v iCharacteristics

Word2vec is a computationally efficient model for learning word embeddings.

Word2vec is a successful example of “shallow” learning.

Very simple Feedforward neural network with single hidden layer, backpropagation, and no non-linearities.

Supervised learning with unlabeled input data!

10/25

Mo ham m adM a h d a v iLearning Models

Learning Models

Continuous Bag-of-Words (CBOW)

Predicts target words from source context words

Skip-GramSkip-gram predicts source context-

words from the target words.

11/25

Mo ham m adM a h d a v iLearning Models

Faster

Smaller Data Sets

Syntactic Similarity

More Accurate

Larger Data Sets

Semantic Similarity

12/25

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Mo ham m adM a h d a v iIntuition

𝑣𝑒𝑐 𝑤 ≡ 𝑣𝑒𝑐𝑡𝑜𝑟 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑 "𝑤"

𝑖𝑓 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" 𝑖𝑠 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑡𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑝" 𝑎𝑛𝑑 "𝑞"⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 ≈ 𝑣𝑒𝑐 𝑝 − 𝑣𝑒𝑐 𝑞⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 + 𝑣𝑒𝑐 𝑞 ≈ 𝑣𝑒𝑐 𝑝⟹ 𝑢𝑠𝑒 "u", "v", 𝑎𝑛𝑑 "𝑞" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑝"

𝑖𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" ℎ𝑎𝑣𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑠𝑦𝑛𝑡𝑎𝑥 𝑜𝑟 𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐⟹ 𝑣𝑒𝑐 𝑢 ≈ 𝑣𝑒𝑐 𝑣

⟹ 𝑢𝑠𝑒 "𝑢" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑣"

14/25

Mo ham m adM a h d a v iInteresting Results

«آنکارا« ← »ترکیه»+ « پایتخت»«ایران»← « میهن»+ « کشور»«محمود گودرزی»← « ورزش»+ « وزیر»«مارک زاکربرگ»← « فیسبوک»+ « مالک»«حسن روحانی« ← »رئیس جمهور»

15/25

𝑖𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" ℎ𝑎𝑣𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑠𝑦𝑛𝑡𝑎𝑥 𝑜𝑟 𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐⟹ 𝑣𝑒𝑐 𝑢 ≈ 𝑣𝑒𝑐 𝑣

⟹ 𝑢𝑠𝑒 "𝑢" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑣"

Mo ham m adM a h d a v iInteresting Results

«یزد« ← »اردکان»+ « رفسنجان»-« کرمان»«مظلومی« ← »استقالل»+ « پرسپولیس»-« برانکو»«لندن»← « انگلیس»+ « ایران»-« تهران»«اردوغان»← « ترکیه»+ « ایران»-« روحانی»«لوزانو»← « والیبال»+ « فوتبال»-« کیروش»

16/25

𝑖𝑓 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" 𝑖𝑠 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑡𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑝" 𝑎𝑛𝑑 "𝑞"⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 ≈ 𝑣𝑒𝑐 𝑝 − 𝑣𝑒𝑐 𝑞⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 + 𝑣𝑒𝑐 𝑞 ≈ 𝑣𝑒𝑐 𝑝⟹ 𝑢𝑠𝑒 "u", "v", 𝑎𝑛𝑑 "𝑞" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑝"

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Mo ham m adM a h d a v iLibraries

Original Implementation in C++

Gensim Implementation in Python

...

18/25

Mo ham m adM a h d a v iGensim

Gensim is a FREE Python library.

It served to generate a short list of the most similar articles to a given article (gensim = “generate similar”).

The Word2Vec training algorithms were originally ported from the C package and extended with additional functionality.

19/25

Mo ham m adM a h d a v iGensim’s Word2Vec

import gensim

# Load Corpus

sentence_stream = gensim.models.word2vec.LineSentence(corpus_path)

# Generate Bigrams

bigram = gensim.models.Phrases(sentence_stream)

term_list = list(bigram[sentence_stream])

# Build Model

model = gensim.models.Word2Vec(term_list)

model.save(model_backup_path)

# Use Model

model = gensim.models.Word2Vec.load(model_backup_path)

model.most_similar(positive = [“president”])

20/25

Mo ham m adM a h d a v iLearning Parameters

Vector Size (∈ [100,1000])

Window Size (∈ [5,15])

Min Term Occurrence (∈ [5,100])

Learning Method (CBOW or Skip-Gram)

Itteration Count (∈ [5,30])

Negative Samples Count (∈ [5,30])

...

21/25

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Mo ham m adM a h d a v iSummary

Word2vec is a computationally efficient model for learning word embeddings.

Its basic idea is words in similar contexts have similar meanings.

The learned vectors can be used as input for so many NLP tasks.

There are good free implementations for it including Python’s Gensim library.

23/25

Mo ham m adM a h d a v iReferences

[1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vectorspace." arXiv preprint arXiv:1301.3781 (2013).

[2] Mikolov, Tomas, et al. "Distributed representations of words and phrases and theircompositionality." Advances in neural information processing systems. 2013.

[3] Goldberg, Yoav, and Omer Levy. "word2vec explained: Deriving mikolov et al.'snegative-sampling word-embedding method." arXiv preprint arXiv:1402.3722(2014).

[4] Rong, Xin. "word2vec parameter learning explained." arXiv preprintarXiv:1411.2738 (2014).

[5] https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html#vector-representations-of-words

24/25

𝑣𝑒𝑐 "𝑀𝑜ℎ𝑎𝑚𝑚𝑎𝑑 𝑀𝑎ℎ𝑑𝑎𝑣𝑖") + 𝑣𝑒𝑐("𝐺𝑚𝑎𝑖𝑙"≈ 𝑣𝑒𝑐 "𝑀𝑜ℎ𝑎𝑚𝑚𝑎𝑑 𝑀𝑎ℎ𝑑𝑎𝑣𝑖") + 𝑣𝑒𝑐("𝐹𝑎𝑐𝑒𝑏𝑜𝑜𝑘"≈ 𝑣𝑒𝑐("𝑚𝑜ℎ.𝑚𝑎ℎ𝑑𝑎𝑣𝑖. 𝑙")

25

Technology

Word2Vec: Vector presentation of words - Mohammad Mahdavi