25
Word2Vec: Vector Representations of Words Mohammad Mahdavi May 2016 1

Word2Vec: Vector presentation of words - Mohammad Mahdavi

  • Upload
    irpycon

  • View
    676

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Word2Vec:Vector Representations of Words

M o h a m m a d M a h d a v iMay 2016

1

Page 2: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Page 3: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iMotivation

Image and audio processing systems work with rich, high-dimensional datasets encoded as vectors.

3/25

Page 4: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iMotivation

However, natural language processing systems traditionally treat words as discrete atomic symbols.

Encodings are arbitrary

Provide no useful information to

the system

Leads to data sparsity (So, we need more data)

cat

One-HotWord as Atomic Units

dog

[0,...,0,1,0,...,0]

[0,...,1,0,0,...,0]

4/25

Page 5: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iWhat is Word2Vec?

Word2Vec is a model for learning vector representations of words, called "word embeddings".

𝑐𝑎𝑡 ⟶ [0.122,0.81,0.405,… , 0.77]

Using vector representations can overcome some of these obstacles!

5/25

Page 6: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Page 7: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iVector Space Models

Vector space models (VSMs) represent (embed) words in a continuous vector space.

Semantically similar words are mapped to nearby points.

Basic Idea is Distributional Hypothesis: words that appear in the same contexts share semantic meaning.

7/25

Page 8: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iWord’s Context

w h e re t h e re ' s a w i l l t h e re ' s a w ay.

Target Word

Word’s Context (Window = 3)

8/25

Page 9: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Page 10: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iCharacteristics

Word2vec is a computationally efficient model for learning word embeddings.

Word2vec is a successful example of “shallow” learning.

Very simple Feedforward neural network with single hidden layer, backpropagation, and no non-linearities.

Supervised learning with unlabeled input data!

10/25

Page 11: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iLearning Models

Learning Models

Continuous Bag-of-Words (CBOW)

Predicts target words from source context words

Skip-GramSkip-gram predicts source context-

words from the target words.

11/25

Page 12: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iLearning Models

Faster

Smaller Data Sets

Syntactic Similarity

More Accurate

Larger Data Sets

Semantic Similarity

12/25

Page 13: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Page 14: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iIntuition

𝑣𝑒𝑐 𝑤 ≡ 𝑣𝑒𝑐𝑡𝑜𝑟 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑 "𝑤"

𝑖𝑓 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" 𝑖𝑠 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑡𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑝" 𝑎𝑛𝑑 "𝑞"⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 ≈ 𝑣𝑒𝑐 𝑝 − 𝑣𝑒𝑐 𝑞⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 + 𝑣𝑒𝑐 𝑞 ≈ 𝑣𝑒𝑐 𝑝⟹ 𝑢𝑠𝑒 "u", "v", 𝑎𝑛𝑑 "𝑞" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑝"

𝑖𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" ℎ𝑎𝑣𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑠𝑦𝑛𝑡𝑎𝑥 𝑜𝑟 𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐⟹ 𝑣𝑒𝑐 𝑢 ≈ 𝑣𝑒𝑐 𝑣

⟹ 𝑢𝑠𝑒 "𝑢" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑣"

14/25

Page 15: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iInteresting Results

«آنکارا« ← »ترکیه»+ « پایتخت»«ایران»← « میهن»+ « کشور»«محمود گودرزی»← « ورزش»+ « وزیر»«مارک زاکربرگ»← « فیسبوک»+ « مالک»«حسن روحانی« ← »رئیس جمهور»

15/25

𝑖𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" ℎ𝑎𝑣𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑠𝑦𝑛𝑡𝑎𝑥 𝑜𝑟 𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐⟹ 𝑣𝑒𝑐 𝑢 ≈ 𝑣𝑒𝑐 𝑣

⟹ 𝑢𝑠𝑒 "𝑢" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑣"

Page 16: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iInteresting Results

«یزد« ← »اردکان»+ « رفسنجان»-« کرمان»«مظلومی« ← »استقالل»+ « پرسپولیس»-« برانکو»«لندن»← « انگلیس»+ « ایران»-« تهران»«اردوغان»← « ترکیه»+ « ایران»-« روحانی»«لوزانو»← « والیبال»+ « فوتبال»-« کیروش»

16/25

𝑖𝑓 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" 𝑖𝑠 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑡𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑝" 𝑎𝑛𝑑 "𝑞"⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 ≈ 𝑣𝑒𝑐 𝑝 − 𝑣𝑒𝑐 𝑞⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 + 𝑣𝑒𝑐 𝑞 ≈ 𝑣𝑒𝑐 𝑝⟹ 𝑢𝑠𝑒 "u", "v", 𝑎𝑛𝑑 "𝑞" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑝"

Page 17: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Page 18: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iLibraries

Original Implementation in C++

Gensim Implementation in Python

...

18/25

Page 19: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iGensim

Gensim is a FREE Python library.

It served to generate a short list of the most similar articles to a given article (gensim = “generate similar”).

The Word2Vec training algorithms were originally ported from the C package and extended with additional functionality.

19/25

Page 20: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iGensim’s Word2Vec

import gensim

# Load Corpus

sentence_stream = gensim.models.word2vec.LineSentence(corpus_path)

# Generate Bigrams

bigram = gensim.models.Phrases(sentence_stream)

term_list = list(bigram[sentence_stream])

# Build Model

model = gensim.models.Word2Vec(term_list)

model.save(model_backup_path)

# Use Model

model = gensim.models.Word2Vec.load(model_backup_path)

model.most_similar(positive = [“president”])

20/25

Page 21: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iLearning Parameters

Vector Size (∈ [100,1000])

Window Size (∈ [5,15])

Min Term Occurrence (∈ [5,100])

Learning Method (CBOW or Skip-Gram)

Itteration Count (∈ [5,30])

Negative Samples Count (∈ [5,30])

...

21/25

Page 22: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Introduction

Background

Approach

Application

Implementation

Conclusion

Outline

Page 23: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iSummary

Word2vec is a computationally efficient model for learning word embeddings.

Its basic idea is words in similar contexts have similar meanings.

The learned vectors can be used as input for so many NLP tasks.

There are good free implementations for it including Python’s Gensim library.

23/25

Page 24: Word2Vec: Vector presentation of words - Mohammad Mahdavi

Mo ham m adM a h d a v iReferences

[1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vectorspace." arXiv preprint arXiv:1301.3781 (2013).

[2] Mikolov, Tomas, et al. "Distributed representations of words and phrases and theircompositionality." Advances in neural information processing systems. 2013.

[3] Goldberg, Yoav, and Omer Levy. "word2vec explained: Deriving mikolov et al.'snegative-sampling word-embedding method." arXiv preprint arXiv:1402.3722(2014).

[4] Rong, Xin. "word2vec parameter learning explained." arXiv preprintarXiv:1411.2738 (2014).

[5] https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html#vector-representations-of-words

24/25

Page 25: Word2Vec: Vector presentation of words - Mohammad Mahdavi

𝑣𝑒𝑐 "𝑀𝑜ℎ𝑎𝑚𝑚𝑎𝑑 𝑀𝑎ℎ𝑑𝑎𝑣𝑖") + 𝑣𝑒𝑐("𝐺𝑚𝑎𝑖𝑙"≈ 𝑣𝑒𝑐 "𝑀𝑜ℎ𝑎𝑚𝑚𝑎𝑑 𝑀𝑎ℎ𝑑𝑎𝑣𝑖") + 𝑣𝑒𝑐("𝐹𝑎𝑐𝑒𝑏𝑜𝑜𝑘"≈ 𝑣𝑒𝑐("𝑚𝑜ℎ.𝑚𝑎ℎ𝑑𝑎𝑣𝑖. 𝑙")

25