Upload
irpycon
View
676
Download
1
Embed Size (px)
Citation preview
Word2Vec:Vector Representations of Words
M o h a m m a d M a h d a v iMay 2016
1
Introduction
Background
Approach
Application
Implementation
Conclusion
Outline
Mo ham m adM a h d a v iMotivation
Image and audio processing systems work with rich, high-dimensional datasets encoded as vectors.
3/25
Mo ham m adM a h d a v iMotivation
However, natural language processing systems traditionally treat words as discrete atomic symbols.
Encodings are arbitrary
Provide no useful information to
the system
Leads to data sparsity (So, we need more data)
cat
One-HotWord as Atomic Units
dog
[0,...,0,1,0,...,0]
[0,...,1,0,0,...,0]
4/25
Mo ham m adM a h d a v iWhat is Word2Vec?
Word2Vec is a model for learning vector representations of words, called "word embeddings".
𝑐𝑎𝑡 ⟶ [0.122,0.81,0.405,… , 0.77]
Using vector representations can overcome some of these obstacles!
5/25
Introduction
Background
Approach
Application
Implementation
Conclusion
Outline
Mo ham m adM a h d a v iVector Space Models
Vector space models (VSMs) represent (embed) words in a continuous vector space.
Semantically similar words are mapped to nearby points.
Basic Idea is Distributional Hypothesis: words that appear in the same contexts share semantic meaning.
7/25
Mo ham m adM a h d a v iWord’s Context
w h e re t h e re ' s a w i l l t h e re ' s a w ay.
Target Word
Word’s Context (Window = 3)
8/25
Introduction
Background
Approach
Application
Implementation
Conclusion
Outline
Mo ham m adM a h d a v iCharacteristics
Word2vec is a computationally efficient model for learning word embeddings.
Word2vec is a successful example of “shallow” learning.
Very simple Feedforward neural network with single hidden layer, backpropagation, and no non-linearities.
Supervised learning with unlabeled input data!
10/25
Mo ham m adM a h d a v iLearning Models
Learning Models
Continuous Bag-of-Words (CBOW)
Predicts target words from source context words
Skip-GramSkip-gram predicts source context-
words from the target words.
11/25
Mo ham m adM a h d a v iLearning Models
Faster
Smaller Data Sets
Syntactic Similarity
More Accurate
Larger Data Sets
Semantic Similarity
12/25
Introduction
Background
Approach
Application
Implementation
Conclusion
Outline
Mo ham m adM a h d a v iIntuition
𝑣𝑒𝑐 𝑤 ≡ 𝑣𝑒𝑐𝑡𝑜𝑟 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑 "𝑤"
𝑖𝑓 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" 𝑖𝑠 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑡𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑝" 𝑎𝑛𝑑 "𝑞"⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 ≈ 𝑣𝑒𝑐 𝑝 − 𝑣𝑒𝑐 𝑞⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 + 𝑣𝑒𝑐 𝑞 ≈ 𝑣𝑒𝑐 𝑝⟹ 𝑢𝑠𝑒 "u", "v", 𝑎𝑛𝑑 "𝑞" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑝"
𝑖𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" ℎ𝑎𝑣𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑠𝑦𝑛𝑡𝑎𝑥 𝑜𝑟 𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐⟹ 𝑣𝑒𝑐 𝑢 ≈ 𝑣𝑒𝑐 𝑣
⟹ 𝑢𝑠𝑒 "𝑢" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑣"
14/25
Mo ham m adM a h d a v iInteresting Results
«آنکارا« ← »ترکیه»+ « پایتخت»«ایران»← « میهن»+ « کشور»«محمود گودرزی»← « ورزش»+ « وزیر»«مارک زاکربرگ»← « فیسبوک»+ « مالک»«حسن روحانی« ← »رئیس جمهور»
15/25
𝑖𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" ℎ𝑎𝑣𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑠𝑦𝑛𝑡𝑎𝑥 𝑜𝑟 𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐⟹ 𝑣𝑒𝑐 𝑢 ≈ 𝑣𝑒𝑐 𝑣
⟹ 𝑢𝑠𝑒 "𝑢" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑣"
Mo ham m adM a h d a v iInteresting Results
«یزد« ← »اردکان»+ « رفسنجان»-« کرمان»«مظلومی« ← »استقالل»+ « پرسپولیس»-« برانکو»«لندن»← « انگلیس»+ « ایران»-« تهران»«اردوغان»← « ترکیه»+ « ایران»-« روحانی»«لوزانو»← « والیبال»+ « فوتبال»-« کیروش»
16/25
𝑖𝑓 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑢" 𝑎𝑛𝑑 "𝑣" 𝑖𝑠 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑡𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 "𝑝" 𝑎𝑛𝑑 "𝑞"⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 ≈ 𝑣𝑒𝑐 𝑝 − 𝑣𝑒𝑐 𝑞⟹ 𝑣𝑒𝑐 𝑢 − 𝑣𝑒𝑐 𝑣 + 𝑣𝑒𝑐 𝑞 ≈ 𝑣𝑒𝑐 𝑝⟹ 𝑢𝑠𝑒 "u", "v", 𝑎𝑛𝑑 "𝑞" 𝑡𝑜 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 "𝑝"
Introduction
Background
Approach
Application
Implementation
Conclusion
Outline
Mo ham m adM a h d a v iLibraries
Original Implementation in C++
Gensim Implementation in Python
...
18/25
Mo ham m adM a h d a v iGensim
Gensim is a FREE Python library.
It served to generate a short list of the most similar articles to a given article (gensim = “generate similar”).
The Word2Vec training algorithms were originally ported from the C package and extended with additional functionality.
19/25
Mo ham m adM a h d a v iGensim’s Word2Vec
import gensim
# Load Corpus
sentence_stream = gensim.models.word2vec.LineSentence(corpus_path)
# Generate Bigrams
bigram = gensim.models.Phrases(sentence_stream)
term_list = list(bigram[sentence_stream])
# Build Model
model = gensim.models.Word2Vec(term_list)
model.save(model_backup_path)
# Use Model
model = gensim.models.Word2Vec.load(model_backup_path)
model.most_similar(positive = [“president”])
20/25
Mo ham m adM a h d a v iLearning Parameters
Vector Size (∈ [100,1000])
Window Size (∈ [5,15])
Min Term Occurrence (∈ [5,100])
Learning Method (CBOW or Skip-Gram)
Itteration Count (∈ [5,30])
Negative Samples Count (∈ [5,30])
...
21/25
Introduction
Background
Approach
Application
Implementation
Conclusion
Outline
Mo ham m adM a h d a v iSummary
Word2vec is a computationally efficient model for learning word embeddings.
Its basic idea is words in similar contexts have similar meanings.
The learned vectors can be used as input for so many NLP tasks.
There are good free implementations for it including Python’s Gensim library.
23/25
Mo ham m adM a h d a v iReferences
[1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vectorspace." arXiv preprint arXiv:1301.3781 (2013).
[2] Mikolov, Tomas, et al. "Distributed representations of words and phrases and theircompositionality." Advances in neural information processing systems. 2013.
[3] Goldberg, Yoav, and Omer Levy. "word2vec explained: Deriving mikolov et al.'snegative-sampling word-embedding method." arXiv preprint arXiv:1402.3722(2014).
[4] Rong, Xin. "word2vec parameter learning explained." arXiv preprintarXiv:1411.2738 (2014).
[5] https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html#vector-representations-of-words
24/25
𝑣𝑒𝑐 "𝑀𝑜ℎ𝑎𝑚𝑚𝑎𝑑 𝑀𝑎ℎ𝑑𝑎𝑣𝑖") + 𝑣𝑒𝑐("𝐺𝑚𝑎𝑖𝑙"≈ 𝑣𝑒𝑐 "𝑀𝑜ℎ𝑎𝑚𝑚𝑎𝑑 𝑀𝑎ℎ𝑑𝑎𝑣𝑖") + 𝑣𝑒𝑐("𝐹𝑎𝑐𝑒𝑏𝑜𝑜𝑘"≈ 𝑣𝑒𝑐("𝑚𝑜ℎ.𝑚𝑎ℎ𝑑𝑎𝑣𝑖. 𝑙")
25