Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
2/20/2020
1
CS 3750: Word ModelsPRESENTED BY: MUHENG YAN
UNIVERSITY OF PITTSBURGH
FEB.20, 2020
Is Document Models Enough?
β’Recap: previously we have LDA and LSI to learn document representations
β’What if we have very short documents, or even sentences? (e.g. Tweets)
β’ Can we investigate relationships between words/sentences with previous models?
β’ We need to model words individually for a better granularity
2/20/2020
2
Distributional Semantics: from a Linguistic AspectWord Embedding, Distributed Representations, Semantic Vector Space... What are they?
A more formal term from linguistic: Distributional Semantic Model
"β¦ quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data." --Wikipedia
--> Represent elements of language (word here) as distributions of other elements (i.e. documents, paragraphs, sentences, and words)
E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24
Document Level RepresentationWords as distributions of documents:
Latent Semantic Analysis/Indexing (LSA/LSI)1.Build a co-occurrence matrix of word vs. doc (n by d)
2.Decompose the Word-Document matrix via SVD
3.Take the highest singular values to get the lower-ranked approximation of the w-d matrix, as the word representations
Picture Credit: https://en.wikipedia.org/wiki/Latent_semantic_analysis
2/20/2020
3
Word Level RepresentationI. Counting and Matrix Factorization
II. Latent RepresentationI.Neural Network for Language Models
II.CBOW
III.Skip-gram
IV.Other Models
III.Graph-based ModelsI.Node2Vec
Counting and Matrix Factorizationβ’ Counting methods start with constructing a matrix of co-occurrences between words and words (can be expanded to other levels, e.g. at document level it becomes LSA)
β’ Due to the high-dimensionality and sparcity, usually used with a dim-reduction algorithm (PCA, SVD, etc.)
β’ The rows of the matrix approximates the distribution of co-occurring words for every word we are trying to model
Example Models including: LSA, Explicit Semantic Analysis (ESA), Global vectors for word representation (GloVe)
2/20/2020
4
Explicit Semantic Analysisβ’ Similar words most likely appear with the same
distribution of topics
β’ ESA represents topics by Wikipedia concepts (Pages). ESA use Wikipedia concepts as dimensions to construct the space in which words will be projected
β’ For each dimension (concept), words in this concept article are counted
β’ Inverted index is then constructed to convert each word into a vector of concepts
β’ The vector constructed for each word represents the frequency of its occurrences within each (concept).
Picture and Content Credit: Ahmed Magooda
Global vectors for word representation(GloVe)
1. Word-word co-occurrence with sliding window (|V| by |V|) (and normalize as probability)
2. Construct the cost as:
π± =
π,π
|π½|
π πΏπ,π πππππ + ππ + ππ β log πΏπ,π
π
3. Use gradient descent to solve the optimization
βI learn machine learning in CS-3750β
Window=2 I learn machine learning
I 0 1 1 0
Learn 1 0 1 1
machine 1 1 0 2
2/20/2020
5
GloVe Cont.How the cost is derived?
Probability of word i and k appear together: ππ,π =πππ
ππ
Using word k as a probe, the βratioβ of two word pairs: πππ‘πππ,π,π =πππ
πππ
To model the ratio with embedding v: π½ = Ο πππ‘πππππ β π π£π , π£π , π£π2
-> O(N^3)
Simplify the computation by design π β = π π£πβπ£πππ£π
Thus we are trying to make πππ
πππ=
π^(π£πππ£π)
π^(π£πππ£π)
Thus we have π½ = Ο logπππ β π£πππ£π
2
To expand the object logπππ = π£πππ£π, we have log πππ β log ππ = π£π
ππ£π, thenlog πππ = π£π
ππ£π + ππ + ππ. By doing this, we solve the problem that πππ β πππbut π£πππ£π
Then we come up with the final cost function π½ = Οπ,π|π½|π πΏπ,π ππ
πππ + ππ + ππ β log πΏπ,ππ
, where π(β) is a weight function
Value of ratio J and k related J and k not related
I and k related 1 Inf
I and k not related 0 1
Latent RepresentationModeling the distribution of context* for a certain words through a series of latent variables, by maximizing the likelihood P(word | context)*
Usually fulfilled by neural networks
The learned latent variables are used as the representations of words after optimization
* context refers to the other words from the distribution of which we model the target word* in some models it could be P(context | word), e.g. Skip-gram
2/20/2020
6
Neural Network for Language ModelLearning Objective (predicting next word ππ):
Find the parameter set π to minimize
πΏ π = β1
πΟπ log(π(π€π|π€πβ1, β¦ , π€πβπ+1)) + π (π)
Where π β =ππ¦π€π
Οπβ π ππ¦π€π
, Y = b + πΎππ’π‘tanh(d + πΎππX),
And X is the lookup results of the n-length sequence:
X = [πΆ π€πβ1 , β¦ , π(π€πβπ+1)]
* (πΎππ’π‘, b) is the parameter set of output layer, (πΎππ, d) is the parameter set of hidden layer
In this mode we learn the parameters in C (|V| * |N|), πΎππ (n * |V| * hidden_size), and πΎππ’π‘ (hidden_size * |V|)
Content Credit: Ahmed Magooda
RNN for Language ModelLearning Objective: similar to NN for LM
Alter from NN:β¦ The hidden layer is now the linear combination of the input
current word t and the hidden of previous word t-1:
π π‘ = π(πΌπ€ π‘ +πΎπ π‘ β 1 )
Where π(β) is the activation function
Content Credit: Ahmed Magooda
2/20/2020
7
Continuous Bag-of-Words Model
Picture Credit: Francois Chaubard, Rohit Mundra, Richard Socher, from https://cs224d.stanford.edu/lecture_notes/notes1.pdf
Learning Objective: maximizing the likelihood of π(π€πππ|ππππ‘ππ₯π‘) for every word in a corpus
Similar to NN for LM, the inputs are one-hot vectors and the matrix πΎhere is like the look-up matrix.
Differences compared to the NN for LM:β¦ Bi-directional: not predicting the βnextβ, instead predicting the center word
inside a window, where words from both directions are input
β¦ Significantly reduced complexity: only learns 2 * |V| * |N| parameters
CBOW Cont.Steps breakdown:
1. Generate the one-hot vectors for the context:(ππβπ, β¦ , ππβ1, ππ+1, β¦ , ππ+π π πΉ π ), and lookup for the word vectors ππ = πΎππ
2. Average the vectors over contexts: ππ =ππβπ+β¦+ππ+π
2π
3. Generate the posterior ππ = πΎβ²ππ, and turn it in to probabilities ΰ·ππ = π πππ‘πππ₯(ππ)
4. Calculate the loss as cross-entropy:Οπ=1|π|
π¦πlog( ΰ·π¦π) π(π€π|π€πβπ, β¦π€π+π)
Notations:m: half window sizec: center word index
π€π: word i from vocabulary Vππ: one-hot input of word i
πΎπ πΉ π Γ π: the context lookup matrix
πΎβ²π πΉπ Γ π : the center lookup matrix
2/20/2020
8
CBOW Cont.Loss fuction:
πΉππ πππ π€π β π ,πππππππ§ππ½ β = ππππ π€π π€πβπ, β¦π€π+π
β β1
|π|ππππ πΎπ ππ
= β1
ππππ
πππβ²π ππ
Οπ=1π
πππβ²π»ππ
= β1
πβππ
β²πππ + log(
π=1
π
πππβ²π»ππ)
Optimization: use SGD to update all relevant vectors ππβ² πππ π
Skip-gram Model
Picture Credit: Francois Chaubard, Rohit Mundra, Richard Socher, from https://cs224d.stanford.edu/lecture_notes/notes1.pdf
Learning Objective: maximizing the likelihood of π(ππππ‘ππ₯π‘|π€πππ) for every word in a corpus
Steps Breakdown:
1. Generate one-hot vector for the center word π π πΉ π , and calculate the embedded vector ππ = πΎπ π πΉπ
2. Calculate the posterior ππ = πΎβ²ππ3. For each word j in the context of the center word, calculate the
probabilities ΰ·ππ = π πππ‘πππ₯(ππ)
4. We want the probabilities ΰ·π¦ππ in ΰ·ππ match the true probabilities of the contexts which are π¦πβπ, β¦ , π¦π+π
Cost function constructed similarly to the CBOW model
2/20/2020
9
Skip-gram Cont.Cost Function:
πππ ππ£πππ¦ ππππ‘ππ π€πππ π€π ππ π ,πππππππ§π:π½ β = βππππ π€πβπ, β¦π€π+π π€π
= βπππΰ·π=0,πβ π
2π
π π€πβπ+π π€π
= βπππΰ·π ππβπ+πβ² ππ
= β logΰ·πππ
β²π ππ
Οπ=1π
πππβ²π»ππ
Skip-gram with Negative SamplingAn alternative way of learning skip-gram:
From the previous learning method, we have looped heavily on negative samples when summing over |V|
Alternatively, we can reform the learning objective in order to enabling βnegative samplingβ, where we only take a few negative samples in each epoch
Alternative Objective: maximize the likelihood of P(D=1|w, c) if the word pair (w, c) is from the data, and minimize the likelihood of P(D=0|w, c) if (w, c) is not from the data
2/20/2020
10
Skip-gram with Negative SamplingWe model the probability as:
π π· = 1 π€, π, π = π ππππππ πππ ππ€ =
π
π+πβπππ ππ€
And the optimization of the loss would be:π = argmax
πΟ π€,π βπ·ππ‘ππ π· = 1 π€, π, π Ο π€,π βπ·ππ‘ππ π· = 0 π€, π, π
= argmaxπ
Ο π€,π βπ·ππ‘ππ π· = 1 π€, π, π Ο π€,π βπ·ππ‘π(1 β π π· = 1 π€, π, π )
= argmaxπ
Ο log1
π+πβπππ ππ€
Ο log(1 β1
π+πβπππ ππ€
)
= argmaxπ
Ο log1
π+πβπππ ππ€
Ο log1
π+ππππ ππ€
Hierarchical Softmax and FastTextHierarchical Softmax:
An alternative way to solve the dimensionality problem whensoftmaxing through y:
1. Build a Binary Tree of words in V, each non-leaf nodes are associated with a pseudo-output to learn.
2. Define the loss for word c as the path to the word from root
3. The probability of π(π€π|ππππ‘ππ₯π‘) now becomes
Οπ=1πππππ‘β ππ πππ‘β
π ππππππ( π π€, π + 1 ππ π‘βπ πβππππππ β π π€,ππ ππ)
FastText: Sub-word n-grams + Hierarchical Softmaxβappleβ and βapplesβ are referring to the same semantic, yet word model ignores such sub-word features.
FastText model introduces sub-word n-gram inputs, while having a similar architecture as skip-gram models. This expands the dimension of π to a even larger number. Thus it adopts the Hierarchical Softmax to speed up the computation
2/20/2020
11
Graph-based ModelsWordNet:A large lexical database of words. Words are grouped into set of synonyms (synsets), each expressing a distinct concept
Provides Word senses, Part of Speech, semantic relationships between words
From which we can construct a real βnetβ of words where words are nodes and word relationships defined by synsets as edges
Content Credit: Ahmed Magooda
Node2VecA model to learn representationsof nodes:Applied to any graphs including word nets
Turns graph topology into sequences with the random walk algorithm:
Start from a random node v, the probability of travel to another node is:
π π₯ π£ = ΰ΅ππ£π₯
πππ π£, π₯ β πΈ
0 ππ‘βπππ€ππ π
The transition probability is defined as:
ππ£π₯ =
1
πππ π₯ ππ π‘βπ ππππ£πππ’π ππππ π‘
1 ππ π₯ ππ π‘βπ ππππβπππ ππ π‘1
πππ π₯ ππ ππππβπππ ππ ππππβπππ ππ π‘
p and q controls the strategy of breath-first search or depth-first search
With the sequence generated, we can embed nodes as we did in language models
2/20/2020
12
Evaluation of Word Representationsβ’ Intrinsic Evaluation:β’How good are the representations?
β’Extrinsic Evaluation:β’How effective are the learned representations in other downstream tasks?
Content Credit: Ahmed Magooda
Intrinsic EvaluationsWord Similarity Task
β¦ Calculate the similarity of word pairs from the learned vectors through a various distance metrics (e.g. euclidean, cosine, etc.)
β¦ Compare the calculated word similarities with human-annotated similarities
β¦ Example test set: word-sim 353 (http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/)
Analogy Task:β¦ Proposed by Mikolov et.al (2013)
β¦ For a specific word relation, given a, b, y; find x sothat "a is to b as x is to y"β¦ "man is to king as woman is to queen"
"Analogy task" Content Credit: Ahmed Magooda
2/20/2020
13
Extrinsic EvaluationsLearned word representations can be taken as inputs to encode texts in downstream NLP tasks, including:β¦Sentiment Analysis, POS tagging, QA, Machine Translation...
β¦GLUE benchmark: a collection of dataset including 9 sentence language understanding tasks (https://gluebenchmark.com/)
SummaryWhat we have covered:
- "words as distributions"
- evaluations of the word representations
Document-level Word-level
Count & Decomposition - LSA - GloVe
Latent Vector Representation
- NN for LM- CBOW, Skip-gram, FastText
- Node2Vec
2/20/2020
14
Software at FingertipsLSA : manual through sklearn or Gensim
CBOW, Skip-gram: Gensim
Neural-networks: Torch, Tensorflow
Pre-trained word representations:β¦Word2vec: (https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit)
β¦Glove: (https://nlp.stanford.edu/projects/glove/)
β¦Fasttext: (https://fasttext.cc/)
ReferencesWord2Vec:
β’ Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space.
β’ Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality.
β’ Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic Regularities in Continuous Space Word Representations.
β’ Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.
Counting:
β’ Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
β’ Gabrilovich, E., & Markovitch, S. (2007, January). Computing semantic relatedness using wikipedia-based explicit semantic analysis.
β’ Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis.
NN for LM:
β’ Mikolov, T., Kopecky, J., Burget, L., & Glembek, O. (2009, April). Neural network based language models for highly inflective languages.
β’ Mikolov, T., KarafiΓ‘t, M., Burget, L., ΔernockΓ½, J., & Khudanpur, S. (2010). Recurrent neural network based language model.
Graph-based:
β’ Princeton University "About WordNet." WordNet. Princeton University. 2010. http://wordnet.princeton.edu
β’ Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864).