14
2/20/2020 1 CS 3750: Word Models PRESENTED BY: MUHENG YAN UNIVERSITY OF PITTSBURGH FEB.20, 2020 Is Document Models Enough? β€’Recap: previously we have LDA and LSI to learn document representations β€’What if we have very short documents, or even sentences? (e.g. Tweets) β€’ Can we investigate relationships between words/sentences with previous models? β€’ We need to model words individually for a better granularity

CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

1

CS 3750: Word ModelsPRESENTED BY: MUHENG YAN

UNIVERSITY OF PITTSBURGH

FEB.20, 2020

Is Document Models Enough?

β€’Recap: previously we have LDA and LSI to learn document representations

β€’What if we have very short documents, or even sentences? (e.g. Tweets)

β€’ Can we investigate relationships between words/sentences with previous models?

β€’ We need to model words individually for a better granularity

Page 2: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

2

Distributional Semantics: from a Linguistic AspectWord Embedding, Distributed Representations, Semantic Vector Space... What are they?

A more formal term from linguistic: Distributional Semantic Model

"… quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data." --Wikipedia

--> Represent elements of language (word here) as distributions of other elements (i.e. documents, paragraphs, sentences, and words)

E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24

Document Level RepresentationWords as distributions of documents:

Latent Semantic Analysis/Indexing (LSA/LSI)1.Build a co-occurrence matrix of word vs. doc (n by d)

2.Decompose the Word-Document matrix via SVD

3.Take the highest singular values to get the lower-ranked approximation of the w-d matrix, as the word representations

Picture Credit: https://en.wikipedia.org/wiki/Latent_semantic_analysis

Page 3: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

3

Word Level RepresentationI. Counting and Matrix Factorization

II. Latent RepresentationI.Neural Network for Language Models

II.CBOW

III.Skip-gram

IV.Other Models

III.Graph-based ModelsI.Node2Vec

Counting and Matrix Factorizationβ€’ Counting methods start with constructing a matrix of co-occurrences between words and words (can be expanded to other levels, e.g. at document level it becomes LSA)

β€’ Due to the high-dimensionality and sparcity, usually used with a dim-reduction algorithm (PCA, SVD, etc.)

β€’ The rows of the matrix approximates the distribution of co-occurring words for every word we are trying to model

Example Models including: LSA, Explicit Semantic Analysis (ESA), Global vectors for word representation (GloVe)

Page 4: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

4

Explicit Semantic Analysisβ€’ Similar words most likely appear with the same

distribution of topics

β€’ ESA represents topics by Wikipedia concepts (Pages). ESA use Wikipedia concepts as dimensions to construct the space in which words will be projected

β€’ For each dimension (concept), words in this concept article are counted

β€’ Inverted index is then constructed to convert each word into a vector of concepts

β€’ The vector constructed for each word represents the frequency of its occurrences within each (concept).

Picture and Content Credit: Ahmed Magooda

Global vectors for word representation(GloVe)

1. Word-word co-occurrence with sliding window (|V| by |V|) (and normalize as probability)

2. Construct the cost as:

𝑱 =

π’Š,𝒋

|𝑽|

𝒇 π‘Ώπ’Š,𝒋 𝒗𝑖𝑇𝒗𝑗 + 𝒃𝑖 + 𝒃𝑗 βˆ’ log 𝑿𝑖,𝑗

𝟐

3. Use gradient descent to solve the optimization

β€œI learn machine learning in CS-3750”

Window=2 I learn machine learning

I 0 1 1 0

Learn 1 0 1 1

machine 1 1 0 2

Page 5: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

5

GloVe Cont.How the cost is derived?

Probability of word i and k appear together: 𝑃𝑖,π‘˜ =π‘‹π‘–π‘˜

𝑋𝑖

Using word k as a probe, the β€œratio” of two word pairs: π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘–,𝑗,π‘˜ =π‘ƒπ‘–π‘˜

π‘ƒπ‘—π‘˜

To model the ratio with embedding v: 𝐽 = Οƒ π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘–π‘—π‘˜ βˆ’ 𝑔 𝑣𝑖 , 𝑣𝑗 , π‘£π‘˜2

-> O(N^3)

Simplify the computation by design 𝑔 βˆ™ = 𝑒 π‘£π‘–βˆ’π‘£π‘—π‘‡π‘£π‘˜

Thus we are trying to make π‘ƒπ‘–π‘˜

π‘ƒπ‘—π‘˜=

𝑒^(π‘£π‘–π‘‡π‘£π‘˜)

𝑒^(π‘£π‘—π‘‡π‘£π‘˜)

Thus we have 𝐽 = Οƒ log𝑃𝑖𝑗 βˆ’ 𝑣𝑖𝑇𝑣𝑗

2

To expand the object log𝑃𝑖𝑗 = 𝑣𝑖𝑇𝑣𝑗, we have log 𝑋𝑖𝑗 βˆ’ log 𝑋𝑖 = 𝑣𝑖

𝑇𝑣𝑗, thenlog 𝑋𝑖𝑗 = 𝑣𝑖

𝑇𝑣𝑗 + 𝑏𝑖 + 𝑏𝑗. By doing this, we solve the problem that 𝑃𝑖𝑗 β‰  𝑃𝑗𝑖but 𝑣𝑗𝑇𝑣𝑖

Then we come up with the final cost function 𝐽 = Οƒπ’Š,𝒋|𝑽|𝒇 π‘Ώπ’Š,𝒋 𝒗𝑖

𝑇𝒗𝑗 + 𝒃𝑖 + 𝒃𝑗 βˆ’ log 𝑿𝑖,π‘—πŸ

, where 𝑓(βˆ™) is a weight function

Value of ratio J and k related J and k not related

I and k related 1 Inf

I and k not related 0 1

Latent RepresentationModeling the distribution of context* for a certain words through a series of latent variables, by maximizing the likelihood P(word | context)*

Usually fulfilled by neural networks

The learned latent variables are used as the representations of words after optimization

* context refers to the other words from the distribution of which we model the target word* in some models it could be P(context | word), e.g. Skip-gram

Page 6: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

6

Neural Network for Language ModelLearning Objective (predicting next word π’˜π’‹):

Find the parameter set πœƒ to minimize

𝐿 πœƒ = βˆ’1

𝑇σ𝑗 log(𝑃(𝑀𝑗|π‘€π‘—βˆ’1, … , π‘€π‘—βˆ’π‘›+1)) + 𝑅(πœƒ)

Where 𝑃 βˆ™ =𝑒𝑦𝑀𝑖

σ𝑖≠𝑗 𝑒𝑦𝑀𝑗

, Y = b + π‘Ύπ‘œπ‘’π‘‘tanh(d + 𝑾𝑖𝑛X),

And X is the lookup results of the n-length sequence:

X = [𝐢 π‘€π‘—βˆ’1 , … , 𝑐(π‘€π‘—βˆ’π‘›+1)]

* (π‘Ύπ‘œπ‘’π‘‘, b) is the parameter set of output layer, (𝑾𝑖𝑛, d) is the parameter set of hidden layer

In this mode we learn the parameters in C (|V| * |N|), 𝑾𝑖𝑛 (n * |V| * hidden_size), and π‘Ύπ‘œπ‘’π‘‘ (hidden_size * |V|)

Content Credit: Ahmed Magooda

RNN for Language ModelLearning Objective: similar to NN for LM

Alter from NN:β—¦ The hidden layer is now the linear combination of the input

current word t and the hidden of previous word t-1:

𝑠 𝑑 = 𝑓(𝑼𝑀 𝑑 +𝑾𝑠 𝑑 βˆ’ 1 )

Where 𝑓(βˆ™) is the activation function

Content Credit: Ahmed Magooda

Page 7: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

7

Continuous Bag-of-Words Model

Picture Credit: Francois Chaubard, Rohit Mundra, Richard Socher, from https://cs224d.stanford.edu/lecture_notes/notes1.pdf

Learning Objective: maximizing the likelihood of 𝑃(π‘€π‘œπ‘Ÿπ‘‘|π‘π‘œπ‘›π‘‘π‘’π‘₯𝑑) for every word in a corpus

Similar to NN for LM, the inputs are one-hot vectors and the matrix 𝑾here is like the look-up matrix.

Differences compared to the NN for LM:β—¦ Bi-directional: not predicting the β€œnext”, instead predicting the center word

inside a window, where words from both directions are input

β—¦ Significantly reduced complexity: only learns 2 * |V| * |N| parameters

CBOW Cont.Steps breakdown:

1. Generate the one-hot vectors for the context:(π’™π‘βˆ’π‘š, … , π’™π‘βˆ’1, 𝒙𝑐+1, … , 𝒙𝑐+π‘š πœ– 𝑹 𝑉 ), and lookup for the word vectors 𝒗𝑖 = 𝑾𝒙𝑖

2. Average the vectors over contexts: 𝒉𝑐 =π’—π‘βˆ’π‘š+…+𝒗𝑐+π‘š

2π‘š

3. Generate the posterior 𝒛𝑐 = 𝑾′𝒉𝑐, and turn it in to probabilities ΰ·π’šπ‘ = π‘ π‘œπ‘“π‘‘π‘šπ‘Žπ‘₯(𝒛𝑐)

4. Calculate the loss as cross-entropy:σ𝑖=1|𝑉|

𝑦𝑖log( ΰ·œπ‘¦π‘–) 𝑃(𝑀𝑐|π‘€π‘βˆ’π‘š, …𝑀𝑐+π‘š)

Notations:m: half window sizec: center word index

𝑀𝑖: word i from vocabulary V𝒙𝑖: one-hot input of word i

π‘Ύπœ– 𝑹 𝑉 Γ— 𝑛: the context lookup matrix

π‘Ύβ€²πœ– 𝑹𝑛 Γ— 𝑉 : the center lookup matrix

Page 8: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

8

CBOW Cont.Loss fuction:

πΉπ‘œπ‘Ÿ π‘Žπ‘™π‘™ 𝑀𝑐 ∈ 𝑉 ,π‘šπ‘–π‘›π‘–π‘šπ‘–π‘§π‘’π½ βˆ™ = π‘™π‘œπ‘”π‘ƒ 𝑀𝑐 π‘€π‘βˆ’π‘š, …𝑀𝑐+π‘š

β‡’ βˆ’1

|𝑉|π‘™π‘œπ‘”π‘ƒ 𝑾𝑐 𝒉𝑐

= βˆ’1

π‘‰π‘™π‘œπ‘”

π‘’π’˜π‘β€²π‘‡ 𝒉𝑐

σ𝑗=1𝑉

π‘’π’˜π‘—β€²π‘»π’‰π‘

= βˆ’1

π‘‰βˆ’π’˜π‘

′𝑇𝒉𝑐 + log(

𝑗=1

𝑉

π‘’π’˜π‘—β€²π‘»π’‰π‘)

Optimization: use SGD to update all relevant vectors π’˜π‘β€² π‘Žπ‘›π‘‘ π’˜

Skip-gram Model

Picture Credit: Francois Chaubard, Rohit Mundra, Richard Socher, from https://cs224d.stanford.edu/lecture_notes/notes1.pdf

Learning Objective: maximizing the likelihood of 𝑃(π‘π‘œπ‘›π‘‘π‘’π‘₯𝑑|π‘€π‘œπ‘Ÿπ‘‘) for every word in a corpus

Steps Breakdown:

1. Generate one-hot vector for the center word 𝒙 πœ– 𝑹 𝑉 , and calculate the embedded vector 𝒉𝑐 = 𝑾𝒙 πœ– 𝑹𝑛

2. Calculate the posterior 𝒛𝑐 = 𝑾′𝒉𝑐3. For each word j in the context of the center word, calculate the

probabilities ΰ·π’šπ‘ = π‘ π‘œπ‘“π‘‘π‘šπ‘Žπ‘₯(𝒛𝑐)

4. We want the probabilities ΰ·œπ‘¦π‘π‘— in ΰ·π’šπ‘ match the true probabilities of the contexts which are π‘¦π‘βˆ’π‘š, … , 𝑦𝑐+π‘š

Cost function constructed similarly to the CBOW model

Page 9: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

9

Skip-gram Cont.Cost Function:

π‘“π‘œπ‘Ÿ π‘’π‘£π‘’π‘Ÿπ‘¦ π‘π‘’π‘›π‘‘π‘’π‘Ÿ π‘€π‘œπ‘Ÿπ‘‘ 𝑀𝑐 𝑖𝑛 𝑉 ,π‘šπ‘–π‘›π‘–π‘šπ‘–π‘§π‘’:𝐽 βˆ™ = βˆ’π‘™π‘œπ‘”π‘ƒ π‘€π‘βˆ’π‘š, …𝑀𝑐+π‘š 𝑀𝑐

= βˆ’π‘™π‘œπ‘”ΰ·‘π‘—=0,π‘—β‰ π‘š

2π‘š

𝑃 π‘€π‘βˆ’π‘š+𝑗 𝑀𝑐

= βˆ’π‘™π‘œπ‘”ΰ·‘π‘ƒ π’˜π‘βˆ’π‘š+𝑗′ 𝒉𝑐

= βˆ’ logΰ·‘π‘’π’˜π‘

′𝑇 𝒉𝑐

σ𝑗=1𝑉

π‘’π’˜π‘—β€²π‘»π’‰π‘

Skip-gram with Negative SamplingAn alternative way of learning skip-gram:

From the previous learning method, we have looped heavily on negative samples when summing over |V|

Alternatively, we can reform the learning objective in order to enabling β€œnegative sampling”, where we only take a few negative samples in each epoch

Alternative Objective: maximize the likelihood of P(D=1|w, c) if the word pair (w, c) is from the data, and minimize the likelihood of P(D=0|w, c) if (w, c) is not from the data

Page 10: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

10

Skip-gram with Negative SamplingWe model the probability as:

𝑃 𝐷 = 1 𝑀, 𝑐, πœƒ = π‘ π‘–π‘”π‘šπ‘œπ‘–π‘‘ 𝒉𝑐𝑇 𝒉𝑀 =

𝟏

𝟏+π’†βˆ’π’‰π‘π‘‡ 𝒉𝑀

And the optimization of the loss would be:πœƒ = argmax

πœƒΟ‚ 𝑀,𝑐 βˆˆπ·π‘Žπ‘‘π‘Žπ‘ƒ 𝐷 = 1 𝑀, 𝑐, πœƒ Ο‚ 𝑀,𝑐 βˆ‰π·π‘Žπ‘‘π‘Žπ‘ƒ 𝐷 = 0 𝑀, 𝑐, πœƒ

= argmaxπœƒ

Ο‚ 𝑀,𝑐 βˆˆπ·π‘Žπ‘‘π‘Žπ‘ƒ 𝐷 = 1 𝑀, 𝑐, πœƒ Ο‚ 𝑀,𝑐 βˆ‰π·π‘Žπ‘‘π‘Ž(1 βˆ’ 𝑃 𝐷 = 1 𝑀, 𝑐, πœƒ )

= argmaxπœƒ

Οƒ log1

𝟏+π’†βˆ’π’‰π‘π‘‡ 𝒉𝑀

Οƒ log(1 βˆ’1

𝟏+π’†βˆ’π’‰π‘π‘‡ 𝒉𝑀

)

= argmaxπœƒ

Οƒ log1

𝟏+π’†βˆ’π’‰π‘π‘‡ 𝒉𝑀

Οƒ log1

𝟏+𝒆𝒉𝑐𝑇 𝒉𝑀

Hierarchical Softmax and FastTextHierarchical Softmax:

An alternative way to solve the dimensionality problem whensoftmaxing through y:

1. Build a Binary Tree of words in V, each non-leaf nodes are associated with a pseudo-output to learn.

2. Define the loss for word c as the path to the word from root

3. The probability of 𝑃(𝑀𝑐|π‘π‘œπ‘›π‘‘π‘’π‘₯𝑑) now becomes

ς𝑗=1π‘™π‘’π‘›π‘”π‘‘β„Ž π‘œπ‘“ π‘π‘Žπ‘‘β„Ž

π‘ π‘–π‘”π‘šπ‘œπ‘–π‘‘( 𝑛 𝑀, 𝑗 + 1 𝑖𝑠 π‘‘β„Žπ‘’ π‘β„Žπ‘–π‘™π‘‘π‘’π‘Ÿπ‘› βˆ™ 𝒉 𝑀,𝑗𝑇 𝒉𝑐)

FastText: Sub-word n-grams + Hierarchical Softmaxβ€œapple” and β€œapples” are referring to the same semantic, yet word model ignores such sub-word features.

FastText model introduces sub-word n-gram inputs, while having a similar architecture as skip-gram models. This expands the dimension of π’š to a even larger number. Thus it adopts the Hierarchical Softmax to speed up the computation

Page 11: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

11

Graph-based ModelsWordNet:A large lexical database of words. Words are grouped into set of synonyms (synsets), each expressing a distinct concept

Provides Word senses, Part of Speech, semantic relationships between words

From which we can construct a real β€œnet” of words where words are nodes and word relationships defined by synsets as edges

Content Credit: Ahmed Magooda

Node2VecA model to learn representationsof nodes:Applied to any graphs including word nets

Turns graph topology into sequences with the random walk algorithm:

Start from a random node v, the probability of travel to another node is:

𝑃 π‘₯ 𝑣 = ΰ΅πœ‹π‘£π‘₯

𝑍𝑖𝑓 𝑣, π‘₯ ∈ 𝐸

0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’

The transition probability is defined as:

πœ‹π‘£π‘₯ =

1

𝑝𝑖𝑓 π‘₯ 𝑖𝑠 π‘‘β„Žπ‘’ π‘π‘Ÿπ‘’π‘£π‘–π‘œπ‘’π‘  π‘›π‘œπ‘‘π‘’ 𝑑

1 𝑖𝑓 π‘₯ 𝑖𝑠 π‘‘β„Žπ‘’ π‘›π‘’π‘–π‘”β„Žπ‘π‘œπ‘Ÿ π‘œπ‘“ 𝑑1

π‘žπ‘–π‘“ π‘₯ 𝑖𝑠 π‘›π‘’π‘–π‘”β„Žπ‘π‘œπ‘Ÿ π‘œπ‘“ π‘›π‘’π‘–π‘”β„Žπ‘π‘œπ‘Ÿ π‘œπ‘“ 𝑑

p and q controls the strategy of breath-first search or depth-first search

With the sequence generated, we can embed nodes as we did in language models

Page 12: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

12

Evaluation of Word Representationsβ€’ Intrinsic Evaluation:β€’How good are the representations?

β€’Extrinsic Evaluation:β€’How effective are the learned representations in other downstream tasks?

Content Credit: Ahmed Magooda

Intrinsic EvaluationsWord Similarity Task

β—¦ Calculate the similarity of word pairs from the learned vectors through a various distance metrics (e.g. euclidean, cosine, etc.)

β—¦ Compare the calculated word similarities with human-annotated similarities

β—¦ Example test set: word-sim 353 (http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/)

Analogy Task:β—¦ Proposed by Mikolov et.al (2013)

β—¦ For a specific word relation, given a, b, y; find x sothat "a is to b as x is to y"β—¦ "man is to king as woman is to queen"

"Analogy task" Content Credit: Ahmed Magooda

Page 13: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

13

Extrinsic EvaluationsLearned word representations can be taken as inputs to encode texts in downstream NLP tasks, including:β—¦Sentiment Analysis, POS tagging, QA, Machine Translation...

β—¦GLUE benchmark: a collection of dataset including 9 sentence language understanding tasks (https://gluebenchmark.com/)

SummaryWhat we have covered:

- "words as distributions"

- evaluations of the word representations

Document-level Word-level

Count & Decomposition - LSA - GloVe

Latent Vector Representation

- NN for LM- CBOW, Skip-gram, FastText

- Node2Vec

Page 14: CS 3750: Word Modelsmilos/courses/cs3750-Spring...E.g. word 1 = doc 1 + doc 5 + doc 10 / word 1 = 0.5*word 12 + 0.7*word 24 Document Level Representation Words as distributions of

2/20/2020

14

Software at FingertipsLSA : manual through sklearn or Gensim

CBOW, Skip-gram: Gensim

Neural-networks: Torch, Tensorflow

Pre-trained word representations:β—¦Word2vec: (https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit)

β—¦Glove: (https://nlp.stanford.edu/projects/glove/)

β—¦Fasttext: (https://fasttext.cc/)

ReferencesWord2Vec:

β€’ Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space.

β€’ Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality.

β€’ Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic Regularities in Continuous Space Word Representations.

β€’ Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.

Counting:

β€’ Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

β€’ Gabrilovich, E., & Markovitch, S. (2007, January). Computing semantic relatedness using wikipedia-based explicit semantic analysis.

β€’ Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis.

NN for LM:

β€’ Mikolov, T., Kopecky, J., Burget, L., & Glembek, O. (2009, April). Neural network based language models for highly inflective languages.

β€’ Mikolov, T., KarafiΓ‘t, M., Burget, L., ČernockΓ½, J., & Khudanpur, S. (2010). Recurrent neural network based language model.

Graph-based:

β€’ Princeton University "About WordNet." WordNet. Princeton University. 2010. http://wordnet.princeton.edu

β€’ Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864).