11
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1232–1242 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics 1232 Incorporating Latent Meanings of Morphological Compositions to Enhance Word Embeddings Yang Xu , Jiawei Liu , Wei Yang * , and Liusheng Huang School of Computer Science and Technology, University of Science and Technology of China, Hefei, 230027, China {smallant, ustcljw}@mail.ustc.edu.cn {qubit, lshuang}@ustc.edu.cn Abstract Traditional word embedding approaches learn semantic information at word level while ignoring the meaningful internal structures of words like morphemes. Furthermore, existing morphology-based models directly incorporate morphemes to train word embeddings, but still neglect the latent meanings of morphemes. In this paper, we explore to employ the latent meanings of morphological compositions of words to train and enhance word embeddings. Based on this purpose, we propose three Latent Meaning Mod- els (LMMs), named LMM-A, LMM-S and LMM-M respectively, which adopt different strategies to incorporate the latent meanings of morphemes during the training process. Experiments on word similarity, syntactic analogy and text classification are conducted to validate the feasibility of our models. The results demonstrate that our models outperform the baselines on five word similarity datasets. On Wordsim-353 and RG-65 datasets, our models nearly achieve 5% and 7% gains over the classic CBOW model, respectively. For the syntactic analogy and text classification tasks, our models also surpass all the baselines including a morphology-based model. 1 Introduction Word embedding, which is also termed distributed word representation, has been a hot topic in the area of Natural Language Processing (NLP). The derived word embeddings have been used in plenty of tasks such as text classification (Liu * This is the corresponding author. et al., 2015), information retrieval (Manning et al., 2008), sentiment analysis (Shin et al., 2016), machine translation (Cho et al., 2014) and so on. Recently, some classic word embedding methods have been proposed, like Continuous Bag-of- Word (CBOW), Skip-gram (Mikolov et al., 2013a), Global Vectors (GloVe) (Pennington et al., 2014). These methods can usually capture word-level semantic information but ignore the meaningful inner structures of words like English morphemes or Chinese characters. The effectiveness of exploiting the internal compositions of words has been validated by some previous work (Luong et al., 2013; Botha and Blunsom, 2014; Chen et al., 2015; Cotterell et al., 2016). Some of them compute the word embeddings by directly adding the representations of morphemes/characters to context words or optimizing a joint objective over distributional statistics and morphological properties (Qiu et al., 2014; Botha and Blunsom, 2014; Chen et al., 2015; Luong et al., 2013; Lazaridou et al., 2013), while others introduce some probabilistic graphical models to build relationship between words and their internal compositions. e.g., Bhatia et al. (2016) treat word embeddings as latent variables for a prior distribution, which reflects words’ morphological properties, and feed the latent variables into a neural sequence model to obtain final word embeddings. Cotterell et al. (2016) construct a Gaussian graphical model that binds the morphological analysis to pre-trained word embeddings, which can help to smooth the noisy embeddings. Besides, these two methods also have the ability to predict embeddings for unseen words. Different from all the above models (we regard them as Explicit models in Fig. 1) where internal compositions are directly used to encode morphological regularities into words and the

Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1232–1242Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

1232

Incorporating Latent Meanings of Morphological Compositions toEnhance Word Embeddings

Yang Xu†, Jiawei Liu†, Wei Yang‡∗, and Liusheng Huang‡School of Computer Science and Technology,

University of Science and Technology of China, Hefei, 230027, China†{smallant, ustcljw}@mail.ustc.edu.cn

‡{qubit, lshuang}@ustc.edu.cn

Abstract

Traditional word embedding approacheslearn semantic information at word levelwhile ignoring the meaningful internalstructures of words like morphemes.Furthermore, existing morphology-basedmodels directly incorporate morphemesto train word embeddings, but still neglectthe latent meanings of morphemes. In thispaper, we explore to employ the latentmeanings of morphological compositionsof words to train and enhance wordembeddings. Based on this purpose,we propose three Latent Meaning Mod-els (LMMs), named LMM-A, LMM-Sand LMM-M respectively, which adoptdifferent strategies to incorporate thelatent meanings of morphemes duringthe training process. Experiments onword similarity, syntactic analogy and textclassification are conducted to validatethe feasibility of our models. The resultsdemonstrate that our models outperformthe baselines on five word similaritydatasets. On Wordsim-353 and RG-65datasets, our models nearly achieve 5%and 7% gains over the classic CBOWmodel, respectively. For the syntacticanalogy and text classification tasks, ourmodels also surpass all the baselinesincluding a morphology-based model.

1 Introduction

Word embedding, which is also termed distributedword representation, has been a hot topic in thearea of Natural Language Processing (NLP).The derived word embeddings have been usedin plenty of tasks such as text classification (Liu

∗This is the corresponding author.

et al., 2015), information retrieval (Manning et al.,2008), sentiment analysis (Shin et al., 2016),machine translation (Cho et al., 2014) and so on.Recently, some classic word embedding methodshave been proposed, like Continuous Bag-of-Word (CBOW), Skip-gram (Mikolov et al.,2013a), Global Vectors (GloVe) (Penningtonet al., 2014). These methods can usually captureword-level semantic information but ignore themeaningful inner structures of words like Englishmorphemes or Chinese characters.

The effectiveness of exploiting the internalcompositions of words has been validated bysome previous work (Luong et al., 2013; Bothaand Blunsom, 2014; Chen et al., 2015; Cotterellet al., 2016). Some of them compute the wordembeddings by directly adding the representationsof morphemes/characters to context words oroptimizing a joint objective over distributionalstatistics and morphological properties (Qiuet al., 2014; Botha and Blunsom, 2014; Chenet al., 2015; Luong et al., 2013; Lazaridou et al.,2013), while others introduce some probabilisticgraphical models to build relationship betweenwords and their internal compositions. e.g., Bhatiaet al. (2016) treat word embeddings as latentvariables for a prior distribution, which reflectswords’ morphological properties, and feed thelatent variables into a neural sequence model toobtain final word embeddings. Cotterell et al.(2016) construct a Gaussian graphical model thatbinds the morphological analysis to pre-trainedword embeddings, which can help to smooth thenoisy embeddings. Besides, these two methodsalso have the ability to predict embeddings forunseen words.

Different from all the above models (weregard them as Explicit models in Fig. 1) whereinternal compositions are directly used to encodemorphological regularities into words and the

Page 2: Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

1233

it is an incredible

unbelievable

thing

it is that

in cred ible

un believ able

not believe able capable

not believe able capable

Prefix Latent Meaning

inun

notnot

Root Latent Meaning

believcred

believe

believe

Suffix Latent Meaning

ableible

able, capale

able, capale

sentence i :

sentence j :

Explicit models

directly use morphemes

Our models

employ the latent meanings

of morphemes

Corpus

Lookup table

Figure 1: An illustration of explicit models and our models in an English corpus. Although incredibleand unbelievable have different morphemes, their morphemes have the same latent meanings.

composition embeddings like morpheme embed-dings are generated as by-products, we explorea new way to employ the latent meanings ofmorphological compositions rather than the com-positions themselves to train word embeddings.As shown in Fig. 1, according to the distributionalsemantics hypothesis (Sahlgren, 2008), incredibleand unbelievable probably have similar wordembeddings because they have similar context.As a matter of fact, incredible is a synonym ofunbelievable and their embeddings are expectedto be close enough. Since the morphemes ofthe two words are different, especially the rootscred and believ, the explicit models may notsignificantly shorten the distance between thewords in the vector space. Fortunately, the latentmeanings of the different morphemes are thesame (e.g., the latent meanings of roots cred,believ are “believe”) as listed in the lookuptable (derived from the resources provided byMichigan State University),1 which evidentlyimplies that incredible and unbelievable sharethe same meanings. In addition, by replacingmorphemes with their latent meanings, we candirectly and simply quantize the similaritiesbetween words and their sub-compositions withthe same metrics used in most NLP tasks, e.g.,cosine similarity. Subsequently, the similaritiesare utilized to calculate the weights of latentmeanings of morphemes for each word.

In this paper, we try different strategies to

1https://msu.edu/˜defores1/gre/roots/gre_rts_afx1.htm

modify the input layer and update rules of aneural language model, e.g., CBOW, Skip-gram, and propose three lightweight and efficientmodels, which are termed Latent Meaning Models(LMMs), to not only encode morphological pro-perties into words but also enhance the semanticsimilarities among word embeddings. Usually, thevocabulary derived from the corpus contains vastmajority or even all of the latent meanings. Ratherthan generating and training extra embeddingsfor latent meanings, we directly override theembeddings of the corresponding words in thevocabulary. Moreover, a word map is createdto describe the relations between words and thelatent meanings of their morphemes.

For comparison, our models together withthe state-of-the-art baselines are tested on twobasic NLP tasks, which are word similarity andsyntactic analogy, and one downstream textclassification task. The results show that LMMsoutperform the baselines and get satisfactoryimprovement on these tasks. In all, the maincontributions of this paper are summarized asfollows.

• Rather than directly incorporating the mor-phological compositions (surface forms)of words, we decide to employ the latentmeanings of the compositions (underlyingforms) to train the word embeddings. Tovalidate the feasibility of our purpose, threespecific models, named LMMs, are proposedwith different strategies to incorporate thelatent meanings.

Page 3: Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

1234

• We utilize a medium-sized English corpusto train LMMs and the state-of-the-artbaselines, and evaluate their performance ontwo basic NLP tasks, i.e., word similarityand syntactic analogy, and one downstreamtext classification task. The results showthat LMMs outperform the baselines onfive word similarity datasets. On the goldenstandard Wordsim-353 and RG-65, LMMsapproximately achieve 5% and 7% gainsover CBOW, respectively. For the syntacticanalogy and text classification tasks, LMMsalso surpass all the baselines.

• We conduct experiments to analyze theimpacts of parameter settings, and the resultsdemonstrate that the performance of LMMson the smallest corpus is similar to theperformance of CBOW on the corpus that isfive times as large, which convinces us thatLMMs are of great advantages to enhanceword embeddings compared with traditionalmethods.

2 Background and Related Work

Considering the high efficiency of CBOW pro-posed by Mikolov et al. (2013a), our LMMsare built upon CBOW. Here, we first reviewsome backgrounds of CBOW, and then presentsome related work on recent word-level andmorphology-based word embedding methods.

CBOW with Negative Sampling With asliding window, CBOW utilizes the context wordsin the window to predict the target word. Given asequence of tokens T = {t1, t2, · · · , tn}, wheren is the size of a training corpus, the objective ofCBOW is to maximize the following average logprobability equation:

L =1

n

n∑i=1

log p(ti|context(ti)

), (1)

where context(ti) represents the context wordsof ti in the slide window, p

(ti|context(ti)

)is derived by softmax. Due to huge size ofEnglish vocabulary, p

(ti|context(ti)

)can not be

calculated in a tolerable time. Therefore, negativesampling and hierarchical softmax are proposedto solve this problem. Owing to the efficiencyof negative sampling, all our models are trainedbased on it. In terms of negative sampling, the log

probability log p(tO|tI) is transformed as:

log δ(vec′(tO)

T vec(tI))+

m∑i=1

log[1− δ

(vec′(ti)

T vec(tI))],

(2)

where m denotes the number of negative samples,and δ(·) is the sigmoid function. The first itemof Eq. (2) is the probability of target word whenits context is given. The second item indicates theprobability that negative samples do not share thesame context as the target word.

Word-level Word Embedding In general,word embedding models can mainly be dividedinto two branches. One is based on neural networklike the classic CBOW model (Mikolov et al.,2013a), while the other is based on matrix fac-torization. Besides CBOW, Skip-gram (Mikolovet al., 2013a) is another widely used neural-network-based model, which predicts the contextby using the target word (Mikolov et al., 2013a).As for matrix factorization, Dhillon et al. (2015)proposed a spectral word embedding method tomeasure the correlation between word informationmatrix and context information matrix. In orderto combine the advantages of models basedon neural network and matrix factorization,Pennington et al. (2014) proposed a famousword embedding model named GloVe, which isreported to outperform the CBOW and Skip-grammodels on some tasks. These models are effectiveto capture word-level semantic information whileneglecting inner structures of words. In contrast,the unheeded inner structures are utilized in bothour LMMs and other morphology-based models.

Morphology-based Word Embedding Recent-ly, some more fine-grained word embedding mod-els are proposed by exploiting the morphologicalcompositions of words, e.g., root and affixes.These morphology-based models can be dividedinto two main categories.

The first category directly adds the representa-tions of internal structures to word embeddingsor optimizes a joint objective over distributionalstatistics and morphological properties (Luonget al., 2013; Qiu et al., 2014; Botha and Blunsom,2014; Lazaridou et al., 2013; Chen et al., 2015;Kim et al., 2016; Cotterell and Schutze, 2015).Chen et al. (2015) proposed a character-enhancedChinese word embedding model, which splits aChinese word into several characters and add thecharacters into the input layer of their models.

Page 4: Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

1235

Luong et al. (2013) utilized the morpheme seg-ments produced by Morfessor (Creutz and Lagus,2007) and constructed morpheme trees for wordsto learn morphologically-aware word embeddingsby the recursive neural network. Kim et al.(2016) incorporated the convolutional characterinformation into English words. Their model canlearn character-level semantic information forembeddings, which is proved to be effective forsome morpheme-rich languages. However, with ahuge size architecture, it’s very time-consuming.Cotterell et al. (2015) augmented the log linearmodel to make the words, which share similarmorphemes, gather together in vector space.

The other category tries to use probabilisticgraphical models to connect words with theirmorphological compositions, and further learnsword embeddings (Bhatia et al., 2016; Cotterellet al., 2016). Bhatia et al. (2016) employedmorphemes and made them as prior knowledgeof the latent word embeddings, then fed the latentvariables into a neural sequence model to obtainfinal word embeddings. Cotterell et al. (2016)proposed a morpheme-based post-processor forpre-trained word embeddings. They constructed aGaussian graphical model which can extrapolatecontinuous representations for unknown words.

However, these morphology-based modelsdirectly exploit the internal compositions ofwords to encode morphological regularities intoword embeddings, and some by-products are alsoproduced like morpheme embeddings. In contrast,we employ the latent meanings of morphologicalcompositions to provide deeper insights fortraining better word embeddings. Furthermore,since the latent meanings are included in thevocabulary, there is no extra embedding beinggenerated.

3 Our Latent Meaning Models

We leverage different strategies to modify theinput layer and update rules of CBOW whenincorporating the latent meanings of morphemes.Three specific models, named Latent MeaningModel-Average (LMM-A), LMM-Similarity(LMM-S) and LMM-Max (LMM-M), are pro-posed. It should be stated that, for now, our mod-els mainly concern the derivational morphemes,which can be interpreted to some meaningfulwords or phrases (i.e., latent meanings), notthe inflectional morphemes like tense, number,

not

Latent Meaning

Prefix

Root

Suffix

it

is

incredible

thing

an

SUM

in

capable

believe

able

An item of the Word Map

incredible notin believe able capable

Word Prefix Root Suffix

1/5

1/5

1/5

1/5

1/5

Figure 2: A paradigm of LMM-A. The sentence“it is an incredible thing” is selected as an exam-ple. When calculating the input vector of “incred-ible”, we first find out the latent meanings of itsmorphemes in the word map, and add the vectorsof all latent meanings to the vector of “incredible”with equal weights.

gender, etc.LMM-A assumes that all latent meanings of

morphemes of a word have equal contributionsto the word. LMM-A is applicable to the condi-tion where words are correctly segmented intomorphemes and each morpheme is interpreted toappropriate latent meanings. However, refiningthe latent meanings for morphemes is time-consuming and needs vast human annotations.To address this concern, LMM-S is proposed.Motivated by the attention scheme, LMM-S holdsthe assumption that all latent meanings havedifferent contributions, and assigns the outlierssmall weights to let them have little impact on therepresentation of the target word. Furthermore,in LMM-M, we only keep the latent meaningswhich have the greatest contributions to thecorresponding word. In what follows, we aregoing to introduce each of our LMMs in detail.At the end of this section, we will introduce theupdate rules of the models.

3.1 LMM-A

Given a sequence of tokens T = {t1, t2, · · · , tn},LMM-A assumes that morphemes’ latent mean-ings of token ti (i ∈ [1, n]) have equal contribu-tions to ti, as shown in Fig. 2. The item for ti inthe word map is ti 7→ Mi. Mi is a set of latentmeanings of ti’s morphemes, and it consists ofthree sub-parts Pi, Ri and Si corresponding tothe latent meanings of prefixes, roots and suffixesof ti, respectively. Hence, at the input layer, the

Page 5: Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

1236

not

Latent Meaning

Prefix

Root

Suffix

it

is

incredible

thing

an

SUM

in

capable

believe

able

An item of the Word Map

incredible notin believe able capable

Word Prefix Root Suffix

ωin

ωnot

ωbelieve

ωcapable

ωable

Figure 3: A paradigm of LMM-S. In this model,all latent meanings of morphemes of “incredible”are added together with different weights.

modified embedding of ti can be expressed as

vti =1

2

(vti +

1

Ni

∑w∈Mi

vw), (3)

where vti is the original word embedding of ti,Ni denotes the length of Mi and vw indicatesthe embedding of latent meaning w. Meanwhile,we assume the original word embedding and theaverage embeddings of vw (w ∈ Mi) have equalweights, i.e., 0.5. Eventually, vti rather than vti isutilized for training in CBOW.

3.2 LMM-SThis model is proposed based on the attentionscheme. We observe that many morphemes havemore than one latent meaning. For instance,prefix in- means “in” and “not”, and suffix-ible means “able” and “capable”.2 As Fig.3 shows, for the item incredible 7→

{[in, not],

[believe], [able, capable]}

in the word map, thelatent meanings have different biases towards“incredible”. Therefore, we assign differentweights to latent meanings. We measure theweights of latent meanings by calculating thenormalized similarities between token ti and thecorresponding latent meanings. For LMM-S, themodified embedding of ti can be rewritten as

vti =1

2

[vti +

∑w∈Mi

ω<ti,w> · vw], (4)

where vti is the original vector of ti, and ω<ti,w>

denotes the weight between ti and the latent mean-ing w (w ∈Mi). We use cos(va, vb) to denote the

2All the latent meanings of roots and affixes are referredto the resources we mentioned before.

not

Latent Meaning

Prefix

Root

Suffix

it

is

incredible

thing

an

SUM

in

capable

believe

able

An item of the Word Map

incredible notin believe able capable

Word Prefix Root Suffix

ωnot

ωbelieve

ωable

Figure 4: A paradigm of LMM-M. The latentmeanings with maximum similarities towards “in-credible” are selected.

cosine similarity between va and vb, then ω<ti,w>

is expressed as follows:

ω<ti,w> =cos(vti , vw)∑

x∈Mi

cos(vti , vx). (5)

3.3 LMM-M

To further eliminate the impacts of some uncor-related latent meanings to a word, in LMM-M,we only select the latent meanings that havemaximum similarities to the token ti from Pi,Ri and Si. As is shown in Fig. 4, the latentmeaning “not” of prefix in is finally selected sincethe similarity between “not” and “incredible” islarger than that between “in” and “incredible”.For token ti, LMM-M is mathematically definedas

vti =1

2

[vti +

∑w∈M i

max

ω<ti,w> · vw], (6)

where M imax = {P i

max, Rimax, S

imax} is the set

of latent meanings with maximum similaritiestowards token ti, and P i

max, Rimax, Si

max areobtained by the following equations:

P imax = argmax

wcos(vti , vw), w ∈ Pi,

Rimax = argmax

wcos(vti , vw), w ∈ Ri, (7)

Simax = argmax

wcos(vti , vw), w ∈ Si.

The normalized weight ω<ti,w> (w ∈ M imax) can

similarly be derived like Eq. (5).

Page 6: Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

1237

3.4 Update Rules for LMMsAfter modifying the input layer of CBOW, Eq. (1)can be rewritten as

L =1

n

n∑i=1

log p(vti |

∑tj∈context(ti)

vtj), (8)

where vtj is the modified vector of vtj (tj ∈context(ti)). Since the word map describestop-level relations between words and the latentmeanings, these relations don’t change duringthe training period. All parameters introducedby our models can be directly derived usingthe word map and word vectors, thus no extraparameter needs to be trained. When the gradientis propagated back to the input layer, we updatenot just the word vector vtj (tj ∈ context(ti))but the vectors of the latent meanings in thevocabulary with the same weights as they areadded to the vector vtj .

4 Experimental Setup

Before conducting experiments, some experimen-tal settings are firstly introduced in this section.

4.1 Corpus and Word MapWe utilize a medium-sized English corpus to trainall word embedding models. The corpus stemsfrom the website of the 2013 ACL Workshopon Machine Translation3 and is used in (Kimet al., 2016). We choose the news corpus of2009 whose size is about 1.7GB. It containsapproximately 500 million tokens and 600,000words in the vocabulary. To get better quality ofthe word embeddings, we filter all digits and somepunctuation marks out of the corpus.

For many languages, there exist large morpho-logical lexicons or morphological tools that cananalyze any word form (Cotterell and Schutze,2015). To create the word map, we need toobtain the morphemes of each word and interpretthem with the lookup table mentioned above toget the latent meanings. Usually, the lookuptable can also be derived from the morphologicallexicons for different languages, although it costssome time and manpower, we can create thelookup table once for all since it represents thecommon knowledge with respect to a certainlanguage. Specifically, we first perform an

3http://www.statmt.org/wmt13/translation-task.html

unsupervised morpheme segmentation usingMorefessor (Creutz and Lagus, 2007) for thevocabularies. Then we execute matching betweenthe segmentation results and the morphologicalcompositions in the lookup table, and the char-acter sequence with largest overlap ratio willbe viewed as a final morpheme and further bereplaced by its latent meanings. Although thelookup table employed in this paper containslatent meanings for only 90 prefixes, 382 rootsand 67 suffixes, we focus on validating thefeasibility of enhancing word embeddings withthe latent meanings of morphemes, and expendingthe lookup table is left as future work.

4.2 BaselinesFor comparison, we choose three word-levelstate-of-the-art word embedding models includingCBOW, Skip-gram (Mikolov et al., 2013a) andGloVe (Pennington et al., 2014), and we alsoimplement an Explicitly Morpheme-relatedModel (EMM), which is a variant version of theprevious work (Qiu et al., 2014). The architectureof EMM is based on our LMM-A, where latentmeanings are replaced back to morphemes andthe embeddings of morphemes are also learnedwhen training word embeddings. This enablesour evaluation to focus on the critical differencebetween our models and the explicit model(Bhatia et al., 2016). We utilize the source codeof word2vec4 to train CBOW and Skip-gram.GloVe is trained based on the code5 provided byPennington et al. (2014). We modify the sourceof word2vec and train our models and EMM.

4.3 Parameter SettingsParameter settings have a great effect on theperformance of word embeddings (Levy et al.,2015). For fairness, all models are trained basedon equal parameter settings. In order to acceleratethe training process, CBOW, Skip-gram and EMMtogether with our models are trained by using thenegative sampling technique. It is suggested thatthe number of negative samples in the range 5-20is useful for small corpus (Mikolov et al., 2013b).If large corpus is used, the number of negativesamples can be as small as 2-5. According to thesize of corpus we used, the number of negativesamples is empirically set to be 20 in this paper.

4https://github.com/dav/word2vec5http://nlp.stanford.edu/projects/

glove

Page 7: Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

1238

Name Pairs Name PairsRG-65 65 RW 2034SCWS 2003 Men-3k 3000

Wordsim-353 353 WS-353-REL 252

Table 1: Details of datasets. The column “Pairs”shows the number of word pairs in each dataset.

The dimension of word embedding is set as 200like that in (Dhillon et al., 2015). We set thecontext window size as 5 which is equal to thesetting in (Mikolov et al., 2013b).

4.4 Evaluation Benchmarks

4.4.1 Word Similarity

This experiment is conducted to evaluate theability of word embeddings to capture semanticinformation from corpus. For English wordsimilarity, we employ two gold standard datasetsincluding Wordsim-353 (Finkelstein et al., 2001)and RG-65 (Rubenstein and Goodenough, 1965)as well as some other widely-used datasetsincluding Rare-Word (Luong et al., 2013), SCWS(Huang et al., 2012), Men-3k (Bruni et al., 2014)and WS-353-Related (Agirre et al., 2009). Moredetails of these datasets are shown in Table 1.Each dataset consists of three columns. Thefirst two columns stand for word pairs and thelast column is human score. We utilize thecosine similarity, which is used in many previousworks (Mikolov et al., 2013b; Pennington et al.,2014), as the metric to measure the distancebetween two words. The Spearman’s rankcorrelation coefficient (ρ) is employed to evaluatethe similarity between our results and humanscores. Higher ρ means better performance.

4.4.2 Syntactic Analogy

Based on the learned word embeddings, the coretask of syntactic analogy is to answer the analogyquestion “a is to b as c is to ”. We utilizethe Microsoft Research Syntactic Analogiesdataset, which is created by Mikolov (Mikolovet al., 2013c) with size of 8000. To answer thesyntactic analogy question “a is to b as c is tod” where d is unknown, we assume that theword representations of a, b, c, d are va, vb,vc, vd, respectively. To get d, we first calculatevd = vb − va + vc. Then, we find out the wordd′ whose cosine similarity to vd is the largest.Finally, we set d as d′.

4.4.3 Text ClassificationTo further evaluate the learned word embeddings,we also conduct 4 text classification tasks usingthe 20 Newsgroups dataset.6 The dataset totallycontains around 19000 documents of 20 differentnewsgroups, and each corresponding to a differenttopic, such as guns, motorcycles, electronicsand so on. For each task, we randomly selectthe documents of 10 topics and split them intotraining/validation/test subsets at the ratio of6:2:2, which are emplyed to train, validate andtest an L2-regularized 10-categorization logisticregression (LR) classifier. As mentioned in(Tsvetkov et al., 2015), here we also regard theaverage word embedding of words (excludingstop words and out-of-vocabulary words) in eachdocument as the feature vector (the input of theclassifier) of that document. The LR classifieris implemented with the scikit-learn toolkit(Pedregosa et al., 2011), which is an open-sourcePython module integrating many state-of-the-artmachine learning algorithms.

5 Experimental Results

5.1 The Results on Word Similarity

Word similarity is conducted to test the semanticinformation which is encoded in word embed-dings, and the results are listed in Table 2 (first 6rows). We observe that our models surpass thecomparative baselines on five datasets. Comparedwith the base model CBOW, it is remarkable thatour models approximately achieve improvementsof more than 5% and 7%, respectively, in theperformance on the golden standard Wordsim-353and RG-65. On WS-353-REL, the differencebetween CBOW and LMM-S even reaches 8%.The advantage demonstrates the effectivenessof our methods. Based on our strategy, moresemantic information will be captured in corpuswhen adding more latent meanings in the contextwindow. By incorporating mophemes, EMM alsoperforms better than other baselines but fails toget the performance as well as ours. Actually,EMM mainly tunes the distributions of words invector space to let the morpheme-similar wordsgather closer, which means it just encodes moremorphological properties into word embeddingsbut lacks the ability to capture more semanticinformation. Specially, because of the medium-

6http://qwone.com/˜jason/20Newsgroups

Page 8: Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

1239

CBOW Skip-gram GloVe EMM LMM-A LMM-S LMM-MWordsim-353 58.77 61.94 49.40 60.01 62.05 63.13 61.54

RW 40.58 36.42 33.40 40.83 43.12 42.14 40.51RG-65 56.50 62.81 59.92 60.85 62.51 62.49 63.07SCWS 63.13 60.20 47.98 60.28 61.86 61.71 63.02Men-3k 68.07 66.30 60.56 66.76 66.26 68.36 64.65

WS-353-REL 49.72 57.05 47.46 54.48 56.14 58.47 55.19Syntactic Analogy 13.46 13.14 13.94 17.34 20.38 17.59 18.30Text Classification 78.26 79.40 77.01 80.00 80.67 80.59 81.28

Table 2: Performance comparison (%) of our LMMs and the baselines on two basic NLP tasks (wordsimilarity & syntactic analogy) and one downstream task (text classification). The bold digits indicatethe best performances.

size corpus and the experimental settings, GloVedoesn’t perform as well as that described in(Pennington et al., 2014).

5.2 The Results on Syntactic Analogy

In (Mikolov et al., 2013c), the dataset is dividedinto adjectives, nouns and verbs. For brevity, weonly report performance on the whole dataset. Asthe middle row of Table 2 shows, all of our modelsoutperform the comparative baselines to a greatextent. Compared with CBOW, the advantage ofLMM-A even reaches to 7%. Besides, we observethat the suffix of “b” usually is the same as thesuffix of “d” when answering question “a is to bas c is to d”. Based on our strategy, morpheme-similar words will not only gather closer but havea trend to group near the latent meanings of theirmorphemes, which makes our embeddings havethe advantage to deal with the syntactic analogyproblem. EMM also performs well on this taskbut is still weaker than our models. Actually,syntactic analogy is also a semantics-related taskbecause “c” and “d” are with similar meanings.Since our models are better to capture semanticinformation, they lead to higher performance thanthe explicitly morphology-based models.

5.3 The Results on Text Classification

For each one of the 4 text classification tasks,we report the classification accuracy over the testset. The average classification accuracy acrossthe 4 tasks is utilized as the evaluation metric fordifferent models. The results are displayed inthe bottom row of Table 2. Since we simply usethe average embedding of words as the featurevector for 10-categorization classification, theoverall classification accuracies of all modelsare merely aroud 80%. However, the classi-fication accuracies of our LMMs still surpassall the baselines, especailly CBOW and GloVe.

Moreover, it can be found that incorporatingmorphological knowledge (morphemes or latentmeanings of morphemes) into word embeddingscan contribute to enhancing the performance ofword embeddings in the downstream NLP tasks.

5.4 The Impacts of Parameter Settings

Parameter settings can affect the performanceof word embeddings. For example, the corpuswith larger corpus size (the ratio of tokens usedfor training) contains more semantic information,which can improve the performance on wordsimilarity. We analyze the impacts of corpussize and window size on the performance ofword embeddings. In the analysis of corpussize, we hold the same parameter settings asbefore. The sizes of tokens used for trainingare separately 1/5, 2/5, 3/5, 4/5 and 5/5 of theentire corpus mentioned above. We utilize theresult of word similarity on Wordsim-353 as theevaluation criterion. From Fig. 5, we observeseveral phenomena. Firstly, the performance ofour LMMs is better than CBOW at each corpussize. Secondly, the performance of CBOW issensitive to the corpus size. In contrast, LMMs’performance is more stable than CBOW. Aswe analyzed in word similarity experiment,LMMs can increase the semantic informationof word embeddings. It is worth noting that theperformance of LMMs on the smallest corpusis even better than CBOW’s performance on thelargest corpus. In the analysis of window size,we observe that the performance of all wordembeddings trained by different models has atrend to ascend with the increasing of window sizeas illustrated in Fig. 6. Our LMMs outperformCBOW under all the pre-set conditions. Besides,the worst performance of LMMs is nearly equalto the best performance of CBOW.

Page 9: Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

1240

55.0

57.5

60.0

62.5

0.2 0.4 0.6 0.8 1.0Corpus Size

Wor

d S

imila

rity

CBOW

LMM−A

LMM−M

LMM−S

Figure 5: Parameter analysis of corpus size. X-axis denotes the ratio of tokens used for training,and Y-axis denotes the Spearman rank (%) of wordsimilarity.

56

58

60

62

1 2 4 53 Window Size

Wor

d S

imila

rity

CBOW

LMM−A

LMM−M

LMM−S

Figure 6: Parameter analysis of window size. X-axis and Y-axis denote the window size and Spear-man rank (%) of word similarity, respectively.

5.5 Word Embedding Visualization

To visualize the embeddings of our models, werandomly select several words from the results ofLMM-A. The dimensions of the selected wordembeddings are reduced from 200 to 2 usingPrincipal Component Analysis (PCA), and the2-D word embeddings are illustrated in Fig. 7.The words with different colors reflect that theyhave different morphemes. It is apparent thatwords with similar morphemes have a trend togroup together and stay near the latent meaningsof their morphemes. In addition, we can alsofind some syntactic regularities in Fig. 7, forexample, “physics” is to “physicist” as “science”is to “scientist”, and “physicist” and “scientist”stay near the latent meaning, i.e., “human”, of thesuffix -ist.

anthropologist

biologistphysicist

scientist

microscope microorganism

premier

preview

agreeable

prescient

prefix

edible

capablevisible

human

small

before

physics

science−5

0

5

−15 −10 −5 0 5x

y

Figure 7: The visualization of word embeddings.Based on PCA, we randomly select several wordsfrom word embedding of LMM-A and illustratethem in this figure, “�” indicates the latent mean-ings of morphemes.

6 Conclusion

In this paper, we explored a new direction toemploy the latent meanings of morphologicalcompositions rather than the internal compo-sitions themselves to train word embeddings.Three specific models named LMM-A, LMM-Sand LMM-M were proposed by modifying theinput layer and update rules of CBOW. Thesource code of LMMs is avaliable at https://github.com/Y-Xu/lmm.

To test the performance of our models, wechose three word-level word embedding modelsand implemented an Explicitly Morpheme-relatedModel (EMM) as comparative baselines, andtested them on two basic NLP tasks of word simi-larity and syntactic analogy, and one downstreamtext classification task. The experimental resultsdemonstrate that our models outperform thebaselines on five word similarity datasets. On thesyntactic analogy as well as the text classificationtasks, our models also surpass all the baselinesincluding the EMM. In the future, we intend toevaluate our models for some morpheme-richlanguages like Russian, German and so on.

Acknowledgments

The authors are grateful to the reviewers for con-structive feedback. This work was supported bythe National Natural Science Foundation of China(No.61572456), the Anhui Province GuidanceFunds for Quantum Communication and QuantumComputers and the Natural Science Foundation ofJiangsu Province of China (No.BK20151241).

Page 10: Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

1241

ReferencesEneko Agirre, Enrique Alfonseca, Keith Hall, Jana

Kravalova, Marius Pasca, and Aitor Soroa. 2009. Astudy on similarity and relatedness using distribu-tional and wordnet-based approaches. In Proceed-ings of Human Language Technologies: The 2009Annual Conference of the North American Chap-ter of the Association for Computational Linguistics,pages 19–27. Association for Computational Lin-guistics.

Parminder Bhatia, Robert Guthrie, and Jacob Eisen-stein. 2016. Morphological priors for probabilis-tic neural word embeddings. In Proceedings of the2016 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 490–500.Association for Computational Linguistics.

Jan Botha and Phil Blunsom. 2014. Compositionalmorphology for word representations and languagemodelling. In International Conference on MachineLearning, pages 1899–1907.

Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014.Multimodal distributional semantics. Journal of Ar-tificail Intelligence Research (JAIR), 49(1–47).

Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun,and Huanbo Luan. 2015. Joint learning of characterand word embeddings. In International Conferenceon Artificial Intelligence.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1724–1734. Association for Computational Linguistics.

Ryan Cotterell and Hinrich Schutze. 2015. Mor-phological word-embeddings. In Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 1287–1292.

Ryan Cotterell, Hinrich Schutze, and Jason Eisner.2016. Morphological smoothing and extrapolationof word embeddings. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1651–1660. Association for Computational Linguistics.

Mathias Creutz and Krista Lagus. 2007. Unsupervisedmodels for morpheme segmentation and morphol-ogy learning. ACM Transactions on Speech andLanguage Processing (TSLP), 4(1):3.

Paramveer S Dhillon, Dean P Foster, and Lyle H Un-gar. 2015. Eigenwords: spectral word embeddings.Journal of Machine Learning Research, 16:3035–3078.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-tan Ruppin. 2001. Placing search in context: Theconcept revisited. In Proceedings of the 10th inter-national conference on World Wide Web, pages 406–414. ACM.

Eric H. Huang, Richard Socher, Christopher D. Man-ning, and Andrew Y. Ng. 2012. Improving wordrepresentations via global context and multiple wordprototypes. In Meeting of the Association for Com-putational Linguistics: Long Papers, pages 873–882.

Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M Rush. 2016. Character-aware neural languagemodels. In The Thirtieth AAAI Conference on Arti-ficial Intelligence, pages 2741–2749.

Angeliki Lazaridou, Marco Marelli, Roberto Zampar-elli, and Marco Baroni. 2013. Compositional-ly de-rived representations of morphologically complexwords in distributional semantics. In Proceedingsof the 51st Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Pa-pers), pages 1517–1526. Association for Computa-tional Linguistics.

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-proving distributional similarity with lessons learnedfrom word embeddings. Transactions of the Associ-ation for Computational Linguistics, 3:211–225.

Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and MaosongSun. 2015. Topical word embeddings. In TheTwenty-Ninth AAAI Conference on Artificial Intelli-gence, pages 2418–2424.

Thang Luong, Richard Socher, and Christopher Man-ning. 2013. Better word representations with recur-sive neural networks for morphology. In Proceed-ings of the Seventeenth Conference on Computa-tional Natural Language Learning, pages 104–113.

Christopher D Manning, Prabhakar Raghavan, HinrichSchutze, et al. 2008. Introduction to information re-trieval, volume 1. Cambridge university press Cam-bridge.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013c. Linguistic regularities in continuous spaceword representations. In Proceedings of the 2013Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 746–751.

Page 11: Incorporating Latent Meanings of Morphological ... · standard Wordsim-353 and RG-65, LMMs approximately achieve 5% and 7% gains over CBOW, respectively. For the syntactic analogy

1242

Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-learn:Machine learning in python. Journal of machinelearning research, 12(Oct):2825–2830.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543.

Siyu Qiu, Qing Cui, Jiang Bian, Bin Gao, and Tie-YanLiu. 2014. Co-learning of word representations andmorpheme representations. In Proceedings of COL-ING 2014, the 25th International Conference onComputational Linguistics: Technical Papers, pages141–150.

Herbert Rubenstein and John B Goodenough. 1965.Contextual correlates of synonymy. Communica-tions of the ACM, 8(10):627–633.

Magnus Sahlgren. 2008. The distributional hypothesis.Italian Journal of Disability Studies, 20:33–53.

Bonggun Shin, Timothy Lee, and Jinho D Choi.2016. Lexicon integrated cnn models with at-tention for sentiment analysis. arXiv preprintarXiv:1610.06272.

Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guil-laume Lample, and Chris Dyer. 2015. Evaluation ofword vector representations by subspace alignment.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages2049–2054.