11
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 327–337 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics MUSE: Modularizing Unsupervised Sense Embeddings Guang-He Lee Yun-Nung Chen Massachusetts Institute of Technology, Cambridge, MA National Taiwan University, Taipei, Taiwan [email protected] [email protected] Abstract This paper proposes to address the word sense ambiguity issue in an unsupervised manner, where word sense representations are learned along a word sense selection mechanism given contexts. Prior work focused on designing a single model to deliver both mechanisms, and thus suf- fered from either coarse-grained represen- tation learning or inefficient sense selec- tion. The proposed modular approach, MUSE, implements flexible modules to optimize distinct mechanisms, achieving the first purely sense-level representation learning system with linear-time sense se- lection. We leverage reinforcement learn- ing to enable joint training on the pro- posed modules, and introduce various ex- ploration techniques on sense selection for better robustness. The experiments on benchmark data show that the proposed approach achieves the state-of-the-art per- formance on synonym selection as well as on contextual word similarities in terms of MaxSimC. 1 Introduction Recently, deep learning methodologies have dom- inated several research areas in natural language processing (NLP), such as machine translation, language understanding, and dialogue systems. However, most of applications usually utilize word-level embeddings to obtain semantics. Con- sidering that natural language is highly ambigu- ous, the standard word embeddings may suffer from polysemy issues. Neelakantan et al. (2014) pointed out that, due to triangle inequality in vec- tor space, if one word has two different senses but is restricted to one embedding, the sum of the distances between the word and its synonym in each sense would upper-bound the distance be- tween the respective synonyms, which may be mu- tually irrelevant, in embedding space 1 . Due to the theoretical inability to account for polysemy us- ing a single embedding representation per word, multi-sense word representations are proposed to address the ambiguity issue using multiple em- bedding representations for different senses in a word (Reisinger and Mooney, 2010; Huang et al., 2012). This paper focuses on unsupervised learning from the unannotated corpus. There are two key mechanisms for a multi-sense word representation system in such scenario: 1) a sense selection (de- coding) mechanism infers the most probable sense for a word given its context and 2) a sense repre- sentation mechanism learns to embed word senses in a continuous space. Under this framework, prior work focused on designing a single model to deliver both mech- anisms (Neelakantan et al., 2014; Li and Juraf- sky, 2015; Qiu et al., 2016). However, the previ- ously proposed models introduce side-effects: 1) mixing word-level and sense-level tokens achieves efficient sense selection but introduces ambigu- ous word-level tokens during the representation learning process (Neelakantan et al., 2014; Li and Jurafsky, 2015), and 2) pure sense-level tokens prevent ambiguity from word-level tokens but re- quire exponential time complexity when decoding a sense sequence (Qiu et al., 2016). Unlike the prior work, this paper proposes MUSE 2 —a novel modularization framework in- corporating sense selection and representation learning models, which implements flexible mod- ules to optimize distinct mechanisms. Specifically, 1 d(rock, stone)+ d(rock, shake) d(stone, shake) 2 The trained models and code are available at https: //github.com/MiuLab/MUSE. 327

MUSE: Modularizing Unsupervised Sense Embeddings · MUSE: Modularizing Unsupervised Sense Embeddings Guang-He Lee?y Yun-Nung Chen y?Massachusetts Institute of Technology, Cambridge,

Embed Size (px)

Citation preview

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 327–337Copenhagen, Denmark, September 7–11, 2017. c©2017 Association for Computational Linguistics

MUSE: Modularizing Unsupervised Sense Embeddings

Guang-He Lee?† Yun-Nung Chen†?Massachusetts Institute of Technology, Cambridge, MA

†National Taiwan University, Taipei, [email protected] [email protected]

Abstract

This paper proposes to address the wordsense ambiguity issue in an unsupervisedmanner, where word sense representationsare learned along a word sense selectionmechanism given contexts. Prior workfocused on designing a single model todeliver both mechanisms, and thus suf-fered from either coarse-grained represen-tation learning or inefficient sense selec-tion. The proposed modular approach,MUSE, implements flexible modules tooptimize distinct mechanisms, achievingthe first purely sense-level representationlearning system with linear-time sense se-lection. We leverage reinforcement learn-ing to enable joint training on the pro-posed modules, and introduce various ex-ploration techniques on sense selection forbetter robustness. The experiments onbenchmark data show that the proposedapproach achieves the state-of-the-art per-formance on synonym selection as well ason contextual word similarities in terms ofMaxSimC.

1 Introduction

Recently, deep learning methodologies have dom-inated several research areas in natural languageprocessing (NLP), such as machine translation,language understanding, and dialogue systems.However, most of applications usually utilizeword-level embeddings to obtain semantics. Con-sidering that natural language is highly ambigu-ous, the standard word embeddings may sufferfrom polysemy issues. Neelakantan et al. (2014)pointed out that, due to triangle inequality in vec-tor space, if one word has two different sensesbut is restricted to one embedding, the sum of

the distances between the word and its synonymin each sense would upper-bound the distance be-tween the respective synonyms, which may be mu-tually irrelevant, in embedding space1. Due to thetheoretical inability to account for polysemy us-ing a single embedding representation per word,multi-sense word representations are proposed toaddress the ambiguity issue using multiple em-bedding representations for different senses in aword (Reisinger and Mooney, 2010; Huang et al.,2012).

This paper focuses on unsupervised learningfrom the unannotated corpus. There are two keymechanisms for a multi-sense word representationsystem in such scenario: 1) a sense selection (de-coding) mechanism infers the most probable sensefor a word given its context and 2) a sense repre-sentation mechanism learns to embed word sensesin a continuous space.

Under this framework, prior work focused ondesigning a single model to deliver both mech-anisms (Neelakantan et al., 2014; Li and Juraf-sky, 2015; Qiu et al., 2016). However, the previ-ously proposed models introduce side-effects: 1)mixing word-level and sense-level tokens achievesefficient sense selection but introduces ambigu-ous word-level tokens during the representationlearning process (Neelakantan et al., 2014; Li andJurafsky, 2015), and 2) pure sense-level tokensprevent ambiguity from word-level tokens but re-quire exponential time complexity when decodinga sense sequence (Qiu et al., 2016).

Unlike the prior work, this paper proposesMUSE2—a novel modularization framework in-corporating sense selection and representationlearning models, which implements flexible mod-ules to optimize distinct mechanisms. Specifically,

1d(rock, stone) + d(rock, shake) ≥ d(stone, shake)2The trained models and code are available at https:

//github.com/MiuLab/MUSE.

327

MUSE enables linear time sense identity decodingwith a sense selection module and purely sense-level representation learning with a sense repre-sentation module.

With the modular design, we propose a noveljoint learning algorithm on the modules by con-necting to a reinforcement learning scenario,which achieves the following advantages. First,the decision making process under reinforcementlearning better captures the sense selection mech-anism than probabilistic and clustering methods.Second, our reinforcement learning algorithm re-alizes the first single objective function for modu-lar unsupervised sense representation systems. Fi-nally, we introduce various exploration techniquesunder reinforcement learning on sense selection toenhance robustness.

In summary, our contributions are five-fold:• MUSE is the first system that maintains

purely sense-level representation learningwith linear-time sense decoding.• We are among the first to leverage reinforce-

ment learning to model the sense selectionprocess in sense representations system.• We are among the first to propose a single

objective for modularized unsupervised senseembedding learning.• We introduce a sense exploration mechanism

for the sense selection module to achieve bet-ter flexibility and robustness.• Our experimental results show the state-of-

the-art performance for synonym selectionand contextual word similarities in terms ofMaxSimC.

2 Related Work

There are three dominant types of approaches forlearning multi-sense word representations in theliterature: 1) clustering methods, 2) probabilis-tic modeling methods, and 3) lexical ontologybased methods. Our reinforcement learning basedapproach can be loosely connected to clusteringmethods and probabilistic modeling methods.

Reisinger and Mooney (2010) first proposedmulti-sense word representations on the vectorspace based on clustering techniques. With thepower of deep learning, some work exploited neu-ral networks to learn embeddings with sense se-lection based on clustering (Huang et al., 2012;Neelakantan et al., 2014). Chen et al. (2014) re-placed the clustering procedure with a word sense

disambiguation model using WordNet (Miller,1995). Kageback et al. (2015) and Vu and Parker(2016) further leveraged a weighting mechanismand interactive process in the clustering proce-dure. Moreover, Guo et al. (2014) leveraged bilin-gual resources for clustering. However, most ofthe above approaches separated the clustering pro-cedure and the representation learning procedurewithout a joint objective, which may suffer fromthe error propagation issue. Instead, the proposedapproach, MUSE, enables joint training on senseselection and representation learning.

Instead of clustering, probabilistic modelingmethods have been applied for learning multi-sense embeddings in order to make the sense se-lection more flexible, where Tian et al. (2014)and Jauhar et al. (2015) conducted probabilis-tic modeling with EM training. Li and Jurafsky(2015) exploited Chinese Restaurant Process toinfer the sense identity. Furthermore, Bartunovet al. (2016) developed a non-parametric Bayesianextension on the skip-gram model (Mikolovet al., 2013b). Despite reasonable modeling onsense selection, all above methods mixed word-level and sense-level tokens during representationlearning—unable to conduct representation learn-ing in the pure sense level due to the complicatedcomputation in their EM algorithms.

Recently, Qiu et al. (2016) proposed an EMalgorithm to learn purely sense-level representa-tions, where the computational cost is high whendecoding the sense identity sequence, because ittakes exponential time to search all sense com-bination within a context window. Our modulardesign addresses such drawback, where the senseselection module decodes a sense sequence withlinear-time complexity, while the sense represen-tation module remains representation learning inthe pure sense level.

Unlike a lot of relevant work that requires addi-tional resources such as the lexical ontology (Pile-hvar and Collier, 2016; Rothe and Schutze, 2015;Jauhar et al., 2015; Chen et al., 2015; Iacobacciet al., 2015) or bilingual data (Guo et al., 2014;Ettinger et al., 2016; Suster et al., 2016), whichmay be unavailable in some language, our modelcan be trained using only an unlabeled corpus.Also, some prior work proposed to learn topicalembeddings and word embeddings jointly in or-der to consider the contexts (Liu et al., 2015a,b),whereas this paper focuses on learning multi-sense

328

Corpus: { Smartphone companies including apple blackberry, and sony will be invited.}

…𝐶𝑡 = 𝑤𝑖𝐶𝑡−1

𝑞(𝑧𝑖1| ഥ𝐶𝑡) 𝑞(𝑧𝑖2| ഥ𝐶𝑡) 𝑞(𝑧𝑖3| ഥ𝐶𝑡)

𝑧𝑖1

Sense Representation Module

…𝑃(𝑧𝑗2|𝑧𝑖1) 𝑃(𝑧𝑢𝑣|𝑧𝑖1)

sense selection ←

Sense Selection Module

reward

signal ←

sense selectio

n →

Sense selection for target word 𝐶𝑡 Sense selection for collocated word 𝐶𝑡′

negative sampling

matrix 𝑄𝑖 matrix 𝑉

matrix 𝑈matrix 𝑃

Sense Selection Module

… 𝐶𝑡+1including apple blackberry

…𝐶𝑡′ = 𝑤𝑗𝐶𝑡′−1

𝑞(𝑧𝑗1|𝐶𝑡′) 𝑞(𝑧𝑗2|𝐶𝑡′) 𝑞(𝑧𝑗3|𝐶𝑡′)

matrix 𝑄𝑗

matrix 𝑃

… 𝐶𝑡′+1apple and

companies andincluding sony

sample collocation1

2

2

3

blackberry

Figure 1: The MUSE architecture with a 3-step learning algorithm: 1) collocation sampling, 2) senseselection for sense representation learning, and 3) optimizing sense selection with a reward signal fromsense representation. Reward signal is only passed to the target word to stabilize model training due todirectional architecture in the sense representation module.

word embeddings.

3 Proposed Approach: MUSE

This work proposes a framework to modularizetwo key mechanisms for multi-sense word repre-sentations: a sense selection module and a senserepresentation module. The sense selection mod-ule decides which sense to use given a text con-text, whereas the sense representation modulelearns meaningful representations based on its sta-tistical characteristics. Unlike prior work thatmust suffer from either inefficient sense selec-tion (Qiu et al., 2016) or coarse-grained represen-tation learning (Neelakantan et al., 2014; Li andJurafsky, 2015; Bartunov et al., 2016), the pro-posed modularized framework is capable of per-forming efficient sense selection and learning rep-resentations in pure sense level simultaneously.

To learn sense-level representations, a senseselection model should be first established forsense identity decoding. On the other hand, thesense embeddings should guide the sense selectionmodel when decoding a sense identity sequence.Therefore, these two modules should be tangled.This indicates that a naive two-stage algorithmor two separate learning algorithms proposed byprior work are not optimal.

By connecting the proposed formulation withreinforcement learning literature, we design anovel joint training algorithm. Besides, taking ad-vantage of the form of reinforcement learning, weare among the first to investigate various explo-ration techniques in sense selection for unsuper-

vised sense embedding learning.

3.1 Model Architecture

Our model architecture is illustrated in Figure 1,where there are two modules in optimization.

3.1.1 Sense Selection Module

Formally speaking, given a corpus C, vocabularyW , and the t-th word Ct = wi ∈ W , we wouldlike to find the most probable sense zik ∈ Zi,where Zi is the set of senses in word wi. As-suming that a word sense is determined by thelocal context, we exploit a local context Ct ={Ct−m, · · · , Ct+m} for sense selection accordingto the Markov assumption, where m is the sizeof a context window. Then we can either for-mulate a probabilistic policy π(zik | Ct) aboutsense selection or estimate the individual likeli-hood q(zik | Ct) for each sense identity.

To ensure efficiency, here we exploit a linearneural architecture that takes word-level input to-kens and outputs sense-level identities. The ar-chitecture is similar to continuous bag-of-words(CBOW) (Mikolov et al., 2013a). Specifically,given a word embedding matrix P , the local con-text can be modeled as the summation of word em-beddings from its context Ct. The output can beformulated with a 3-mode tensorQ, whose dimen-sions denote words, senses, and latent variables.Then we can model π(zik | Ct) or q(zik | Ct) cor-respondingly. Here we model π(·) as a categorical

329

distribution using a softmax layer:

π(zik | Ct) =exp(QTik

∑j∈Ct

Pj)∑k′∈Zi

exp(QTik′∑

j∈CtPj)

.

(1)On the other hand, the likelihood of selecting dis-tinct sense identities, q(zik | Ct), is modeled asa Bernoulli distribution with a sigmoid functionσ(·):

q(zik | Ct) = σ(QTik∑j∈Ct

Pj). (2)

Different modeling approaches require differentlearning methods, especially for the unsupervisedsetting. We leave the corresponding learning al-gorithms in § 3.2. Finally, with a built sense se-lection module, we can apply any selection algo-rithm such as a greedy selection strategy to inferthe sense identity zik given a word wi with its con-text Ct.

We note that modularized model enables effi-cient sense selection by leveraging word-level to-kens, while remaining purely sense-level tokens inthe representation module. Specifically, if n de-notes maxk |Zk|, decoding L words takes O(nL)senses to be searched due to independent senseselection. The prior work using a single modelwith purely sense-level tokens (Qiu et al., 2016)requires exponential time to calculate the collo-cation energy for every possible combination ofsense identities within a context window, O(n2m),for a single target sense. Further, Qiu et al. (2016)took an additional sequence decoding step withquadratic time complexity O(n4mL), based onan exponential number n2m in the base unit. Itdemonstrates the achievement about sense infer-ence efficiency in our proposed model.

3.1.2 Sense Representation ModuleThe semantic representation learning is typicallyformulated as a maximum likelihood estimation(MLE) problem for collocation likelihood. In thispaper, we use the skip-gram formulation (Mikolovet al., 2013b) considering that it requires lesstraining time, where only two sense identitiesare required for stochastic training. Other pop-ular candidates, like GloVe (Pennington et al.,2014) and CBOW (Mikolov et al., 2013a), requiremore sense identities to be selected as input andthus not suitable for our scenario. For example,GloVe (Pennington et al., 2014) takes computa-tionally expensive collocation counting statistics

for each token in a corpus as input, which requiressense selection for every occurrence of the targetword across the whole corpus for a single opti-mization step.

To learn the representations, we first create in-put sense representation matrix U and collocationestimation matrix V as the learning targets. Givena target word wi and collocated word wj with cor-responding local contexts, we map them to theirsense identities as zik and zjl by the sense se-lection module, and maximize the sense colloca-tion log likelihood logL(·). A natural choice ofthe likelihood function is formulated as a categor-ical distribution over all possible collocated sensesgiven the target sense zik:

maxU,V

logL(zjl | zik) = logexp(UTzik

Vzjl)∑

zuvexp(UTzik

Vzuv).

(3)Instead of enumerating all possible collocatedsenses which is computationally expensive, weuse the skip-gram objective (4) (Mikolov et al.,2013b) to approximate (3) as shown in the greenblock of Figure 1.

maxU,V

log L(zjl | zik) = log σ(UTzikVzjl

) (4)

+M∑v=1

Ezuv∼pneg(z)[log σ(−UTzikVzuv)],

where pneg(z) is the distribution over all sensesfor negative samples. In our experiment with |Zi|senses for word wi, we use (1/|Zi|) word-levelunigram as sense-level unigram for efficiency andthe 3/4-th power trick in Mikolov et al. (2013b).

We note that our modular framework can easilymaintain purely sense-level tokens with an arbi-trary representation learning model. In contrast,most related work using probabilistic modeling(Tian et al., 2014; Jauhar et al., 2015; Li and Juraf-sky, 2015; Bartunov et al., 2016) binded sense rep-resentations with the sense selection mechanism,so efficient sense selection by leveraging word-level tokens can be achieved only at the cost ofmixing word-level and sense-level tokens in theirrepresentation learning process.

3.2 LearningWithout the supervised signal for the proposedmodules, it is desirable to connect two modulesin a way where they can improve each other bytheir own estimations. First, a trivial way is to for-ward the prediction of the sense selection module

330

to the representation module. Then we cast the es-timated collocation likelihood as a reward signalfor the selected sense for effective learning.

To realize the above procedure, we cast thelearning problem a one-step Markov decision pro-cess (MDP) (Sutton and Barto, 1998), where thestate, action, and reward correspond to context Ct,sense zik, and collocation log likelihood log L(·),respectively. Based on different modeling meth-ods ((1) or (2)) in the sense selection module,we connect the model to respective reinforcementlearning algorithms to solve the MDP. Specifically,we refer (1) to policy distribution and refer (2) toQ-value estimation in the reinforcement learningliterature.

The proposed MDP framework embodies sev-eral nuances of sense selection. First, the deci-sion of a word sense is Markov: taking the wholecorpus into consideration is not more helpful thana handful of necessary local contexts. Second,the decision making in MDP exploits a hard de-cision for selecting sense identity, which capturesthe sense selection process more naturally than ajoint probability distribution among senses (Qiuet al., 2016). Finally, we exploit the reward mech-anism in MDP to enable joint training: the estima-tion of sense representation is treated as a rewardsignal to guide sense selection. In contrast, thedecision making under clustering (Huang et al.,2012; Neelakantan et al., 2014) considers the sim-ilarity within clusters instead of the outcome of adecision using a reward signal as MDP.

3.2.1 Policy Gradient MethodBecause (1) fits a valid probability distribution, anintuitive optimization target is the expectation ofresulting collocation likelihood among each sense.In addition, as the skip-gram formulation in (4) isunidirectional (L(zik | zjl) 6= L(zjl | zik)), weperform one-side optimization for the target sensezik to stabilize model training3. That is, for thetarget word wi and the collocated word wj givenrespective contexts Ct and Ct′ (0 < |t− t′| ≤ m),we first draw a sense zjl for wj from the policyπ(· | Ct′) and optimize the expected collocationlikelihood for the target sense zik as follows,

maxP,Q

Ezik∼π(·|Ct)[log L(zjl | zik)]. (5)

Note that (4) can be merged into (5) as a sin-gle objective. The objective is differentiable and

3We observe about 4% performance drop by optimizinginput selection zik and output selection zjl simultaneously.

supports stochastic optimization (Lei et al., 2016),which uses a stochastic sample zik for optimiza-tion.

However, there are two possible disadvantagesin this formulation. First, because the policy as-sumes the probability distribution in (1), optimiz-ing the selected sense must affect the estimationof the other senses. Second, if applying stochasticgradient ascent to optimizing (5), it would alwayslower the probability estimation for the selectedsense zik even if the model accurately selects theright sense. The detailed proof is in Appendix A.

3.2.2 Value-Based MethodTo address the above issues, we apply the Q-learning algorithm (Mnih et al., 2013). Instead ofmaintaining a probabilistic policy for sense selec-tion, Q-learning estimates the Q-value (resultingcollocation log likelihood) for each sense candi-date directly and independently. Thus, the estima-tion of unselected senses may not be influenced bythe selected one. Note that in one-step MDP, thereward is equivalent to the Q-value, so we will usereward and Q-value interchangeably, hereinafter,based on the context.

We further follow the convention of recent neu-ral reinforcement learning by reducing the re-ward range to aid model training (Mnih et al.,2013). Specifically, we replace the log likelihoodlog L(·) ∈ (− inf, 0] with the likelihood L(·) ∈[0, 1] as the reward function. Due to the mono-tonic operation in log(), the relative ordering ofthe reward remains the same.

Furthermore, we exploit the probabilistic na-ture of likelihood for Q-learning. To elaborate,as Q-learning is used to approximate the Q-valuefor each action in typical reinforcement learning,most literature adopted square loss to characterizethe discrepancy between the target and estimatedQ-values (Mnih et al., 2013). In our setting wherethe Q-value/reward is a likelihood function, ourmodel exploits cross-entropy loss to better capturethe characteristics of probability distribution.

Given that the collocation likelihood in (4) isan approximation to the original categorical dis-tribution with a softmax function shown in (3)(Mikolov et al., 2013b), we revise the formulationby omitting the negative sampling term. The re-sulting formulation L(·) is a Bernoulli distributionindicating whether zjl collocates or not given zik:

L(zjl | zik) = σ(UTzikVzjl

). (6)

331

There are three advantages about using L(·) in-stead of approximated L(·) and original L(·).First, regarding the variance of estimation, L(·)better captures L(·) than L(·) because L(·) in-volves sampling:

V ar(L(·)) ≥ V ar(L(·)) = V ar(L(·)) = 0. (7)

Second, regarding the relative ordering of estima-tion, for any two collocated senses zjl and zjl′ witha target sense zik, the following equivalence holds:

L(zjl | zik) < L(zjl′ | zik) (8)

⇔ L(zjl | zik) < L(zjl′ | zik)⇔ L(zjl | zik) < L(zjl′ | zik)

Third, for collocation computation, L(·) requiresall sense identities and L(·) requires (M+1) senseidentities, whereas L(·) only requires 1 sense iden-tity. In sum, the proposed L(·) approximates L(·)with no variance, no “bias” (in terms of relativeordering), and significantly less computation.

Finally, because both target distribution L(·)and estimated distribution q(·) in (2) are Bernoullidistributions, we follow the last section to conductone-side optimization by fixing a collocated sensezjl and optimize the selected sense zik with crossentropy as

minP,Q

H(L(zik | zjl), q(zik | Ct)). (9)

3.2.3 Joint TrainingTo jointly train sense selection and sense represen-tation modules, we first select a pair of the collo-cated senses, zik and zjl, based on the sense se-lection module with any selecting strategy (e.g.greedy), and then optimize the sense representa-tion module and the sense selection module usingthe above derivations. Algorithm 1 describes theproposed MUSE model training procedure.

As modular frameworks, the major distinc-tion between our modular framework and two-stage clustering-representation learning frame-work (Neelakantan et al., 2014; Vu and Parker,2016) is that we establish a reward signal from thesense representation to the sense selection moduleto enable immediate and joint optimization.

3.3 Sense Selection StrategyGiven a fitness estimation for each sense, exploit-ing the greedy sense is the most popular strat-egy for clustering algorithms (Neelakantan et al.,

Algorithm 1: Learning Algorithm

for wi = Ct ∈ C dosample wj = Ct′(0 < |t′ − t| ≤ m);zik = select(Ct, wi);zjl = select(Ct′ , wj);optimize U, V by (4) for the senserepresentation module;

optimize P,Q by (5) or (9) for the senseselection module;

2014; Kageback et al., 2015) and hard-EM algo-rithms (Qiu et al., 2016; Jauhar et al., 2015) inliterature. However, there are two incentives toconduct exploration. First, in the early trainingstage when the fitness is not well estimated, it isdesirable to explore underestimated senses. Sec-ond, due to high ambiguity in natural language,sometimes multiple senses in a word would fitin the same context. The dilemma between ex-ploring sub-optimal choices and exploiting theoptimal choice is called exploration-exploitationtrade-off in reinforcement learning (Sutton andBarto, 1998).

We introduce exploration mechanisms for senseselection for both policy gradient and Q-learning.For policy gradient, we sample the policy distri-bution to approximate the expectation in (5). Be-cause of the flexible formulation of Q-learning, thefollowing classic exploration mechanisms are ap-plied to sense selection:• Greedy: selects the sense with the largest Q-

value (no exploration).• ε-Greedy: selects a random sense with ε

probability, and adopts the greedy strategyotherwise (Mnih et al., 2013).• Boltzmann: samples the sense based on the

Boltzmann distribution modeled by Q-value.We directly use (1) as the Boltzmann distri-bution for simplicity.

We note that Q-learning with Boltzmann samplingyields the same sampling process as policy gradi-ent but different optimization objectives. To ourbest knowledge, we are among the first to ex-plore several exploration strategies for unsuper-vised sense embedding learning.

In the following sections, MUSE-Policy de-notes the proposed MUSE model with policylearning and MUSE-Greedy denotes the model us-ing corresponding sense selection strategy for Q-learning.

332

4 Experiments

We evaluate our proposed MUSE model in bothquantitative and qualitative experiments.

4.1 Experimental Setup

Our model is trained on the April 2010 Wikipediadump (Shaoul and Westbury, 2010), which con-tains approximately 1 billion tokens. For faircomparison, we adopt the same vocabulary setas Huang et al. (2012) and Neelakantan et al.(2014). For preprocessing, we convert all words totheir lower cases, apply the Stanford tokenizer andthe Stanford sentence tokenizer (Manning et al.,2014), and remove all sentences with less than 10tokens. The number of senses per word in Q is setto 3 as the prior work (Neelakantan et al., 2014).

In the experiments, the context window sizeis set to 5 (|Ct| = 11). Subsampling tech-nique introduced by word2vec (Mikolov et al.,2013b) is applied to accelerate the training pro-cess. The learning rate is set to 0.025. The em-bedding dimension is 300. We initialize Q and Vas zeros, and P and U from uniform distribution[−√

1/100,√

1/100] such that each embeddinghas unit length in expectation (Lei et al., 2015).Our model uses 25 negative senses for negativesampling in (4). We use ε = 5% for ε-Greedysense selection strategy

In optimization, we conduct mini-batch trainingwith 2048 batch size using the following proce-dure: 1) select senses in the batch; 2) optimizeU, V using stochastic training within the batchfor efficiency; 3) optimize P,Q using mini-batchtraining for robustness.

4.2 Experiment 1: Contextual WordSimilarity

To evaluate the quality of the learned sense em-beddings, we compute the similarity score be-tween each word pair given their respective lo-cal contexts and compare with the human-judgedscore using Stanford’s Contextual Word Similari-ties (SCWS) dataset (Huang et al., 2012). Specifi-cally, given a list of word pairs with correspond-ing contexts, S = {(wi, Ct, wj , Ct′)}, we cal-culate the Spearman’s rank correlation ρ betweenhuman-judged similarity and model similarity es-timations4. Two major contextual similarity esti-

4For example, human-judged similarity between “... eastbank of the Des Moines River ...” and “... basis of all moneylaundering ...” is 2.5 out of 10.0 in SCWS dataset (Huang

Method MaxSimC AvgSimC

Huang et al. (2012) 26.1 65.7Neelakantan et al. (2014) 60.1 69.3Tian et al. (2014) 63.6 65.4Li and Jurafsky (2015) 66.6 66.8Bartunov et al. (2016) 53.8 61.2Qiu et al. (2016) 64.9 66.1MUSE-Policy 66.1 67.4MUSE-Greedy 66.3 68.3MUSE-ε-Greedy 67.4† 68.6MUSE-Boltzmann 67.9† 68.7

Table 1: Spearman’s rank correlation ρ x100 onthe SCWS dataset. † denotes superior performanceto all unsupervised competitors.

mations are introduced by Reisinger and Mooney(2010): AvgSimC and MaxSimC. AvgSimC is asoft measurement that addresses the contextual in-formation with a probability estimation:

AvgSimC(wi, Ct, wj , Ct′) =|Zi|∑k=1

|Zj |∑l=1

π(zik|Ct)π(zjl|Ct′)d(zik, zjl),

where d(zik, zjl) refers to the cosine similarity be-tween Uzik

and Uzjl. AvgSimC weights the sim-

ilarity measurement of each sense pair zik andzjl by their probability estimations. On the otherhand, MaxSimC is a hard measurement that onlyconsiders the most probable senses:

MaxSimC(wi, Ct, wj , Ct′) = d(zik, zjl),zik = arg max

zik′π(zik′ |Ct),

zjl = arg maxzjl′

π(zjl′ |Ct′).

The baselines for comparison include classicclustering methods (Huang et al., 2012; Neelakan-tan et al., 2014), EM algorithms (Tian et al., 2014;Qiu et al., 2016; Bartunov et al., 2016), and Chi-nese Restaurant Process (Li and Jurafsky, 2015)5,where all approaches are trained on the same cor-pus except Qiu et al. (2016) used more recentWikipedia dumps. The embedding sizes of allbaselines are 300, except 50 in Huang et al. (2012).For every competitor with multiple settings, we re-port the best performance in each similarity mea-surement setting and show in Table 1.

et al., 2012).5We run Li and Jurafsky (2015)’s released code on our

corpus for fair comparison.

333

Method ESL-50 RD-300 TOEFL-80

1) Conventional Word EmbeddingGlobal Context 47.73 45.07 60.87Skip-Gram 52.08 55.66 66.672) Word Sense DisambiguationIMS+SG 41.67 53.77 66.673) Unsupervised Sense EmbeddingsEM 27.08 33.96 40.00MSSG 57.14 58.93 78.26CRP 50.00 55.36 82.61MUSE-Policy 52.38 51.79 79.71MUSE-Greedy 57.14 58.93 79.71MUSE-ε-Greedy 61.90† 62.50† 84.06†

MUSE-Boltzmann 64.29† 66.07† 88.41†

4) Supervised Sense EmbeddingsRetro-GC 63.64 66.20 71.01Retro-SG 56.25 65.09 73.33

Table 2: Accuracy on synonym selection. † de-notes superior performance to all unsupervisedcompetitors.

Our MUSE model achieves the state-of-the-artperformance on MaxSimC, demonstrating supe-rior quality on independent sense embeddings. Onthe other hand, MUSE achieves comparable per-formance with the best competitor in terms ofAvgSimC (68.7 vs. 69.3), while MUSE outper-forms the same competitor significantly in termsof MaxSimC (67.9 vs. 60.1). The results demon-strate not only the high quality of sense represen-tations but also accurate sense selection.

From the application perspective, MaxSimCrefers to a typical scenario using single embeddingper word, while AvgSimC employs multiple sensevectors simultaneously per word, which not onlybrings computational overhead but changes exist-ing neural architecture for NLP. Hence, we arguethat MaxSimC better characterize practical usageof a sense representation system than AvgSimC.

Among various learning methods for MUSE,policy gradient performs worst, echoing our ar-gument in § 3.2.1. On the other hand, the supe-rior performance of Boltzmann sampling and ε-Greedy over Greedy selection demonstrates the ef-fectiveness of exploration.

Finally, replacing L(·) with L(·) as the re-ward signal yields 2.3 times speedup for MUSE-ε-Greedy and 1.3 times speedup for MUSE-Boltzmann to reach 67.0 in MaxSimC, whichdemonstrates the efficacy of proposed approxima-

tion L(·) over typical L(·) in terms of conver-gence.

4.3 Experiment 2: Synonym Selection

We further evaluate our model on synonymselection using multi-sense word representa-tions (Jauhar et al., 2015). Three standard syn-onym selection datasets, ESL-50 (Turney, 2001),RD-300 (Jarmasz and Szpakowicz, 2004), andTOEFL-80 (Landauer and Dumais, 1997), are per-formed. In the datasets, each question consists ofa question word wQ and four answer candidates{wA, wB, wC , wD}, and the goal is to select themost semantically synonymous choice among thefour candidates. For example, in the TOEFL-80dataset, a question shows {(Q) enormously, (A)appropriately, (B) uniquely, (C) tremendously, (D)decidedly}, and the answer is (C). For multi-senserepresentations system, it selects the synonym ofthe question word wQ using the maximum sense-level cosine similarity as a proxy of the semanticsimilarity (Jauhar et al., 2015).

Our model is compared with the following base-lines: 1) conventional word embeddings: globalcontext vectors (Huang et al., 2012) and skip-gram (Mikolov et al., 2013b); 2) applying su-pervised word sense disambiguation using theIMS system and then applying skip-gram on dis-ambiguated corpus (IMS+SG) (Zhong and Ng,2010); 3) unsupervised sense embeddings: EMalgorithm (Jauhar et al., 2015), multi-sense skip-gram (MSSG) (Neelakantan et al., 2014), Chi-nese restaurant process (CRP) (Li and Jurafsky,2015), and the MUSE models; 4) supervisedsense embeddings with WordNet (Miller, 1995):retrofitting global context vectors (Retro-GC) andretrofitting skip-gram (Retro-SG) (Jauhar et al.,2015).

Among unsupervised sense embedding ap-proaches, CRP and MSSG refer to the baselineswith highest MaxSimC and AvgSimC in Table 1respectively. Here we report the setting for base-lines based on the best average performance in thistask. We also show the performance of supervisedsense embeddings as an upperbound of unsuper-vised methods due to the usage of additional su-pervised information from WordNet.

The results are shown in Table 2, where ourMUSE-ε-Greedy and MUSE-Boltzmann signifi-cantly outperform all unsupervised sense embed-dings methods, echoing the superior quality of our

334

Context k-NN Senses· · · braves finish the season in tie with the los angeles dodgers · · · scoreless otl shootout 6-6 hingis 3-3 7-7 0-0

· · · his later years proudly wore tie with the chinese characters for · · · pants trousers shirt juventus blazer socks anfield

· · · of the mulberry or the blackberry and minos sent him to · · · cranberries maple vaccinium apricot apple

· · · of the large number of blackberry users in the us federal · · · smartphones sap microsoft ipv6 smartphone

· · · shells and/or high explosive squash head hesh and/or anti-tank · · · venter thorax neck spear millimeters fusiform

· · · head was shaven to prevent head lice serious threat back then · · · shaved thatcher loki thorax mao luthor chest

· · · appoint john pope republican as head of the new army of · · · multi-party appoints unicameral beria appointed

Table 3: Different word senses are selected by MUSE according to different contexts. The respectivek-NN (sorted by collocation likelihood) senses are shown to indicate respective semantic meanings.

sense vectors in last section. MUSE-Boltzmannalso outperforms the supervised sense embeddingsexcept 1 setting without any supervised signal dur-ing training. Finally, the MUSE methods withproper exploration outperform all unsupervisedbaselines consistently, demonstrating the impor-tance of exploration.

4.4 Qualitative Analysis

We further conduct qualitative analysis to checkthe semantic meanings of different senses learnedby MUSE with k-nearest neighbors (k-NN) us-ing sense representations. In addition, we providecontexts in the training corpus where the sense willbe selected to validate the sense selection mod-ule. Table 3 shows the results. The learned senseembeddings of the words “tie”, “blackberry”, and“head” clearly correspond to correct senses underdifferent contexts.

Since we address an unsupervised setting thatlearns sense embeddings from unannotated cor-pus, the discovered senses highly depend on thetraining corpus. From our manual inspection, it iscommon for our model to discover only two sensesin a word, like “tie” and “blackberry”. However,we maintain our effort in developing unsupervisedsense embeddings learning methods in this work,and the number of discovered sense is not a focus.

5 Conclusion

This paper proposes a novel modularized frame-work for unsupervised sense representation learn-ing, which supports not only the flexible de-sign of modular tasks but also joint optimizationamong modules. The proposed model is the firstwork that implements purely sense-level represen-tation learning with linear-time sense selection,and achieves the state-of-the-art performance onbenchmark contextual word similarity and syn-

onym selection tasks. In the future, we plan to in-vestigate reinforcement learning methods to incor-porate multi-sense word representations for down-stream NLP tasks.

Acknowledgements

We would like to thank reviewers for their insight-ful comments on the paper. The authors are sup-ported by the Ministry of Science and Technologyof Taiwan under the contract number 105-2218-E-002-033, Institute for Information Industry, andMediaTek Inc..

ReferencesSergey Bartunov, Dmitry Kondrashkin, Anton Osokin,

and Dmitry Vetrov. 2016. Breaking sticks and ambi-guities with adaptive skip-gram. Proceedings of the19th International Conference on Artificial Intelli-gence and Statistics, page 130–138.

Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang.2015. Improving distributed representation of wordsense via wordnet gloss composition and contextclustering. Association for Computational Linguis-tics.

Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014.A unified model for word sense representation anddisambiguation. In EMNLP, pages 1025–1035.Citeseer.

Allyson Ettinger, Philip Resnik, and Marine Carpuat.2016. Retrofitting sense-specific word vectors usingparallel text. In Proceedings of NAACL-HLT, pages1378–1383.

Jiang Guo, Wanxiang Che, Haifeng Wang, and TingLiu. 2014. Learning sense-specific word embed-dings by exploiting bilingual resources. In COL-ING, pages 497–507.

Eric H. Huang, Richard Socher, Christopher D. Man-ning, and Andrew Y. Ng. 2012. Improving WordRepresentations via Global Context and Multiple

335

Word Prototypes. In Annual Meeting of the Asso-ciation for Computational Linguistics (ACL).

Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2015. Sensembed: Learning senseembeddings for word and relational similarity. InACL (1), pages 95–105.

Mario Jarmasz and Stan Szpakowicz. 2004. Roget’sthesaurus and semantic similarity. Recent Advancesin Natural Language Processing III: Selected Papersfrom RANLP, 2003:111.

Sujay Kumar Jauhar, Chris Dyer, and Eduard H Hovy.2015. Ontologically grounded multi-sense represen-tation learning for semantic vector space models. InHLT-NAACL, pages 683–693.

Mikael Kageback, Fredrik Johansson, Richard Johans-son, and Devdatt Dubhashi. 2015. Neural contextembeddings for automatic discovery of word senses.In Proceedings of NAACL-HLT, pages 25–32.

Thomas K Landauer and Susan T Dumais. 1997. Asolution to plato’s problem: The latent semanticanalysis theory of acquisition, induction, and rep-resentation of knowledge. Psychological review,104(2):211.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2015.Molding cnns for text: non-linear, non-consecutiveconvolutions. EMNLP.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.Rationalizing neural predictions. Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing.

Jiwei Li and Dan Jurafsky. 2015. Do multi-sense em-beddings improve natural language understanding?Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing, pages1722–1732.

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2015a.Learning context-sensitive word embeddings withneural tensor skip-gram model. In IJCAI, pages1284–1290.

Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and MaosongSun. 2015b. Topical word embeddings. In AAAI,pages 2418–2424.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient estimation of word represen-tations in vector space. Proceedings of Workshop atICLR.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

George A Miller. 1995. Wordnet: a lexical database forenglish. Communications of the ACM, 38(11):39–41.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Alex Graves, Ioannis Antonoglou, Daan Wierstra,and Martin Riedmiller. 2013. Playing atari withdeep reinforcement learning. NIPS Deep LearningWorkshop.

Arvind Neelakantan, Jeevan Shankar, Alexandre Pas-sos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings perword in vector space. Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. volume 14, pages 1532–1543.

Mohammad Taher Pilehvar and Nigel Collier. 2016.De-conflated semantic representations. Proceedingsof the 2016 Conference on Empirical Methods inNatural Language Processing.

Lin Qiu, Kewei Tu, and Yong Yu. 2016. Context-dependent sense embedding. In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing.

Joseph Reisinger and Raymond J Mooney. 2010.Multi-prototype vector-space models of word mean-ing. In Human Language Technologies: The 2010Annual Conference of the North American Chap-ter of the Association for Computational Linguistics,pages 109–117. Association for Computational Lin-guistics.

Sascha Rothe and Hinrich Schutze. 2015. Autoex-tend: Extending word embeddings to embeddingsfor synsets and lexemes. In Proceedings of the ACL.

Cyrus Shaoul and Chris Westbury. 2010. The westburylab wikipedia corpus.

Simon Suster, Ivan Titov, and Gertjan van Noord. 2016.Bilingual learning of multi-sense embeddings withdiscrete autoencoders. NAACL-HLT 2016.

Richard S Sutton and Andrew G Barto. 1998. Re-inforcement learning: An introduction, volume 1.MIT press Cambridge.

Fei Tian, Hanjun Dai, Jiang Bian, Bin Gao, Rui Zhang,Enhong Chen, and Tie-Yan Liu. 2014. A probabilis-tic model for learning multi-prototype word embed-dings. In COLING, pages 151–160.

336

Peter D Turney. 2001. Mining the web for synonyms:Pmi-ir versus lsa on toefl. In European Conferenceon Machine Learning, pages 491–502. Springer.

Thuy Vu and D Stott Parker. 2016. K-embeddings:Learning conceptual embeddings for words usingcontext. In Proceedings of NAACL-HLT, pages1262–1267.

Zhi Zhong and Hwee Tou Ng. 2010. It makes sense:A wide-coverage word sense disambiguation systemfor free text. In Proceedings of the ACL 2010 Sys-tem Demonstrations, pages 78–83. Association forComputational Linguistics.

A Doubly Stochastic Gradient

To derive doubly stochastic gradient for equation(5), we first denote (5) as J(Θ) with Θ = {P,Q}and resolve the expectation form as:

J(θ) = Ezik∼π(·|Ct)[log L(zjl | zik)]=

∑k

π(zik | Ct) log L(zjl|zik).

Denote Θ = {P,Q} as the parameter set for pol-icy π. The gradient with respect to Θ should be:

∂J(θ)∂Θ

=∂

∂Θ

∑k

π(zik | Ct) log L(zjl|zik)

=∑k

log L(zjl|zik)∂π(zik | Ct)∂Θ

=∑k

log L(zjl|zik)(∂ log π(zik | Ct)∂Θ

)(π(zik | Ct))

= Ezik∼π(·|Ct)[log L(zjl | zik)∂ log π(zik | Ct)∂Θ

]

Accordingly, if we conduct typical stochastic gra-dient ascent training on J(Θ) with respect to Θfrom samples zik with a learning rate η, the up-date formula will be:

Θ = Θ + η log L(zjl | zik)∂ log π(zik | Ct)∂Θ

.

However, the collocation log likelihood should al-ways be non-positive: log L(zjl | zik) ≤ 0.Therefore, as long as the collocation log likelihoodlog L(zjl | zik) is negative, the update formula isto minimize the likelihood of choosing zik, despitethe fact that zik may be good choices. On the otherhand, if the log likelihood reaches 0, according to(4), it indicates:

log L(zjl | zik) = 0⇒ L(zjl | zik) = 1

⇒ UTzikVzjl→∞, UTzik

Vzuv →∞, ∀zuv,which leads to computational overflow from an in-finity value.

337