4
Word Sense Induction using Correlated Topic Model Thanh Tung Hoang Department of Computer Science University of Engineering and Technology, VNU Hanoi, Vietnam [email protected] Phuong Thai Nguyen Department of Computer Science University of Engineering and Technology, VNU Hanoi, Vietnam [email protected] Abstract—Word sense induction (WSI) is the problem of automatic identification of word senses given the corpus. This paper presents a method for solving WSI problem based on the context clustering approach. The idea behind this approach is that similar contexts indicate similar meanings. Specifically, we have successfully applied Correlated Topic Model (CTM) to partition contexts of a word into clusters, each representing a sense of that word. Different from some previous systems where a single model is built for all words, in our system, each word has its own model. Experimental result on the SemEval-2010 dataset shows that CTM is a strong tool for modelling the word’s contexts. Our system has significantly better performance than all systems participated in the SemEval-2010 workshop. In comparison to the use of other topic models for WSI, our system can explore additional useful information which is the relationship between senses of a word. The prospect of using CTM for discovering the correlation between senses of multiple words is also discussed at the end of this paper. Keywords-word sense induction; correlated topic model; context clustering; I. I NTRODUCTION Word Sense Induction (WSI) and Word Sense Disam- biguation (WSD) are two closely related problems which aim to identify the senses of word given its context. While WSD systems usually follow the supervised approach, WSI systems are usually unsupervised systems. Super- vised WSD require annotated data and domain specific knowledge which are not always available for all domains and languages. WSI, on the other hand, is more flexible as it does not require annotated data or specialized knowl- edge. WSI also solves the problem of sense granularity by allowing the number of senses to vary according to different purposes. Therefore, WSI is an attractive line of thinking beside WSD and it has been successfully applied to various fields of informatics including information retrieval and machine translation. A large number of methods have been developed to solve WSI problem including clustering methods and graph based methods. The graph based methods model the problem space as a graph where nodes are words or sets of words and the edges are word co-occurrences. In 2007, Klapaftis and Manandhar [7] introduced UoY system based on the hyper-graph model. The system outperformed the Most Frequent Sense (MFS) baseline on the SemEval-2007 dataset. Different from graph based methods, clustering methods try to assign the word to the most appropriate cluster. In recent years, many works have Figure 1. The correlated topic model (from Blei et al. [1]) showed that topic model is a strong tool for clustering and can be applied to WSI problem. In 2009, Brody and Lapata [4] applied latent Dirichlet allocation (LDA) [3] to WSI problem and achieved high result on the SemEval- 2007 benchmark dataset. In 2011, Yao and Van Durme [2] showed that performance can be improved by using Hierarchical Dirichlet Process (HDP) instead of LDA. However, due to the assumption that topics in LDA and HDP are nearly independent, the model cannot directly capture the correlation between senses. On the other hand, in Correlated Topic Model (CTM) [1], information about topics’ correlation can be captured in the covariance matrix of the logistic normal distribution. This information improves the predictive power of the model and can be used to form a graph representing the correlation of topics in the corpus. Building a WSI system based on the context clustering approach was our primary purpose. The system trans- formed WSI problem into a topic modelling problem by viewing the contexts as documents and word senses as topics. CTM was used to model the sense space of each ambiguous word in the dataset. Our system achieved sig- nificantly better performance than all systems participated in the SemEval-2010 workshop [5] in two over three evaluation methods. The system is also able to explore the correlation between senses of a word. This relationship can be graphically represented by senses graph in which highly related senses are linked together. II. WORD SENSE INDUCTION MODEL A. The Correlated topic model CTM is a probabilistic model for modelling topics in document collection [1]. CTM is a bag-of-word model in which words in a document are assumed to be ex- changeable. A document in CTM represents all topics with different proportions. Given a model with K topics 2012 International Conference on Asian Language Processing 978-0-7695-4886-9/12 $26.00 © 2012 IEEE DOI 10.1109/IALP.2012.73 41 2012 International Conference on Asian Language Processing 978-0-7695-4886-9/12 $26.00 © 2012 IEEE DOI 10.1109/IALP.2012.73 41

[IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Word

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Word

Word Sense Induction using Correlated Topic Model

Thanh Tung HoangDepartment of Computer Science

University of Engineering and Technology, VNUHanoi, Vietnam

[email protected]

Phuong Thai NguyenDepartment of Computer Science

University of Engineering and Technology, VNUHanoi, [email protected]

Abstract—Word sense induction (WSI) is the problem ofautomatic identification of word senses given the corpus. Thispaper presents a method for solving WSI problem basedon the context clustering approach. The idea behind thisapproach is that similar contexts indicate similar meanings.Specifically, we have successfully applied Correlated TopicModel (CTM) to partition contexts of a word into clusters,each representing a sense of that word. Different fromsome previous systems where a single model is built forall words, in our system, each word has its own model.Experimental result on the SemEval-2010 dataset shows thatCTM is a strong tool for modelling the word’s contexts.Our system has significantly better performance than allsystems participated in the SemEval-2010 workshop. Incomparison to the use of other topic models for WSI, oursystem can explore additional useful information which isthe relationship between senses of a word. The prospect ofusing CTM for discovering the correlation between senses ofmultiple words is also discussed at the end of this paper.

Keywords-word sense induction; correlated topic model;context clustering;

I. INTRODUCTIONWord Sense Induction (WSI) and Word Sense Disam-

biguation (WSD) are two closely related problems whichaim to identify the senses of word given its context. WhileWSD systems usually follow the supervised approach,WSI systems are usually unsupervised systems. Super-vised WSD require annotated data and domain specificknowledge which are not always available for all domainsand languages. WSI, on the other hand, is more flexibleas it does not require annotated data or specialized knowl-edge. WSI also solves the problem of sense granularityby allowing the number of senses to vary according todifferent purposes. Therefore, WSI is an attractive line ofthinking beside WSD and it has been successfully appliedto various fields of informatics including informationretrieval and machine translation.A large number of methods have been developed to

solve WSI problem including clustering methods andgraph based methods. The graph based methods modelthe problem space as a graph where nodes are wordsor sets of words and the edges are word co-occurrences.In 2007, Klapaftis and Manandhar [7] introduced UoYsystem based on the hyper-graph model. The systemoutperformed the Most Frequent Sense (MFS) baselineon the SemEval-2007 dataset. Different from graph basedmethods, clustering methods try to assign the word to themost appropriate cluster. In recent years, many works have

Figure 1. The correlated topic model (from Blei et al. [1])

showed that topic model is a strong tool for clusteringand can be applied to WSI problem. In 2009, Brody andLapata [4] applied latent Dirichlet allocation (LDA) [3] toWSI problem and achieved high result on the SemEval-2007 benchmark dataset. In 2011, Yao and Van Durme[2] showed that performance can be improved by usingHierarchical Dirichlet Process (HDP) instead of LDA.However, due to the assumption that topics in LDA andHDP are nearly independent, the model cannot directlycapture the correlation between senses. On the otherhand, in Correlated Topic Model (CTM) [1], informationabout topics’ correlation can be captured in the covariancematrix of the logistic normal distribution. This informationimproves the predictive power of the model and can beused to form a graph representing the correlation of topicsin the corpus.Building a WSI system based on the context clustering

approach was our primary purpose. The system trans-formed WSI problem into a topic modelling problem byviewing the contexts as documents and word senses astopics. CTM was used to model the sense space of eachambiguous word in the dataset. Our system achieved sig-nificantly better performance than all systems participatedin the SemEval-2010 workshop [5] in two over threeevaluation methods. The system is also able to explorethe correlation between senses of a word. This relationshipcan be graphically represented by senses graph in whichhighly related senses are linked together.

II. WORD SENSE INDUCTION MODEL

A. The Correlated topic modelCTM is a probabilistic model for modelling topics in

document collection [1]. CTM is a bag-of-word modelin which words in a document are assumed to be ex-changeable. A document in CTM represents all topicswith different proportions. Given a model with K topics

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.73

41

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.73

41

Page 2: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Word

Example Sense descriptionHe is at the access of the building a way of leaving or enteringHe gained access to the building the act of approaching or enteringHe cannot access the building reach or gain access to

Table IThe context carries information about the sense of a word. However,there are cases where context is not enough to determine word’ssenses. This table shows examples of different senses of access

occurring in similar context

β1:K and a multivariate normal distribution N{μ,Σ},an N-word document d is generated from the followinggenerative process:1) Draw ηd from the multivariate Gaussian distribution

N{μ,Σ}2) For each word nd d = 1...N doa) Draw topic assignment zd,n from the multinomialdistribution Mult(f(ηd))

b) Draw word wd,n from the multinomial distributionMult(βzd,n

)

where f(ηd) maps η to topic proportions θ:

θ = f(η) =exp(η)

∑i exp(ηi)

(1)

B. Building the systemWe have decided to build an unsupervised WSI based

on the context clustering approach. The approach isoriginated from the idea that similarity in word contextindicates similarity in word sense. The CTM was selectedfor modelling the topic space instead of other topic modelsbecause of its ability to capture the correlation betweentopics (senses) in the document collection. The correlationbetween senses helps to improve the predictive power ofthe system especially when the number of senses rises [1].Our system is an unsupervised, language independent

system which means that it does not require any do-main/language specific knowledge or tagged data for train-ing. It is also convenient to port our system into newlanguages/domains as we only need to retrain the systemon the new untagged data.To transform WSI problem into Topic Modelling prob-

lem, we view local contexts as documents and senses astopics. For each occurrence an ambiguous word, we take20 words before and after that word as the local context.Local context contains most valuable information whichis directly related to the target sense. In contrast, globalcontext may contains information that is not related tothe target sense. Due to the large size of global context,multiple senses of a word may occur in the same context.By using a small number of words surrounding the targetword, local context effectively reduce the above circum-stance.For each word, there is a corresponding set of contexts

and a vocabulary. Vocabulary of a word is the set ofdistinct words occurring in the contexts of that word. Lowfrequency words and stop words which do not containvaluable information are removed from vocabulary. Amodel is built for each ambiguous word. To examine the

effect of number of senses to system’s performance andthe correlation between senses, we varied the number ofsenses from 4 (which is close to the average number ofsenses in the testing dataset) to 14. It is noticeable that thisnumber is much smaller than that in some other systemslike KSU-KDD [6] which is typically 400 to 500. Thisis because in KSU-KDD, only one model is built forall words. KSU-KDD approach is convenient but mightreduce system performance because contexts of differentwords could be very different and should be characterizedby different sets of parameters. Therefore, it is necessaryfor each word to have its own model.We trained models for 100 target words in the SemEval-

2010 dataset and used the resulting models to inducesenses of words. A sense in the system is a distributionover the vocabulary and is represented by its highestprobability words. To induce the senses of a word, oursystem takes the local context of that word as the inputand runs posterior inference on that context to producean output vector representing the distribution over senses.The ith component of the vector is the probability that theword is of the ith sense. Based on that vector, the mostprobable sense is selected as the sense of that word.Although the covariance matrix of the logistic normal

distribution contains information about sense correlation, itis not good for building the senses graph. Because almostall components in the matrix are not perfectly equal to0, the resulting graph will be very dense as all nodes areconnected. To control the sparsity, Blei et al. used theLASSO to estimate this graph. Details of the techniquecan be found in Blei et al. [1].

III. EVALUATION METHODSWe used the data and evaluation methods of the

SemEval-2010 task 14 (Word Sense Induction andDisambiguation)[5] to evaluate and to compare our sys-tem with other system participated in the SemEval-2010workshop.The following evaluation methods assess the quality of

a clustering solution by comparing it to the gold-standardsenses.

A. V-measureV-measure measures the homogeneity and completeness

of a clustering solution, it is the harmonic means ofhomogeneity and completeness. Better clustering solutionis expected to have higher V-measure score.

B. Paired F-scorePaired F-score treats the clustering problem in as a

classification problem, it is the harmonic mean of precisionand recall. Better solution is expected to have higher pairedF-score.

C. Supervised evaluationSupervised evaluation evaluates WSI system in a WSD

setting. In this method, the testing dataset is split intomapping corpus and evaluation corpus. The mappingcorpus is used to derive a mapping matrix that maps the

4242

Page 3: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Word

Training set Testing set #SensesAll 879807 8915 3.79Nouns 716945 5285 4.46Verbs 162862 3630 3.12

Table IITraining and testing dataset detail

Figure 2. V-measure of systems (Our system denoted by ’OS’ followedby the number of senses)

induced sense space to the gold-standard sense space. Foreach testing instance, the corresponding induced senseproportion vector is multiplied with the mapping matrix toproduce a gold-standard sense proportion vector. From theresulting vector, the most probable gold standard sense isselected to label that instance. The result is then comparedto the answer key.

IV. RESULTS

In this section, we give the result of our system andother systems on the SemEval-2010 dataset. The result isalso compared to two baselines which are Most FrequentSense (MFS) and Random. To investigate the effect ofnumber of senses on system’s performance, we builtvarious models with different numbers of senses.The training and testing dataset contain 50 nouns and

50 verbs. The data are in XML form. For each word, thereare approximately 9000 instances for training and another900 for testing. Detail of the dataset is shown in table II.

A. System performance

1) V-measure: As is shown in figure 2 our systemachieved better V-measure than all other systems. Oursystem with 14 senses (OS14) has the highest score of19.1% which is 3.5% higher than Hermit - the systemwith highest V-measure in SemEval-2010 workshop. Theresult also suggests that when producing more senses, oursystem get higher V-measure.2) Paired F-score: In contrast to the previous evalua-

tion method, paired F-score assigns lower score to systemswhich produce higher number of senses. Figure 3 showsthe paired F-score of our system and some other systems.As can be seen from the figure, systems with higherV-measure get lower paired F-score. Paired F-scores ofour system increase as the number of senses decrease.The Most Frequent Sense (MFS) baseline which produce

Figure 3. Paired F-score of systems

Figure 4. Supervised recall of systems (80-20 split)

only one sense for each word, gets highest paired F-scorealthough its V-measure is 0.The two unsupervised evaluation methods have their

own drawback in evaluating systems. Supervised evalua-tion partially solves the problem by mapping the inducedsenses to real senses. This reduce the difference in num-bers of senses of systems.3) Supervised evaluation: Figures 4 and 5 show the

result of supervised evaluation on 80-20 and 60-40 split.To make the result more reliable, the task organizerprovided 10 different datasets and the final result is theaverage of runs on these datasets.As in unsupervised evaluation, we varied the number of

senses to investigate its effect to system performance. Thescore increase when the number of senses rises from 4 to8. This is because the difference in domain of the trainingand testing corpus. When the number of senses in thesystem is close to that of the testing data (about 4 sensesper word) the shift in domain reduces the performance. Toaccommodate this shift, the system needs to produce moresenses. However, when the number of senses continue torise (larger than 8), the senses become too fine grained,causing a decrease in system performance.We also notice that score on 80-20 datasets is higher

than that on 60-40 dataset. The change in proportion ofmapping and evaluation corpus affects systems differently,resulting in some changes in system ranking.

B. Senses graphA great advantage of CTM is that it can model the

correlation between senses. However, the model can hardlyfind that information when the number of senses is small.

4343

Page 4: [IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Word

Figure 5. Supervised recall of systems (60-40 split)

sense 7 sense 24 sense 17 sense 3 sense 6 sense 28software 99 software applications people time armyserver applications software afghanistan work hutchinsonapplication model easy military day careuser based key iraq free afghanistancomputers limited media orders needed son

Table IIISeveral senses of deploy

To make this correlation clearer, we built some modelswith larger number of senses (e.g. 20 to 30 senses).We consider here the model of deploy with 30 sensesand the corresponding senses graph which is shown infigure 6. Isolated senses were removed from the graphand highly related senses were linked together to formclusters representing different senses of the target word.There are two major senses of deploy:1) Deploy: (v) (place troops or weapons in battle for-mation)

2) Deploy: (v) (to distribute systematically or strategi-cally)

(Sense definitions from WordNet)We investigate further a cluster of the 7th, 24th and 17th

sense. Table III shows the most probable words of eachsense. The three senses are closely related and representthe second major sense of deploy that is deployment ofsoftware. On the other hand, the cluster of 3rd, 6th, 21stand 28th sense related to the first major sense of deploy.

Figure 6. Senses graph of the word deploy

The senses graph give an idea about how the sensesare related. Different senses of a word are represented byclusters not by nodes.

V. CONCLUSION AND FUTURE WORKSWe have successfully applied Correlated Topic Model

to solve the problem of Word Sense Induction. Our systemachieved high performance on the SemEval-2010 bench-mark dataset. The system can capture the relationshipbetween senses and represent it by senses graphs. We alsoinvestigated the effect of number of induced senses onsystem performance.However, building a model for each word is time

consuming. It is the primary problem stopping us fromapply our system to large scale data. Improving speed andaccuracy of the system is our main topic in the near future.Because words with similar meanings also appear in

similar context. We can use our system to explore thesemantic relationship of senses of multiple words bybuilding a single model for these words. This also allowus to cluster words with similar meanings into clusters.

REFERENCES[1] D. M. Blei, J.D. Lafferty (2007). A correlated topic model

of Science, The Annals of Applied Statistics, 2007, Vol. 1,No. 1, pp. 17-35.

[2] Xuchen Yao and Benjamin Van Durme (2011). Nonparamet-ric BayesianWord Sense Induction, in Proceedings of theTextGraphs-6 Workshop, pp. 10 14.

[3] Blei, D., Ng, A. and Jordan, M. (2003). Latent Dirichletallocation. Journal of Machine Learning Research 3 pp.9931022.

[4] Samuel Brody and Mirella Lapata (2009). Bayesian WordSense Induction, in Proceeding of the 12th Conference ofthe European Chapter of the ACL, pp. 103-111.

[5] Suresh Manandhar, Ioannis P. Klapftis, Dmitriy Dligach andSameer S. Pradhan (2010). SemEval-2010 Task 14: WordSense Induction and Disambiguation.

[6] Wesam Elshamy, Doina Caragea, William H. Hsu (2010).KSU KDD: Word Sense Induction by Clustering in TopicSpace, in Proceedings of the 5th International Workshop onSemantic Evaluation, ACL 2010, pp. 367370.

[7] Ioannis P. Klapaftis and Suresh Manandhar. UOY: A Hyper-graph Model ForWord Sense Induction & Disambiguation, inProceedings of the 4th International Workshop on SemanticEvaluations (SemEval-2007), ACL 2007, pp. 414-417.

4444