A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

GSLT course: Statistical Methods Term paper

A distributional or a translational basis for datadriven sensediscrimination?

Anders Nøklestad Gunn Inger LyseUniversity of Oslo University of [email protected] [email protected]

Received May 18th 2005

Introduction

How can we define senses, at least to the extent that they are useful for naturallanguage processing (NLP) tasks, such as word sense disambiguation (WSD) or machinetranslation (MT)? This paper aims at comparing and evaluating two different hypotheseson how to derive sense distinctions automatically from corpus data. The appeal of datadriven approaches to sense discrimination pertains to the fact that there is still no clearanswer to how and where to draw the line between separate senses of a word, whetherfrom a philosophical nor from a linguistic point of view (Ide et al., 2002). Looking at aword in a particular context, we as humans interpret the word sense quite effortlessly andunconsciously, even if we cannot necessarily agree what to call this sense. This makes itnatural to think that rather than starting with the definition of an a priori sense inventory,which is then to be assigned to word instances in context, we may instead seek to findcriteria for grouping occurrences of a word in context, such that each group representsone sense of the word.

The distributional hypothesis assumes that the different senses of a target wordmay be teased out by clustering together those target word instances that display similarcontextual properties. For clustering, we applied the SenseClusters software package,developed by Ted Pedersen et al. 1 (section 2.1). The translational hypothesis is based onthe notion that translations may be seen as the product of having interpreted the meaningof the source language text (Dyvik, 2003). Furthermore, the different senses of a wordtend to be lexicalised differently in some other language (Resnik & Yarowsky, 1997) atleast this tends to be true for semantically unrelated senses. For translationbased sensediscrimination, we applied the Mirrors method, developed by Helge Dyvik (Dyvik,1998ab, 2003) (section 2.2).

1 SenseClusters can be downloaded from http://www.d.umn.edu/~tpederse/senseclusters.html

Since the translational hypothesis requires parallel data, we applied the EnglishNorwegian Parallel Corpus (ENPC)2, developed at the University of Oslo, which hasbeen automatically wordaligned by Sindre Sørensen at Aksis, Bergen. The distributionalhypothesis exploited only information from the Norwegian side of the ENPC. The ENPCcontains approximately 2.6 million words (adding both language sides). It has alreadybeen asserted (Dyvik, 2003) that the Mirrors method is vulnerable to sparse data, as is thecase when using the relatively small ENPC corpus. For this reason it becomesparticularly interesting to see how far another datadriven approach, namely thedistributional hypothesis, may take us given the same amount of corpus data.

It is not easy to evaluate sense inventories however, since there does not reallyexist any ‘gold standard’ for comparison. We compare the resulting sense inventoriesmanually on the basis of the sense distinctions in a common dictionary entry for the samelexical sample. We used the Norwegian dictionary Bokmålsordboka3 (University ofOslo). As will be seen, the clustering approach seems to be more vulnerable towardssparse data than the Mirrors method. Furthermore, as a source of knowledge forautomatic senseannotation of corpora, the clustering method is less accurate but theMirrors method can annotate only approximately half of the available material.

2 Two datadriven approaches

2.1 The distributional hypothesis

”You shall know a word by the company it keeps”. Firth (1968)

The distributional hypothesis essentially exploits the ”one sense per collocation”property (Yarowsky, 1993), meaning that differing senses do not normally have the samecontextual properties. Hence, senses may be discriminated by grouping (clustering)contexts that are similar.

SenseClusters is a freely available software package for clustering any kind oftextual data. It has been used for general word sense discrimination (Purandare andPedersen 2004) as well as for named entity recognition (Pedersen et al., 2005, who referto this task as “proper name discrimination”). SenseClusters is used for clusteringcontexts. In the task of word sense discrimination, all contexts should contain theparticular word whose senses will be clustered (the target word), although the tool canalso be used for clustering contexts that do not necessarily contain any particular targetword (e.g., for spam detection). SenseClusters clusters contexts based on lexical featuresthat are extracted from the text. These features are labelled unigrams, bigrams, co-occurrences and target co-occurrences. Bigrams are defined in this context as orderedpairs of words that may have intervening elements. Co-occurrences are unordered pairs of

2 The ENPC is located at http://www.hf.uio.no/iba/prosjekt/.3 Bokmålsordboka is located at http://www.dokpro.uio.no/ordboksoek.html

words in a context, while target co-occurrences are co-occurrences that include the targetword.

SenseClusters is a very flexible tool that allows the experimenter to set a widerange of parameters defining the way that features are extracted and the clustering isperformed. For our experiments, we adopted the parameter settings used by Kulkarni andPedersen (2005) for some of these parameters: Only unigrams, bigrams and co-occurrences that occur at least five times in the data were included. For bigrams and co-occurrences, we applied the additional requirement that the log-likelihood ratio should behigh enough that there is a 95% certainty that the words are not independent. Bigrams,unigrams and co-occurrences that include stop words were excluded; stop words weredefined to be prepositions, conjunctions, pronouns, and the infinitival marker å “to”.

Words can be clustered based on first-order or second-order associations. In eithercase, the matrices may optionally be reduced using Singular Value Decomposition(SVD). We used the standard setting of SenseClusters, which is to cluster based onsecond-order associations and without SVD. The actual clustering is done using a methodknown as Repeated Bisections. This method produces almost as good results asagglomerative clustering, but with the speed of partitional clustering (Kulkarni andPedersen 2005).

With unsupervised clustering approaches it is often hard to determine how theresulting clusters should be characterised in terms of semantic categories. The creators ofSenseClusters aim at alleviating this problem by providing ways of extracting descriptiveand discriminating labels for the clusters. The descriptive labels of a cluster are the top Nbigrams in the cluster, according to the log-likelihood ratio. Its discriminating labels arethose descriptive labels that are not descriptive labels of any other cluster. Thus, thediscriminating labels of a cluster might be seen as expressing information that sets thecluster apart from other clusters. Unfortunately, in our experiments the sets ofdiscriminating labels turned out to contain a huge amount of bigrams, making themeffectively useless.

2.2 The translational hypothesis

“Meaning is manifested in the relation between languages”. Dyvik (1998)

The Mirrors method has been developed by Helge Dyvik (Dyvik, 1998ab, 2003),within the research project “From Parallel Corpus to Wordnet” at the University ofBergen (20012005)4. It is based on the hypothesis that the translational relation betweenlanguages may be viewed as a theoretical primitive, which may serve as the basis forderiving the sense distinctions of a word as well as semantic relatedness between wordsenses. (Semantic relatedness will not be pursued here; a bit simplified we may forinstance say that if both roof and ceiling point back to the same sense of tak (English

4 A web interface to the Mirrors method is located at http://ling.uib.no/~helge/mirrwebguide.html

“ceiling/roof” or “grip/hold”), then we assume roof and ceiling to be semanticallyrelated.)

In order to separate the senses of a target word TW, the Mirrors algorithm takesthe set of translations of the TW from a parallel corpus as input, and "clusters" thetranslations into semantically similar groups on the basis of their translational overlap.For example, based on a manual extraction of translations from the ENPC, the noun takwas found to have the set of translations given in (1) below.

(1) tak {ceiling, cover, grip, hold, roof}

The Mirrors method assumes that we do not normally expect a contrastivelyambiguous word to share its ambiguity with any other words, neither within the samelanguage nor in another language. Hence, we do not expect semantically unrelatedtranslational correspondences, such as roof vs. grip in (1), to have any othercorrespondents in common except tak itself. If two (or more words) have a translationaloverlap through other words than tak itself, on the other hand, we take this to be a noncoincidental indicator that they are semantically related. (For instance, grip and holdoverlapped by Norwegian grep “grip” in addition to tak itself.) By this criterion the (1)example yielded the sense inventory given in (2) below.

(2) tak {ceiling roof} {cover} {grip hold}

Recently, the ENPC has been automatically wordaligned, making it possible todo sense discrimination based on automatically extracted sets of translations. In theremainder of this report, we shall be using the latter kind of data. Based on the set oftranslations that was extracted from an automatic wordalignment of the ENPC, the nountak was found to have the sense inventory given in (3) below. As can be seen, thisMirrors output reflects the preliminary status of the automatic word alignment, which hasan estimated precision of .84 and an estimated recall of .62. (For instance, it is hard toimagine a context in which tak could plausibly translate to tail). The sensegroups thatstem from word alignment errors (by manual verification) are listed on the last line.

(3) tak {roof} {ceiling} {grasp hold} {sky} {top} {stroke}word alignment errors: {back} {mass} {pot} {tail} {lake Lake}

It is obvious that the Mirrors method is vulnerable to sparse data, since a smallcorpus such as the ENPC is bound to contain translational ‘gaps’, resulting in less

translational overlap (grouping of senses) than we intuitively know would be possible.We see for instance that with automatic wordalignment in (3), roof and ceiling were notgrouped together as they were in (2) because they did not have enough translationaloverlap. This probably pertains to the fact that with automatic wordalignment, we areleft with the subset of the ENPC that was successfully wordaligned (cf. the estimatedrecall of 62% above).

3 Deriving senses automatically from the ENPC

Section 3.1 motivates the choice of test words; section 3.2 outlines the senseinventory derived through clustering and the Mirrors method, respectively. 3.1 Lexical sample choice

We focussed on two Norwegian nouns, tak and plan. The former word has twomain senses according to Bokmålsordboka, which may be translated into English as (i)grip/hold and (ii) roof/ceiling. The most characteristic occurrence of sense (i) is as part ofthe phrase få tak i (“get hold of”). This phrase is likely to constitute a very distinctivecontext for sense (i) for the distributional hypothesis, given that the system focuses on alocal context of ±5 words. Other occurrences are for instance ta et fast tak rundt henne“get a firm grip around her”, svømme noen tak “swim a few strokes”, røre med faste tak“stir with form strokes”). Sense (ii) relates to the notion of some kind of shield (forinstance the roof of a building) or metaphorically to an upper bound. The ENPC contains385 occurrences of this word, of which 199 represent sense (i) and 186 represent sense(ii). Since the distribution between the two senses is almost perfectly even in the corpus,and the overall number of occurrences is relatively high, we considered tak to be wellsuited as a “pilot study”, as we will not need to consider whether an imbalance betweensenses influenced the sense discrimination.

The noun plan has two main senses according to Bokmålsordboka, (i)intention/plan/schedule and (ii) level (e.g., lese en historie på to plan “reading a story attwo levels” or et vannrett plan “a horizontal level”). This word occurs only 178 times inthe ENPC, that is, with less than half as many occurrences as tak, of which 143 instancesrepresent sense (i), and 28 represent sense (ii). (Additionally, 7 instances analyzed as thenoun plan in the ENPC were found to actually represent the Norwegian noun planet“planet”, that is, 7 instances were actually the result of erroneous stemming. These weresimply given an ERROR-tag in our manual verification.) The noun plan was chosen inorder to compare how the two sense discrimination hypotheses manage with fewer data.Whereas the former approach is purely statistics-based, the latter does need an overlapbetween translations but does not need any such translation-overlap to occur more thanonce in the entire corpus. Therefore, one might in principle expect the clusteringalgorithm to be more vulnerable than the Mirrors method with less data.

3.2 Resulting sense inventory with clustering vs. the Mirrors method

3.2.1 The distributional clustering sense inventory

The SenseClustering algorithm allows us to choose how many clusters (= senses)we want, whereas the Mirrors method derives a final set of sense distinctions. As our‘gold standard’ sense inventory from Bokmålsordboka enumerates two main senses forboth test nouns, we set the number of distributional clusters to two. This is also motivatedby the fact that the main sense distinctions of the two test nouns in Bokmålsordboka havebeen made according to the etymological origin of the words, i.e., separating the twohistorically unrelated senses of each test noun. This fits well with the Mirrors method,which is expected to separate at least semantically unrelated senses.

3.1.1 The Mirrors method sense inventory

The sense divisions of the Mirrors method to be used in out experiments given thenoun tak are given in (3) in section 2.2, and has been briefly commented in that section.Comparing Bokmålsordboka’s sense distinctions for tak against the two sense inventoriesproduced by the Mirrors method, we see that the Mirrors method displays a tendency togenerate more senses than what is desirable. Only one sensegroup is not a singelton(grasp and hold having been successfully grouped together), and both main sensesaccording to Bokmålsordboka are represented by more than one sensegroup in theMirrors. It is, nonetheless, encouraging that the Mirrors method does not seem to fail atseparating the main distinctions, even if too many distinctions are made.

With the test noun plan, the Mirrors method yielded the following senseinventory, where the singelton sensegroups listed on the last line are “senses” that stemfrom erroneous wordalignment.

(4) plan {programme project schedule scheme}{level plane}{plan}{action}{planning} {design} word alignment errors: {stand} {pace} {fanfare}

As with tak, the Mirrors successfully kept translations that point to different mainsenses apart, but generates five senses that (intuitively) encapsulate the one sense ofprogramme/plan.

4 Evaluation

The main question of this section may be formulated as follows: What are thestrengths and weaknesses of the two hypotheses, given the same lexical sample and thesame corpus resource?

4.1 The assignment of automatically derived senses to corpus instances

Clustering means, in effect, that all target word occurrences in the corpus are“sense-tagged”, since each target word occurrence from the corpus is a member of acluster (representing one sense). Hence, the clustering output may be compared directlyagainst the manually tagged ‘gold standard’, in which all target word instances in theENPC are tagged according to the sense distinctions of Bokmålsordboka.

The result for clustering is summed up in table 1 below. The better accuracy oftak in comparison to plan seems to confirm, not surprisingly, that a clustering techniqueis vulnerable to sparse data.

Table 1. Clustering: Assignment of senses to target word instances in the corpus. Accuracy ismeasured against a manual ’gold standard’ sense assignment

Target word TW Instances in the ENPC Accuracytak 385 296/385 (76.9%)plan 178 88/171 (51.5%)

The Mirrors method does not in itself assign senses to occurrences of the targetword in context. An algorithm has been written (Lyse, 2003) for sensetagging the corpususing the Mirrors senses. This sensetagging method assigns a sense to the target word incontext using the situated translation as a ”sense indicator”. Using for instance the senseinventory in (4), all occurrences of Norwegian plan that were translated as one of themembers in the first sensegroup may be sensetagged as ‘programme’. This sensetagging method is generally accurate, but depends on an identifiable translation. As anexample, the phrase ha en plan om å.. “have a plan to..” may sometimes simply betranslated through the (intentional) verb “will”. In such cases of rewriting, the occurrenceof plan will not have an identifiable correspondent matching the Mirrors sense inventory.

Since senses occurring less than ten times in the material are not expected to beinformative for any practical usage, we chose to weed out such senses. The senses withten or more training instances are given in table 2 on the next page.

Table 2. Mirrors sense inventory counting only senses with ten or more corpus instances

Target word Sense inventory Number ofcorpus instances

tak {roof} 93{ceiling} 75{grasp hold} 18

plan {plan} 93{programme projectschedule scheme}

14

{level} 19

The automatic sense-tagging results using the sense inventory in table 2 above aresummed up in table 3 below. As can be seen in table 3, a weakness of a translation-basedassignment of senses to corpus instances is that the method is only applicable when thetarget word instance has an identifiable translational correspondent. A human translatormay frequently choose phrasings that are not word-by-word. On the other hand, usingtranslations as “sense indicators” has a comparably higher precision level than theclustering approach (cf. table 1). With both test words, one erroneous sense-tag wasidentified by manual verification, respectively. These errors occurred because the sense-tagging algorithm goes through the entire corresponding sentence in the look for a matchagainst the sense-groups of the target word5. The good accuracies indicate that with largeramounts of parallel corpora, translation-based sense-tagging procedure would represent arealistic alternative to manual sense-tagging of corpora (provided, of course, that theMirrors senses make sense in the first place).

Table 3. Translationbased sensetagging. Coverage measures how many instances (of total)that could be sensetagged, accuracy measures how many instances in the sensetaggedmaterial that were assigned a correct tag by manual verification

Target wordTW instancesin the ENPC

Coverage Accuracy

tak 385 186/385 (48.3%) 185/186 (99.5%)plan 178 126/178 (70.8%) 125/126 (99.2%)

4 Conclusion

This paper was aimed towards the comparison and evaluation of two differenthypotheses on how to derive sense distinctions automatically from corpus data. The

5 With the recent access of automatic wordalignment of the ENPC, we do not really need to go throughthe entire corresponding sentence, but instead access the automatically aligned correspondent of thetarget word directly. In this assignment, however, we used sensetagging based on the entirecorresponding sentence.

experimental framework was to test the two datadriven approaches on the same lexicalsample and using the same corpus resource.

In relation to the resulting sense inventory per se, it is hard to compare the twoapproaches directly, since it is difficult to characterize the ‘contents’ of the clusteringoutput, that is, what each cluster represents semantically. Whereas the Mirrors methodproduces a final set of translationally motivated sense distinctions, clustering leaves it tothe user to decide how many sense distinctions we want. For this reason, we may say thatclustering is a more flexible tool than the Mirrors method. But on the other hand, theMirrors method yields perhaps clearer results if we want to look at the sense distinctionsas such.

For the purpose of annotating a corpus with the datadriven sense inventory, itshould be noted that the Mirrors method does not in itself do this. Applying thetranslationbased sensetagging procedure described, however, the results show that withtranslations as our knowledge source we may only tag a subset of the corpus. Withclustering, on the other hand, the entire corpus will be annotated. But in return the resultsshow that the use of translations as a ‘sense indicator’ is considerably more precise thanclustering. Also, the annotation results show that the clustering algorithm seems morevulnerable towards sparse data than the Mirrors method. Then again, since clusteringdoes not depend on parallel data, it should not be hard to find larger corpus resources thanthe ENPC.

References

Dyvik, H. (1998a): ”A translational basis for semantics”. Stig Johansson and SigneOksefjell (eds.) (1998): Corpora and Crosslinguistic Research: Theory, Method, andCase Studies. Amsterdam: Rodopi. pp. 5186.

Dyvik, H. (1998b): ”Translations as semantic mirrors”. Proceedings of Workshop W13:Multilinguality in the lexicon II. The 13th biennial European Conference on ArtificialIntelligence ECAI 98. pp. 24.44, Brighton, UK

Dyvik, Helge (2003): ”Translations as a Semantic Knowledge Source”. [Draft, 2003]http://www.hf.uib.no/i/LiLi/SLF/ans/Dyvik/transknow.pdf

Ide, N., T. Erjavec and D. Tufis (2002): ”Sense Discrimination with Parallell Corpora”.Proceedings of the SIGLEX/SENSEVAL Workshop on Word Sense Disambiguation:Recent Successes and Future Directions. Philadelphia, July 2002. Association forCompuational Linguistics. pp. 5460.http://acl.ldc.upenn.edu/acl2002/WSD/pdfs/WSD008.pdf

Kulkarni, A. and T. Pedersen (2005). SenseClusters: Unsupervised Clustering andLabeling of Similar Contexts. Appears in the Proceedings of the Demonstration and

Interactive Poster Session of the 43rd Annual Meeting of the Association forComputational Linguistics. June 26, 2005, Ann Arbor, MI.

Lyse, G. I. (2003). Fra speilmetoden til automatisk ekstrahering av et betydningstaggetkorpus for WSDformål. Master’s thesis, Section for linguistic studies, University ofBergen.

Pedersen, T., A. Purandare, and A. Kulkarni (2005). Name Discrimination by ClusteringSimilar Contexts. Appears in the Proceedings of the Sixth International Conference onIntelligent Text Processing and Computational Linguistics, February 13-19, 2005,Mexico City.

Purandard, A. and T. Pedersen (2005). Word Sense Discrimination by ClusteringContexts in Vector and Similarity Spaces. Appears in the Proceedings of the Conferenceon Computational Natural Language Learning (CoNLL), May 6-7, 2004, Boston, MA.

Yarowsky, D. (1993): ”One sense per collocation”. Proceedings, ARPA HumanLanguage Technology Workshop. Princeton, N.J. pp. 266271.

Documents

A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English