10
GSLT course: Statistical Methods Term paper A distributional or a translational basis for data-driven sense discrimination? Anders Nøklestad Gunn Inger Lyse University of Oslo University of Bergen [email protected] [email protected] Received May 18 th  2005 Introduction How can we define senses, at least to the extent that they are useful for natural language processing (NLP) tasks, such as word sense disambiguation (WSD) or machine translation (MT)? This paper aims at comparing and evaluating two different hypotheses on how to derive sense distinctions automatically from corpus data. The appeal of data- driven approaches to sense discrimination pertains to the fact that there is still no clear answer to how and where to draw the line between separate senses of a word, whether from a philosophical nor from a linguistic point of view (Ide et al., 2002). Looking at a word in a particular context, we as humans interpret the word sense quite effortlessly and unconsciously, even if we cannot necessarily agree what to call this sense. This makes it natural to think that rather than starting with the definition of an a priori sense inventory, which is then to be assigned to word instances in context, we may instead seek to find criteria for grouping occurrences of a word in context, such that each group represents one sense of the word. The distributional hypothesis assumes that the different senses of a target word may be teased out by clustering together those target word instances that display similar contextual properties. For clustering, we applied the SenseClusters software package, developed by Ted Pedersen et al. 1  (section 2.1). The translational hypothesis is based on the notion that translations may be seen as the product of having interpreted the meaning of the source language text (Dyvik, 2003). Furthermore, the different senses of a word tend to be lexicalised differently in some other language (Resnik & Yarowsky, 1997) - at least this tends to be true for semantically unrelated senses. For translation-based sense discrimination, we applied the Mirrors method, developed by Helge Dyvik (Dyvik, 1998a-b, 2003) (section 2.2). 1 SenseClusters can be downloaded from http://www.d.umn.edu/~tpederse/senseclusters.html

A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

GSLT course: Statistical Methods Term paper

A distributional or a translational basis for data­driven sensediscrimination?

Anders Nøklestad  Gunn Inger LyseUniversity of Oslo  University of [email protected]  [email protected]

Received May 18th 2005 

Introduction

How can we define senses, at least to the extent that they are useful for naturallanguage processing (NLP) tasks, such as word sense disambiguation (WSD) or machinetranslation (MT)? This paper aims at comparing and evaluating two different hypotheseson how to derive sense distinctions automatically from corpus data. The appeal of data­driven approaches to sense discrimination pertains to the fact that there is still no clearanswer to how and where to draw the line between separate senses of a word, whetherfrom a philosophical nor from a linguistic point of view (Ide et al., 2002). Looking at aword in a particular context, we as humans interpret the word sense quite effortlessly andunconsciously, even if we cannot necessarily agree what to call this sense. This makes itnatural to think that rather than starting with the definition of an a priori sense inventory,which is then to be assigned to word instances in context, we may instead seek to findcriteria for grouping occurrences of a word in context, such that each group representsone sense of the word.

The  distributional hypothesis  assumes that the different senses of a target wordmay be teased out by clustering together those target word instances that display similarcontextual  properties.  For  clustering,  we applied  the  SenseClusters  software  package,developed by Ted Pedersen et al. 1 (section 2.1). The translational hypothesis is based onthe notion that translations may be seen as the product of having interpreted the meaningof the source language text (Dyvik, 2003). Furthermore, the different senses of a wordtend to be lexicalised differently in some other language (Resnik & Yarowsky, 1997) ­ atleast this tends to be true for semantically unrelated senses. For translation­based sensediscrimination,   we   applied   the   Mirrors   method,   developed   by   Helge   Dyvik   (Dyvik,1998a­b, 2003) (section 2.2).

1 SenseClusters can be downloaded from http://www.d.umn.edu/~tpederse/senseclusters.html

Page 2: A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

Since the translational hypothesis requires parallel data, we applied the English­Norwegian Parallel  Corpus (ENPC)2,  developed at   the University  of  Oslo,  which hasbeen automatically word­aligned by Sindre Sørensen at Aksis, Bergen. The distributionalhypothesis exploited only information from the Norwegian side of the ENPC. The ENPCcontains approximately 2.6 million words (adding both language sides). It has alreadybeen asserted (Dyvik, 2003) that the Mirrors method is vulnerable to sparse data, as is thecase   when   using   the   relatively   small   ENPC   corpus.   For   this   reason   it   becomesparticularly   interesting   to   see   how   far   another   data­driven   approach,   namely   thedistributional hypothesis, may take us given the same amount of corpus data. 

It is not easy to evaluate sense inventories however, since there does not reallyexist  any ‘gold standard’ for comparison. We compare the resulting sense inventoriesmanually on the basis of the sense distinctions in a common dictionary entry for the samelexical   sample.   We   used   the   Norwegian   dictionary  Bokmålsordboka3  (University   ofOslo).  As will  be seen,  the clustering approach seems to be more vulnerable towardssparse   data   than   the   Mirrors   method.   Furthermore,   as   a   source   of   knowledge   forautomatic   sense­annotation  of  corpora,   the  clustering  method  is   less  accurate  but   theMirrors method can annotate only approximately half of the available material.

2 Two data­driven approaches

2.1 The distributional hypothesis

”You shall know a word by the company it keeps”. Firth (1968)

The distributional hypothesis essentially exploits the ”one sense per collocation”property (Yarowsky, 1993), meaning that differing senses do not normally have the samecontextual properties. Hence, senses may be discriminated by grouping (clustering)contexts that are similar.

SenseClusters is a freely available software package for clustering any kind oftextual data. It has been used for general word sense discrimination (Purandare andPedersen 2004) as well as for named entity recognition (Pedersen et al., 2005, who referto this task as “proper name discrimination”). SenseClusters is used for clusteringcontexts. In the task of word sense discrimination, all contexts should contain theparticular word whose senses will be clustered (the target word), although the tool canalso be used for clustering contexts that do not necessarily contain any particular targetword (e.g., for spam detection). SenseClusters clusters contexts based on lexical featuresthat are extracted from the text. These features are labelled unigrams, bigrams, co-occurrences and target co-occurrences. Bigrams are defined in this context as orderedpairs of words that may have intervening elements. Co-occurrences are unordered pairs of

2  The ENPC is located at http://www.hf.uio.no/iba/prosjekt/.3  Bokmålsordboka is located at http://www.dokpro.uio.no/ordboksoek.html

Page 3: A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

words in a context, while target co-occurrences are co-occurrences that include the targetword.

SenseClusters is a very flexible tool that allows the experimenter to set a widerange of parameters defining the way that features are extracted and the clustering isperformed. For our experiments, we adopted the parameter settings used by Kulkarni andPedersen (2005) for some of these parameters: Only unigrams, bigrams and co-occurrences that occur at least five times in the data were included. For bigrams and co-occurrences, we applied the additional requirement that the log-likelihood ratio should behigh enough that there is a 95% certainty that the words are not independent. Bigrams,unigrams and co-occurrences that include stop words were excluded; stop words weredefined to be prepositions, conjunctions, pronouns, and the infinitival marker å “to”.

Words can be clustered based on first-order or second-order associations. In eithercase, the matrices may optionally be reduced using Singular Value Decomposition(SVD). We used the standard setting of SenseClusters, which is to cluster based onsecond-order associations and without SVD. The actual clustering is done using a methodknown as Repeated Bisections. This method produces almost as good results asagglomerative clustering, but with the speed of partitional clustering (Kulkarni andPedersen 2005).

With unsupervised clustering approaches it is often hard to determine how theresulting clusters should be characterised in terms of semantic categories. The creators ofSenseClusters aim at alleviating this problem by providing ways of extracting descriptiveand discriminating labels for the clusters. The descriptive labels of a cluster are the top Nbigrams in the cluster, according to the log-likelihood ratio. Its discriminating labels arethose descriptive labels that are not descriptive labels of any other cluster. Thus, thediscriminating labels of a cluster might be seen as expressing information that sets thecluster apart from other clusters. Unfortunately, in our experiments the sets ofdiscriminating labels turned out to contain a huge amount of bigrams, making themeffectively useless.

2.2 The translational hypothesis

“Meaning is manifested in the relation between languages”. Dyvik (1998)

The Mirrors method has been developed by Helge Dyvik (Dyvik, 1998a­b, 2003),within   the   research  project   “From Parallel  Corpus   to  Wordnet”   at   the  University   ofBergen (2001­2005)4. It is based on the hypothesis that the translational relation betweenlanguages may be viewed as a theoretical primitive, which may serve as the basis forderiving the sense distinctions of a word as well as semantic relatedness between wordsenses.   (Semantic   relatedness  will  not  be  pursued here;  a  bit   simplified  we may   forinstance say that if both  roof  and  ceiling  point back to the same sense of  tak  (English

4  A web interface to the Mirrors method is located at http://ling.uib.no/~helge/mirrwebguide.html

Page 4: A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

“ceiling/roof”   or   “grip/hold”),   then   we   assume  roof  and  ceiling  to   be   semanticallyrelated.)

In order to separate the senses of a target word TW, the Mirrors algorithm takesthe set  of   translations  of   the  TW from a  parallel  corpus  as   input,  and "clusters"   thetranslations into semantically similar groups on the basis of their translational overlap.For example, based on a manual extraction of translations from the ENPC, the noun takwas found to have the set of translations given in (1) below.

(1) tak {ceiling, cover, grip, hold, roof}

The  Mirrors  method  assumes   that  we  do  not  normally  expect   a   contrastivelyambiguous word to share its ambiguity with any other words, neither within the samelanguage   nor   in   another   language.   Hence,   we   do   not   expect   semantically   unrelatedtranslational   correspondences,   such   as  roof  vs.  grip  in   (1),   to   have   any   othercorrespondents in common except tak itself. If two (or more words) have a translationaloverlap through other words than tak itself, on the other hand, we take this to be a non­coincidental   indicator   that   they are semantically   related.   (For  instance,  grip  and  holdoverlapped by Norwegian grep “grip” in addition to tak itself.) By this criterion the (1)example yielded the sense inventory given in (2) below. 

(2) tak {ceiling roof} {cover} {grip hold}

Recently, the ENPC has been automatically word­aligned, making it possible todo   sense  discrimination  based  on   automatically   extracted   sets   of   translations.   In   theremainder of this report, we shall be using the latter kind of data. Based on the set oftranslations that was extracted from an automatic word­alignment of the ENPC, the nountak  was  found  to  have   the sense  inventory given  in   (3)  below.  As can be seen,   thisMirrors output reflects the preliminary status of the automatic word alignment, which hasan estimated precision of .84 and an estimated recall of .62. (For instance, it is hard toimagine a context in which  tak  could plausibly translate to  tail). The sense­groups thatstem from word alignment errors (by manual verification) are listed on the last line. 

(3) tak {roof} {ceiling} {grasp hold} {sky} {top} {stroke}word alignment errors: {back} {mass} {pot}  {tail} {lake Lake}

It is obvious that the Mirrors method is vulnerable to sparse data, since a smallcorpus   such   as   the  ENPC   is   bound   to   contain   translational   ‘gaps’,   resulting   in   less

Page 5: A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

translational overlap (grouping of senses) than we intuitively know would be possible.We see for instance that with automatic word­alignment in (3), roof and ceiling were notgrouped  together  as   they were  in   (2)  because  they did not  have  enough  translationaloverlap. This probably pertains to the fact that with automatic word­alignment, we areleft with the subset of the ENPC that was successfully word­aligned (cf. the estimatedrecall of 62% above).

3 Deriving senses automatically from the ENPC

Section 3.1 motivates   the choice of   test  words;  section  3.2 outlines   the senseinventory derived through clustering and the Mirrors method, respectively.  3.1  Lexical sample choice

We focussed on two Norwegian nouns, tak and plan. The former word has twomain senses according to Bokmålsordboka, which may be translated into English as (i)grip/hold and (ii) roof/ceiling. The most characteristic occurrence of sense (i) is as part ofthe phrase få tak i (“get hold of”). This phrase is likely to constitute a very distinctivecontext for sense (i) for the distributional hypothesis, given that the system focuses on alocal context of ±5 words. Other occurrences are for instance ta et fast tak rundt henne“get a firm grip around her”, svømme noen tak “swim a few strokes”, røre med faste tak“stir with form strokes”). Sense (ii) relates to the notion of some kind of shield (forinstance the roof of a building) or metaphorically to an upper bound. The ENPC contains385 occurrences of this word, of which 199 represent sense (i) and 186 represent sense(ii). Since the distribution between the two senses is almost perfectly even in the corpus,and the overall number of occurrences is relatively high, we considered tak to be wellsuited as a “pilot study”, as we will not need to consider whether an imbalance betweensenses influenced the sense discrimination.

The noun plan has two main senses according to Bokmålsordboka, (i)intention/plan/schedule and (ii) level (e.g., lese en historie på to plan “reading a story attwo levels” or et vannrett plan “a horizontal level”). This word occurs only 178 times inthe ENPC, that is, with less than half as many occurrences as tak, of which 143 instancesrepresent sense (i), and 28 represent sense (ii). (Additionally, 7 instances analyzed as thenoun plan in the ENPC were found to actually represent the Norwegian noun planet“planet”, that is, 7 instances were actually the result of erroneous stemming. These weresimply given an ERROR-tag in our manual verification.) The noun plan was chosen inorder to compare how the two sense discrimination hypotheses manage with fewer data.Whereas the former approach is purely statistics-based, the latter does need an overlapbetween translations but does not need any such translation-overlap to occur more thanonce in the entire corpus. Therefore, one might in principle expect the clusteringalgorithm to be more vulnerable than the Mirrors method with less data.

Page 6: A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

3.2 Resulting sense inventory with clustering vs. the Mirrors method

3.2.1 The distributional clustering sense inventory

The SenseClustering algorithm allows us to choose how many clusters (= senses)we want, whereas the Mirrors method derives a final set of sense distinctions. As our‘gold standard’ sense inventory from Bokmålsordboka  enumerates two main senses forboth test nouns, we set the number of distributional clusters to two. This is also motivatedby the fact that the main sense distinctions of the two test nouns in Bokmålsordboka havebeen made according to the etymological origin of the words, i.e.,  separating the twohistorically unrelated senses of each test noun. This fits well with the Mirrors method,which is expected to separate at least semantically unrelated senses.

3.1.1 The Mirrors method sense inventory

The sense divisions of the Mirrors method to be used in out experiments given thenoun tak are given in (3) in section 2.2, and has been briefly commented in that section.Comparing Bokmålsordboka’s sense distinctions for tak against the two sense inventoriesproduced by the Mirrors method, we see that the Mirrors method displays a tendency togenerate more senses  than what  is  desirable.  Only one sense­group is not a singelton(grasp  and  hold  having   been   successfully   grouped   together),   and   both   main   sensesaccording   to  Bokmålsordboka  are   represented   by   more   than   one   sense­group   in   theMirrors. It is, nonetheless, encouraging that the Mirrors method does not seem to fail atseparating the main distinctions, even if too many distinctions are made.

With   the   test   noun  plan,   the   Mirrors   method   yielded   the   following   senseinventory, where the singelton sense­groups listed on the last line are “senses” that stemfrom erroneous word­alignment. 

(4) plan {programme project schedule scheme}{level plane}{plan}{action}{planning} {design} word alignment errors: {stand} {pace} {fanfare}

As with tak, the Mirrors successfully kept translations that point to different mainsenses  apart,  but  generates  five senses  that   (intuitively)  encapsulate   the one sense ofprogramme/plan.

Page 7: A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

4 Evaluation 

The   main   question   of   this   section   may   be   formulated   as   follows:   What   are   thestrengths and weaknesses of the two hypotheses, given the same lexical sample and thesame corpus resource? 

4.1 The assignment of automatically derived senses to corpus instances

Clustering means, in effect, that all target word occurrences in the corpus are“sense-tagged”, since each target word occurrence from the corpus is a member of acluster (representing one sense). Hence, the clustering output may be compared directlyagainst the manually tagged ‘gold standard’, in which all target word instances in theENPC are tagged according to the sense distinctions of Bokmålsordboka.

The result for clustering is summed up in table 1 below. The better accuracy oftak in comparison to plan seems to confirm, not surprisingly, that a clustering techniqueis vulnerable to sparse data.

Table 1. Clustering: Assignment of senses to target word instances in the corpus. Accuracy ismeasured against a manual ’gold standard’ sense assignment

Target word TW Instances in the ENPC Accuracytak 385 296/385 (76.9%)plan 178 88/171 (51.5%)

The Mirrors method does not in itself assign senses to occurrences of the targetword in context. An algorithm has been written (Lyse, 2003) for sense­tagging the corpususing the Mirrors senses. This sense­tagging method assigns a sense to the target word incontext using the situated translation as a ”sense indicator”. Using for instance the senseinventory in (4), all occurrences of Norwegian  plan  that were translated as one of themembers   in   the   first   sense­group  may  be   sense­tagged   as   ‘programme’.  This   sense­tagging method is generally accurate, but depends on an identifiable translation. As anexample,   the phrase  ha en plan om å.. “have a plan to..” may sometimes simply betranslated through the (intentional) verb “will”. In such cases of rewriting, the occurrenceof plan will not have an identifiable correspondent matching the Mirrors sense inventory.

Since senses occurring less than ten times in the material are not expected to beinformative for any practical usage, we chose to weed out such senses. The senses withten or more training instances are given in table 2 on the next page.

Page 8: A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

Table 2. Mirrors sense inventory counting only senses with ten or more corpus instances

Target word Sense inventory Number ofcorpus instances

tak  {roof} 93{ceiling} 75{grasp hold} 18

plan {plan} 93{programme   projectschedule scheme}

14

{level} 19

The automatic sense-tagging results using the sense inventory in table 2 above aresummed up in table 3 below. As can be seen in table 3, a weakness of a translation-basedassignment of senses to corpus instances is that the method is only applicable when thetarget word instance has an identifiable translational correspondent. A human translatormay frequently choose phrasings that are not word-by-word. On the other hand, usingtranslations as “sense indicators” has a comparably higher precision level than theclustering approach (cf. table 1). With both test words, one erroneous sense-tag wasidentified by manual verification, respectively. These errors occurred because the sense-tagging algorithm goes through the entire corresponding sentence in the look for a matchagainst the sense-groups of the target word5. The good accuracies indicate that with largeramounts of parallel corpora, translation-based sense-tagging procedure would represent arealistic alternative to manual sense-tagging of corpora (provided, of course, that theMirrors senses make sense in the first place).

Table 3. Translation­based sense­tagging. Coverage measures how many instances (of total)that could be sense­tagged, accuracy measures how many instances in the sense­taggedmaterial that were assigned a correct tag by manual verification

Target wordTW instancesin the ENPC

Coverage Accuracy

tak 385 186/385 (48.3%)  185/186 (99.5%)plan 178 126/178 (70.8%)  125/126 (99.2%)

4 Conclusion

This paper was aimed towards the  comparison and evaluation of two differenthypotheses  on  how  to  derive   sense  distinctions  automatically   from corpus  data.  The

5  With the recent access of automatic word­alignment of the ENPC, we do not really need to go throughthe entire corresponding sentence, but instead access the automatically aligned correspondent of thetarget word directly. In this assignment, however, we used sense­tagging based on the entirecorresponding sentence.

Page 9: A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

experimental framework was to test the two data­driven approaches on the same lexicalsample and using the same corpus resource.

In relation to the resulting sense inventory per se, it is hard to compare the twoapproaches directly, since it  is difficult to characterize the ‘contents’ of the clusteringoutput, that is, what each cluster represents semantically. Whereas the Mirrors methodproduces a final set of translationally motivated sense distinctions, clustering leaves it tothe user to decide how many sense distinctions we want. For this reason, we may say thatclustering is a more flexible tool than the Mirrors method. But on the other hand, theMirrors method yields perhaps clearer results if we want to look at the sense distinctionsas such.

For the purpose of annotating a corpus with the data­driven sense inventory, itshould   be   noted   that   the   Mirrors   method   does   not   in   itself   do   this.   Applying   thetranslation­based sense­tagging procedure described, however, the results show that withtranslations  as  our  knowledge source  we may only   tag a   subset  of   the corpus.  Withclustering, on the other hand, the entire corpus will be annotated. But in return the resultsshow that the use of translations as a ‘sense indicator’ is considerably more precise thanclustering. Also, the annotation results show that the clustering algorithm seems morevulnerable  towards sparse data  than the Mirrors method. Then again,  since clusteringdoes not depend on parallel data, it should not be hard to find larger corpus resources thanthe ENPC.

References

Dyvik,   H.   (1998a):   ”A   translational   basis   for   semantics”.  Stig   Johansson   and   SigneOksefjell   (eds.)   (1998):  Corpora  and Crosslinguistic  Research:  Theory,  Method,  andCase Studies. Amsterdam: Rodopi. pp. 51­86.

Dyvik, H. (1998b): ”Translations as semantic mirrors”.  Proceedings of Workshop W13:Multilinguality in the lexicon II.  The 13th biennial European Conference on ArtificialIntelligence ECAI 98. pp. 24.44, Brighton, UK

Dyvik,  Helge (2003):  ”Translations  as a Semantic Knowledge Source”.   [Draft,  2003]http://www.hf.uib.no/i/LiLi/SLF/ans/Dyvik/transknow.pdf 

Ide, N., T. Erjavec and D. Tufis (2002): ”Sense Discrimination with Parallell Corpora”.Proceedings   of   the   SIGLEX/SENSEVAL   Workshop   on   Word   Sense   Disambiguation:Recent   Successes   and   Future   Directions.   Philadelphia,   July   2002.   Association   forCompuational Linguistics. pp. 54­60.http://acl.ldc.upenn.edu/acl2002/WSD/pdfs/WSD008.pdf

Kulkarni, A. and T. Pedersen (2005). SenseClusters: Unsupervised Clustering andLabeling of Similar Contexts. Appears in the Proceedings of the Demonstration and

Page 10: A distributional or a translational basis for datadriven ...omilia.uio.no/anders/statkurs/Noklestad_Lyse.pdfmain senses according to Bokmålsordboka, which may be translated into English

Interactive Poster Session of the 43rd Annual Meeting of the Association forComputational Linguistics. June 26, 2005, Ann Arbor, MI.

Lyse, G. I. (2003). Fra speilmetoden til automatisk ekstrahering av et betydningstaggetkorpus   for  WSD­formål.  Master’s   thesis,  Section   for   linguistic   studies,  University  ofBergen. 

Pedersen, T., A. Purandare, and A. Kulkarni (2005). Name Discrimination by ClusteringSimilar Contexts. Appears in the Proceedings of the Sixth International Conference onIntelligent Text Processing and Computational Linguistics, February 13-19, 2005,Mexico City.

Purandard, A. and T. Pedersen (2005). Word Sense Discrimination by ClusteringContexts in Vector and Similarity Spaces. Appears in the Proceedings of the Conferenceon Computational Natural Language Learning (CoNLL), May 6-7, 2004, Boston, MA.

Yarowsky,   D.   (1993):   ”One   sense   per   collocation”.  Proceedings,   ARPA   HumanLanguage Technology Workshop. Princeton, N.J. pp. 266­271.