17
Disambiguating Nouns, Verbs, and Adjectives Using Automatically Acquired Selectional Preferences Article (Unspecified) http://sro.sussex.ac.uk McCarthy, Diana and Carroll, John (2003) Disambiguating Nouns, Verbs, and Adjectives Using Automatically Acquired Selectional Preferences. Computational Linguistics, 29 (4). pp. 639-654. ISSN 0891-2017 This version is available from Sussex Research Online: http://sro.sussex.ac.uk/id/eprint/1212/ This document is made available in accordance with publisher policies and may differ from the published version or from the version of record. If you wish to cite this item you are advised to consult the publisher’s version. Please see the URL above for details on accessing the published version. Copyright and reuse: Sussex Research Online is a digital repository of the research output of the University. Copyright and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable, the material made available in SRO has been checked for eligibility before being made available. Copies of full text items generally can be reproduced, displayed or performed and given to third parties in any format or medium for personal research or study, educational, or not-for-profit purposes without prior permission or charge, provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way.

Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

Disambiguating Nouns Verbs and Adjectives Using Automatically Acquired Selectional Preferences

Article (Unspecified)

httpsrosussexacuk

McCarthy Diana and Carroll John (2003) Disambiguating Nouns Verbs and Adjectives Using Automatically Acquired Selectional Preferences Computational Linguistics 29 (4) pp 639-654 ISSN 0891-2017

This version is available from Sussex Research Online httpsrosussexacukideprint1212

This document is made available in accordance with publisher policies and may differ from the published version or from the version of record If you wish to cite this item you are advised to consult the publisherrsquos version Please see the URL above for details on accessing the published version

Copyright and reuse Sussex Research Online is a digital repository of the research output of the University

Copyright and all moral rights to the version of the paper presented here belong to the individual author(s) andor other copyright owners To the extent reasonable and practicable the material made available in SRO has been checked for eligibility before being made available

Copies of full text items generally can be reproduced displayed or performed and given to third parties in any format or medium for personal research or study educational or not-for-profit purposes without prior permission or charge provided that the authors title and full bibliographic details are credited a hyperlink andor URL is given for the original metadata page and the content is not changed in any way

creg 2003 Association for Computational Linguistics

Disambiguating Nouns Verbs andAdjectives Using Automatically AcquiredSelectional Preferences

Diana McCarthycurren John CarrollcurrenUniversity of Sussex University of Sussex

Selectional preferences have been used by word sense disambiguation (WSD) systems as onesource of disambiguating information We evaluate WSD using selectional preferences acquiredfor English adjectivendashnoun subject and direct object grammatical relationships with respect toa standard test corpus The selectional preferences are specic to verb or adjective classes ratherthan individual word forms so they can be used to disambiguate the co-occurring adjectives andverbs rather than just the nominal argument heads We also investigate use of the one-sense-per-discourse heuristic to propagate a sense tag for a word to other occurrences of the same wordwithin the current document in order to increase coverage Although the preferences performwell in comparison with other unsupervised WSD systems on the same corpus the results showthat for many applications further knowledge sources would be required to achieve an adequatelevel of accuracy and coverage In addition to quantifying performance we analyze the results toinvestigate the situations in which the selectional preferences achieve the best precision and inwhich the one-sense-per-discourse heuristic increases performance

1 Introduction

Although selectional preferences are a possible knowledge source in an automaticword sense disambiguation (WDS) system they are not a panacea One problem iscoverage Most previous work has focused on acquiring selectional preferences forverbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots (Ribas 1995 McCarthy 1997 Abney and Light 1999 Ciaramita and Johnson2000 Stevenson and Wilks 2001) In normal running text however a large proportionof word tokens do not fall at these slots There has been some work looking at otherslots (Resnik 1997) and on using nominal arguments as disambiguators for verbs (Fed-erici Montemagni and Pirrelli 1999 Agirre and Martinez 2001) but the problem ofcoverage remains Selectional preferences can be used for WSD in combination withother knowledge sources (Stevenson and Wilks 2001) but there is a need to ascer-tain when they work well so that they can be utilized to their full advantage Thisarticle is aimed at quantifying the disambiguation performance of automatically ac-quired selectional preferences in regard to nouns verbs and adjectives with respect toa standard test corpus and evaluation setup (SENSEVAL-2) and to identify strengthsand weaknesses Although there is clearly a limit to coverage using preferences alonebecause preferences are acquired only with respect to specic grammatical roles weshow that when dealing with running text rather than isolated examples coverage canbe increased at little cost in accuracy by using the one-sense-per-discourse heuristic

curren Department of Informatics University of Sussex Brighton BN1 9QH UK E-mail fdianamjohncagsussexacuk

640

Computational Linguistics Volume 29 Number 4

We acquire selectional preferences as probability distributions over the Word-Net (Fellbaum 1998) noun hyponym hierarchy The probability distributions are con-ditioned on a verb or adjective class and a grammatical relationship A noun is disam-biguated by using the preferences to give probability estimates for each of its sensesin WordNet that is for WordNet synsets Verbs and adjectives are disambiguated byusing the probability distributions and Bayesrsquo rule to obtain an estimate of the proba-bility of the adjective or verb class given the noun and the grammatical relationshipPreviously we evaluated noun and verb disambiguation on the English all-words taskin the SENSEVAL-2 exercise (Cotton et al 2001) We now present results also usingpreferences for adjectives again evaluated on the SENSEVAL-2 test corpus (but carriedout after the formal evaluation deadline) The results are encouraging given that thismethod does not rely for training on any hand-tagged data or frequency distributionsderived from such data Although a modest amount of English sense-tagged data isavailable we nevertheless believe it is important to investigate methods that do notrequire such data because there will be languages or texts for which sense-taggeddata for a given word is not available or relevant

2 Motivation

The goal of this article is to assess the WSD performance of selectional preferencemodels for adjectives verbs and nouns on the SENSEVAL-2 test corpus There are twoapplications for WSD that we have in mind and are directing our research The rstapplication is text simplication as outlined by Carroll Minnen Pearce et al (1999)One subtask in this application involves substituting words with thier more frequentsynonyms for example substituting letter for missive Our motivation for using WSDis to lter out inappropriate senses of a word token so that the substituting synonymis appropriate given the context For example in the following sentence we wouldlike to use strategy rather than dodge as a substitute for scheme

A recent government study singled out the scheme as an example to others

We are also investigating the disambiguation of verb senses in running text beforesubcategorization information for the verbs is acquired in order to produce a sub-categorization lexicon specic to sense (Preiss and Korhonen 2002) For example ifsubcategorization were acquired specic to sense rather than verb form then distinctsenses of re could have different subcategorization entries

re(1) - sack NP V NP

re(2) - shoot NP V NP NP V

Selectional preferences could also then be acquired automatically from sense-taggeddata in an iterative approach (McCarthy 2001)

3 Methodology

We acquire selectional preferences from automatically preprocessed and parsed textduring a training phase The parser is applied to the test data as well in the run-time phase to identify grammatical relations among nouns verbs and adjectives Theacquired selectional preferences are then applied to the noun-verb and noun-adjectivepairs in these grammatical constructions for disambiguation

641

McCarthy and Carroll Disambiguating Using Selectional Preferences

00005 009

lttimegt00807

ltentitygt ltmeasuregt

drink suck sipmodel for verbclass

Sense Tagged output

ltattributegt

Disambiguator

ParserPreprocessor

AcquisitionPreferenceSelectional

a single glass with

where you can drink

Training data

d04s12t09 drink23400

d06s13t03 man11800

instance ID WordNet sense tag

ncmod glass_NN1 single_JJ

dobj drink_VV0 glass_NN1

ncsubj drink_VV0 you_PPY

Grammatical Relations

Test data

The men drink here

WordNet

Selectional Preferences

Figure 1System overview Solid lines indicate ow of data during training and broken lines show thatat run time

The overall structure of the system is illustrated in Figure 1 We describe theindividual components in sections 31ndash33 and 4

31 PreprocessingThe preprocessor consists of three modules applied in sequence a tokenizer a part-of-speech (POS) tagger and a lemmatizer

The tokenizer comprises a small set of manually developed nite-state rules foridentifying word and sentence boundaries The tagger (Elworthy 1994) uses a bigramhidden Markov model augmented with a statistical unknown word guesser Whenapplied to the training data for selectional preference acquisition it produces the sin-gle highest-ranked POS tag for each word In the run-time phase it returns multipletag hypotheses each with an associated forward-backward probability to reduce theimpact of tagging errors The lemmatizer (Minnen Carroll and Pearce 2001) reducesinected verbs and nouns to their base forms It uses a set of nite-state rules express-ing morphological regularities and subregularities together with a list of exceptionsfor specic (irregular) word forms

32 ParsingThe parser uses a wide-coverage unication-based shallow grammar of English POStags and punctuation (Briscoe and Carroll 1995) and performs disambiguation usinga context-sensitive probabilistic model (Briscoe and Carroll 1993) recovering fromextra-grammaticality by returning partial parses The output of the parser is a set ofgrammatical relations (Carroll Briscoe and Sanlippo 1998) specifying the syntacticdependency between each head and its dependent(s) taken from the phrase structuretree that is returned from the disambiguation phase

For selectional preference acquisition we applied the analysis system to the 90million words of the written portion of the British National Corpus (BNC) the parserproduced complete analyses for around 60 of the sentences and partial analysesfor over 95 of the remainder Both in the acquisition phase and at run time weextract from the analyser output subjectndashverb verbndashdirect object and nounndashadjective

642

Computational Linguistics Volume 29 Number 4

modier dependencies1 We did not use the SENSEVAL-2 Penn Treebankndashstyle brack-etings supplied for the test data

33 Selectional Preference AcquisitionThe preferences are acquired for grammatical relations (subject direct object andadjectivendashnoun) involving nouns and grammatically related adjectives or verbs Weuse WordNet synsets to dene our sense inventory Our method exploits the hyponymlinks given for nouns (eg cheese is a hyponym of food) the troponym links for verbs 2

(eg limp is a troponym of walk) and the ldquosimilar-tordquo relationship given for adjectives(eg one sense of cheap is similar to imsy)

The preference models are modications of the tree cut models (TCMs) originallyproposed by Li and Abe (1995 1998) The main differences between that work andours are that we acquire adjective as well as verb models and also that our modelsare with respect to verb and adjective classes rather than forms We acquire modelsfor classes because we are using the models for WSD whereas Li and Abe used themfor structural disambiguation

We dene a TCM as follows Let NC be the set of noun synsets (noun classes)in WordNet NC = fnc 2 WordNetg and NS be the set of noun senses 3 in WordnetNS = fns 2 WordNetg A TCM is a set of noun classes that partition NS disjointlyWe use iexcl to refer to such a set of classes in a TCM A TCM is dened by iexcl and aprobability distribution

X

nc 2 iexcl

p(nc) = 1 (8)

The probability distribution is conditioned by the grammatical context In thiswork the probability distribution associated with a TCM is conditioned on a verbclass (vc) and either the subject or direct-object relation or an adjective class (ac)and the adjectivendashnoun relation Let VC be the set of verb synsets (verb classes) inWordNet VC = fvc 2 WordNetg Let AC be the set of adjective classes (which subsumeWordNet synsets we elaborate further on this subsequently) Thus the TCMs denea probability distribution over NS that is conditioned on a verb class (vc) or adjectiveclass (ac) and a particular grammatical relation (gr)

X

nc 2 iexcl

p(ncjvc gr) = 1 (9)

Acquisition of a TCM for a given vc and gr proceeds as follows The data foracquiring the preference are obtained from a subset of the tuples involving verbsin the synset or troponym (subordinate) synsets Not all verbs that are troponymsor direct members of the synset are used in training We take the noun argumentheads occurring with verbs that have no more than 10 senses in WordNet and afrequency of 20 or more occurrences in the BNC data in the specied grammaticalrelationship The threshold of 10 senses removes some highly polysemous verbs havingmany sense distinctions that are rather subtle Verbs that have more than 10 sensesinclude very frequent verbs such as be and do that do not select strongly for their

1 In a previous evaluation of grammatical-relation accuracy with in-coverage text the analyzer returnedsubjectndashverb and verbndashdirect object dependencies with 84ndash88 recall and precision (Carroll Minnenand Briscoe 1999)

2 In WordNet verbs are organized by the troponymy relation but this is represented with the samehyponym pointer as is used in the noun hierarchy

3 We refer to nouns attached to synsets as noun senses

643

McCarthy and Carroll Disambiguating Using Selectional Preferences

0

20

40

60

80

100

0 50 100 150 200

n

ot in

Wo

rdN

et

frequency rank

Figure 2Verbs not in WordNet by BNC frequency

arguments The frequency threshold of 20 is intended to remove noisy data We setthe threshold by examining a plot of BNC frequency and the percentage of verbs atparticular frequencies that are not listed in WordNet (Figure 2) Using 20 as a thresholdfor the subject slot results in only 5 verbs that are not found in WordNet whereas73 of verbs with fewer than 20 BNC occurrences are not present in WordNet4

The selectional-preference models for adjectivendashnoun relations are conditioned onan ac Each ac comprises a group of adjective WordNet synsets linked by the ldquosimilar-tordquo relation These groups are formed such that they partition all adjective synsetsThus AC = fac 2 WordNet adjective synsets linked by similar-tog For example Figure 3shows the adjective classes that include the adjective fundamental and that are formedin this way5 For selectional-preference models conditioned on adjective classes weuse only those adjectives that have 10 synsets or less in WordNet and have 20 or moreoccurrences in the BNC

The set of ncs in iexcl are selected from all the possibilities in the hyponym hierarchyaccording to the minimum description length (MDL) principle (Rissanen 1978) as usedby Li and Abe (1995 1998) MDL nds the best TCM by considering the cost (in bits)of describing both the model and the argument head data encoded in the model Thecost (or description length) for a TCM is calculated according to equation (10) Thenumber of parameters of the model is given by k which is the number of ncs in iexclminus one N is the sample of the argument head data The cost of describing eachnoun argument head (n) is calculated by the log of the probability estimate for thatnoun

description length = model description length + data description length

= k2 pound log jNj + iexcl

X

n2 N

log p(n) (10)

4 These threshold values are somewhat arbitrary but it turns out that our results are not sensitive to theexact values

5 For the sake of brevity not all adjectives are included in this diagram

644

Computational Linguistics Volume 29 Number 4

basicroot radicalgrassroots

underlying fundamentalrudimentary

primal elementary

basal base

of import important

primal keycentral fundamental cardinal

crucial essentialvaluable usefulhistoric

chief principal primary mainmeasurable

weighty grevious grave

signicant important

monumentalprofound fundamental

epochal

earthshaking

headportentous prodigious

evidentiary evidential

noteworthy remarkable

large

Figure 3Adjective classes that include fundamental

The probability estimate for each n is obtained using the estimates for all the nssthat n has Let Cn be the set of ncs that include n as a direct member Cn = fnc 2 NCjn 2ncg Let nc0 be a hypernym of nc on iexcl (ie nc0 2 fiexcl jnc 2 nc0g) and let nsnc 0 = fns 2 nc0g(ie the set of nouns senses at and beneath nc0 in the hyponym hierarchy) Then theestimate p(n) is obtained using the estimates for the hypernym classes on iexcl for all theCn that n belongs to

p(n) =X

nc 2 Cn

p(nc0)

jnsnc 0 j (11)

The probability at any particular nc0 is divided by nsnc 0 to give the estimate for eachp(ns) under that nc0

The probability estimates for the fnc 2 iexcl g ( p(ncjvc gr) or p(ncjac gr)) are obtainedfrom the tuples from the data of nouns co-occurring with verbs (or adjectives) belong-ing to the conditioning vc (or ac) in the specied grammatical relationship (lt n v gr gt)The frequency credit for a tuple is divided by jCnj for any n and by the number ofsynsets of v Cv (or Ca if the gr is adjective-noun)

freq(ncjvc gr) =X

v2 vc

X

n2 nc

freq(njv gr)jCnjjCvj (12)

A hypernym nc0 includes the frequency credit attributed to all its hyponyms (fnc 2nc0g)

freq(nc0jvc gr) =X

nc 2 nc 0

freq(ncjvc gr) (13)

This ensures that the total frequency credit at any iexcl across the hyponym hierarchyequals the credit for the conditioning vc This will be the sum of the frequency creditfor all verbs that are direct members or troponyms of the vc divided by the numberof other senses of each of these verbs

freq(vcjgr) =X

verb2 vc

freq(verbjgr)jCvj

(14)

645

McCarthy and Carroll Disambiguating Using Selectional Preferences

group event

Root

human_action

TCM for

TCM for

control monarchythrone

party

possession

strawmoney

handle

001002014

001002 053

006

005007

026

entity

seize assume usurp

seize clutch

Figure 4TCMs for the direct-object slot of two verb classes that include the verb seize

To ensure that the TCM covers all NS in WordNet we modify Li and Abersquos originalscheme by creating hyponym leaf classes below all WordNetrsquos internal classes in thehyponym hierarchy Each leaf holds the ns previously held at the internal class

Figure 4 shows portions of two TCMs The TCMs are similar as they both containthe verb seize but the TCM for the class that includes clutch has a higher probabilityfor the entity noun class compared to the class that also includes assume and usurpThis example includes only top-level WordNet classes although the TCM may usemore specic noun classes

4 Disambiguation

Nouns adjectives and verbs are disambiguated by nding the sense (nc vc or ac) withthe maximum probability estimate in the given context The method disambiguatesnouns and verbs to the WordNet synset level and adjectives to a coarse-grained levelof WordNet synsets linked by the similar-to relation as described previously

41 Disambiguating NounsNouns are disambiguated when they occur as subjects or direct objects and whenmodied by adjectives We obtain a probability estimate for each nc to which the targetnoun belongs using the distribution of the TCM associated with the co-occurring verbor adjective and the grammatical relationship

Li and Abe used TCMs for the task of structural disambiguation To obtain proba-bility estimates for noun senses occurring at classes beneath hypernyms on the cut Liand Abe used the probability estimate at the nc0 on the cut divided by the number ofns descendants as we do when nding iexcl during training so the probability estimateis shared equally among all nouns in the nc0 as in equation (15)

p(ns 2 nsnc 0 ) =p(nc0)

jnsnc 0 j (15)

One problem with doing this is that in cases in which the TCM is quite high in thehierarchy for example at the entity class the probability of any nsrsquos occurring underthis nc0 on the TCM will be the same and does not allow us to discriminate amongsenses beneath this level

For the WSD task we compare the probability estimates at each nc 2 Cn so ifa noun belongs to several synsets we compare the probability estimates given thecontext of these synsets We obtain estimates for each nc by using the probability ofthe hypernym nc0 on iexcl Rather than assume that all synsets under a given nc0 on iexcl

646

Computational Linguistics Volume 29 Number 4

have the same likelihood of occurrence we multiply the probability estimate for thehypernym nc0 by the ratio of the prior frequency of the nc that is p(ncjgr) for whichwe seek the estimate divided by the prior frequency of the hypernym nc0 (p(nc0jgr))

p(nc 2 hyponyms of nc0jvc gr) = p(nc0jvc gr) pound p(ncjgr)p(nc0jgr)

(16)

These prior estimates are taken from populating the noun hyponym hierarchy with theprior frequency data for the gr irrespective of the co-occurring verbs The probabilityat the hypernym nc0 will necessarily total the probability at all hyponyms since thefrequency credit of hyponyms is propagated to hypernyms

Thus to disambiguate a noun occurring in a given relationship with a given verbthe nc 2 Cn that gives the largest estimate for p(ncjvc gr) is taken where the verb class(vc) is that which maximizes this estimate from Cv The TCM acquired for each vc ofthe verb in the given gr provides an estimate for p(nc0jvc gr) and the estimate for ncis obtained as in equation (16)

For example one target noun was letter which occurred as the direct object ofsign in our parses of the SENSEVAL-2 data The TCM that maximized the probabilityestimate for p(ncjvc direct object) is shown in Figure 5 The noun letter is disambiguatedby comparing the probability estimates on the TCM above the ve senses of letter mul-tiplied by the proportion of that probability mass attributed to that synset Althoughentity has a higher probability on the TCM compared to matter which is above thecorrect sense of letter6 the ratio of prior probabilities for the synset containing letter7

under entity is 0001 whereas that for the synset under matter is 0226 This gives aprobability of 0009pound 0226 = 0002 for the noun class probability given the verb class

approval

root

entity

human

capitalist

interpretation

message

communication

abstraction

relation

human_act

symbol

016

0012

00329

0114

00060009

draftletter

matter

writing

explanation

letter

alphabetic_letterletter

letter

letter

sign

Figure 5TCM for the direct-object slot of the verb class including sign and ratify

6 The gloss is ldquoa written message addressed to a person or organization eg wrote an indignant letter tothe editorrdquo

7 The gloss is ldquoowner who lets another person use something (housing usually) for hirerdquo

647

McCarthy and Carroll Disambiguating Using Selectional Preferences

(with maximum probability) and grammatical context This is the highest probabilityfor any of the synsets of letter and so in this case the correct sense is selected

42 Disambiguating Verbs and AdjectivesVerbs and adjectives are disambiguated using TCMs to give estimates for p(ncjvc gr)and p(ncjac gr) respectively These are combined with prior estimates for p(ncjgr) andp(vcjgr) (or p(acjgr)) using Bayesrsquo rule to give

p(vcjnc gr) = p(ncjvc gr) pound p(vcjgr)p(ncjgr)

(17)

and for adjectivendashnoun relations

p(acjnc adjnoun) = p(ncjac adjnoun) pound p(acjadjnoun)

p(ncjadjnoun)(18)

The prior distributions for p(ncjgr) p(vcjgr) and p(acjadjnoun) are obtained dur-ing the training phase For the prior distribution over NC the frequency credit ofeach noun in the specied gr in the training data is divided by jCnj The frequencycredit attached to a hyponym is propagated to the superordinate hypernyms and thefrequency of a hypernym (nc0) totals the frequency at its hyponyms

freq(nc0jgr) =X

nc 2 nc 0

freq(ncjgr) (19)

The distribution over VC is obtained similarly using the troponym relation Forthe distribution over AC the frequency credit for each adjective is divided by thenumber of synsets to which the adjective belongs and the credit for an ac is the sumover all the synsets that are members by virtue of the similar-to WordNet link

To disambiguate a verb occurring with a given noun the vc from Cv that givesthe largest estimate for p(vcjnc gr) is taken The nc for the co-occurring noun is thenc from Cn that maximizes this estimate The estimate for p(ncjvc gr) is taken as inequation (16) but selecting the vc to maximize the estimate for p(vcjnc gr) rather thanp(ncjvc gr) An adjective is likewise disambiguated to the ac from all those to which theadjective belongs using the estimate for p(ncjac gr) and selecting the nc that maximizesthe p(acjnc gr) estimate

43 Increasing Coverage One Sense per DiscourseThere is a signicant limitation to the word tokens that can be disambiguated usingselectional preferences in that they are restricted to those that occur in the speciedgrammatical relations and in argument head position Moreover we have TCMs onlyfor adjective and verb classes in which there was at least one adjective or verb mem-ber that met our criteria for training (having no more than a threshold of 10 senses inWordNet and a frequency of 20 or more occurrences in the BNC data in the speciedgrammatical relationship) We chose not to apply TCMs for disambiguation where wedid not have TCMs for one or more classes for the verb or adjective To increase cov-erage we experimented with applying the one-sense-per-discourse (OSPD) heuristic(Gale Church and Yarowsky 1992) With this heuristic a sense tag for a given wordis propagated to other occurrences of the same word within the current document inorder to increase coverage When applying the OSPD heuristic we simply applied atag for a noun verb or adjective to all the other instances of the same word type withthe same part of speech in the discourse provided that only one possible tag for thatword was supplied by the selectional preferences for that discourse

648

Computational Linguistics Volume 29 Number 4

0

20

40

60

80

100

0 20 40 60 80 100

Rec

all

Precision

sel-ospd-ana sel-ospdsel

supervisedother unsupervised

first sense heuristicsel

sel-ospdsel-ospd-ana

Figure 6SENSEVAL-2 English all-words task results

5 Evaluation

We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al 2001) We entered a previous version of this system forthe SENSEVAL-2 exercise in three variants under the names ldquosussex-selrdquo (selectionalpreferences) ldquosussex-sel-ospdrdquo (with the OSPD heuristic) and ldquosussex-sel-ospd-anardquo(with anaphora resolution) 8 For SENSEVAL-2 we used only the direct object andsubject slots since we had not yet dealt with adjectives In Figure 6 we show how oursystem fared at the time of SENSEVAL-2 compared to other unsupervised systems9

We have also plotted the results of the supervised systems and the precision and recallachieved by using the most frequent sense (as listed in WordNet)10

In the work reported here we attempted disambiguation for head nouns and verbsin subject and direct object relationships and for adjectives and nouns in adjective-noun relationships For each test instance we applied subject preferences before directobject preferences and direct object preferences before adjectivendashnoun preferences Wealso propagated sense tags to test instances not in these relationships by applying theone-sense-per-discourse heuristic

We did not use the SENSEVAL-2 coarse-grained classication as this was notavailable at the time when we were acquiring the selectional preferences We therefore

8 The motivation for using anaphora resolution was increased coverage but anaphora resolution turnedout not actually to improve performance

9 We use unsupervised to refer to systems that do not use manually sense-tagged training data such asSemCor Our systems marked in the gure as sel sel-ospd and sel-ospd-ana are unsupervised

10 We are indebted to Judita Preiss for the most-frequent-sense result This was obtained using thefrequency data supplied with the WordNet 17 version prereleased for SENSEVAL-2

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 2: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

creg 2003 Association for Computational Linguistics

Disambiguating Nouns Verbs andAdjectives Using Automatically AcquiredSelectional Preferences

Diana McCarthycurren John CarrollcurrenUniversity of Sussex University of Sussex

Selectional preferences have been used by word sense disambiguation (WSD) systems as onesource of disambiguating information We evaluate WSD using selectional preferences acquiredfor English adjectivendashnoun subject and direct object grammatical relationships with respect toa standard test corpus The selectional preferences are specic to verb or adjective classes ratherthan individual word forms so they can be used to disambiguate the co-occurring adjectives andverbs rather than just the nominal argument heads We also investigate use of the one-sense-per-discourse heuristic to propagate a sense tag for a word to other occurrences of the same wordwithin the current document in order to increase coverage Although the preferences performwell in comparison with other unsupervised WSD systems on the same corpus the results showthat for many applications further knowledge sources would be required to achieve an adequatelevel of accuracy and coverage In addition to quantifying performance we analyze the results toinvestigate the situations in which the selectional preferences achieve the best precision and inwhich the one-sense-per-discourse heuristic increases performance

1 Introduction

Although selectional preferences are a possible knowledge source in an automaticword sense disambiguation (WDS) system they are not a panacea One problem iscoverage Most previous work has focused on acquiring selectional preferences forverbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots (Ribas 1995 McCarthy 1997 Abney and Light 1999 Ciaramita and Johnson2000 Stevenson and Wilks 2001) In normal running text however a large proportionof word tokens do not fall at these slots There has been some work looking at otherslots (Resnik 1997) and on using nominal arguments as disambiguators for verbs (Fed-erici Montemagni and Pirrelli 1999 Agirre and Martinez 2001) but the problem ofcoverage remains Selectional preferences can be used for WSD in combination withother knowledge sources (Stevenson and Wilks 2001) but there is a need to ascer-tain when they work well so that they can be utilized to their full advantage Thisarticle is aimed at quantifying the disambiguation performance of automatically ac-quired selectional preferences in regard to nouns verbs and adjectives with respect toa standard test corpus and evaluation setup (SENSEVAL-2) and to identify strengthsand weaknesses Although there is clearly a limit to coverage using preferences alonebecause preferences are acquired only with respect to specic grammatical roles weshow that when dealing with running text rather than isolated examples coverage canbe increased at little cost in accuracy by using the one-sense-per-discourse heuristic

curren Department of Informatics University of Sussex Brighton BN1 9QH UK E-mail fdianamjohncagsussexacuk

640

Computational Linguistics Volume 29 Number 4

We acquire selectional preferences as probability distributions over the Word-Net (Fellbaum 1998) noun hyponym hierarchy The probability distributions are con-ditioned on a verb or adjective class and a grammatical relationship A noun is disam-biguated by using the preferences to give probability estimates for each of its sensesin WordNet that is for WordNet synsets Verbs and adjectives are disambiguated byusing the probability distributions and Bayesrsquo rule to obtain an estimate of the proba-bility of the adjective or verb class given the noun and the grammatical relationshipPreviously we evaluated noun and verb disambiguation on the English all-words taskin the SENSEVAL-2 exercise (Cotton et al 2001) We now present results also usingpreferences for adjectives again evaluated on the SENSEVAL-2 test corpus (but carriedout after the formal evaluation deadline) The results are encouraging given that thismethod does not rely for training on any hand-tagged data or frequency distributionsderived from such data Although a modest amount of English sense-tagged data isavailable we nevertheless believe it is important to investigate methods that do notrequire such data because there will be languages or texts for which sense-taggeddata for a given word is not available or relevant

2 Motivation

The goal of this article is to assess the WSD performance of selectional preferencemodels for adjectives verbs and nouns on the SENSEVAL-2 test corpus There are twoapplications for WSD that we have in mind and are directing our research The rstapplication is text simplication as outlined by Carroll Minnen Pearce et al (1999)One subtask in this application involves substituting words with thier more frequentsynonyms for example substituting letter for missive Our motivation for using WSDis to lter out inappropriate senses of a word token so that the substituting synonymis appropriate given the context For example in the following sentence we wouldlike to use strategy rather than dodge as a substitute for scheme

A recent government study singled out the scheme as an example to others

We are also investigating the disambiguation of verb senses in running text beforesubcategorization information for the verbs is acquired in order to produce a sub-categorization lexicon specic to sense (Preiss and Korhonen 2002) For example ifsubcategorization were acquired specic to sense rather than verb form then distinctsenses of re could have different subcategorization entries

re(1) - sack NP V NP

re(2) - shoot NP V NP NP V

Selectional preferences could also then be acquired automatically from sense-taggeddata in an iterative approach (McCarthy 2001)

3 Methodology

We acquire selectional preferences from automatically preprocessed and parsed textduring a training phase The parser is applied to the test data as well in the run-time phase to identify grammatical relations among nouns verbs and adjectives Theacquired selectional preferences are then applied to the noun-verb and noun-adjectivepairs in these grammatical constructions for disambiguation

641

McCarthy and Carroll Disambiguating Using Selectional Preferences

00005 009

lttimegt00807

ltentitygt ltmeasuregt

drink suck sipmodel for verbclass

Sense Tagged output

ltattributegt

Disambiguator

ParserPreprocessor

AcquisitionPreferenceSelectional

a single glass with

where you can drink

Training data

d04s12t09 drink23400

d06s13t03 man11800

instance ID WordNet sense tag

ncmod glass_NN1 single_JJ

dobj drink_VV0 glass_NN1

ncsubj drink_VV0 you_PPY

Grammatical Relations

Test data

The men drink here

WordNet

Selectional Preferences

Figure 1System overview Solid lines indicate ow of data during training and broken lines show thatat run time

The overall structure of the system is illustrated in Figure 1 We describe theindividual components in sections 31ndash33 and 4

31 PreprocessingThe preprocessor consists of three modules applied in sequence a tokenizer a part-of-speech (POS) tagger and a lemmatizer

The tokenizer comprises a small set of manually developed nite-state rules foridentifying word and sentence boundaries The tagger (Elworthy 1994) uses a bigramhidden Markov model augmented with a statistical unknown word guesser Whenapplied to the training data for selectional preference acquisition it produces the sin-gle highest-ranked POS tag for each word In the run-time phase it returns multipletag hypotheses each with an associated forward-backward probability to reduce theimpact of tagging errors The lemmatizer (Minnen Carroll and Pearce 2001) reducesinected verbs and nouns to their base forms It uses a set of nite-state rules express-ing morphological regularities and subregularities together with a list of exceptionsfor specic (irregular) word forms

32 ParsingThe parser uses a wide-coverage unication-based shallow grammar of English POStags and punctuation (Briscoe and Carroll 1995) and performs disambiguation usinga context-sensitive probabilistic model (Briscoe and Carroll 1993) recovering fromextra-grammaticality by returning partial parses The output of the parser is a set ofgrammatical relations (Carroll Briscoe and Sanlippo 1998) specifying the syntacticdependency between each head and its dependent(s) taken from the phrase structuretree that is returned from the disambiguation phase

For selectional preference acquisition we applied the analysis system to the 90million words of the written portion of the British National Corpus (BNC) the parserproduced complete analyses for around 60 of the sentences and partial analysesfor over 95 of the remainder Both in the acquisition phase and at run time weextract from the analyser output subjectndashverb verbndashdirect object and nounndashadjective

642

Computational Linguistics Volume 29 Number 4

modier dependencies1 We did not use the SENSEVAL-2 Penn Treebankndashstyle brack-etings supplied for the test data

33 Selectional Preference AcquisitionThe preferences are acquired for grammatical relations (subject direct object andadjectivendashnoun) involving nouns and grammatically related adjectives or verbs Weuse WordNet synsets to dene our sense inventory Our method exploits the hyponymlinks given for nouns (eg cheese is a hyponym of food) the troponym links for verbs 2

(eg limp is a troponym of walk) and the ldquosimilar-tordquo relationship given for adjectives(eg one sense of cheap is similar to imsy)

The preference models are modications of the tree cut models (TCMs) originallyproposed by Li and Abe (1995 1998) The main differences between that work andours are that we acquire adjective as well as verb models and also that our modelsare with respect to verb and adjective classes rather than forms We acquire modelsfor classes because we are using the models for WSD whereas Li and Abe used themfor structural disambiguation

We dene a TCM as follows Let NC be the set of noun synsets (noun classes)in WordNet NC = fnc 2 WordNetg and NS be the set of noun senses 3 in WordnetNS = fns 2 WordNetg A TCM is a set of noun classes that partition NS disjointlyWe use iexcl to refer to such a set of classes in a TCM A TCM is dened by iexcl and aprobability distribution

X

nc 2 iexcl

p(nc) = 1 (8)

The probability distribution is conditioned by the grammatical context In thiswork the probability distribution associated with a TCM is conditioned on a verbclass (vc) and either the subject or direct-object relation or an adjective class (ac)and the adjectivendashnoun relation Let VC be the set of verb synsets (verb classes) inWordNet VC = fvc 2 WordNetg Let AC be the set of adjective classes (which subsumeWordNet synsets we elaborate further on this subsequently) Thus the TCMs denea probability distribution over NS that is conditioned on a verb class (vc) or adjectiveclass (ac) and a particular grammatical relation (gr)

X

nc 2 iexcl

p(ncjvc gr) = 1 (9)

Acquisition of a TCM for a given vc and gr proceeds as follows The data foracquiring the preference are obtained from a subset of the tuples involving verbsin the synset or troponym (subordinate) synsets Not all verbs that are troponymsor direct members of the synset are used in training We take the noun argumentheads occurring with verbs that have no more than 10 senses in WordNet and afrequency of 20 or more occurrences in the BNC data in the specied grammaticalrelationship The threshold of 10 senses removes some highly polysemous verbs havingmany sense distinctions that are rather subtle Verbs that have more than 10 sensesinclude very frequent verbs such as be and do that do not select strongly for their

1 In a previous evaluation of grammatical-relation accuracy with in-coverage text the analyzer returnedsubjectndashverb and verbndashdirect object dependencies with 84ndash88 recall and precision (Carroll Minnenand Briscoe 1999)

2 In WordNet verbs are organized by the troponymy relation but this is represented with the samehyponym pointer as is used in the noun hierarchy

3 We refer to nouns attached to synsets as noun senses

643

McCarthy and Carroll Disambiguating Using Selectional Preferences

0

20

40

60

80

100

0 50 100 150 200

n

ot in

Wo

rdN

et

frequency rank

Figure 2Verbs not in WordNet by BNC frequency

arguments The frequency threshold of 20 is intended to remove noisy data We setthe threshold by examining a plot of BNC frequency and the percentage of verbs atparticular frequencies that are not listed in WordNet (Figure 2) Using 20 as a thresholdfor the subject slot results in only 5 verbs that are not found in WordNet whereas73 of verbs with fewer than 20 BNC occurrences are not present in WordNet4

The selectional-preference models for adjectivendashnoun relations are conditioned onan ac Each ac comprises a group of adjective WordNet synsets linked by the ldquosimilar-tordquo relation These groups are formed such that they partition all adjective synsetsThus AC = fac 2 WordNet adjective synsets linked by similar-tog For example Figure 3shows the adjective classes that include the adjective fundamental and that are formedin this way5 For selectional-preference models conditioned on adjective classes weuse only those adjectives that have 10 synsets or less in WordNet and have 20 or moreoccurrences in the BNC

The set of ncs in iexcl are selected from all the possibilities in the hyponym hierarchyaccording to the minimum description length (MDL) principle (Rissanen 1978) as usedby Li and Abe (1995 1998) MDL nds the best TCM by considering the cost (in bits)of describing both the model and the argument head data encoded in the model Thecost (or description length) for a TCM is calculated according to equation (10) Thenumber of parameters of the model is given by k which is the number of ncs in iexclminus one N is the sample of the argument head data The cost of describing eachnoun argument head (n) is calculated by the log of the probability estimate for thatnoun

description length = model description length + data description length

= k2 pound log jNj + iexcl

X

n2 N

log p(n) (10)

4 These threshold values are somewhat arbitrary but it turns out that our results are not sensitive to theexact values

5 For the sake of brevity not all adjectives are included in this diagram

644

Computational Linguistics Volume 29 Number 4

basicroot radicalgrassroots

underlying fundamentalrudimentary

primal elementary

basal base

of import important

primal keycentral fundamental cardinal

crucial essentialvaluable usefulhistoric

chief principal primary mainmeasurable

weighty grevious grave

signicant important

monumentalprofound fundamental

epochal

earthshaking

headportentous prodigious

evidentiary evidential

noteworthy remarkable

large

Figure 3Adjective classes that include fundamental

The probability estimate for each n is obtained using the estimates for all the nssthat n has Let Cn be the set of ncs that include n as a direct member Cn = fnc 2 NCjn 2ncg Let nc0 be a hypernym of nc on iexcl (ie nc0 2 fiexcl jnc 2 nc0g) and let nsnc 0 = fns 2 nc0g(ie the set of nouns senses at and beneath nc0 in the hyponym hierarchy) Then theestimate p(n) is obtained using the estimates for the hypernym classes on iexcl for all theCn that n belongs to

p(n) =X

nc 2 Cn

p(nc0)

jnsnc 0 j (11)

The probability at any particular nc0 is divided by nsnc 0 to give the estimate for eachp(ns) under that nc0

The probability estimates for the fnc 2 iexcl g ( p(ncjvc gr) or p(ncjac gr)) are obtainedfrom the tuples from the data of nouns co-occurring with verbs (or adjectives) belong-ing to the conditioning vc (or ac) in the specied grammatical relationship (lt n v gr gt)The frequency credit for a tuple is divided by jCnj for any n and by the number ofsynsets of v Cv (or Ca if the gr is adjective-noun)

freq(ncjvc gr) =X

v2 vc

X

n2 nc

freq(njv gr)jCnjjCvj (12)

A hypernym nc0 includes the frequency credit attributed to all its hyponyms (fnc 2nc0g)

freq(nc0jvc gr) =X

nc 2 nc 0

freq(ncjvc gr) (13)

This ensures that the total frequency credit at any iexcl across the hyponym hierarchyequals the credit for the conditioning vc This will be the sum of the frequency creditfor all verbs that are direct members or troponyms of the vc divided by the numberof other senses of each of these verbs

freq(vcjgr) =X

verb2 vc

freq(verbjgr)jCvj

(14)

645

McCarthy and Carroll Disambiguating Using Selectional Preferences

group event

Root

human_action

TCM for

TCM for

control monarchythrone

party

possession

strawmoney

handle

001002014

001002 053

006

005007

026

entity

seize assume usurp

seize clutch

Figure 4TCMs for the direct-object slot of two verb classes that include the verb seize

To ensure that the TCM covers all NS in WordNet we modify Li and Abersquos originalscheme by creating hyponym leaf classes below all WordNetrsquos internal classes in thehyponym hierarchy Each leaf holds the ns previously held at the internal class

Figure 4 shows portions of two TCMs The TCMs are similar as they both containthe verb seize but the TCM for the class that includes clutch has a higher probabilityfor the entity noun class compared to the class that also includes assume and usurpThis example includes only top-level WordNet classes although the TCM may usemore specic noun classes

4 Disambiguation

Nouns adjectives and verbs are disambiguated by nding the sense (nc vc or ac) withthe maximum probability estimate in the given context The method disambiguatesnouns and verbs to the WordNet synset level and adjectives to a coarse-grained levelof WordNet synsets linked by the similar-to relation as described previously

41 Disambiguating NounsNouns are disambiguated when they occur as subjects or direct objects and whenmodied by adjectives We obtain a probability estimate for each nc to which the targetnoun belongs using the distribution of the TCM associated with the co-occurring verbor adjective and the grammatical relationship

Li and Abe used TCMs for the task of structural disambiguation To obtain proba-bility estimates for noun senses occurring at classes beneath hypernyms on the cut Liand Abe used the probability estimate at the nc0 on the cut divided by the number ofns descendants as we do when nding iexcl during training so the probability estimateis shared equally among all nouns in the nc0 as in equation (15)

p(ns 2 nsnc 0 ) =p(nc0)

jnsnc 0 j (15)

One problem with doing this is that in cases in which the TCM is quite high in thehierarchy for example at the entity class the probability of any nsrsquos occurring underthis nc0 on the TCM will be the same and does not allow us to discriminate amongsenses beneath this level

For the WSD task we compare the probability estimates at each nc 2 Cn so ifa noun belongs to several synsets we compare the probability estimates given thecontext of these synsets We obtain estimates for each nc by using the probability ofthe hypernym nc0 on iexcl Rather than assume that all synsets under a given nc0 on iexcl

646

Computational Linguistics Volume 29 Number 4

have the same likelihood of occurrence we multiply the probability estimate for thehypernym nc0 by the ratio of the prior frequency of the nc that is p(ncjgr) for whichwe seek the estimate divided by the prior frequency of the hypernym nc0 (p(nc0jgr))

p(nc 2 hyponyms of nc0jvc gr) = p(nc0jvc gr) pound p(ncjgr)p(nc0jgr)

(16)

These prior estimates are taken from populating the noun hyponym hierarchy with theprior frequency data for the gr irrespective of the co-occurring verbs The probabilityat the hypernym nc0 will necessarily total the probability at all hyponyms since thefrequency credit of hyponyms is propagated to hypernyms

Thus to disambiguate a noun occurring in a given relationship with a given verbthe nc 2 Cn that gives the largest estimate for p(ncjvc gr) is taken where the verb class(vc) is that which maximizes this estimate from Cv The TCM acquired for each vc ofthe verb in the given gr provides an estimate for p(nc0jvc gr) and the estimate for ncis obtained as in equation (16)

For example one target noun was letter which occurred as the direct object ofsign in our parses of the SENSEVAL-2 data The TCM that maximized the probabilityestimate for p(ncjvc direct object) is shown in Figure 5 The noun letter is disambiguatedby comparing the probability estimates on the TCM above the ve senses of letter mul-tiplied by the proportion of that probability mass attributed to that synset Althoughentity has a higher probability on the TCM compared to matter which is above thecorrect sense of letter6 the ratio of prior probabilities for the synset containing letter7

under entity is 0001 whereas that for the synset under matter is 0226 This gives aprobability of 0009pound 0226 = 0002 for the noun class probability given the verb class

approval

root

entity

human

capitalist

interpretation

message

communication

abstraction

relation

human_act

symbol

016

0012

00329

0114

00060009

draftletter

matter

writing

explanation

letter

alphabetic_letterletter

letter

letter

sign

Figure 5TCM for the direct-object slot of the verb class including sign and ratify

6 The gloss is ldquoa written message addressed to a person or organization eg wrote an indignant letter tothe editorrdquo

7 The gloss is ldquoowner who lets another person use something (housing usually) for hirerdquo

647

McCarthy and Carroll Disambiguating Using Selectional Preferences

(with maximum probability) and grammatical context This is the highest probabilityfor any of the synsets of letter and so in this case the correct sense is selected

42 Disambiguating Verbs and AdjectivesVerbs and adjectives are disambiguated using TCMs to give estimates for p(ncjvc gr)and p(ncjac gr) respectively These are combined with prior estimates for p(ncjgr) andp(vcjgr) (or p(acjgr)) using Bayesrsquo rule to give

p(vcjnc gr) = p(ncjvc gr) pound p(vcjgr)p(ncjgr)

(17)

and for adjectivendashnoun relations

p(acjnc adjnoun) = p(ncjac adjnoun) pound p(acjadjnoun)

p(ncjadjnoun)(18)

The prior distributions for p(ncjgr) p(vcjgr) and p(acjadjnoun) are obtained dur-ing the training phase For the prior distribution over NC the frequency credit ofeach noun in the specied gr in the training data is divided by jCnj The frequencycredit attached to a hyponym is propagated to the superordinate hypernyms and thefrequency of a hypernym (nc0) totals the frequency at its hyponyms

freq(nc0jgr) =X

nc 2 nc 0

freq(ncjgr) (19)

The distribution over VC is obtained similarly using the troponym relation Forthe distribution over AC the frequency credit for each adjective is divided by thenumber of synsets to which the adjective belongs and the credit for an ac is the sumover all the synsets that are members by virtue of the similar-to WordNet link

To disambiguate a verb occurring with a given noun the vc from Cv that givesthe largest estimate for p(vcjnc gr) is taken The nc for the co-occurring noun is thenc from Cn that maximizes this estimate The estimate for p(ncjvc gr) is taken as inequation (16) but selecting the vc to maximize the estimate for p(vcjnc gr) rather thanp(ncjvc gr) An adjective is likewise disambiguated to the ac from all those to which theadjective belongs using the estimate for p(ncjac gr) and selecting the nc that maximizesthe p(acjnc gr) estimate

43 Increasing Coverage One Sense per DiscourseThere is a signicant limitation to the word tokens that can be disambiguated usingselectional preferences in that they are restricted to those that occur in the speciedgrammatical relations and in argument head position Moreover we have TCMs onlyfor adjective and verb classes in which there was at least one adjective or verb mem-ber that met our criteria for training (having no more than a threshold of 10 senses inWordNet and a frequency of 20 or more occurrences in the BNC data in the speciedgrammatical relationship) We chose not to apply TCMs for disambiguation where wedid not have TCMs for one or more classes for the verb or adjective To increase cov-erage we experimented with applying the one-sense-per-discourse (OSPD) heuristic(Gale Church and Yarowsky 1992) With this heuristic a sense tag for a given wordis propagated to other occurrences of the same word within the current document inorder to increase coverage When applying the OSPD heuristic we simply applied atag for a noun verb or adjective to all the other instances of the same word type withthe same part of speech in the discourse provided that only one possible tag for thatword was supplied by the selectional preferences for that discourse

648

Computational Linguistics Volume 29 Number 4

0

20

40

60

80

100

0 20 40 60 80 100

Rec

all

Precision

sel-ospd-ana sel-ospdsel

supervisedother unsupervised

first sense heuristicsel

sel-ospdsel-ospd-ana

Figure 6SENSEVAL-2 English all-words task results

5 Evaluation

We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al 2001) We entered a previous version of this system forthe SENSEVAL-2 exercise in three variants under the names ldquosussex-selrdquo (selectionalpreferences) ldquosussex-sel-ospdrdquo (with the OSPD heuristic) and ldquosussex-sel-ospd-anardquo(with anaphora resolution) 8 For SENSEVAL-2 we used only the direct object andsubject slots since we had not yet dealt with adjectives In Figure 6 we show how oursystem fared at the time of SENSEVAL-2 compared to other unsupervised systems9

We have also plotted the results of the supervised systems and the precision and recallachieved by using the most frequent sense (as listed in WordNet)10

In the work reported here we attempted disambiguation for head nouns and verbsin subject and direct object relationships and for adjectives and nouns in adjective-noun relationships For each test instance we applied subject preferences before directobject preferences and direct object preferences before adjectivendashnoun preferences Wealso propagated sense tags to test instances not in these relationships by applying theone-sense-per-discourse heuristic

We did not use the SENSEVAL-2 coarse-grained classication as this was notavailable at the time when we were acquiring the selectional preferences We therefore

8 The motivation for using anaphora resolution was increased coverage but anaphora resolution turnedout not actually to improve performance

9 We use unsupervised to refer to systems that do not use manually sense-tagged training data such asSemCor Our systems marked in the gure as sel sel-ospd and sel-ospd-ana are unsupervised

10 We are indebted to Judita Preiss for the most-frequent-sense result This was obtained using thefrequency data supplied with the WordNet 17 version prereleased for SENSEVAL-2

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 3: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

640

Computational Linguistics Volume 29 Number 4

We acquire selectional preferences as probability distributions over the Word-Net (Fellbaum 1998) noun hyponym hierarchy The probability distributions are con-ditioned on a verb or adjective class and a grammatical relationship A noun is disam-biguated by using the preferences to give probability estimates for each of its sensesin WordNet that is for WordNet synsets Verbs and adjectives are disambiguated byusing the probability distributions and Bayesrsquo rule to obtain an estimate of the proba-bility of the adjective or verb class given the noun and the grammatical relationshipPreviously we evaluated noun and verb disambiguation on the English all-words taskin the SENSEVAL-2 exercise (Cotton et al 2001) We now present results also usingpreferences for adjectives again evaluated on the SENSEVAL-2 test corpus (but carriedout after the formal evaluation deadline) The results are encouraging given that thismethod does not rely for training on any hand-tagged data or frequency distributionsderived from such data Although a modest amount of English sense-tagged data isavailable we nevertheless believe it is important to investigate methods that do notrequire such data because there will be languages or texts for which sense-taggeddata for a given word is not available or relevant

2 Motivation

The goal of this article is to assess the WSD performance of selectional preferencemodels for adjectives verbs and nouns on the SENSEVAL-2 test corpus There are twoapplications for WSD that we have in mind and are directing our research The rstapplication is text simplication as outlined by Carroll Minnen Pearce et al (1999)One subtask in this application involves substituting words with thier more frequentsynonyms for example substituting letter for missive Our motivation for using WSDis to lter out inappropriate senses of a word token so that the substituting synonymis appropriate given the context For example in the following sentence we wouldlike to use strategy rather than dodge as a substitute for scheme

A recent government study singled out the scheme as an example to others

We are also investigating the disambiguation of verb senses in running text beforesubcategorization information for the verbs is acquired in order to produce a sub-categorization lexicon specic to sense (Preiss and Korhonen 2002) For example ifsubcategorization were acquired specic to sense rather than verb form then distinctsenses of re could have different subcategorization entries

re(1) - sack NP V NP

re(2) - shoot NP V NP NP V

Selectional preferences could also then be acquired automatically from sense-taggeddata in an iterative approach (McCarthy 2001)

3 Methodology

We acquire selectional preferences from automatically preprocessed and parsed textduring a training phase The parser is applied to the test data as well in the run-time phase to identify grammatical relations among nouns verbs and adjectives Theacquired selectional preferences are then applied to the noun-verb and noun-adjectivepairs in these grammatical constructions for disambiguation

641

McCarthy and Carroll Disambiguating Using Selectional Preferences

00005 009

lttimegt00807

ltentitygt ltmeasuregt

drink suck sipmodel for verbclass

Sense Tagged output

ltattributegt

Disambiguator

ParserPreprocessor

AcquisitionPreferenceSelectional

a single glass with

where you can drink

Training data

d04s12t09 drink23400

d06s13t03 man11800

instance ID WordNet sense tag

ncmod glass_NN1 single_JJ

dobj drink_VV0 glass_NN1

ncsubj drink_VV0 you_PPY

Grammatical Relations

Test data

The men drink here

WordNet

Selectional Preferences

Figure 1System overview Solid lines indicate ow of data during training and broken lines show thatat run time

The overall structure of the system is illustrated in Figure 1 We describe theindividual components in sections 31ndash33 and 4

31 PreprocessingThe preprocessor consists of three modules applied in sequence a tokenizer a part-of-speech (POS) tagger and a lemmatizer

The tokenizer comprises a small set of manually developed nite-state rules foridentifying word and sentence boundaries The tagger (Elworthy 1994) uses a bigramhidden Markov model augmented with a statistical unknown word guesser Whenapplied to the training data for selectional preference acquisition it produces the sin-gle highest-ranked POS tag for each word In the run-time phase it returns multipletag hypotheses each with an associated forward-backward probability to reduce theimpact of tagging errors The lemmatizer (Minnen Carroll and Pearce 2001) reducesinected verbs and nouns to their base forms It uses a set of nite-state rules express-ing morphological regularities and subregularities together with a list of exceptionsfor specic (irregular) word forms

32 ParsingThe parser uses a wide-coverage unication-based shallow grammar of English POStags and punctuation (Briscoe and Carroll 1995) and performs disambiguation usinga context-sensitive probabilistic model (Briscoe and Carroll 1993) recovering fromextra-grammaticality by returning partial parses The output of the parser is a set ofgrammatical relations (Carroll Briscoe and Sanlippo 1998) specifying the syntacticdependency between each head and its dependent(s) taken from the phrase structuretree that is returned from the disambiguation phase

For selectional preference acquisition we applied the analysis system to the 90million words of the written portion of the British National Corpus (BNC) the parserproduced complete analyses for around 60 of the sentences and partial analysesfor over 95 of the remainder Both in the acquisition phase and at run time weextract from the analyser output subjectndashverb verbndashdirect object and nounndashadjective

642

Computational Linguistics Volume 29 Number 4

modier dependencies1 We did not use the SENSEVAL-2 Penn Treebankndashstyle brack-etings supplied for the test data

33 Selectional Preference AcquisitionThe preferences are acquired for grammatical relations (subject direct object andadjectivendashnoun) involving nouns and grammatically related adjectives or verbs Weuse WordNet synsets to dene our sense inventory Our method exploits the hyponymlinks given for nouns (eg cheese is a hyponym of food) the troponym links for verbs 2

(eg limp is a troponym of walk) and the ldquosimilar-tordquo relationship given for adjectives(eg one sense of cheap is similar to imsy)

The preference models are modications of the tree cut models (TCMs) originallyproposed by Li and Abe (1995 1998) The main differences between that work andours are that we acquire adjective as well as verb models and also that our modelsare with respect to verb and adjective classes rather than forms We acquire modelsfor classes because we are using the models for WSD whereas Li and Abe used themfor structural disambiguation

We dene a TCM as follows Let NC be the set of noun synsets (noun classes)in WordNet NC = fnc 2 WordNetg and NS be the set of noun senses 3 in WordnetNS = fns 2 WordNetg A TCM is a set of noun classes that partition NS disjointlyWe use iexcl to refer to such a set of classes in a TCM A TCM is dened by iexcl and aprobability distribution

X

nc 2 iexcl

p(nc) = 1 (8)

The probability distribution is conditioned by the grammatical context In thiswork the probability distribution associated with a TCM is conditioned on a verbclass (vc) and either the subject or direct-object relation or an adjective class (ac)and the adjectivendashnoun relation Let VC be the set of verb synsets (verb classes) inWordNet VC = fvc 2 WordNetg Let AC be the set of adjective classes (which subsumeWordNet synsets we elaborate further on this subsequently) Thus the TCMs denea probability distribution over NS that is conditioned on a verb class (vc) or adjectiveclass (ac) and a particular grammatical relation (gr)

X

nc 2 iexcl

p(ncjvc gr) = 1 (9)

Acquisition of a TCM for a given vc and gr proceeds as follows The data foracquiring the preference are obtained from a subset of the tuples involving verbsin the synset or troponym (subordinate) synsets Not all verbs that are troponymsor direct members of the synset are used in training We take the noun argumentheads occurring with verbs that have no more than 10 senses in WordNet and afrequency of 20 or more occurrences in the BNC data in the specied grammaticalrelationship The threshold of 10 senses removes some highly polysemous verbs havingmany sense distinctions that are rather subtle Verbs that have more than 10 sensesinclude very frequent verbs such as be and do that do not select strongly for their

1 In a previous evaluation of grammatical-relation accuracy with in-coverage text the analyzer returnedsubjectndashverb and verbndashdirect object dependencies with 84ndash88 recall and precision (Carroll Minnenand Briscoe 1999)

2 In WordNet verbs are organized by the troponymy relation but this is represented with the samehyponym pointer as is used in the noun hierarchy

3 We refer to nouns attached to synsets as noun senses

643

McCarthy and Carroll Disambiguating Using Selectional Preferences

0

20

40

60

80

100

0 50 100 150 200

n

ot in

Wo

rdN

et

frequency rank

Figure 2Verbs not in WordNet by BNC frequency

arguments The frequency threshold of 20 is intended to remove noisy data We setthe threshold by examining a plot of BNC frequency and the percentage of verbs atparticular frequencies that are not listed in WordNet (Figure 2) Using 20 as a thresholdfor the subject slot results in only 5 verbs that are not found in WordNet whereas73 of verbs with fewer than 20 BNC occurrences are not present in WordNet4

The selectional-preference models for adjectivendashnoun relations are conditioned onan ac Each ac comprises a group of adjective WordNet synsets linked by the ldquosimilar-tordquo relation These groups are formed such that they partition all adjective synsetsThus AC = fac 2 WordNet adjective synsets linked by similar-tog For example Figure 3shows the adjective classes that include the adjective fundamental and that are formedin this way5 For selectional-preference models conditioned on adjective classes weuse only those adjectives that have 10 synsets or less in WordNet and have 20 or moreoccurrences in the BNC

The set of ncs in iexcl are selected from all the possibilities in the hyponym hierarchyaccording to the minimum description length (MDL) principle (Rissanen 1978) as usedby Li and Abe (1995 1998) MDL nds the best TCM by considering the cost (in bits)of describing both the model and the argument head data encoded in the model Thecost (or description length) for a TCM is calculated according to equation (10) Thenumber of parameters of the model is given by k which is the number of ncs in iexclminus one N is the sample of the argument head data The cost of describing eachnoun argument head (n) is calculated by the log of the probability estimate for thatnoun

description length = model description length + data description length

= k2 pound log jNj + iexcl

X

n2 N

log p(n) (10)

4 These threshold values are somewhat arbitrary but it turns out that our results are not sensitive to theexact values

5 For the sake of brevity not all adjectives are included in this diagram

644

Computational Linguistics Volume 29 Number 4

basicroot radicalgrassroots

underlying fundamentalrudimentary

primal elementary

basal base

of import important

primal keycentral fundamental cardinal

crucial essentialvaluable usefulhistoric

chief principal primary mainmeasurable

weighty grevious grave

signicant important

monumentalprofound fundamental

epochal

earthshaking

headportentous prodigious

evidentiary evidential

noteworthy remarkable

large

Figure 3Adjective classes that include fundamental

The probability estimate for each n is obtained using the estimates for all the nssthat n has Let Cn be the set of ncs that include n as a direct member Cn = fnc 2 NCjn 2ncg Let nc0 be a hypernym of nc on iexcl (ie nc0 2 fiexcl jnc 2 nc0g) and let nsnc 0 = fns 2 nc0g(ie the set of nouns senses at and beneath nc0 in the hyponym hierarchy) Then theestimate p(n) is obtained using the estimates for the hypernym classes on iexcl for all theCn that n belongs to

p(n) =X

nc 2 Cn

p(nc0)

jnsnc 0 j (11)

The probability at any particular nc0 is divided by nsnc 0 to give the estimate for eachp(ns) under that nc0

The probability estimates for the fnc 2 iexcl g ( p(ncjvc gr) or p(ncjac gr)) are obtainedfrom the tuples from the data of nouns co-occurring with verbs (or adjectives) belong-ing to the conditioning vc (or ac) in the specied grammatical relationship (lt n v gr gt)The frequency credit for a tuple is divided by jCnj for any n and by the number ofsynsets of v Cv (or Ca if the gr is adjective-noun)

freq(ncjvc gr) =X

v2 vc

X

n2 nc

freq(njv gr)jCnjjCvj (12)

A hypernym nc0 includes the frequency credit attributed to all its hyponyms (fnc 2nc0g)

freq(nc0jvc gr) =X

nc 2 nc 0

freq(ncjvc gr) (13)

This ensures that the total frequency credit at any iexcl across the hyponym hierarchyequals the credit for the conditioning vc This will be the sum of the frequency creditfor all verbs that are direct members or troponyms of the vc divided by the numberof other senses of each of these verbs

freq(vcjgr) =X

verb2 vc

freq(verbjgr)jCvj

(14)

645

McCarthy and Carroll Disambiguating Using Selectional Preferences

group event

Root

human_action

TCM for

TCM for

control monarchythrone

party

possession

strawmoney

handle

001002014

001002 053

006

005007

026

entity

seize assume usurp

seize clutch

Figure 4TCMs for the direct-object slot of two verb classes that include the verb seize

To ensure that the TCM covers all NS in WordNet we modify Li and Abersquos originalscheme by creating hyponym leaf classes below all WordNetrsquos internal classes in thehyponym hierarchy Each leaf holds the ns previously held at the internal class

Figure 4 shows portions of two TCMs The TCMs are similar as they both containthe verb seize but the TCM for the class that includes clutch has a higher probabilityfor the entity noun class compared to the class that also includes assume and usurpThis example includes only top-level WordNet classes although the TCM may usemore specic noun classes

4 Disambiguation

Nouns adjectives and verbs are disambiguated by nding the sense (nc vc or ac) withthe maximum probability estimate in the given context The method disambiguatesnouns and verbs to the WordNet synset level and adjectives to a coarse-grained levelof WordNet synsets linked by the similar-to relation as described previously

41 Disambiguating NounsNouns are disambiguated when they occur as subjects or direct objects and whenmodied by adjectives We obtain a probability estimate for each nc to which the targetnoun belongs using the distribution of the TCM associated with the co-occurring verbor adjective and the grammatical relationship

Li and Abe used TCMs for the task of structural disambiguation To obtain proba-bility estimates for noun senses occurring at classes beneath hypernyms on the cut Liand Abe used the probability estimate at the nc0 on the cut divided by the number ofns descendants as we do when nding iexcl during training so the probability estimateis shared equally among all nouns in the nc0 as in equation (15)

p(ns 2 nsnc 0 ) =p(nc0)

jnsnc 0 j (15)

One problem with doing this is that in cases in which the TCM is quite high in thehierarchy for example at the entity class the probability of any nsrsquos occurring underthis nc0 on the TCM will be the same and does not allow us to discriminate amongsenses beneath this level

For the WSD task we compare the probability estimates at each nc 2 Cn so ifa noun belongs to several synsets we compare the probability estimates given thecontext of these synsets We obtain estimates for each nc by using the probability ofthe hypernym nc0 on iexcl Rather than assume that all synsets under a given nc0 on iexcl

646

Computational Linguistics Volume 29 Number 4

have the same likelihood of occurrence we multiply the probability estimate for thehypernym nc0 by the ratio of the prior frequency of the nc that is p(ncjgr) for whichwe seek the estimate divided by the prior frequency of the hypernym nc0 (p(nc0jgr))

p(nc 2 hyponyms of nc0jvc gr) = p(nc0jvc gr) pound p(ncjgr)p(nc0jgr)

(16)

These prior estimates are taken from populating the noun hyponym hierarchy with theprior frequency data for the gr irrespective of the co-occurring verbs The probabilityat the hypernym nc0 will necessarily total the probability at all hyponyms since thefrequency credit of hyponyms is propagated to hypernyms

Thus to disambiguate a noun occurring in a given relationship with a given verbthe nc 2 Cn that gives the largest estimate for p(ncjvc gr) is taken where the verb class(vc) is that which maximizes this estimate from Cv The TCM acquired for each vc ofthe verb in the given gr provides an estimate for p(nc0jvc gr) and the estimate for ncis obtained as in equation (16)

For example one target noun was letter which occurred as the direct object ofsign in our parses of the SENSEVAL-2 data The TCM that maximized the probabilityestimate for p(ncjvc direct object) is shown in Figure 5 The noun letter is disambiguatedby comparing the probability estimates on the TCM above the ve senses of letter mul-tiplied by the proportion of that probability mass attributed to that synset Althoughentity has a higher probability on the TCM compared to matter which is above thecorrect sense of letter6 the ratio of prior probabilities for the synset containing letter7

under entity is 0001 whereas that for the synset under matter is 0226 This gives aprobability of 0009pound 0226 = 0002 for the noun class probability given the verb class

approval

root

entity

human

capitalist

interpretation

message

communication

abstraction

relation

human_act

symbol

016

0012

00329

0114

00060009

draftletter

matter

writing

explanation

letter

alphabetic_letterletter

letter

letter

sign

Figure 5TCM for the direct-object slot of the verb class including sign and ratify

6 The gloss is ldquoa written message addressed to a person or organization eg wrote an indignant letter tothe editorrdquo

7 The gloss is ldquoowner who lets another person use something (housing usually) for hirerdquo

647

McCarthy and Carroll Disambiguating Using Selectional Preferences

(with maximum probability) and grammatical context This is the highest probabilityfor any of the synsets of letter and so in this case the correct sense is selected

42 Disambiguating Verbs and AdjectivesVerbs and adjectives are disambiguated using TCMs to give estimates for p(ncjvc gr)and p(ncjac gr) respectively These are combined with prior estimates for p(ncjgr) andp(vcjgr) (or p(acjgr)) using Bayesrsquo rule to give

p(vcjnc gr) = p(ncjvc gr) pound p(vcjgr)p(ncjgr)

(17)

and for adjectivendashnoun relations

p(acjnc adjnoun) = p(ncjac adjnoun) pound p(acjadjnoun)

p(ncjadjnoun)(18)

The prior distributions for p(ncjgr) p(vcjgr) and p(acjadjnoun) are obtained dur-ing the training phase For the prior distribution over NC the frequency credit ofeach noun in the specied gr in the training data is divided by jCnj The frequencycredit attached to a hyponym is propagated to the superordinate hypernyms and thefrequency of a hypernym (nc0) totals the frequency at its hyponyms

freq(nc0jgr) =X

nc 2 nc 0

freq(ncjgr) (19)

The distribution over VC is obtained similarly using the troponym relation Forthe distribution over AC the frequency credit for each adjective is divided by thenumber of synsets to which the adjective belongs and the credit for an ac is the sumover all the synsets that are members by virtue of the similar-to WordNet link

To disambiguate a verb occurring with a given noun the vc from Cv that givesthe largest estimate for p(vcjnc gr) is taken The nc for the co-occurring noun is thenc from Cn that maximizes this estimate The estimate for p(ncjvc gr) is taken as inequation (16) but selecting the vc to maximize the estimate for p(vcjnc gr) rather thanp(ncjvc gr) An adjective is likewise disambiguated to the ac from all those to which theadjective belongs using the estimate for p(ncjac gr) and selecting the nc that maximizesthe p(acjnc gr) estimate

43 Increasing Coverage One Sense per DiscourseThere is a signicant limitation to the word tokens that can be disambiguated usingselectional preferences in that they are restricted to those that occur in the speciedgrammatical relations and in argument head position Moreover we have TCMs onlyfor adjective and verb classes in which there was at least one adjective or verb mem-ber that met our criteria for training (having no more than a threshold of 10 senses inWordNet and a frequency of 20 or more occurrences in the BNC data in the speciedgrammatical relationship) We chose not to apply TCMs for disambiguation where wedid not have TCMs for one or more classes for the verb or adjective To increase cov-erage we experimented with applying the one-sense-per-discourse (OSPD) heuristic(Gale Church and Yarowsky 1992) With this heuristic a sense tag for a given wordis propagated to other occurrences of the same word within the current document inorder to increase coverage When applying the OSPD heuristic we simply applied atag for a noun verb or adjective to all the other instances of the same word type withthe same part of speech in the discourse provided that only one possible tag for thatword was supplied by the selectional preferences for that discourse

648

Computational Linguistics Volume 29 Number 4

0

20

40

60

80

100

0 20 40 60 80 100

Rec

all

Precision

sel-ospd-ana sel-ospdsel

supervisedother unsupervised

first sense heuristicsel

sel-ospdsel-ospd-ana

Figure 6SENSEVAL-2 English all-words task results

5 Evaluation

We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al 2001) We entered a previous version of this system forthe SENSEVAL-2 exercise in three variants under the names ldquosussex-selrdquo (selectionalpreferences) ldquosussex-sel-ospdrdquo (with the OSPD heuristic) and ldquosussex-sel-ospd-anardquo(with anaphora resolution) 8 For SENSEVAL-2 we used only the direct object andsubject slots since we had not yet dealt with adjectives In Figure 6 we show how oursystem fared at the time of SENSEVAL-2 compared to other unsupervised systems9

We have also plotted the results of the supervised systems and the precision and recallachieved by using the most frequent sense (as listed in WordNet)10

In the work reported here we attempted disambiguation for head nouns and verbsin subject and direct object relationships and for adjectives and nouns in adjective-noun relationships For each test instance we applied subject preferences before directobject preferences and direct object preferences before adjectivendashnoun preferences Wealso propagated sense tags to test instances not in these relationships by applying theone-sense-per-discourse heuristic

We did not use the SENSEVAL-2 coarse-grained classication as this was notavailable at the time when we were acquiring the selectional preferences We therefore

8 The motivation for using anaphora resolution was increased coverage but anaphora resolution turnedout not actually to improve performance

9 We use unsupervised to refer to systems that do not use manually sense-tagged training data such asSemCor Our systems marked in the gure as sel sel-ospd and sel-ospd-ana are unsupervised

10 We are indebted to Judita Preiss for the most-frequent-sense result This was obtained using thefrequency data supplied with the WordNet 17 version prereleased for SENSEVAL-2

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 4: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

641

McCarthy and Carroll Disambiguating Using Selectional Preferences

00005 009

lttimegt00807

ltentitygt ltmeasuregt

drink suck sipmodel for verbclass

Sense Tagged output

ltattributegt

Disambiguator

ParserPreprocessor

AcquisitionPreferenceSelectional

a single glass with

where you can drink

Training data

d04s12t09 drink23400

d06s13t03 man11800

instance ID WordNet sense tag

ncmod glass_NN1 single_JJ

dobj drink_VV0 glass_NN1

ncsubj drink_VV0 you_PPY

Grammatical Relations

Test data

The men drink here

WordNet

Selectional Preferences

Figure 1System overview Solid lines indicate ow of data during training and broken lines show thatat run time

The overall structure of the system is illustrated in Figure 1 We describe theindividual components in sections 31ndash33 and 4

31 PreprocessingThe preprocessor consists of three modules applied in sequence a tokenizer a part-of-speech (POS) tagger and a lemmatizer

The tokenizer comprises a small set of manually developed nite-state rules foridentifying word and sentence boundaries The tagger (Elworthy 1994) uses a bigramhidden Markov model augmented with a statistical unknown word guesser Whenapplied to the training data for selectional preference acquisition it produces the sin-gle highest-ranked POS tag for each word In the run-time phase it returns multipletag hypotheses each with an associated forward-backward probability to reduce theimpact of tagging errors The lemmatizer (Minnen Carroll and Pearce 2001) reducesinected verbs and nouns to their base forms It uses a set of nite-state rules express-ing morphological regularities and subregularities together with a list of exceptionsfor specic (irregular) word forms

32 ParsingThe parser uses a wide-coverage unication-based shallow grammar of English POStags and punctuation (Briscoe and Carroll 1995) and performs disambiguation usinga context-sensitive probabilistic model (Briscoe and Carroll 1993) recovering fromextra-grammaticality by returning partial parses The output of the parser is a set ofgrammatical relations (Carroll Briscoe and Sanlippo 1998) specifying the syntacticdependency between each head and its dependent(s) taken from the phrase structuretree that is returned from the disambiguation phase

For selectional preference acquisition we applied the analysis system to the 90million words of the written portion of the British National Corpus (BNC) the parserproduced complete analyses for around 60 of the sentences and partial analysesfor over 95 of the remainder Both in the acquisition phase and at run time weextract from the analyser output subjectndashverb verbndashdirect object and nounndashadjective

642

Computational Linguistics Volume 29 Number 4

modier dependencies1 We did not use the SENSEVAL-2 Penn Treebankndashstyle brack-etings supplied for the test data

33 Selectional Preference AcquisitionThe preferences are acquired for grammatical relations (subject direct object andadjectivendashnoun) involving nouns and grammatically related adjectives or verbs Weuse WordNet synsets to dene our sense inventory Our method exploits the hyponymlinks given for nouns (eg cheese is a hyponym of food) the troponym links for verbs 2

(eg limp is a troponym of walk) and the ldquosimilar-tordquo relationship given for adjectives(eg one sense of cheap is similar to imsy)

The preference models are modications of the tree cut models (TCMs) originallyproposed by Li and Abe (1995 1998) The main differences between that work andours are that we acquire adjective as well as verb models and also that our modelsare with respect to verb and adjective classes rather than forms We acquire modelsfor classes because we are using the models for WSD whereas Li and Abe used themfor structural disambiguation

We dene a TCM as follows Let NC be the set of noun synsets (noun classes)in WordNet NC = fnc 2 WordNetg and NS be the set of noun senses 3 in WordnetNS = fns 2 WordNetg A TCM is a set of noun classes that partition NS disjointlyWe use iexcl to refer to such a set of classes in a TCM A TCM is dened by iexcl and aprobability distribution

X

nc 2 iexcl

p(nc) = 1 (8)

The probability distribution is conditioned by the grammatical context In thiswork the probability distribution associated with a TCM is conditioned on a verbclass (vc) and either the subject or direct-object relation or an adjective class (ac)and the adjectivendashnoun relation Let VC be the set of verb synsets (verb classes) inWordNet VC = fvc 2 WordNetg Let AC be the set of adjective classes (which subsumeWordNet synsets we elaborate further on this subsequently) Thus the TCMs denea probability distribution over NS that is conditioned on a verb class (vc) or adjectiveclass (ac) and a particular grammatical relation (gr)

X

nc 2 iexcl

p(ncjvc gr) = 1 (9)

Acquisition of a TCM for a given vc and gr proceeds as follows The data foracquiring the preference are obtained from a subset of the tuples involving verbsin the synset or troponym (subordinate) synsets Not all verbs that are troponymsor direct members of the synset are used in training We take the noun argumentheads occurring with verbs that have no more than 10 senses in WordNet and afrequency of 20 or more occurrences in the BNC data in the specied grammaticalrelationship The threshold of 10 senses removes some highly polysemous verbs havingmany sense distinctions that are rather subtle Verbs that have more than 10 sensesinclude very frequent verbs such as be and do that do not select strongly for their

1 In a previous evaluation of grammatical-relation accuracy with in-coverage text the analyzer returnedsubjectndashverb and verbndashdirect object dependencies with 84ndash88 recall and precision (Carroll Minnenand Briscoe 1999)

2 In WordNet verbs are organized by the troponymy relation but this is represented with the samehyponym pointer as is used in the noun hierarchy

3 We refer to nouns attached to synsets as noun senses

643

McCarthy and Carroll Disambiguating Using Selectional Preferences

0

20

40

60

80

100

0 50 100 150 200

n

ot in

Wo

rdN

et

frequency rank

Figure 2Verbs not in WordNet by BNC frequency

arguments The frequency threshold of 20 is intended to remove noisy data We setthe threshold by examining a plot of BNC frequency and the percentage of verbs atparticular frequencies that are not listed in WordNet (Figure 2) Using 20 as a thresholdfor the subject slot results in only 5 verbs that are not found in WordNet whereas73 of verbs with fewer than 20 BNC occurrences are not present in WordNet4

The selectional-preference models for adjectivendashnoun relations are conditioned onan ac Each ac comprises a group of adjective WordNet synsets linked by the ldquosimilar-tordquo relation These groups are formed such that they partition all adjective synsetsThus AC = fac 2 WordNet adjective synsets linked by similar-tog For example Figure 3shows the adjective classes that include the adjective fundamental and that are formedin this way5 For selectional-preference models conditioned on adjective classes weuse only those adjectives that have 10 synsets or less in WordNet and have 20 or moreoccurrences in the BNC

The set of ncs in iexcl are selected from all the possibilities in the hyponym hierarchyaccording to the minimum description length (MDL) principle (Rissanen 1978) as usedby Li and Abe (1995 1998) MDL nds the best TCM by considering the cost (in bits)of describing both the model and the argument head data encoded in the model Thecost (or description length) for a TCM is calculated according to equation (10) Thenumber of parameters of the model is given by k which is the number of ncs in iexclminus one N is the sample of the argument head data The cost of describing eachnoun argument head (n) is calculated by the log of the probability estimate for thatnoun

description length = model description length + data description length

= k2 pound log jNj + iexcl

X

n2 N

log p(n) (10)

4 These threshold values are somewhat arbitrary but it turns out that our results are not sensitive to theexact values

5 For the sake of brevity not all adjectives are included in this diagram

644

Computational Linguistics Volume 29 Number 4

basicroot radicalgrassroots

underlying fundamentalrudimentary

primal elementary

basal base

of import important

primal keycentral fundamental cardinal

crucial essentialvaluable usefulhistoric

chief principal primary mainmeasurable

weighty grevious grave

signicant important

monumentalprofound fundamental

epochal

earthshaking

headportentous prodigious

evidentiary evidential

noteworthy remarkable

large

Figure 3Adjective classes that include fundamental

The probability estimate for each n is obtained using the estimates for all the nssthat n has Let Cn be the set of ncs that include n as a direct member Cn = fnc 2 NCjn 2ncg Let nc0 be a hypernym of nc on iexcl (ie nc0 2 fiexcl jnc 2 nc0g) and let nsnc 0 = fns 2 nc0g(ie the set of nouns senses at and beneath nc0 in the hyponym hierarchy) Then theestimate p(n) is obtained using the estimates for the hypernym classes on iexcl for all theCn that n belongs to

p(n) =X

nc 2 Cn

p(nc0)

jnsnc 0 j (11)

The probability at any particular nc0 is divided by nsnc 0 to give the estimate for eachp(ns) under that nc0

The probability estimates for the fnc 2 iexcl g ( p(ncjvc gr) or p(ncjac gr)) are obtainedfrom the tuples from the data of nouns co-occurring with verbs (or adjectives) belong-ing to the conditioning vc (or ac) in the specied grammatical relationship (lt n v gr gt)The frequency credit for a tuple is divided by jCnj for any n and by the number ofsynsets of v Cv (or Ca if the gr is adjective-noun)

freq(ncjvc gr) =X

v2 vc

X

n2 nc

freq(njv gr)jCnjjCvj (12)

A hypernym nc0 includes the frequency credit attributed to all its hyponyms (fnc 2nc0g)

freq(nc0jvc gr) =X

nc 2 nc 0

freq(ncjvc gr) (13)

This ensures that the total frequency credit at any iexcl across the hyponym hierarchyequals the credit for the conditioning vc This will be the sum of the frequency creditfor all verbs that are direct members or troponyms of the vc divided by the numberof other senses of each of these verbs

freq(vcjgr) =X

verb2 vc

freq(verbjgr)jCvj

(14)

645

McCarthy and Carroll Disambiguating Using Selectional Preferences

group event

Root

human_action

TCM for

TCM for

control monarchythrone

party

possession

strawmoney

handle

001002014

001002 053

006

005007

026

entity

seize assume usurp

seize clutch

Figure 4TCMs for the direct-object slot of two verb classes that include the verb seize

To ensure that the TCM covers all NS in WordNet we modify Li and Abersquos originalscheme by creating hyponym leaf classes below all WordNetrsquos internal classes in thehyponym hierarchy Each leaf holds the ns previously held at the internal class

Figure 4 shows portions of two TCMs The TCMs are similar as they both containthe verb seize but the TCM for the class that includes clutch has a higher probabilityfor the entity noun class compared to the class that also includes assume and usurpThis example includes only top-level WordNet classes although the TCM may usemore specic noun classes

4 Disambiguation

Nouns adjectives and verbs are disambiguated by nding the sense (nc vc or ac) withthe maximum probability estimate in the given context The method disambiguatesnouns and verbs to the WordNet synset level and adjectives to a coarse-grained levelof WordNet synsets linked by the similar-to relation as described previously

41 Disambiguating NounsNouns are disambiguated when they occur as subjects or direct objects and whenmodied by adjectives We obtain a probability estimate for each nc to which the targetnoun belongs using the distribution of the TCM associated with the co-occurring verbor adjective and the grammatical relationship

Li and Abe used TCMs for the task of structural disambiguation To obtain proba-bility estimates for noun senses occurring at classes beneath hypernyms on the cut Liand Abe used the probability estimate at the nc0 on the cut divided by the number ofns descendants as we do when nding iexcl during training so the probability estimateis shared equally among all nouns in the nc0 as in equation (15)

p(ns 2 nsnc 0 ) =p(nc0)

jnsnc 0 j (15)

One problem with doing this is that in cases in which the TCM is quite high in thehierarchy for example at the entity class the probability of any nsrsquos occurring underthis nc0 on the TCM will be the same and does not allow us to discriminate amongsenses beneath this level

For the WSD task we compare the probability estimates at each nc 2 Cn so ifa noun belongs to several synsets we compare the probability estimates given thecontext of these synsets We obtain estimates for each nc by using the probability ofthe hypernym nc0 on iexcl Rather than assume that all synsets under a given nc0 on iexcl

646

Computational Linguistics Volume 29 Number 4

have the same likelihood of occurrence we multiply the probability estimate for thehypernym nc0 by the ratio of the prior frequency of the nc that is p(ncjgr) for whichwe seek the estimate divided by the prior frequency of the hypernym nc0 (p(nc0jgr))

p(nc 2 hyponyms of nc0jvc gr) = p(nc0jvc gr) pound p(ncjgr)p(nc0jgr)

(16)

These prior estimates are taken from populating the noun hyponym hierarchy with theprior frequency data for the gr irrespective of the co-occurring verbs The probabilityat the hypernym nc0 will necessarily total the probability at all hyponyms since thefrequency credit of hyponyms is propagated to hypernyms

Thus to disambiguate a noun occurring in a given relationship with a given verbthe nc 2 Cn that gives the largest estimate for p(ncjvc gr) is taken where the verb class(vc) is that which maximizes this estimate from Cv The TCM acquired for each vc ofthe verb in the given gr provides an estimate for p(nc0jvc gr) and the estimate for ncis obtained as in equation (16)

For example one target noun was letter which occurred as the direct object ofsign in our parses of the SENSEVAL-2 data The TCM that maximized the probabilityestimate for p(ncjvc direct object) is shown in Figure 5 The noun letter is disambiguatedby comparing the probability estimates on the TCM above the ve senses of letter mul-tiplied by the proportion of that probability mass attributed to that synset Althoughentity has a higher probability on the TCM compared to matter which is above thecorrect sense of letter6 the ratio of prior probabilities for the synset containing letter7

under entity is 0001 whereas that for the synset under matter is 0226 This gives aprobability of 0009pound 0226 = 0002 for the noun class probability given the verb class

approval

root

entity

human

capitalist

interpretation

message

communication

abstraction

relation

human_act

symbol

016

0012

00329

0114

00060009

draftletter

matter

writing

explanation

letter

alphabetic_letterletter

letter

letter

sign

Figure 5TCM for the direct-object slot of the verb class including sign and ratify

6 The gloss is ldquoa written message addressed to a person or organization eg wrote an indignant letter tothe editorrdquo

7 The gloss is ldquoowner who lets another person use something (housing usually) for hirerdquo

647

McCarthy and Carroll Disambiguating Using Selectional Preferences

(with maximum probability) and grammatical context This is the highest probabilityfor any of the synsets of letter and so in this case the correct sense is selected

42 Disambiguating Verbs and AdjectivesVerbs and adjectives are disambiguated using TCMs to give estimates for p(ncjvc gr)and p(ncjac gr) respectively These are combined with prior estimates for p(ncjgr) andp(vcjgr) (or p(acjgr)) using Bayesrsquo rule to give

p(vcjnc gr) = p(ncjvc gr) pound p(vcjgr)p(ncjgr)

(17)

and for adjectivendashnoun relations

p(acjnc adjnoun) = p(ncjac adjnoun) pound p(acjadjnoun)

p(ncjadjnoun)(18)

The prior distributions for p(ncjgr) p(vcjgr) and p(acjadjnoun) are obtained dur-ing the training phase For the prior distribution over NC the frequency credit ofeach noun in the specied gr in the training data is divided by jCnj The frequencycredit attached to a hyponym is propagated to the superordinate hypernyms and thefrequency of a hypernym (nc0) totals the frequency at its hyponyms

freq(nc0jgr) =X

nc 2 nc 0

freq(ncjgr) (19)

The distribution over VC is obtained similarly using the troponym relation Forthe distribution over AC the frequency credit for each adjective is divided by thenumber of synsets to which the adjective belongs and the credit for an ac is the sumover all the synsets that are members by virtue of the similar-to WordNet link

To disambiguate a verb occurring with a given noun the vc from Cv that givesthe largest estimate for p(vcjnc gr) is taken The nc for the co-occurring noun is thenc from Cn that maximizes this estimate The estimate for p(ncjvc gr) is taken as inequation (16) but selecting the vc to maximize the estimate for p(vcjnc gr) rather thanp(ncjvc gr) An adjective is likewise disambiguated to the ac from all those to which theadjective belongs using the estimate for p(ncjac gr) and selecting the nc that maximizesthe p(acjnc gr) estimate

43 Increasing Coverage One Sense per DiscourseThere is a signicant limitation to the word tokens that can be disambiguated usingselectional preferences in that they are restricted to those that occur in the speciedgrammatical relations and in argument head position Moreover we have TCMs onlyfor adjective and verb classes in which there was at least one adjective or verb mem-ber that met our criteria for training (having no more than a threshold of 10 senses inWordNet and a frequency of 20 or more occurrences in the BNC data in the speciedgrammatical relationship) We chose not to apply TCMs for disambiguation where wedid not have TCMs for one or more classes for the verb or adjective To increase cov-erage we experimented with applying the one-sense-per-discourse (OSPD) heuristic(Gale Church and Yarowsky 1992) With this heuristic a sense tag for a given wordis propagated to other occurrences of the same word within the current document inorder to increase coverage When applying the OSPD heuristic we simply applied atag for a noun verb or adjective to all the other instances of the same word type withthe same part of speech in the discourse provided that only one possible tag for thatword was supplied by the selectional preferences for that discourse

648

Computational Linguistics Volume 29 Number 4

0

20

40

60

80

100

0 20 40 60 80 100

Rec

all

Precision

sel-ospd-ana sel-ospdsel

supervisedother unsupervised

first sense heuristicsel

sel-ospdsel-ospd-ana

Figure 6SENSEVAL-2 English all-words task results

5 Evaluation

We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al 2001) We entered a previous version of this system forthe SENSEVAL-2 exercise in three variants under the names ldquosussex-selrdquo (selectionalpreferences) ldquosussex-sel-ospdrdquo (with the OSPD heuristic) and ldquosussex-sel-ospd-anardquo(with anaphora resolution) 8 For SENSEVAL-2 we used only the direct object andsubject slots since we had not yet dealt with adjectives In Figure 6 we show how oursystem fared at the time of SENSEVAL-2 compared to other unsupervised systems9

We have also plotted the results of the supervised systems and the precision and recallachieved by using the most frequent sense (as listed in WordNet)10

In the work reported here we attempted disambiguation for head nouns and verbsin subject and direct object relationships and for adjectives and nouns in adjective-noun relationships For each test instance we applied subject preferences before directobject preferences and direct object preferences before adjectivendashnoun preferences Wealso propagated sense tags to test instances not in these relationships by applying theone-sense-per-discourse heuristic

We did not use the SENSEVAL-2 coarse-grained classication as this was notavailable at the time when we were acquiring the selectional preferences We therefore

8 The motivation for using anaphora resolution was increased coverage but anaphora resolution turnedout not actually to improve performance

9 We use unsupervised to refer to systems that do not use manually sense-tagged training data such asSemCor Our systems marked in the gure as sel sel-ospd and sel-ospd-ana are unsupervised

10 We are indebted to Judita Preiss for the most-frequent-sense result This was obtained using thefrequency data supplied with the WordNet 17 version prereleased for SENSEVAL-2

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 5: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

642

Computational Linguistics Volume 29 Number 4

modier dependencies1 We did not use the SENSEVAL-2 Penn Treebankndashstyle brack-etings supplied for the test data

33 Selectional Preference AcquisitionThe preferences are acquired for grammatical relations (subject direct object andadjectivendashnoun) involving nouns and grammatically related adjectives or verbs Weuse WordNet synsets to dene our sense inventory Our method exploits the hyponymlinks given for nouns (eg cheese is a hyponym of food) the troponym links for verbs 2

(eg limp is a troponym of walk) and the ldquosimilar-tordquo relationship given for adjectives(eg one sense of cheap is similar to imsy)

The preference models are modications of the tree cut models (TCMs) originallyproposed by Li and Abe (1995 1998) The main differences between that work andours are that we acquire adjective as well as verb models and also that our modelsare with respect to verb and adjective classes rather than forms We acquire modelsfor classes because we are using the models for WSD whereas Li and Abe used themfor structural disambiguation

We dene a TCM as follows Let NC be the set of noun synsets (noun classes)in WordNet NC = fnc 2 WordNetg and NS be the set of noun senses 3 in WordnetNS = fns 2 WordNetg A TCM is a set of noun classes that partition NS disjointlyWe use iexcl to refer to such a set of classes in a TCM A TCM is dened by iexcl and aprobability distribution

X

nc 2 iexcl

p(nc) = 1 (8)

The probability distribution is conditioned by the grammatical context In thiswork the probability distribution associated with a TCM is conditioned on a verbclass (vc) and either the subject or direct-object relation or an adjective class (ac)and the adjectivendashnoun relation Let VC be the set of verb synsets (verb classes) inWordNet VC = fvc 2 WordNetg Let AC be the set of adjective classes (which subsumeWordNet synsets we elaborate further on this subsequently) Thus the TCMs denea probability distribution over NS that is conditioned on a verb class (vc) or adjectiveclass (ac) and a particular grammatical relation (gr)

X

nc 2 iexcl

p(ncjvc gr) = 1 (9)

Acquisition of a TCM for a given vc and gr proceeds as follows The data foracquiring the preference are obtained from a subset of the tuples involving verbsin the synset or troponym (subordinate) synsets Not all verbs that are troponymsor direct members of the synset are used in training We take the noun argumentheads occurring with verbs that have no more than 10 senses in WordNet and afrequency of 20 or more occurrences in the BNC data in the specied grammaticalrelationship The threshold of 10 senses removes some highly polysemous verbs havingmany sense distinctions that are rather subtle Verbs that have more than 10 sensesinclude very frequent verbs such as be and do that do not select strongly for their

1 In a previous evaluation of grammatical-relation accuracy with in-coverage text the analyzer returnedsubjectndashverb and verbndashdirect object dependencies with 84ndash88 recall and precision (Carroll Minnenand Briscoe 1999)

2 In WordNet verbs are organized by the troponymy relation but this is represented with the samehyponym pointer as is used in the noun hierarchy

3 We refer to nouns attached to synsets as noun senses

643

McCarthy and Carroll Disambiguating Using Selectional Preferences

0

20

40

60

80

100

0 50 100 150 200

n

ot in

Wo

rdN

et

frequency rank

Figure 2Verbs not in WordNet by BNC frequency

arguments The frequency threshold of 20 is intended to remove noisy data We setthe threshold by examining a plot of BNC frequency and the percentage of verbs atparticular frequencies that are not listed in WordNet (Figure 2) Using 20 as a thresholdfor the subject slot results in only 5 verbs that are not found in WordNet whereas73 of verbs with fewer than 20 BNC occurrences are not present in WordNet4

The selectional-preference models for adjectivendashnoun relations are conditioned onan ac Each ac comprises a group of adjective WordNet synsets linked by the ldquosimilar-tordquo relation These groups are formed such that they partition all adjective synsetsThus AC = fac 2 WordNet adjective synsets linked by similar-tog For example Figure 3shows the adjective classes that include the adjective fundamental and that are formedin this way5 For selectional-preference models conditioned on adjective classes weuse only those adjectives that have 10 synsets or less in WordNet and have 20 or moreoccurrences in the BNC

The set of ncs in iexcl are selected from all the possibilities in the hyponym hierarchyaccording to the minimum description length (MDL) principle (Rissanen 1978) as usedby Li and Abe (1995 1998) MDL nds the best TCM by considering the cost (in bits)of describing both the model and the argument head data encoded in the model Thecost (or description length) for a TCM is calculated according to equation (10) Thenumber of parameters of the model is given by k which is the number of ncs in iexclminus one N is the sample of the argument head data The cost of describing eachnoun argument head (n) is calculated by the log of the probability estimate for thatnoun

description length = model description length + data description length

= k2 pound log jNj + iexcl

X

n2 N

log p(n) (10)

4 These threshold values are somewhat arbitrary but it turns out that our results are not sensitive to theexact values

5 For the sake of brevity not all adjectives are included in this diagram

644

Computational Linguistics Volume 29 Number 4

basicroot radicalgrassroots

underlying fundamentalrudimentary

primal elementary

basal base

of import important

primal keycentral fundamental cardinal

crucial essentialvaluable usefulhistoric

chief principal primary mainmeasurable

weighty grevious grave

signicant important

monumentalprofound fundamental

epochal

earthshaking

headportentous prodigious

evidentiary evidential

noteworthy remarkable

large

Figure 3Adjective classes that include fundamental

The probability estimate for each n is obtained using the estimates for all the nssthat n has Let Cn be the set of ncs that include n as a direct member Cn = fnc 2 NCjn 2ncg Let nc0 be a hypernym of nc on iexcl (ie nc0 2 fiexcl jnc 2 nc0g) and let nsnc 0 = fns 2 nc0g(ie the set of nouns senses at and beneath nc0 in the hyponym hierarchy) Then theestimate p(n) is obtained using the estimates for the hypernym classes on iexcl for all theCn that n belongs to

p(n) =X

nc 2 Cn

p(nc0)

jnsnc 0 j (11)

The probability at any particular nc0 is divided by nsnc 0 to give the estimate for eachp(ns) under that nc0

The probability estimates for the fnc 2 iexcl g ( p(ncjvc gr) or p(ncjac gr)) are obtainedfrom the tuples from the data of nouns co-occurring with verbs (or adjectives) belong-ing to the conditioning vc (or ac) in the specied grammatical relationship (lt n v gr gt)The frequency credit for a tuple is divided by jCnj for any n and by the number ofsynsets of v Cv (or Ca if the gr is adjective-noun)

freq(ncjvc gr) =X

v2 vc

X

n2 nc

freq(njv gr)jCnjjCvj (12)

A hypernym nc0 includes the frequency credit attributed to all its hyponyms (fnc 2nc0g)

freq(nc0jvc gr) =X

nc 2 nc 0

freq(ncjvc gr) (13)

This ensures that the total frequency credit at any iexcl across the hyponym hierarchyequals the credit for the conditioning vc This will be the sum of the frequency creditfor all verbs that are direct members or troponyms of the vc divided by the numberof other senses of each of these verbs

freq(vcjgr) =X

verb2 vc

freq(verbjgr)jCvj

(14)

645

McCarthy and Carroll Disambiguating Using Selectional Preferences

group event

Root

human_action

TCM for

TCM for

control monarchythrone

party

possession

strawmoney

handle

001002014

001002 053

006

005007

026

entity

seize assume usurp

seize clutch

Figure 4TCMs for the direct-object slot of two verb classes that include the verb seize

To ensure that the TCM covers all NS in WordNet we modify Li and Abersquos originalscheme by creating hyponym leaf classes below all WordNetrsquos internal classes in thehyponym hierarchy Each leaf holds the ns previously held at the internal class

Figure 4 shows portions of two TCMs The TCMs are similar as they both containthe verb seize but the TCM for the class that includes clutch has a higher probabilityfor the entity noun class compared to the class that also includes assume and usurpThis example includes only top-level WordNet classes although the TCM may usemore specic noun classes

4 Disambiguation

Nouns adjectives and verbs are disambiguated by nding the sense (nc vc or ac) withthe maximum probability estimate in the given context The method disambiguatesnouns and verbs to the WordNet synset level and adjectives to a coarse-grained levelof WordNet synsets linked by the similar-to relation as described previously

41 Disambiguating NounsNouns are disambiguated when they occur as subjects or direct objects and whenmodied by adjectives We obtain a probability estimate for each nc to which the targetnoun belongs using the distribution of the TCM associated with the co-occurring verbor adjective and the grammatical relationship

Li and Abe used TCMs for the task of structural disambiguation To obtain proba-bility estimates for noun senses occurring at classes beneath hypernyms on the cut Liand Abe used the probability estimate at the nc0 on the cut divided by the number ofns descendants as we do when nding iexcl during training so the probability estimateis shared equally among all nouns in the nc0 as in equation (15)

p(ns 2 nsnc 0 ) =p(nc0)

jnsnc 0 j (15)

One problem with doing this is that in cases in which the TCM is quite high in thehierarchy for example at the entity class the probability of any nsrsquos occurring underthis nc0 on the TCM will be the same and does not allow us to discriminate amongsenses beneath this level

For the WSD task we compare the probability estimates at each nc 2 Cn so ifa noun belongs to several synsets we compare the probability estimates given thecontext of these synsets We obtain estimates for each nc by using the probability ofthe hypernym nc0 on iexcl Rather than assume that all synsets under a given nc0 on iexcl

646

Computational Linguistics Volume 29 Number 4

have the same likelihood of occurrence we multiply the probability estimate for thehypernym nc0 by the ratio of the prior frequency of the nc that is p(ncjgr) for whichwe seek the estimate divided by the prior frequency of the hypernym nc0 (p(nc0jgr))

p(nc 2 hyponyms of nc0jvc gr) = p(nc0jvc gr) pound p(ncjgr)p(nc0jgr)

(16)

These prior estimates are taken from populating the noun hyponym hierarchy with theprior frequency data for the gr irrespective of the co-occurring verbs The probabilityat the hypernym nc0 will necessarily total the probability at all hyponyms since thefrequency credit of hyponyms is propagated to hypernyms

Thus to disambiguate a noun occurring in a given relationship with a given verbthe nc 2 Cn that gives the largest estimate for p(ncjvc gr) is taken where the verb class(vc) is that which maximizes this estimate from Cv The TCM acquired for each vc ofthe verb in the given gr provides an estimate for p(nc0jvc gr) and the estimate for ncis obtained as in equation (16)

For example one target noun was letter which occurred as the direct object ofsign in our parses of the SENSEVAL-2 data The TCM that maximized the probabilityestimate for p(ncjvc direct object) is shown in Figure 5 The noun letter is disambiguatedby comparing the probability estimates on the TCM above the ve senses of letter mul-tiplied by the proportion of that probability mass attributed to that synset Althoughentity has a higher probability on the TCM compared to matter which is above thecorrect sense of letter6 the ratio of prior probabilities for the synset containing letter7

under entity is 0001 whereas that for the synset under matter is 0226 This gives aprobability of 0009pound 0226 = 0002 for the noun class probability given the verb class

approval

root

entity

human

capitalist

interpretation

message

communication

abstraction

relation

human_act

symbol

016

0012

00329

0114

00060009

draftletter

matter

writing

explanation

letter

alphabetic_letterletter

letter

letter

sign

Figure 5TCM for the direct-object slot of the verb class including sign and ratify

6 The gloss is ldquoa written message addressed to a person or organization eg wrote an indignant letter tothe editorrdquo

7 The gloss is ldquoowner who lets another person use something (housing usually) for hirerdquo

647

McCarthy and Carroll Disambiguating Using Selectional Preferences

(with maximum probability) and grammatical context This is the highest probabilityfor any of the synsets of letter and so in this case the correct sense is selected

42 Disambiguating Verbs and AdjectivesVerbs and adjectives are disambiguated using TCMs to give estimates for p(ncjvc gr)and p(ncjac gr) respectively These are combined with prior estimates for p(ncjgr) andp(vcjgr) (or p(acjgr)) using Bayesrsquo rule to give

p(vcjnc gr) = p(ncjvc gr) pound p(vcjgr)p(ncjgr)

(17)

and for adjectivendashnoun relations

p(acjnc adjnoun) = p(ncjac adjnoun) pound p(acjadjnoun)

p(ncjadjnoun)(18)

The prior distributions for p(ncjgr) p(vcjgr) and p(acjadjnoun) are obtained dur-ing the training phase For the prior distribution over NC the frequency credit ofeach noun in the specied gr in the training data is divided by jCnj The frequencycredit attached to a hyponym is propagated to the superordinate hypernyms and thefrequency of a hypernym (nc0) totals the frequency at its hyponyms

freq(nc0jgr) =X

nc 2 nc 0

freq(ncjgr) (19)

The distribution over VC is obtained similarly using the troponym relation Forthe distribution over AC the frequency credit for each adjective is divided by thenumber of synsets to which the adjective belongs and the credit for an ac is the sumover all the synsets that are members by virtue of the similar-to WordNet link

To disambiguate a verb occurring with a given noun the vc from Cv that givesthe largest estimate for p(vcjnc gr) is taken The nc for the co-occurring noun is thenc from Cn that maximizes this estimate The estimate for p(ncjvc gr) is taken as inequation (16) but selecting the vc to maximize the estimate for p(vcjnc gr) rather thanp(ncjvc gr) An adjective is likewise disambiguated to the ac from all those to which theadjective belongs using the estimate for p(ncjac gr) and selecting the nc that maximizesthe p(acjnc gr) estimate

43 Increasing Coverage One Sense per DiscourseThere is a signicant limitation to the word tokens that can be disambiguated usingselectional preferences in that they are restricted to those that occur in the speciedgrammatical relations and in argument head position Moreover we have TCMs onlyfor adjective and verb classes in which there was at least one adjective or verb mem-ber that met our criteria for training (having no more than a threshold of 10 senses inWordNet and a frequency of 20 or more occurrences in the BNC data in the speciedgrammatical relationship) We chose not to apply TCMs for disambiguation where wedid not have TCMs for one or more classes for the verb or adjective To increase cov-erage we experimented with applying the one-sense-per-discourse (OSPD) heuristic(Gale Church and Yarowsky 1992) With this heuristic a sense tag for a given wordis propagated to other occurrences of the same word within the current document inorder to increase coverage When applying the OSPD heuristic we simply applied atag for a noun verb or adjective to all the other instances of the same word type withthe same part of speech in the discourse provided that only one possible tag for thatword was supplied by the selectional preferences for that discourse

648

Computational Linguistics Volume 29 Number 4

0

20

40

60

80

100

0 20 40 60 80 100

Rec

all

Precision

sel-ospd-ana sel-ospdsel

supervisedother unsupervised

first sense heuristicsel

sel-ospdsel-ospd-ana

Figure 6SENSEVAL-2 English all-words task results

5 Evaluation

We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al 2001) We entered a previous version of this system forthe SENSEVAL-2 exercise in three variants under the names ldquosussex-selrdquo (selectionalpreferences) ldquosussex-sel-ospdrdquo (with the OSPD heuristic) and ldquosussex-sel-ospd-anardquo(with anaphora resolution) 8 For SENSEVAL-2 we used only the direct object andsubject slots since we had not yet dealt with adjectives In Figure 6 we show how oursystem fared at the time of SENSEVAL-2 compared to other unsupervised systems9

We have also plotted the results of the supervised systems and the precision and recallachieved by using the most frequent sense (as listed in WordNet)10

In the work reported here we attempted disambiguation for head nouns and verbsin subject and direct object relationships and for adjectives and nouns in adjective-noun relationships For each test instance we applied subject preferences before directobject preferences and direct object preferences before adjectivendashnoun preferences Wealso propagated sense tags to test instances not in these relationships by applying theone-sense-per-discourse heuristic

We did not use the SENSEVAL-2 coarse-grained classication as this was notavailable at the time when we were acquiring the selectional preferences We therefore

8 The motivation for using anaphora resolution was increased coverage but anaphora resolution turnedout not actually to improve performance

9 We use unsupervised to refer to systems that do not use manually sense-tagged training data such asSemCor Our systems marked in the gure as sel sel-ospd and sel-ospd-ana are unsupervised

10 We are indebted to Judita Preiss for the most-frequent-sense result This was obtained using thefrequency data supplied with the WordNet 17 version prereleased for SENSEVAL-2

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 6: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

643

McCarthy and Carroll Disambiguating Using Selectional Preferences

0

20

40

60

80

100

0 50 100 150 200

n

ot in

Wo

rdN

et

frequency rank

Figure 2Verbs not in WordNet by BNC frequency

arguments The frequency threshold of 20 is intended to remove noisy data We setthe threshold by examining a plot of BNC frequency and the percentage of verbs atparticular frequencies that are not listed in WordNet (Figure 2) Using 20 as a thresholdfor the subject slot results in only 5 verbs that are not found in WordNet whereas73 of verbs with fewer than 20 BNC occurrences are not present in WordNet4

The selectional-preference models for adjectivendashnoun relations are conditioned onan ac Each ac comprises a group of adjective WordNet synsets linked by the ldquosimilar-tordquo relation These groups are formed such that they partition all adjective synsetsThus AC = fac 2 WordNet adjective synsets linked by similar-tog For example Figure 3shows the adjective classes that include the adjective fundamental and that are formedin this way5 For selectional-preference models conditioned on adjective classes weuse only those adjectives that have 10 synsets or less in WordNet and have 20 or moreoccurrences in the BNC

The set of ncs in iexcl are selected from all the possibilities in the hyponym hierarchyaccording to the minimum description length (MDL) principle (Rissanen 1978) as usedby Li and Abe (1995 1998) MDL nds the best TCM by considering the cost (in bits)of describing both the model and the argument head data encoded in the model Thecost (or description length) for a TCM is calculated according to equation (10) Thenumber of parameters of the model is given by k which is the number of ncs in iexclminus one N is the sample of the argument head data The cost of describing eachnoun argument head (n) is calculated by the log of the probability estimate for thatnoun

description length = model description length + data description length

= k2 pound log jNj + iexcl

X

n2 N

log p(n) (10)

4 These threshold values are somewhat arbitrary but it turns out that our results are not sensitive to theexact values

5 For the sake of brevity not all adjectives are included in this diagram

644

Computational Linguistics Volume 29 Number 4

basicroot radicalgrassroots

underlying fundamentalrudimentary

primal elementary

basal base

of import important

primal keycentral fundamental cardinal

crucial essentialvaluable usefulhistoric

chief principal primary mainmeasurable

weighty grevious grave

signicant important

monumentalprofound fundamental

epochal

earthshaking

headportentous prodigious

evidentiary evidential

noteworthy remarkable

large

Figure 3Adjective classes that include fundamental

The probability estimate for each n is obtained using the estimates for all the nssthat n has Let Cn be the set of ncs that include n as a direct member Cn = fnc 2 NCjn 2ncg Let nc0 be a hypernym of nc on iexcl (ie nc0 2 fiexcl jnc 2 nc0g) and let nsnc 0 = fns 2 nc0g(ie the set of nouns senses at and beneath nc0 in the hyponym hierarchy) Then theestimate p(n) is obtained using the estimates for the hypernym classes on iexcl for all theCn that n belongs to

p(n) =X

nc 2 Cn

p(nc0)

jnsnc 0 j (11)

The probability at any particular nc0 is divided by nsnc 0 to give the estimate for eachp(ns) under that nc0

The probability estimates for the fnc 2 iexcl g ( p(ncjvc gr) or p(ncjac gr)) are obtainedfrom the tuples from the data of nouns co-occurring with verbs (or adjectives) belong-ing to the conditioning vc (or ac) in the specied grammatical relationship (lt n v gr gt)The frequency credit for a tuple is divided by jCnj for any n and by the number ofsynsets of v Cv (or Ca if the gr is adjective-noun)

freq(ncjvc gr) =X

v2 vc

X

n2 nc

freq(njv gr)jCnjjCvj (12)

A hypernym nc0 includes the frequency credit attributed to all its hyponyms (fnc 2nc0g)

freq(nc0jvc gr) =X

nc 2 nc 0

freq(ncjvc gr) (13)

This ensures that the total frequency credit at any iexcl across the hyponym hierarchyequals the credit for the conditioning vc This will be the sum of the frequency creditfor all verbs that are direct members or troponyms of the vc divided by the numberof other senses of each of these verbs

freq(vcjgr) =X

verb2 vc

freq(verbjgr)jCvj

(14)

645

McCarthy and Carroll Disambiguating Using Selectional Preferences

group event

Root

human_action

TCM for

TCM for

control monarchythrone

party

possession

strawmoney

handle

001002014

001002 053

006

005007

026

entity

seize assume usurp

seize clutch

Figure 4TCMs for the direct-object slot of two verb classes that include the verb seize

To ensure that the TCM covers all NS in WordNet we modify Li and Abersquos originalscheme by creating hyponym leaf classes below all WordNetrsquos internal classes in thehyponym hierarchy Each leaf holds the ns previously held at the internal class

Figure 4 shows portions of two TCMs The TCMs are similar as they both containthe verb seize but the TCM for the class that includes clutch has a higher probabilityfor the entity noun class compared to the class that also includes assume and usurpThis example includes only top-level WordNet classes although the TCM may usemore specic noun classes

4 Disambiguation

Nouns adjectives and verbs are disambiguated by nding the sense (nc vc or ac) withthe maximum probability estimate in the given context The method disambiguatesnouns and verbs to the WordNet synset level and adjectives to a coarse-grained levelof WordNet synsets linked by the similar-to relation as described previously

41 Disambiguating NounsNouns are disambiguated when they occur as subjects or direct objects and whenmodied by adjectives We obtain a probability estimate for each nc to which the targetnoun belongs using the distribution of the TCM associated with the co-occurring verbor adjective and the grammatical relationship

Li and Abe used TCMs for the task of structural disambiguation To obtain proba-bility estimates for noun senses occurring at classes beneath hypernyms on the cut Liand Abe used the probability estimate at the nc0 on the cut divided by the number ofns descendants as we do when nding iexcl during training so the probability estimateis shared equally among all nouns in the nc0 as in equation (15)

p(ns 2 nsnc 0 ) =p(nc0)

jnsnc 0 j (15)

One problem with doing this is that in cases in which the TCM is quite high in thehierarchy for example at the entity class the probability of any nsrsquos occurring underthis nc0 on the TCM will be the same and does not allow us to discriminate amongsenses beneath this level

For the WSD task we compare the probability estimates at each nc 2 Cn so ifa noun belongs to several synsets we compare the probability estimates given thecontext of these synsets We obtain estimates for each nc by using the probability ofthe hypernym nc0 on iexcl Rather than assume that all synsets under a given nc0 on iexcl

646

Computational Linguistics Volume 29 Number 4

have the same likelihood of occurrence we multiply the probability estimate for thehypernym nc0 by the ratio of the prior frequency of the nc that is p(ncjgr) for whichwe seek the estimate divided by the prior frequency of the hypernym nc0 (p(nc0jgr))

p(nc 2 hyponyms of nc0jvc gr) = p(nc0jvc gr) pound p(ncjgr)p(nc0jgr)

(16)

These prior estimates are taken from populating the noun hyponym hierarchy with theprior frequency data for the gr irrespective of the co-occurring verbs The probabilityat the hypernym nc0 will necessarily total the probability at all hyponyms since thefrequency credit of hyponyms is propagated to hypernyms

Thus to disambiguate a noun occurring in a given relationship with a given verbthe nc 2 Cn that gives the largest estimate for p(ncjvc gr) is taken where the verb class(vc) is that which maximizes this estimate from Cv The TCM acquired for each vc ofthe verb in the given gr provides an estimate for p(nc0jvc gr) and the estimate for ncis obtained as in equation (16)

For example one target noun was letter which occurred as the direct object ofsign in our parses of the SENSEVAL-2 data The TCM that maximized the probabilityestimate for p(ncjvc direct object) is shown in Figure 5 The noun letter is disambiguatedby comparing the probability estimates on the TCM above the ve senses of letter mul-tiplied by the proportion of that probability mass attributed to that synset Althoughentity has a higher probability on the TCM compared to matter which is above thecorrect sense of letter6 the ratio of prior probabilities for the synset containing letter7

under entity is 0001 whereas that for the synset under matter is 0226 This gives aprobability of 0009pound 0226 = 0002 for the noun class probability given the verb class

approval

root

entity

human

capitalist

interpretation

message

communication

abstraction

relation

human_act

symbol

016

0012

00329

0114

00060009

draftletter

matter

writing

explanation

letter

alphabetic_letterletter

letter

letter

sign

Figure 5TCM for the direct-object slot of the verb class including sign and ratify

6 The gloss is ldquoa written message addressed to a person or organization eg wrote an indignant letter tothe editorrdquo

7 The gloss is ldquoowner who lets another person use something (housing usually) for hirerdquo

647

McCarthy and Carroll Disambiguating Using Selectional Preferences

(with maximum probability) and grammatical context This is the highest probabilityfor any of the synsets of letter and so in this case the correct sense is selected

42 Disambiguating Verbs and AdjectivesVerbs and adjectives are disambiguated using TCMs to give estimates for p(ncjvc gr)and p(ncjac gr) respectively These are combined with prior estimates for p(ncjgr) andp(vcjgr) (or p(acjgr)) using Bayesrsquo rule to give

p(vcjnc gr) = p(ncjvc gr) pound p(vcjgr)p(ncjgr)

(17)

and for adjectivendashnoun relations

p(acjnc adjnoun) = p(ncjac adjnoun) pound p(acjadjnoun)

p(ncjadjnoun)(18)

The prior distributions for p(ncjgr) p(vcjgr) and p(acjadjnoun) are obtained dur-ing the training phase For the prior distribution over NC the frequency credit ofeach noun in the specied gr in the training data is divided by jCnj The frequencycredit attached to a hyponym is propagated to the superordinate hypernyms and thefrequency of a hypernym (nc0) totals the frequency at its hyponyms

freq(nc0jgr) =X

nc 2 nc 0

freq(ncjgr) (19)

The distribution over VC is obtained similarly using the troponym relation Forthe distribution over AC the frequency credit for each adjective is divided by thenumber of synsets to which the adjective belongs and the credit for an ac is the sumover all the synsets that are members by virtue of the similar-to WordNet link

To disambiguate a verb occurring with a given noun the vc from Cv that givesthe largest estimate for p(vcjnc gr) is taken The nc for the co-occurring noun is thenc from Cn that maximizes this estimate The estimate for p(ncjvc gr) is taken as inequation (16) but selecting the vc to maximize the estimate for p(vcjnc gr) rather thanp(ncjvc gr) An adjective is likewise disambiguated to the ac from all those to which theadjective belongs using the estimate for p(ncjac gr) and selecting the nc that maximizesthe p(acjnc gr) estimate

43 Increasing Coverage One Sense per DiscourseThere is a signicant limitation to the word tokens that can be disambiguated usingselectional preferences in that they are restricted to those that occur in the speciedgrammatical relations and in argument head position Moreover we have TCMs onlyfor adjective and verb classes in which there was at least one adjective or verb mem-ber that met our criteria for training (having no more than a threshold of 10 senses inWordNet and a frequency of 20 or more occurrences in the BNC data in the speciedgrammatical relationship) We chose not to apply TCMs for disambiguation where wedid not have TCMs for one or more classes for the verb or adjective To increase cov-erage we experimented with applying the one-sense-per-discourse (OSPD) heuristic(Gale Church and Yarowsky 1992) With this heuristic a sense tag for a given wordis propagated to other occurrences of the same word within the current document inorder to increase coverage When applying the OSPD heuristic we simply applied atag for a noun verb or adjective to all the other instances of the same word type withthe same part of speech in the discourse provided that only one possible tag for thatword was supplied by the selectional preferences for that discourse

648

Computational Linguistics Volume 29 Number 4

0

20

40

60

80

100

0 20 40 60 80 100

Rec

all

Precision

sel-ospd-ana sel-ospdsel

supervisedother unsupervised

first sense heuristicsel

sel-ospdsel-ospd-ana

Figure 6SENSEVAL-2 English all-words task results

5 Evaluation

We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al 2001) We entered a previous version of this system forthe SENSEVAL-2 exercise in three variants under the names ldquosussex-selrdquo (selectionalpreferences) ldquosussex-sel-ospdrdquo (with the OSPD heuristic) and ldquosussex-sel-ospd-anardquo(with anaphora resolution) 8 For SENSEVAL-2 we used only the direct object andsubject slots since we had not yet dealt with adjectives In Figure 6 we show how oursystem fared at the time of SENSEVAL-2 compared to other unsupervised systems9

We have also plotted the results of the supervised systems and the precision and recallachieved by using the most frequent sense (as listed in WordNet)10

In the work reported here we attempted disambiguation for head nouns and verbsin subject and direct object relationships and for adjectives and nouns in adjective-noun relationships For each test instance we applied subject preferences before directobject preferences and direct object preferences before adjectivendashnoun preferences Wealso propagated sense tags to test instances not in these relationships by applying theone-sense-per-discourse heuristic

We did not use the SENSEVAL-2 coarse-grained classication as this was notavailable at the time when we were acquiring the selectional preferences We therefore

8 The motivation for using anaphora resolution was increased coverage but anaphora resolution turnedout not actually to improve performance

9 We use unsupervised to refer to systems that do not use manually sense-tagged training data such asSemCor Our systems marked in the gure as sel sel-ospd and sel-ospd-ana are unsupervised

10 We are indebted to Judita Preiss for the most-frequent-sense result This was obtained using thefrequency data supplied with the WordNet 17 version prereleased for SENSEVAL-2

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 7: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

644

Computational Linguistics Volume 29 Number 4

basicroot radicalgrassroots

underlying fundamentalrudimentary

primal elementary

basal base

of import important

primal keycentral fundamental cardinal

crucial essentialvaluable usefulhistoric

chief principal primary mainmeasurable

weighty grevious grave

signicant important

monumentalprofound fundamental

epochal

earthshaking

headportentous prodigious

evidentiary evidential

noteworthy remarkable

large

Figure 3Adjective classes that include fundamental

The probability estimate for each n is obtained using the estimates for all the nssthat n has Let Cn be the set of ncs that include n as a direct member Cn = fnc 2 NCjn 2ncg Let nc0 be a hypernym of nc on iexcl (ie nc0 2 fiexcl jnc 2 nc0g) and let nsnc 0 = fns 2 nc0g(ie the set of nouns senses at and beneath nc0 in the hyponym hierarchy) Then theestimate p(n) is obtained using the estimates for the hypernym classes on iexcl for all theCn that n belongs to

p(n) =X

nc 2 Cn

p(nc0)

jnsnc 0 j (11)

The probability at any particular nc0 is divided by nsnc 0 to give the estimate for eachp(ns) under that nc0

The probability estimates for the fnc 2 iexcl g ( p(ncjvc gr) or p(ncjac gr)) are obtainedfrom the tuples from the data of nouns co-occurring with verbs (or adjectives) belong-ing to the conditioning vc (or ac) in the specied grammatical relationship (lt n v gr gt)The frequency credit for a tuple is divided by jCnj for any n and by the number ofsynsets of v Cv (or Ca if the gr is adjective-noun)

freq(ncjvc gr) =X

v2 vc

X

n2 nc

freq(njv gr)jCnjjCvj (12)

A hypernym nc0 includes the frequency credit attributed to all its hyponyms (fnc 2nc0g)

freq(nc0jvc gr) =X

nc 2 nc 0

freq(ncjvc gr) (13)

This ensures that the total frequency credit at any iexcl across the hyponym hierarchyequals the credit for the conditioning vc This will be the sum of the frequency creditfor all verbs that are direct members or troponyms of the vc divided by the numberof other senses of each of these verbs

freq(vcjgr) =X

verb2 vc

freq(verbjgr)jCvj

(14)

645

McCarthy and Carroll Disambiguating Using Selectional Preferences

group event

Root

human_action

TCM for

TCM for

control monarchythrone

party

possession

strawmoney

handle

001002014

001002 053

006

005007

026

entity

seize assume usurp

seize clutch

Figure 4TCMs for the direct-object slot of two verb classes that include the verb seize

To ensure that the TCM covers all NS in WordNet we modify Li and Abersquos originalscheme by creating hyponym leaf classes below all WordNetrsquos internal classes in thehyponym hierarchy Each leaf holds the ns previously held at the internal class

Figure 4 shows portions of two TCMs The TCMs are similar as they both containthe verb seize but the TCM for the class that includes clutch has a higher probabilityfor the entity noun class compared to the class that also includes assume and usurpThis example includes only top-level WordNet classes although the TCM may usemore specic noun classes

4 Disambiguation

Nouns adjectives and verbs are disambiguated by nding the sense (nc vc or ac) withthe maximum probability estimate in the given context The method disambiguatesnouns and verbs to the WordNet synset level and adjectives to a coarse-grained levelof WordNet synsets linked by the similar-to relation as described previously

41 Disambiguating NounsNouns are disambiguated when they occur as subjects or direct objects and whenmodied by adjectives We obtain a probability estimate for each nc to which the targetnoun belongs using the distribution of the TCM associated with the co-occurring verbor adjective and the grammatical relationship

Li and Abe used TCMs for the task of structural disambiguation To obtain proba-bility estimates for noun senses occurring at classes beneath hypernyms on the cut Liand Abe used the probability estimate at the nc0 on the cut divided by the number ofns descendants as we do when nding iexcl during training so the probability estimateis shared equally among all nouns in the nc0 as in equation (15)

p(ns 2 nsnc 0 ) =p(nc0)

jnsnc 0 j (15)

One problem with doing this is that in cases in which the TCM is quite high in thehierarchy for example at the entity class the probability of any nsrsquos occurring underthis nc0 on the TCM will be the same and does not allow us to discriminate amongsenses beneath this level

For the WSD task we compare the probability estimates at each nc 2 Cn so ifa noun belongs to several synsets we compare the probability estimates given thecontext of these synsets We obtain estimates for each nc by using the probability ofthe hypernym nc0 on iexcl Rather than assume that all synsets under a given nc0 on iexcl

646

Computational Linguistics Volume 29 Number 4

have the same likelihood of occurrence we multiply the probability estimate for thehypernym nc0 by the ratio of the prior frequency of the nc that is p(ncjgr) for whichwe seek the estimate divided by the prior frequency of the hypernym nc0 (p(nc0jgr))

p(nc 2 hyponyms of nc0jvc gr) = p(nc0jvc gr) pound p(ncjgr)p(nc0jgr)

(16)

These prior estimates are taken from populating the noun hyponym hierarchy with theprior frequency data for the gr irrespective of the co-occurring verbs The probabilityat the hypernym nc0 will necessarily total the probability at all hyponyms since thefrequency credit of hyponyms is propagated to hypernyms

Thus to disambiguate a noun occurring in a given relationship with a given verbthe nc 2 Cn that gives the largest estimate for p(ncjvc gr) is taken where the verb class(vc) is that which maximizes this estimate from Cv The TCM acquired for each vc ofthe verb in the given gr provides an estimate for p(nc0jvc gr) and the estimate for ncis obtained as in equation (16)

For example one target noun was letter which occurred as the direct object ofsign in our parses of the SENSEVAL-2 data The TCM that maximized the probabilityestimate for p(ncjvc direct object) is shown in Figure 5 The noun letter is disambiguatedby comparing the probability estimates on the TCM above the ve senses of letter mul-tiplied by the proportion of that probability mass attributed to that synset Althoughentity has a higher probability on the TCM compared to matter which is above thecorrect sense of letter6 the ratio of prior probabilities for the synset containing letter7

under entity is 0001 whereas that for the synset under matter is 0226 This gives aprobability of 0009pound 0226 = 0002 for the noun class probability given the verb class

approval

root

entity

human

capitalist

interpretation

message

communication

abstraction

relation

human_act

symbol

016

0012

00329

0114

00060009

draftletter

matter

writing

explanation

letter

alphabetic_letterletter

letter

letter

sign

Figure 5TCM for the direct-object slot of the verb class including sign and ratify

6 The gloss is ldquoa written message addressed to a person or organization eg wrote an indignant letter tothe editorrdquo

7 The gloss is ldquoowner who lets another person use something (housing usually) for hirerdquo

647

McCarthy and Carroll Disambiguating Using Selectional Preferences

(with maximum probability) and grammatical context This is the highest probabilityfor any of the synsets of letter and so in this case the correct sense is selected

42 Disambiguating Verbs and AdjectivesVerbs and adjectives are disambiguated using TCMs to give estimates for p(ncjvc gr)and p(ncjac gr) respectively These are combined with prior estimates for p(ncjgr) andp(vcjgr) (or p(acjgr)) using Bayesrsquo rule to give

p(vcjnc gr) = p(ncjvc gr) pound p(vcjgr)p(ncjgr)

(17)

and for adjectivendashnoun relations

p(acjnc adjnoun) = p(ncjac adjnoun) pound p(acjadjnoun)

p(ncjadjnoun)(18)

The prior distributions for p(ncjgr) p(vcjgr) and p(acjadjnoun) are obtained dur-ing the training phase For the prior distribution over NC the frequency credit ofeach noun in the specied gr in the training data is divided by jCnj The frequencycredit attached to a hyponym is propagated to the superordinate hypernyms and thefrequency of a hypernym (nc0) totals the frequency at its hyponyms

freq(nc0jgr) =X

nc 2 nc 0

freq(ncjgr) (19)

The distribution over VC is obtained similarly using the troponym relation Forthe distribution over AC the frequency credit for each adjective is divided by thenumber of synsets to which the adjective belongs and the credit for an ac is the sumover all the synsets that are members by virtue of the similar-to WordNet link

To disambiguate a verb occurring with a given noun the vc from Cv that givesthe largest estimate for p(vcjnc gr) is taken The nc for the co-occurring noun is thenc from Cn that maximizes this estimate The estimate for p(ncjvc gr) is taken as inequation (16) but selecting the vc to maximize the estimate for p(vcjnc gr) rather thanp(ncjvc gr) An adjective is likewise disambiguated to the ac from all those to which theadjective belongs using the estimate for p(ncjac gr) and selecting the nc that maximizesthe p(acjnc gr) estimate

43 Increasing Coverage One Sense per DiscourseThere is a signicant limitation to the word tokens that can be disambiguated usingselectional preferences in that they are restricted to those that occur in the speciedgrammatical relations and in argument head position Moreover we have TCMs onlyfor adjective and verb classes in which there was at least one adjective or verb mem-ber that met our criteria for training (having no more than a threshold of 10 senses inWordNet and a frequency of 20 or more occurrences in the BNC data in the speciedgrammatical relationship) We chose not to apply TCMs for disambiguation where wedid not have TCMs for one or more classes for the verb or adjective To increase cov-erage we experimented with applying the one-sense-per-discourse (OSPD) heuristic(Gale Church and Yarowsky 1992) With this heuristic a sense tag for a given wordis propagated to other occurrences of the same word within the current document inorder to increase coverage When applying the OSPD heuristic we simply applied atag for a noun verb or adjective to all the other instances of the same word type withthe same part of speech in the discourse provided that only one possible tag for thatword was supplied by the selectional preferences for that discourse

648

Computational Linguistics Volume 29 Number 4

0

20

40

60

80

100

0 20 40 60 80 100

Rec

all

Precision

sel-ospd-ana sel-ospdsel

supervisedother unsupervised

first sense heuristicsel

sel-ospdsel-ospd-ana

Figure 6SENSEVAL-2 English all-words task results

5 Evaluation

We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al 2001) We entered a previous version of this system forthe SENSEVAL-2 exercise in three variants under the names ldquosussex-selrdquo (selectionalpreferences) ldquosussex-sel-ospdrdquo (with the OSPD heuristic) and ldquosussex-sel-ospd-anardquo(with anaphora resolution) 8 For SENSEVAL-2 we used only the direct object andsubject slots since we had not yet dealt with adjectives In Figure 6 we show how oursystem fared at the time of SENSEVAL-2 compared to other unsupervised systems9

We have also plotted the results of the supervised systems and the precision and recallachieved by using the most frequent sense (as listed in WordNet)10

In the work reported here we attempted disambiguation for head nouns and verbsin subject and direct object relationships and for adjectives and nouns in adjective-noun relationships For each test instance we applied subject preferences before directobject preferences and direct object preferences before adjectivendashnoun preferences Wealso propagated sense tags to test instances not in these relationships by applying theone-sense-per-discourse heuristic

We did not use the SENSEVAL-2 coarse-grained classication as this was notavailable at the time when we were acquiring the selectional preferences We therefore

8 The motivation for using anaphora resolution was increased coverage but anaphora resolution turnedout not actually to improve performance

9 We use unsupervised to refer to systems that do not use manually sense-tagged training data such asSemCor Our systems marked in the gure as sel sel-ospd and sel-ospd-ana are unsupervised

10 We are indebted to Judita Preiss for the most-frequent-sense result This was obtained using thefrequency data supplied with the WordNet 17 version prereleased for SENSEVAL-2

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 8: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

645

McCarthy and Carroll Disambiguating Using Selectional Preferences

group event

Root

human_action

TCM for

TCM for

control monarchythrone

party

possession

strawmoney

handle

001002014

001002 053

006

005007

026

entity

seize assume usurp

seize clutch

Figure 4TCMs for the direct-object slot of two verb classes that include the verb seize

To ensure that the TCM covers all NS in WordNet we modify Li and Abersquos originalscheme by creating hyponym leaf classes below all WordNetrsquos internal classes in thehyponym hierarchy Each leaf holds the ns previously held at the internal class

Figure 4 shows portions of two TCMs The TCMs are similar as they both containthe verb seize but the TCM for the class that includes clutch has a higher probabilityfor the entity noun class compared to the class that also includes assume and usurpThis example includes only top-level WordNet classes although the TCM may usemore specic noun classes

4 Disambiguation

Nouns adjectives and verbs are disambiguated by nding the sense (nc vc or ac) withthe maximum probability estimate in the given context The method disambiguatesnouns and verbs to the WordNet synset level and adjectives to a coarse-grained levelof WordNet synsets linked by the similar-to relation as described previously

41 Disambiguating NounsNouns are disambiguated when they occur as subjects or direct objects and whenmodied by adjectives We obtain a probability estimate for each nc to which the targetnoun belongs using the distribution of the TCM associated with the co-occurring verbor adjective and the grammatical relationship

Li and Abe used TCMs for the task of structural disambiguation To obtain proba-bility estimates for noun senses occurring at classes beneath hypernyms on the cut Liand Abe used the probability estimate at the nc0 on the cut divided by the number ofns descendants as we do when nding iexcl during training so the probability estimateis shared equally among all nouns in the nc0 as in equation (15)

p(ns 2 nsnc 0 ) =p(nc0)

jnsnc 0 j (15)

One problem with doing this is that in cases in which the TCM is quite high in thehierarchy for example at the entity class the probability of any nsrsquos occurring underthis nc0 on the TCM will be the same and does not allow us to discriminate amongsenses beneath this level

For the WSD task we compare the probability estimates at each nc 2 Cn so ifa noun belongs to several synsets we compare the probability estimates given thecontext of these synsets We obtain estimates for each nc by using the probability ofthe hypernym nc0 on iexcl Rather than assume that all synsets under a given nc0 on iexcl

646

Computational Linguistics Volume 29 Number 4

have the same likelihood of occurrence we multiply the probability estimate for thehypernym nc0 by the ratio of the prior frequency of the nc that is p(ncjgr) for whichwe seek the estimate divided by the prior frequency of the hypernym nc0 (p(nc0jgr))

p(nc 2 hyponyms of nc0jvc gr) = p(nc0jvc gr) pound p(ncjgr)p(nc0jgr)

(16)

These prior estimates are taken from populating the noun hyponym hierarchy with theprior frequency data for the gr irrespective of the co-occurring verbs The probabilityat the hypernym nc0 will necessarily total the probability at all hyponyms since thefrequency credit of hyponyms is propagated to hypernyms

Thus to disambiguate a noun occurring in a given relationship with a given verbthe nc 2 Cn that gives the largest estimate for p(ncjvc gr) is taken where the verb class(vc) is that which maximizes this estimate from Cv The TCM acquired for each vc ofthe verb in the given gr provides an estimate for p(nc0jvc gr) and the estimate for ncis obtained as in equation (16)

For example one target noun was letter which occurred as the direct object ofsign in our parses of the SENSEVAL-2 data The TCM that maximized the probabilityestimate for p(ncjvc direct object) is shown in Figure 5 The noun letter is disambiguatedby comparing the probability estimates on the TCM above the ve senses of letter mul-tiplied by the proportion of that probability mass attributed to that synset Althoughentity has a higher probability on the TCM compared to matter which is above thecorrect sense of letter6 the ratio of prior probabilities for the synset containing letter7

under entity is 0001 whereas that for the synset under matter is 0226 This gives aprobability of 0009pound 0226 = 0002 for the noun class probability given the verb class

approval

root

entity

human

capitalist

interpretation

message

communication

abstraction

relation

human_act

symbol

016

0012

00329

0114

00060009

draftletter

matter

writing

explanation

letter

alphabetic_letterletter

letter

letter

sign

Figure 5TCM for the direct-object slot of the verb class including sign and ratify

6 The gloss is ldquoa written message addressed to a person or organization eg wrote an indignant letter tothe editorrdquo

7 The gloss is ldquoowner who lets another person use something (housing usually) for hirerdquo

647

McCarthy and Carroll Disambiguating Using Selectional Preferences

(with maximum probability) and grammatical context This is the highest probabilityfor any of the synsets of letter and so in this case the correct sense is selected

42 Disambiguating Verbs and AdjectivesVerbs and adjectives are disambiguated using TCMs to give estimates for p(ncjvc gr)and p(ncjac gr) respectively These are combined with prior estimates for p(ncjgr) andp(vcjgr) (or p(acjgr)) using Bayesrsquo rule to give

p(vcjnc gr) = p(ncjvc gr) pound p(vcjgr)p(ncjgr)

(17)

and for adjectivendashnoun relations

p(acjnc adjnoun) = p(ncjac adjnoun) pound p(acjadjnoun)

p(ncjadjnoun)(18)

The prior distributions for p(ncjgr) p(vcjgr) and p(acjadjnoun) are obtained dur-ing the training phase For the prior distribution over NC the frequency credit ofeach noun in the specied gr in the training data is divided by jCnj The frequencycredit attached to a hyponym is propagated to the superordinate hypernyms and thefrequency of a hypernym (nc0) totals the frequency at its hyponyms

freq(nc0jgr) =X

nc 2 nc 0

freq(ncjgr) (19)

The distribution over VC is obtained similarly using the troponym relation Forthe distribution over AC the frequency credit for each adjective is divided by thenumber of synsets to which the adjective belongs and the credit for an ac is the sumover all the synsets that are members by virtue of the similar-to WordNet link

To disambiguate a verb occurring with a given noun the vc from Cv that givesthe largest estimate for p(vcjnc gr) is taken The nc for the co-occurring noun is thenc from Cn that maximizes this estimate The estimate for p(ncjvc gr) is taken as inequation (16) but selecting the vc to maximize the estimate for p(vcjnc gr) rather thanp(ncjvc gr) An adjective is likewise disambiguated to the ac from all those to which theadjective belongs using the estimate for p(ncjac gr) and selecting the nc that maximizesthe p(acjnc gr) estimate

43 Increasing Coverage One Sense per DiscourseThere is a signicant limitation to the word tokens that can be disambiguated usingselectional preferences in that they are restricted to those that occur in the speciedgrammatical relations and in argument head position Moreover we have TCMs onlyfor adjective and verb classes in which there was at least one adjective or verb mem-ber that met our criteria for training (having no more than a threshold of 10 senses inWordNet and a frequency of 20 or more occurrences in the BNC data in the speciedgrammatical relationship) We chose not to apply TCMs for disambiguation where wedid not have TCMs for one or more classes for the verb or adjective To increase cov-erage we experimented with applying the one-sense-per-discourse (OSPD) heuristic(Gale Church and Yarowsky 1992) With this heuristic a sense tag for a given wordis propagated to other occurrences of the same word within the current document inorder to increase coverage When applying the OSPD heuristic we simply applied atag for a noun verb or adjective to all the other instances of the same word type withthe same part of speech in the discourse provided that only one possible tag for thatword was supplied by the selectional preferences for that discourse

648

Computational Linguistics Volume 29 Number 4

0

20

40

60

80

100

0 20 40 60 80 100

Rec

all

Precision

sel-ospd-ana sel-ospdsel

supervisedother unsupervised

first sense heuristicsel

sel-ospdsel-ospd-ana

Figure 6SENSEVAL-2 English all-words task results

5 Evaluation

We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al 2001) We entered a previous version of this system forthe SENSEVAL-2 exercise in three variants under the names ldquosussex-selrdquo (selectionalpreferences) ldquosussex-sel-ospdrdquo (with the OSPD heuristic) and ldquosussex-sel-ospd-anardquo(with anaphora resolution) 8 For SENSEVAL-2 we used only the direct object andsubject slots since we had not yet dealt with adjectives In Figure 6 we show how oursystem fared at the time of SENSEVAL-2 compared to other unsupervised systems9

We have also plotted the results of the supervised systems and the precision and recallachieved by using the most frequent sense (as listed in WordNet)10

In the work reported here we attempted disambiguation for head nouns and verbsin subject and direct object relationships and for adjectives and nouns in adjective-noun relationships For each test instance we applied subject preferences before directobject preferences and direct object preferences before adjectivendashnoun preferences Wealso propagated sense tags to test instances not in these relationships by applying theone-sense-per-discourse heuristic

We did not use the SENSEVAL-2 coarse-grained classication as this was notavailable at the time when we were acquiring the selectional preferences We therefore

8 The motivation for using anaphora resolution was increased coverage but anaphora resolution turnedout not actually to improve performance

9 We use unsupervised to refer to systems that do not use manually sense-tagged training data such asSemCor Our systems marked in the gure as sel sel-ospd and sel-ospd-ana are unsupervised

10 We are indebted to Judita Preiss for the most-frequent-sense result This was obtained using thefrequency data supplied with the WordNet 17 version prereleased for SENSEVAL-2

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 9: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

646

Computational Linguistics Volume 29 Number 4

have the same likelihood of occurrence we multiply the probability estimate for thehypernym nc0 by the ratio of the prior frequency of the nc that is p(ncjgr) for whichwe seek the estimate divided by the prior frequency of the hypernym nc0 (p(nc0jgr))

p(nc 2 hyponyms of nc0jvc gr) = p(nc0jvc gr) pound p(ncjgr)p(nc0jgr)

(16)

These prior estimates are taken from populating the noun hyponym hierarchy with theprior frequency data for the gr irrespective of the co-occurring verbs The probabilityat the hypernym nc0 will necessarily total the probability at all hyponyms since thefrequency credit of hyponyms is propagated to hypernyms

Thus to disambiguate a noun occurring in a given relationship with a given verbthe nc 2 Cn that gives the largest estimate for p(ncjvc gr) is taken where the verb class(vc) is that which maximizes this estimate from Cv The TCM acquired for each vc ofthe verb in the given gr provides an estimate for p(nc0jvc gr) and the estimate for ncis obtained as in equation (16)

For example one target noun was letter which occurred as the direct object ofsign in our parses of the SENSEVAL-2 data The TCM that maximized the probabilityestimate for p(ncjvc direct object) is shown in Figure 5 The noun letter is disambiguatedby comparing the probability estimates on the TCM above the ve senses of letter mul-tiplied by the proportion of that probability mass attributed to that synset Althoughentity has a higher probability on the TCM compared to matter which is above thecorrect sense of letter6 the ratio of prior probabilities for the synset containing letter7

under entity is 0001 whereas that for the synset under matter is 0226 This gives aprobability of 0009pound 0226 = 0002 for the noun class probability given the verb class

approval

root

entity

human

capitalist

interpretation

message

communication

abstraction

relation

human_act

symbol

016

0012

00329

0114

00060009

draftletter

matter

writing

explanation

letter

alphabetic_letterletter

letter

letter

sign

Figure 5TCM for the direct-object slot of the verb class including sign and ratify

6 The gloss is ldquoa written message addressed to a person or organization eg wrote an indignant letter tothe editorrdquo

7 The gloss is ldquoowner who lets another person use something (housing usually) for hirerdquo

647

McCarthy and Carroll Disambiguating Using Selectional Preferences

(with maximum probability) and grammatical context This is the highest probabilityfor any of the synsets of letter and so in this case the correct sense is selected

42 Disambiguating Verbs and AdjectivesVerbs and adjectives are disambiguated using TCMs to give estimates for p(ncjvc gr)and p(ncjac gr) respectively These are combined with prior estimates for p(ncjgr) andp(vcjgr) (or p(acjgr)) using Bayesrsquo rule to give

p(vcjnc gr) = p(ncjvc gr) pound p(vcjgr)p(ncjgr)

(17)

and for adjectivendashnoun relations

p(acjnc adjnoun) = p(ncjac adjnoun) pound p(acjadjnoun)

p(ncjadjnoun)(18)

The prior distributions for p(ncjgr) p(vcjgr) and p(acjadjnoun) are obtained dur-ing the training phase For the prior distribution over NC the frequency credit ofeach noun in the specied gr in the training data is divided by jCnj The frequencycredit attached to a hyponym is propagated to the superordinate hypernyms and thefrequency of a hypernym (nc0) totals the frequency at its hyponyms

freq(nc0jgr) =X

nc 2 nc 0

freq(ncjgr) (19)

The distribution over VC is obtained similarly using the troponym relation Forthe distribution over AC the frequency credit for each adjective is divided by thenumber of synsets to which the adjective belongs and the credit for an ac is the sumover all the synsets that are members by virtue of the similar-to WordNet link

To disambiguate a verb occurring with a given noun the vc from Cv that givesthe largest estimate for p(vcjnc gr) is taken The nc for the co-occurring noun is thenc from Cn that maximizes this estimate The estimate for p(ncjvc gr) is taken as inequation (16) but selecting the vc to maximize the estimate for p(vcjnc gr) rather thanp(ncjvc gr) An adjective is likewise disambiguated to the ac from all those to which theadjective belongs using the estimate for p(ncjac gr) and selecting the nc that maximizesthe p(acjnc gr) estimate

43 Increasing Coverage One Sense per DiscourseThere is a signicant limitation to the word tokens that can be disambiguated usingselectional preferences in that they are restricted to those that occur in the speciedgrammatical relations and in argument head position Moreover we have TCMs onlyfor adjective and verb classes in which there was at least one adjective or verb mem-ber that met our criteria for training (having no more than a threshold of 10 senses inWordNet and a frequency of 20 or more occurrences in the BNC data in the speciedgrammatical relationship) We chose not to apply TCMs for disambiguation where wedid not have TCMs for one or more classes for the verb or adjective To increase cov-erage we experimented with applying the one-sense-per-discourse (OSPD) heuristic(Gale Church and Yarowsky 1992) With this heuristic a sense tag for a given wordis propagated to other occurrences of the same word within the current document inorder to increase coverage When applying the OSPD heuristic we simply applied atag for a noun verb or adjective to all the other instances of the same word type withthe same part of speech in the discourse provided that only one possible tag for thatword was supplied by the selectional preferences for that discourse

648

Computational Linguistics Volume 29 Number 4

0

20

40

60

80

100

0 20 40 60 80 100

Rec

all

Precision

sel-ospd-ana sel-ospdsel

supervisedother unsupervised

first sense heuristicsel

sel-ospdsel-ospd-ana

Figure 6SENSEVAL-2 English all-words task results

5 Evaluation

We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al 2001) We entered a previous version of this system forthe SENSEVAL-2 exercise in three variants under the names ldquosussex-selrdquo (selectionalpreferences) ldquosussex-sel-ospdrdquo (with the OSPD heuristic) and ldquosussex-sel-ospd-anardquo(with anaphora resolution) 8 For SENSEVAL-2 we used only the direct object andsubject slots since we had not yet dealt with adjectives In Figure 6 we show how oursystem fared at the time of SENSEVAL-2 compared to other unsupervised systems9

We have also plotted the results of the supervised systems and the precision and recallachieved by using the most frequent sense (as listed in WordNet)10

In the work reported here we attempted disambiguation for head nouns and verbsin subject and direct object relationships and for adjectives and nouns in adjective-noun relationships For each test instance we applied subject preferences before directobject preferences and direct object preferences before adjectivendashnoun preferences Wealso propagated sense tags to test instances not in these relationships by applying theone-sense-per-discourse heuristic

We did not use the SENSEVAL-2 coarse-grained classication as this was notavailable at the time when we were acquiring the selectional preferences We therefore

8 The motivation for using anaphora resolution was increased coverage but anaphora resolution turnedout not actually to improve performance

9 We use unsupervised to refer to systems that do not use manually sense-tagged training data such asSemCor Our systems marked in the gure as sel sel-ospd and sel-ospd-ana are unsupervised

10 We are indebted to Judita Preiss for the most-frequent-sense result This was obtained using thefrequency data supplied with the WordNet 17 version prereleased for SENSEVAL-2

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 10: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

647

McCarthy and Carroll Disambiguating Using Selectional Preferences

(with maximum probability) and grammatical context This is the highest probabilityfor any of the synsets of letter and so in this case the correct sense is selected

42 Disambiguating Verbs and AdjectivesVerbs and adjectives are disambiguated using TCMs to give estimates for p(ncjvc gr)and p(ncjac gr) respectively These are combined with prior estimates for p(ncjgr) andp(vcjgr) (or p(acjgr)) using Bayesrsquo rule to give

p(vcjnc gr) = p(ncjvc gr) pound p(vcjgr)p(ncjgr)

(17)

and for adjectivendashnoun relations

p(acjnc adjnoun) = p(ncjac adjnoun) pound p(acjadjnoun)

p(ncjadjnoun)(18)

The prior distributions for p(ncjgr) p(vcjgr) and p(acjadjnoun) are obtained dur-ing the training phase For the prior distribution over NC the frequency credit ofeach noun in the specied gr in the training data is divided by jCnj The frequencycredit attached to a hyponym is propagated to the superordinate hypernyms and thefrequency of a hypernym (nc0) totals the frequency at its hyponyms

freq(nc0jgr) =X

nc 2 nc 0

freq(ncjgr) (19)

The distribution over VC is obtained similarly using the troponym relation Forthe distribution over AC the frequency credit for each adjective is divided by thenumber of synsets to which the adjective belongs and the credit for an ac is the sumover all the synsets that are members by virtue of the similar-to WordNet link

To disambiguate a verb occurring with a given noun the vc from Cv that givesthe largest estimate for p(vcjnc gr) is taken The nc for the co-occurring noun is thenc from Cn that maximizes this estimate The estimate for p(ncjvc gr) is taken as inequation (16) but selecting the vc to maximize the estimate for p(vcjnc gr) rather thanp(ncjvc gr) An adjective is likewise disambiguated to the ac from all those to which theadjective belongs using the estimate for p(ncjac gr) and selecting the nc that maximizesthe p(acjnc gr) estimate

43 Increasing Coverage One Sense per DiscourseThere is a signicant limitation to the word tokens that can be disambiguated usingselectional preferences in that they are restricted to those that occur in the speciedgrammatical relations and in argument head position Moreover we have TCMs onlyfor adjective and verb classes in which there was at least one adjective or verb mem-ber that met our criteria for training (having no more than a threshold of 10 senses inWordNet and a frequency of 20 or more occurrences in the BNC data in the speciedgrammatical relationship) We chose not to apply TCMs for disambiguation where wedid not have TCMs for one or more classes for the verb or adjective To increase cov-erage we experimented with applying the one-sense-per-discourse (OSPD) heuristic(Gale Church and Yarowsky 1992) With this heuristic a sense tag for a given wordis propagated to other occurrences of the same word within the current document inorder to increase coverage When applying the OSPD heuristic we simply applied atag for a noun verb or adjective to all the other instances of the same word type withthe same part of speech in the discourse provided that only one possible tag for thatword was supplied by the selectional preferences for that discourse

648

Computational Linguistics Volume 29 Number 4

0

20

40

60

80

100

0 20 40 60 80 100

Rec

all

Precision

sel-ospd-ana sel-ospdsel

supervisedother unsupervised

first sense heuristicsel

sel-ospdsel-ospd-ana

Figure 6SENSEVAL-2 English all-words task results

5 Evaluation

We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al 2001) We entered a previous version of this system forthe SENSEVAL-2 exercise in three variants under the names ldquosussex-selrdquo (selectionalpreferences) ldquosussex-sel-ospdrdquo (with the OSPD heuristic) and ldquosussex-sel-ospd-anardquo(with anaphora resolution) 8 For SENSEVAL-2 we used only the direct object andsubject slots since we had not yet dealt with adjectives In Figure 6 we show how oursystem fared at the time of SENSEVAL-2 compared to other unsupervised systems9

We have also plotted the results of the supervised systems and the precision and recallachieved by using the most frequent sense (as listed in WordNet)10

In the work reported here we attempted disambiguation for head nouns and verbsin subject and direct object relationships and for adjectives and nouns in adjective-noun relationships For each test instance we applied subject preferences before directobject preferences and direct object preferences before adjectivendashnoun preferences Wealso propagated sense tags to test instances not in these relationships by applying theone-sense-per-discourse heuristic

We did not use the SENSEVAL-2 coarse-grained classication as this was notavailable at the time when we were acquiring the selectional preferences We therefore

8 The motivation for using anaphora resolution was increased coverage but anaphora resolution turnedout not actually to improve performance

9 We use unsupervised to refer to systems that do not use manually sense-tagged training data such asSemCor Our systems marked in the gure as sel sel-ospd and sel-ospd-ana are unsupervised

10 We are indebted to Judita Preiss for the most-frequent-sense result This was obtained using thefrequency data supplied with the WordNet 17 version prereleased for SENSEVAL-2

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 11: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

648

Computational Linguistics Volume 29 Number 4

0

20

40

60

80

100

0 20 40 60 80 100

Rec

all

Precision

sel-ospd-ana sel-ospdsel

supervisedother unsupervised

first sense heuristicsel

sel-ospdsel-ospd-ana

Figure 6SENSEVAL-2 English all-words task results

5 Evaluation

We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al 2001) We entered a previous version of this system forthe SENSEVAL-2 exercise in three variants under the names ldquosussex-selrdquo (selectionalpreferences) ldquosussex-sel-ospdrdquo (with the OSPD heuristic) and ldquosussex-sel-ospd-anardquo(with anaphora resolution) 8 For SENSEVAL-2 we used only the direct object andsubject slots since we had not yet dealt with adjectives In Figure 6 we show how oursystem fared at the time of SENSEVAL-2 compared to other unsupervised systems9

We have also plotted the results of the supervised systems and the precision and recallachieved by using the most frequent sense (as listed in WordNet)10

In the work reported here we attempted disambiguation for head nouns and verbsin subject and direct object relationships and for adjectives and nouns in adjective-noun relationships For each test instance we applied subject preferences before directobject preferences and direct object preferences before adjectivendashnoun preferences Wealso propagated sense tags to test instances not in these relationships by applying theone-sense-per-discourse heuristic

We did not use the SENSEVAL-2 coarse-grained classication as this was notavailable at the time when we were acquiring the selectional preferences We therefore

8 The motivation for using anaphora resolution was increased coverage but anaphora resolution turnedout not actually to improve performance

9 We use unsupervised to refer to systems that do not use manually sense-tagged training data such asSemCor Our systems marked in the gure as sel sel-ospd and sel-ospd-ana are unsupervised

10 We are indebted to Judita Preiss for the most-frequent-sense result This was obtained using thefrequency data supplied with the WordNet 17 version prereleased for SENSEVAL-2

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 12: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

649

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 1Overall results

With OSPD Without OSPDPrecision 511 Precision 523Recall 232 Recall 200Attempted 455 Attempted 383

Table 2Precision results by part of speech

Precision () Baseline precision ()Nouns 585 517

Polysemous nouns 368 258Verbs 409 297

Polysemous verbs 381 253Adjectives 498 486

Polysemous adjectives 355 240Nouns verbs and adjectives 511 449

Polysemous nouns verbs and adjectives 368 273

do not include in the following the coarse-grained results they are just slightly betterthan the ne-grained results which seems to be typical of other systems

Our latest overall results are shown in Table 1 In this table we show the resultsboth with and without the OSPD heuristic The results for the English SENSEVAL-2tasks were generally much lower than those for the original SENSEVAL competitionAt the time of the SENSEVAL-2 workshop this was assumed to be due largely to theuse of WordNet as the inventory as opposed to HECTOR (Atkins 1993) but PalmerTrang Dang and Fellbaum (forthcoming) have subsequently shown that at least for thelexical sample tasks this was due to a harder selection of words with a higher averagelevel of polysemy For three of the most polysemous verbs that overlapped betweenthe English lexical sample for SENSEVAL and SENSEVAL-2 the performance wascomparable Table 2 shows our precision results including use of the OSPD heuristicbroken down by part of speech Although the precision for nouns is greater than thatfor verbs the difference is much less when we remove the trivial monosemous casesNouns verbs and adjectives all outperform their random baseline for precision andthe difference is more marked when monosemous instances are dropped

Table 3 shows the precision results for polysemous words given the slot and thedisambiguation source Overall once at least one word token has been disambiguatedby the preferences the OSPD heuristic seems to perform better than the selectionalpreferences We can see however that although this is certainly true for the nounsthe difference for the adjectives (13) is less marked and the preferences outperformOSPD for the verbs It seems that verbs obey the OSPD principle much less than nounsAlso verbs are best disambiguated by their direct objects whereas nouns appear tobe better disambiguated as subjects and when modied by adjectives

6 Discussion

61 Selectional PreferencesThe precision of our system compares well with that of other unsupervised systems onthe SENSEVAL-2 English all-words task despite the fact that these other systems use a

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 13: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

650

Computational Linguistics Volume 29 Number 4

Table 3Precision results for polysemous words by part of speech and slot or disambiguation source

Subject () Dobj () Adjm () OSPD ()Polysemous nouns 337 268 310 490Polysemous verbs 338 473 mdash 298Polysemous adjectives mdash mdash 351 364Polysemous nouns verbs and adjectives 334 360 316 448

number of different sources of information for disambiguation rather than selectionalpreferences in isolation Light and Greiff (2002) summarize some earlier WSD resultsfor automatically acquired selectional preferences These results were obtained forthree systems (Resnik 1997 Abney and Light 1999 Ciaramita and Johnson 2000) on atraining and test data set constructed by Resnik containing nouns occurring as directobjects of 100 verbs that select strongly for their objects

Both the test and training sets were extracted from the section of the Brown cor-pus within the Penn Treebank and used the treebank parses The test set comprisedthe portion of this data within SemCor containing these 100 verbs and the trainingset comprised 800000 words from the Penn Treebank parses of the Brown corpus notwithin SemCor All three systems obtained higher precision than the results we reporthere with Ciaramita and Johnsonrsquos Bayesian belief networks achieving the best accu-racy at 514 These results are not comparable with ours however for three reasonsFirst our results for the direct-object slot are for all verbs in the English all-words taskas opposed to just those selecting strongly for their direct objects We would expectthat WSD results using selectional preferences would be better for the latter class ofverbs Second we do not use manually produced parses but the output from our fullyautomatic shallow parser Third and nally the baselines reported for Resnikrsquos test setwere higher than those for the all-words task For Resnikrsquos test data the random base-line was 285 whereas for the polysemous nouns in the direct-object relation on theall-words task it was 239 The distribution of senses was also perhaps more skewedfor Resnikrsquos test set since the rst sense heuristic was 828 (Abney and Light 1999)whereas it was 536 for the polysemous direct objects in the all-words task Althoughour results do show that the precision for the TCMs compares favorably with that ofother unsupervised systems on the English all-words task it would be worthwhile tocompare other selectional preference models on the same data

Although the accuracy of our system is encouraging given that it does not usehand-tagged data the results are below the level of state-of-the-art supervised systemsIndeed a system just assigning to each word its most frequent sense as listed inWordNet (the ldquorst-sense heuristicrdquo) would do better than our preference models(and in fact better than the majority of the SENSEVAL-2 English all-words supervisedsystems) The rst-sense heuristic however assumes the existence of sense-tagged datathat are able to give a denitive rst sense We do not use any rst-sense informationAlthough a modest amount of sense-tagged data is available for English (Miller et al1993 Ng and Lee 1996) for other languages with minimal sense-tagged resources theheuristic is not applicable Moreover for some words the predominant sense variesdepending on the domain and text type

To quantify this we carried out an analysis of the polysemous nouns verbsand adjectives in SemCor occurring in more than one SemCor le and found thata large proportion of words have a different rst sense in different les and alsoin different genres (see Table 4) For adjectives there seems to be a lot less ambi-

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 14: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

651

McCarthy and Carroll Disambiguating Using Selectional Preferences

Table 4Percentages of words with a different predominant sense in SemCor across les and genres

File GenreNouns 70 66Verbs 79 74Adjectives 25 21

guity (this has also been noted by Krovetz [1998] the data in SENSEVAL-2 bearthis out with many adjectives occurring only in their rst sense For nouns andverbs for which the predominant sense is more likely to vary among texts it wouldbe worthwhile to try to detect words for which using the predominant sense isnot a reliable strategy for example because the word shows ldquoburstyrdquo topic-relatedbehavior

We therefore examined our disambiguation results to see if there was any patternin the predicates or arguments that were easily disambiguated themselves or weregood disambiguators of the co-occurring word No particular patterns were evidentin this respect perhaps because of the small size of the test data There were nounssuch as team (precision= 2

2 ) and cancer ( 810) that did better than average but whether

or not they did better than the rst-sense heuristic depends of course on the sensein which they are used For example all 10 occurrences of cancer are in the rstsense so the rst sense heuristic is impossible to beat in this case For the test itemsthat are not in their rst sense we beat the rst-sense heuristic but on the otherhand we failed to beat the random baseline (The random baseline is 218 andwe obtained 214 for these items overall) Our performance on these items is lowprobably because they are lower-frequency senses for which there is less evidencein the untagged training corpus (the BNC) We believe that selectional preferenceswould perform best if they were acquired from similar training data to that for whichdisambiguation is required In the future we plan to investigate our models for WSDin specic domains such as sport and nance The senses and frequency distributionof senses for a given domain will in general be quite different from those in a balancedcorpus

There are individual words that are not used in the rst sense on which our TCMpreferences do well for example sound (precision = 2

2 ) but there are not enough datato isolate predicates or arguments that are good disambiguators from those that arenot We intend to investigate this issue further with the SENSEVAL-2 lexical sampledata which contains more instances of a smaller number of words

Performance of selectional preferences depends not just on the actual word beingdisambiguated but the cohesiveness of the tuple ltpred arg grgt We have thereforeinvestigated applying a threshold on the probability of the class (nc vc or ac) beforedisambiguation Figure 7 presents a graph of precision against threshold applied to theprobability estimate for the highest-scoring class We show alongside this the randombaseline and the rst-sense heuristic for these items Selectional preferences appear todo better on items for which the probability predicted by our model is higher but therst-sense heuristic does even better on these The rst sense heuristic with respect toSemCor outperforms the selectional preferences when it is averaged over a given textThat seems to be the case overall but there will be some words and texts for whichthe rst sense from SemCor is not relevant and use of a threshold on probability andperhaps a differential between probability of the top-ranked senses suggested by themodel should increase precision

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 15: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

652

Computational Linguistics Volume 29 Number 4

0

02

04

06

08

1

0 0002 0004 0006 0008 001

pre

cisi

on

threshold

first senseTCMs

random

Figure 7Thresholding the probability estimate for the highest-scoring class

Table 5Lemmale combinations in SemCor with more than one sense evident

Nouns 23Verbs 19Adjectives 16

62 The OSPD HeuristicIn these experiments we applied the OSPD heuristic to increase coverage One problemin doing this when using a ne-grained classication like WordNet is that althoughthe OSPD heuristic works well for homonyms it is less accurate for related senses(Krovetz 1998) and this distinction is not made in WordNet We did however ndthat in SemCor for the majority of polysemous11 lemma and le combinations therewas only one sense exhibited (see Table 5) We refrained from using the OSPD insituations in which there was conicting evidence regarding the appropriate sense fora word type occurring more than once in an individual le In our experiments theOSPD heuristic increased coverage by 7 and recall by 3 at a cost of only a 1decrease in precision

7 Conclusion

We quantied coverage and accuracy of sense disambiguation of verbs adjectivesand nouns in the SENSEVAL-2 English all-words test corpus using automaticallyacquired selectional preferences We improved coverage and recall by applying theone-sense-per-discourse heuristic The results show that disambiguation models usingonly selectional preferences can perform with accuracy well above the random base-line although accuracy would not be high enough for applications in the absence of

11 Krovetz just looked at ldquoactual ambiguityrdquo that is words with more than one sense in SemCor Wedene polysemy as those words having more than one sense in WordNet since we are usingSENSEVAL-2 data and not SemCor

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 16: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

653

McCarthy and Carroll Disambiguating Using Selectional Preferences

other knowledge sources (Stevenson and Wilks 2001) The results compare well withthose for other systems that do not use sense-tagged training data

Selectional preferences work well for some word combinations and grammaticalrelationships but not well for others We hope in future work to identify the situationsin which selectional preferences have high precision and to focus on these at theexpense of coverage on the assumption that other knowledge sources can be usedwhere there is not strong evidence from the preferences The rst-sense heuristic basedon sense-tagged data such as that available in SemCor seems to beat unsupervisedmodels such as ours For many words however the predominant sense varies acrossdomains and so we contend that it is worth concentrating on detecting when therst sense is not relevant and where the selectional-preference models provide a highprobability for a secondary sense In these cases evidence for a sense can be taken frommultiple occurrences of the word in the document using the one-sense-per-discourseheuristic

AcknowledgmentsThis work was supported by UK EPSRCproject GRN36493 ldquoRobust AccurateStatistical Parsing (RASP)rdquo and EU FW5project IST-2001-34460 ldquoMEANINGrdquo We aregrateful to Rob Koeling and threeanonymous reviewers for their helpfulcomments on earlier drafts We would alsolike to thank David Weir and MarkMclauchlan for useful discussions

ReferencesAbney Steven and Marc Light 1999 Hiding

a semantic class hierarchy in a Markovmodel In Proceedings of the ACL Workshopon Unsupervised Learning in NaturalLanguage Processing pages 1ndash8

Agirre Eneko and David Martinez 2001Learning class-to-class selectionalpreferences In Proceedings of the FifthWorkshop on Computational LanguageLearning (CoNLL-2001) pages 15ndash22

Atkins Sue 1993 Tools for computer-aidedlexicography The Hector project InPapers in Computational LexicographyCOMPLEX 93 Budapest

Briscoe Ted and John Carroll 1993Generalised probabilistic LR parsing ofnatural language (corpora) withunication-based grammarsComputational Linguistics 19(1)25ndash59

Briscoe Ted and John Carroll 1995Developing and evaluating a probabilisticLR parser of part-of-speech andpunctuation labels In fourthACLSIGPARSE International Workshop onParsing Technologies pages 48ndash58 PragueCzech Republic

Carroll John Ted Briscoe and AntonioSanlippo 1998 Parser evaluation Asurvey and a new proposal In Proceedingsof the International Conference on Language

Resources and Evaluation pages 447ndash454Carroll John Guido Minnen and Ted

Briscoe 1999 Corpus annotation forparser evaluation In EACL-99Post-conference Workshop on LinguisticallyInterpreted Corpora pages 35ndash41 BergenNorway

Carroll John Guido Minnen Darren PearceYvonne Canning Siobhan Devlin andJohn Tait 1999 Simplifying English textfor language impaired readers InProceedings of the Ninth Conference of theEuropean Chapter of the Association forComputational Linguistics pages 269ndash270Bergen Norway

Ciaramita Massimiliano and Mark Johnson2000 Explaining away ambiguityLearning verb selectional preference withBayesian networks In Proceedings of the18th International Conference ofComputational Linguistics (COLING-00)pages 187ndash193

Cotton Scott Phil Edmonds AdamKilgarriff and Martha Palmer 2001SENSEVAL-2 Available at httpwwwslesharpcouksenseval2

Elworthy David 1994 Does Baum-Welchre-estimation help taggers In Proceedingsof the fourth ACL Conference on AppliedNatural Language Processing pages 53ndash58Stuttgart Germany

Federici Stefano Simonetta Montemagniand Vito Pirrelli 1999 SENSE Ananalogy-based word sensedisambiguation system Natural LanguageEngineering 5(2)207ndash218

Fellbaum Christiane editor 1998 WordNetAn Electronic Lexical Database MIT PressCambridge MA

Gale William Kenneth Church and DavidYarowsky 1992 A method fordisambiguating word senses in a largecorpus Computers and the Humanities

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349

Page 17: Disambiguating Nouns, Verbs, and Adjectives Using ...sro.sussex.ac.uk/1212/01/cl03.pdf · verbs and applying them to disambiguate nouns occurring at subject and direct ob-ject slots

654

Computational Linguistics Volume 29 Number 4

26415ndash439Krovetz Robert 1998 More than one sense

per discourse In Proceedings of theACL-SIGLEX SENSEVAL WorkshopAvailable at httpwwwitribtonacukeventssensevalARCHIVEPROCEEDINGS

Li Hang and Naoki Abe 1995 Generalizingcase frames using a thesaurus and theMDL principle In Proceedings of theInternational Conference on Recent Advancesin Natural Language Processing pages239ndash248 Bulgaria

Li Hang and Naoki Abe 1998 Generalizingcase frames using a thesaurus and theMDL principle Computational Linguistics24(2)217ndash244

Light Marc and Warren Greiff 2002Statistical models for the induction anduse of selectional preferences CognitiveScience 26(3)269ndash281

McCarthy Diana 1997 Word sensedisambiguation for acquisition ofselectional preferences In Proceedings ofthe ACLEACL 97 Workshop AutomaticInformation Extraction and Building of LexicalSemantic Resources for NLP Applicationspages 52ndash61

McCarthy Diana 2001 Lexical Acquisition atthe Syntax-Semantics Interface DiathesisAlternations Subcategorization Frames andSelectional Preferences PhD thesisUniversity of Sussex

Miller George A Claudia Leacock RandeeTengi and Ross T Bunker 1993 Asemantic concordance In Proceedings of theARPA Workshop on Human LanguageTechnology pages 303ndash308 MorganKaufmann

Minnen Guido John Carroll and DarrenPearce 2001 Applied morphologicalprocessing of English Natural LanguageEngineering 7(3)207ndash223

Ng Hwee Tou and Hian Beng Lee 1996Integrating multiple knowledge sourcesto disambiguate word sense Anexemplar-based approach In Proceedingsof the 34th Annual Meeting of the Associationfor Computational Linguistics pages 40ndash47

Palmer Martha Hoa Trang Dang andChristiane Fellbaum 2003 Makingne-grained and coarse-grained sensedistinctions both manually andautomatically Forthcoming NaturalLanguage Engineering

Preiss Judita and Anna Korhonen 2002Improving subcategorization acquisitionwith WSD In Proceedings of the ACLWorkshop on Word Sense DisambiguationRecent Successes and Future DirectionsPhiladelphia PA

Resnik Philip 1997 Selectional preferenceand sense disambiguation In Proceedingsof the SIGLEX Workshop on Tagging Text withLexical Semantics Why What and Howpages 52ndash57 Washington DC

Ribas Francesc 1995 On learning moreappropriate selectional restrictions InProceedings of the Seventh Conference of theEuropean Chapter of the Association forComputational Linguistics pages 112ndash118

Rissanen Jorma 1978 Modelling byshortest data description Automatica14465ndash471

Stevenson Mark and Yorick Wilks 2001The interaction of knowledge sources inword sense disambiguation ComputationalLinguistics 17(3)321ndash349