One sense per collocation.pdf

Embed Size (px)

Citation preview

  • 7/28/2019 One sense per collocation.pdf

    1/6

    O N E S E N S E P E R C O L L O C A T I O ND a v i d Y a r o w s k y *

    Depar tmen t o f Compute r and In fo rmat ion Sc ienceUniver s i ty o f Pennsy lvan ia

    Phi ladelphia , PA 19104y a r o w s k y @ u n a g i . c i s . u p e n n . e d u

    A B S T R A C TPrevious work [Gale, Church and Yarowsky, 1992] showed that withhigh probability a polysemous word has one sense per discourse.In this paper we show that for certain definitions of collocation, apolysemous word exhibits essentially only one sense per collocation.We test this empirical hypothesis for several definitions of sense andcollocation, and discover that it holds with 90-99% accuracy forbinary ambiguities. We utilize this property in a disambiguationalgorithm that achieves precision of 92% using combined models ofvery local context.

    1 . INTRODUCTIONThe use of collocations to resolve lexical ambiguities is cer-tainly not a new idea. The first approaches to sense dis-ambiguation, such as [Kelly and Stone 1975], were basedon simple hand-built decision tables consisting almost ex-clusively of questions about observed word associations inspecific positions. Later work fro m the AI community reliedheavily upon selectional restrictions for verbs, although pri-marily in terms of features exhibited by their arguments (suchas +DRINKABLE) rather than in terms of individual words orword classes. More recent work [Brown et al. 1991][Hearst1991] has utilized a set of discrete local questions (such asword-to-the-right) in the development of statistical decisionprocedures. However, a strong trend in recent years is to treata reasonably wide context window as an unordered bag of in-dependent evidence points. This technique from informationretrieval has been used in neural networks, Bayesian discrim-inators, and dictionary definition matching. In a comparativepaper in this volume [Leacock et al. 1993], all three methodsunder investigation used words in wide context as a pool ofevidence independent of relative position. It is perhaps nota coincidence that this work has focused almost exclusivelyon nouns, as will be shown in Section 6.2. In this studywe will return again to extremely local sources of evidence,and show that models o f discrete syntactic relationships haveconsiderable advantages.

    * T h i s r e s e a rc h w a s s u p p o r t e d b y a n N D S E G F e l l o w s h i p a n d b y D A R P Ag r a n t N 0 0 0 1 4 - 9 0 - J - 1 8 6 3 . T h e a u t h o r i s a l s o a ff i l ia t e d w i t h t h e L i n g u i s t i c sR e s e a r c h D e p a r t m e n t o f A T & T B e l l L a b o r a t o r ie s , a n d g r e a tl y a p p r e c ia t e s t h eu s e o f i t s r e s o u r c e s i n s u p p o r t o f t h i s w o r k . H e w o u l d a l s o l i k e to t h a n k E r i cB f i l l, B i l l G a l e , L i b b y L e v i s o n , M i t c h M a r c u s a n d P h i l i p R e s n i k f o r th e i rv a l u a b l e f e e d b a c k .

    2 . DEFINITI ONS OF SENSEThe traditional definition of word sense is "One of severalmeanings assigned to the same orthographic string". Asmeanings can always be partitioned into multiple refinements,senses are typically organized in a tree such as one finds in adictionary. In the extreme case, one could continue makingrefinements until a word has a slightly different sense everytime it is used. If so, the title of this paper is a tautology.However, the studies in this paper are focused on the sensedistinctions at the top of the tree. A goo d working definition ofthe distinctions considered are those meanings which are nottypically translated to the same word in a foreign language.Therefore, one natural type of sense distinction to considerare those words in English which indeed have multiple trans-lations in a language such as French. As is now standard inthe field, we use the Canadian Hansards, a parallel bilingualcorpus, to provide sense tags in the form of French transla-tions. Unfortunately, the Hansards are high ly skewed in theirsense distributions, and it is difficult to find words for whichthere are adequate numbers of a second sense. More diverselarge bilingual corpora are not yet readi ly available.

    We also use data sets which have been hand-ta gged by nativeEnglish speakers. To make the selection of sense distinc-tions more objective, we use words such as bass where thesense distinctions ( f ish and musical instrument) correspondto pronunciation differences ([b~es] and [beIs]). Such data isoften problematic, as the tagging is potentially subjective anderror-filled, and sufficient quantities are difficult to obtain.As a solution to the data shortages for the above methods,[Gale, Church and Yarowsky 1992b] proposed the use o f"pseudo-words," artificial sense ambiguities created by tak-ing two English words with the same part of speech (such asguerilla and rep tile) , and replacing each instance of both in acorpus with a new polysemous word guerrilla~reptile. As itis entirely possible that the concepts guerrilla and reptile arerepresented by the same orthographic string in some foreignlanguage, choosing between these two meanings based oncontext is a problem a word sense disambiguation algorithmcould easily face. "Pseudo-words" are very useful for devel-oping and testing disambiguation methods because of theirnearly unlimited availability and the known, fully reliable

    266

  • 7/28/2019 One sense per collocation.pdf

    2/6

    g r o u n d t r u th t h e y p r o v i d e w h e n g r a d i n g p e r f o r m a n c e .F i n a ll y , w e c o n s i d e r s e n s e d i s a m b i g u a t i o n f o r m e d i u m s o t h e rt h a n c le a n E n g l i s h t e x t . F o r e x a m p l e , w e l o o k a t w o r d p a i r ssu ch a s terse/ tense an d cookie/rookie w h i c h m a y b e p l a u s i -b l y c o n f u s e d i n o p t i c a l c h a ra c t e r r e c o g n i t io n ( O C R ) . H o m o -p h o n es , su ch a s aid~aide, and censor/sensor, are i d ea l can -d id a t e s fo r su ch a s t u d y b ecau se l a rg e d a t a s e t s w i th k n o w ng ro u n d t ru th a r e av a i l ab l e i n w r i t t en t ex t , y e t t h ey a r e t ru ea m b i g u i t i e s w h i c h m u s t b e r e s o l v e d r o u t i n e ly i n o r al c o m m u -n ica t i o n .W e d i sc o v e r t h a t t h e cen t r a l c l a im s o f th i s p ap e r h o ld fo r a l lo f t h ese p o t en t i a l d e f in i t i o n s o f s en se . T h i s co r ro b o ra t i n ge v i d e n c e m a k e s u s m u c h m o r e c o n f id e n t i n o u r re s u l ts t h a n i ft h e y w e r e d e r i v e d s o l e l y f r o m a r e l a t i v e ly s m a l l h a n d - t a g g e dd a t a se t .

    3 . D E F I N I T I O N S O F C O L L O C A T I O NC o l l o c a t io n m e a n s t h e c o - o c c u r r e n c e o f tw o w o r d s i n s o m ed ef in ed r e l a t io n sh ip . W e lo o k a t s ev e ra l su ch r e l a t i o n sh ip s , i n -c lu d in g d i r ec t ad j ac en cy an d f i r s t w o rd t o t h e l e f t o r r i g h t h av -in g a ce r t a in p a r t -o f - sp eech . W e a l so co n s id e r ce r t a in d i r ec tsy n t ac t i c r e l a t i o n sh ip s , su ch a s v e rb /o b j ec t , su b j ec t / v e rb , an dad jec t i v e /n o u n p a i r s . I t ap p ea r s t h a t conten t words (n o u n s ,v e rb s , ad j ec t i v es , an d ad v e rb s ) b e h av e q u i t e d i f f e r en t ly f ro mfunc t ion words ( o t h e r p a rt s o f s p e e c h ) ; w e m a k e u s e o f t h isd i s t i n c t i o n i n sev e ra l d e f in i t i o n s o f co l lo ca t i o n .W e w i l l a t t em p t t o q u an t i fy t h e v a l i d i t y o f t h e o n e - sen se -p e r -co l lo ca t i o n h y p o th e s i s fo r t h ese d i f f e r en t co l l o ca t i o n ty p es .

    4 . E X P E R I M E N T SIn t h e ex p e r im en t s , w e a sk tw o cen t r a l , r e l a t ed q u es t io n s :F o r each d e f in i t i o n o f s en se an d co l lo ca t i o n ,

    W h a t i s t h e m e an en t ro p y o f t h e d i s t r i b u t io nP r ( S e n s e [ C o l l o c a t i o n ) ?

    W h a t i s t he p e r f o r m a n c e o f a d i s a m b i g u a t i o n a l g o r i th mw h ich u ses o n ly t h a t co l l o ca t i o n t y p e a s ev id en ce?

    W e ex am in e sev e ra l p e rm u ta t i o n s fo r each , an d a r e in t e r e s t edin h o w th e r e su l t s o f t h ese q u es t i o n s d i f f e r w h en ap p l i ed t op o l y s e m o u s n o u n s , v e r b s , a n d a d j e c t i v e s.T o l i m i t th e a l r e a d y v e r y l a r g e n u m b e r o f p a r a m e t e r s c o n s i d -e red , w e s tu d y o n ly b in a ry sen se d i s t i n c ti o n s . I n a ll ca ses t h es e n s es b e i n g c o m p a r e d h a v e t h e s a m e p a r t o f s p e e ch . T h ese l ec t i o n b e tw een d i f f e r e n t p o ss ib l e p a r t s o f sp eech h as b eenh eav i ly s tu d i ed an d i s n o t r ep l i ca t ed h e re .4 . 1 . S a m p l e C o l l e c t i o nA l l s a m p l e s w e r e e x t r a c t e d f r o m a 3 8 0 m i l l i o n w o r d c o r -p u s c o l l e c t i on c o n s i s t in g o f n e w s w i r e t e x t ( A P N e w s w i r e a n d

    Ha nd Tagged (homographs): b ass , ax es , ch i , b o w ,colon, lead, IV, sake, tear, . . .

    French Translation D istinctions: sen t en ce , d u ty , d ru g ,l an g u ag e , p o s i t i o n , p ap e r , s i n g l e . . . .

    H o m o p h o n e s : a id / a id e , ce l l a r / s e ll e r , cen so r / sen so r ,cu e /q u eu e , p ed a l /p e t a l .. . .

    O C R A m b i g u i t i e s : t e r s e / te n s e , g u m / g y m , d e a f /d e a r ,c o o k i e / r o o ki e , b e v e r a g e / l e v e r a g e . . .

    P s e u d o - W o r d s : c o v e r e d / w a v e d , k i s s e d /s l a p p e d ,ab u sed / esco r t ed , cu t e / co m p a t ib l e . . .

    T ab l e 1 : A sam p le o f t h e w o rd s u sed i n t h e ex p e r im en t s

    W al l S t r ee t Jo u rn a l ) , s c i en t i f i c ab s t r ac t s ( f ro m N S F an d th eD e p a r t m e n t o f E n e r g y ) , t h e C a n a d i a n H a n s a r d s p a r l i a m e n t a r yd e b a t e r e c or d s , G r o l i e r ' s E n c y c l op e d i a , a m e d i c a l e n c y c l o -p e d i a , o v e r 1 0 0 H a r p e r & R o w b o o k s , a n d s e v e r a l s m a l l e rc o r p o r a i nc l u d in g t h e B r o w n C o r p u s , a n d A T I S a n d T I M I Tsentences.1T h e h o m o p h o n e p a i r s u s e d w e r e r a n d o m l y s e l e c t e d f r o m al i st o f w o r d s h a v i n g t h e s a m e p r o n u n c i a t i o n o r w h i c h d i f fe r e di n o n l y o n e p h o n e m e . T h e O C R a n d p s e u d o - w o r d p a i r s w e r er a n d o m l y s e l e c t ed f r o m c o r p u s w o r d l i s t s , w i t h t h e f o r m e rres t r i c t ed t o p a i r s w h ich co u ld p l au s ib ly b e co n fu sed i n an o i sy F A X , t y p i ca l l y w o rd s d i f f e r in g i n o n ly o n e ch a rac t e r .D u e to t h e d i f f i cu l t y o f o b t a in in g n ew d a t a , t h e h an d - t ag g eda n d F r e n c h t r a ns l a t io n e x a m p l e s w e r e b o r r o w e d f r o m t h o s eu sed in o u r p rev io u s s tu d i e s i n s en se d i sam b ig u a t io n .4 .2 . M e a s u r i n g E n t r o p i e sW h e n c o m p u t i n g t h e e n t r o p y o f P r ( S e n s e [ C o l l o c a t i o n ) ,w e e n u m e r a t e a l l c o l l o c a ti o n s o f a g i v e n t y p e o b s e r v e d f o r t h ew o r d o r w o r d p a i r b e i n g d i s a m b i g u a te d . T a b l e 2 s h o w s t h ee x a m p l e o f th e h o m o p h o n e a m b i g u i t y aid~aide fo r t h e co l lo -ca t i o n t y p e content-word- to- the- le f t . W e l i s t a l l w o rd s 2 ap -p ea r in g i n su ch a co l lo ca t i o n w i th e i t h e r o f t h ese tw o " sen ses"o f th e h o m o g rap h , an d ca l cu l a t e t h e r aw d i s t r i b u t io n a l co u n tfo r each .N o te t h a t t h e v as t m a jo r i t y o f t h e en t r i e s i n T ab le 2 h av e ze roas o n e o f t h e f r eq u en cy co u n t s . I t is n o t accep t ab l e , h o w ev er ,

    t Training and test sa mples were not only extracted from different articlesor discourses but also from entirely different blocks of the corpus . This wasdone to minimize long range discours e effects such as on e finds in the AP orHansards.

    2Note: the entries in this table are lemmas (uninflected root forms), ratherthan raw words. By treating the verbal inflections squander, squanders,squandering , and squa ndered as t h e s a m e w o r d , o n e can improve statisticsand coverage at a slight cost of lost subtlety. Although we will refer to "wordsin collocation" througho ut this paper for simplicity, this shou ld always beinterpreted as "lemmas in collocation."

    2 6 7

  • 7/28/2019 One sense per collocation.pdf

    3/6

    F req u en cy a s F req u en cy a sCo l l o c a t i o n Ai d Ai d efo re ig nfederalw es t e rnp ro v id ezo v er ta p p o s efu tu re~imilarp res iden t ia l: h i e flo n g t im ea id s - in fec t edd e e p yd isaffectedLndispensable~ract ical; q u an d er

    7 1 82 97146

    882613

    96000000221

    10000000

    634 02 6

    211100

    T ab le 2 : A ty p i ca l co l l o ca t i o n a l d i s t r ib u t io n fo r t h e h o m o -p h o n e a m b i g u i t y a i d / a i d e .

    t o t r ea t t h ese a s h av in g ze ro p ro b ab i l i t y an d h en ce a ze roen t ro p y fo r t h e d i s t r i b u t io n . I t i s q u i t e p o ss ib l e , e sp ec i a l l yfo r t h e l o w er f r eq u en cy d i s t r i b u t io n s , t h a t w e w o u ld see aco n t r a ry ex am p le i n a l a rg e r s am p le . B y c ro s s -v a l i d a t i o n , w ed i s c o v e r f o r t h e aid~aide e x a m p l e t h a t f o r c o l lo c a t i o n s w i t h a no b se rv ed 1 /0 d i s t r i b u t io n , w e w o u ld ac tu a l l y ex p ec t th e m in o rs e n s e t o o c c u r 6 % o f t h e t i m e i n a n i n d e p e n d e n t s a m p l e , o nav e rag e . T h u s a f a i r e r d i s t ri b u t io n w o u ld b e . 9 4 / .0 6 , g iv in ga c ro s s -v a l i d a t ed en t ro p y o f . 33 b i t s r a th e r th an 0 b it s . F o ra m o re u n b a l an ced o b se rv ed d i s t r i b u t io n , su ch a s 1 0 /0 , t h ep r o b a b i l i t y o f s e e in g t h e m i n o r s e n s e d e c r e a s e s t o 2 % , g i v i n ga c ro s s -v a l i d a t ed en t ro p y o f H ( . 9 8 , . 0 2 ) = . 14 b i ts . R ep e a t in gt h i s p ro c e s s a n d t a k i n g t h e w e i g h t e d m e a n y i e l d s t h e e n t r o p yo f t h e fu l l d i s t r i b u t io n , i n t h i s case . 0 9 b i t s fo r t h e a i d / a i d eam b ig u i ty .F o r e a c h t y p e o f c o l l o c a t io n , w e a l s o c o m p u t e h o w w e l l a no b se rv ed p ro b ab i l i t y d i s t r ib u t io n p red i c t s t h e co r r ec t c l a s s i f i-ca t i o n fo r n o v e l ex am p les . I n g en e ra l , t h i s i s a m o re u se fu lm e a s u r e f o r m o s t o f t h e c o m p a r i s o n p u r p o s e s w e w i l l a d d re s s .N o t o n ly d o es i t r e f lec t t h e u n d e r ly in g e n t ro p y o f t h e d i s t r ib u -t i o n , b u t i t a l so h as t h e p rac t i ca l ad v a n tag e o f sh o w in g h o w aw o r k i n g s y s t e m w o u l d p e r f o r m g i v e n t hi s d a ta .

    5 . A L G O R I T H MT h e sen se d i sam b ig u a t io n a lg o r i t h m u sed i s q u i t e s t r a ig h t fo r -w ard . W h en b ased o n a s i n g l e co l lo ca t i o n t y p e , su ch as t h eo b jec t o f t h e v e rb o r w o rd im m e d ia t e ly t o t h e l e f t, t h e p ro -ced u re i s v e ry s im p le . O n e id en t i f ie s i f t h i s co l l o ca t i o n t y p e

    ex i s t s fo r t h e n o v e l co n t ex t an d i f t h e sp ec i f i c w o rd s fo u n dare l i s t ed in t h e t ab l e o f p ro b ab i l i t y d i s t r i b u t io n s ( a s co m p u teda b o v e ) . I f s o , w e r e tu r n t h e s e n s e w h i c h w a s m o s t f r e q u e n tfo r t h a t co l l o ca t i o n in t h e t r a in in g d a ta . I f n o t , w e r e tu rn t h esen se w h ich i s m o s t f r eq u en t o v e ra l l .W h e n w e c o n s i d e r m o r e t h a n o n e c o l l o c a t io n t y p e a n d c o m -b i n e e vi d e n c e , th e p r o c e s s is m o r e c o m p l i c a t e d . T h e a l g o -r i t h m u sed i s b ased o n d ec i s io n l i s t s [R iv es t , 1 9 8 7 ] , an d w asd i scu ssed i n [S p ro a t , H i r sch b e rg , an d Y aro w sk y 1 9 9 2 ] . T h eg o a l i s t o b ase t h e d ec i s io n o n t h e s in g l e b es t p i ece o f ev i -d e n c e a v a i l a b le . C r o s s - v a l i d a t e d p r o b a b i li t i e s a r e c o m p u t e das i n S ec t io n 4 .2 , an d th e d i f f e r en t t y p es o f ev id en c e a r es o r t e d b y t h e a b s o l u t e v a l u e o f t h e l o g o f t h e s e p r o b a b i l - . Pr Sense l Colloc a~ion i )ra t ios . A b s ( L o g ( p r l s , n , ~ ~ C o n o c au o ,~ ,) )) " W h e n a n o ve ltyco n tex t i s en co u n te red , o n e s t ep s t h ro u g h th e d ec i s io n l i s tu n t i l t h e ev id en ce a t t h a t p o in t i n t h e l i s t ( su ch a s w o r d - t o -/ e f t= "p res id en t i a l " ) m a tch es t h e cu r r en t co n t ex t u n d e r co n -s id e ra t i o n . T h e se n se w i th t h e g rea t e s t l i s t ed p ro b a b i l i t y isr e tu rn ed , an d th i s c ro s s -v a l i d a t ed p ro b ab i l i t y r ep resen t s t h eco n f id en ce i n t h e an sw er .T h i s a p p r o a c h i s w e l l - s u i t e d f o r t h e c o m b i n a t i o n o f m u l t i -p l e e v i d e n c e t y p e s w h i c h a r e c l e a r l y n o t i n d e p e n d e n t ( s u c has t h o se fo u n d in t h i s s t u d y ) a s p ro b ab i l i t i e s a r e n ev e r co m -b i n e d . T h e r e f o r e t hi s m e t h o d o f f e r s a d v a n t a g e s o v e r B a y e s i a nc l a s s if i e r t e c h n i qu e s w h i c h a s s u m e i n d e p e n d e n c e o f t h e f e a -tu res u sed . I t a l so o f f e r s ad v an ta g es o v e r d ec i s io n t r ee b asedt ech n iq u es b ecau se t h e t r a in in g p o o l s a r e n o t sp l i t a t eachq u e s t io n . T h e i n t e r e st i n g p r o b l e m s a r e h o w o n e s h o u l d r e -e s t im a te p ro b ab i l i t i e s co n d i t i o n a l o n q u es t i o n s a sk ed ea r l i e ri n t h e l i s t , o r h o w o n e sh o u ld p ru n e l o w er ev id en ce w h ichi s c a te g o r i c a l ly s u b s u m e d b y h i g h e r e v i d e n c e o r i s e n t i r e lyco n d i t i o n a l o n h ig h e r ev id en ce . [B ah l e t a l . 1 9 8 9 ] h av e d i s -cu ssed so m e o f th ese i s su es a t l en g th , an d th e re i s n o t sp aceto co n s id e r t h em h e re . F o r s im p l i c i t y , in t h i s ex p e r im e n t n os e c o n d a r y s m o o t h i n g o r p ru n i n g i s d o n e . T h i s d o e s n o t a p -p e a r t o b e p r o b l e m a t i c w h e n s m a l l n u m b e r s o f i n d e p e n d e n te v i d e n c e t y p e s a r e u s e d , b u t p e r f o r m a n c e s h o u l d i n c r e a s e i ft h i s ex t r a s t ep i s t ak en .

    6 . R E S U L T S A N D D I S C U S S I O N6 . 1 . O n e S e n s e P e r C o l l o c a t i o nF o r t h e co l lo ca t i o n s s tu d i ed , i t ap p ea r s t h a t t h e h y p o th es i so f o n e s e n s e p e r c o l l o c a t i o n h o l d s w i t h h i g h p r o b a b i l i t y f o rb in a ry am b ig u i t i e s . T h e ex p e r im en ta l r e su l t s i n t h e p r e c i s i o nc o l u m n o f T a b le 3 q u a n t i f y t h e v a l i d it y o f t h i s c l a i m . A c c u -r a c y v a r i e s f r o m 9 0 % t o 9 9 % f o r d i f fe r e n t t y p e s o f c o l l o c a t io na n d p a r t o f s p e e c h, w i t h a m e a n o f 9 5 % . T h e s i g n i fi c a n c e o fth ese d i f f e r en ces w i l l b e d i scu ssed i n S ec t io n 6 . 2 .T h es e p rec i s io n v a lu es h av e sev e ra l i n t e rp re t a t i o n s . F i r s t ,t h ey r e f l ec t t h e u n d e r ly in g p ro b ab i l i t y d i s t r i b u t io n s o f s en se

    2 6 8

  • 7/28/2019 One sense per collocation.pdf

    4/6

  • 7/28/2019 One sense per collocation.pdf

    5/6

    F i g u r e ] sh o w s th a t n o u n s , v e rb s an d ad j ec t i v es a l so d i f f e r i nth e i r ab i l i t y t o b e d i sam b ig u a t ed b y w id e r co n t ex t . [G a le e ta l . 1 9 93 ] p rev io u s ly sh o w ed th a t n o u n s can b e d i sam b ig u a t edb ased s t r i c t l y o n d i s t an t co n t ex t , an d t h a t u se fu l i n fo rm a t io nw as p resen t u p t o 1 0 , 0 0 0 w o rd s aw ay . W e r ep l i ca t ed an ex p e r -i m e n t i n w h i c h p e r f o r m a n c e w a s c a l c u l a t e d f o r d i s a m b i g u a -t i o n s b ased s t r i c t l y o n 5 w o rd w in d o w s cen t e red a t v a r io u sd i s t an ces ( sh o w n o n th e h o r i zo n ta l ax i s ) . G a le ' s o b se rv a t io nw a s t e s t e d o n l y o n n o u n s ; o u r e x p e r i m e n t a l s o s h o w s t h a tr e a s o n a b l y a c c u r a t e d e c i s i o n s m a y b e m a d e f o r n o u n s u s i n gex c lu s iv e ly r em o te co n tex t . O u r r e su l t s i n t h i s case a r e b asedo n t e s t s e t s w i th eq u a l n u m b er s o f t h e tw o sen ses . H en cec h a n c e p e r f o r m a n c e i s a t 5 0 % . H o w e v e r , w h e n t e s te d o nv e rb s an d ad j ec t i v es , p r ec i s io n d ro p s o f f w i th a m u ch s t eep e rs l o p e a s th e d i s t a n c e f r o m t h e a m b i g u o u s w o r d i n c re a s e s . T h i sw o u ld i n d i c a t e t h a t ap p ro ac h es g iv in g eq u a l w e ig h t t o a l l p o -s i t io n s i n a b r o a d w i n d o w o f c o n t e x t m a y b e l e s s w e l l- s u i te df o r h a n d l in g v e r b s a n d a d j e c t iv e s . M o d e l s w h i c h g i v e g r e a t e rw e i g h t t o i m m e d i a t e c o n t e x t w o u l d s e e m m o r e a p p r o p r i a t e i nt h e s e c i r c u m s t a n c e s .A s i m i l a r e x p e r i m e n t w a s a p p l i e d t o f u n c ti o n w o r d s , a n d t h ed r o p o f f b e y o n d s t r ic t l y i m m e d i a t e c o n t e x t w a s p r e c i pi t o u s,c o n v e r g i n g a t n e a r c h a n c e p e r f o r m a n c e f o r d i st a n c e s g r e a t e rt h a n 5 . H o w e v e r , f u n c t io n w o r d s d i d a p p e a r t o h a v e p r e -d i c t i v e p o w e r o f r o u g h l y 5 % g r e a t e r th a n c h a n c e i n d i re c t l yad j acen t p o s i t i o n s . T h e e f f ec t w as g rea t e s t f o r v e rb s , w h ereth e fu n c t io n w o rd t o t h e r i g h t ( t y p i c a l l y a p rep o s i t i o n o r p a r -t i c l e ) s e r v e d t o d i s a m b i g u a t e a t a p r e c i s i o n o f 1 3 % a b o v ec h a n c e . T h i s w o u l d i n di c a t e t h a t m e t h o d s w h i c h e x c l ud ef u n c t io n w o r d s f r o m m o d e l s t o m i n i m i z e n o i s e s h o u l d c o n -s id e r t h e i r i n c lu s io n , b u t o n ly fo r r e s t r i c t ed l o ca l p o s i t i o n s .6 . 3. C o m p a r i s o n o f S e n s e D e f i n i ti o n sR esu l t s fo r t h e 5 d i f f e r en t d e f in it i o n s o f s en se am b ig u i ty s tu d -i ed h e re a r e s im i l a r . H o w e v er t h ey t en d to f lu c tu a t e r e l a t i v eto each o th e r ac ro s s ex p e r im en t s , an d th e re ap p ea r s t o b en o c o n s i s t e nt o r d e r i n g o f t h e m e a n e n t r o p y o f t h e d i f f e r e n tt y p e s o f s e n s e d i s t ri b u t io n s . B e c a u s e o f t h e v e r y l a r g e n u m -b er o f p e rm u ta t i o n s co n s id e red , i t i s n o t p o ss ib l e t o g iv e af u l l b r e a k d o w n o f t h e d i ff e r e n c e s, a n d s u c h a b r e a k d o w n d o e sn o t a p p e a r t o b e t e r r i b l y i n f o rm a t i v e . T h e i m p o r t a n t o b s e r v a -t i o n , h o w ev er , i s t h a t t h e b as i c co n c lu s io n s d raw n f ro m th i sp a p e r h o l d f o r each o f t h e sen se d e f in i t i o n s co n s id e red , an dh e n c e c o r r o b o r a t e a n d s t r e n g th e n t h e c o n c l u s i o n s w h i c h c a nb e d r a w n f r o m a n y o n e .6 .4 . P e r f o r m a n c e G i v e n L i t tl e E v i d e n c eO n e o f t h e m o s t s t r i k i n g c o n c l u s io n s t o e m e r g e f r o m t h i s s t u dyi s t h a t fo r t h e l o ca l co l l o ca t i o n s co n s id e red , d ec i s io n s b asedo n a s i n g l e d a t a p o i n t a r e h i g h ly r e l i ab l e . N o r m a l l y o n e w o u l dco n s id e r a 1 /0 sen se d i s t r i b u t io n i n a 3 9 4 4 sam p le t r a in in g se tt o b e n o i s e , w i t h p e r f o r m a n c e b a s e d o n t h i s i n f o r m a t i o n no t

    L o w C o u n t s a r e R e l i a b l e

    ; ~ ,'o ~o ,;o o;o ,go.,T r a i n i n g F r e q u e n c y ( f)

    F i g u r e 2 : P e r c e n t a g e c o r r e c t fo r d i s a m b i g u a t i o n s b a s e d s o l e l yo n a s i n g l e con t en t -word - to - the - r igh t co l l o ca t i on s e en f t i m e sin t h e t r a in in g d a t a w i th o u t co u n te r - ex am p les .l i k e ly t o m u c h e x c e e d t h e 6 9 % p r i o r p r o b a b i l i t y e x p e c t e d b yc h a n c e . B u t t h is i s n o t w h a t w e o b s e r v e . F o r e x a m p l e , w h e ntes t ed o n t h e word - to - the - r igh t c o l l o c a ti o n , d i s a m b i g u a t i o n sb ased so l e ly o n a s i n g l e d a t a p o in t ex ceed 9 2 % accu racy , an dp e r f o r m a n c e o n 2 / 0 a n d 3 / 0 d i s tr i b u ti o n s c l i m b r a p i d l y f r o mth e re , an d r each n ea r ly p e r f ec t accu racy fo r t r a in in g sam p les a ssm a l l a s 1 5 /0 , a s sh o w n in F ig u re 2 . I n co n t r a s t , a co l l o ca t i o n3 0 w o rd s aw ay w h ich a l so ex h ib i t s a 1 /0 sen se d i s t r i b u t io nh a s a p r e d i c ti v e v a l u e o f o n l y 3 % g r e a t e r t h a n c h a n c e . T h i sd i f f e r en c e i n t h e r e l i a b i li t y o f l o w f r e q u e n c y d a t a f r o m l o c a la n d w i d e c o n t e x t w i l l h a v e i m p l i c a t i o n s f o r a l g o r i t h m d e s i g n .

    7 . A P P L I C A T I O N S7 . 1. T r a i n i n g S e t C r e a t i o n a n d V e r i f ic a t i o nT h i s l a s t o b s e r v a t io n h a s r e l e v a n c e f o r n e w d a t a s e t c r e a t i o na n d c o r re c t i on . C o l l o c a t i o n s w i t h a n a m b i g u o u s c o n t e n t w o r dw h i c h h a v e f r e q u e n c y g r e a t e r t h an 1 0 - 15 a n d w h i c h d o n o tb e l o n g e x c l u s i v e ly t o o n e s e n s e s h o u l d b e f l a g g e d f o r h u m a nre in sp ec t io n , a s t h ey a r e m o s t l i k e ly i n e r ro r . O n e can sp ee dt h e s e n s e t a g g i n g p r o c e s s b y c o m p u t i n g t h e m o s t f r e q u e n t c o l -l o ca t e s , an d fo r each o n e a s s ig n in g a l l ex am p les t o t h e sam esen se . F o r t h e d a t a i n T ab le 2 , t h i s w i l l ap p a ren t ly f a i l f o r t h ef o r e i g nA i d /A i d e ex a m p le i n 1 o u t o f 7 1 9 in s t an ces ( s t i l l 9 9 . 9 %c o r re c t ). H o w e v e r , i n t h is e x a m p l e t h e m o d e l ' s c l a s s i f i c at i o nw as ac tu a l l y co r r ec t ; t h e g iv en u sag e w as a m i s sp e l l i n g i nt h e 1 9 92 A P N e w s w i r e : " B u s h a c c e l e r a t e d f o r e i g n a ide an dw e ap o n s sa l e s t o I r aq . " . I t is q u i t e l i k e ly t h a t i f w ere i n -d eed a fo re ig n a s s i s t an t b e in g d i scu ssed , t h i s ex am p le w o u lda l s o h a v e a n o t h e r c o l l o c a t io n ( w i t h t h e v e r b , f o r e x a m p l e ) ,

    2 7 0

  • 7/28/2019 One sense per collocation.pdf

    6/6

    which would indicate the correct sense. Such inconsisten-cies should also be flagged for human supervision. Workingfrom the most to least frequent collocates in this manner, onecan use previously tagged collocates to automatically suggestthe classification of other words appearing in different collo-cation types for those tagged examples. The one sense perdiscourse constraint can be used to refine this process further.We are working on a similar use of these two constraints forunsupervised sense clustering.7 .2 . A l g o r i t h m D e s i g nOur results also have implications for algorithm design. Forthe large number of current approaches which treat wide con-text as an unordered bag of words, it may be beneficial tomodel certain local collocations separately. We have shownthat reliability of collocational evidence differs considerablybetween local and distant context, especially for verbs andadjectives. If one one is interested in providing a probabilitywith an answer, modeling local collocations separately willimprove the probabili ty estimates and reduce cross entropy.Another reason for modeling local collocations separately isthat this will allow the reliable inclusion o f evidence with verylow frequency counts. Evidence with observed frequency dis-tributions of 1/0 typically constitute on the order of 50% ofall available evidence types, yet in a wide context windowthis low frequency evidence is effectively noise, with predic-tive power little better than chance. However, in very localcollocations, single data points carry considerable informa-tion, and when used alone can achieve precision in excess of92%. Their inclusion should improve system recall, with amuch-reduced danger of overmodeling the data.7 . 3 . B u i l d i n g a F u l l D i s a m b i g u a t i o n S y s t e mFinally, one may ask to what extent can local collocationalevidence alone support a practical sense disambiguation algo-rithm. As shown in Table 3, our models of single collocationtypes achieve high precision, but individually their applica-bility is limited. However, if we combine these models asdescribed in Section 5, and use an additional function wordcollocation model when no other evidence is available, weachieve full coverage at a precision of 92%. This result iscomparable to those previously reported in the literature us-ing wider context of up to 50 words away [5,6,7,12]. Dueto the large number of variables involved, we shall not at-tempt to compare these directly. Our results are encouraging,however, and and we plan to conduct a more formal compari-son of the "bag of words" approaches relative to our separatemodeling of local collocation types. We will also consider ad-ditional collocation types covering a wider range of syntacticrelationships. In addition, we hope to incorporate class-basedtechniques, such as the mode ling of verb-argument selectionalpreferences [Resnik, 1992], as a mechanism fo r achieving im-

    proved performance on unfamiliar collocations.8 . CONCLUSION

    This paper has examined some of the basic distributional prop-erties o f lexical ambiguity in the English language. Our ex-periments have shown that for several definitions of senseand collocation, an ambiguous word has only one sense in agiven collocation with a probability of 90-99%. We showedhow this claim is influenced by part-of-speech, distance, andsample frequency. We discussed the implications of theseresults for data set creation and algorithm design, identifyingpotential weaknesses in the common "bag of words" approachto disambiguation. Finally, we showed that models of localcollocation can be combined in a disambiguation algorithmthat achieves overall precision of 92%.

    R e f e r e n c e s1. Bahl, L., P. Brown, P. de Souza, R. Mercer, "A Tree-Based Sta-tistical Language Model for Natural Language Speech Recog-nition," in IEEE Transactions on Acoustics, Spe ech, and SignalProcessing, 37, 1989.2. Brown, Peter, Stephen Della Pietra, Vincent Della Pietra, andRobert Mercer, "Word Sense Disambiguation using Statisti-cal Methods," Proceedings o f the 29th Annual Meeting o f theAssociation for Computational Linguistics, 1991, pp 264-270.3. Gale, W., K. Church, and D. Yarowsky, "One Sense Per Dis-course," Proceedings of the 4th D ARP A Speech and NaturalLanguage Workshop, 1992.4. Gale, W., K. Church, and D. Yarowsky, "On Evaluation ofWord-Sense Disambiguation Systems," in Proceedings, 30thAnnual M eeting of the Association for Computational Linguis-tics, 1992b.5. Gale, W., K. Church, and D. Yarowsky,"A Method for Disam-biguating Word Senses in a Large Corpus," in Computers andthe Humanities, 1993.6. Hearst, Marti, "Noun Homograph Disambiguation Using LocalContext in Large Text Corpora," in Using Corpora, Universityof Waterloo, Waterloo, Ontario, 1991.7. Leacock, Claudia, Geoffrey Towell and Ellen Voorhees"Corpus-Based Statistical Sense Resolution," in Proceedings,ARPA Human Language Technology Workshop, 1993.8. Kelly, Edward, and Phillip Stone, Computer Recognition ofEnglish Word Senses, North-Holland, Amsterdam, 1975.9. Resnik, Philip, "A Class-based Approach to Lexical Discov-ery," in Proceedings of 3Oth Annual Meeting of the A ssociationfor Computational Linguistics, 1992.

    10. Rivest, R. L., "Learning Decision Lists," in Machine Learning,2, 1987, pp 229-246.11. Sproa t, R., J. Hirschbergand D. Yarowsky "A Corpus-based

    Synthesizer," in Proceedings, International Conference onSpoken Language Processing, Banff, Alberta. October 1992.12. Yarowsky,David "Word-Sense Disambiguation Using Statisti-cal Models of Roget's Categories Trained on Large Corpora,"in Proceedings, CO LING-92, Nantes, France, 1992.

    271