62
Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm Word Sense Disambiguation Marina San(ni [email protected]fil.uu.se Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden Spring 2016 1

Lecture: Word Sense Disambiguation

Embed Size (px)

Citation preview

Page 1: Lecture: Word Sense Disambiguation

Seman&c  Analysis  in  Language  Technology  http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm

Word Sense Disambiguation

Marina  San(ni  

[email protected]  

 

Department  of  Linguis(cs  and  Philology  

Uppsala  University,  Uppsala,  Sweden  

 

Spring  2016  

 

 1  

Page 2: Lecture: Word Sense Disambiguation

Previous  Lecture:  Word  Senses  •  Homonomy,  polysemy,  synonymy,  metonymy,  etc.    

Prac(cal  ac(vi(es:  1)  SELECTIONAL  RESTRICTIONS  2)  MANUAL  DISAMBIGUATION  OF  EXAMPLES  USING  SENSEVAL  SENSES  

AIMS  OF  PRACTICAL  ACTIVITiES:    •  STUDENTS  SHOULD  GET  ACQUINTED  WITH  REAL  DATA  •  EXPLORATIONS  OF  APPLICATIONS,  RESOURCES  AND  METHODS.    

2  

Page 3: Lecture: Word Sense Disambiguation

No  preset  solu$ons  (this  slide  is  to  tell  you  that  you  are  doing  well    J    )  

•  Whatever  your  experience  with  data,  it  is  a  valuable  experience:    •  Disappointment  •  Frustra(on  •  Feeling  lost  •  Happiness  •  Power  •  Excitement  •  …  •  All  the  students  so  far    (also  in  previous  courses)  have  presented  their    own  solu(ons…  many  different  solu(ons  and  it  is  ok…    3  

Page 4: Lecture: Word Sense Disambiguation

J&M  own  solu$ons:  Selec$onal  Restric$ons  (just  for  your  records,  does  not  mean  they  are  necessearily  beMer  than  yours…  )  

4  

Page 5: Lecture: Word Sense Disambiguation

Other  possible  solu$ons…  •  Kissàconcrete  sense:  touching  

with  lips/mouth  •  animate  kiss  [using  lips/mouth]  animate/inanimate  

•  Ex:  he  kissed  her;    •  The  dolphin  kissed  the  kid    •  Why  does  the  pope  kiss  the  ground  a^er  he  disembarks  ...  

•  Kissàfigura(ve  sense:  touching    •  animate  kiss  inanimate  •  Ex:  "Walk  as  if  you  are  kissing  the  Earth  

with  your  feet."  

5  

pursed  lips?  

Page 6: Lecture: Word Sense Disambiguation

NO  solu$on  or  comments  provided  for  Senseval  

•  All  your  impressions  and  feelings  are  plausible  and  acceptable  J  

6  

Page 7: Lecture: Word Sense Disambiguation

Remember  that  in  both  ac$vi$es…  

•  You  have  experienced  cases  of  POLYSEMY!  

•  YOU  HAVE  TRIED  TO  DISAMBIGUATE  THE  SENSES  MANUALLY,  IE  WITH  YOUR  HUMAN  SKILLS…    

7  

Page 8: Lecture: Word Sense Disambiguation

Previous  lecture:  end  

8  

Page 9: Lecture: Word Sense Disambiguation

Today:  Word  Sense  Disambigua$on  (WSD)  •  Given:  

•  A  word  in  context;    •  A  fixed  inventory  of  poten(al  word  senses;  •  Create  a  system  that  automa(cally  decides  which  sense  of  the  word  is  correct  in  that  context.  

Page 10: Lecture: Word Sense Disambiguation

Word  Sense  Disambigua$on:  Defini$on  •  Word  Sense  Disambitua(on  (WSD)  is  the  TASK  of  determining  the  

correct  sense  of  a  word  in  context.  •  It  is  an  automa(c  task:  we  create  a  system  that  automa-cally  

disambiguates  the  senses  for  us.  

•  Useful  for  many  NLP  tasks:  informa(on  retrieval  (apple                                    or  apple                            ?),  ques(on  answering  (does  United  serve  Philadelphia?),  machine  transla(on  (eng  ”bat”  à  It:  pipistrello                    or  mazza                                                                        ?)  

10    

Page 11: Lecture: Word Sense Disambiguation

Anecdote:  the  poison  apple  

•  In  1954,  Alan  Turing  died  a^er  bi(ng  into  an  apple  laced  with  cyanide  

•  It  was  said  that  this  half-­‐biten  apple  inspired  the  Apple  logo…  but  apparently  it  is  a  legend  J    

•  hmp://mentalfloss.com/ar(cle/64049/did-­‐alan-­‐turing-­‐inspire-­‐apple-­‐logo    

11  

Page 12: Lecture: Word Sense Disambiguation

Be  alert…  

•  Word  sense  ambiguity  is  pervasive  !!!  

12  

Page 13: Lecture: Word Sense Disambiguation

Acknowledgements Most  slides  borrowed  or  adapted  from:  

Dan  Jurafsky  and  James  H.  Mar(n  

Dan  Jurafsky  and  Christopher  Manning,  Coursera    

 

J&M(2015,  dra^):  hmps://web.stanford.edu/~jurafsky/slp3/      

 

     

Page 14: Lecture: Word Sense Disambiguation

Outline:  WSD  Methods  

•  Thesaurus/Dic(onary  Methods  •  Supervised  Machine  Learning  •  Semi-­‐Supervised  Learning  (self-­‐reading)  

14  

Page 15: Lecture: Word Sense Disambiguation

Word Sense Disambiguation

Dic(onary  and  Thesaurus  Methods  

Page 16: Lecture: Word Sense Disambiguation

The  Simplified  Lesk  algorithm  

•  Let’s  disambiguate  “bank”  in  this  sentence:  The  bank  can  guarantee  deposits  will  eventually  cover  future  tui(on  costs  because  it  invests  in  adjustable-­‐rate  mortgage  securi(es.    

•  given  the  following  two  WordNet  senses:    

16.6 • WSD: DICTIONARY AND THESAURUS METHODS 13

function SIMPLIFIED LESK(word, sentence) returns best sense of word

best-sense most frequent sense for wordmax-overlap 0context set of words in sentencefor each sense in senses of word dosignature set of words in the gloss and examples of senseoverlap COMPUTEOVERLAP(signature, context)if overlap > max-overlap then

max-overlap overlapbest-sense sense

endreturn(best-sense)

Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns thenumber of words in common between two sets, ignoring function words or other words on astop list. The original Lesk algorithm defines the context in a more complex way. The Cor-pus Lesk algorithm weights each overlapping word w by its � logP(w) and includes labeledtraining corpus data in the signature.

bank1 Gloss: a financial institution that accepts deposits and channels themoney into lending activities

Examples: “he cashed a check at the bank”, “that bank holds the mortgageon my home”

bank2 Gloss: sloping land (especially the slope beside a body of water)Examples: “they pulled the canoe up on the bank”, “he sat on the bank of

the river and watched the currents”

Sense bank1 has two non-stopwords overlapping with the context in (16.19):deposits and mortgage, while sense bank2 has zero words, so sense bank1 is chosen.

There are many obvious extensions to Simplified Lesk. The original Lesk algo-rithm (Lesk, 1986) is slightly more indirect. Instead of comparing a target word’ssignature with the context words, the target signature is compared with the signaturesof each of the context words. For example, consider Lesk’s example of selecting theappropriate sense of cone in the phrase pine cone given the following definitions forpine and cone.

pine 1 kinds of evergreen tree with needle-shaped leaves2 waste away through sorrow or illness

cone 1 solid body which narrows to a point2 something of this shape whether solid or hollow3 fruit of certain evergreen trees

In this example, Lesk’s method would select cone3 as the correct sense since twoof the words in its entry, evergreen and tree, overlap with words in the entry for pine,whereas neither of the other entries has any overlap with words in the definition ofpine. In general Simplified Lesk seems to work better than original Lesk.

The primary problem with either the original or simplified approaches, how-ever, is that the dictionary entries for the target words are short and may not provideenough chance of overlap with the context.3 One remedy is to expand the list ofwords used in the classifier to include words related to, but not contained in, their

3 Indeed, Lesk (1986) notes that the performance of his system seems to roughly correlate with thelength of the dictionary entries.

Page 17: Lecture: Word Sense Disambiguation

The  Simplified  Lesk  algorithm  

The  bank  can  guarantee  deposits  will  eventually  cover  future  tui(on  costs  because  it  invests  in  adjustable-­‐rate  mortgage  securi(es.    

16.6 • WSD: DICTIONARY AND THESAURUS METHODS 13

function SIMPLIFIED LESK(word, sentence) returns best sense of word

best-sense most frequent sense for wordmax-overlap 0context set of words in sentencefor each sense in senses of word dosignature set of words in the gloss and examples of senseoverlap COMPUTEOVERLAP(signature, context)if overlap > max-overlap then

max-overlap overlapbest-sense sense

endreturn(best-sense)

Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns thenumber of words in common between two sets, ignoring function words or other words on astop list. The original Lesk algorithm defines the context in a more complex way. The Cor-pus Lesk algorithm weights each overlapping word w by its � logP(w) and includes labeledtraining corpus data in the signature.

bank1 Gloss: a financial institution that accepts deposits and channels themoney into lending activities

Examples: “he cashed a check at the bank”, “that bank holds the mortgageon my home”

bank2 Gloss: sloping land (especially the slope beside a body of water)Examples: “they pulled the canoe up on the bank”, “he sat on the bank of

the river and watched the currents”

Sense bank1 has two non-stopwords overlapping with the context in (16.19):deposits and mortgage, while sense bank2 has zero words, so sense bank1 is chosen.

There are many obvious extensions to Simplified Lesk. The original Lesk algo-rithm (Lesk, 1986) is slightly more indirect. Instead of comparing a target word’ssignature with the context words, the target signature is compared with the signaturesof each of the context words. For example, consider Lesk’s example of selecting theappropriate sense of cone in the phrase pine cone given the following definitions forpine and cone.

pine 1 kinds of evergreen tree with needle-shaped leaves2 waste away through sorrow or illness

cone 1 solid body which narrows to a point2 something of this shape whether solid or hollow3 fruit of certain evergreen trees

In this example, Lesk’s method would select cone3 as the correct sense since twoof the words in its entry, evergreen and tree, overlap with words in the entry for pine,whereas neither of the other entries has any overlap with words in the definition ofpine. In general Simplified Lesk seems to work better than original Lesk.

The primary problem with either the original or simplified approaches, how-ever, is that the dictionary entries for the target words are short and may not provideenough chance of overlap with the context.3 One remedy is to expand the list ofwords used in the classifier to include words related to, but not contained in, their

3 Indeed, Lesk (1986) notes that the performance of his system seems to roughly correlate with thelength of the dictionary entries.

Choose  sense  with  most  word  overlap  between  gloss  and  context  (not  coun(ng  func(on  words)  

Page 18: Lecture: Word Sense Disambiguation

Drawback  

•  Glosses  and  examples  migh  be  too  short  and  may  not  provide  enough  chance  to  overlap  with  the  context  of  the  word  to  be  disambiguated.    

18  

Page 19: Lecture: Word Sense Disambiguation

The  Corpus(-­‐based)  Lesk  algorithm  

•  Assumes  we  have  some  sense-­‐labeled  data  (like  SemCor)  •  Take  all  the  sentences  with  the  relevant  word  sense:  

These  short,  "streamlined"  mee-ngs  usually  are  sponsored  by  local  banks1,  Chambers  of  Commerce,  trade  associa-ons,  or  other  civic  organiza-ons.  

•  Now  add  these  to  the  gloss  +  examples  for  each  sense,  call  it  the  “signature”  of  a  sense.  Basically,  it  is  an  expansion  of  the  dic(onary  entry.  

•  Choose  sense  with  most  word  overlap  between  context  and  signature  (ie.  the  context  words  provided  by  the  resources).  

Page 20: Lecture: Word Sense Disambiguation

Corpus  Lesk:  IDF  weigh$ng  •  Instead  of  just  removing  func(on  words  

•  Weigh  each  word  by  its  `promiscuity’  across  documents  •  Down-­‐weights  words  that  occur  in  every  `document’  (gloss,  example,  etc)  •  These  are  generally  func(on  words,  but  is  a  more  fine-­‐grained  measure  

•  Weigh  each  overlapping  word  by  inverse  document  frequency  (IDF).  

20  

Page 21: Lecture: Word Sense Disambiguation

Graph-­‐based  methods  •  First,  WordNet  can  be  viewed  as  a  graph  

•  senses  are  nodes  •  rela(ons  (hypernymy,  meronymy)  are  edges  •  Also  add  edge  between  word  and  unambiguous  gloss  words  

21  

toastn4

drinkv1

drinkern1

drinkingn1

potationn1

sipn1

sipv1

beveragen1 milkn

1

liquidn1foodn

1

drinkn1

helpingn1

supv1

consumptionn1

consumern1

consumev1

An  undirected  graph  is  set  of  nodes  tha  are  connected  together  by  bidirec(onal  edges  (lines).    

Page 22: Lecture: Word Sense Disambiguation

How  to  use  the  graph  for  WSD  

“She  drank  some  milk”  •  choose  the            most  central  sense    (several  algorithms    have  been  proposed  recently)    

22  

drinkv1

drinkern1

beveragen1

boozingn1

foodn1

drinkn1 milkn

1

milkn2

milkn3

milkn4

drinkv2

drinkv3

drinkv4

drinkv5

nutrimentn1

“drink” “milk”

Page 23: Lecture: Word Sense Disambiguation

Word Meaning and Similarity

Word  Similarity:  Thesaurus  Methods  

beg:  c_w8  

Page 24: Lecture: Word Sense Disambiguation

Word  Similarity  

•  Synonymy:  a  binary  rela(on  •  Two  words  are  either  synonymous  or  not  

•  Similarity  (or  distance):  a  looser  metric  •  Two  words  are  more  similar  if  they  share  more  features  of  meaning  

•  Similarity  is  properly  a  rela(on  between  senses  •  We  do  not  say  “The  word  “bank”  is  not  similar  to  the  word  “slope”  “,  bu  w  say.  

•  Bank1  is  similar  to  fund3  

•  Bank2  is  similar  to  slope5  

•  But  we’ll  compute  similarity  over  both  words  and  senses  

Page 25: Lecture: Word Sense Disambiguation

Why  word  similarity  

•  Informa(on  retrieval  •  Ques(on  answering  •  Machine  transla(on  •  Natural  language  genera(on  •  Language  modeling  •  Automa(c  essay  grading  •  Plagiarism  detec(on  •  Document  clustering  

Page 26: Lecture: Word Sense Disambiguation

Word  similarity  and  word  relatedness  

•  We  o^en  dis(nguish  word  similarity    from  word  relatedness  •  Similar  words:  near-­‐synonyms  •  car, bicycle:        similar  

•  Related  words:  can  be  related  any  way  •  car, gasoline:      related,  not  similar  

Cf.  Synonyms:  car  &  automobile  

Page 27: Lecture: Word Sense Disambiguation

Two  classes  of  similarity  algorithms  

•  Thesaurus-­‐based  algorithms  •  Are  words  “nearby”  in  hypernym  hierarchy?  •  Do  words  have  similar  glosses  (defini(ons)?  

•  Distribu(onal  algorithms:  next  (me!  •  Do  words  have  similar  distribu(onal  contexts?  

Page 28: Lecture: Word Sense Disambiguation

Path-­‐based  similarity  

•  Two  concepts  (senses/synsets)  are  similar  if  they  are  near  each  other  in  the  thesaurus  hierarchy    •  =have  a  short  path  between  them  •  concepts  have  path  1  to  themselves  

Page 29: Lecture: Word Sense Disambiguation

Refinements  to  path-­‐based  similarity  •  pathlen(c1,c2) =  (distance  metric)  =  1  +  number  of  edges  in  the  

shortest  path  in  the  hypernym  graph  between  sense  nodes  c1  and  c2  

•  simpath(c1,c2) =

•  wordsim(w1,w2) = max sim(c1,c2) c1∈senses(w1),c2∈senses(w2)  

1pathlen(c1,c2 )

Sense  similarity  metric:  1  over  the  distance!  

Word  similarity  metric:    max  similarity  among  pairs  of  senses.  

For  all  senses  of  w1  and  all  senses  of  w2,  take  the  similarity  between  each  of  the  senses  of  w1  and  each  of  the  senses  of  w2  and  then  take  the  maximum  similarity  between  those  pairs.  

Page 30: Lecture: Word Sense Disambiguation

Example:  path-­‐based  similarity  simpath(c1,c2) = 1/pathlen(c1,c2)

simpath(nickel,coin)  =  1/2 = .5

simpath(fund,budget)  =  1/2 = .5 simpath(nickel,currency)  =  1/4 = .25 simpath(nickel,money)  =  1/6 = .17 simpath(coinage,Richter  scale)  =  1/6 = .17

Page 31: Lecture: Word Sense Disambiguation

Problem  with  basic  path-­‐based  similarity  

•  Assumes  each  link  represents  a  uniform  distance  •  But  nickel  to  money  seems  to  us  to  be  closer  than  nickel  to  standard  

•  Nodes  high  in  the  hierarchy  are  very  abstract  •  We  instead  want  a  metric  that  

•  Represents  the  cost  of  each  edge  independently  •  Words  connected  only  through  abstract  nodes    •  are  less  similar  

Page 32: Lecture: Word Sense Disambiguation

Informa$on  content  similarity  metrics  

•  In  simple  words:  •  We  define  the  probability  of  a  concept  C  as  the  probability  that  a  randomly  selected  word  in  a  corpus  is  an  instance  of  that  concept.  

•  Basically,  for  each  random  word  in  a  corpus  we  compute  how  probable  it  is  that  it  belongs  to  a  certain  concepts.    

Resnik  1995.  Using  informa(on  content  to  evaluate  seman(c  similarity  in  a  taxonomy.  IJCAI  

Page 33: Lecture: Word Sense Disambiguation

Formally:  Informa$on  content  similarity  metrics  

•  Let’s  define  P(c) as:  •  The  probability  that  a  randomly  selected  word  in  a  corpus  is  an  instance  of  concept  c

•  Formally:  there  is  a  dis(nct  random  variable,  ranging  over  words,  associated  with  each  concept  in  the  hierarchy  •  for  a  given  concept,  each  observed  noun  is  either  

•   a  member  of  that  concept    with  probability  P(c) •  not  a  member  of  that  concept  with  probability  1-P(c)

•  All  words  are  members  of  the  root  node  (En(ty)  •  P(root)=1

•  The  lower  a  node  in  hierarchy,  the  lower  its  probability  

Resnik  1995.  Using  informa(on  content  to  evaluate  seman(c  similarity  in  a  taxonomy.  IJCAI  

Page 34: Lecture: Word Sense Disambiguation

Informa$on  content  similarity  

•  For  every  word  (ex  “natural  eleva(on”),  we  count  all  the  words  in  that  concepts,  and  then  we  normalize  by  the  total  number  of  words  in  the  corpus.  

•  we  get  a  probability  value  that  tells  us  how  probable  it  is  that  a  random  word  is  a  an  instance  of  that  concept    

P(c) =count(w)

w∈words(c)∑

N

geological-­‐forma(on  

shore  

hill  

natural  eleva(on  

coast  

cave  

gromo  ridge  

…  

en(ty  

In  order  o  compute  the  probability  of  the  term  "natural  eleva(on",  we  take  ridge,  hill  +  natural  eleva(on  itself  

Page 35: Lecture: Word Sense Disambiguation

Informa$on  content  similarity  •  WordNet  hierarchy  augmented  with  probabili(es  P(c)  

D.  Lin.  1998.  An  Informa(on-­‐Theore(c  Defini(on  of  Similarity.  ICML  1998  

Page 36: Lecture: Word Sense Disambiguation

Informa$on  content:  defini$ons  

1.  Informa(on  content:  1.  IC(c) = -log P(c)

2.  Most  informa(ve  subsumer  (Lowest  common  subsumer)  LCS(c1,c2) = The  most  informa(ve  (lowest)  node  in  the  hierarchy  subsuming  both  c1  and  c2  

Page 37: Lecture: Word Sense Disambiguation

IC  aka…  

•  A  lot  of  people  prefer  the  term  surprisal  to  informa(on  or  to  informa(on  content.    

-­‐log  p(x)  It  measures  the  amount  of  surprise  generated  by  the  event  x.    The  smaller  the  probability  of  x,  the  bigger  the  surprisal  is.    It's  helpful  to  think  about  it  this  way,  par(cularly  for  linguis(cs  examples.    37  

Page 38: Lecture: Word Sense Disambiguation

Using  informa$on  content  for  similarity:    the  Resnik  method  

•  The  similarity  between  two  words  is  related  to  their  common  informa(on  

•  The  more  two  words  have  in  common,  the  more  similar  they  are  

•  Resnik:  measure  common  informa(on  as:  •  The  informa(on  content  of  the  most  informa(ve    (lowest)  subsumer  (MIS/LCS)  of  the  two  nodes  

•  simresnik(c1,c2) = -log P( LCS(c1,c2) )

Philip  Resnik.  1995.  Using  Informa(on  Content  to  Evaluate  Seman(c  Similarity  in  a  Taxonomy.  IJCAI  1995.  Philip  Resnik.  1999.  Seman(c  Similarity  in  a  Taxonomy:  An  Informa(on-­‐Based  Measure  and  its  Applica(on  to  Problems  of  Ambiguity  in  Natural  Language.  JAIR  11,  95-­‐130.  

Page 39: Lecture: Word Sense Disambiguation

Dekang  Lin  method  

•  Intui(on:  Similarity  between  A  and  B  is  not  just  what  they  have  in  common  

•  The  more  differences  between  A  and  B,  the  less  similar  they  are:  •  Commonality:  the  more  A  and  B  have  in  common,  the  more  similar  they  are  •  Difference:  the  more  differences  between  A  and  B,  the  less  similar  

•  Commonality:  IC(common(A,B))  •  Difference:  IC(descrip(on(A,B)-­‐IC(common(A,B))  

Dekang  Lin.  1998.  An  Informa(on-­‐Theore(c  Defini(on  of  Similarity.  ICML  

Page 40: Lecture: Word Sense Disambiguation

Dekang  Lin  similarity  theorem  •  The  similarity  between  A  and  B  is  measured  by  the  ra(o  

between  the  amount  of  informa(on  needed  to  state  the  commonality  of  A  and  B  and  the  informa(on  needed  to  fully  describe  what  A  and  B  are  

 simLin(A,B)∝

IC(common(A,B))IC(description(A,B))

•  Lin  (altering  Resnik)  defines  IC(common(A,B))  as  2  x  informa(on  of  the  LCS  

simLin(c1,c2 ) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )

Page 41: Lecture: Word Sense Disambiguation

Lin  similarity  func$on  

simLin(A,B) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )

simLin(hill, coast) =2 logP(geological-formation)logP(hill)+ logP(coast)

=2 ln0.00176

ln0.0000189+ ln0.0000216= .59

Page 42: Lecture: Word Sense Disambiguation

The  (extended)  Lesk  Algorithm    

•  A  thesaurus-­‐based  measure  that  looks  at  glosses  •  Two  concepts  are  similar  if  their  glosses  contain  similar  words  

•  Drawing  paper:  paper  that  is  specially  prepared  for  use  in  dra^ing  •  Decal:  the  art  of  transferring  designs  from  specially  prepared  paper  to  a  wood  or  glass  or  metal  surface  

•  For  each  n-­‐word  phrase  that’s  in  both  glosses  •  Add  a  score  of  n2    •  Paper  and  specially  prepared  for  1  +  22  =  5  •  Compute  overlap  also  for  other  rela(ons  •  glosses  of  hypernyms  and  hyponyms  

Page 43: Lecture: Word Sense Disambiguation

Summary:  thesaurus-­‐based  similarity  

Page 44: Lecture: Word Sense Disambiguation

Libraries  for  compu$ng  thesaurus-­‐based  similarity  

•  NLTK  •  hmp://nltk.github.com/api/nltk.corpus.reader.html?highlight=similarity  -­‐  nltk.corpus.reader.WordNetCorpusReader.res_similarity  

•  WordNet::Similarity  •  hmp://wn-­‐similarity.sourceforge.net/  •  Web-­‐based  interface:  

•  hmp://marimba.d.umn.edu/cgi-­‐bin/similarity/similarity.cgi  

44  

Page 45: Lecture: Word Sense Disambiguation

Machine Learning based approach

Page 46: Lecture: Word Sense Disambiguation

Basic  idea  

•  If  we  have  data  that  has  been  hand-­‐labelled  with  correct  word  senses,  we  can  used  a  supervised  learning  approach  and  learn  from  it!  •  We  need  to  extract  features  and  train  a  classifier  •  The  output  of  training  is  an  automa(c  system  capable  of  assigning  sense  labels  TO  unlabelled  words  in  a  context.    

46  

Page 47: Lecture: Word Sense Disambiguation

Two  variants  of  WSD  task  •  Lexical  Sample  task    

•  (we  need  labelled  corpora  for  individual  senses)  •  Small  pre-­‐selected  set  of  target  words  (ex  difficulty)  •  And  inventory  of  senses  for  each  word  •  Supervised  machine  learning:  train  a  classifier  for  each  word  

•  All-­‐words  task    •  (each  word  in  each  sentence  is  labelled  with  a  sense)  •  Every  word  in  an  en(re  text  •  A  lexicon  with  senses  for  each  word  

SENSEVAL  1-­‐2-­‐3  

Page 48: Lecture: Word Sense Disambiguation

Supervised  Machine  Learning  Approaches  

•  Summary  of  what  we  need:  •  the  tag  set  (“sense  inventory”)  •  the  training  corpus  •  A  set  of  features  extracted  from  the  training  corpus  •  A  classifier  

Page 49: Lecture: Word Sense Disambiguation

Supervised  WSD  1:  WSD  Tags  

•  What’s  a  tag?  A  dic(onary  sense?  

•  For  example,  for  WordNet  an  instance  of  “bass”  in  a  text  has  8  possible  tags  or  labels  (bass1  through  bass8).  

Page 50: Lecture: Word Sense Disambiguation

8  senses  of  “bass”  in  WordNet  1.  bass  -­‐  (the  lowest  part  of  the  musical  range)  2.  bass,  bass  part  -­‐  (the  lowest  part  in  polyphonic    music)  3.  bass,  basso  -­‐  (an  adult  male  singer  with  the  lowest  voice)  4.  sea  bass,  bass  -­‐  (flesh  of  lean-­‐fleshed  saltwater  fish  of  the  family  

Serranidae)  5.  freshwater  bass,  bass  -­‐  (any  of  various  North  American  lean-­‐fleshed  

freshwater  fishes  especially  of  the  genus  Micropterus)  6.  bass,  bass  voice,  basso  -­‐  (the  lowest  adult  male  singing  voice)  7.  bass  -­‐  (the  member  with  the  lowest  range  of  a  family  of  musical  

instruments)  8.  bass  -­‐  (nontechnical  name  for  any  of  numerous  edible    marine  and  

freshwater  spiny-­‐finned  fishes)  

Page 51: Lecture: Word Sense Disambiguation

SemCor  

<wf  pos=PRP>He</wf>  <wf  pos=VB  lemma=recognize  wnsn=4  lexsn=2:31:00::>recognized</wf>  <wf  pos=DT>the</wf>  <wf  pos=NN  lemma=gesture  wnsn=1  lexsn=1:04:00::>gesture</wf>  <punc>.</punc>  

51  

SemCor: 234,000 words from Brown Corpus, manually tagged with WordNet senses

Page 52: Lecture: Word Sense Disambiguation

Supervised  WSD:  Extract  feature  vectors  Intui$on  from  Warren  Weaver  (1955):  

“If  one  examines  the  words  in  a  book,  one  at  a  (me  as  through  an  opaque  mask  with  a  hole  in  it  one  word  wide,  then  it  is  obviously  impossible  to  determine,  one  at  a  (me,  the  meaning  of  the  words…    But  if  one  lengthens  the  slit  in  the  opaque  mask,  un(l  one  can  see  not  only  the  central  word  in  ques(on  but  also  say  N  words  on  either  side,  then  if  N  is  large  enough  one  can  unambiguously  decide  the  meaning  of  the  central  word…    The  prac(cal  ques(on  is  :  ``What  minimum  value  of  N  will,  at  least  in  a  tolerable  frac(on  of  cases,  lead  to  the  correct  choice  of  meaning  for  the  central  word?”    

the    window  

Page 53: Lecture: Word Sense Disambiguation

Feature  vectors  

•  Vectors  of  sets  of  feature/value  pairs  

Page 54: Lecture: Word Sense Disambiguation

Two  kinds  of  features  in  the  vectors  

•  Colloca$onal  features  and  bag-­‐of-­‐words  features  

•  Colloca$onal/Paradigma$c  •  Features  about  words  at  specific  posi(ons  near  target  word  •  O^en  limited  to  just  word  iden(ty  and  POS  

•  Bag-­‐of-­‐words  •  Features  about  words  that  occur  anywhere  in  the  window  (regardless  of  posi(on)  •  Typically  limited  to  frequency  counts  

 

Generally speaking, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance. But here the meaning is not exactly this…  

Page 55: Lecture: Word Sense Disambiguation

Examples  

•  Example  text  (WSJ):  

An  electric  guitar  and  bass  player  stand  off  to  one  side  not  really  part  of  the  scene  

•  Assume  a  window  of  +/-­‐  2  from  the  target  

Page 56: Lecture: Word Sense Disambiguation

Examples  

•  Example  text  (WSJ)  

An  electric  guitar  and  bass  player  stand  off  to  one  side  not  really  part  of  the  scene,    

•  Assume  a  window  of  +/-­‐  2  from  the  target  

Page 57: Lecture: Word Sense Disambiguation

Colloca$onal  features  

•  Posi(on-­‐specific  informa(on  about  the  words  and  colloca(ons  in  window  

•  guitar  and  bass  player  stand  

•  word  1,2,3  grams  in  window  of  ±3  is  common  

10 CHAPTER 16 • COMPUTING WITH WORD SENSES

ually tagged with WordNet senses (Miller et al. 1993, Landes et al. 1998). In ad-dition, sense-tagged corpora have been built for the SENSEVAL all-word tasks. TheSENSEVAL-3 English all-words test data consisted of 2081 tagged content word to-kens, from 5,000 total running words of English from the WSJ and Brown corpora(Palmer et al., 2001).

The first step in supervised training is to extract features that are predictive ofword senses. The insight that underlies all modern algorithms for word sense disam-biguation was famously first articulated by Weaver (1955) in the context of machinetranslation:

If one examines the words in a book, one at a time as through an opaquemask with a hole in it one word wide, then it is obviously impossibleto determine, one at a time, the meaning of the words. [. . . ] But ifone lengthens the slit in the opaque mask, until one can see not onlythe central word in question but also say N words on either side, thenif N is large enough one can unambiguously decide the meaning of thecentral word. [. . . ] The practical question is : “What minimum value ofN will, at least in a tolerable fraction of cases, lead to the correct choiceof meaning for the central word?”

We first perform some processing on the sentence containing the window, typi-cally including part-of-speech tagging, lemmatization , and, in some cases, syntacticparsing to reveal headwords and dependency relations. Context features relevant tothe target word can then be extracted from this enriched input. A feature vectorfeature vectorconsisting of numeric or nominal values encodes this linguistic information as aninput to most machine learning algorithms.

Two classes of features are generally extracted from these neighboring contexts,both of which we have seen previously in part-of-speech tagging: collocational fea-tures and bag-of-words features. A collocation is a word or series of words in acollocationposition-specific relationship to a target word (i.e., exactly one word to the right, orthe two words starting 3 words to the left, and so on). Thus, collocational featurescollocational

featuresencode information about specific positions located to the left or right of the targetword. Typical features extracted for these context words include the word itself, theroot form of the word, and the word’s part-of-speech. Such features are effective atencoding local lexical and grammatical information that can often accurately isolatea given sense.

For example consider the ambiguous word bass in the following WSJ sentence:(16.17) An electric guitar and bass player stand off to one side, not really part of

the scene, just as a sort of nod to gringo expectations perhaps.A collocational feature vector, extracted from a window of two words to the rightand left of the target word, made up of the words themselves, their respective parts-of-speech, and pairs of words, that is,

[wi�2,POSi�2,wi�1,POSi�1,wi+1,POSi+1,wi+2,POSi+2,wi�1i�2,w

i+1i ] (16.18)

would yield the following vector:[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]

High performing systems generally use POS tags and word collocations of length1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and Ng,2010).

The second type of feature consists of bag-of-words information about neigh-boring words. A bag-of-words means an unordered set of words, with their exactbag-of-words

10 CHAPTER 16 • COMPUTING WITH WORD SENSES

ually tagged with WordNet senses (Miller et al. 1993, Landes et al. 1998). In ad-dition, sense-tagged corpora have been built for the SENSEVAL all-word tasks. TheSENSEVAL-3 English all-words test data consisted of 2081 tagged content word to-kens, from 5,000 total running words of English from the WSJ and Brown corpora(Palmer et al., 2001).

The first step in supervised training is to extract features that are predictive ofword senses. The insight that underlies all modern algorithms for word sense disam-biguation was famously first articulated by Weaver (1955) in the context of machinetranslation:

If one examines the words in a book, one at a time as through an opaquemask with a hole in it one word wide, then it is obviously impossibleto determine, one at a time, the meaning of the words. [. . . ] But ifone lengthens the slit in the opaque mask, until one can see not onlythe central word in question but also say N words on either side, thenif N is large enough one can unambiguously decide the meaning of thecentral word. [. . . ] The practical question is : “What minimum value ofN will, at least in a tolerable fraction of cases, lead to the correct choiceof meaning for the central word?”

We first perform some processing on the sentence containing the window, typi-cally including part-of-speech tagging, lemmatization , and, in some cases, syntacticparsing to reveal headwords and dependency relations. Context features relevant tothe target word can then be extracted from this enriched input. A feature vectorfeature vectorconsisting of numeric or nominal values encodes this linguistic information as aninput to most machine learning algorithms.

Two classes of features are generally extracted from these neighboring contexts,both of which we have seen previously in part-of-speech tagging: collocational fea-tures and bag-of-words features. A collocation is a word or series of words in acollocationposition-specific relationship to a target word (i.e., exactly one word to the right, orthe two words starting 3 words to the left, and so on). Thus, collocational featurescollocational

featuresencode information about specific positions located to the left or right of the targetword. Typical features extracted for these context words include the word itself, theroot form of the word, and the word’s part-of-speech. Such features are effective atencoding local lexical and grammatical information that can often accurately isolatea given sense.

For example consider the ambiguous word bass in the following WSJ sentence:(16.17) An electric guitar and bass player stand off to one side, not really part of

the scene, just as a sort of nod to gringo expectations perhaps.A collocational feature vector, extracted from a window of two words to the rightand left of the target word, made up of the words themselves, their respective parts-of-speech, and pairs of words, that is,

[wi�2,POSi�2,wi�1,POSi�1,wi+1,POSi+1,wi+2,POSi+2,wi�1i�2,w

i+1i ] (16.18)

would yield the following vector:[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]

High performing systems generally use POS tags and word collocations of length1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and Ng,2010).

The second type of feature consists of bag-of-words information about neigh-boring words. A bag-of-words means an unordered set of words, with their exactbag-of-words

Page 58: Lecture: Word Sense Disambiguation

Bag-­‐of-­‐words  features  

•  “an  unordered  set  of  words”  –  posi(on  ignored  •  Choose  a  vocabulary:  a  useful  subset  of  words  in  a  training  corpus  

•  Either:  the  count  of  how  o^en  each  of  those  terms  occurs  in  a  given  window  OR  just  a  binary  “indicator”  1  or  0  

 

Page 59: Lecture: Word Sense Disambiguation

Co-­‐Occurrence  Example  

•  Assume  we’ve  semled  on  a  possible  vocabulary  of  12  words  in  “bass”  sentences:    

 [fishing,  big,  sound,  player,  fly,  rod,  pound,  double,  runs,  playing,  guitar,  band]    

•  The  vector  for:    guitar  and  bass  player  stand    [0,0,0,1,0,0,0,0,0,0,1,0]    

 

Page 60: Lecture: Word Sense Disambiguation

Word Sense Disambiguation

Classifica(on  

Page 61: Lecture: Word Sense Disambiguation

Classifica$on  •  Input:  •   a  word  w  and  some  features  f  •   a  fixed  set  of  classes    C  =  {c1,  c2,…,  cJ}  

•  Output:  a  predicted  class  c∈C  

Any  kind  of  classifier  •  Naive  Bayes  •  Logis(c  regression  •  Neural  Networks  •  Support-­‐vector  machines  •  k-­‐Nearest  Neighbors  •  etc.  

 

Page 62: Lecture: Word Sense Disambiguation

The  end    

62