24
Zdroje jazykových dat Word senses Sense tagged corpora

Zdroje jazykových dat Word senses Sense tagged corpora

Embed Size (px)

DESCRIPTION

Word sense disambiguation The different meanings of polysemous words are known as “senses” and the process of deciding which is being used in a particular context “word sense disambiguation”

Citation preview

Page 1: Zdroje jazykových dat Word senses Sense tagged corpora

Zdroje jazykových dat

Word sensesSense tagged corpora

Page 2: Zdroje jazykových dat Word senses Sense tagged corpora

• Lev V. Ščerba: And indeed, every sufficiently complex word must actually become the subject of a scientific monograph; therefore it is hard to expect in the near future the completion of a good dictionary.

Page 3: Zdroje jazykových dat Word senses Sense tagged corpora

Word sense disambiguation

• The different meanings of polysemous words are known as “senses” and the process of deciding which is being used in a particular context “word sense disambiguation”

Page 4: Zdroje jazykových dat Word senses Sense tagged corpora

Lexical Acquisition Bottleneck

• In NLP many systems do not perform in practice as well as they could with adequate dictionary resources, due to the cost of production, adaptation, and maintenance of these resouces

• Solutions– Reusing existing dictionaries and ontologies as

lexicons– Deriving disambiguation information directly from

corpora

Page 5: Zdroje jazykových dat Word senses Sense tagged corpora

Usefulness of WSD

• NLP tools:– Systems – carries out some task of “interest for its

own sake” (e.g. MT,IR); applications potentially interesting for non-linguists

– Components – interesting for linguists and language engineers; e.g. WSD

Page 6: Zdroje jazykových dat Word senses Sense tagged corpora

Early approaches• Preference semantics – 1970’s

– Selectional constraints (e.g. ANIMATE for subject of “to drink”)

• Word experts – 1980’s– Hand crafted disambiguators constructed for each word

separately– Limited applicability

• Polaroid words– Gradual disambiguation (grammar, parser, lexicon, semantic

interpreter, knowledge representation language)

Page 7: Zdroje jazykových dat Word senses Sense tagged corpora

Dictionary Based Approaches

• Since 1980’s – dictionary publishers started to produced “Machine Readable Dictionaries” (now - m. tractable d.)

• Wider polysemy than in the systems described so far

Page 8: Zdroje jazykových dat Word senses Sense tagged corpora

Two claimsabout sense distribution

• One sense per discourse– There is a very strong tendency for multiple uses of a

word to share the same sense in a well-written discourse

• One sense per collocation– With a high probability an ambiguous word has only

one sense in a given collocation

Page 9: Zdroje jazykových dat Word senses Sense tagged corpora

Taxonomy of WSD Algorithms

• Knowledge based• Corpus based

– Tagged corpora– Untagged corpora

• Hybrid approaches

Page 10: Zdroje jazykových dat Word senses Sense tagged corpora

Word Senses and Lexicons

Sense tagging = attaching senses from some lexicon to words in text

Sense-enumerative dictionary

Page 11: Zdroje jazykových dat Word senses Sense tagged corpora

Deficiencies of dictionaries

• Omissions and oversights• Coverage of names• Ghost words – Dord=density (D or d)• Differentiating senses (P.Hanks: A serious problem for

computer applications if that dictionaries compiled for human users focus on giving lists of meanings for each entry, without saying much about how one meaning may be distinguished from another in text)

Page 12: Zdroje jazykových dat Word senses Sense tagged corpora

Two levels of sense distinction

• Homography– Two senses of a word are homographic when there

is no obvious semantic relation between them (e.g. a ball – a dance or a rounded object)

– Risk of amateur etymology

• Polysemy

Page 13: Zdroje jazykových dat Word senses Sense tagged corpora

Distinguishing senses

• P.Hanks: No generally agreed criteria exist for what counts as a sense, or for how to distinguish one sense from another

• Zeugma: Arthur and his driving license expired last Thursday.

• Polysemy vs. vagueness (e.g. mountain)

Page 14: Zdroje jazykových dat Word senses Sense tagged corpora

The Bank Model• Assumption A – Words have a finite set of clearly distinct,

well-defined sense

• Assumption B – Native speakers of … know instantly and effortlessly which meaning applies in a given situation

• Criticism of the bank model: Kilgarriff (“I don’t believe word senses”), Pustejovsky (Generative lexicon), and many others…

Page 15: Zdroje jazykových dat Word senses Sense tagged corpora

NLP Lexicons

• Longman Dictionary of Contemporary English (LDOCE) – three-level embedded structure for sense distinctions (homographs,senses,optional subsenses)

• Roget’s Thesaurus• Cambridge International Dictionary of English• COBUILD English Language Dictionary• WordNet

Page 16: Zdroje jazykových dat Word senses Sense tagged corpora

Thesaurus

Page 17: Zdroje jazykových dat Word senses Sense tagged corpora

Ontology

Page 18: Zdroje jazykových dat Word senses Sense tagged corpora

Ontology• There is little agreement on what an ontology is… In

general, an ontology can be described as an inventory of the objects, processes, etc. in a domain, as well as a specification of (some of ) the relation that hold among them.

• Aristotle: genus (category to which something belongs)and differentiae (property that uniquely distinguish the category member from their parent and from one another)

• Nodes (concepts) in the hierarchy related by subsumption

Page 19: Zdroje jazykových dat Word senses Sense tagged corpora

Ontologies in different traditions

• Philosophical• Cognitive • Artificial intelligence• Lexical semantics• Lexicography• Information science

Page 20: Zdroje jazykových dat Word senses Sense tagged corpora

Princeton WordNet• Lexical semantic network structured around the notion of

synsets• Synset - skupina literálů téhož slovního druhu, které jsou v určitém kontextu vzájemně

zaměnitelné („set of synonyms“)

• http://www.cogsci.princeton.edu/~wn/w3wn.html• Inspired by psycholinguistic theories of human lexical memory• broad coverage, rich lexical information, freely available• too fine-grained for practical NLP tasks• Relations between two synsets: homonymy, hyperonymy,

meronymy …

Page 21: Zdroje jazykových dat Word senses Sense tagged corpora

EuroWordNet (i)

• Multilingual database containing several monoloingual wordnets structured along the same lines as the Princeton WordNet1.5

• English,Dutch,German,Spanish,French,Italian, Czech,Estonian

• Inter-Lingual-Index• http://www.hum.uva.nl/~ewn

Page 22: Zdroje jazykových dat Word senses Sense tagged corpora

EuroWordNet (ii)

Princeton WordNet 1.5 EuroWordNet

note, observe, make a remark,

remark

prohodit, poznamenat,připomenout

anmerken,bemerken

. . . . . .. . . . . .

. . . . . .

Page 23: Zdroje jazykových dat Word senses Sense tagged corpora

Sense tagged corpora• “interest” corpus

– 2kS containing the word “interest” • SENSEVAL

– http://www.senseval.org– WSD evaluation exercise, first run in 1998

• SEMCOR– http://multisemcor.itc.it/semcor.phpSubset of the English Brown corpus,700kW– More than 200kW sense-tagged according to Princeton

WordNet 1.6

Page 24: Zdroje jazykových dat Word senses Sense tagged corpora

Final remarks

• Similarity of POS- and sense tagging• Mapping lexical resources