Upload
owen-oneill
View
230
Download
3
Tags:
Embed Size (px)
Citation preview
OverviewOverview Princeton WordNet (1980 - ongoing) EuroWordNet (1996 - 1999)
The database design
The general building strategy
Towards a universal index of meaning
Global WordNet Association (2001 - ongoing)
Other wordnets
BalkaNet (2001 - 2004)
IndoWordnet (2002 - ongoing)
Meaning (2002 - 2005)
WordNet1.5WordNet1.5• Developed at Princeton by George Miller
and his team as a model of the mental lexicon.
• Semantic network in which concepts are defined in terms of relations to other concepts.
• Structure: organized around the notion of
synsets (sets of synonymous words) basic semantic relations between
these synsets Initially no glosses Main revision after tagging the Brown
corpus with word meanings: SemCor. http://www.cogsci.princeton.edu/~wn/http://www.cogsci.princeton.edu/~wn/
w3wn.htmlw3wn.html
Structure of WordNet1.5
{vehicle}
{conveyance; transport}
{car; auto; automobile; machine; motorcar}
{cruiser; squad car; patrol car; police car; prowl car} {cab; taxi; hack; taxicab; }
{motor vehicle; automotive vehicle}
{bumper}
{car door}
{car window}
{car mirror}
{hinge; flexible joint}
{doorlock}
{armrest}
hyperonym
hyperonym
hyperonym
hyperonymhyperonym
meronym
meronym
meronym
meronym
EuroWordNet The development of a multilingual database
with wordnets for several European languages Funded by the European Commission, DG XIII,
Luxembourg as projects LE2-4003 and LE4-8328
March 1996 - September 1999 2.5 Million EURO. URL: http://www.hum.uva.nl/~ewnhttp://www.hum.uva.nl/~ewn
Objectives of Objectives of EuroWordNetEuroWordNet
Languages covered: EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian EuroWordNet-2 (LE4-8328): German, French, Czech,
Estonian. Size of vocabulary:
EuroWordNet-1: 30,000 concepts - 50,000 word meanings. EuroWordNet-2: 15,000 concepts- 25,000 word meaning.
Type of vocabulary: the most frequent words of the languages all concepts needed to relate more specific concepts
ConsortiumConsortium
Organization Country Task University of Amsterdam
NL Project Coordinator & Build the Dutch wordnet
Istituto Di Linguistica Computazionale Pisa
IT Build the Italian wordnet
Fundacion Universidad Empresa ES Build the Spanish wordnet Université d’ Avignon and Memodata at Avignon
FR Build the French wordnet
Universität Tübingen DE Build the German wordnet University of Masaryk at Brno CZ Build the Czech wordnet University of Tartu, Estonia EE Build the Estonian wordnet
University of Sheffield GB Adapt the English wordnet Novell Belgium NV BE User
Build the common database Xerox Research Centre, Meylan FR User Bertin & Cie, Plaisir, Paris FR User
The basic principles of EuroWordNet
the structure of the Princeton WordNet the design of the EuroWordNet
database wordnets as language-specific
structures the language-internal relations the multilingual relations
Specific features of Specific features of EuroWordNetEuroWordNet
it contains semantic lexicons for other languages than English.
each wordnet reflects the relations as a language-internal
system, maintaining cultural and linguistic differences in the
wordnets.
it contains multilingual relations from each wordnet to English
meanings, which makes it possible to compare the wordnets,
tracking down inconsistencies and cross-linguistic differences.
each wordnet is linked to a language independent top-
ontology and to domain labels.
Autonomous & Language-Specific
voorwerp{object}
lepel{spoon}
werktuig{tool}
tas{bag}
bak{box}
blok{block}
lichaam{body}
Wordnet1.5 Dutch Wordnet
bagspoonbox
object
natural object (an object occurring naturally)
artifact, artefact (a man-made object)
instrumentality block body
containerdeviceimplement
tool instrument
Differences in structure
•Artificial Classes versus Lexicalized Classes: instrumentality; natural object
•Lexicalization differences of classes: container and artifact (object) are not lexicalized in
Dutch
•What is the purpose of different hierarchies?
•Should we include all lexicalized classes from all (8) languages?
Conceptual ontology: A particular level or structuring may be required to achieve a better control or performance, or a more compact and coherent structure.
• introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), • neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise).
What properties can we infer for spoons?spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking
Linguistic versus Conceptual Ontologies
Linguistic versus Conceptual Ontologies
Linguistic ontology: Exactly reflects the relations between all the lexicalized
words and expressions in a language. It therefore captures valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language.
What words can be used to name spoons?spoon -> object, tableware, silverware, merchandise,
cutlery,
Separate Wordnets and Ontologies
ReferenceOntologyClasses: BOXContainerProduct;SolidTangibleThing
Language-Neutral Ontology
object
boxcontainer
box
containerWordNet1.5
Language-Specific Wordnets
doos
voorwerpDutch Wordnet
EuroWordNet Top-Ontology:Form: CubicFunction: ContainOrigin: ArtifactComposition: Whole
Wordnets versus ontologies
Wordnets:autonomous language-specific lexicalization patterns in a relational network. Usage: to predict substitution in text for information retrieval,text generation, machine translation, word-sense-disambiguation.
Ontologies: data structure with formally defined concepts.Usage: making semantic inferences.
Classical Substitution Principle:
Any word that is used to refer to something can be replaced by its synonyms, hyperonyms and hyponyms:
horse stallion, mare, pony, mammal, animal, being.
It cannot be referred to by co-hyponyms and co-hyponyms of its hyperonyms:
horse X cat, dog, camel, fish, plant, person, object.
Conceptual Distance Measurement:
Number of hierarchical nodes between words is a measurement of closeness, where the level and the local density of nodes are additional factors.
Wordnets asLinguistic Ontologies
Linguistic Principles for deriving relations
1. Substitution tests (Cruse 1986):
1 a. It is a fiddle therefore it is a violin.b It is a violin therefore it is a fiddle.
2 a. It is a dog therefore it is an animal.b *It is an animal therefore it is a dog.
3 a to kill (/a murder) causes to die (/ death)to kill (/a murder) has to die (/ death) as a
consequenceb *to die / death causes to kill
*to die / death has to kill as a consequence
Linguistic Principles for deriving relations
2. Principle of Economy (Dik 1978):
If a word W1 (animal) is the hyperonym of W2 (mammal) and W2
is the hyperonym of W3 (dog) then W3 (dog) should not be linked to W1 (animal) but to W2 (mammal).
3. Principle of Compatibility
If a word W1 is related to W2 via relation R1, W1 and W2 cannot be related via relation Rn, where Rn is defined as a distinct relation from R1.
Architecture of the Architecture of the EuroWordNet Data BaseEuroWordNet Data Base
I
I = Language Independent linkII = Link from Language Specific to Inter lingual IndexIII = Language Dependent Link
II
Lexical Items Table
bewegengaan
rijden berijdenIII
guidare
III
Lexical Items Table
cavalcare
andaremuoversi
ILI-record{drive}
Inter-Lingual-Index
I
Lexical Items Table
driveride
movego
III
Ontology
2OrderEntity
LocationDynamic
Lexical Items Table
cabalgar jinetear
III
conducir
movertransitar
Domains
Traffic
Air Road`
III
IIIIII
III
IIII
II
The mono-lingual design of EuroWordNet
Language Internal Language Internal RelationsRelations
WN 1.5 starting point
The ‘synset’ as a weak notion of synonymy:“two expressions are synonymous in a linguistic context C if the substitution of one for the other in C does not alter the truth value.” (Miller et al. 1993)
Relations between synsets:Relation POS-combination ExampleANTONYMY adjective-to-adjective
verb-to-verb open/ closeHYPONYMY noun-to-noun car/ vehicle
verb-to-verb walk/ moveMERONYMY noun-to-noun head/ noseENTAILMENT verb-to-verb buy/ payCAUSE verb-to-verb kill/ die
Differences Differences EuroWordNet/WordNet1.5EuroWordNet/WordNet1.5
• Added Features to relations
• Cross-Part-Of-Speech relations
• New relations to differentiate shallow hierarchies
• New interpretations of relations
EWN Relationship EWN Relationship LabelsLabels
Disjunction/Conjunction of multiple relations of the same type
WordNet1.5door1 -- (a swinging or sliding barrier that will close the entrance to a room or
building; "he knocked on the door"; "he slammed the door as he left") PART OF: doorway, door, entree, entry, portal, room access
door 6 -- (a swinging or sliding barrier that will close off access into a car; "she forgot to lock the doors of her car") PART OF: car, auto, automobile, machine, motorcar.
EWN Relationship EWN Relationship LabelsLabels
{airplane} HAS_MERO_PART: conj1 {door}HAS_MERO_PART: conj2 disj1 {jet engine}HAS_MERO_PART: conj2 disj2 {propeller}
{door} HAS_HOLO_PART: disj1 {car}HAS_HOLO_PART: disj2 {room}
HAS_HOLO_PART: disj3 {entrance}
{dog} HAS_HYPERONYM: conj1 {mammal} HAS_HYPERONYM: conj2 {pet}
{albino} HAS_HYPERONYM: disj1 {plant} HAS_HYPERONYM: dis2 {animal}
Default Interpretation: non-exclusive disjunction
EWN Relationship EWN Relationship LabelsLabels
Disjunction/Conjunction of multiple relations of the same type
{ {dog}HAS_HYPONYM: dis1 {poodle}HAS_HYPONYM: dis1 {labrador}HAS_HYPONYM: {sheep dog} (Orthogonal)HAS_HYPONYM: {watch dog} (Orthogonal)
Default Interpretation: non-exclusive disjunction
Factive/Non-factive CAUSES (Lyons 1977)
factive (default interpretation):
“to kill causes to die”: {kill} CAUSES{die}
non-factive: E1 probably or likely causes event E2 or E1 is intended to cause some event E2:
“to search may cause to find”.{search} CAUSES {find} non-factive
EWN Relationship EWN Relationship LabelsLabels
EWN Relationship EWN Relationship LabelsLabels
ReversedIn the database every relation must have a reverse counter-part but there is a difference between relations which are explicitly coded as reverse and automatically reversed relations:
{finger} HAS_HOLONYM {hand}{hand} HAS_MERONYM {finger} {paper-clip} HAS_MER_MADE_OF {metal} {metal} HAS_HOL_MADE_OF {paper-clip} reversed
Negation{monkey} HAS_MERO_PART {tail}{ape} HAS_MERO_PART {tail} not
Cross-Part-Of-Speech Cross-Part-Of-Speech relationsrelations
WordNet1.5: nouns and verbs are not interrelated by basic semantic relations such as hyponymy and synonymy:
adornment 2 change of state-- (the act of changing something)adorn 1 change, alter-- (cause to change; make different)
EuroWordNet: words of different parts of speech can be inter-linked with explicit xpos-synonymy, xpos-antonymy and xpos-hyponymy relations:
{adorn V} XPOS_NEAR_SYNONYM {adornment N}
The advantages of such explicit cross-part-of-speech relations are:
similar words with different parts of speech are grouped together. the same information can be coded in an NP or in a sentence. By
unifying higher-order nouns and verbs in the same ontology it will be possible to match expressions with very different syntactic structures but comparable content
by merging verbs and abstract nouns we can more easily link mismatches across languages that involve a part-of-speech shift. Dutch nouns such as “afsluiting”, “gehuil” are translated with the English verbs “close” and “cry”, respectively.
Cross-Part-Of-Speech Cross-Part-Of-Speech relationsrelations
Entailment in WordNetEntailment in WordNet
WordNet1.5: Entailment indicates the direction of the implication or entailment:
a. + Temporal Inclusion (the two situations partially or totally overlap)a.1 co-extensiveness (e. g., to limp/to walk)
hyponymy/troponymya.2 proper inclusion (e.g., to snore/to sleep) entailment
b. - Temporal Exclusion (the two situations are temporally disjoint)b.1 backward presupposition (e.g., to succeed/to try) entailmentb.2 cause (e.g., to give/to have)
Subevents in EuroWordNetEuroWordNetDirection of the entailment is expressed by the labels factive and reversed:
{to succeed} is_caused_by {to try} factive{to try} causes {to succeed} non-factive
Proper inclusion is described by the has_subevent/ is_subevent_of relation in combination with the label reversed:
{to snore} is_subevent_of {to sleep}{to sleep} has_subevent {to snore} reversed{to buy}has_subevent {to pay}{to pay} is_subevent_of {to buy} reversed
The interpretation of The interpretation of
the CAUSE relationthe CAUSE relation
WordNet1.5: The causal relation only holds between verbs and it should only apply to temporally disjoint situations:
EuroWordNet: the causal relation will also be applied across different parts of speech:
{to kill} V causes {death} N{death} n is_caused_by {to kill} v reversed{to kill } v causes {dead} a{dead} a is_caused_by {to kill} v reversed{murder} n causes {death}n{death} a is_caused_by {murder} n reversed
The interpretation of The interpretation of the CAUSE relationthe CAUSE relation
Various temporal relationships between the (dynamic/non-dynamic) situations may hold:
• Temporally disjoint: there is no time point when dS1 takes place and also S2 (which is caused by dS1) (e.g. to shoot/to hit);
• Temporally overlapping: there is at least one time point when both dS1 and S2 take place, and there is at least one time point when dS1 takes place and S2 (which is caused by dS1) does not yet take place (e.g. to teach/to learn);
• Temporally co-extensive: whenever dS1 takes place also S2 (which is caused by dS1) takes place and there is no time point when dS1 takes place and S2 does not take place, and vice versa (e.g. to feed/to eat).
Role relationsRole relationsIn the case of many verbs and nouns the most salient relation is not the hyperonym but the relation between the event and the involved participants. These relations are expressed as follows:
{hammer} ROLE_INSTRUMENT {to hammer}{to hammer} INVOLVED_INSTRUMENT {hammer} reversed{school} ROLE_LOCATION {to teach}{to teach} INVOLVED_LOCATION {school} reversed
These relations are typically used when other relations, mainly hyponymy, do not clarify the position of the concept network, but the word is still closely related to another word.
Co_Role relationsCo_Role relations
guitar player HAS_HYPERONYM playerCO_AGENT_INSTRUMENT guitar
player HAS_HYPERONYM personROLE_AGENT to play musicCO_AGENT_INSTRUMENT musical instrument
to play music HAS_HYPERONYM to makeROLE_INSTRUMENT musical instrument
guitar HAS_HYPERONYM musical instrumentCO_INSTRUMENT_AGENT guitar player
ice saw HAS_HYPERONYM sawCO_INSTRUMENT_PATIENT ice
saw HAS_HYPERONYM sawROLE_INSTRUMENT to saw
ice CO_PATIENT_INSTRUMENT ice saw REVERSED
Co_Role relationsCo_Role relations
Examples of the other relations are:
criminal CO_AGENT_PATIENT victimnovel writer/ poet CO_AGENT_RESULT novel/ poemdough CO_PATIENT_RESULT pastry/ breadphotograpic camera CO_INSTRUMENT_RESULT photo
BE_IN_STATE and STATE_OFBE_IN_STATE and STATE_OFExample: the poor are the ones to whom the state poor applies
Effect: poor N HAS_HYPERONYM person Npoor N BE_IN_STATE poor Apoor A STATE_OF poor N reversed
IN_MANNER and MANNER_OFIN_MANNER and MANNER_OFExample: to slurp is to eat in a noisely manner
Effect: slurp V HAS_HYPERONYM eat Vslurp V IN_MANNER noisely Adverbnoisely Adverb MANNER_OF slurp V reversed
Overview of the Language Overview of the Language Internal relations in EuroWordnetInternal relations in EuroWordnet
Same Part of Speech relations:NEAR_SYNONYMY apparatus - machineHYPERONYMY/HYPONYMY car - vehicleANTONYMY open - closeHOLONYMY/MERONYMY head - nose
Cross-Part-of-Speech relations:XPOS_NEAR_SYNONYMY dead - death; to adorn - adornmentXPOS_HYPERONYMY/HYPONYMY to love - emotionXPOS_ANTONYMY to live - deadCAUSE die - deathSUBEVENT buy - pay; sleep - snoreROLE/INVOLVED write - pencil; hammer - hammerSTATE the poor - poorMANNER to slurp - noisily BELONG_TO_CLASS Rome - city
Thematic networksThematic networks
behandelen(treat)
zieke (sick person, patient)
genezen (to get well)
arts (doctor)
scalpel
opereren(operate)
persoon (person)
wezen(being)
organisme (organism)
orgaan(organ)
maag(stomach)
maagaandoening(stomach disease)
ziekte(disease)
Agent
PatientCauses
Patient
Involves
Instrument
Part of
Patient
The multi-lingual design of EuroWordNet
Inter-Lingual-Index: unstructured fund of concepts to
provide an efficient mapping across the languages;
Index-records are mainly based on WordNet1.5 synsets
and consist of synonyms, glosses and source references;
Various types of complex equivalence relations are
distinguished;
Equivalence relations from synsets to index records: not
on a word-to-word basis;
Indirect matching of synsets linked to the same index
items;
The Multilingual DesignThe Multilingual Design
EWN Interlingual RelationsEWN Interlingual Relations
• EQ_SYNONYM: there is a direct match between a synset and an ILI-record
• EQ_NEAR_SYNONYM: a synset matches multiple ILI-records simultaneously,
• HAS_EQ_HYPERONYM: a synset is more specific than any available ILI-record.
• HAS_EQ_HYPONYM: a synset can only be linked to more specific ILI-records.
• other relations:
CAUSES/IS_CAUSED_BY, EQ_SUBEVENT/EQ_ROLE, EQ_IS_STATE_OF/EQ_BE_IN_STATE
Equivalent Near SynonymEquivalent Near Synonym
1. Multiple TargetsOne sense for Dutch schoonmaken (to clean) which simultaneously matches with at least 4 senses of clean in WordNet1.5:
•{make clean by removing dirt, filth, or unwanted substances from}•{remove unwanted substances from, such as feathers or pits, as of chickens or fruit}•(remove in making clean; "Clean the spots off the rug")•{remove unwanted substances from - (as in chemistry)}
The Dutch synset schoonmaken will thus be linked with an eq_near_synonym relation to all these sense of clean.
Equivalent Near SynonymEquivalent Near Synonym
2. Multiple Source meaningsSynsets inter-linked by a near_synonym relation can be linked to same target ILI-record(s), either with an eq_synonym or an eq_near_synonym relation:
Dutch wordnet:
toestel near_synonym apparaatILI-records: {machine}; {device}; {apparatus}; {tool}
Equivalent Hyponymy
has_eq_hyperonym Typically used for gaps in WordNet1.5 or in English:
• genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin,
• pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: Dutch hoofd only refers to human head and Dutch kop only refers to animal head, English uses head for both.
has_eq_hyponym Used when wordnet1.5 only provides more narrow terms. In this case there
can only be a pragmatic difference, not a genuine cultural gap, e.g.: Spanish dedo can be used to refer to both finger and toe.
{ toe : part of foot }
{ finger : part of hand }
{ dedo , dito : finger or toe }
{ head : part of body }
{ hoofd : human head } { kop : animal head }
toe finger
head
dito
dedo
hoofd kop
GB-Net
NL-Net
IT-Net
ES-Net
= normal equivalence
= eq _has_hyponym
= eq _has_hyperonym
Complex mappings across languages
The methodologies for building wordnets
Overall Building Overall Building ProcessProcess
Verificationby users
Comparing and restructuring the wordnet
Load wordnet in the EuroWordNet Database
Improve and extend the wordnet fragments
Adjust coverageimprove encoding
Machine Readable DictionariesWordnets, Taxonomies,
CorporaLoaded in local databases
Subset of word meanings
Encoding oflanguage internal and equivalence relations
Wordnet fragment with links to WordNet1.5
in local database
Specification of selection criteria
Wordnet fragment inEuroWordNet database
Demonstrationin
InformationRetrieval
VerificationReport
Ia
Ib
Ic
II
Ia
III
Main MethodsMain Methods Expand approach: translate WordNet1.5 synsets to another
language and take over the structure easier and more efficient method compatible structure with WordNet1.5 structure is close to WordNet1.5 but also biased by it
Merge approach: create an independent wordnet in another language and align the separate hierarchies by generating the appropriate translations
more complex and labour intensive different structure from WordNet1.5 lanuage specific patterns can be maintained
Methods for extracting Methods for extracting language-internal relationslanguage-internal relations
• editors and database for manually encoding relations;
• comparison with WordNet1.5 structure;
• definition patterns in monolingual dictionaries;
• co-occurrences in corpora;
• morphology;
• bilingual dictionaries;
• lexical semantic substitution tests
• extract monosemeous translations of English synsets, e.g. a Spanish word has only 1 translation to an English word which has only one sense and vice versa;
disambiguation of multiple ambivalent translations by measuring their conceptual-distance between the senses of these translations in the WordNet1.5 hierarchy (Rigau and Aguirre, 95);
disambiguation of ambivalent translations by measuring the conceptual-distance directly in the WordNet1.5 hierarchy between alternative translations and the translations of the direct semantic context in the source wordnet;
disambiguation of ambivalent translations by measuring the overlap in top-concepts inherited in the source wordnet and inherited for the different senses of translations in WordNet1.5;
Methods for extracting Methods for extracting equivalence relationsequivalence relations
Aligning wordnetsAligning wordnets
muziekinstrument
orgel
hammond orgel
organ ? organ organ
hammond organ
musical instrument
instrument
artifact object natural object
object
Inheriting Inheriting Semantic FeaturesSemantic Features
hart 1 orgaan 1 (Living Part) deel 2 (Part) iets 1 LEAF
-----------------------------------------------------------------------------------------------------heart 1
playing card 1 card 1 (Artifact Function Object) paper 6 (Artifact Solid) material 5 (Substance) matter 1 inanimate object 1 entity 1 LEAF
heart 2 disposition 2 (Dynamic Experience Mental)nature 1trait 1 (Property) attribute 1 (Property) abstraction 1 LEAF
heart 3 bravery 1 spirit 1 character 1 trait 1 (Property) attribute 1 (Property) abstraction 1 LEAF
heart 4 internal organ 1 organ 4 (Living Part) body part 1 (Living Part) part 10 entity 1 LEAF
Reliability of Equivalence Relations
Spanish wordnet Confidence (Variants)
Nouns Verbs Total
100% (Manual) 7819 8394 16213 >96% 382 0 382 >94% 2948 0 2948 >92% 1364 0 1364 >85% 23113 0 23113 >84% 4156 0 4156 Total 39782 8394 48176
Reliability of Equivalence Relations
Dutch wordnet Nouns Verbs Matching Type No of synsets Perc. Reliability No of Synsets Perc. Reliability manual/ok 4138 17,00% 100% 3383 37,07% 100% 1 match 4846 19,91% 86% 763 8,36% 78% 2 matches 3059 12,57% 68% 652 7,15% 71% 3-9 matches 5408 22,22% 65% 2471 27,08% 49% 10+ matches 1864 7,66% 54% 980 10,74% 23% 0 matches 5022 20,64% n.a. 876 9,60% n.a. Total 24337 9125
Conflicting Starting pointsConflicting Starting points
1. There should be a maximum of flexibility: the wordnets should be able to reflect language-specific relations
and patterns the wordnets should be built relatively independently because each
sites has different starting points: different tools, database and resources (Machine Readable
Dictionaries) differences in the languages
2. The wordnets have to be compatible in terms of coverage and relations to be useful for multilingual information retrieval and translations tools and to be able to compare the wordnets.
Measures to Measures to achieve maximal compatibilityachieve maximal compatibility
The results are loaded into a common Multilingual Database (Polaris): consistency checks and types of incompatibility specific comparison options to measure consistency and overlap in coverage
User-guides for building wordnets in each language: the steps to encode the relations for a word meaning. common tests and criteria for all the relations. overview of problems and solutions.
A set of common Base-Concepts which are shared by all the sites, having: most relations and the most-important positions in the wordnets most meanings and badly defined
Classification of the common Base Concept in terms of a Top-Ontology of 63 basic Semantic Distinctions
Top-Down Approach, where first the Base Concepts and their direct context are (manually) encoded and next the wordnets are (semi-automatically) extended top-down to include more specific concepts that depend on these Base Concept.
Top-Ontology and Top-Ontology and Base ConceptsBase Concepts
Top-Ontology with 63 higher-level conceptsExisting Ontologies:
WordNet1.5 top-levelsAktions-Art models (Vendler, Verkuyl)Acquilex and Sift ontologies (EC-projects)Qualia-structure (Pustejovsky)Upper-Model, MikroKosmos, Cyc, Ad Hoc ANSI-Committee on ontologies
The ontology was adapted to represent the variety of concepts in the set of Common Base Concepts, across the 4 language:.
homogenous Base-Concept Clustersaverage size of Base Concept Clusterapply to both nouns and verbs
Set of 1024 common Base Concepts making up the core of the separate wordnets.
Base ConceptsBase ConceptsProcedure:• Each site determined the set of word meanings with most relations (up to 15% of all relations) and high positions in the hierarchy.• This set was extended with all meanings used to define the first selection.• The local selection was translated to WordNet1.5 equivalences: 4 lists of WordNet1.5 synsets (between 450 – 2000 synsets per selection).• These sets of WordNet1.5 translations have been compared.
Concepts selected by all sites: 30 synsets (24 nouns synsets, 6 verb synsets). Explanations:•The individual selections are not representative enough.
•There are major differences in the way meanings are classified, which have an effect on the frequency of the relations.
•The translations of the selection to WordNet1.5 synsets are not reliable
•The resources cover very different vocabularies
Concepts selected by at least two sites: intersections of pairsConcepts selected by at least two sites: intersections of pairs
NOUNS VERBS
NL ES IT GB/WN NL ES IT GB/WNNL 1027 103 182 333 323 36 42 86
ES 103 523 45 284 36 128 18 43
IT 182 45 334 167 42 18 104 39
GB/WN 333 284 167 1296 86 43 39 236
Total Set of shared Base Concepts : Union of intersection pairsTotal Set of shared Base Concepts : Union of intersection pairs
Nouns Verbs Total
1stOrderEntities 491 491
2ndOrderEntities 272 228 500
3rdOrderEntities 33 33
Total 796 228 1024
Table 4: Number of Common BCs represented in the local wordnetsTable 4: Number of Common BCs represented in the local wordnets
Related to CBCs Eq_synonym Eq_near_ CBCs Without
Relations Synonym relations Direct Equivalent
AMS 992 725 269 97
FUE 1012 1009 0 15PSA 878 759 191 9
Table 5: BC4 Gaps in at least two wordnets (10 synsets)Table 5: BC4 Gaps in at least two wordnets (10 synsets)
body covering#1 mental object#1; cognitive content#1; content#2body substance#1 natural object#1social control#1 place of business#1; business establishment#1change of magnitude#1 plant organ#1contractile organ#1 Plant part#1psychological feature#1 spatial property#1; spatiality#1
Table 6: Local senses with complex equivalence relations to CBCsTable 6: Local senses with complex equivalence relations to CBCsNL ES IT
Eq_has_hyperonym 61 40 4eq_has_hyponym 34 14 20Eq_has_holonym 2 0Eq_has_meronym 3 2Eq_involved 3Eq_is_caused_by 3Eq_is_state_of 1
Example of complex relation
CBC: cause to feel unwell#1, Verb
Closest Dutch concept: {onwel#1}, Adjective (sick)
Equivalence relation: eq_is_caused_by
Adaptation of Base Concepts in Adaptation of Base Concepts in EuroWordNet-2EuroWordNet-2
A similar selection of fundamental concepts has been made in EuroWordNet-2
The selected concepts have been compared among German, French, Czech and Estonian and with the EuroWordNet-1 selection
The EuroWordNet-1 set has been extended to 1310 Base Concepts
A distinction has been made between Hard and Soft Base Concepts Hard: represented by only a single Index-record Soft: represented by several close Index-records
The final set has been used as starting point in EuroWordNet-2
NOUNS Local NBCs
Intersection with NBC-ewn1 (905)
% of NBC-ewn1 % of Local BCs New BCs
FR 787 787 99,24% 100,00% 0 DE 460 202 25,47% 43,91% 258 CZ 726 271 34,17% 37,33% 455 EE 703 389 49,05% 55,33% 314 Union (selected by at least 1 side)
1727 811 102,27% 46,96% 916
Union of Intersections (selected by at least 2 sides)
619 516 65,07% 83,36% 105
Intersection (selected by 4 sides)
70 70 8,83% 100,00%
VERBS Local VBCs
Intersection with VBC-ewn1 (239)
% of VBC-EWN1 % of Local BCs New BCs
FR 225 225 94.14% 100.00% 0 DE 321 98 41.00% 30.53% 223 EE 459 145 60.67% 31.80% 314 CZ 260 71 29.71% 27.31% 189 Union (selected by at least 1 side)
872 233 97.49% 26.72% 639
Union of Intersections (selected by at least 2 sides)
258 179 74.90% 69.38% 61
Intersections (selected by 4 sides)
30 30 12.55% 100.00%
Comparison of Comparison of Base Concept SelectionsBase Concept Selections
Revised Set of Base Revised Set of Base ConceptsConcepts
EWN1 EWN2 EWN12 Total Hard Soft Total Hard Soft Total Hard Soft
NOUNS 905 575 330 105 20 85 1010 595 415 VERBS 239 164 75 61 23 38 300 187 113
Table 7: Proposed, Missing and Selected Noun Base Concepts for EWN2 SOFT
LocalBCs HARD Missing Total Partial Missing
Unique BCs
Shared BCs
FR 787 24 199 112 87 0 787 DE 460 427 322 97 225 199 216 EE 703 293 252 160 92 238 465 CZ 726 339 260 153 107 375 351
Table 8: Proposed, Missing and Selected Verb Base Concepts for EWN2 HARD SOFT
Total Missing Total Partial Missing
Unique BCs
Shared BCs
FR 225 30 45 11 34 0 225 DE 321 91 70 36 34 182 139 EE 459 52 43 36 7 254 205 CZ 260 126 76 35 41 162 98
Starting points Starting points for the Top-Ontologyfor the Top-Ontology
• The ontology should support the building and encoding of semantic networks as
linguistic ontologies: networks of lexicalized words and expressions in a
language.
• The classification of the Base Concepts in terms of the Top Ontology should apply
to all the involved languages.
• Enforce uniformity and compatibility of the different wordnets, by providing a
common framework. Divide the Base Concepts (BCs) into coherent clusters to
enable contrastive-analysis and discussion of closely related word meanings
• Customize the database by assigning features to the top-concepts, irrespective of
language-specific structures.
• Provide an anchor point for connecting other ontologies to the Inter-Lingual-
Index, such as CYC, MikroKosmos, the Upper-Model, by linking them to the
corresponding ILI-records.
Principles for Principles for deciding on the distinctionsdeciding on the distinctions
Starting point is that the wordnets are linguistic ontologies:
• Semantic classifications common in linguistic paradigms: Aktionsart models [Vendler 1967, Verkuyl 1972, Verkuyl 1989, Pustejovsky 1991], entity-orders [Lyons 1977], Aristotle’s Qualia-structure [Pustejovsky 1995].
• Ontologies developed in previous EC-projects, which had a similar basis and are well-known in the project consortium: Acquilex (BRA 3030, 7315), Sift (LE-62030, [Vossen and Bon 1996].
• The ontology should be capable of reflecting the diversity of the set of common BCs, across the 4 languages. In this sense the classification of the common BCs in terms of the top-concepts should result in:
Homogeneous Base Concept Clusters: classifications in WordNet1.5 and the other wordnets.
Average-sized Base Concept Clusters: not extremely large or small.
Other important characteristics:
The distinctions apply to both nouns, verbs and adjectives, because these can be related in the language-specific wordnets via a xpos_synonymy relation, and the ILI-records can be related to any part-of-speech.
The top-concepts are hierarchically ordered by means of a subsumption relation but there can only be one super-type linked to each top-concept: multiple inheritance between top-concepts is not allowed.
In addition to the subsumption relation top-concepts can have an opposition-relation to indicate that certain distinctions are disjunct, whereas others may overlap.
There may be multiple relations from ILI-records to top-concepts: the Base Conceptss can be cross-classified in terms of multiple top-concepts (as long as these have no opposition-relation between them): i.e. multiple inheritance from Top-Concept to Base Concept is allowed.
Result: the TCs function as cross-classifying features rather than conceptual classes .
Meanings for bodyparts are not linked to a single class BodyPart but to two features: Living and Part.
The EuroWordNet Top-Ontology: The EuroWordNet Top-Ontology: 63 concepts (excluding the top)63 concepts (excluding the top)
First Level [Lyons 1977]:
1stOrderEntity (491 BC synsets, all nouns)Any concrete entity (publicly) perceivable by the senses and located at any point in time, in a three-dimensional space.
2ndOrderEntity (500 BC synsets, 272 nouns and 228 verbs)Any Static Situation (property, relation) or Dynamic Situation, which cannot be grasped, heart, seen, felt as an independent physical thing. They can be located in time and occur or take place rather than exist; e.g. continue, occur, apply
3rdOrderEntity (33 BC synsets, all nouns)An unobservable proposition that exists independently of time and space. They can be true or false rather than real. They can be asserted or denied, remembered or forgotten. E.g. idea, though, information, theory, plan.
Third-order entities cannot occur, have no temporal duration and therefore fail on both tests:a The same person was here again to-dayb The same thing happened/occurred again to-day*? The idea, fact, expectation, etc.... was here/occurred/ took place
A positive test for a 3rdOrderEntity is based on the properties that can be predicated:
ok The idea, fact, expectation, etc.. is true, is denied, forgotten
The first division of the ontology is disjoint: BCs cannot be classified as combinations of these TCs. This distinction cuts across the different parts of speech in that:
1stOrderEntities are always (concrete) nouns. 2ndOrderEntities can be nouns, verbs and adjectives, where adjectives are always non-dynamic (refer to states and situations not involving a change of state). 3rdOrderEntities are always (abstract) nouns.
Test to distinguish 1st, 2nd and 3rd OrderEntities
Base Concepts classified as 3rdOrderEntities
theory; idea; structure; evidence; procedure; doctrine; policy; data point; content; plan of action; concept; plan; communication; knowledge base; cognitive content; know-how; category; information; abstract; info;
1stOrderEntity1stOrderEntity11
Origin 0 the way in which an entity has come aboutNatural21 Living30 Plant18
Human106
Creature2
Animal123
Artifact144
Function0 the typical activity or role that is associated with an entityVehicle8 Occupation23 Covering8
Garment3 Software4 Furniture6 Place45 Container12 Comestible32
Instrument18 Container12 Building13
Representation12: MoneyRepresentation10; LanguageRepresentation34; Image Representation9
Form0 a-morf or fixed shape.Substance32 Solid63
Liquid13
Gas1
Object62
Composition0 group of self-contained wholes or as a part of such a wholePart86
Group63
Conjunctive classes of Conjunctive classes of 1stOrderEntities1stOrderEntities
Frequent combinations5 Comestible;Solid;Artifact 7 LanguageRepresentation 5 Container;Part;Solid;Living 7 Vehicle;Object;Artifact5 Furniture;Object;Artifact 10 Instrument;Object;Artifact5 Instrument;Artifact 12 Part5 Living 14 Place5 Plant 14 Place;Part6 Liquid 15 Substance6 Object;Artifact 19LanguageRepresentation;Artifact6 Part;Living 20 Occupation;Object;Human6 Place;Part;Solid 22 Object;Animal; Function7 Building;Object;Artifact 38 Group;Human7 Group 42 Object;Human
Conjunctive classes of Conjunctive classes of 1stOrderEntities1stOrderEntities
Low Frequent combinationsfruit: Comestible (Function) life: Group (Composition)
Object (Form) Living (Natural, Origin)Part (Composition) cell: Part (Composition)Plant (Natural, Origin) Living (Natural, Origin)
skin: Covering (Covering) arms: Instrument (Function) Solid (Form) Group (Composition)Part (Composition) Object (Form)Living (Natural, Origin) Artifact (Origin)
1stOrderEntities classified 1stOrderEntities classified as Function onlyas Function only
barrier 1; belonging 2;building material 1;causal agency 1;commodity 1;consumer goods 1;creation 3;curative 1;decoration 2;device 4;fastener 1;force 6;force 7;form 5;impediment 1;medicament 1;piece of work 1;possession 1;protection 4;remains 2;restraint 2;support 6;support; 7;supporting structure 1;thing 3
2ndOrderEntity2ndOrderEntity00
SituationType6 (the event-structure in terms of which a situation can be characterized as a conceptual unit over time; Disjoint features)
Dynamic134
(he sat down quickly. a quick meeting) BoundedEvent183
UnboundedEvent48
Static28
(?he sits quickly.)Property61
Relation38
SituationComponent0
(the most salient semantic component(s) that characterize(s) a situation; Conjuncted Features)
Cause67 Communication50 Condition62 Physical140
Agentive170 Existence27 Experience43 Possession23
Phenomenal17 Location76 Manner21 Purpose137
Stimulating25 Mental90 Modal10 Quantity39
Social102 Time24 Usage8
Conjunctive classes of Conjunctive classes of 2ndOrderEntities2ndOrderEntities
Static
5 Property;Physical;Condition5 Property;Stimulating;Physical5 Relation5 Relation;Social6 Static;Quantity7 Property;Condition8 Relation;Location9 Property10 Relation;Physical;Location:
adjoin 1; aim 4; blank space 1; course 7; direction 8; distance 1; elbow room 1; path 3; spatial property 1; spatial relation 1
Conjunctive classes of Conjunctive classes of 2ndOrderEntities2ndOrderEntities
Dynamic5 BoundedEvent;Cause;Physical5 BoundedEvent;Cause;Physical;Location5 BoundedEvent;Time5 Dynamic5 Dynamic;Location5 Dynamic;Phenomenal5 Dynamic;Phenomenal;Physical6 BoundedEvent;Agentive6 BoundedEvent;Location6 BoundedEvent;Physical;Location6 Dynamic;Agentive;Communication6 Dynamic;Cause8 BoundedEvent;Agentive;Mental;Purpose8 BoundedEvent;Quantity;Time9 BoundedEvent;Cause9 Dynamic;Experience;Mental experience 7; find 3;affect 5; arouse 5; excite 2; cognition 1; desire 2; disposition 2; disposition 4; disturbance 7; emotion 1; feeling 1; humor 3; pleasance 1; process 4; look 8; phenomenon 1; cause to appear 1;
perception 2; sensation 1; feel 12; experience 8; trouble 3; reality 1
Top-Down Building Top-Down Building ProcedureProcedure
1) Construction of a core wordnet from the common set of Base Concepts
• Find Representatives in the local language for the Common Base Concepts (1310 synsets)• Add local Base Concepts that are not selected as Common Base Concepts • Specify the hyperonyms of the local and common Base Concepts
2) Extend the Core Wordnets
• Add the first level of hyponyms to the core wordnets• Add other hyponyms which have many sub-hyponyms• Add other types of relations: XPOS, roles, meronymy, subevents, causes.
3) Verify the Selection
• Corpus frequency: Parole lexicons and corpora• Top-Concept clustering• Intersection of ILI-records• Overlap in ILI-chains
Top-Down BuildingTop-Down Building
63TCs
1310 CBCs149 new ILIs
First Level Hyponyms
Remaining Hyponyms
Hyperonyms
CBCRepresen- tatives
Local BCs
WMsrelated vianon-hyponymy
Top-Ontology
Inter-Lingual-Index
Remaining Hyponyms
Hyperonyms
CBCRepre-senta.
Local BCs
WMsrelated vianon-hyponymyFirst Level Hyponyms
RemainingWordNet1.5Synsets
The current wordnets
Synsets No. of senses Sens./ syns.
Entries Sens./ entry
LIRels. LIRels/ syns
EQRels-ILI
EQRels/syn
Synsets without
ILI Dutch 44015 70201 1,59 56283 1,25 111639 2,54 53448 1,21 7203 Spanish 23370 50526 2,16 27933 1,81 55163 2,36 21236 0,91 0 Italian 40428 48499 1,20 32978 1,47 117068 2,90 71789 1,78 1561 French 22745 32809 1.44 18777 1.75 49494 2.18 22730 1.00 20 German 15132 20453 1.35 17098 1.20 34818 2.30 16347 1.08 0 Czech 12824 19949 1.56 12283 1.62 26259 2.05 12824 1.00 0 Estonian 7678 13839 1.80 10961 1.26 16318 2.13 9004 1.17 0 English 16361 40588 2,48 17320 2,34 42140 2,58 n.a. n.a. n.a. WN15 94515 187602 1,98 126617 1,48 211375 2,24 n.a. n.a. n.a.
Comparison of wordnets
In depth comparison of major semantic fields Comparison of the intersection of the associated ILI-
records Distribution of the associated ILI-records over the different top ontology clusters
Comparison of the hyponymy relations in the wordnets, projected on the associated ILI-records
Intersection of the associated ILI-records
Nouns Verbs
Total 62780 32520 Total 12215 7455
frequency
% of (WN,IT, NL, ES)
% of (IT, NL, ES)
frequency
% of (WN,IT, NL, ES)
% of (IT, NL, ES)ES 24596 39.2% 75.6% 4654 38.1% 62.4%
IT 14272 22.7% 43.9% 4673 38.3% 62.7%
NL 21259 33.9% 65.4% 6416 52.5% 86.1%
(ES, IT) 10907 17.4% 33.5% 3272 26.8% 43.9%
(ES, NL) 14773 23.5% 45.4% 3870 31.7% 51.9%
(IT, NL) 9862 15.7% 30.3% 3950 32.3% 53.0%
(ES, IT, NL)
81838183 13.0% 25.2% 3051 25.0% 40.9%
Distribution over the top ontology clusters
WN NL ES IT Top-Concept TC-
Tokens %of wn
TC-Tokens
% of nl
%of wn
TC-Tokens
%of es %of wn
TC-Tokens
%of it %of wn
Animal 14068 3.99% 1193 0.97% 8.5% 2458 1.81% 17.5% 1122 1.44% 8.0% Artifact 19562 5.55% 10803 8.83% 55.2% 9969 7.36% 51.0% 6494 8.34% 33.2% Building 1022 0.29% 707 0.58% 69.2% 628 0.46% 61.4% 434 0.56% 42.5% Comestible 3377 0.96% 1393 1.14% 41.2% 1614 1.19% 47.8% 624 0.80% 18.5% Container 1725 0.49% 778 0.64% 45.1% 799 0.59% 46.3% 432 0.55% 25.0% Covering 2030 0.58% 1208 0.99% 59.5% 1027 0.76% 50.6% 690 0.89% 34.0% Creature 664 0.19% 159 0.13% 23.9% 254 0.19% 38.3% 27 0.03% 4.1% Function 34081 9.68% 17668 14.44% 51.8% 18904 13.96% 55.5% 11043 14.18% 32.4% Furniture 298 0.08% 171 0.14% 57.4% 147 0.11% 49.3% 87 0.11% 29.2% Garment 756 0.21% 494 0.40% 65.3% 426 0.31% 56.3% 292 0.37% 38.6% Gas 93 0.03% 67 0.05% 72.0% 62 0.05% 66.7% 49 0.06% 52.7% Group 27805 7.90% 3357 2.74% 12.1% 3630 2.68% 13.1% 2337 3.00% 8.4% Human 11543 3.28% 6372 5.21% 55.2% 7683 5.67% 66.6% 4488 5.76% 38.9% ImageRepresentation 780 0.22% 412 0.34% 52.8% 426 0.31% 54.6% 294 0.38% 37.7% Instrument 7036 2.00% 4102 3.35% 58.3% 3590 2.65% 51.0% 2564 3.29% 36.4% LanguageRepresent. 2844 0.81% 1273 1.04% 44.8% 1218 0.90% 42.8% 691 0.89% 24.3% Liquid 1629 0.46% 617 0.50% 37.9% 500 0.37% 30.7% 339 0.44% 20.8% Living 47104 13.37% 10225 8.36% 21.7% 13661 10.08% 29.0% 7408 9.51% 15.7%
Distribution over the top ontology clusters
WN NL ES IT Top-Concept TC-
Tokens %of wn
TC-Tokens
% of nl
%of wn
TC-Tokens
%of es %of wn
TC-Tokens
%of it %of wn
MoneyRepresentation 372 0.11% 190 0.16% 51.1% 183 0.14% 49.2% 111 0.14% 29.8% Natural 68370 19.41% 21948 17.94% 32.1% 24556 18.13% 35.9% 14400 18.49% 21.1% Object 48162 13.68% 20206 16.51% 42.0% 22608 16.69% 46.9% 13242 17.00% 27.5% Occupation 2059 0.58% 1209 0.99% 58.7% 1395 1.03% 67.8% 824 1.06% 40.0% Part 12083 3.43% 4806 3.93% 39.8% 5819 4.30% 48.2% 2586 3.32% 21.4% Place 5281 1.50% 2072 1.69% 39.2% 2439 1.80% 46.2% 1227 1.58% 23.2% Plant 18874 5.36% 1534 1.25% 8.1% 2012 1.49% 10.7% 1121 1.44% 5.9% Representation 934 0.27% 560 0.46% 60.0% 577 0.43% 61.8% 302 0.39% 32.3% Software 201 0.06% 80 0.07% 39.8% 91 0.07% 45.3% 49 0.06% 24.4% Solid 6319 1.79% 2845 2.33% 45.0% 2721 2.01% 43.1% 1406 1.81% 22.3% Substance 12365 3.51% 5447 4.45% 44.1% 5599 4.13% 45.3% 2847 3.66% 23.0% Vehicle 747 0.21% 466 0.38% 62.4% 466 0.34% 62.4% 352 0.45% 47.1% Total 352184 122362 34.7% 135462 38.5% 77882 22.1%
Comparison of the hyponymy relations, projected on the associated ILI-records
To be able to compare hyponymy chains, each word sense in the chain has been replaced by the ILI-records that are linked to these synsets which gives the following result:
veranderen (change) bewegen (move intransitive) bewegen (move reflexive) voortbewegen (move location) verplaatsen (move from A to B) stijgen (move to a higher position) opstijgen (take off)
00064108 01046072 01046072 01046072 01055491 01094615 00257753
Coverage of complete noun chains projected over WN1.5 structure
nodes (53467) edges (53467) frequency % frequency % ES 14221 26.60 14221 26.60 NL 650 1.22 17 0.03 IT 2760 5.16 49 0.09 (ES,NL) 352 0.66 10 0.02 (ES,IT) 1563 2.92 34 0.06 (NL,IT) 190 0.36 0 0.00 (ES,NL,IT) 136 0.25 0 0.00
Partial noun chains projected over WN1.5
LENGTH ES NL IT (ES,NL)
(ES,IT)
(NL,IT)
(ES,NL,IT)
WN
1 53467 53213 53456 53148 53452 52862 52803 53467 2 53385 43161 47346 41959 47138 40893 40636 53467 3 51541 26862 44076 25162 42764 21573 21089 53434 4 47930 15032 27878 13106 26260 7808 7112 52913 5 42049 6771 21019 5454 19433 2996 2506 50693 6 27582 2781 14817 1929 12552 949 799 45029 7 16789 967 7865 726 6259 169 148 32299 8 8337 196 3526 87 2648 17 12 20558 9 3800 6 1062 3 779 11821 10 1647 380 311 5881 11 647 82 73 2576 12 299 28 25 1176 13 115 659 14 19 295 15 2 82
Partial noun chains with 1 gap projected over WN1.5
LENGTH ES NL IT (ES,NL)
(ES,IT)
(NL,IT)
(ES,NL,IT)
WN
3 7804 29355 12152 28312 11619 20886 20439 53434 4 7776 26152 11616 24655 11086 17228 16775 52913 5 7333 18633 10480 16712 9652 11136 10561 50693 6 6296 12019 7782 10158 6879 6023 5262 45029 7 5017 5326 4602 3866 4119 2531 1960 32299 8 3392 1891 2456 1046 2131 704 560 20558 9 1914 487 1166 268 986 115 98 11821 10 1038 83 538 32 485 11 7 5881 11 564 2 173 1 163 2576 12 232 108 101 1176 13 98 35 4 659 14 43 2 295 15 5 82 16 2 7
Independently of the wordnet structures in each language, we can manipulate the mapping across languages via the ILI.
We can use the information of all the languages to correct incompleteness and inconsistencies of the individual resources
Ultimately, we should try to find a minimal and sufficient set of concepts to provide an efficient mapping.
Towards an efficient, condensed and universal index of sense-distinctions
Characteristics of the Characteristics of the Inter-Lingual-IndexInter-Lingual-Index
The Inter-lingual-Index (ILI) is an unstructured fund of concepts with the sole purpose of providing an efficient mapping of senses across languages. Requirements:1. efficient level of granularity
ILI Wordnets{break} “He broke the glass” breken Dutch {break; cause to break} breken Dutch
{break; damage} inflict damage upon. romper Spanish rompere Italian
2. superset of concepts that occur across languagesILI Wordnets
{cashier} eq_hyperonym cassière Dutcheq_hyperonym cajera Spanish
{female cashier} eq_synonym cassière Dutcheq_synonym cajera Spanish
A Minimal and Efficient set of A Minimal and Efficient set of conceptsconcepts
• Globalizing the sense-differentiation:• create metonymic clusters• abstract from contextual specialization and grammatical perspectives• abstract from part-of-speech realization• abstract from productive and predictable meanings
• Extending the Inter-Lingual-Index to become the superset of concepts occurring in two or more wordnets only if:
• concepts are unpredictable and unproductive• concepts cannot be linked exhaustively and uniquely to the ILI
Under-specified conceptsUnder-specified conceptsMetonymic clustersMetonymic clusters
club
{vereniging}NL
{club; verenigingsgebouw}NL
{club}EN
metonym# club: organization
metonym# club: building
eq_metonym eq_metonym
eq_synonym eq_synonym
Under-specified conceptsGeneralization and Diathesis
clusters
break
{rompere}IT
diathesis# break: inchoative
diathesis#break: causative
{breken; kapotgaan}NL
{breken; kapotmaken}NL eq_synonym eq_synonym
eq_diatheis
{rompersi}IT
eq_diathesis
Under-specified for POS
depart
{vertrekkenV}NL
{vertrekN}NL
{departV}EN
{departureN}EN
xpos# departure
xpos# depart
eq_xpos_synonym eq_xpos_synonym
eq_synonym eq_synonym
Overview of equivalence relations to the ILI
Relation POS Sources: Targets Exampleeq_synonym same 1:1 auto : voiture
careq_near_synonym any many : many apparaat, machine, toestel:
apparatus, machine, deviceeq_hyperonym same many : 1 (usually) citroenjenever:
gineq_hyponym same (usually) 1 : many dedo :
toe, fingereq_metonymy same many/1 : 1 universiteit, universiteitsgebouw:
universityeq_diathesis same many/1 : 1 raken (cause), raken:
hiteq_generalization same many/1 : 1 schoonmaken :
clean
Progress on restructuring the ILI
Clusters added manually and automatically based on: structural properties of WN1.5 mapping to other sources: Levin’s classes, WN1.6 cross-lingual mapping
clusters words word senses synsets
Nouns 1703 1398 3205 2895
Verbs 2905 1799 5134 3839
New ILIs from other wordnets have not yet been added. We estimated that for verbs hardly any new ILIs are needed, for nouns about 30% of non-translated concepts (2,000 synsets based on Dutch).
Effects of ILI-clusters
Intersection of ILI-references for Dutch, Spanish, Italian and English
Nouns 2895 clustered synsets (4,6% of 62780 WN1.5 noun synsets)intersection increased from 7736 (23,8%) to 8183 (25,2%) out of the union of 32520 synsets
Verbs 3839 clustered synsets (31,4% of 12215 WN1.5 verb synsets)intersection increased from 1632 (21,9%) to 3051 (40,9%) out of the union of 7455 synsets
Superset of all conceptsSuperset of all concepts.
Procedure:• Initially, the ILI will only contain WordNet1.5 synsets.• a site that cannot find a proper equivalent among the available ILI-concepts
will link the meaning to another ILI-record using a so-called complex-equivalence relation and will generate a potential new ILI-record:
Dutch Meaning Definition Complex-equivalence Target conceptklunen to walk on skates has_eq_hyperonym walk
• after a building-phase all potentially-new ILI-records are collected and verified for overlap by one site;
• a proposal for updating the ILI is distributed to all sites and has to be verified;• the ILI is updated and all sites have to reconsider the equivalence relations
for all meanings that can potentially be linked to the new ILI-records;
Filling gaps in the ILI
Types of GAPS 1. genuine, cultural gaps for things not known in English culture,
e.g. citroenjenever, which is a kind of gin made out of lemon skin,
• Non-productive• Non-compositional
1. pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier)
• Productive• Compositional
2. Universality of gaps: Concepts occurring in at least 2 languages
Productive and Predictable Lexicalizations exhaustively linked to the ILI
beat
stamp
{doodslaanV}NL
{cajeraN}ES
eq_has_hyperonym
{doodschoppenV}NL
{doodstampenV}NL
kill
kick
{tottrampelnV}DE
{totschlagenV}DE
eq_has_hyperonym
eq_has_hyperonym eq_has_hyperonym
eq_has_hyperonym
eq_has_hyperonym
eq_has_hyperonym
cashier
female
young
fish
{casière}NL
eq_has_hyperonym
{alevínN}ES
eq_has_hyperonym
eq_has_hyperonym
eq_has_hyperonym
eq_in_state
eq_in_state
eq_in_state
WordNet gaps across languages
ILI REFs (mostly hyperonyms)
ILIVars
Nouns Verbs Nouns Verbs NL 491 99 551 82 DE 109 9 144 10 IT 45 22 77 66 NL&DE 10 0 2 0 NL&IT 6 3 1 0 DE&IT 5 1 0 0 NL&DE&IT 3 0 0 0 Union Intersections 15 4 3 0
Towards an efficient, condensed and universal index of sense-distinctions
Productive derivations and compounds linked exhaustively
WordNet1.5
90,000concepts
Metonymy/Generalizationclusters
Universal Core meanings
POSIndependent
Non-predictable
Universal systematic polysemy and level of granularity
Language and domain specific lexicalizations that do not occur in a large variety of languages
Language specific realizations in grammatical forms
The EuroWordNet databaseThe EuroWordNet database1.) The actual wordnets in Flaim database format: an indexing and compression
format of Novell.
2.) Polaris (Louw 1997): Re-implementation of the Novell ConceptNet toolkit (Díez-Orzas et al 1995) adapted to the EuroWordNet architecture.
import and export wordnets or wordnet selections from/to ASCII files. resolve links for imported concepts. edit and add concepts, variants and relations in the wordnets. access to the ILI and ontologies and to switch between the wordnets and
ontologies via the ILI. extract, import and export clusters of senses based on relations. project synsets or clusters from one wordnet to another wordnet compare clusters of synsets. import new or adapted ILI-records. update ILI-references to updated ILI.
3. Periscope (Cuypers and Adriaens 1997): a graphical interface for viewing the EuroWordNet database.
Global Wordnet Associationhttp://www.globalwordnet.org
provide a standardized framework to link, compare and build complete wordnets for all the European languages and dialects.
initialize the development of wordnets in non-European languages
develop more specific definitions, tests and procedures for evaluating and developing wordnets.
extend the specification of EuroWordNet to lexical units which are not yet covered (adjectives/adverbs, lexicalized phrases and multi-words).
develop (axiomatized) ontologies for Domains and World-Knowledge that can be shared by all languages via the ILI.
develop an efficient ILI for linking, sharing, consistency checking and cross-language technology applications. This ILI could function as a gold-standard of sense-distinctions.
organize a (annual/bi-annual) workshop or conference.
2nd Global Wordnet Conference Location: Masaryk University, Brno (Czech
Republic), January, 20 - 23, 2004. http://www.fi.muni.cz/gwc2004/
Other wordnet initiatives Danish Norway Swedish Portuguese Arabic Korean Russian
Welsh Basque, Catalan Chinese BalkaNet IndoWordnet Meaning
BalkaNet Funded by the European Union as project IST-
2000-29388. 3-year project: 2001 - 2004 Follows a strict EuroWordNet approach:
Expanded set of base concepts Top-down building approach
EWN database extended with: Greek, Romanian, Serbian, Turkish, Bulgarian, Czech
Development of new wordnet database system: VisDic
http://www.ceid.upatras.gr/Balkanet/.
IndoWordnet
Current Wordnet development in India: Hindi and Marathi at IIT Bombay, Tamil at Anna University-K.B Chandrashekhar Research
Centre (AU-KBC) Chennai and Tamil University Tanjavur, Gujarathi at MS University Baroda, Oriya at Utkal
University Bhubaneswar and Bengali at IIT Kharagpur. The Hindi WordNet is at an advanced stage of
development with about 11000 semantically linked synsets and with associated software and user interface.
IndoWordnet By the end of 2003 each Indian language will create a WordNet of 5000
synsets. These will be for about 2000 most frequent content words in each language. Use will be made of the wordlist sorted by frequency- available with the CIIL
Language specific WordNets developed by the following institutions: CIIL, Mysore: Kannada, Kashmiri, Punjabi, Urdu, Himachali, Malayalam. IIT Bombay: Hindi, Marathi and Konkani AU-KBC Chenai and Tamil University Tanjavur: Tamil and Malayalam University of Hyderabad: Telegu University of Baroda: Gujarati Utkal University Bhubaneswar: Oriya IIT Kharagpur: Bengali
Reserach groups have to be identified for building the WordNets of Assamese, Nepali and Languages of the North East.
Developing Multilingual Web-Developing Multilingual Web-scale Language Technologiesscale Language Technologies
http://www.lsi.upc.es/~nlp/meaning/http://www.lsi.upc.es/~nlp/meaning/
MeaningMeaning
Meaning Objectives Funded by the European Union as
project IST-2001-34460IST-2001-34460 3 -year project: April 2002 - April 2005 Large-scale (Lexical) Knowledge Bases
Automatic enrichment of EWN Mixed approach (KB + ML) Applied to Q/A, CLIR
Problem structural and lexical ambiguity
Meaning Approach automatic collection of sense
examples (Leacock et al. 98, Mihalcea y Moldovan 99)
Large-scale WSD (Boosting, SVM, transductives)
Large-scale Knowledge Acquisition (McCarthy 01, Agirre & Martinez 02)
MultilingualMultilingualCentral RepositoryCentral Repository
ItalianItalianEWNEWN
BasqueBasqueEWNEWN
SpanishSpanishEWNEWN
EnglishEnglishEWNEWN
BasqueWeb Corpus
ItalianWeb Corpus
EnglishWeb Corpus
SpanishWeb Corpus
ACQACQ
ACQACQACQACQ
ACQACQ
UPLOADUPLOADUPLOADUPLOAD
UPLOADUPLOADUPLOADUPLOAD
PORTPORT
PORTPORT
PORTPORT
PORTPORT
WSDWSD
WSDWSDWSDWSD
MeaningMeaningArchitectureArchitecture
WSDWSD
CatalanCatalanEWNEWN
CatalanWeb Corpus
WSDWSDACQACQ
PORTPORT UPLOADUPLOAD
A combination of unsupervised Knowledge-based and supervised Machine Learning techniques that will provide a high-precision system that is able to tag running text with word senses
A system that acquires a huge number of examples per word from the web
The use of sophisticated linguistic information, such as, syntactic relations, semantic classes, selectional restrictions, subcategorization information, domain, etc.
Efficient margin-based Machine Learning algorithms.
Novel algorithms that combine tagged examples with huge amounts of untagged examples in order to increase the precision of the system.
MeaningMeaningWP6: Word Sense DisambiguationWP6: Word Sense Disambiguation
THE END...