Transcript
Page 1: Building Wordnets Piek Vossen, Irion Technologies

Building Wordnets

Piek Vossen, Irion Technologies

Page 2: Building Wordnets Piek Vossen, Irion Technologies

Overview

Starting points Semantic framework Process overview Methodologies in other projects Multilinguality

Page 3: Building Wordnets Piek Vossen, Irion Technologies

Starting points

Purpose of the wordnet database: education, science, applications formal ontology or linguistic ontology making inferences or lexical substitution conceptual density or large coverage

Distributed development Reproducability Available resources Language-specific features (Cross-language) compatibility Exploit cummunity resources by projecting

conceptual relations on a target wordnet

Page 4: Building Wordnets Piek Vossen, Irion Technologies

Semantic framework

Page 5: Building Wordnets Piek Vossen, Irion Technologies

Differences in wordnet structures

voorwerp{object}

lepel{spoon}

werktuig{tool}

tas{bag}

bak{box}

blok{block}

lichaam{body}

Wordnet1.5 Dutch Wordnet

bagspoonbox

object

natural object (an object occurring naturally)

artifact, artefact (a man-made object)

instrumentality block body

containerdeviceimplement

tool instrument

- Artificial Classes versus Lexicalized Classes: instrumentality; natural object

- Lexicalization differences of classes: container and artifact (object) are not lexicalized in Dutch

Page 6: Building Wordnets Piek Vossen, Irion Technologies

Linguistic versus conceptual ontologies

Conceptual ontology: A particular level or structuring may be required to achieve a better control or performance, or a more compact and coherent structure.

Introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), Neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise).

What properties can we infer for spoons?spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking

Linguistic ontology: Exactly reflects the relations between all the lexicalized words and expressions in a language. Valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language. What words can be used to name spoons?

spoon -> object, tableware, silverware, merchandise, cutlery,

Page 7: Building Wordnets Piek Vossen, Irion Technologies

Wordnets as Linguistic Ontologies

Classical Substitution Principle:

Any word that is used to refer to something can be replaced by its synonyms, hyperonyms and

hyponyms:

horse stallion, mare, pony, mammal, animal, being.

It cannot be referred to by co-hyponyms and co-hyponyms of its hyperonyms:

horse X cat, dog, camel, fish, plant, person, object.

Conceptual Distance Measurement:

Number of hierarchical nodes between words is a measurement of closeness,

where the level and the local density of nodes are additional factors.

Main purpose is to predict what words can be used as substitutes in language, considering all the lexicalized words in a language.

Page 8: Building Wordnets Piek Vossen, Irion Technologies

Define a semantic framework

Definition of relations Diagnostic frames (Cruse 1986) Examples and corpus data

Top-level ontology Constraints on relations Type consistency Large scale validation

Page 9: Building Wordnets Piek Vossen, Irion Technologies

Process overview

Page 10: Building Wordnets Piek Vossen, Irion Technologies

Techniques

Manual encoding and verification Automatic extraction:

definitions synonyms distribution and similarity patterns in copora defining contexts, e.g. “cats and other pets” parallel corpora, e.g. bible translations morphological structure bilingual dictionaries

Encode source and status of data: who, when, based on what algorithm, validated, final

Page 11: Building Wordnets Piek Vossen, Irion Technologies

Encoding cycle

1. Collecting data Vocabulary: what is the list of words of a language? Concepts: what is the list of concepts related to the

vocabulary? 2. Encoding data:

Defining synsets Defining language internal relations: hyponymy, meronymy

roles, causal relations Defining equivalence relations to English Defining other relations,e.g. Ontology types, Domains

3. Validation 4. Go to 1.

Page 12: Building Wordnets Piek Vossen, Irion Technologies

Where to start?

How to get a first selection: Words (alphabetic, frequency) -> concepts -> relations Concept (hyperonym, domain, semantic feature) -> words -

> concepts -> relations How to get a complete overview of words and

expressions that belong to a segment of a wordnet? Up to 20 hyperonyms for instrumentality: instrument,

instrumentality, means, tool, device, machine, apparatus, ....

iterative process: collect, structure, collect, restructure... using multiple sources of evidence comparing results, e.g. tri-cycle is a toy or a vehicle

Page 13: Building Wordnets Piek Vossen, Irion Technologies

Synonymy as a basis?

Synsets are the core unit of a wordnet database Synonymy is only vaguely defined: substitution in a

context. Synonyms are very hard to detect Other relations (role relations, causal relations):

easier to detect and encode easier to validate within a formal framework easier to validate in a corpus

Rich set of relations per concept help alignment with other resources

Page 14: Building Wordnets Piek Vossen, Irion Technologies

Diagnostic frames and examplesAgent Involvement(A/an) X is the one/that who/which does the Y, typically intentionally.Conditions: - X is a noun

- Y is a verb in the gerundive formExample:

A teacher is the one who does the teaching intentionallyEffect:

{to teach} (Y) INVOLVED_AGENT {teacher} (X)

Patient Involvement(A/an) X is the one/that who/which undergoes the YConditions: - X is a noun

- Y is a verb in the gerundive formExample:

A learner is the one who undergoes the learningEffect:

{to learn} (Y) INVOLVED_PATIENT {learner} (X)

Page 15: Building Wordnets Piek Vossen, Irion Technologies

Diagnostic frames and examplesResult Involvement

A/an) X is comes into existence as a result of Y, where X is a noun and Y is a verb in the gerundive form and a hyponym of “make”, “produce”, “generate”.

Example:A crystal comes into existence as a result of crystalizingA crystal is the result of crystalizingA crystal is created by crystalizing

Effect:{to crystalize} (Y) INVOLVED_RESULT {crystal} (X)

Comments: Special kind of patient relation. The entity is not jut changed or

affected but it comes into existence as a result of the event: Only applies to concrete entities (1stOrder) or mental objects such as

ideas (3rdOrder). Situations that result from other situations are related by the CAUSE

relation.

Page 16: Building Wordnets Piek Vossen, Irion Technologies

Hyponymy overloading (Guarino 1998, Vossen and Bloksma

1998). The vocabulary does not clearly differentiate between orthogonal roles and disjoint types: role: passenger, teacher, student type: dog; cat ?:

knife ->weapon, cutlery; spoon -> container, cutlery food material <- building material <-?- stone; <-?-water; <- brick;

Disjunctive and conjunctive hyperonyms: albino -> animal or plant spoon -> cutlery & container

Page 17: Building Wordnets Piek Vossen, Irion Technologies

Hyponymy restructuring

dierenziekte(animal disease)

infectieziekte(infectious disease)

ingewandsziekte(bowel disease)

ziekte (disease)

kolder(staggers: brain disease of cattle)

vuilbroed(infectious infectious

disease of bees)

veeziekte(cattle disease)

haringwormziekte(anisakiasis: bowel disease of herrings)

Page 18: Building Wordnets Piek Vossen, Irion Technologies

Methodologies in a number of projects Princeton Wordnet EuroWordNet:

English, Dutch, German, French, Spanish, Italian, Czech, Estonian

10,000 up to 50,000 synsets BalkaNet:

Romanian, Bulgarian, Turkish, Slovenian, Greek, Serbian

10,000 synsets

Page 19: Building Wordnets Piek Vossen, Irion Technologies

Main strategies for building wordnets Expand approach: translate WordNet synsets to another

language and take over the structure easier and more efficient method compatible structure with WordNet vocabulary and structure is close to WordNet but also biased can exploit many resources linked to Wordnet: SUMO, Wordnet

domains, selection restriction from BNC, etc...

Merge approach: create an independent wordnet in another language and align it with WordNet by generating the appropriate translations

more complex and labor intensive different structure from WordNet language specific patterns can be maintained, i.e. very precise

substitution patterns

Page 20: Building Wordnets Piek Vossen, Irion Technologies

Aligning wordnetsAligning wordnets

muziekinstrument

orgel

hammond orgel

organ ? organ organ

hammond organ

musical instrument

instrument

artifact object natural object

object

Dutch wordnet English wordnet

orgaan

orgel?

?

Page 21: Building Wordnets Piek Vossen, Irion Technologies

General criteria for approach: Maximize the overlap with wordnets for other

languages Maximize semantic consistency within and

across wordnets Maximally focus the manual effort where

needed Maximally exploit automatic techniques

Page 22: Building Wordnets Piek Vossen, Irion Technologies

Top-down methodology Develop a core wordnet (5,000 synsets):

all the semantic building blocks or foundation to define the relations for all other more specific synsets, e.g. building -> house, church, school

provide a formal and explicit semantics Validate the core wordnet:

does it include the most frequent words? are semantic constraints violated?

Extend the core wordnet: (5,000 synsets or more): automatic techniques for more specific concepts with high-

confidence results add other levels of hyponymy add specific domains add ‘easy’ derivational words add ‘easy’ translation equivalence

Validate the complete wordnet

Page 23: Building Wordnets Piek Vossen, Irion Technologies

Developing a core wordnet Define a set of concepts(so-called Base Concepts) that play an

important role in wordnets: high position in the hierarchy & high connectivity represented as English WordNet synsets Common base concepts: shared by various wordnets in different

languages Local base concepts: not shared

EuroWordNet: 1024 synsets, shared by 2 or more languages BalkaNet: 5000 synsets (including 1024) Common semantic framework for all Base Concepts, in the form of a

Top-Ontology Manually translate all Base Concepts (English Wordnet synsets) to

synsets in the local languages (was applied for 13 Wordnets) Manually build and verify the hypernym relations for the Base

Concepts All 13 Wordnets are developed from a similar semantic core closely

related to the English Wordnet

Page 24: Building Wordnets Piek Vossen, Irion Technologies

63TCs

1024 CBCs

First Level Hyponyms

Remaining Hyponyms

Hyperonyms

CBCRepresen- tatives

Local BCs

WMsrelated vianon-hyponymy

Top-Ontology

Inter-Lingual-Index

Remaining Hyponyms

Hyperonyms

CBCRepre-senta.

Local BCs

WMsrelated vianon-hyponymyFirst Level Hyponyms

RemainingWordNet1.5Synsets

Top-down methodology

Page 25: Building Wordnets Piek Vossen, Irion Technologies

DomainNamedEntities

Next Level Hyponyms

SumoOntology

WordNetSynsets

SBC

Hypernyms

ABCEuroWordNet BalkaNetBase Concepts

5000Synsets

EnglishArabic

Lexiconteach

-darrasa

WordNet Domains

Domain“chemics”

WordNetSynsets

English Wordnet Arabic Wordnet

Arabicword

frequency

Arabicroots

&derivation

rules

Top-down methodology

More Hyponyms

EasyTranslations

NamedEntities

1000Synsets

=

Core wordnet5000 synsets

CBC

WordNetSynsets

1045678-v{teach}

WordNetSynsets

1045678-v{darrasa}

Page 26: Building Wordnets Piek Vossen, Irion Technologies

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts Apply consistency checks Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

Page 27: Building Wordnets Piek Vossen, Irion Technologies

Distribution over the top ontology clusters

WN NL ES IT Top-Concept TC-

Tokens %of wn

TC-Tokens

% of nl

%of wn

TC-Tokens

%of es %of wn

TC-Tokens

%of it %of wn

Animal 14068 3.99% 1193 0.97% 8.5% 2458 1.81% 17.5% 1122 1.44% 8.0% Artifact 19562 5.55% 10803 8.83% 55.2% 9969 7.36% 51.0% 6494 8.34% 33.2% Building 1022 0.29% 707 0.58% 69.2% 628 0.46% 61.4% 434 0.56% 42.5% Comestible 3377 0.96% 1393 1.14% 41.2% 1614 1.19% 47.8% 624 0.80% 18.5% Container 1725 0.49% 778 0.64% 45.1% 799 0.59% 46.3% 432 0.55% 25.0% Covering 2030 0.58% 1208 0.99% 59.5% 1027 0.76% 50.6% 690 0.89% 34.0% Creature 664 0.19% 159 0.13% 23.9% 254 0.19% 38.3% 27 0.03% 4.1% Function 34081 9.68% 17668 14.44% 51.8% 18904 13.96% 55.5% 11043 14.18% 32.4% Furniture 298 0.08% 171 0.14% 57.4% 147 0.11% 49.3% 87 0.11% 29.2% Garment 756 0.21% 494 0.40% 65.3% 426 0.31% 56.3% 292 0.37% 38.6% Gas 93 0.03% 67 0.05% 72.0% 62 0.05% 66.7% 49 0.06% 52.7% Group 27805 7.90% 3357 2.74% 12.1% 3630 2.68% 13.1% 2337 3.00% 8.4% Human 11543 3.28% 6372 5.21% 55.2% 7683 5.67% 66.6% 4488 5.76% 38.9% ImageRepresentation 780 0.22% 412 0.34% 52.8% 426 0.31% 54.6% 294 0.38% 37.7% Instrument 7036 2.00% 4102 3.35% 58.3% 3590 2.65% 51.0% 2564 3.29% 36.4% LanguageRepresent. 2844 0.81% 1273 1.04% 44.8% 1218 0.90% 42.8% 691 0.89% 24.3% Liquid 1629 0.46% 617 0.50% 37.9% 500 0.37% 30.7% 339 0.44% 20.8% Living 47104 13.37% 10225 8.36% 21.7% 13661 10.08% 29.0% 7408 9.51% 15.7%

Page 28: Building Wordnets Piek Vossen, Irion Technologies

Wordnet Domains Concepts Proportion

Wordnet Domains Concepts Proportion

acoustics 104 0.092% linguistics 1545 1.363%

administration 2974 2.624% literature 686 0.605%

aeronautic 154 0.136% mathematics 575 0.507%

agriculture 306 0.270% mechanics 532 0.469%

alimentation 28 0.025% medicine 2690 2.374%

anatomy 2705 2.387% merchant_navy 485 0.428%

anthropology 896 0.791% meteorology 231 0.204%

applied_science 28 0.025% metrology 1409 1.243%

archaeology 68 0.060% military 1490 1.315%

archery 5 0.004% money 624 0.551%

architecture 255 0.225% mountaineering 28 0.025%

art 420 0.371% music 985 0.869%

artisanship 148 0.131% mythology 314 0.277%

astrology 17 0.015% number 220 0.194%

astronautics 29 0.026% numismatics 43 0.038%

astronomy 376 0.332% occultism 52 0.046%

athletics 22 0.019% oceanography 10 0.009%

Page 29: Building Wordnets Piek Vossen, Irion Technologies

EWN Interlingual RelationsEWN Interlingual Relations

• EQ_SYNONYM: there is a direct match between a synset and an ILI-record

• EQ_NEAR_SYNONYM: a synset matches multiple ILI-records simultaneously,

• HAS_EQ_HYPERONYM: a synset is more specific than any available ILI-record.

• HAS_EQ_HYPONYM: a synset can only be linked to more specific ILI-records.

• other relations: CAUSES/IS_CAUSED_BY, EQ_SUBEVENT/EQ_ROLE, EQ_IS_STATE_OF/EQ_BE_IN_STATE

Page 30: Building Wordnets Piek Vossen, Irion Technologies

Multilinguality

Page 31: Building Wordnets Piek Vossen, Irion Technologies

Complex equivalence relationsComplex equivalence relations

eq_near_synonym1. Multiple Targets

One sense for Dutch schoonmaken (to clean) which simultaneously matches with at least 4 senses of clean in WordNet1.5:

•{make clean by removing dirt, filth, or unwanted substances from}•{remove unwanted substances from, such as feathers or pits, as of chickens or fruit}•(remove in making clean; "Clean the spots off the rug")•{remove unwanted substances from - (as in chemistry)}

The Dutch synset schoonmaken will thus be linked with an eq_near_synonym relation to all these sense of clean.

2. Multiple Source meaningsSynsets inter-linked by a near_synonym relation can be linked to same target ILI-record(s), either with an eq_synonym or an eq_near_synonym relation:

Dutch wordnet: toestel near_synonym apparaatILI-records: {machine}; {device}; {apparatus}; {tool}

Page 32: Building Wordnets Piek Vossen, Irion Technologies

Complex equivalence relationsComplex equivalence relations

has_eq_hyperonym Typically used for gaps in WordNet1.5 or in English:

• genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin, • pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: Dutch hoofd only refers to human head and Dutch kop only refers to animal head, English uses head for both.

has_eq_hyponym Used when wordnet1.5 only provides more narrow terms. In this case there can only be a pragmatic difference, not a genuine cultural gap, e.g.: Spanish dedo can be used to refer to both finger and toe.

Page 33: Building Wordnets Piek Vossen, Irion Technologies

Overview of equivalence relations to the ILI

Relation POS Sources: Targets Exampleeq_synonym same 1:1 auto : voiture

careq_near_synonym any many : many apparaat, machine, toestel:

apparatus, machine, deviceeq_hyperonym same many : 1 (usually) citroenjenever:

gineq_hyponym same (usually) 1 : many dedo :

toe, fingereq_metonymy same many/1 : 1 universiteit, universiteitsgebouw:

universityeq_diathesis same many/1 : 1 raken (cause), raken:

hiteq_generalization same many/1 : 1 schoonmaken :

clean

Page 34: Building Wordnets Piek Vossen, Irion Technologies

Filling gaps in the ILI

Types of GAPS 1. genuine, cultural gaps for things not known in English culture,

e.g. citroenjenever, which is a kind of gin made out of lemon skin,

• Non-productive• Non-compositional

2. pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier)

• Productive• Compositional

3. Universality of gaps: Concepts occurring in at least 2 languages

Page 35: Building Wordnets Piek Vossen, Irion Technologies

Productive and Predictable Lexicalizations exhaustively linked to the ILI beat

stamp

{doodslaanV}NL

{cajeraN}ES

{doodschoppenV}NL

{doodstampenV}NL

kill

kick

{tottrampelnV}DE

{totschlagenV}DE

hypernym

cashier

female

young

fish

{casière}NL

{alevínN}ES

in_state

in_state

in_state

hypernym

hypernym

hypernym

hypernym

hypernym

hypernym

hypernym

Page 36: Building Wordnets Piek Vossen, Irion Technologies

DomainNamedEntities

Next Level Hyponyms

SumoOntology

WordNetSynsets

1000Synsets

SBCCBC

Hypernyms

ABCEuroWordNet BalkaNetBase Concepts

5000Synsets

EnglishArabic

LexiconWordNet Domains

Domain“chemics”

WordNetSynsets

English Wordnet Arabic Wordnet

Arabicword

frequency

Arabicroots

&derivation

rules

Top-down methodology

More Hyponyms

EasyTranslations

NamedEntities

=

Page 37: Building Wordnets Piek Vossen, Irion Technologies

dierenziekte(animal disease)

infectieziekte(infectious disease)

ingewandsziekte(bowel disease)

ziekte (disease)

kolder(staggers: brain disease of cattle)

vuilbroed(infectious infectious

disease of bees)

veeziekte(cattle disease)

haringwormziekte(anisakiasis: bowel disease of herrings)

Page 38: Building Wordnets Piek Vossen, Irion Technologies

dierenziekte(animal disease)

infectieziekte(infectious disease)

ingewandsziekte(bowel disease)

ziekte(disease)

kolder(staggers: brain disease of cattle)

vuilbroed(infectious infectious

disease of bees)

veeziekte(cattle disease)

haringwormziekte(anisakiasis: bowel disease of herrings)

Page 39: Building Wordnets Piek Vossen, Irion Technologies

Resources

Monolingual dictionaries: definitions synonym relations other relations

Bi-lingual dictionaries: L-English, English-L Ontologies Thesauri Corpora:

monolingual parallel


Recommended