87
WordNet for MT Christiane Fellbaum Dept. of Computer Science [email protected]

WordNet for MT Christiane Fellbaum Dept. of Computer Science [email protected]

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Page 1: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

WordNet for MT

Christiane Fellbaum

Dept. of Computer Science

[email protected]

Page 2: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

The Challenge

Globalization requires more texts and speech to be translated faster across more languages

Page 3: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

The half-empty glass

• Manual translation is difficult, expensive, time-consuming

• Machine translation is of low quality, often unacceptable

Page 4: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

The half-full glass

• Human-aided machine translation can work for restricted domains (science, instruction manuals, etc., but not literature or poetry)

• Restricted domains use limited vocabulary (terminology)

• Key words are less polysemous (often monosemous)

• “Prefabricated” phrases (“let x be y”) don’t need to be translated anew each time

Page 5: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Hardest part of translation(?):

lexical disambiguation

Page 6: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Focus on two challenges

• identify the intended sense of a polysemous word in the source

• find the context-appropriate word in the target language

Page 7: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

A local resource: WordNet

Page 8: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

What is WordNet?

• A large lexical database, or “electronic dictionary,” developed and maintained at Princeton

http://wordnet.princeton.edu• Includes most English nouns, verbs,

adjectives, adverbs• Can be used by humans and machines• Princeton WordNet is for English only,

but it is linked to wordnets is many other languages

Page 9: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

What’s special about WordNet?

• Traditional paper dictionaries are organized alphabetically: words that are found together (on the same page) are not related by meaning

• WordNet is organized by meaning: words in close proximity are semantically similar

• Human users and computers can browse WordNet and find words that are meaningfully related to their queries (somewhat like in a hyperdimensional thesaurus)

Page 10: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

What’s special about WordNet?

WordNet gives information about two fundamental, universal properties of human language:

polysemy and synonymy

Polysemy = one:many mapping of form and meaning

Synonymy = one:many mapping of meaning and form

Page 11: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Polysemy

One word form expresses multiple meanings

{table, tabular_array}{table, piece_of_furniture}{table, mesa}{table, postpone}

Note: the most frequent word forms are the most polysemous!

Page 12: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Synonymy

One concept is expressed by several different word forms:

{beat, hit, strike}

{car, motorcar, auto, automobile}

Page 13: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Polysemy and synonymy

Understanding and generating language (as for translation) means matching a word form with the intended, context-appropriate meaning

People (fluent speakers of a language) do this very efficiently

Page 14: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Synonymy in WordNet

WordNet groups (roughly) synonymous, denotationally equivalent, words into unordered sets of synonyms (“synsets”)

{hit, beat, strike}{big, large}{queue, line}

By definition, each synset expresses a distinct meaning/concept

Each word form-meaning pair is unique

Page 15: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Polysemy in WordNetA word form that appears in n synsets is n-fold polysemous

{table, tabular_array}{table, piece_of_furniture}{table, mesa}{table, postpone}

table is fourfold polysemous/has four sensesfour distinct concepts are associated with the

word form table

Page 16: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Some WordNet stats

Part of speech

Word forms Synsets

noun 117,798 82,115

verb 11,529 13,767

adjective 21,479 18,156

adverb 4,481 3,621

total 155,287 117,659

Page 17: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

The “Net” part of WordNet

Synsets are interconnected

Bi-directional arcs express semantic relations

Result: large semantic network (graph)

Page 18: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Hypo-/hypernymy relates noun synsets

Relates more/less general conceptsCreates hierarchies, or “trees” {vehicle} / \ {car, automobile} {bicycle, bike} / \ \ {convertible} {SUV} {mountain bike}

“A car is is a kind of vehicle” <=>“The class of vehicles includes cars, bikes”

Hierarchies can have up to 16 levels

Page 19: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Hyponymy

Transitivity:

A car is a kind of vehicle

An SUV is a kind of car

=> An SUV is a kind of vehicle

Page 20: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Meronymy/holonymy(part-whole relation)

{car, automobile} | {engine} / \ {spark plug} {cylinder}

“An engine has spark plugs” “Spark plus and cylinders are parts of an engine”

Page 21: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Meronymy/Holonymy

Inheritance:

A finger is part of a hand A hand is part of an armAn arm is part of a body=>a finger is part of a body

Page 22: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Structure of WordNet (Nouns)

{vehicle}

{conveyance; transport}

{car; auto; automobile; machine; motorcar}

{cruiser; squad car; patrol car; police car; prowl car} {cab; taxi; hack; taxicab; }

{motor vehicle; automotive vehicle}

{bumper}

{car door}

{car window}

{car mirror}

{hinge; flexible joint}

{doorlock}

{armrest}

hyperonym

hyperonym

hyperonym

hyperonymhyperonym

meronym

meronym

meronym

meronym

Page 23: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

WordNet Data Model

bank

fiddleviolin

violistfiddler

string

rec: 12345- financial instituterec: 54321

- side of a riverrec: 9876

- small string instrumentrec: 65438

- musician playing violinrec:42654

- musician

rec:25876

- string instrument

rec:35576

- string of instrumentrec:29551

- subatomic particle

type-of

type-of

part-of

Vocabulary of a languageConceptsRelations

1

2

2

1

1

2

Page 24: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

A bit of history (or: Did we just make this stuff up?)

1980s Research in Artificial Intelligence (AI):

How do humans store and access knowledge about concept?

Knowledge about concepts is huge--must be stored in an efficient and economic fashion

Hypothesis: concepts are interconnected via meaningful relations

Page 25: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

A bit of history

One hypothesis:

Knowledge about concepts is computed “on the fly” via access to general concepts

E.g., we know that “canaries fly” because

“birds fly” and “canaries are a kind of bird”

Page 26: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

A bit of history

animal (alive, breathes, moves..)

| bird (has feathers, lays eggs,..)

| canary (yellow, sings,…)

Page 27: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

A bit of history

Knowledge is stored at the highest possible node (animals move, birds fly, canaries sing)

Collins & Quillian (1969) measured reaction times to statements involving knowledge distributed across different “levels”

Page 28: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

A bit of history

People confirmed statements like

(1) bird lays eggs

faster than

(2) canary lays eggs

Hypothesis: (2) requires “look-up” at higher level, (1) doesn’t

Page 29: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

A bit of history

Collins’ & Quillian’s results are not compelling Reaction time to statements like “do canaries move?”

are influenced by prototypicality (robins are more typical birds than emus) word frequency (robin occurs more often than emu and people recognize frequent words faster), uneven semantic distance across levels

But the idea inspired WordNet (1986), which asked:Can most/all of the lexicon be represented as a

semantic network? Or are there unconnectable words and concepts?

Page 30: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Adjective relations

Strong association between members of antonymous adjective pairs:

hot-cold, old-new, high-low, big-small,...

WordNet connects members of such pairs(“direct antonyms”) as well as similar but

less salient adjectives ( e.g., cool, lukewarm...)

Page 31: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Experimental Evidence

Reaction time measurements for semantic judgments (it takes less time to confirm that hot and cold are opposites than that hot and chilly are)

Weakened by: frequency, prototypicality effects

Page 32: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Relations among verbs

Manner relation connects verbs like

move-walk-run-jog

communicate-talk-whisper

Relations reflecting temporal or logical order:

divorce-marry, snore-sleep, buy-pay

Manner relation builds “trees”

Page 33: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

WN as a lexical resource

“Have concept, need words”

--depart from synset, travel in WordNet space

“Have word, need concept”

--query word form, find associated synsets

Page 34: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Is WordNet a Thesaurus?• Yes:--it groups together meaningfully related words• No:--it labels the relations--the relations are limited--related words are linked to specific concepts

(disambiguated); thesaurus is a “bag of words”

--many words linked in WordNet do not co-occur in the same thesaurus entry

--WordNet allows one to measure and quantify the semantic similarity or distance among words and concepts

Page 35: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Web interface for WN search• Noun

キキ S: (n) bicycle, bike, wheel, cycle (a wheeled vehicle that has two wheels and is moved by foot pedals)

• o direct hyponym / full hyponym

キキ S: (n) bicycle-built-for-two, tandem bicycle, tandem (a bicycle with two sets of pedals and two seats)

キキ S: (n) mountain bike, all-terrain bike, off-roader (a bicycle with a sturdy frame and fat tires; originally designed for riding in mountainous country)

キキ S: (n) ordinary, ordinary bicycle (an early bicycle with a very large front wheel and small back wheel) キキ S: (n) push-bike (a bicycle that must be pedaled) キキ S: (n) safety bicycle, safety bike (bicycle that has two wheels of equal size; pedals are connected to the rear wheel by a multiplying gear)

キキ S: (n) velocipede (any of several early bicycles with pedals on the front wheel)

• o part meronym

キキ S: (n) bicycle seat, saddle (a seat for the rider of a bicycle)

キキ S: (n) bicycle wheel (the wheel of a bicycle) キキ S: (n) chain (a series of (usually metal) rings or links fitted into one another to make a flexible ligament)

キキ S: (n) coaster brake (a brake on a bicycle that engages with reverse pressure on the pedals)

キS: (n) handlebar (the shaped bar used to steer a bicycle)

キキ S: (n) kickstand (a swiveling metal rod attached to a bicycle or motorcycle or other two-wheeled vehicle; the rod lies horizontally when not in

use but can be kicked into a vertical position as a support to hold the vehicle upright when it is not being ridden) キVerb

キキ S: (v) bicycle, cycle, bike, pedal, wheel (ride a bicycle)

Page 36: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

WordNet as a lexical resource

• WN has been incorporated into many dictionaries

• Google “define” usually brings up WN entry at the top of the list

• User-created visual interfaces (e.g., visualthesaurus.com)

Page 37: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu
Page 38: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Back to the problem

• How does a computer find the right word associated with a given concept, or

• the right concept associated with a given word?

(This is is a crucial step in information retrieval, text mining, document sorting, machine translation...)

Page 39: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

• Example:

John needed cash so he walked over to the bank

Which bank?

Money institution? (building/institution?)

Sloping land by the water?

Page 40: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Word Sense Discrimination

Less difficult: homonymy (unrelated senses of a word form): bank (river bank/financial institution), bat (mammal/raquet), club (social organization/stick)

Systems perform very well/near-perfectE.g., can rely on “one sense per discourse” rule

(Yarowsky)

Very difficult: polysemy (related senses of a word form): bank (institution vs. building)

Systems perform much worseNeed local context

Page 41: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Approaches to WSD

• Knowledge-based (using resources like WordNet)

• Statistical (corpus-based clustering of senses; many use WordNet in addition)

• Supervised (train on manually disambiguated corpus)

• Unsupervised (“discover” senses)

Page 42: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Knowledge-Based Approaches

Determine which sense in a lexical resource (like WordNet) a given token (occurrence) of a word represents

“Dumb” method: assume that the most frequently occurring sense of a polysemous word is the context-appropriate one

Works amazingly well: 65-70% correct for nouns

Frequency is relative to a human-annotated “gold standard” corpus

Page 43: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Less “dumb” approaches

Page 44: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Basic Assumptions

• Natural language context is coherent (*Colorless green ideas sleep furiously)

• Words co-ocurring in context are semantically related to one another

• Given word w1 with sense 1, determine which sense 1, 2, 3,…of word w2 is the context-appropriate one

• WN allows one to determine and measure meaning similarity among co-occurring words

Page 45: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

WordNet-Based Method

• Look for words in the vicinity (context) of the target word

• Find that sense of the target word that is related to the context words in WordNet (shared superordinate, parts, definition, etc.)

• Shortest path among candidate words often shows the intended sense

Bank/money institution and cash are linked (share word in their definitions), so if cash and bank co-occur in a context, bank likely has the “financial institution” sense

Page 46: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Deriving meaning similarity metrics from WN’s structure

Shortest path length is between concepts in noun hierarchies (Leacock&Chodorow)

Problem: edges are not of equal length in natural language (i.e., semantic distance is not uniform)

possession elephant | |white elephant Indian elephant

Page 47: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Corrections for differences in edge lengths

• Scale by depth of hierarchies of the synsets whose similarity is measured (path from target to its root)

• Density of sub-hierarchy: intuition that words in denser hierarchies (more sisters) are very closely related to one another

• Types of link (IS-A, HAS-A, other): too many “changes of directions” reduce score

Page 48: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Information-based similarity measure (Resnik)

Intuition: similarity of two concepts is the degree to which they share information in common

Can be determined by the concepts’ relative position wrt the lowest common superordinate

Define “class”: all concepts below the lowest common superordinate

Page 49: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Information-based similarity measure (Resnik)

medium of exchange / \ money \ | \ cash \ | \ coin credit / \ | nickel dime credit card

Classes: coin, medium of exchange

Page 50: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Information-based similarity measure (Resnik)

Information content of a concept c:

IC(c) = -logp(p(c))

p is p of encountering a token of concept c in a corpus Class notion entails that occurrence of a member of the class counts as occurrence of all class members Probability of a word is the sum of all counts of all class members divided by total number of words in the corpus

p increases as you go up the hierarchy

Page 51: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Information-based similarity measure (Resnik)

Similarity of a pair of concepts is determined by the least probable (most informative) class they belong to

simR (c1 c2) = -logp(lso (c1 c2))

Page 52: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

“Lesk” method

overlap of words in definitions of synsets is a measure of similarity

(Lesk: Why is a pine cone not like an ice cream cone? Because there’s no lexical overlap in their definitions!)

Page 53: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Knowledge-Lean Approaches

• Exploit distributional (contextual) properties of words in a corpus

• Context-based clustering (with/without WN support)

• Induce senses from clusters• Use WN similarity to evaluate clusters

Page 54: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Knowledge-Lean Approaches(McCarthy et al.)

• For each token of a target word, find words that are distributionally similar (based on corpus analysis)

• Nearest neighbors characterize the domain of the target

• Use WN relations and WN gloss overlap to measure similarity between target and its neighbors

• Sense of target that is most similar to words in the domain is predominant in that domain

• Target word (token) is assigned that WN sense

Page 55: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Supervised systems

• Perform better than unsupervised ones• WSD is a learning task• Train classifiers on data annotated by

humans (“gold standard”)• Each sense-tagged occurrence of a particular

word is a feature vector used in learning• Problem: hand-annotated data are sparse!

Page 56: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Supervised learning with sparse data

• Start with hand-annotated seed• Determine contextual classifiers• Augment contextual classifiers with WN similarity• Reasoning: if baseball is a good discriminator for a sense of

play, then football, hockey, etc. should also be a good discriminator for that sense of play

• Use monosemous (single-sense) relatives only

Page 57: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Similarity measures

Good reference:

Ted Pedersen::WordNetSimilarity

http://wn-similarity.sourceforge.net

Page 58: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Back to MT

Now that we (think we) can discriminate word senses within one language, how do we find the corresponding senses in another language?

Page 59: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

WordNet(s) for Translation

Needed:

Wordnet(s) in the target language(s)

Page 60: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Crosslinguistic WordNets • Starting in late 1990s, WordNets were built for

languages other than English• Genetically and typologically unrelated

languages: Turkish, Hindi, Chinese, Korean, Basque, Zulu, Arabic,… (currently >60)

• Mapped to Princeton WordNet

www.globalwordnet.org

• Great potential for crosslinguistic applications

Page 61: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Mapping words and synsets across multilingual WordNets

• First set of foreign-language WNs (“EuroWordNet”) were built with reference to Princeton WordNet

• Princeton WN as the hub (“interlingual index”) • Each synset in each WN was linked to a

“record” (PWN synset identifier) in the index• Crosslingual mapping of words and synsets

proceeds via the index

Page 62: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

ENGLISHCar…

Train…

Vehicle

Inter-Lingual-Index

Transport

Road Air Water

Domains

Device

Object

TransportDevice

English Words

vehicle

car train

1

2

4

3 3

Czech Words

dopravní prostředník

auto vlak

2

1

French Words

véhicule

voiture train

2

1

Estonian Words

liiklusvahend

auto killavoor

2

1

German Words

Fahrzeug

Auto Zug

2

1

Spanish Words

vehículo

auto tren

2

1

Italian Words

veicolo

auto treno

2

1

Dutch Words

voertuig

auto trein

2

1

Page 63: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

EWN Interlingual Index

Index is flat list of synsets (relations were removed)

Relations are represented in each language-specific wordnet (incl. English

WN, which “resurfaced” as one of the language-specific wordnets)

Page 64: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Mismatches in multilingual WordNets

Concepts not lexicalized in English required new entries in the Interlingual Index (w/out English synset):

--Arabic lexically distinguishes 12 kinds of cousin--Index may refer to “son of father’s brother”/”daughter

of mother’s sister” etc.

Automatic system will likely choose underspecified concept, “cousin.” Human translator can decide to use “cousin” or a more specific paraphrase

Page 65: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Mismatches in multilingual WordNets

Conversely, some lgs. lack equivalents of English words:

--Dutch lacks a word for container but has kinds (hyponyms) of container (box, bag, bucket..)

Respective hierarchies reflects this difference:Du. bag, box,..kind of artifactEngl. bag, box,…kind of container, which is a kind of

artifact

Translator may specify the kind of container

Page 66: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

English-Dutch snippet

voorwerp{object}

artifact

tas{bag}

bak{box}

lichaam{body}

English Wordnet Dutch Wordnet

bagbox

object

natural object artifact, artefact (a man-made object)

body

container

Page 67: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

From EuroWordNet to Global WordNet

• Currently, wordnets exist for more than 60 languages, including:

• Arabic, Bantu, Basque, Chinese, Bulgarian, Estonian, Hebrew, Icelandic, Japanese, Kannada, Korean, Latvian, Nepali, Persian, Romanian, Sanskrit, Tamil, Thai, Turkish, Zulu...

• Many languages are genetically and typologically unrelated

• http://www.globalwordnet.org

Page 68: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

• The more languages, the more mismatches

• Not all languages have the same lexical categories (N, V, Adj)

Page 69: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Problems with ILI model

ILI model requires lots of entries that are not represented in the ILI language (English)

ILI is English-centric, may bias constructions of other wordnets (esp. those using the “translation” method)

Page 70: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Better model

Replace ILI based on a natural language

with a formal, language-independent ontology

Concepts are represented by axioms in first order logic

Page 71: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Top level ontologies have been worked out by philosophers

SUMO, Dolce,...

strictly categorize and distinguish entities (objects, endurants), events (perdurants), properties

Page 72: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Ontology

• Mapping of several competing ontologies to WordNet(s)

• Suggested Upper Merged Ontology (SUMO; Niles and Pease)

Page 73: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

SUMO• Upper-level, formal ontology (abstract concepts)• 1K terms, 4K axioms• MILO: (mid-level ontology): several K more terms• SUMO+MILO: 20K terms, 70K axioms• axioms are written in SUO-KIF language (Knowledge

Interchange Format)• All terms manually mapped to WN (and WNs in other

languages)

Page 74: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Axiom for “earlier” in SUMO

(<=> (earlier ?INTERVAL1 ?INTERVAL2) (before (EndFn ?INTERVAL1) (BeginFn ?INTERVAL2)))

“An interval that precedes another; the ending of the first interval is before the beginning of the second interval”

? = variable

Fn=function that takes a time interval and returns the time point at the end of the interval

Ontology also has axioms for “before,” etc.

Page 75: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

“Clean” ontologies refer to essential properties

distinguish among rigid and non-rigid entities

allow reasoning, inferencing:

if x is a y, and y is a z, then z is an x

Page 76: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

An ontology differs from a wordnet

Wordnets – represent how we use language e.g., the word „cat“.

Ontologies - represent what it is to be a cat. e.g., meta-property „rigidity“

Page 77: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Rigidity„Cat“ is a rigid concept. „Pet“ is a non-rigid concept.A concept is rigid if it is essential to all of its

instances. Permanence – Fluffy is always a cat, not

always a pet. Necessity – Fluffy cannot stop being a cat.

Fluffy can stop being a pet.

Page 78: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Reasoning with rigidityFluffy was a cat. Fluffy was not a pet on Monday.Fluffy was a pet on Tuesday.#Fluffy was not a cat on Tuesday.

Pet-hood is sensitive to time and circumstancesCat-hood is notNeed to distinguish rigid and non-rigid in the ontology!

Page 79: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Making ontological distinctions

Fluffy(Instance)

Animal (Rigid)

Pet(Non-Rigid)

Fluffy(Instance)

Animal

Pet(Non-Rigid)

Cat (Rigid)

Page 80: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

“Rudify”

An ontology annotation tool currently being developed

Page 81: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

“Rudify”

Rudify learns to classify words collected from the web by means of lexical patterns

Associate the words with the appropriate concept

Distinguish rigid from non-rigid concepts

Page 82: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu
Page 83: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Lexical patterns that distinguish rigid from non-rigid

conceptsX (would be | make) (a good | a bad) Y

X is no longer a(n) Y

vs.

Xs (such as | like) Ys

Xs and other Ys

Page 84: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Training Rudify

100 selected words/concepts

contexts from Google

50 rigid, 50 non-rigid

manually annotated as +/- rigid

Page 85: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Testing Rudify

Two test suites:

--297 Base Concepts (identified by the Spanish wordnet team)

--287 terms referring to Regions and Species

(common regions and species, Latin species)

Page 86: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

How well does Rudify do?

Rudify’s classification is compared with that of OntoClean

Out of the 287 Regions and Species terms,

Rudify misclassified only three, where

rigid concepts were misclassified as non-rigid

Page 87: WordNet for MT Christiane Fellbaum Dept. of Computer Science fellbaum@princeton.edu

Error Analysis

(1) Misclassification due to lexical pattern:

Wolf was classified as non-rigid

“The dog is no longer a wolf but a separate species”

(2) Misclassification due to polysemy/metaphor

wildcat (animal, gun, mascot)

Apollo (butterfly, god, space mission)