The Global Wordnet Grid: anchoring languages to universal meaning Piek Vossen Irion Technologies/Free University of Amsterdam.

  • Published on
    27-Mar-2015

  • View
    213

  • Download
    1

Embed Size (px)

Transcript

  • Slide 1

The Global Wordnet Grid: anchoring languages to universal meaning Piek Vossen Irion Technologies/Free University of Amsterdam Slide 2 Overview Wordnet, EuroWordNet background Architecture of the Global Wordnet Grid Mapping wordnets to the Grid Advantages of shared knowledge structure 7 th Frame work project KYOTO Slide 3 WordNet1.5 Semantic network in which concepts are defined in terms of relations to other concepts. Structure: organized around the notion of synsets (sets of synonymous words) basic semantic relations between these synsets http://www.cogsci.princeton.edu/~wn/w3wn.html http://www.cogsci.princeton.edu/~wn/w3wn.html Developed at Princeton by George Miller and his team as a model of the mental lexicon. Slide 4 Relational model of meaning manwoman boygirl cat kitten dog puppy animal man woman boy meisje cat kitten dog puppy animal Slide 5 Structure of WordNet Slide 6 Wordnet Data Model bank fiddle violin violist fiddler string rec: 12345 - financial institute rec: 54321 - side of a river rec: 9876 - small string instrument rec: 65438 - musician playing violin rec:42654 - musician rec:25876 - string instrument rec:35576 - string of instrument rec:29551 - underwear type-of part-of Vocabulary of a language ConceptsRelations 1 2 2 1 1 2 Slide 7 Usage of Wordnet Improve recall of textual based analysis: Query -> Index Synonyms: commence begin Hypernyms: taxi -> car Hyponyms: car -> taxi Meronyms: trunk -> elephant Lexical entailments: gun -> shoot Inferencing: what things can burn? Expression in language generation and translation: alternative words and paraphrases Slide 8 Improve recall Information retrieval: small databases without redundancy, e.g. image captions, video text Text classification: small training sets Question & Answer systems query analysis: who, whom, where, what, when Slide 9text with specific cases "car""> Improve recall Anaphora resolution: The girl fell off the table. She.... The glass fell of the table. It... Coreference resolution: When he moved the furniture, the antique table got damaged. Information extraction (unstructed text to structured databases): generic forms or patterns "vehicle" - > text with specific cases "car" Slide 10 Improve recall Summarizers: Sentence selection based on word counts -> concept counts Avoid repetition in summary -> language generation Limited inferencing: detect locations, organisations, etc. Slide 11 Many others Data sparseness for machine learning: hapaxes can be replaced by semantic classes Use redundancy for more robustness: spelling correction and speech recognition can built semantic expections using Wordnet and make better choices Sentiment and opinion mining Natural language learning Slide 12 Slide 13 EuroWordNet The development of a multilingual database with wordnets for several European languages Funded by the European Commission, DG XIII, Luxembourg as projects LE2-4003 and LE4-8328 March 1996 - September 1999 2.5 Million EURO. http://www.hum.uva.nl/~ewnhttp://www.hum.uva.nl/~ewnhttp://www.hum.uva.nl/~ewn http://www.illc.uva.nl/EuroWordNet/finalresults- ewn.htmlhttp://www.illc.uva.nl/EuroWordNet/finalresults- ewn.html Slide 14 EuroWordNet Languages covered: EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian EuroWordNet-2 (LE4-8328): German, French, Czech, Estonian. Size of vocabulary: EuroWordNet-1: 30,000 concepts - 50,000 word meanings. EuroWordNet-2: 15,000 concepts- 25,000 word meaning. Type of vocabulary: the most frequent words of the languages all concepts needed to relate more specific concepts Slide 15 ENGLISH Car Train Vehicle Inter-Lingual-Index Transport Road Air Water Domains DOLCE SUMO Device Object TransportDevice English Words vehicle cartrain 1 2 4 33 Czech Words dopravn prostednk autovlak 2 1 French Words vhicule voiture train 2 1 Estonian Words liiklusvahend autokillavoor 2 1 German Words Fahrzeug AutoZug 2 1 Spanish Words vehculo autotren 2 1 Italian Words veicolo autotreno 2 1 Dutch Words voertuig autotrein 2 1 Wordnet family Princeton WordNet, (Fellbaum 1998): 115,000 conceps EuroWordNet, (Vossen 1998): 8 languages BalkaNet, (Tufis 2004): 6 languagesGlobal Wordnet Association: all languages Slide 16 EuroWordNet Wordnets are unique language-specific structures: different lexicalizations differences in synonymy and homonymy different relations between synsets same organizational principles: synset structure and same set of semantic relations. Language independent knowledge is assigned to the ILI and can thus be shared for all language linked to the ILI: both an ontology and domain hierarchy Slide 17 Autonomous & Language-Specific voorwerp {object} lepel {spoon} werktuig{tool} tas {bag} bak {box} blok {block} lichaam {body} Wordnet1.5Dutch Wordnet bag spoon box object natural object (an object occurring naturally) artifact, artefact (a man-made object) instrumentality blockbody container device implement tool instrument Slide 18 Artificial ontology: better control or performance, or a more compact and coherent structure. introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise ). What properties can we infer for spoons? spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking Linguistic versus Artificial Ontologies Slide 19 Linguistic ontology: Exactly reflects the relations between all the lexicalized words and expressions in a language. Captures valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language. What words can be used to name spoons? spoon -> object, tableware, silverware, merchandise, cutlery, Linguistic versus Artificial Ontologies Slide 20 Wordnets versus ontologies Wordnets: autonomous language-specific lexicalization patterns in a relational network. Usage: to predict substitution in text for information retrieval, text generation, machine translation, word- sense-disambiguation. Ontologies: data structure with formally defined concepts. Usage: making semantic inferences. Slide 21 Inter-Lingual-Index: unstructured fund of concepts to provide an efficient mapping across the languages; Index-records are mainly based on WordNet synsets and consist of synonyms, glosses and source references; Various types of complex equivalence relations are distinguished; Equivalence relations from synsets to index records: not on a word-to-word basis; Indirect matching of synsets linked to the same index items; The Multilingual Design Slide 22 Equivalent Near Synonym 1. Multiple Targets (1:many) Dutch wordnet: schoonmaken (to clean) matches with 4 senses of clean in WordNet1.5: make clean by removing dirt, filth, or unwanted substances from remove unwanted substances from, such as feathers or pits, as of chickens or fruit remove in making clean; "Clean the spots off the rug" remove unwanted substances from - (as in chemistry) 2. Multiple Sources (many:1) Dutch wordnet: versiersel near_synonym versiering ILI-Record:decoration. 3. Multiple Targets and Sources (many:many) Dutch wordnet: toestel near_synonym apparaat ILI-records:machine; device; apparatus; tool Slide 23 Equivalent Hyperonymy Typically used for gaps in English WordNet: genuine, cultural gaps for things not known in English culture: Dutch: klunen, to walk on skates over land from one frozen water to the other pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English: Dutch: kunstproduct = artifact substance artifact object Slide 24 From EuroWordNet to Global WordNet Currently, wordnets exist for more than 40 languages, including: Arabic, Bantu, Basque, Chinese, Bulgarian, Estonian, Hebrew, Icelandic, Japanese, Kannada, Korean, Latvian, Nepali, Persian, Romanian, Sanskrit, Tamil, Thai, Turkish, Zulu... Many languages are genetically and typologically unrelated http://www.globalwordnet.org Slide 25 Some downsides Construction is not done uniformly Coverage differs Not all wordnets can communicate with one another Proprietary rights restrict free access and usage A lot of semantics is duplicated Complex and obscure equivalence relations due to linguistic differences between English and other languages Slide 26 Inter-Lingual Ontology Device Object TransportDevice English Words vehicle cartrain 1 2 33 Czech Words dopravn prostednk autovlak 2 1 French Words vhicule voituretrain 2 1 Estonian Words liiklusvahend autokillavoor 2 1 German Words Fahrzeug AutoZug 2 1 Spanish Words vehculo autotren 2 1 Italian Words veicolo autotreno 2 1 Dutch Words voertuig autotrein 2 1 Next step: Global WordNet Grid Slide 27 GWNG: Main Features Construct separate wordnets for each Grid language Contributors from each language encode the same core set of concepts plus culture/language-specific ones Synsets (concepts) can be mapped crosslinguistically via an ontology No license constraints, freely available Slide 28 The Ontology: Main Features Formal, artificial ontology serves as universal index of concepts List of concepts is not just based on the lexicon of a particular language (unlike in EuroWordNet) but uses ontological observations Concepts are related in a type hierarchy Concepts are defined with axioms Slide 29 The Ontology: Main Features In addition to high-level (primitive) concept ontology needs to express low-level concepts lexicalized in the Grid languages Additional concepts can be defined with expressions in Knowledge Interchange Format (KIF) based on first order predicate calculus and atomic element Slide 30 The Ontology: Main Features Minimal set of concepts (Reductionist view): to express equivalence across languages to support inferencing Ontology must be powerful enough to encode all concepts that are lexically expressed in any of the Grid languages Slide 31 The Ontology: Main Features Ontology need not and cannot provide a linguistic encoding for all concepts found in the Grid languages Lexicalization in a language is not sufficient to warrant inclusion in the ontology Lexicalization in all or many languages may be sufficient Ontological observations will be used to define the concepts in the ontology Slide 32 Ontological observations Identity criteria as used in OntoClean (Guarino & Welty 2002), : rigidity: to what extent are properties true for entities in all worlds? You are always a human, but you can be a student for a short while. essence: what properties are essential for an entity? Shape is essential for a statue but not for the clay it is made of. unicity: what represents a whole and what entities are parts of these wholes? An ocean is a whole but the water it contains is not. Slide 33 Type-role distinction Current WordNet treatment: (1) a husky is a kind of dog(type) (2) a husky is a kind of working dog (role) Whats wrong? (2) is defeasible, (1) is not: *This husky is not a dog This husky is not a working dog Other roles: watchdog, sheepdog, herding dog, lapdog, etc. Slide 34 Ontology and lexicon Hierarchy of disjunct types: Canine PoodleDog; NewfoundlandDog; GermanShepherdDog; Husky Lexicon: NAMES for TYPES: {poodle}EN, {poedel}NL, {pudoru}JP ((instance x Poodle) LABELS for ROLES: {watchdog}EN, {waakhond}NL, {banken}JP ((instance x Canine) and (role x GuardingProcess)) Slide 35 Ontology and lexicon Hierarchy of disjunct types: River; Clay; etc Lexicon: NAMES for TYPES: {river}EN, {rivier, stroom}NL ((instance x River) LABELS for dependent concepts: {rivierwater}NL (water from a river => water is not Unit) ((instance x water) and (instance y River) and (portion x y) {kleibrok}NL (irregularly shared piece of clay=>Non-essential) ((instance x Object) and (instance y Clay) and (portion x y) and (shape X Irregular)) Slide 36 Rigidity The primitive concepts represented in the ontology are rigid types Entities with non-rigid properties will be represented with KIF statements But: ontology may include some universal, core concepts referring to roles like father, mother Slide 37 Properties of the Ontology Minimal: terms are distinguished by essential properties only Comprehensive: includes all distinct concepts types of all Grid languages Allows definitions via KIF of all lexemes that express non-rigid, non-essential properties of types Logically valid, allows inferencing Slide 38 Mapping Grid Languages onto the Ontology Explicit and precise equivalence relations among synsets in different languages, which is somehow easier: type hierarchy is minimal subtle differences can be encoded in KIF expressions Grid database contains wordnets with synsets that label either primitive types in the hierarchies, or words relating to these types in ways made explicit in KIF expressions If 2 lgs. create the same KIF expression, this is a statement of equivalence! Slide 39 How to construct the GWNG Take an existing ontology as starting point; Use English WordNet to maximize the number of disjunct types in the ontology; Link English WordNet synsets as names to the disjunct types; Provide KIF expressions for all other English words and synsets Slide 40 How to construct the GWNG Copy the relation from the English Wordnet to the ontology to other languages, including KIF statements built for English Revise KIF statements to make the mapping more precise Map all words and synsets that are and cannot be mapped to English WordNet to the ontology: propose extensions to the type hierarchy create KIF expressions for all non-rigid concepts Slide 41 Initial Ontology: SUMO (Niles and Pease) SUMO = Suggested Upper Merged Ontology --consistent with good ontological practice --fully mapped to WordNet(s): 1000 equivalence mappings, the rest through subsumption --freely and publicly available --allows data interoperability --allows NLP --allows reasoning/inferencing Slide 42 Mapping Grid languages onto the Ontology Check existing SUMO mappings to Princeton WordNet -> extend the ontology with rigid types for specific concepts Extend it to many other WordNet synsets Observe OntoClean principles! (Synsets referring to non-rigid, non-essential, non- unicitous concepts must be expressed in KIF) Slide 43 Lexicalizations not mapped to WordNet Not added to the type hierarchy: {straathond}NL (a dog that lives in the streets) ((instance x Canine) and (habitat x Street)) Added to the type hierarchy: {klunen}NL (to walk on skates from one frozen body to the next over land) KluunProcess => WalkProcess Axioms: (and (instance x Human) (instance y Walk) (instance z Skates) (wear x z) (instance s1 Skate) (instance s2 Skate) (before s1 y) (before y s2) etc National dishes, customs, games,.... Slide 44 Most mismatching concepts are not new types Refer to sets of types in specific circumstances or to concept that are dependent on these types, next to {rivierwater}NL there are many others: {theewater}NL (w...

Recommended

View more >