LREC 2008 AWN 1 Building WordNets: The Arabic case H. Rodrguez
LREC 2008 AWN 2 Index of the talk Introduction Ontologies Wordnets Building wordnets Arabic WordNet Semi-automatic extensions of AWN
LREC 2008 AWN 3 Ontologies What an ontology is: (Gruber, 1993) an ontology is an explicit specification of a conceptualization (Studer et al, 1998) an ontology is a formal explicit specification of a shared conceptualization
LREC 2008 AWN 4 Ontologies
LREC 2008 AWN 5 Ontologies lexico-conceptual ontologies Some authors simply reject this term, an ontology is by definition conceptual and, thus, language independent (or better, language neutral) Other authors admit that some conceptualizations are different in different languages, thus leading to different ontologies (Barbu and Barbu-Mititelu, 2005) classify these differences as accidental, systematic and cultural.
LREC 2008 AWN 6 Ontologies
LREC 2008 AWN 7 Ontologies The mapping between lexical items (words or multiwords) and concepts can be complex. Due to polysemy, most lexical items can be mapped into more than one concept. Due to synonymy, more than one word can be mapped to a concept. Usually the mapping is splitted into two steps from words into word-senses (i.e. different word meanings) and from word-senses into concepts.
LREC 2008 AWN 8 Ontologies
LREC 2008 AWN 9 Ontologies
LREC 2008 AWN 10 Ontologies
LREC 2008 AWN 11 Ontologies Building lexico-conceptual ontologies This derivation process is far to be simple. for a LO, the mapping words word-senses concepts is complex (and controversial) (Kilgarrif, 1997) arguments against the ontological status of word-senses (Edmonds and Hirst, 2002) reduce a lot the cases of absolute synonymy and propose, instead, modeling near-synonymy for fine-grained mapping between words and concepts).
LREC 2008 AWN 12 Wordnets Princeton's English WordNet (Miller et al, 1990), (Fellbaum, 1998) Semantic Information more than 123,000 words organised in 117,000 synsets (WN3.0) more than 235,000 relations between synsets Freely available: http://wordnet.princeton.edu/
LREC 2008 AWN 13 Wordnets Princeton's English WordNet Lexicalised concepts (words, compounds, multiwords) Synset: synonym set (of words) Large semantic net conecting synsets synonymy, antonymy, hyperonymy, hyponymy, meronymy, implication, causation... Structure Noun hierarchy depth ~12 Verb hierarchy depth ~3 Adjective/adverb not in hierarchy, but in star structure
LREC 2008 AWN 14 Wordnets
LREC 2008 AWN 15 Wordnets Beyond WN EuroWordNet (Vossen 98) UE funded project Integrated local wordnets in several languages English Sheffield Dutch Amsterdam Italian Pisa Spanish UB, UPC, UNED. http://www.hum.uva.nl/~ewn/
LREC 2008 AWN 16 Wordnets
LREC 2008 AWN 17 Wordnets Beyond WN EWN2 German (GermaNet), French, Chec, Swedish, Estonian ITEM, CREL Spanish, Catalan, Basque (UB, UPC) EuroTerm, Jur-Wordnet Extending EWN in particular domain Balkanet Extending EWN for the Balkan languages Hownet Chinese WN
LREC 2008 AWN 18 Wordnets Macro Ontologies based on WN MCR Yago Omega
LREC 2008 AWN 20 Building wordnets EWN Building Base Concepts (BC) supposed to be the concepts that play the most important role in different languages. Two main criteria: A high position in the semantic hierarchy (abstract) Having many relations to other concepts (hub) 1000 synsets Vertical expansion filling gaps and assuring good overlapping
LREC 2008 AWN 21 Building wordnets EWN Spanish WN automatic extension with human validation Combination of 17 heuristic methods 1) simple rule 2) pair wise combination 3) Logistic Regression combination
LREC 2008 AWN 22 Building wordnets
LREC 2008 AWN 23 Building wordnets
LREC 2008 AWN 24 Building wordnets
LREC 2008 AWN 25 Arabic WordNet USA REFLEX program funded (2005-2007) Partners: Universities Princeton Manchester UPC (Barcelona) UB (Barcelona) Companies Articulate Software Irion
LREC 2008 AWN 26 Arabic WordNet papers Introducing the Arabic WordNet Project Black et al, 2006 Building a WordNet for Arabic Elkateb et al, 2006 Arabic WordNet: Current State and Future Extensions Rodrguez et al, 2008 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference Rodrguez et al, 2008
LREC 2008 AWN 27 Arabic WordNet Objectives 10,000 synsets including some amount of domain specific data linked to PWN 2.0 finally to PWN 3.0 linked to SUMO + 1,000 NE manually built (or revised) vowelized entries including root of each entry
LREC 2008 AWN 28 Arabic WordNet Approach described in 3rd GWC (Elkateb et al, 2006) Manually built 2 lexicographic interfaces Manchester, Barcelona guided by automatically generated suggestions of pairs coming from bilingual resources.
LREC 2008 AWN 29 Arabic WordNet Approach BCs Covering of EWN & Balkanet Base Concepts Filling gaps Building Arabic specific synsets Covering domain specific synsets Adding NEs. (Semi) automatic extensions heuristic based Bayesian networks
LREC 2008 AWN 30 Arabic WordNet Resources used LOGOS database of Arabic verbs: contains 944 fully conjugated Arabic verbs Bilingual (Arabic-English) dictionaries NMSU bilingual Arabic-English lexicon: Salmon University of Barcelona Effel Corpora Arabic GigaWord Corpus (from LDC) UN (2000-2002) bilingual Arabic-English Corpus (from LDC).
LREC 2008 AWN 31 Arabic WordNet Representation database (implemented in MySQL) interchange format (XML) The database structure comprises four principal entity types: item, word, form and link.
LREC 2008 AWN 32 Arabic WordNet Item conceptual entities, including synsets, ontology classes and instances. Word word senses Form entity that contains lexical information (not merely inflectional variation) roots broken plural forms Link relates two items, and has a type such as equivalence, subsuming, etc. interconnect sense items, e.g., a PWN synset to an AWN synset, a synset to a SUMO concept, etc.
LREC 2008 AWN 33 Arabic WordNet Current (Final ?, we hope no!!!) figures up to date statistics: http://www.lsi.upc.edu/~mbertran/arabic/awn/query/sug_statistics.php. 23496Arabic words 11270Arabic synsets 2538v 110r 7961n 661a DB contentpos Named entities: 1656Words in synsets that are named entities 10028Synsets that are not named entities 1142Synsets that are named entities
LREC 2008 AWN 34 Arabic WordNet Software Lexicographer's Web Interface http://www.lsi.upc.edu/~mbertran/arabic/awn/update/synset_browse.php User's Web Interface http://www.lsi.upc.edu/~mbertran/arabic/awn/index.html The Arabic Word Spotter http://www.lsi.upc.edu/~mbertran/arabic/wwwWn7/ AWN browser http://sourceforge.net/projects/awnbrowser/ AWN to SUMO mapping including automatic generation of Arabic paraphrases of SUMO formal axioms
LREC 2008 AWN 35 Arabic WordNet Ongoing research (Semi) automatic methods for enriching AWN Heuristic-based approach GWC 2008 Bayesian Networks LREC 2008 Automatically obtaining & linking NEs using Wikipedia as Knowledge Source
LREC 2008 AWN 36 Arabic WordNet (Semi) automatic methods for enriching AWN key idea In Arabic many words having a common root have related meanings and can be derived from a base verbal form by means of a reduced set of lexical rules
LREC 2008 AWN 37 Semi-automatic Extensions of AWN Lexical rules regular verbal derivative forms regular nominal and adjectival derivative forms masdar (nominal verb) masculine and feminine active and passive participles inflected verbal forms
LREC 2008 AWN 38 Semi-automatic Extensions of AWN Procedure for generating a set of likely : produce an initial list of candidate word forms filter out the less likely candidates from this list generate an initial list of attachments score the reliability of these candidates manually review the best scored candidates and include the valid associations in AWN.
LREC 2008 AWN 39 Semi-automatic Extensions of AWN Score the reliability of the candidates build a graph representing the words, synsets and their associations associations synset-synset: explicit in WN2.0 path-based apply a set of heuristic rules that use directly the structure of the graph GWC 2008 apply Bayesian inference LREC 2008
LREC 2008 AWN 40 Semi-automatic Extensions of AWN
LREC 2008 AWN 41 Using Heuristics When a unique path A-E-S exists (i.e., A is only translated as E), and E is monosemous (i.e., it is associated with a single synset), then the output tuple is tagged as case 1 Using Heuristics, case 1
LREC 2008 AWN 42 Using Heuristics If multiple paths A-E 1 -S and A-E 2 -S exist (i.e., A is translated as E 1 or E 2 and both E 1 and E 2 are associated with S among other possible associations) then the output tuple is tagged as case 2 Using Heuristics, case 2
LREC 2008 AWN 43 Using Heuristics If S in A-E-S has a semantic relation to one or more synsets, S 1, S 2 that have already been associated with an Arabic word on the basis of either heuristic 1 or heuristic 2, then the output tuple is tagged as case 3 Using Heuristics, case 3
LREC 2008 AWN 44 Using Heuristics If S in A-E-S has some semantic relation with S 1, S 2 where S 1, S 2 belong to the set of synsets that have already been associated with related Arabic words, then the output tuple is tagged as case 4. In this case there is only one translation E of A but more than one synset associated with E. This heuristic can be sub- classified by the number of input edges or supporting semantic relations (1, 2, 3,...). Using Heuristics, case 4
LREC 2008 AWN 45 Using Heuristics
LREC 2008 AWN 46 Using Heuristics Heuristic 5 is the same as heuristic 4 except that there are multiple translations E 1, E 2, of A and, for each translation E i, there are possibly multiple associated synsets S i1, S i2, . In this case the output tuple is tagged as case 5 and again the heuristic can be sub- classified by the number of input edges or supporting semantic relations (1, 2, 3...). Using Heuristics, case 5
LREC 2008 AWN 47 Using Heuristics
LREC 2008 AWN 48 Using Bayesian Inference
LREC 2008 AWN 49 Using Bayesian Inference Building the CPT for each node in the BN edges EW AW probabilities from statistical translation models built from the UN corpus using GIZA++ (word-word probabilities) filtered to avoid pairs having Arabic expressions with invalid Buckwalter encodings. all the mass probability is distributed between pairs occurring in the BN other edges (EW S, S S) linear distribution on priors noisy or model
LREC 2008 AWN 50 Using Bayesian Inference Performing Bayesian Inference in the BN Assign probability 1 to nodes in layer 1 Infer the probabilities of nodes in layer 3 Select for each word in layer 1 select as candidates the synsets in layer 3 connected to it and with probability over a threshold Score the candidate pair with this probability Select the candidates scored over a threshold
LREC 2008 AWN 53 Results Using HEU + BN (threshold 0.07) precision 0.71 65 accepted candidates from 92 proposed average 65/11 6 extrapolating the results to the set of AWN verbs (>2,500) lead to 15,000 new synsets from 20,000 candidates
LREC 2008 AWN 54 Conclusions the BN approach doubles the number of candidates of the previous HEU approach (554 vs 272). The sample is clearly insufficient. The overlaping of Heu + BN seems to improve the results An analysis of the errors shows a substantial number were due to the lack of the shadda diacritic or the feminine ending form (ta marbuta, ).
LREC 2008 AWN 55 Future work Repeat the entire procedure relying when possible on dictionaries containing diacritics Refine the scoring procedure by assigning different weights to the different relations. Include additional relations (e.g. path-based) Use additional Knowledge Sources for weighting the relations: related entries already included in AWN SUMO Magnini's domain codes