Upload
mohammed-attia
View
391
Download
2
Tags:
Embed Size (px)
Citation preview
Lexical Profiling for Arabic
Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith
National Centre for Language Technology (NCLT),
School of Computing, Dublin City University
Funded by:
Enterprise Ireland, the Irish Research Council for Science
Engineering and Technology (IRCSET), and
the EU projects PANACEA and META-NET
Overview
• Introduction
• Building the lexical database for Arabic– Corpus-based Selection of Entries
– Morphological Details: Inflectional Paradigms
– Syntactic Details: Subcategorization Frames
• Web Application
• Conclusion
Introduction
• Modern Standard Arabic vs. Classical Arabic
• Current State of Arabic Lexicography– Lexicons are not corpus-based
– Buckwalter Electronic Dictionary and Arabic Morphological Analyser
– No lexica for subcategorization frames
• Importance of Lexical Resources
Introduction
• Arabic Morphotactics
Aim
• Constructing a lexical database of Modern Standard Arabic
• Constructing a database for Arabic subcategorization frames
Methodology
Lexical Details
• Using a medium-scale manually created lexicon of 10,799 lemmas
• Using statistics from a 1 billion word corpus (annotated by MADA)
– 90% from the LDC's Arabic Gigaword
– 10% collected from the Al-Jazeera website
Subcategorization Details
• Using a medium-scale manually created lexicon of 2,901 lemma-frame types
• Using the Penn Arabic Treebank of 22,524 sentences, and 587,665 words
Extending the Lexical Database
• Start-off with a seed lexicon– Three Lexical Databases, manually constructed
• 5,925 nominal lemmas, with details on:– Gender and number
– Inflection paradigm (13 continuation classes)
– Humanness
• 1,529 verb lemmas, with details on:– Transitivity
– Whether passive is allowed or not
– Whether the imperative is allowed or not
• 490 patterns (456 for nominals and 34 for verbs)
• lemma-root look up database
Methodology
Extending the Lexical Database
• Automatically Extending the Lexical Database: Lexical Enrichment– Data-driven filtering technique
• 40,648 lemmas (in Buckwalter or SAMA 3.1)
• Statistics from three web search engines• Statistics from the corpus annotated by MADA• 29,627 lemmas (left after filtering)
Extending the Lexical Database
Automatically Extending the Lexical Database: Feature Enrichment
– Machine Learning– Multilayer Peceptron classification algorithm
– Training Data: 4,816 nominals and 1,448 verbs
– Classes for nominals: continuation classes (or inflection paths), the semantico-grammatical feature of humanness, and POS (noun or adjective)
– Classes for verbs: transitivity, allowing the passive voice, and allowing the imperative mood
– We feed these datasets with frequency statistics from the corpus and build a vector grid.
Extending the Lexical Database
• Extending the Lexical Database– Feature enrichment using Machine Learning
Extending the Lexical Database
• Extending the Lexical Database– With Machine Learning we add:
18,000 new lemmas: 12,974 nominals 5,034 verbs
Extending the Lexical Database
• Handling Broken PluralsjAnib (side)jawAnib (sides)
Poor handling of broken plural in Buckwalter
(4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc> <pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss>
(5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc> <pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss>
Two differences: voc and gloss
Extending the Lexical Database
• Extracting Broken Plurals<gloss>side/aspect</gloss>
<gloss>sides/aspects</gloss>
We use Levenshtein Distance which measures the difference between two strings (here glosses having the same lemmaID).
distance of 2 / length of the first string = 0.15 (within the threshold 0.4)
We collect 2,266 candidates
Extending the Lexical Database
• Validating Broken Plurals<voc>jAnib</voc> singular
pattern is: fAEilregex is: .A.i.
<voc>jawAnib</voc> pluralpattern is: fawAEilregex is: .awA.i.
Pattern database: 135 singular patterns that choose from a set of 82 broken plural patterns
2,266 candidates -> 1,965 are validated (87%)
Extending the Lexical Database
• Interesting statistics on Arabic pluralsInsights from the corpus:
5,570 lemmas have a feminine plural suffix
1,942 lemmas have a masculine plural suffix
2,730 lemmas with a broken plural forms
Extraction of Subcat Frames
• Importance of subcategorization frames
• Advantage of Automatic Extraction
• Available Resource on Arabic Subcat Frames:
– none except Arabic LFG Parser (Attia, 2008) - available as open source
Extraction of Subcat Frames
What are LFG subcat frames? Governable GFs (SUBJ, OBJ, OBJϴ, OBLϴ, COMP
and XCOMP) Non-governable GFs (ADJ and XADJ)
π<gf1,gf2,…gfn>
{iEotamada Al-Tifolu EalaY wAlidati-hi “The child relied on his mother”
{iEotamada<(↑SUBJ)( ↑OBL>alaY)>
Extraction of Subcat Frames
Automatic extraction of subcat frames The ATB contains 22,524 sentences LFG Annotation algorithm (DCU) Traversing trees and looking for dependencies. Lemmatization We extract 7,746 lemma-frame types (for verbs, nouns and
adjectives)
Extraction of Subcat Frames
Estimating the Subcategorization Probability
Extraction of Subcat Frames
Evaluation the Subcategorization Extraction
Extraction of Subcat Frames
Evaluation the Subcategorization Extraction
Web Application• AraComLex Lexicon Writing Application
www.cngl.ie/aracomlex
Byproducts of the Work
A number of open-source Resources:
• finite-state morphological transducer Arabic morphological patterns Subcategorization frames Arabic lemma frequency counts
Conclusion
• We successfully use machine learning to predict morpho-syntactic features for newly acquired words.
• We successfully extract subcategorization frames from the Penn Arabic Treebank
• We build specifications and implementation for an Arabic lexicographic web application.