E lex presentation_03

Lexical Profiling for Arabic

Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith

National Centre for Language Technology (NCLT),

School of Computing, Dublin City University

Funded by:

Enterprise Ireland, the Irish Research Council for Science

Engineering and Technology (IRCSET), and

the EU projects PANACEA and META-NET

Overview

• Introduction

• Building the lexical database for Arabic– Corpus-based Selection of Entries

– Morphological Details: Inflectional Paradigms

– Syntactic Details: Subcategorization Frames

• Web Application

• Conclusion

Introduction

• Modern Standard Arabic vs. Classical Arabic

• Current State of Arabic Lexicography– Lexicons are not corpus-based

– Buckwalter Electronic Dictionary and Arabic Morphological Analyser

– No lexica for subcategorization frames

• Importance of Lexical Resources

Introduction

• Arabic Morphotactics

Aim

• Constructing a lexical database of Modern Standard Arabic

• Constructing a database for Arabic subcategorization frames

Methodology

Lexical Details

• Using a medium-scale manually created lexicon of 10,799 lemmas

• Using statistics from a 1 billion word corpus (annotated by MADA)

– 90% from the LDC's Arabic Gigaword

– 10% collected from the Al-Jazeera website

Subcategorization Details

• Using a medium-scale manually created lexicon of 2,901 lemma-frame types

• Using the Penn Arabic Treebank of 22,524 sentences, and 587,665 words

Extending the Lexical Database

• Start-off with a seed lexicon– Three Lexical Databases, manually constructed

• 5,925 nominal lemmas, with details on:– Gender and number

– Inflection paradigm (13 continuation classes)

– Humanness

• 1,529 verb lemmas, with details on:– Transitivity

– Whether passive is allowed or not

– Whether the imperative is allowed or not

• 490 patterns (456 for nominals and 34 for verbs)

• lemma-root look up database

Methodology


• Automatically Extending the Lexical Database: Lexical Enrichment– Data-driven filtering technique

• 40,648 lemmas (in Buckwalter or SAMA 3.1)

• Statistics from three web search engines• Statistics from the corpus annotated by MADA• 29,627 lemmas (left after filtering)


Automatically Extending the Lexical Database: Feature Enrichment

– Machine Learning– Multilayer Peceptron classification algorithm

– Training Data: 4,816 nominals and 1,448 verbs

– Classes for nominals: continuation classes (or inflection paths), the semantico-grammatical feature of humanness, and POS (noun or adjective)

– Classes for verbs: transitivity, allowing the passive voice, and allowing the imperative mood

– We feed these datasets with frequency statistics from the corpus and build a vector grid.


• Extending the Lexical Database– Feature enrichment using Machine Learning


• Extending the Lexical Database– With Machine Learning we add:

18,000 new lemmas: 12,974 nominals 5,034 verbs


• Handling Broken PluralsjAnib (side)jawAnib (sides)

Poor handling of broken plural in Buckwalter

(4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc> <pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss>

(5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc> <pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss>

Two differences: voc and gloss


• Extracting Broken Plurals<gloss>side/aspect</gloss>

<gloss>sides/aspects</gloss>

We use Levenshtein Distance which measures the difference between two strings (here glosses having the same lemmaID).

distance of 2 / length of the first string = 0.15 (within the threshold 0.4)

We collect 2,266 candidates


• Validating Broken Plurals<voc>jAnib</voc> singular

pattern is: fAEilregex is: .A.i.

<voc>jawAnib</voc> pluralpattern is: fawAEilregex is: .awA.i.

Pattern database: 135 singular patterns that choose from a set of 82 broken plural patterns

2,266 candidates -> 1,965 are validated (87%)


• Interesting statistics on Arabic pluralsInsights from the corpus:

5,570 lemmas have a feminine plural suffix

1,942 lemmas have a masculine plural suffix

2,730 lemmas with a broken plural forms

Extraction of Subcat Frames

• Importance of subcategorization frames

• Advantage of Automatic Extraction

• Available Resource on Arabic Subcat Frames:

– none except Arabic LFG Parser (Attia, 2008) - available as open source


What are LFG subcat frames? Governable GFs (SUBJ, OBJ, OBJϴ, OBLϴ, COMP

and XCOMP) Non-governable GFs (ADJ and XADJ)

π<gf1,gf2,…gfn>

{iEotamada Al-Tifolu EalaY wAlidati-hi “The child relied on his mother”

{iEotamada<(↑SUBJ)( ↑OBL>alaY)>


Automatic extraction of subcat frames The ATB contains 22,524 sentences LFG Annotation algorithm (DCU) Traversing trees and looking for dependencies. Lemmatization We extract 7,746 lemma-frame types (for verbs, nouns and

adjectives)


Estimating the Subcategorization Probability


Evaluation the Subcategorization Extraction


Evaluation the Subcategorization Extraction

Web Application• AraComLex Lexicon Writing Application

www.cngl.ie/aracomlex

Byproducts of the Work

A number of open-source Resources:

• finite-state morphological transducer Arabic morphological patterns Subcategorization frames Arabic lemma frequency counts

Conclusion

• We successfully use machine learning to predict morpho-syntactic features for newly acquired words.

• We successfully extract subcategorization frames from the Penn Arabic Treebank

• We build specifications and implementation for an Arabic lexicographic web application.

Documents

E lex presentation_03