E lex presentation_03

Preview:

Citation preview

Lexical Profiling for Arabic

Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith

National Centre for Language Technology (NCLT),

School of Computing, Dublin City University

Funded by:

Enterprise Ireland, the Irish Research Council for Science

Engineering and Technology (IRCSET), and

the EU projects PANACEA and META-NET

Overview

• Introduction

• Building the lexical database for Arabic– Corpus-based Selection of Entries

– Morphological Details: Inflectional Paradigms

– Syntactic Details: Subcategorization Frames

• Web Application

• Conclusion

Introduction

• Modern Standard Arabic vs. Classical Arabic

• Current State of Arabic Lexicography– Lexicons are not corpus-based

– Buckwalter Electronic Dictionary and Arabic Morphological Analyser

– No lexica for subcategorization frames

• Importance of Lexical Resources

Introduction

• Arabic Morphotactics

Aim

• Constructing a lexical database of Modern Standard Arabic

• Constructing a database for Arabic subcategorization frames

Methodology

Lexical Details

• Using a medium-scale manually created lexicon of 10,799 lemmas

• Using statistics from a 1 billion word corpus (annotated by MADA)

– 90% from the LDC's Arabic Gigaword

– 10% collected from the Al-Jazeera website

Subcategorization Details

• Using a medium-scale manually created lexicon of 2,901 lemma-frame types

• Using the Penn Arabic Treebank of 22,524 sentences, and 587,665 words

Extending the Lexical Database

• Start-off with a seed lexicon– Three Lexical Databases, manually constructed

• 5,925 nominal lemmas, with details on:– Gender and number

– Inflection paradigm (13 continuation classes)

– Humanness

• 1,529 verb lemmas, with details on:– Transitivity

– Whether passive is allowed or not

– Whether the imperative is allowed or not

• 490 patterns (456 for nominals and 34 for verbs)

• lemma-root look up database

Methodology

Extending the Lexical Database

• Automatically Extending the Lexical Database: Lexical Enrichment– Data-driven filtering technique

• 40,648 lemmas (in Buckwalter or SAMA 3.1)

• Statistics from three web search engines• Statistics from the corpus annotated by MADA• 29,627 lemmas (left after filtering)

Extending the Lexical Database

Automatically Extending the Lexical Database: Feature Enrichment

– Machine Learning– Multilayer Peceptron classification algorithm

– Training Data: 4,816 nominals and 1,448 verbs

– Classes for nominals: continuation classes (or inflection paths), the semantico-grammatical feature of humanness, and POS (noun or adjective)

– Classes for verbs: transitivity, allowing the passive voice, and allowing the imperative mood

– We feed these datasets with frequency statistics from the corpus and build a vector grid.

Extending the Lexical Database

• Extending the Lexical Database– Feature enrichment using Machine Learning

Extending the Lexical Database

• Extending the Lexical Database– With Machine Learning we add:

18,000 new lemmas: 12,974 nominals 5,034 verbs

Extending the Lexical Database

• Handling Broken PluralsjAnib (side)jawAnib (sides)

Poor handling of broken plural in Buckwalter

(4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc> <pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss>

(5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc> <pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss>

Two differences: voc and gloss

Extending the Lexical Database

• Extracting Broken Plurals<gloss>side/aspect</gloss>

<gloss>sides/aspects</gloss>

We use Levenshtein Distance which measures the difference between two strings (here glosses having the same lemmaID).

distance of 2 / length of the first string = 0.15 (within the threshold 0.4)

We collect 2,266 candidates

Extending the Lexical Database

• Validating Broken Plurals<voc>jAnib</voc> singular

pattern is: fAEilregex is: .A.i.

<voc>jawAnib</voc> pluralpattern is: fawAEilregex is: .awA.i.

Pattern database: 135 singular patterns that choose from a set of 82 broken plural patterns

2,266 candidates -> 1,965 are validated (87%)

Extending the Lexical Database

• Interesting statistics on Arabic pluralsInsights from the corpus:

5,570 lemmas have a feminine plural suffix

1,942 lemmas have a masculine plural suffix

2,730 lemmas with a broken plural forms

Extraction of Subcat Frames

• Importance of subcategorization frames

• Advantage of Automatic Extraction

• Available Resource on Arabic Subcat Frames:

– none except Arabic LFG Parser (Attia, 2008) - available as open source

Extraction of Subcat Frames

What are LFG subcat frames? Governable GFs (SUBJ, OBJ, OBJϴ, OBLϴ, COMP

and XCOMP) Non-governable GFs (ADJ and XADJ)

π<gf1,gf2,…gfn>

{iEotamada Al-Tifolu EalaY wAlidati-hi “The child relied on his mother”

{iEotamada<(↑SUBJ)( ↑OBL>alaY)>

Extraction of Subcat Frames

Automatic extraction of subcat frames The ATB contains 22,524 sentences LFG Annotation algorithm (DCU) Traversing trees and looking for dependencies. Lemmatization We extract 7,746 lemma-frame types (for verbs, nouns and

adjectives)

Extraction of Subcat Frames

Estimating the Subcategorization Probability

Extraction of Subcat Frames

Evaluation the Subcategorization Extraction

Extraction of Subcat Frames

Evaluation the Subcategorization Extraction

Web Application• AraComLex Lexicon Writing Application

www.cngl.ie/aracomlex

Byproducts of the Work

A number of open-source Resources:

• finite-state morphological transducer Arabic morphological patterns Subcategorization frames Arabic lemma frequency counts

Conclusion

• We successfully use machine learning to predict morpho-syntactic features for newly acquired words.

• We successfully extract subcategorization frames from the Penn Arabic Treebank

• We build specifications and implementation for an Arabic lexicographic web application.