26
Natural Language Processing for Amazigh Language: Challenges and Future Directions Fadoua Ataa Allah Siham Boulaknadel CEISIC, IRCAM {ataaallah, boulaknadel}@ircam.ma

Natural Language Processing for Amazigh Language

Embed Size (px)

DESCRIPTION

© Fadoua Ataa Allah and Siham Boulaknadel

Citation preview

Page 1: Natural Language Processing for Amazigh Language

Natural Language Processing for Amazigh Language:

Challenges and Future Directions

Fadoua Ataa Allah Siham BoulaknadelCEISIC, IRCAM

{ataaallah, boulaknadel}@ircam.ma

Page 2: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 2

Outline

Amazigh Language

Amazigh Complexity in NLP

State of the Technology on Amazigh

Future Directions

Page 3: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 3

North African autochthonous language

Spoken by millions of people as dialects

Sociolinguistic ContextAmazigh language

Page 4: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 410/07/2012

Amazigh language

Languages of Morocco Classical Arabic as an official language.

Amazigh, since 2011 it becomes an officiallanguage.

Moroccan Arabic or Darija is the diglossia ofClassical Arabic.

French as the first foreign language.

Spanish is used in the north of Morocco.

English is becoming the second foreign language.

Sociolinguistic Context

Page 5: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 510/07/2012

Tinzouline Inscriptions (Zagora, Morocco)

Amazigh language

Amazigh abjed Tifinagh is attested from

25 centuries.

Its writing form hascontinued to changefrom the traditionalTuareg writing to theTifinaghe-IRCAM .

History

Page 6: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 610/07/2012

Plate 9

Anou Elias, Mammanet Valley (Niger).

Henri Lhote, Oued Mammanet gravures.

Les Nouvelles Editions Africaines. 1979

Direction

Amazigh languageHistory

Page 7: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 710/07/2012

Amazigh writing system Direction: horizontal from left to right. Alphabet:

27 consonants: ⴱ, ⴳ, ⴳⵯ, ⴷ, ⴹ, ⴼ, ⴽ, ⴽⵯ, ⵀ, ⵃ, ⵄ, ⵅ, ⵇ, ⵊ, ⵍ, ⵎ, ⵏ, ⵔ, ⵕ ,ⵖ, ⵙ, ⵚ, ⵛ, ⵜ, ⵟ, ⵣ, ⵥ;

2 semi-consonants: ⵢ and ⵡ; 4 vowels: ⴰ, ⵉ, ⵓ, ⴻ.

Punctuation marks: conventional signs including: “ ”(space), “.”, “,”, “;”, “:”, “?”, “!”, “…” , etc.

Numerals: Hindu-Arabic numerals [0-9].

Moroccan Amazigh characteristics

Page 8: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 8

Different writing forms Complex phonology and phonetic

systems Rich morphology

Amazigh Complexity in NLP

Page 9: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 9

Writing prescriptions’ conversion into‘Tifinaghe – Unicode’ is confronted with: Spelling variation related to regional

varieties ([tfucht] [tafukt] (sun)), Spelling variation based on the use or the

elimination of spaces within or betweenwords ([tadartino] [tadart ino] (my house)).

Arabic or Latin transcription systems.

Amazigh Complexity in NLPAmazigh script

Page 10: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 10

The main problem of Amazigh phonologyand phonetic consists on allophones:

/ll/ that is realized as [dj] in the North.

Amazigh Complexity in NLPPhonology & phonetic

Page 11: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 11

Amazigh Complexity in NLPMorphology

High inflected language.

Word structure:

Affixes set: Prefixes, Infixes, and Suffixes. Base form varies with paradigms:

(qqim svim (make sit)).

Prefix Stem Suffix

Page 12: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 12

State of the Amazigh technology

Tifinaghe Encoding

Optical character recognition

Fundamental processing tools

Language resources

Page 13: Natural Language Processing for Amazigh Language

ANSI Unicode

State of the Amazigh technology Tifinaghe Encoding

13

Page 14: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 14

State of the Amazigh technology

Amazigh OCR systems: System focused on isolated printed characters

based on a syntactic approach using finiteautomata.

Global approach based on Hidden MarkovModels for recognizing handwritten characters.

Method using invariant moments for recognizingprinted script.

System based on artificial neural network torecognize printed characters.

OCR

Page 15: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 15

State of the Amazigh technology

Transliterator

Tagging assistance tool

Light stemmer

Search engine

Concordancer

Fundamental processing

Page 16: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 16

State of the Amazigh technology

Transliterator

Fundamental processing

Tifinaghe Unicode

Arabic script

Latin script

Tifinaghe Latin

Convertisor

Transliterator

Page 17: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 17

State of the Amazigh technology

Tagging assistance tool

Fundamental processing

Amazigh raw

corpora

Tag set

Tokenization

Manual Stemming Manual POS

Tagged corpus

Stem list

Validation

Standard output

Page 18: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 18

State of the Amazigh technology

Light stemmer

Fundamental processing

Find the largest suffix

Find the largest prefix

Prefix + Stem + Suffix

Begin

Stem + Suffix

Stem

End

Page 19: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 19

State of the Amazigh technology

Search engineFundamental processing

User Interface

Query EngineNatural Language Processing Tools

Data Searching Indexer

Natural Language Processing Tools

Index

Data Indexing

Crawler Repository

Data Crawling

Web

Page 20: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 20

State of the Amazigh technology

Concordancer

Fundamental processing

input field

.txt,.doc

.pdf, .zip

Tokenization

Word / expression

Context display

List of the text words

and their frequency

Page 21: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 21

State of the Amazigh technology

Corpora

Dictionary

Terminology database

Language resources

Page 22: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 22

State of the Amazigh technology

Corpora:

General corpus,

POS corpus.

Language resources

Page 23: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 23

State of the Amazigh technology

Dictionary Definition, Arabic equivalent words, French equivalent words, English equivalent words, Synonyms, Classification by domains, Derivational families.

Language resources

Page 24: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 24

State of the Amazigh technology

Terminology database

Media vocabulary

Grammatical vocabulary

Language resources

Page 25: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 25

Future Directions

Building a large and representativeAmazigh corpora.

Developing a machine translationsystem.

Creating a pool of competent humanresources.

Page 26: Natural Language Processing for Amazigh Language

LREC-2012: SALTMIL-AfLaT Workshop 26

Thank you for

your attention

ⵜⴰⵏⵎⵉⵔⵜ