View
928
Download
1
Category
Tags:
Preview:
DESCRIPTION
© Fadoua Ataa Allah and Siham Boulaknadel
Citation preview
Natural Language Processing for Amazigh Language:
Challenges and Future Directions
Fadoua Ataa Allah Siham BoulaknadelCEISIC, IRCAM
{ataaallah, boulaknadel}@ircam.ma
LREC-2012: SALTMIL-AfLaT Workshop 2
Outline
Amazigh Language
Amazigh Complexity in NLP
State of the Technology on Amazigh
Future Directions
LREC-2012: SALTMIL-AfLaT Workshop 3
North African autochthonous language
Spoken by millions of people as dialects
Sociolinguistic ContextAmazigh language
LREC-2012: SALTMIL-AfLaT Workshop 410/07/2012
Amazigh language
Languages of Morocco Classical Arabic as an official language.
Amazigh, since 2011 it becomes an officiallanguage.
Moroccan Arabic or Darija is the diglossia ofClassical Arabic.
French as the first foreign language.
Spanish is used in the north of Morocco.
English is becoming the second foreign language.
Sociolinguistic Context
LREC-2012: SALTMIL-AfLaT Workshop 510/07/2012
Tinzouline Inscriptions (Zagora, Morocco)
Amazigh language
Amazigh abjed Tifinagh is attested from
25 centuries.
Its writing form hascontinued to changefrom the traditionalTuareg writing to theTifinaghe-IRCAM .
History
LREC-2012: SALTMIL-AfLaT Workshop 610/07/2012
Plate 9
Anou Elias, Mammanet Valley (Niger).
Henri Lhote, Oued Mammanet gravures.
Les Nouvelles Editions Africaines. 1979
Direction
Amazigh languageHistory
LREC-2012: SALTMIL-AfLaT Workshop 710/07/2012
Amazigh writing system Direction: horizontal from left to right. Alphabet:
27 consonants: ⴱ, ⴳ, ⴳⵯ, ⴷ, ⴹ, ⴼ, ⴽ, ⴽⵯ, ⵀ, ⵃ, ⵄ, ⵅ, ⵇ, ⵊ, ⵍ, ⵎ, ⵏ, ⵔ, ⵕ ,ⵖ, ⵙ, ⵚ, ⵛ, ⵜ, ⵟ, ⵣ, ⵥ;
2 semi-consonants: ⵢ and ⵡ; 4 vowels: ⴰ, ⵉ, ⵓ, ⴻ.
Punctuation marks: conventional signs including: “ ”(space), “.”, “,”, “;”, “:”, “?”, “!”, “…” , etc.
Numerals: Hindu-Arabic numerals [0-9].
Moroccan Amazigh characteristics
LREC-2012: SALTMIL-AfLaT Workshop 8
Different writing forms Complex phonology and phonetic
systems Rich morphology
Amazigh Complexity in NLP
LREC-2012: SALTMIL-AfLaT Workshop 9
Writing prescriptions’ conversion into‘Tifinaghe – Unicode’ is confronted with: Spelling variation related to regional
varieties ([tfucht] [tafukt] (sun)), Spelling variation based on the use or the
elimination of spaces within or betweenwords ([tadartino] [tadart ino] (my house)).
Arabic or Latin transcription systems.
Amazigh Complexity in NLPAmazigh script
LREC-2012: SALTMIL-AfLaT Workshop 10
The main problem of Amazigh phonologyand phonetic consists on allophones:
/ll/ that is realized as [dj] in the North.
Amazigh Complexity in NLPPhonology & phonetic
LREC-2012: SALTMIL-AfLaT Workshop 11
Amazigh Complexity in NLPMorphology
High inflected language.
Word structure:
Affixes set: Prefixes, Infixes, and Suffixes. Base form varies with paradigms:
(qqim svim (make sit)).
Prefix Stem Suffix
LREC-2012: SALTMIL-AfLaT Workshop 12
State of the Amazigh technology
Tifinaghe Encoding
Optical character recognition
Fundamental processing tools
Language resources
ANSI Unicode
State of the Amazigh technology Tifinaghe Encoding
13
LREC-2012: SALTMIL-AfLaT Workshop 14
State of the Amazigh technology
Amazigh OCR systems: System focused on isolated printed characters
based on a syntactic approach using finiteautomata.
Global approach based on Hidden MarkovModels for recognizing handwritten characters.
Method using invariant moments for recognizingprinted script.
System based on artificial neural network torecognize printed characters.
OCR
LREC-2012: SALTMIL-AfLaT Workshop 15
State of the Amazigh technology
Transliterator
Tagging assistance tool
Light stemmer
Search engine
Concordancer
Fundamental processing
LREC-2012: SALTMIL-AfLaT Workshop 16
State of the Amazigh technology
Transliterator
Fundamental processing
Tifinaghe Unicode
Arabic script
Latin script
Tifinaghe Latin
Convertisor
Transliterator
LREC-2012: SALTMIL-AfLaT Workshop 17
State of the Amazigh technology
Tagging assistance tool
Fundamental processing
Amazigh raw
corpora
Tag set
Tokenization
Manual Stemming Manual POS
Tagged corpus
Stem list
Validation
Standard output
LREC-2012: SALTMIL-AfLaT Workshop 18
State of the Amazigh technology
Light stemmer
Fundamental processing
Find the largest suffix
Find the largest prefix
Prefix + Stem + Suffix
Begin
Stem + Suffix
Stem
End
LREC-2012: SALTMIL-AfLaT Workshop 19
State of the Amazigh technology
Search engineFundamental processing
User Interface
Query EngineNatural Language Processing Tools
Data Searching Indexer
Natural Language Processing Tools
Index
Data Indexing
Crawler Repository
Data Crawling
Web
LREC-2012: SALTMIL-AfLaT Workshop 20
State of the Amazigh technology
Concordancer
Fundamental processing
input field
.txt,.doc
.pdf, .zip
Tokenization
Word / expression
Context display
List of the text words
and their frequency
LREC-2012: SALTMIL-AfLaT Workshop 21
State of the Amazigh technology
Corpora
Dictionary
Terminology database
Language resources
LREC-2012: SALTMIL-AfLaT Workshop 22
State of the Amazigh technology
Corpora:
General corpus,
POS corpus.
Language resources
LREC-2012: SALTMIL-AfLaT Workshop 23
State of the Amazigh technology
Dictionary Definition, Arabic equivalent words, French equivalent words, English equivalent words, Synonyms, Classification by domains, Derivational families.
Language resources
LREC-2012: SALTMIL-AfLaT Workshop 24
State of the Amazigh technology
Terminology database
Media vocabulary
Grammatical vocabulary
Language resources
LREC-2012: SALTMIL-AfLaT Workshop 25
Future Directions
Building a large and representativeAmazigh corpora.
Developing a machine translationsystem.
Creating a pool of competent humanresources.
LREC-2012: SALTMIL-AfLaT Workshop 26
Thank you for
your attention
ⵜⴰⵏⵎⵉⵔⵜ
Recommended