6
Probabilistic Bi-directional Root-Pattern Relationships as Cognitive Model for Semantic Processing of Arabic Bassam Haddad Petra University/Department of Computer Science, Amman, Jordan [email protected] ; [email protected] Abstract— This paper is attempting in terms of a formal statistical language model to review the overall domination of Arabic morphology as a non-linear or non-concatenated processing system in the case of word identification. The basic components of this model are relying on bi-directional probabilistic root-pattern relationships acting as cognitive morphological factors for word recognition. Considering a root in the mental lexicon as the highest level of semantic abstraction for a morpheme allows the view of considering words as a functional or applicative process instantiating the most probable or known pattern to the most plausible root. As Arabic is known for its highly inflectional morphological structure and its high tendency to pattern and root ambiguity (Root-Homonymy and Pattern Polysemy) this model is assuming bi-directional morphological background knowledge for resolving ambiguities in form of probabilistic semantic network. As a major consequence, this paper is stressing the significance of this phenomenon in designing Arabic interactive cognitive systems particularly those related to interactive Arabic natural language understanding and word recognition and corrections. I. INTRODUCTION AND MOTIVATION This paper is dealing with some aspects related to cognitive informatics and cognitive linguistics in the context of the word identification problem in Arabic and, generally in other closed Semitic languages. Within a statistical language model, the domination of the root- pattern aspect in word identification is reviewed and in- vestigated. As a key result of this research, this paper is proposing considering the overall domination of the root- pattern phenomenon in designing cognitive systems such as interactive Arabic natural language understanding and in particular interactive word recognition, corrections and Arabic based e-leaning systems. In fact, since entering the age of Arabic computational linguistics, discussions about lexical semantic represen- tation of Arabic were controversial. Classical Arabic linguisticians recognized a long time ago the semantic dimension of the root-pattern phenomenon and strongly employed its significance in their theoretical lexical and syntactical analysis. However, after the appearance and implementation of the first computational models, and due to the overall unforeseen computational complexity, and possibly the dominating and relatively advanced Indo- German languages Natural Language Processing Technology, some issues were raised suggesting stem- based or lexeme-based units as an alternative [4,11] for building simpler morphological model for lexical analysis. According to the laws of ambiguities in [4], ambiguity increases proportionally with lifting the abstraction of lexical units. There are also some reports coming from Information Retrieval discussing the concept of stem- based indexing versus root-based indexing in the context of reducing computational complexity and increasing precision [8,9]. On the other hand, current cognitive linguistic research concerned with word identification for Semitic languages and in particular for Arabic and Hebrew; is increasing delivering increasingly evidence for the sensitivity of readers to root and pattern during visual word processing. These cognitive models proceed from the assumption, that word recognition can be viewed as a dynamic process operating on a mental lexicon. This view relies on the fact, that words actually contain basic phonological, semantic and orthographic characteristics, which need further morphological processing and organization [1, 2, 3, 10] and [12]. Since starting our research in Arabic Natural Language Processing, NLP aiming at building a comprehensive Natural Language understanding system, we adopted the root-pattern paradigm, motivated by our understanding of the nature of Arabic, and encouraged by the delivered indications by psycho-cognitive research in retrieval of linguistic constituents (phonological, semantic) of a word. Relying on these elements, we are in the process of pursuing our research based on a statistical language model for Arabic utilizing the root-pattern paradigm as a theoretical background and conceptual model; i.e. Statistical Language Model for Arabic based on Probabilistic Bi-directional Root-Pattern Relations”, (APRoPat). The application potentials of this model are wide-ranging, including supporting Morphological Analysis; Word Sense Disambiguation, Spell-Checking, POS-Tagging and supporting indexing and Ranking in root-pattern indexing based Search Engines, particularly when integrating this view within cognitive systems [5,6,7,8]. In this paper the focus of attention will be set on the root-pattern morphological organizational structure, emerging from the view, that roots in Arabic represent autonomous semantic lexical units. In the following sections, morphological factors in the context of the word identification problem will be introduced, whereas word recognition process will be reviewed within the main features of a statistical language model. Furthermore, empirical analysis toward model justification will be 279 978-1-4673-5188-1/12/$31.00 ©2012 IEEE CogInfoCom 2012 • 3rd IEEE International Conference on Cognitive Infocommunications • December 2-5, 2012, Kosice, Slovakia

[IEEE 2012 IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom) - Kosice, Slovakia (2012.12.2-2012.12.5)] 2012 IEEE 3rd International Conference on Cognitive

  • Upload
    bassam

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom) - Kosice, Slovakia (2012.12.2-2012.12.5)] 2012 IEEE 3rd International Conference on Cognitive

Probabilistic Bi-directional Root-Pattern Relationships as Cognitive Model for Semantic

Processing of Arabic

Bassam Haddad Petra University/Department of Computer Science, Amman, Jordan

[email protected]; [email protected]

Abstract— This paper is attempting in terms of a formal statistical language model to review the overall domination of Arabic morphology as a non-linear or non-concatenated processing system in the case of word identification. The basic components of this model are relying on bi-directional probabilistic root-pattern relationships acting as cognitive morphological factors for word recognition. Considering a root in the mental lexicon as the highest level of semantic abstraction for a morpheme allows the view of considering words as a functional or applicative process instantiating the most probable or known pattern to the most plausible root. As Arabic is known for its highly inflectional morphological structure and its high tendency to pattern and root ambiguity (Root-Homonymy and Pattern Polysemy) this model is assuming bi-directional morphological background knowledge for resolving ambiguities in form of probabilistic semantic network. As a major consequence, this paper is stressing the significance of this phenomenon in designing Arabic interactive cognitive systems particularly those related to interactive Arabic natural language understanding and word recognition and corrections.

I. INTRODUCTION AND MOTIVATION

This paper is dealing with some aspects related to cognitive informatics and cognitive linguistics in the context of the word identification problem in Arabic and, generally in other closed Semitic languages. Within a statistical language model, the domination of the root-pattern aspect in word identification is reviewed and in-vestigated. As a key result of this research, this paper is proposing considering the overall domination of the root-pattern phenomenon in designing cognitive systems such as interactive Arabic natural language understanding and in particular interactive word recognition, corrections and Arabic based e-leaning systems.

In fact, since entering the age of Arabic computational linguistics, discussions about lexical semantic represen-tation of Arabic were controversial. Classical Arabic linguisticians recognized a long time ago the semantic dimension of the root-pattern phenomenon and strongly employed its significance in their theoretical lexical and syntactical analysis. However, after the appearance and implementation of the first computational models, and due to the overall unforeseen computational complexity, and possibly the dominating and relatively advanced Indo-German languages Natural Language Processing Technology, some issues were raised suggesting stem-based or lexeme-based units as an alternative [4,11] for

building simpler morphological model for lexical analysis. According to the laws of ambiguities in [4], ambiguity increases proportionally with lifting the abstraction of lexical units. There are also some reports coming from Information Retrieval discussing the concept of stem-based indexing versus root-based indexing in the context of reducing computational complexity and increasing precision [8,9].

On the other hand, current cognitive linguistic research concerned with word identification for Semitic languages and in particular for Arabic and Hebrew; is increasing delivering increasingly evidence for the sensitivity of readers to root and pattern during visual word processing. These cognitive models proceed from the assumption, that word recognition can be viewed as a dynamic process operating on a mental lexicon. This view relies on the fact, that words actually contain basic phonological, semantic and orthographic characteristics, which need further morphological processing and organization [1, 2, 3, 10] and [12].

Since starting our research in Arabic Natural Language Processing, NLP aiming at building a comprehensive Natural Language understanding system, we adopted the root-pattern paradigm, motivated by our understanding of the nature of Arabic, and encouraged by the delivered indications by psycho-cognitive research in retrieval of linguistic constituents (phonological, semantic) of a word. Relying on these elements, we are in the process of pursuing our research based on a statistical language model for Arabic utilizing the root-pattern paradigm as a theoretical background and conceptual model; i.e. “Statistical Language Model for Arabic based on Probabilistic Bi-directional Root-Pattern Relations”, (APRoPat). The application potentials of this model are wide-ranging, including supporting Morphological Analysis; Word Sense Disambiguation, Spell-Checking, POS-Tagging and supporting indexing and Ranking in root-pattern indexing based Search Engines, particularly when integrating this view within cognitive systems [5,6,7,8].

In this paper the focus of attention will be set on the root-pattern morphological organizational structure, emerging from the view, that roots in Arabic represent autonomous semantic lexical units. In the following sections, morphological factors in the context of the word identification problem will be introduced, whereas word recognition process will be reviewed within the main features of a statistical language model. Furthermore, empirical analysis toward model justification will be

279978-1-4673-5188-1/12/$31.00 ©2012 IEEE

CogInfoCom 2012 • 3rd IEEE International Conference on Cognitive Infocommunications • December 2-5, 2012, Kosice, Slovakia

Page 2: [IEEE 2012 IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom) - Kosice, Slovakia (2012.12.2-2012.12.5)] 2012 IEEE 3rd International Conference on Cognitive

treated as well in the form of a certain implementation of this model, designed for non-word recognition and correction. In addition, this model will be conceptually compared in the context of Non-Word identification problem with MS-Word spell-Checker.

II. MORPHOLOGICAL FACTORS FOR WORD IDENTIFICATION

A. Characteristics of Arabic Morphology Arabic and other Semitic languages belong to a singular

non-concatenative or non-linear morphological class; whereas root letters are decisive for forming the majority of words. Roots accordingly represent the highest level of symbolic abstraction for a word. In this context, roots correspond to ground-morphemes as the smallest meaningful elements; however without considering the affixes as is the case in the indo-German languages.

Morphologically, words can be classified into three major lexical classes: Basic Derivative, Rigid (Non-Derivative or inflectional) and Arabized Words [5, 6, 7].

Basic Derivative Arabic Words form the overwhelm-ing majority of the Arabic lexical vocabulary. Most of these words can be generated from a templatic tri-radical root or a quadri-radical root by adding consistent prefixes and suffixes or filling vowels in a predetermined phonological pattern form. The patterns have two syntactical categories: verbal and nominal word patterns in addition to some semantic contents.

Non-Derivative words include the lexical non-inflectional word types such as pronouns, adverbs, par-ticles besides stem words, which cannot be reduced into a known root, whereas Arabized Basic Words consist of

words without Arabic origin such as (/ا�رت�ت/, Internet).

Arabized Basic Words, and Non-Derivative Words, do not linguistically exhibit an understandable root-pattern relationship [6].

B. Semantic Relationship between Root and Patterns

In the scientific community of cognitive science, there is a general agreement, that morphological factors affect word recognition, e.g. morphologically related words induce priming effects. The observation, that morphological manipulation affects word recognition, supports the view, that morphological units are explicitly represented in the mental lexicon facilitating the word

recognition process; e.g. the Arabic root ( /كتب/ , ktb,

Writing) as a prime for ( / ِكتابٌ / , kitābun, Book in singular

and in nominative). In Arabic, a root can be considered as a basic morphological unit carrying the core semantic and meaning of a possible word, whereas pattern word or word form as an instantiated word (see Definition 1 and 2) creates variation on the meaning and on the syntactical category. The exact meaning of a word cannot be predicated unambiguously without considering its morphological units; i.e. the root and its patterns.

Furthermore, the fact that a root is interwoven in a phonetic pattern and not necessarily contiguously si-tuated in a word, allows investigating morphological factors non-linearly. This view allows the possibility of organizing morphological units in the form of an asso-ciative or semantic relationship between constitutes of

Arabic morphology. Moreover, considering the possible ambiguity occurring during word identification, this model is assuming some kind of uncertainty among such root-pattern associative relationships.

In the following, word identification will be considered as an applicative process or pattern matching process. This aspect is interesting as this model is operating on abstract relationships between roots and patterns, and the meaning of a word cannot be exactly determined without some kind of substitution of the root radical within an adequate or consistent phonetic pattern. Additionally, the uncertainty aspect will be also considered by introducing bi-directional associative relations in form of probabilistic rules expressing an associative relationship on the root-pattern level.

C. Word Identification as Applicative Processess In the following some preliminary and basic notations

are introduced based on [5]. The main goal is to formalize Derivative Basic Words functionally; i.e. as root-pattern matching and substitution process on the level of Basic Derivative Word.

The set of all Arabic roots and patterns will be rep-resented by and receptively:

= {r1,r2,r3,...,rr},

= {pt1,pt2,pt3,...,ptpt} (1)

Let SUBroot be a root substitution replacing the root letters with letters occurring in a pattern.

Definition 1 (SUBroot : Templatic Root-Pattern Substitution)

Let (فr , fr , C1), (ع, ،r , C2) and (لr ,lr ,C3) be the templatic

root literals and pti ∈ containing the templatic root radicals, then a templatic root-pattern substitution is defined as

means substitute the first root literal (pt , fptف)/(r, fr , C1ف)

See) .(pt, fptف) with the pattern letter (r, fr ,C1ف)

transcription in Appendix).

Definition 2 (Root Reduction as an Instantiation of a Templatic associative Root-Pattern Relationship)

Let r ∈ and pt ∈ then a basic word can be represented as a Lambda-abstraction, which can be generated by application of a templatic root-pattern substitution “SUBroot “ to the pattern pt:

λ r. pt SUBroot (3)

This expression means to find a root (the most plausible), which is applicable to a given pattern using a Templatic Root-Pattern Substitution). By further analy-

SUBroot = {(فr , fr, C1)/(فpt, fpt ),

,(pt , ،ptع )/(r , ،r , C2ع)

{ ( pt, lptل)/( r, lr, C3ل)

(2)

B. Haddad • Probabilistic bi-directional root-pattern relationships as cognitive model for semantic processing of arabic

280

Page 3: [IEEE 2012 IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom) - Kosice, Slovakia (2012.12.2-2012.12.5)] 2012 IEEE 3rd International Conference on Cognitive

sis, this expression can be understood as an uncertain or probabilistic rule for root extraction.

Most Arabic words can be generated from the tem-

platic tri-literal root (/فعل/, f‘l, C1C2C3) (see appendix) or

the quadri-literal root (/فعلل/, f‘ll, C1C2C3C3), whereas for

each valid Arabic root, ri , there is a certain number of consistent patterns, ptj, with which a root can be instanti-ated. The grade of consistency, expresses the presence of an associative relationship between a root and a pattern priming some word forms. Therefore, a lexical derivative Arabic word can be understood as a result of applying a substitution of a root literal with the corresponding con-sistent pattern literals; i.e. a generative or derivative process. Such a substitution can be regarded as a trans-formation operation of a root into a pattern word or an in-stantiation of template with root letters.

Example:

Let (/ علاف /, fā‘il, C1āC2iC3)1 be a pattern ∈ and (/كتب /, ktb, Writing) ∈, then the application of the templatic root substitution under using such root to the pattern

(/ علاف /, fā‘il, C1āC2iC3) generates the following word:

λ r. (/ لعِاف /, fā‘il, C1āC2iC3 ) SUBroot (/كتب /, ktb, Writing) =

(/ تباك /, kātib, Writer) 2.

Definition 3 (Pattern Reduction as a Templatic associative Root-Pattern Relationship)

Let r ∈ and pt ∈ then a root can be represented as a Lambda-abstraction, which can be extracted by application of a templatic root-pattern substitution “SUBroot “ to the root r:

λ pt. rSUBroot (4)

This expression means to find (the most plausible) associative or a consistent pattern form for a given root. By further analysis, this process can be understood as uncertain rule to confirm the association grade between the given root and a possible pattern.

The above introduced definitions are attempting to view the process of basic word identification as a bi-directional non-trivial search problem. Based on a lexical abstract root, which might be stored in the mental lexicon and a given word form associated with a phonetic pattern, a forward and backward process to derive or generate a consistent word.

D. Word Identification as a Search under Uncertainty Due to historical reasons and difficulties in represent-

ing short vowels without diacritics, the overwhelming existing written Arabic texts are not vocalized. Con-

1 Root radicals are depicted in bold to illustrate the application of the root-pattern substitution (see Appendix). 2 Generally a morphologically high density Arabic word can also be represented as an applicative process considering prefixes, suffixes besides their morpho-syntactic, MS and morpho-phonetic, MP Rules such as:

λ MS(Pref+(λMP (λr.Pt))+Suff) However, this is beyond the scope of this paper.

sidering additionally the fact that an Arabic root usually occurs with many different not phonetic vocalized pat-terns, increases the possibility of strong ambiguity and in particular on the lexical level for Arabic. Such ambi-guity is twofold in the sense, that for one root there will be many possible un-vocalized patterns, and for one pattern there might be more than one possible root whereas each root-pattern relationship might represent different word senses. We call, the first type as Pattern Polysemy and the second one as some kind of Root-Homonymy [5].

For solving this problem, this model is proposing to consider a probabilistic approach, motivated by the observation, that humans can quickly identify a frequent pattern or root. In the following a compound bidirectional relationship will be introduced inspired from the probabilistic measure presented in [5,7]

Definition 4 (Probabilistic Bi-directional Root-Pattern Relationship)

Let the root ri ∈, and the pattern ptj ∈ then Probabilistic Bi-directional associative Root-Pattern Re-lationships

ji

iji j

rpv

ppv Ptr ←⎯⎯⎯→ (5)

can be established as follows:

PT iji j i jRoot {((r , pt ), ppv ) | (r , pt )

×

}

(6)

where ij j ippv P(pt /r )

is defined as Pattern-

Predictive value and

}

Root jij i j iPT {((pt , r ), rpv ) | (pt , r ) ∈

×

(7)

where ji i jrpv P(r /pt )

is defined as Root Predictive

value.

PTRoot

can be interpreted as uncertain forward bi-nary rules for identification of a pattern or a basic de-rivative word based on known root as the case in defini-

tion 3. While RootPT

can be interpreted as uncertain backwards binary rule for extracting a root based on known pattern or a word form as the case in definition 4.

This model is assuming the existence of an uncertain relationship between roots and pattern variables measured by some probabilistic measures rpv

ppv

; i.e. root and pattern predictive values. Furthermore roots are lexical units carrying basic semantics, whereas patterns are considered in their variable form, so that a real word form is actually formed not before selecting a consistent phonetic pattern.

Example.:

Let ri=(/كتب /, ktb, Writing)∈ , ptj=(/مَفْعُوْل/, maf‘ūl,

maC1C2ūC3)∈ then based on the pattern predictive value

ijppv j iP(pt /r )

we can establish a binary uncertain

281

CogInfoCom 2012 • 3rd IEEE International Conference on Cognitive Infocommunications • December 2-5, 2012, Kosice, Slovakia

Page 4: [IEEE 2012 IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom) - Kosice, Slovakia (2012.12.2-2012.12.5)] 2012 IEEE 3rd International Conference on Cognitive

relation expressing the probability for predicting the

instantiation of the pattern ptj =(/مَفْعُوْل/, maf‘ūl, maC1C2ūC3)

with the given root and vice versa; i.e.

ji i jrpv P(r /pt )

such3:

In this context, basic words identification can be re-garded as probabilistic bidirectional applicative process within a semantic network of associative relationships. In figure 2, bi-directional uncertain relationships between

the root ( /كتب/ , ktb, Writing) and some patterns:

faā‘ilun, aā i un),(/ / 1 2 3C C Cٌِــل , فاَعــ faā‘ilah, aā i ah(/ /, )1 2 3C C Cفَاعِلــــه

illustrating two cases of Pattern Reductions/Matching phonetically based on probabilistic values for representing uncertainty. It is to notice that some addi-tional syntactical and semantic information are also ga-thered with the reduction (applicative) process. This

example infers also that the word ( /كاتبة/ , kātibah, Writer)

as feminine is less frequent than the word (word ( /كاتب/ ,

kātib, Writer ) as masculine in the probabilistic semantic network model.

3 To obtain numerical values for representing the uncertainty between roots and patterns supporting the application potential of this model based on the morphological analysis of a corpus containing 50,544,830 Arabic word-forms in one 990 MB flat file, and Arabic dictionaries of about 31.5MB. Normalized conditional probabilities were assigned to 6860 Arabic roots in association with 650 patterns. The data was analyzed by ATW morphological analyzer of Arabic Textware. At present, I am working on lunching a comprehensive project considering root-particle, root-complement predicative values and refining previously obtained values.

III. EMPIRICAL ANALYSIS

The main features of the Statistical Language Model for Arabic based on Probabilistic Bi-directional Root-Pattern Relations (APRoPat) have been introduced in the context of word identification as a cognitive process (Figure 1). Based on the fundamental elements of APRoPat; namely the Probabilistic Bi-directional Root-

Pattern Relations and their values; i.e. PTRoot

and

RootPT

Values, a prototype program designed for Non-word Recognition and Correction has already been im-planted (Details about Type of error patterns in Arabic are found in [6]. The program was tested and compared with the most popular Arabic Spell-Checker; namely Arabic MS-Word, whereas humans have been asked to deiced whether isolated word lists are correct or not, and if not to try to correct them explaining the error type. Most of the involved persons were students at different university levels coming from different Arabic countries. The main goal of the experiment was not to evaluate MS word Spell-Checker’s performance against our system, but to get an overview about the difference between adopted lexical analysis compared to APRoPat model for word identification and its closeness to act as a computational model simulating a cognitive process for word identification following the root-pattern concept (See Table 1). There are indications for the view of adopting a linear organizational strategy in lexical analysis designed for MS-Word without a clear ranking strategy considering statistical analysis. We believe that such strategies might not produce natural human expected ranked corrections and results, as in the implementation strategy employed in APRoPat model.

However, as a major result of this test, we can con-clude that there are strong indications for the presence of the root-pattern factors in the mental morphological decomposition supported by test for non-word recogni-

(ktb, Writing ,/ كتب/)ji

ij

rpv ppv←⎯⎯⎯→ (maf‘ūl ma C1C2ūC3 ,/ مَفْعُوْل/)

Figure 1: Overview of the APRoPat Model: Root-Pattern and Pattern-Patten (i.e. stem-stem) Search on level of probabilistic abstract root-pattern relationship.

Figure 2: Bi-directional uncertain Relationship between the root

( /كتب/ , ktb, Writing) and some patterns illustrating the two cases of

Pattern Reductions.

B. Haddad • Probabilistic bi-directional root-pattern relationships as cognitive model for semantic processing of arabic

282

Page 5: [IEEE 2012 IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom) - Kosice, Slovakia (2012.12.2-2012.12.5)] 2012 IEEE 3rd International Conference on Cognitive

tion considering the root-pattern paradigm, which ac-tually complies with previous studies for other Semitic languages as explained earlier4. The reason for not de-tecting some error types might be explained that deep semantic or cognitive pattern of errors primarily can be detected with deep root-pattern reduction analysis utilizing the root morpheme as the basis semantic unit in the context of contextual knowledge base occurring simultaneously in a network of root-pattern relationships.

IV. OUTLOOK AND CONCLUSION

This paper attempted to review the concept of the root-pattern paradigm in context of the word identification problem. Our current research towards building a comprehensive natural language understanding model for Arabic relies on a Statistical Language Model for Arabic (APRoPat)5. This model can be viewed as a probabilistic network emerging from roots as basic semantic units associated with patterns, with which the basic words can be generated and recognized. As this model is designed to be comprehensive, the focus of attention of presentation was primarily directed toward word identification. Computational aspects and complexity are beyond the scope of this presentation.

For testing the cognitive elements of this model, a non-word recognition system was implemented considering bi-directional root-pattern semantic relationship to identify words and non-words. The gathered results comply with the current studies of cognitive linguistic research concerned with word identification for Semitic languages and in particular for Arabic and Hebrew, which are increasingly supporting the existence of sensitivity of a human to root and patterns during visual word processing. These results are encouraging to review this model within other linguistic problems such as Word Sense Disambiguation; POS-Tagging and supporting indexing and Ranking in Root-Pattern indexing based Search Engines and Ranking Techniques and Machine Learning. As preliminary result of this study, the overall domination of such root-pattern phenomena as linguistic image schemata should have its significance and effect in designing cognitive systems such as interactive Arabic natural language understanding.

REFERENCES [1] Salim Abu-Rabia, “The Role of Morphology and Short

Vowelization in Reading Morphological Complex Words in Arabic: Evidence for the Domination of the Morpheme/Root-Based theory in Reading Arabic”, Scientific Research, Vol.3, No.4, 486-494, 2012.

[2] S. Bentin and Raphiq Ibrahim, “Phonological Processing During Visual Word Recognition: The Case of Arabic”, Journal of Ex-perimental Psychology Learning, Memory, and Cognition, Vol. 22, No. 2, 309-323, 1995.7

[3] S. Bentin and R. Forst, Morphological Factors in word Identifica-tion in Hebrew”, on Feldman L. (ed), Morphological aspects of language processing, Hillsdale NJ: Erlbaum, 271-292, 1994.

4 The Class of class of Non-Derivative words has not been considered in these analysis 5We are in the process of refining and training these values to consider

aspects involved in POS and Word Sense disambiguation and others, whereas there is a further ambition to extend these values to consider a stem based representation. This step is important for studying the inflectional morphology, which actually represent most elements of a linear morphology in Arabic.

[4] J. Dichy, “On lemmatization in Arabic A formal definition of Arabic entries of multilingual lexical databases”, Arabic Lan-guage Processing: Status and Prospects, Association for Com-putational Linguistics (ACL-2001), 10th Conference of the Eu-ropean Chapter, Morgan Kaufman Publisher, France, 2001.

[5] B. Haddad, “Representation of Arabic Words: An Approach towards Probabilistic Root-Pattern Relationships”, International Conference on Knowledge Engineering and Ontology Develop-ment, Madeira, Portugal, 2009.

[6] B. Haddad, Semantic Representation of Arabic: A logical Ap-proach towards Compositionality and Generalized Arabic Quan-

TABLE 1

PROBABILISTIC BI-DIRECTIONAL ROOT-PATTERN IMPLEMENTATION FOR NON-WORD RECOGNITION AND CORRECTION COMPARED WITH

MS-WORD AND HUMAN DECISION. Sample AProPat (with

Ranking Strategy)

Arabic MS-Word

Human Error Pattern

,/رااااااااااامي /)rāāāāāāāāāāmy)

رامي /) /, rāmī) No suggestion

,/رااااااااااامي /)rāāāāāāāāāām)

Insertion:unexpected multiple insertions. AProPat reduces to most probable root

(/ لد435643خا/, ḫā43563lid)

(/ دخال /, ḫālid) ,/خالج /) ḫāliğ) ,/ أخالد /) ’aḫālid) ,/ بخالد/) biḫālid) ,/فخالد /) faḫālid) ,/ آخالد/) kaḫālid)

(/ د خال /, ḫālid) Insertion: non-linear solution without Ranking

خالدمحمود/) /, ḫālidmaḥmūd)

(/ ,/خالد محمودḫālid maḫmūd)

خالد /) ,/محمود ḫālid maḫmūd)

,/خالد محمود /)ḫālid maḫmūd)

Run-on error

( /اجمتاع/ , ǧ’mitā‘)

(/ اجتماع/ , ǧ‘tima )‘

No suggestion

(/ اجتماع/ , ǧ‘tima )‘

Transposi-tion: root-pattern reduction problem (root per-mutation to find most probable valid root)

)/اتطر/ , ’ṭar )

( /اضطر/ , ’ḍṭar ) ( /تطر/ , taṭar ) ( /يتطر/ , yataṭar)

( /ابطر/ , ’bṭar) ( /لتطر/ , litaṭar) ( /اطر/ , ’ṭar ) ( /طرت/ , taṭar) ( /طرتاس/ , ’staṭar)

( /اضطر/ , ’ḍṭar ) Cognitive/Phonetic: possibly slang, Ranking

,/مضلوم /)maḍlūm)

,/مظلوم /)maḏlūm)

(/ مضلون /, maḍlūn) (/ ه مضلو /, maḍlūh) ,/مضلوك /)maḍlūk) ,/مضلو /)maḍlū) (/ م ملو /, malūm)

,/مظلوم /)maḏlūm)

Cognitive/Phonetic: Ranking Root-pattern reduction problem

(/ ,/اسخدم’sḫdama )

(/ استخدم /, ’staḫdama)

(/ اخدم /, ’ḫdama) (/ ,/استخدم’staḫdama) خ دم سا /)/, ’saḫ dama)

(/ استخدم /, ’staḫdama)

Deletion:Ranking

(’ rafī ,/رفيئ/) (/ رفيق /, rafīq) ,/رفي/) rafī )

Cognitive (rafīq ,/ رفيق /)/Phonetic: (slang) Root-pattern based on phonetic

283

CogInfoCom 2012 • 3rd IEEE International Conference on Cognitive Infocommunications • December 2-5, 2012, Kosice, Slovakia

Page 6: [IEEE 2012 IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom) - Kosice, Slovakia (2012.12.2-2012.12.5)] 2012 IEEE 3rd International Conference on Cognitive

tifiers. In International Journal of computer processing of oriental languages, IJCPOL 20(1), World Scientific Publishing, 2007.

[7] B. Haddad and M. Yaseen, Detection and Correction of Non-Words in Arabic: A Hybrid Approach. In International Journal of Computer Processing of Oriental Languages, IJCPOL, Vol. 20, Number 4, World Scientific Publishing, 2007.

[8] M. Hattab, B. Haddad, M. Yaseen, A. Duraidi and A. Abu Shmias, “Addaall Arabic Search Engine: Improving Search based on Combination of Morphological Analysis and Generation Considering Semantic Patterns”, The second International Confe-rence on Arabic Language Resources and Tools, Cairo, Egypt, 2009.

[9] H. Moukdad, “Stemming and root-based approaches to the re-trieval of Arabic documents on the Web”, Webology, Volume 3, Number 1, 2006.

[10] K. Rastle, M. H. Davis, W. D. Marslen-Wilson, and L. K. Tyler, “Morphological and semantic effects in visual word recognition: A time-course study”, Languages and Cognitive Processes, 15 (4/5), 507–537, 2000.

[11] A. Soudi, V. Cavalli-Sforza, A “Computational Lexeme-Based Treatment of Arabic Morphology”, Arabic Language Processing: Status and Prospects, Association for Computational Linguistics (ACL-2001), 10th Conference of the European Chapter, Morgan Kaufman Publisher, France, 2001.

[12] L. Verhoeven, and C. A. Perfetti, “Morphological processing in reading acquisition: A cross-linguistic”, Applied Psycholinguistics, 32, 457–466, 2011.

APPENDIX

Transcription of Arabic letters based on DIN and Fischer W., Grammatik des Klassischen Arabisch. Otto Harrassowitz, Wiesbaden, 1972. Long vowels are represented through the letters ( ا , ā), (ى, ī) and :while short vowels as follows ,(ū ,و) ( fatḥa, —َ, a), (kasrah, —ِ, i) and (ḍammah, —ُ, u).

Formal annotations of the adopted transliteration:

I. A root ri∈ is depicted as 3-arguments, such as: (/Arabic form /,

Latin transliteration, abstract meaning). E.g. the root “(/كتب /, ktb,

Writing)”. II. A pattern ptj∈ is also depicted as 3-augments: (/Arabic form /,

Latin transliteration, root radicals template positions), whereas C1

, C2 and C3 represent root radicals variables such as (/مَفْعُوْل/,

maf‘ūl, maC1C2ūC3), i.e. f= C1, ‘= C2 and l= C3. III. Analog a root radical as 3-arguments (Arabic form, Latin

transliteration, Latin Variable trancription) such as the first root

radical, e.g. (فr ,fr,C1)

Letter Transcription Name

hamza ‘ ء

ā ’alif ا

’b Bā ب

’t tā ت

ṯ ā’t ث

’ḥ ḥā ح

’ḫ ḫā خ

d dāl د

ḏ āld ذ

’r rā ر

z zāy ز

s sān س

š šīn ش

ṣ sād ص

ḍ ḍāḍ ض

’ṭ ṭ ā ط

’ẓ ẓ ā ظ

ain‘ ‘ ع

ġ ġain غ

’f f ā ف

q qāf ق

k kāf ك

l lām ل

m mīm م

n nūn ن

’h hā ه

w, ū wāw و

’y, ī y ā ى

B. Haddad • Probabilistic bi-directional root-pattern relationships as cognitive model for semantic processing of arabic

284