Arabic and the computer Van Mol A025.pdf · this way, from the Arabic 27,000 words in the database a total of about 600,000 derived or conjugated forms were generated. This means

1

Arabic and the Computer

Van Mol Mark

Institute of Living Languages – Faculty of Arts, Katholieke Universiteit Leuven, Leuven, Belgium

[email protected]

Keywords: Arabic computing and tagging, lexicography, corpus based analysis, conception of relational databases for

Arabic

Abstract: In this paper I will discuss the possibilities given by the creation of a multilingual relational database for

Arabic in combination with the composition of a representative corpus of Arabic. I will go into some details

of some search results that can be obtained about frequency of word forms in Arabic. The exploration of a

varied corpus teaches us that when corpora are to be compiled, that the norm of representativeness of

materials is of prime importance.

Introduction

The world has known several technical revolutions. Everybody is acquainted with the

industrial revolution. Mankind has always tried to improve the quality of life and work.

In many domains, progress has been made step by step for several generations. Transport

has improved step by step, from the invention of the bicycle to the production of cars.

Planes and ships have made transport much more easy. Also in the field of manual work,

many improvements have been made over the last centuries. Electric devices of all kinds

have been developed from digging machines to bulldozers, etc. In all fields of science,

more and more progress has been made. In medicine, for example, huge progress is

made, even day by day, not only with the invention of new medicines, but also with many

new devices which have been developed for the diagnosis and treatment of all kind of

illnesses.

But what about language? Also in the field of language, progress has been made step by

step. Starting from the first imprints with isolated letters by Gutenberg to the first

typewriters, and linotype machines in the past century. The most revolutionary progress

in language, however, is taking place today in our era, and in the few decades to come. In

this paper we will give a survey of the possibilities the computer creates with reference to

the Arabic language.

Preliminary remarks

As in many languages, also for Arabic, major progress has been made with regard to text

processing and also for the use of databases. Still, there remain some problems to be

solved. One of the big problems is the uniformization of the computer code for Arabic.

As everyone knows, still many different ASCII codes are used. The application of the

UTF-8 Unicode in the near future seems to be decisive for the progress of the practical

and scientific use of Arabic on the computer. The exchange of files between Macintosh

and PC remains a big problem. Until now, there are no reliable programmes that can

make the conversion from an ASCII file in a UTF-8 file. The computer uniformization of

one character code used by all and exchangeable by all is a priority matter.

2

The second very important step is the complete adaptation of software to all the requisites

that the Arabic language imposes. Many programmes, for instance, do not take into

account the fact that Arabic can and needs from time to time to be vocalised. A lot of

software does not take this problem into account, which leads to frustration by those who

use that software. Let me just give one example. When we produced the dictionary

Arabic – Dutch * Dutch – Arabic (Van Mol and Bergman 2001), we first had to contact

the database developers of 4D in Paris to work together with them to make the database

completely suitable for the use of the Arabic language. Once the database was completely

adapted, we worked many years at the development of the dictionaries. Every word was

completely vocalised and checked many times. When the work was finished, we were

able to convert the files in the database into a desktop publishing programme in order to

produce a paper dictionary.

The problem that arose, however, was that the vowels on the Arabic words were not in

the right spot on the printed version. For example a fatha, which was put above the

shadda, was placed after conversion under the shadda, which, of course, yields another

word. The same goes for other vowels which often interfered with the dots of letters.

Fathas and dammas that were printed in the dot of a fa or the dots of the qaf, for

example. We encounter the same problem with the new manual for Arabic, which we

have composed, and which will be published by Fall 2007. Here too, the diacritic

elements were perfectly placed in the text processor, but when the texts were converted to

the printing programme, many changes occurred. Interference occurred, for example,

between the fatha and the hamza above the alif, etc.

This does not mean that this problem does not occur with other languages when using

databases. For French, for example, problems sometimes occur when the data of a

database are made available by the web. Also in that case, colleagues of mine discover

that the accent of the é disappears, etc. Such practical problems ought to be resolved in

the near future by programmers, because of the unproductiveness of those systems. In my

case, it meant much extra work, not only for me, but also for the publishing house,

because we had to check every letter in the printer’s proof in order to correct it, and the

printer had to adapt every diacritical sign manually in order to get a correctly printed

form of Arabic words. For both the author and the printing house, this is extra work that

could have been avoided if the programmes had been better adapted for the treatment of

Arabic.

Dictionaries

As far as the production of dictionaries is concerned, the Institute of Modern Languages

has developed together with 4D and Beyond in Belgium a database which is completely

adapted for the inventory and production of bilingual dictionaries Arabic – Foreign

language * Foreign Language – Arabic. The adaptation and the development of the

dictionaries have been made possible by a partial funding from the Dutch Language

Union. In what follows, we will give a concise description of the structure of the

database.

The database consists of data compiled from four departments. There is a department of

Arabic, a department of Foreign language (in our case Dutch), a department of Arabic –

3

Foreign language, and finally a department of Foreign language – Arabic. It is important

to stress the division between those four different parts, because this makes the database

suitable for use for other foreign languages as well. This is precisely one of the big

advantages of the revolution in the information and technology sciences. Electronic

databases and texts can be used over and over again. The Arabic department, for

example, has been completely developed and can be used also for other languages. If a

team in Sweden, for example, or Croatia wants to produce a dictionary Arabic – Swedish

(or Arabic – Croatian), the Arabic part can be used again, which means quite an

important saving of time for other researchers who want to devote themselves to the

production of Arabic – Foreign language dictionaries.

The advantage of an electronic dictionary is that it can be updated at any time, and that it

can also be used in an electronic way, i.e. by conducting searches on the Internet by

means of the world-wide-web. But, of course, electronic lexical databases offer even

many more possibilities, depending on the richness of the database. In our database,

which was primarily conceived for the production of a paper dictionary, we included a lot

of meta-information. To give one example, we did not only include in the Arabic part the

normal part of speech categories, such as verb, noun, adjective, etc. We also added

information on Arabic word patterns and categories. Once this information is stocked in

the database, it is possible to obtain interesting information about the language in no time.

To give only a few examples: One might wonder which broken plural forms are frequent,

and which are not. In our database, we obtained the following results for some triptotic

plurals.

Pattern Number of occurrences %

Af‘aal 673 39

Fu‘uul 396 23

Fi‘aal 267 15

Fu‘al 183 11

Fu‘ul 105 6

Fu‘l 96 5.7

Fa‘iil 6 0.3

Total 1726 100

Table 1

So the most frequent pattern of triptotic broken plurals is: Af‘aal (673), followed by

Fu‘uul (396), followed by Fi‘aal (267), Fu‘al (183), Fu‘ul (105 occurrences), Fu‘l (96)

and finally Fa‘iil (6), which is the least frequent of the seven plurals investigated.

It is also possible to investigate the frequency of singular Arabic word patterns. For some

patterns our database yields the following results:

4

Pattern Number of occurrences %

Fa‘iil 737 46

Fa‘‘aal 383 24

Fa‘il 222 14

Fu‘uul 203 13

Fa‘laan 71 3

Total 1616 100

Table 2

The above tables, however, give only a static idea about the frequency of forms, and not

about frequency in reality. It is clear that, although the form Fa‘iil is relatively rare, it

does occur frequently, for example in frequent words of such a pattern, such as the plural

of the word himaar (donkey), which is of course hamiir. The same goes for table 2. The

pattern alone does not tell us everything about the parts of speech concerned. The pattern

Fa‘iil can both be a noun or an adjective.

In a lexical database, this information too can be added. We can make a distinction

between homograph words that cover two or more grammatical categories.

Words and word forms in a lexical database

But all words, of course, do not only exist in their static form, the way they are stored in a

dictionary as a lemma. Words can be declined or conjugated. In our lexical database we

developed a programme that generates all the derivations and conjugations of words. In

this way, from the Arabic 27,000 words in the database a total of about 600,000 derived

or conjugated forms were generated.

This means that we have generated for the verbs all the tenses (past, present, jussive and

imperative) and also all the persons of the tenses. For the nouns all plurals were

generated, but also the dual forms of the words and the dual forms in a construct state etc.

The data we obtained with the generation of all the derived and conjugated forms,

however, remain also static data. This means that these data do not give a realistic view

about the Arabic language either. Many of the generated data, for example, might just as

in case of the lemma in a dictionary be of a very uncommon nature. Some verbs, for

example, may never have an imperative form such as, for example, the verb ariqa (to be

sleepless). On the other hand, many verbs have no passive form, while other verbs have

no active form, but only a passive one.

The same goes for nouns and the like. Not all nouns occur in dual form, and certainly not

in their dual form as a first element in a construct state. In spite of the fact that

information in lexical databases is crucial as a departure point for scientific investigations

about language, it still does not mean that the opportunities offered by lexical databases,

by means of generating all kinds of word forms, is always to the point.

One always has to bear in mind that the more stuff is generated, the more work is created,

which has to be checked, and the more elements are generated which might never occur

5

in real language use, and therefore risk to be totally irrelevant. The more data we gather,

and the more data we generate, the more it will be difficult to obtain precise results in

scientific investigations simply because of the fact that it will demand an enormous effort

to check it all.

The strategy to obtain more precise elements in computerized language investigation is

twofold. The first one is the compilation of a representative corpus and the second is the

detailed tagging of the corpus in order to obtain scientifically relevant results.

Compilation of corpora The lexical database we have was not simply compiled by transferring data from an

existing dictionary into our database. The easiest way to compile a database is precisely

that way. The problem however is that, by doing so, data are accumulated in a database

without our having any idea about their current relevance. The famous dictionary of Hans

Wehr, for example, contains a lot of words that are completely obsolete. It is a fact that

Modern Standard Arabic is in close relation to the Classical variety and that some

authors, for one reason or another, use obsolete and uncommon words. The use of

obsolete Arabic words is still understandable, though, but Hans Wehr contains also a lot

of words that are simple Arabic transliterations from European languages.

In colonial times this might have been the case, but nowadays many of those words do

not occur at all anymore in Modern Standard Arabic. An evolution in the use of words in

Modern Arabic has undoubtedly taken place.

In order to obtain current data we based our lexical database on an Arabic text corpus of

4,000,000 words all dating from 1980 till now. Precisely by comparing the lexical

database obtained this way, with the dictionary of Hans Wehr, we were able to discover

the fact that in Modern Standard Arabic many European and thus foreign transliterated

words have disappeared in current Arabic standard language usage. The second thing we

have learned by basing our dictionary on a corpus is that some dictionaries, such as the

English Arabic dictionary of Nafis, contain multi-word expressions that are not at all used

in the current Arabic language.

A lexicographer may try to describe concepts that he does not find in the target language,

but this can only be his last option. When multiword expressions exist and do occur

frequently in a language it is up to the lexicographer to detect those and to present them

in his dictionary. When the lexicographer does not know the current expression, he may

give a description of it, but this has always to be clearly marked so that the user of the

dictionary knows that the expression he finds in the dictionary is a description and, as

such, an invention of the lexicographer, and not a currently used, generally accepted

multi-word expression. Here corpus linguistics can play an important role because it does

give insight into real language use.

Another observation we made by comparing our new corpus-based dictionary with the

older existing dictionaries that are not corpus-based is that there is a discrepancy between

Arabic words proposed in older dictionaries and the words we found in real language use.

6

In the lapse of time, some lexicographers fill dictionaries with words proposed by

academics that did not find their way in reality. This is a problem for all languages. Also

in Dutch, the word proposed by academics for the high-speed train, was not accepted by

the population that got used to the abbreviation of the French word TGV.

Representativeness of corpora

But here again some remarks have to be made. When researchers want to base their

findings on corpora, those ought to be representative. Until now there are no

representative corpora of the Arabic language. Researchers use the material which can be

found in the easiest way, which is, of course, the web and also the electronic corpora of

Arabic that are made available by some newspapers. It is very important concerning the

findings of scientific work that the corpus on which research is based is clearly defined in

order to put the obtained results into perspective. The data of newspaper corpora only

yield data about that segment of language.

A second prerequisite is that the corpus ought to be tagged. Untagged corpora give us a

biased view of language. As an example, we take the splendid initiative of our colleague

Dilworth Parkinson with his ArabiCorpus on the web. The program gives a nice

concordance of mainly newspaper Arabic. As a result of the fact, though, that the corpus

is untagged the data obtained are biased. ArabiCorpus, for example, gives information

about the amount and kind of words you can obtain before and after a chosen word. As

those words are untagged, the information obtained is not specific enough to be

scientifically relevant.

In most cases, the interference of the chosen word with other homomorphic words which

do not relate to the chosen word, is too big to be of scientific relevance. This interference

occurs, for example, with verbs of the first and the fourth form, and in many other cases.

When looking for the verb fahima (to understand) many other elements showed up,

among others the word hatfahum in the expression laqia hatfahu (to die), which in this

case is not a verb, but a noun with a pronominal suffix.

The reason why the tagging of Arabic corpora is so important is precisely because of the

big bias one obtains in searching untagged corpora, due to the fact that it is an

unvowelled and agglutinative language. The second reason is that the larger corpora we

obtain, the more time we will precisely lose to ferret out the biased data. As a

lexicographer and a corpus analyser, we have lots of experience with the time-consuming

aspect of working with words.

I was told, a long time ago, that some publishing houses of dictionaries limited the

average time a lexicographer was permitted to work on one lemma to seven minutes. This

was done for the work to remain economically profitable. With the analysis of huge

untagged and also unrepresentative corpora, one may spend a day on the study of one

lemma, losing in that span of time much precious time in clearing out the data before they

are manageable for interpretation.

7

Again, raw corpora, such as we find in ArabiCorpus, may yield interesting results. When

we looked for the collocation of kitaab (book) and tayyib or jayyid (good), according to

the search of ArabiCorpus the only proper adjective seemed to be jayyid (10 hits) and not

tayyib (no hits). Although this result is very interesting, one might raise the question

whether the result of jayyid is applicable both to the adverb with the verb kataba (to

write) and the adjective with the noun kitaab, or only to one of those two. In order to

clear this out, other searches have to be made.

The necessity of variety in the choice of texts in a corpus

Let me give the following example about the necessity of the compilation of corpora that

are representative both of the region and of the type of texts. At the Leuven Institute of

Modern Languages, a representative database of Arabic is under compilation. For the

moment, the database comprises texts from five Arab sources, i.e. Morocco, Egypt,

Lebanon, Saudi Arabia and also Arabic texts published in London. The texts vary from

newspaper Arabic to fiction and non-fiction texts, but it also comprises texts from

television programmes and schoolbooks.

By the following we want to illustrate the importance of the availability of different kinds

of texts in a corpus, and the availability of texts from different countries. To illustrate,

this we have investigated the spread of some Arabic words in the corpus, for instance the

word for ‘bus’, which can be utuubiis, baas, or the new word haafila.

The following table shows the results for the word for ‘bus’ in written media Arabic.

Word Country haafilat baas utuubiis Total

Newspapers EGY 0 1 12 13

LON 5 1 0 6

LEB 1 1 0 2

MOR 1 0 0 1

SAU 1 1 0 2

Magazines EGY 0 0 3 3

LON 4 5 21 30

LEB 1 1 0 2

MOR 6 0 0 6

SAU 3 0 1 4

Total 22 10 37 69

Table 3

The following table gives a survey of the occurrence of the different possibilities for the

word ‘bus’ in Arabic in literature. The literature corpus of the Gulf-states is still under

construction. That is why we cannot give results of the Gulf states for the fiction and non-

fiction genre.

8


Non-fiction EGY 2 0 2 4

LEB 5 1 0 6

MOR 9 1 0 10

Fiction EGY 2 5 23 30

LEB 1 8 1 10

MOR 23 1 0 24

Total 42 16 26 84

Table 4

Table 5 shows the results for the television corpus. Here we make a distinction between

formal spoken language and informal spoken language. Jaz stays for programmes recorded

from the Al-Jazeera television station.


informal EGY 0 0 0 0

JAZ 1 0 0 1

LEB 0 4 0 4

MOR 0 0 0 0

SAU 0 0 0 0

formal EGY 0 2 2 4

JAZ 3 2 0 5

LEB 1 1 0 2

MOR 0 1 0 1

SAU 1 0 0 1

Total 6 10 2 18

Table 5

Table 6 shows the results from schoolbooks. These are the manuals used in the different Arab

countries for the study of the Arabic language.


Schoolbooks EGY 13 0 1 14

LEB 18 0 0 18

MOR 31 0 0 31

SAU 10 0 0 10

Total 72 0 1 73

Table 6

The last four tables show the importance of the compilation of varied corpora that contain

all kinds of texts. From the above tables the following can be concluded. Starting with the

last table (table 6) we see very clearly that the language policy, such as it is reflected in

schoolbooks of the Arabic language, in the different investigated Arab countries, advises

the use of the word haafilat for ‘bus’. This is a word that is probably proposed by the

9

different Arab academies and ministries of education as the word to be used from now on

and in the future. Only in Egypt do we find one case of the use of utuubiis.

In media Arabic, we clearly see that the word haafilat is used, but not in the majority of

cases. In newspapers and magazines the use of the word utuubiis is still prevalent (37 on

a total of 69), whereas on television the word baas is prevalent (10 on a total of 18). The

word haafilat, though, is used in media Arabic in a limited way: 22 occurrences out of 69

in newspaper Arabic, which is 32% of the total and 6 occurrences out of 18 in television

programmes, which is about the same percentage as in written media language viz. 33%.

It is remarkable, though, that in written media Arabic the non-fusha words are almost all

used in Egypt and in London, viz. in 91% of the cases (43 occurrences out of 47). It is

also remarkable that the other countries in the survey use the word haafilat in written

media Arabic in the majority of cases.

This means that the influence of words for ‘bus’ that are probably more frequently used in

everyday spoken Arabic is still large in media Arabic. When we take a look at the table of

literature we get quite a different result. In literature half of the occurrences for the word

‘bus’ is the fusha word haafilat (42 out of 84, which makes exactly 50%). But here too

we discover that the larger part of the more colloquial words is used in Egyptian literature

(30 occurrences out of 42, which is approximately 71%). It is also remarkable that in both

Moroccan fiction and non-fiction, the authors use the fusha word instead of the other

possibilities in the overwhelming majority of cases. The following table presents a survey

of the percentages.

Percentage of use in fiction and non-fiction.


Non-fiction EGY 50 0 50 100

LEB 83 17 0 100

MOR 90 10 0 100

Fiction EGY 6 17 77 100

LEB 10 80 10 100

MOR 96 4 0 100

Table 7

This table makes it clear that the country where the fusha word haafilat is the most

widespread in literature is Morocco (an average of 93%), whereas in Lebanon the fusha

word haafilat is much more frequently used in non-fiction (83%) than in fiction (10%).

Another remarkable result is that Egyptian authors of fiction stick to the use of the

colloquial words, with only 6% occurrences of the fusha word haafilat.

Another example: the word for ‘wet’

There are two possibilities for the use of the word ‘wet’ in Arabic, viz. the word muballal

and the word mabloul. An investigation of the corpus yields the following results:

10

Word Country mabloul muballal Total

Newspapers EGY 1 1 2

LON 0 0 0

LEB 0 0 0

MOR 0 0 0

SAU 0 0 0

Magazines EGY 0 0 0

LON 0 0 0

LEB 1 0 1

MOR 0 0 0

SAU 0 0 0

Total 2 1 3

Table 8

This table clearly shows that some words do not often occur in written media Arabic. If

we base word counts on newspaper Arabic, which has been done by Fromm (1981) or on

prose, which has been done by Landa (1959), we obtain a biased view on frequency of

the whole language.

Let us now examine the frequency of these words in oral media Arabic, viz. both formal

and informal spoken language in television programmes. The results are shown in the

following table.


Informal EGY 1 0 1

JAZ 0 3 3

LEB 0 0 0

MOR 0 1 1

SAU 0 0 0

Formal EGY 0 0 0

JAZ 0 0 0

LEB 0 0 0

MOR 0 0 0

SAU 0 0 0

Total 1 4 5

Table 9

It is clear that also in oral media Arabic the occurrence of both words is very low. In

formal speech, we did not find any occurrence. Only in informal speech do both words

occur, but they seem to be rather uncommon in use.

The question we want to raise here is whether a literature corpus would yield another

result as far as both words are concerned. In the following table we present the result.

11


Non-fiction EGY 0 2 2

LEB 0 0 0

MOR 0 4 4

Fiction EGY 0 5 5

LEB 5 14 19

MOR 0 8 8

Total 5 33 38

Table 10

Table 10 clearly shows that both words are much more frequent in literature than in

media Arabic. It shows us even more, namely that the word muballal is exclusively used

in all countries except in Lebanon where the word mabloul is still used in 26% of the

cases. The size of the media corpus and the literature corpus are the same, viz.

approximately 2,7000,000 words.

The last table shows us the use in schoolbooks, which gives us the following result:


Schoolbooks EGY 0 1 1

LEB 0 2 2

MOR 3 8 11

SAU 2 0 2

Total 5 11 16

Table 11

Also in schoolbooks the frequency of the words is higher than in Media Arabic, both oral

and written. It is clear that the word muballal is more frequently used than mabloul

except in Saudi Arabia where it is not used in schoolbooks at all. Because the literature

corpus of the Gulf States is not available yet we cannot examine whether also in the

literature of the Gulf States the word mabloul is preferred over the word muballal. With

these tables we hope we have been able to make clear that, in order to give a total view of

the language, corpora of the Arabic language ought to be compiled with a great variety of

text genres.

Detailed exploration of corpora

The above conducted searches on corpora are relatively simple. We have examined words

that are for the greater part unambiguous. Of the five words we have examined above

only one is fairly ambiguous and that is the word baas in Arabic, for this word has also

the meaning of bass (a musical voice), and also the word minibaas (minibus) often

occurred in the corpus. This word had to be filtered out, because it has nothing to do with

the meaning for ‘bus’.

The question is, however, how we might be able to do more detailed and refined searches

of Arabic corpora. Here the tagging of corpora becomes extremely important, especially

because of the fact that the Arabic language is a very ambiguous language. As Badawi

12

and others (2004) have stressed in their introduction, and as is generally known, also

educated Arab speakers are conscious of the ambiguity of forms and words in Arabic, not

only because of the unvowelled nature of the text but also because of its semi-

agglutinative character. In earlier articles we have given some examples of this ambiguity

(Van Mol 2000).

The question is how to find a manner to tag the Arabic language in such a way that

precise searches can be conducted. In order to do so, we have developed a two-phased tag

system at the Institute of Modern Languages at Leuven University. The first one is based

on the systematic use of diacritic signs to disambiguate words. To give only a few

examples: in order to distinguish the preposition baa (with) from baa as the first

consonant of a word, we used a kashieda after the letter baa as a preposition (Van Mol

2002). The simple vocalisation of an Arabic text is useless for an effective

disambiguation, because it does not give a complete disambiguation of a text so that there

will always remain a relatively large amount of ambiguous strings of characters between

spaces.

The system we have developed and described in our book on variation in Arabic from

broadcasts (Van Mol 2003) offers the possibility to do specific searches without

complicated programming. This way we are able to do specific investigations on Arabic

grammar. One of the points we have investigated is the use of the particles of the future

sawfa and sa in radio news broadcasts from Algeria, Egypt and Saudi Arabia on a corpus

of 300,000 words.

In that study, we were able to prove statistically that the particle sawfa occurs much less

than the particle sa in the three countries under investigation. A qualitative analysis,

however, revealed that the distinction in use of both particles, such as it is prescribed by

the Arab grammarians is not applied anymore in Modern Standard Arabic in news

broadcasts. The Arab grammarians prescribe that the particle sawfa has to be used when

indicating a remote future, whereas the particle sa has to be used to indicate a near future.

In reality, we see that both particles are used for a near future as well as for a remote

future. We might have concluded that this result means that the Arabic language has

undergone an evolution on this point. But the only conclusion we can draw is that with

reference to the traditional Arab grammarians, an evolution has been started. However, in

order to state a real evolution of the Arabic language on this topic, the results ought to be

compared with the analysis of a representative corpus of Classical texts from a certain

period in history. One thing is clear, however, which is that the analysis of corpora gives

real evidence on the ‘life’ of languages.

Although the detailed tagging of corpora yields interesting results, we have to point out

that also detailed tagged corpora can yield a tremendous amount of data that are very

time-consuming to analyse. When we conducted an analysis of the conjunctions wa and

fa on the smaller corpus of 300,000 words, we obtained 8,452 hits for the particle wa and

1920 hits for the particle fa. Our corpus now comprises 12,000,000 manually tagged

words. If we were to investigate the use of these conjunctions on our new corpus we can

13

calculate by extrapolation that we will find approximately 340,000 occurrences of wa and

76,800 occurrences of fa.

These are enormous figures. If we wanted to examine every case, it would take maybe a

year to investigate every detail. This is why also in tagged corpora strategies have to be

developed, which reduce the amount of hits in a scientific way. When huge occurrences

are found, it is necessary to take again representative samples of the elements we are

looking for. We cannot sufficiently stress the importance and the necessity of

representativeness in language investigation, especially when corpus linguistics is

involved.

The definitive tagging of Arabic corpora

In order to obtain even finer results of the tagging, we have developed at the institute a

(semi)automatic tagger that attributes Part of Speech information to each word in the

corpus. Affixes, both prefixes and suffixes are automatically recognized, but the program

also gives the different possibilities for every word. Sometimes a string of characters

between two spaces can have more than 20 tag possibilities. In an earlier article, we gave

an analysis of one sentence in Arabic and of the different possibilities that the programme

detected for every word (Van Mol 2005).

The programme offers two ways to tag, either by manual choice of the correct tag, the

semi-automatic way, or by automatic tagging. Thanks to the preliminary encoding, based

on the systematic application of diacritic signs based on convention we yield a

correctness degree of 97% for the automatic and definitive tagging of Arabic corpora.

Future perspectives Once a corpus is tagged this way it opens many new horizons for the future of Arabic

corpus linguistics. We are now at the stage of developing software to do complex

searches. Indeed, the definitive tagging will make it possible to do combined searches,

both for words and for grammatical categories, or a combination of both. It will be

possible, for instance, to investigate the use of the verb kaana as an auxiliary combination

with other verbs, precisely because every verb is tagged as a verb.

Another very important perspective for the future is the automatic tagging of other raw

Arabic corpora. Once we have a representative corpus that is tagged in a very accurate

way, this corpus might serve as a basis for the automatic tagging of other corpora, so that

we might obtain larger corpora with a high degree of correctness as far as the tagging is

concerned.

Another field in which tagged corpora may play a big role is the field of education (see

Van Mol 2006). Many applications can be developed for educational purposes. At the

institute we are currently developing a program which will give students the ability to

question by themselves Arabic sentences about the meaning of the words and about

grammatical issues related to these words.

Conclusion

14

As mentioned in the Introduction, the innovation in techniques for the examination and

exploration of language is one of the latest achievements in science, similar to what text

processing had done to the production and editing of texts in the past decade. The late

Hans Wehr, and also Lane, had to work a lifetime on the compilation of dictionaries for

Arabic. Others worked hard on the examination of language structures by means of books

and other relevant sources. Thirty years ago, when I was a young man, I was already

interested in making a frequency count of words in Arabic, in order to give my students a

clearer insight into the importance of some words. The way to do that was by compiling

on large papers words, which were written—in alphabetical order—as they were found in

Arabic texts. The only way to count in those days was by tallying the words one by one.

The same was true for sociological investigations at that time. Too much use was made of

making statistics of several observations by tallying. Later, I had to give up my tallying

of Arabic words, because I became conscious that it was not feasible that way.

The Arabic paper dictionary that we have developed for Dutch started the same way:

making the first notices in a booklet and then on papers till the first home computers

became available seven years after I started. With the first simple text processing

programmes much more became possible. First of all, the fact that data did not have to be

copied over and over again for other uses, but that they were suitable for reuse was a

tremendous progress. The availability of the first databases made even much more

possible. Time-consuming work became feasible. On the other hand, developments in

linguistics led to the possibility of the compilation of huge amounts of data.

This is the other side of the coin. The huge amount of data might lead to a proliferation

that might become complex and immense, and even unmanageable. It might even lead to

a kind of discouragement for some researchers who might be overwhelmed by the

availability of a vast amount of data, which might throw them back to drudgery.

I hope to have made clear that two basic principles have to be kept in mind for the further

responsible scientific analysis of future corpora. First of all, the development of a very

specific tagging-software and a representative corpus that is tagged according to the

specifications of a well-determined tagset. In the second place, representativeness of

corpora is essential if we do not want to take the risk of being overwhelmed by data that

we might never be able to analyse because of lack of time.

Also in other sciences, investigations are based on representative samples. When

examining the quality of seawater, a small sample is taken. When investigating the

current of the oceanic streams, also small samples are taken on different places to be

representative. In medicine scanners exist that can give a survey of the body of a human

being. The development of representative corpora and the development of the appropriate

software to investigate those may yield comparable results as a scanner does in medicine,

provided that the scientist remains very critical about the representativeness of his data,

and that he is aware that not every detail has to be scrutinized.

Bibliography

15

Badawi Elsaid, M. G. Carter, and A. Gully. 2004. Modern written Arabic: A

comprehensive grammar. London: Routledge.

Fromm, Wolf-Dietrich. 1981. Frequency dictionary of modern newspaper Arabic. State

mutual book & periodical services.

Landau, Jacob. 1959. A word count of modern Arabic prose. New York: American

Council of Learned Societies.

Van Mol, Mark. 2000. Exploring annotated Arabic corpora, preliminary results. In

Corpora and natural language processing: Proceedings of the International Conference

on Artificial and Computational Intelligence for Decision, Control, and Automation in

Engineering and Industrial Applications, pp. 94–98. Monastir.

Van Mol, Mark. 2002. The Semi-automatic tagging of Arabic corpora. In Workshop

Proceedings –Arabic Language Resources and Evaluation– Status and Prospects, pp.

40–44. LREC.

Van Mol, Mark. 2003. Variation in modern standard Arabic in radio news broadcasts: A

synchronic descriptive investigation in the use of complementary particles, Orientalia

Lovaniensia Analecta, no. 117. Leuven.

Van Mol, Mark. 2005. From lexical database to tagged Arabic corpus, In ACIDCA-ICMI

conference, International Conference on Machine Intelligence, Tozeur, 4–7 November

2005.

Van Mol, Mark. 2006. Arabic receptive language teaching: A new CALL approach. In

Handbook for Arabic Language Teaching Professionals in the 21st Century, ed. Kassem

M. Wahba and Zeinab Taha, pp. 305–314. Laurence Erlbaum Publishers.

Documents

Arabic and the computer Van Mol A025.pdf · this way, from the Arabic 27,000 words in the database a total of about 600,000 derived or conjugated forms were generated. This means