Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
1
Arabic and the Computer
Van Mol Mark
Institute of Living Languages – Faculty of Arts, Katholieke Universiteit Leuven, Leuven, Belgium
Keywords: Arabic computing and tagging, lexicography, corpus based analysis, conception of relational databases for
Arabic
Abstract: In this paper I will discuss the possibilities given by the creation of a multilingual relational database for
Arabic in combination with the composition of a representative corpus of Arabic. I will go into some details
of some search results that can be obtained about frequency of word forms in Arabic. The exploration of a
varied corpus teaches us that when corpora are to be compiled, that the norm of representativeness of
materials is of prime importance.
Introduction
The world has known several technical revolutions. Everybody is acquainted with the
industrial revolution. Mankind has always tried to improve the quality of life and work.
In many domains, progress has been made step by step for several generations. Transport
has improved step by step, from the invention of the bicycle to the production of cars.
Planes and ships have made transport much more easy. Also in the field of manual work,
many improvements have been made over the last centuries. Electric devices of all kinds
have been developed from digging machines to bulldozers, etc. In all fields of science,
more and more progress has been made. In medicine, for example, huge progress is
made, even day by day, not only with the invention of new medicines, but also with many
new devices which have been developed for the diagnosis and treatment of all kind of
illnesses.
But what about language? Also in the field of language, progress has been made step by
step. Starting from the first imprints with isolated letters by Gutenberg to the first
typewriters, and linotype machines in the past century. The most revolutionary progress
in language, however, is taking place today in our era, and in the few decades to come. In
this paper we will give a survey of the possibilities the computer creates with reference to
the Arabic language.
Preliminary remarks
As in many languages, also for Arabic, major progress has been made with regard to text
processing and also for the use of databases. Still, there remain some problems to be
solved. One of the big problems is the uniformization of the computer code for Arabic.
As everyone knows, still many different ASCII codes are used. The application of the
UTF-8 Unicode in the near future seems to be decisive for the progress of the practical
and scientific use of Arabic on the computer. The exchange of files between Macintosh
and PC remains a big problem. Until now, there are no reliable programmes that can
make the conversion from an ASCII file in a UTF-8 file. The computer uniformization of
one character code used by all and exchangeable by all is a priority matter.
2
The second very important step is the complete adaptation of software to all the requisites
that the Arabic language imposes. Many programmes, for instance, do not take into
account the fact that Arabic can and needs from time to time to be vocalised. A lot of
software does not take this problem into account, which leads to frustration by those who
use that software. Let me just give one example. When we produced the dictionary
Arabic – Dutch * Dutch – Arabic (Van Mol and Bergman 2001), we first had to contact
the database developers of 4D in Paris to work together with them to make the database
completely suitable for the use of the Arabic language. Once the database was completely
adapted, we worked many years at the development of the dictionaries. Every word was
completely vocalised and checked many times. When the work was finished, we were
able to convert the files in the database into a desktop publishing programme in order to
produce a paper dictionary.
The problem that arose, however, was that the vowels on the Arabic words were not in
the right spot on the printed version. For example a fatha, which was put above the
shadda, was placed after conversion under the shadda, which, of course, yields another
word. The same goes for other vowels which often interfered with the dots of letters.
Fathas and dammas that were printed in the dot of a fa or the dots of the qaf, for
example. We encounter the same problem with the new manual for Arabic, which we
have composed, and which will be published by Fall 2007. Here too, the diacritic
elements were perfectly placed in the text processor, but when the texts were converted to
the printing programme, many changes occurred. Interference occurred, for example,
between the fatha and the hamza above the alif, etc.
This does not mean that this problem does not occur with other languages when using
databases. For French, for example, problems sometimes occur when the data of a
database are made available by the web. Also in that case, colleagues of mine discover
that the accent of the é disappears, etc. Such practical problems ought to be resolved in
the near future by programmers, because of the unproductiveness of those systems. In my
case, it meant much extra work, not only for me, but also for the publishing house,
because we had to check every letter in the printer’s proof in order to correct it, and the
printer had to adapt every diacritical sign manually in order to get a correctly printed
form of Arabic words. For both the author and the printing house, this is extra work that
could have been avoided if the programmes had been better adapted for the treatment of
Arabic.
Dictionaries
As far as the production of dictionaries is concerned, the Institute of Modern Languages
has developed together with 4D and Beyond in Belgium a database which is completely
adapted for the inventory and production of bilingual dictionaries Arabic – Foreign
language * Foreign Language – Arabic. The adaptation and the development of the
dictionaries have been made possible by a partial funding from the Dutch Language
Union. In what follows, we will give a concise description of the structure of the
database.
The database consists of data compiled from four departments. There is a department of
Arabic, a department of Foreign language (in our case Dutch), a department of Arabic –
3
Foreign language, and finally a department of Foreign language – Arabic. It is important
to stress the division between those four different parts, because this makes the database
suitable for use for other foreign languages as well. This is precisely one of the big
advantages of the revolution in the information and technology sciences. Electronic
databases and texts can be used over and over again. The Arabic department, for
example, has been completely developed and can be used also for other languages. If a
team in Sweden, for example, or Croatia wants to produce a dictionary Arabic – Swedish
(or Arabic – Croatian), the Arabic part can be used again, which means quite an
important saving of time for other researchers who want to devote themselves to the
production of Arabic – Foreign language dictionaries.
The advantage of an electronic dictionary is that it can be updated at any time, and that it
can also be used in an electronic way, i.e. by conducting searches on the Internet by
means of the world-wide-web. But, of course, electronic lexical databases offer even
many more possibilities, depending on the richness of the database. In our database,
which was primarily conceived for the production of a paper dictionary, we included a lot
of meta-information. To give one example, we did not only include in the Arabic part the
normal part of speech categories, such as verb, noun, adjective, etc. We also added
information on Arabic word patterns and categories. Once this information is stocked in
the database, it is possible to obtain interesting information about the language in no time.
To give only a few examples: One might wonder which broken plural forms are frequent,
and which are not. In our database, we obtained the following results for some triptotic
plurals.
Pattern Number of occurrences %
Af‘aal 673 39
Fu‘uul 396 23
Fi‘aal 267 15
Fu‘al 183 11
Fu‘ul 105 6
Fu‘l 96 5.7
Fa‘iil 6 0.3
Total 1726 100
Table 1
So the most frequent pattern of triptotic broken plurals is: Af‘aal (673), followed by
Fu‘uul (396), followed by Fi‘aal (267), Fu‘al (183), Fu‘ul (105 occurrences), Fu‘l (96)
and finally Fa‘iil (6), which is the least frequent of the seven plurals investigated.
It is also possible to investigate the frequency of singular Arabic word patterns. For some
patterns our database yields the following results:
4
Pattern Number of occurrences %
Fa‘iil 737 46
Fa‘‘aal 383 24
Fa‘il 222 14
Fu‘uul 203 13
Fa‘laan 71 3
Total 1616 100
Table 2
The above tables, however, give only a static idea about the frequency of forms, and not
about frequency in reality. It is clear that, although the form Fa‘iil is relatively rare, it
does occur frequently, for example in frequent words of such a pattern, such as the plural
of the word himaar (donkey), which is of course hamiir. The same goes for table 2. The
pattern alone does not tell us everything about the parts of speech concerned. The pattern
Fa‘iil can both be a noun or an adjective.
In a lexical database, this information too can be added. We can make a distinction
between homograph words that cover two or more grammatical categories.
Words and word forms in a lexical database
But all words, of course, do not only exist in their static form, the way they are stored in a
dictionary as a lemma. Words can be declined or conjugated. In our lexical database we
developed a programme that generates all the derivations and conjugations of words. In
this way, from the Arabic 27,000 words in the database a total of about 600,000 derived
or conjugated forms were generated.
This means that we have generated for the verbs all the tenses (past, present, jussive and
imperative) and also all the persons of the tenses. For the nouns all plurals were
generated, but also the dual forms of the words and the dual forms in a construct state etc.
The data we obtained with the generation of all the derived and conjugated forms,
however, remain also static data. This means that these data do not give a realistic view
about the Arabic language either. Many of the generated data, for example, might just as
in case of the lemma in a dictionary be of a very uncommon nature. Some verbs, for
example, may never have an imperative form such as, for example, the verb ariqa (to be
sleepless). On the other hand, many verbs have no passive form, while other verbs have
no active form, but only a passive one.
The same goes for nouns and the like. Not all nouns occur in dual form, and certainly not
in their dual form as a first element in a construct state. In spite of the fact that
information in lexical databases is crucial as a departure point for scientific investigations
about language, it still does not mean that the opportunities offered by lexical databases,
by means of generating all kinds of word forms, is always to the point.
One always has to bear in mind that the more stuff is generated, the more work is created,
which has to be checked, and the more elements are generated which might never occur
5
in real language use, and therefore risk to be totally irrelevant. The more data we gather,
and the more data we generate, the more it will be difficult to obtain precise results in
scientific investigations simply because of the fact that it will demand an enormous effort
to check it all.
The strategy to obtain more precise elements in computerized language investigation is
twofold. The first one is the compilation of a representative corpus and the second is the
detailed tagging of the corpus in order to obtain scientifically relevant results.
Compilation of corpora The lexical database we have was not simply compiled by transferring data from an
existing dictionary into our database. The easiest way to compile a database is precisely
that way. The problem however is that, by doing so, data are accumulated in a database
without our having any idea about their current relevance. The famous dictionary of Hans
Wehr, for example, contains a lot of words that are completely obsolete. It is a fact that
Modern Standard Arabic is in close relation to the Classical variety and that some
authors, for one reason or another, use obsolete and uncommon words. The use of
obsolete Arabic words is still understandable, though, but Hans Wehr contains also a lot
of words that are simple Arabic transliterations from European languages.
In colonial times this might have been the case, but nowadays many of those words do
not occur at all anymore in Modern Standard Arabic. An evolution in the use of words in
Modern Arabic has undoubtedly taken place.
In order to obtain current data we based our lexical database on an Arabic text corpus of
4,000,000 words all dating from 1980 till now. Precisely by comparing the lexical
database obtained this way, with the dictionary of Hans Wehr, we were able to discover
the fact that in Modern Standard Arabic many European and thus foreign transliterated
words have disappeared in current Arabic standard language usage. The second thing we
have learned by basing our dictionary on a corpus is that some dictionaries, such as the
English Arabic dictionary of Nafis, contain multi-word expressions that are not at all used
in the current Arabic language.
A lexicographer may try to describe concepts that he does not find in the target language,
but this can only be his last option. When multiword expressions exist and do occur
frequently in a language it is up to the lexicographer to detect those and to present them
in his dictionary. When the lexicographer does not know the current expression, he may
give a description of it, but this has always to be clearly marked so that the user of the
dictionary knows that the expression he finds in the dictionary is a description and, as
such, an invention of the lexicographer, and not a currently used, generally accepted
multi-word expression. Here corpus linguistics can play an important role because it does
give insight into real language use.
Another observation we made by comparing our new corpus-based dictionary with the
older existing dictionaries that are not corpus-based is that there is a discrepancy between
Arabic words proposed in older dictionaries and the words we found in real language use.
6
In the lapse of time, some lexicographers fill dictionaries with words proposed by
academics that did not find their way in reality. This is a problem for all languages. Also
in Dutch, the word proposed by academics for the high-speed train, was not accepted by
the population that got used to the abbreviation of the French word TGV.
Representativeness of corpora
But here again some remarks have to be made. When researchers want to base their
findings on corpora, those ought to be representative. Until now there are no
representative corpora of the Arabic language. Researchers use the material which can be
found in the easiest way, which is, of course, the web and also the electronic corpora of
Arabic that are made available by some newspapers. It is very important concerning the
findings of scientific work that the corpus on which research is based is clearly defined in
order to put the obtained results into perspective. The data of newspaper corpora only
yield data about that segment of language.
A second prerequisite is that the corpus ought to be tagged. Untagged corpora give us a
biased view of language. As an example, we take the splendid initiative of our colleague
Dilworth Parkinson with his ArabiCorpus on the web. The program gives a nice
concordance of mainly newspaper Arabic. As a result of the fact, though, that the corpus
is untagged the data obtained are biased. ArabiCorpus, for example, gives information
about the amount and kind of words you can obtain before and after a chosen word. As
those words are untagged, the information obtained is not specific enough to be
scientifically relevant.
In most cases, the interference of the chosen word with other homomorphic words which
do not relate to the chosen word, is too big to be of scientific relevance. This interference
occurs, for example, with verbs of the first and the fourth form, and in many other cases.
When looking for the verb fahima (to understand) many other elements showed up,
among others the word hatfahum in the expression laqia hatfahu (to die), which in this
case is not a verb, but a noun with a pronominal suffix.
The reason why the tagging of Arabic corpora is so important is precisely because of the
big bias one obtains in searching untagged corpora, due to the fact that it is an
unvowelled and agglutinative language. The second reason is that the larger corpora we
obtain, the more time we will precisely lose to ferret out the biased data. As a
lexicographer and a corpus analyser, we have lots of experience with the time-consuming
aspect of working with words.
I was told, a long time ago, that some publishing houses of dictionaries limited the
average time a lexicographer was permitted to work on one lemma to seven minutes. This
was done for the work to remain economically profitable. With the analysis of huge
untagged and also unrepresentative corpora, one may spend a day on the study of one
lemma, losing in that span of time much precious time in clearing out the data before they
are manageable for interpretation.
7
Again, raw corpora, such as we find in ArabiCorpus, may yield interesting results. When
we looked for the collocation of kitaab (book) and tayyib or jayyid (good), according to
the search of ArabiCorpus the only proper adjective seemed to be jayyid (10 hits) and not
tayyib (no hits). Although this result is very interesting, one might raise the question
whether the result of jayyid is applicable both to the adverb with the verb kataba (to
write) and the adjective with the noun kitaab, or only to one of those two. In order to
clear this out, other searches have to be made.
The necessity of variety in the choice of texts in a corpus
Let me give the following example about the necessity of the compilation of corpora that
are representative both of the region and of the type of texts. At the Leuven Institute of
Modern Languages, a representative database of Arabic is under compilation. For the
moment, the database comprises texts from five Arab sources, i.e. Morocco, Egypt,
Lebanon, Saudi Arabia and also Arabic texts published in London. The texts vary from
newspaper Arabic to fiction and non-fiction texts, but it also comprises texts from
television programmes and schoolbooks.
By the following we want to illustrate the importance of the availability of different kinds
of texts in a corpus, and the availability of texts from different countries. To illustrate,
this we have investigated the spread of some Arabic words in the corpus, for instance the
word for ‘bus’, which can be utuubiis, baas, or the new word haafila.
The following table shows the results for the word for ‘bus’ in written media Arabic.
Word Country haafilat baas utuubiis Total
Newspapers EGY 0 1 12 13
LON 5 1 0 6
LEB 1 1 0 2
MOR 1 0 0 1
SAU 1 1 0 2
Magazines EGY 0 0 3 3
LON 4 5 21 30
LEB 1 1 0 2
MOR 6 0 0 6
SAU 3 0 1 4
Total 22 10 37 69
Table 3
The following table gives a survey of the occurrence of the different possibilities for the
word ‘bus’ in Arabic in literature. The literature corpus of the Gulf-states is still under
construction. That is why we cannot give results of the Gulf states for the fiction and non-
fiction genre.
8
Word Country haafilat baas utuubiis Total
Non-fiction EGY 2 0 2 4
LEB 5 1 0 6
MOR 9 1 0 10
Fiction EGY 2 5 23 30
LEB 1 8 1 10
MOR 23 1 0 24
Total 42 16 26 84
Table 4
Table 5 shows the results for the television corpus. Here we make a distinction between
formal spoken language and informal spoken language. Jaz stays for programmes recorded
from the Al-Jazeera television station.
Word Country haafilat baas utuubiis Total
informal EGY 0 0 0 0
JAZ 1 0 0 1
LEB 0 4 0 4
MOR 0 0 0 0
SAU 0 0 0 0
formal EGY 0 2 2 4
JAZ 3 2 0 5
LEB 1 1 0 2
MOR 0 1 0 1
SAU 1 0 0 1
Total 6 10 2 18
Table 5
Table 6 shows the results from schoolbooks. These are the manuals used in the different Arab
countries for the study of the Arabic language.
Word Country haafilat baas utuubiis Total
Schoolbooks EGY 13 0 1 14
LEB 18 0 0 18
MOR 31 0 0 31
SAU 10 0 0 10
Total 72 0 1 73
Table 6
The last four tables show the importance of the compilation of varied corpora that contain
all kinds of texts. From the above tables the following can be concluded. Starting with the
last table (table 6) we see very clearly that the language policy, such as it is reflected in
schoolbooks of the Arabic language, in the different investigated Arab countries, advises
the use of the word haafilat for ‘bus’. This is a word that is probably proposed by the
9
different Arab academies and ministries of education as the word to be used from now on
and in the future. Only in Egypt do we find one case of the use of utuubiis.
In media Arabic, we clearly see that the word haafilat is used, but not in the majority of
cases. In newspapers and magazines the use of the word utuubiis is still prevalent (37 on
a total of 69), whereas on television the word baas is prevalent (10 on a total of 18). The
word haafilat, though, is used in media Arabic in a limited way: 22 occurrences out of 69
in newspaper Arabic, which is 32% of the total and 6 occurrences out of 18 in television
programmes, which is about the same percentage as in written media language viz. 33%.
It is remarkable, though, that in written media Arabic the non-fusha words are almost all
used in Egypt and in London, viz. in 91% of the cases (43 occurrences out of 47). It is
also remarkable that the other countries in the survey use the word haafilat in written
media Arabic in the majority of cases.
This means that the influence of words for ‘bus’ that are probably more frequently used in
everyday spoken Arabic is still large in media Arabic. When we take a look at the table of
literature we get quite a different result. In literature half of the occurrences for the word
‘bus’ is the fusha word haafilat (42 out of 84, which makes exactly 50%). But here too
we discover that the larger part of the more colloquial words is used in Egyptian literature
(30 occurrences out of 42, which is approximately 71%). It is also remarkable that in both
Moroccan fiction and non-fiction, the authors use the fusha word instead of the other
possibilities in the overwhelming majority of cases. The following table presents a survey
of the percentages.
Percentage of use in fiction and non-fiction.
Word Country haafilat baas utuubiis Total
Non-fiction EGY 50 0 50 100
LEB 83 17 0 100
MOR 90 10 0 100
Fiction EGY 6 17 77 100
LEB 10 80 10 100
MOR 96 4 0 100
Table 7
This table makes it clear that the country where the fusha word haafilat is the most
widespread in literature is Morocco (an average of 93%), whereas in Lebanon the fusha
word haafilat is much more frequently used in non-fiction (83%) than in fiction (10%).
Another remarkable result is that Egyptian authors of fiction stick to the use of the
colloquial words, with only 6% occurrences of the fusha word haafilat.
Another example: the word for ‘wet’
There are two possibilities for the use of the word ‘wet’ in Arabic, viz. the word muballal
and the word mabloul. An investigation of the corpus yields the following results:
10
Word Country mabloul muballal Total
Newspapers EGY 1 1 2
LON 0 0 0
LEB 0 0 0
MOR 0 0 0
SAU 0 0 0
Magazines EGY 0 0 0
LON 0 0 0
LEB 1 0 1
MOR 0 0 0
SAU 0 0 0
Total 2 1 3
Table 8
This table clearly shows that some words do not often occur in written media Arabic. If
we base word counts on newspaper Arabic, which has been done by Fromm (1981) or on
prose, which has been done by Landa (1959), we obtain a biased view on frequency of
the whole language.
Let us now examine the frequency of these words in oral media Arabic, viz. both formal
and informal spoken language in television programmes. The results are shown in the
following table.
Word Country mabloul muballal Total
Informal EGY 1 0 1
JAZ 0 3 3
LEB 0 0 0
MOR 0 1 1
SAU 0 0 0
Formal EGY 0 0 0
JAZ 0 0 0
LEB 0 0 0
MOR 0 0 0
SAU 0 0 0
Total 1 4 5
Table 9
It is clear that also in oral media Arabic the occurrence of both words is very low. In
formal speech, we did not find any occurrence. Only in informal speech do both words
occur, but they seem to be rather uncommon in use.
The question we want to raise here is whether a literature corpus would yield another
result as far as both words are concerned. In the following table we present the result.
11
Word Country mabloul muballal Total
Non-fiction EGY 0 2 2
LEB 0 0 0
MOR 0 4 4
Fiction EGY 0 5 5
LEB 5 14 19
MOR 0 8 8
Total 5 33 38
Table 10
Table 10 clearly shows that both words are much more frequent in literature than in
media Arabic. It shows us even more, namely that the word muballal is exclusively used
in all countries except in Lebanon where the word mabloul is still used in 26% of the
cases. The size of the media corpus and the literature corpus are the same, viz.
approximately 2,7000,000 words.
The last table shows us the use in schoolbooks, which gives us the following result:
Word Country mabloul muballal Total
Schoolbooks EGY 0 1 1
LEB 0 2 2
MOR 3 8 11
SAU 2 0 2
Total 5 11 16
Table 11
Also in schoolbooks the frequency of the words is higher than in Media Arabic, both oral
and written. It is clear that the word muballal is more frequently used than mabloul
except in Saudi Arabia where it is not used in schoolbooks at all. Because the literature
corpus of the Gulf States is not available yet we cannot examine whether also in the
literature of the Gulf States the word mabloul is preferred over the word muballal. With
these tables we hope we have been able to make clear that, in order to give a total view of
the language, corpora of the Arabic language ought to be compiled with a great variety of
text genres.
Detailed exploration of corpora
The above conducted searches on corpora are relatively simple. We have examined words
that are for the greater part unambiguous. Of the five words we have examined above
only one is fairly ambiguous and that is the word baas in Arabic, for this word has also
the meaning of bass (a musical voice), and also the word minibaas (minibus) often
occurred in the corpus. This word had to be filtered out, because it has nothing to do with
the meaning for ‘bus’.
The question is, however, how we might be able to do more detailed and refined searches
of Arabic corpora. Here the tagging of corpora becomes extremely important, especially
because of the fact that the Arabic language is a very ambiguous language. As Badawi
12
and others (2004) have stressed in their introduction, and as is generally known, also
educated Arab speakers are conscious of the ambiguity of forms and words in Arabic, not
only because of the unvowelled nature of the text but also because of its semi-
agglutinative character. In earlier articles we have given some examples of this ambiguity
(Van Mol 2000).
The question is how to find a manner to tag the Arabic language in such a way that
precise searches can be conducted. In order to do so, we have developed a two-phased tag
system at the Institute of Modern Languages at Leuven University. The first one is based
on the systematic use of diacritic signs to disambiguate words. To give only a few
examples: in order to distinguish the preposition baa (with) from baa as the first
consonant of a word, we used a kashieda after the letter baa as a preposition (Van Mol
2002). The simple vocalisation of an Arabic text is useless for an effective
disambiguation, because it does not give a complete disambiguation of a text so that there
will always remain a relatively large amount of ambiguous strings of characters between
spaces.
The system we have developed and described in our book on variation in Arabic from
broadcasts (Van Mol 2003) offers the possibility to do specific searches without
complicated programming. This way we are able to do specific investigations on Arabic
grammar. One of the points we have investigated is the use of the particles of the future
sawfa and sa in radio news broadcasts from Algeria, Egypt and Saudi Arabia on a corpus
of 300,000 words.
In that study, we were able to prove statistically that the particle sawfa occurs much less
than the particle sa in the three countries under investigation. A qualitative analysis,
however, revealed that the distinction in use of both particles, such as it is prescribed by
the Arab grammarians is not applied anymore in Modern Standard Arabic in news
broadcasts. The Arab grammarians prescribe that the particle sawfa has to be used when
indicating a remote future, whereas the particle sa has to be used to indicate a near future.
In reality, we see that both particles are used for a near future as well as for a remote
future. We might have concluded that this result means that the Arabic language has
undergone an evolution on this point. But the only conclusion we can draw is that with
reference to the traditional Arab grammarians, an evolution has been started. However, in
order to state a real evolution of the Arabic language on this topic, the results ought to be
compared with the analysis of a representative corpus of Classical texts from a certain
period in history. One thing is clear, however, which is that the analysis of corpora gives
real evidence on the ‘life’ of languages.
Although the detailed tagging of corpora yields interesting results, we have to point out
that also detailed tagged corpora can yield a tremendous amount of data that are very
time-consuming to analyse. When we conducted an analysis of the conjunctions wa and
fa on the smaller corpus of 300,000 words, we obtained 8,452 hits for the particle wa and
1920 hits for the particle fa. Our corpus now comprises 12,000,000 manually tagged
words. If we were to investigate the use of these conjunctions on our new corpus we can
13
calculate by extrapolation that we will find approximately 340,000 occurrences of wa and
76,800 occurrences of fa.
These are enormous figures. If we wanted to examine every case, it would take maybe a
year to investigate every detail. This is why also in tagged corpora strategies have to be
developed, which reduce the amount of hits in a scientific way. When huge occurrences
are found, it is necessary to take again representative samples of the elements we are
looking for. We cannot sufficiently stress the importance and the necessity of
representativeness in language investigation, especially when corpus linguistics is
involved.
The definitive tagging of Arabic corpora
In order to obtain even finer results of the tagging, we have developed at the institute a
(semi)automatic tagger that attributes Part of Speech information to each word in the
corpus. Affixes, both prefixes and suffixes are automatically recognized, but the program
also gives the different possibilities for every word. Sometimes a string of characters
between two spaces can have more than 20 tag possibilities. In an earlier article, we gave
an analysis of one sentence in Arabic and of the different possibilities that the programme
detected for every word (Van Mol 2005).
The programme offers two ways to tag, either by manual choice of the correct tag, the
semi-automatic way, or by automatic tagging. Thanks to the preliminary encoding, based
on the systematic application of diacritic signs based on convention we yield a
correctness degree of 97% for the automatic and definitive tagging of Arabic corpora.
Future perspectives Once a corpus is tagged this way it opens many new horizons for the future of Arabic
corpus linguistics. We are now at the stage of developing software to do complex
searches. Indeed, the definitive tagging will make it possible to do combined searches,
both for words and for grammatical categories, or a combination of both. It will be
possible, for instance, to investigate the use of the verb kaana as an auxiliary combination
with other verbs, precisely because every verb is tagged as a verb.
Another very important perspective for the future is the automatic tagging of other raw
Arabic corpora. Once we have a representative corpus that is tagged in a very accurate
way, this corpus might serve as a basis for the automatic tagging of other corpora, so that
we might obtain larger corpora with a high degree of correctness as far as the tagging is
concerned.
Another field in which tagged corpora may play a big role is the field of education (see
Van Mol 2006). Many applications can be developed for educational purposes. At the
institute we are currently developing a program which will give students the ability to
question by themselves Arabic sentences about the meaning of the words and about
grammatical issues related to these words.
Conclusion
14
As mentioned in the Introduction, the innovation in techniques for the examination and
exploration of language is one of the latest achievements in science, similar to what text
processing had done to the production and editing of texts in the past decade. The late
Hans Wehr, and also Lane, had to work a lifetime on the compilation of dictionaries for
Arabic. Others worked hard on the examination of language structures by means of books
and other relevant sources. Thirty years ago, when I was a young man, I was already
interested in making a frequency count of words in Arabic, in order to give my students a
clearer insight into the importance of some words. The way to do that was by compiling
on large papers words, which were written—in alphabetical order—as they were found in
Arabic texts. The only way to count in those days was by tallying the words one by one.
The same was true for sociological investigations at that time. Too much use was made of
making statistics of several observations by tallying. Later, I had to give up my tallying
of Arabic words, because I became conscious that it was not feasible that way.
The Arabic paper dictionary that we have developed for Dutch started the same way:
making the first notices in a booklet and then on papers till the first home computers
became available seven years after I started. With the first simple text processing
programmes much more became possible. First of all, the fact that data did not have to be
copied over and over again for other uses, but that they were suitable for reuse was a
tremendous progress. The availability of the first databases made even much more
possible. Time-consuming work became feasible. On the other hand, developments in
linguistics led to the possibility of the compilation of huge amounts of data.
This is the other side of the coin. The huge amount of data might lead to a proliferation
that might become complex and immense, and even unmanageable. It might even lead to
a kind of discouragement for some researchers who might be overwhelmed by the
availability of a vast amount of data, which might throw them back to drudgery.
I hope to have made clear that two basic principles have to be kept in mind for the further
responsible scientific analysis of future corpora. First of all, the development of a very
specific tagging-software and a representative corpus that is tagged according to the
specifications of a well-determined tagset. In the second place, representativeness of
corpora is essential if we do not want to take the risk of being overwhelmed by data that
we might never be able to analyse because of lack of time.
Also in other sciences, investigations are based on representative samples. When
examining the quality of seawater, a small sample is taken. When investigating the
current of the oceanic streams, also small samples are taken on different places to be
representative. In medicine scanners exist that can give a survey of the body of a human
being. The development of representative corpora and the development of the appropriate
software to investigate those may yield comparable results as a scanner does in medicine,
provided that the scientist remains very critical about the representativeness of his data,
and that he is aware that not every detail has to be scrutinized.
Bibliography
15
Badawi Elsaid, M. G. Carter, and A. Gully. 2004. Modern written Arabic: A
comprehensive grammar. London: Routledge.
Fromm, Wolf-Dietrich. 1981. Frequency dictionary of modern newspaper Arabic. State
mutual book & periodical services.
Landau, Jacob. 1959. A word count of modern Arabic prose. New York: American
Council of Learned Societies.
Van Mol, Mark. 2000. Exploring annotated Arabic corpora, preliminary results. In
Corpora and natural language processing: Proceedings of the International Conference
on Artificial and Computational Intelligence for Decision, Control, and Automation in
Engineering and Industrial Applications, pp. 94–98. Monastir.
Van Mol, Mark. 2002. The Semi-automatic tagging of Arabic corpora. In Workshop
Proceedings –Arabic Language Resources and Evaluation– Status and Prospects, pp.
40–44. LREC.
Van Mol, Mark. 2003. Variation in modern standard Arabic in radio news broadcasts: A
synchronic descriptive investigation in the use of complementary particles, Orientalia
Lovaniensia Analecta, no. 117. Leuven.
Van Mol, Mark. 2005. From lexical database to tagged Arabic corpus, In ACIDCA-ICMI
conference, International Conference on Machine Intelligence, Tozeur, 4–7 November
2005.
Van Mol, Mark. 2006. Arabic receptive language teaching: A new CALL approach. In
Handbook for Arabic Language Teaching Professionals in the 21st Century, ed. Kassem
M. Wahba and Zeinab Taha, pp. 305–314. Laurence Erlbaum Publishers.