[IEEE 2013 International Conference on Computing, Electrical and Electronics Engineering (ICCEEE) - Khartoum, Sudan (2013.08.26-2013.08.28)] 2013 INTERNATIONAL CONFERENCE ON COMPUTING,

2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE)

447

Segmentation Algorithm for Arabic Handwritten Text

based on Contour Analysis

Yusra Osman

Nile Center for Technology Research, NCTR

Khartoum, Sudan

[email protected]

Abstract—Segmentation is the process of dividing the binary

image into useful regions according to certain conditions. It is the

most important phase in any optical character recognition (OCR)

system and its accuracy affects significantly the recognition rate

of that system. In cursive nature languages such as Arabic, the

segmentation procedure is complicated especially in handwritten

documents because writers’ styles differs as well as the special

cases of characters overlapping and ligatures. Hence, the design

of the segmentation algorithms must be based on general

descriptors that most writers follow. In this paper, a

segmentation algorithm for Arabic handwriting has been

developed. The main idea of the algorithm is to divide the

selected image into lines and sub-words. Then, for each sub-

word, the contour of each sub-word is traced. After that, the

algorithm detects the exact points where the contour changes its

state from a horizontal line to another state of vertical or curved

line. Finally, the coordinates of these points are considered as the

segmentation points. The algorithm was tested over the

IFN/ENIT database words. Over 537 tested words containing

3222 character; the algorithm achieved 89.4% of correct

character segmentation points.

Keywords- Segmentation; Arabic language features;

Handwritten text

I. INTRODUCTION

Image segmentation is one of the most important concerns in digital image processing. It's a long standing problem in computer vision. It has been classified as the second core stage after recognition in any optical character recognition (OCR) system. In addition to, it is the main source of errors in the recognition stage and still represents a challenge. Hence, the efficient design of segmentation stage is needed.

The design challenge of the segmentation algorithm comes from both language and type of writing it processes. In terms of type of writing, handwritten documents comes first. This difficulty arises from many sources. An example is the cursive nature of writing that most writers tend to. Another example is the mode of writer which means that one writer could have more than one style of witting according to his mode. On the other hand, in terms of the language of the handwriting texts, Arabic is one of the top difficult languages. Moreover, Arabic still suffers from lack of research compared to Latin and Chinese [1].

The main challenges in Arabic are the possibility of some characters to overlap with others which make the words segmentation paths ambiguous and difficult to detect. Also,

some characters may stack over together in ligatures. Thus, there may be more than one character in one region of interest.

Several methods and algorithms have been presented to segment the handwriting text into characters. However, most f them do not currently solve the overlapping problem. Generally, some of them use the artificial intelligence methods and others do not.

In [2] the algorithm first detects the baseline of each sub-word using the horizontal projection analysis. Second, it analyzes the vertical projection of the sub-word to examine its relative distance from the calculated baseline. Points with far distance from the baseline are ignored.

Another algorithm is represented in [3] which determines the intersection points by analyzing a skeleton image. Then, the algorithm analyzes the image contour to find the closest path to the extracted intersection points. The algorithm chooses the first three lower peaks of the distance map between the intersection points and the chain code as the final segmentation points.

Another different technique is introduced in [4] which uses two processes to determine the final sequence of tentative lines of segmentation paths. First, the algorithm applies a simplified version of character segmentation algorithm found in [5] which relies on the vertical projection analysis of each sub-word to determine weak points. Second, the algorithm extracts the convex dominant points by detecting the zero derivatives of the curvature contour.

Applying the neural network for validating the segmentation points is represented in [6]. The validation process of the segmentation points is checked by some features such as segments directions.

In this paper, the proposed algorithm simulates the human analogy in writing Arabic scripts. Instead of detecting the baseline or the intersection points, the algorithm detects the points where the contour of each sub-word changes from horizontal line to vertical or curved line. These points represents where the writers start to write a new letter.

The organization of this paper is as follows. Section II covers some basic background about the Arabic language properties. Then, a detailed description of the proposed algorithm is provided in Section III. The obtained results from the proposed algorithm are available in Section IV. A conclusion and future work is discussed in Section V.

978-1-4673-6232-0/13/$31.00 ©2013 IEEE

448

II. BACKGROUND

An in-depth understanding of the characteristic of Arabic writing is essential for the implementation of its segmentation system. This knowledge helps to discover the suitability of existing techniques to the system and may also lead to the development of new techniques. Arabic has no capital or lower cases letters and is written from right to left. There are 28 characters in the Arabic alphabet. Each character has 2 to 4 different forms which depend on its position in the word or sub-word; all letters are shown in Figure 1.

Figure 1 Arabic alphabet

Also some Arabic words consist of one or more units called

sub-words when they contain some special characters such

as“ ز, ذ, ر, د, ا ,”. As an example, the word (حاسب, Haseb)

consist of two sub-words: (حا,Ha) and (سب,seb). Figure 2

illustrates an example of Arabic word consists of four sub-

words.

Figure 2 Arabic word contains two sub-words

Arabic language is a very challenging language; it is a cursive-type language meaning that characters are connected in a sub-word through an imaginary horizontal line as shown in Figure 3.

Figure 3 The concept of baseline in Arabic text

Moreover, some special characters lead to overlapping between characters such as “و ,ر, ز ”such as shown in Figure 4. Also, there is a special group of characters “ه, م , ح , ج , خ ” that cause the sub-word to have stacked characters; this is illustrated by an example in Figure 5.

Figure 4 Some overlapping cases between characters

Figure 5 Some character stack over together in ligatures

Text segmentation algorithms can be categorized into segmenting into primitives and segmentation into characters [4]. Segmenting into primitives segments a cursive word into primitives which are possibly smaller than characters such as intersection points, loops, dots and ligatures. The decision of enlarging the number of taken primitives is done later at the recognition or the classification stage. On the other hand, segmenting into characters tries to detect the exact places of characters to segment the word.

III. PROPOSED APPROACH

The proposed algorithm’s model is illustrated in Figure 6. In the first stage of the model, the selected image is converted to binary image. Also, the image is cleaned to remove any small noisy components. For the compatibility issues between the programming utility and the writing style direction of Arabic, the pre-processed image is flipped horizontally. The image is now ready to enter the segmentation phase which consists of three levels: lines, sub-words and characters segmentation.

Figure 6 The algorithm's model

In the second stage of the model, the image is segmenated

into lines by analyzing the horizontal projection of the image’

rows. This analysis helps in detecting the white spaces

between image’ rows because the actual text’ lines are inserted

between the white spaces. Hence, the starting and ending

indecies of each text line between two successive white spaces

Image

Acquisition and

Pre-processing

Page

segmentation into

lines

Lines’

segmentation into

sub-words

Sub-words’

segmentation into

characters

449

are recoreded as Line Start and End respectively. For more

details, Table I conatins the lines’ segmentation procedure.

TABLE I. LINES SEGMENTATION ALGORITHM

Step 0 Build up the horizontal projection for the image. Store the projection values of the columns in array x.

Take the first element in array x.

Go to Step 1.

Step 1 If the current element index is smaller than the max index in array x

If the current element equals to 0 and the next element is

greater than 0, Store the next element’s index in Line Start array.

Go to Step 2.

Else If the current element is greater than 0 and the next

element equals to 0

Store the current element’s index in Line End array. Go to Step 2.

Else

Go to Step 2. Else

Go to Step 3

Step 2 Take the next element in array x. Go to Step 1.

Step 3 END

After the image is segmented into lines, every line is divided into sub-words. This time, the vertical projection of each line is analyzed. The white spaces are detected as zero values. Then, the starting and ending indices of zero values are stored as sub-word Start and End respectively. Table II explains the procedure steps.

TABLE II. SUB-WORDS SEGMENTATION ALGORITHM

Step 0 Build up the vertical projection for the image. Store the projection values of the columns in array y.

Take the first element in array y.

Go to Step 1.

Step 1 If the current element index is smaller than the max index in array y

If the current element equals to 0 and the next element

is greater than 0 Store the next element’s index in Sub-word Start array.

Go to Step 2.

Else If the current element is greater than 0 and the next

element equals to 0

Store the current element’s index in Sub-word End array. Go to Step 2.

Else

Go to Step 2. Else

Go to Step 3.

Step 2 Take the next element in array x. Go to Step 1.

Step 3 END

Now, the image is divided into lines and sub-words and ready to be segmented into characters. At this stage, the incoming image is thinned into one pixel’s width. The purpose of this step is to reduce the number of boundary pixels that must be traced. Then, the small dots and zigzags that may be contained in the Arabic text are checked. If the area percentage of a connected component was less than 10%, then, delete that

object. Hence, the dots and zigzags are removed from the sub-word to concentrate on the main shape of the sub-word.

Next, the remaining object is labeled because the sub-word may contain more than one connected components. This is a typical case of characters’ overlapping. Thus, the algorithm should search for this possibility too. Then, every connected component is traced beginning from a point called the starting point. Figure 7 explains the algorithm’s flow chart.

Figure 7 The algorithm's flow chart

After that, the Freeman chain code [7] is applied to translate the traced boundary into numbers between 0 and 7 to express the direction change. Figure 8 illustrates the chain code numbering.

Record this pixel’s

coordinates

Yes

No

Starting point calculation for

the remaining object

Trace object’s boundaries

Build up the Chain code of

the traced boundary

Check if there

is a change to

vertical or slope

line?

Sub-words Thinning

Removal of small

components (dots and

zigzags)

Start

End

450

Figure 8 Freeman chain code

Next, the chain code is analyzed to determine the points

where the chain code is changed from a steady horizontal line

to another state of vertical or slope line. Figure 9 shows a

typical illustration of chain code tracing for letter Lam “ل”.

Figure 9 typical example of chain code tracing of letter lam “ل”

The aim of this analysis is to simulate the human analogy in writing Arabic scripts. Thus, the coordinates where the writers start to draw a new character is detected. Typical analysis will follow chain code of the word contour as done in Figure 10 where the chain code changes from 4 to 2 means vertical line.

Figure 10 The defined segmentation points in the algorithm

By the end of this stage, all character segmentation points are recorded as well as lines and sub-words segmentation points resulted in the previous stage. Finally, all these points are marked in the input text's image. Figure 11 shows a typical process for an Arabic word.

Figure 11 A typical Arabic word processed by the algorithm.

IV. EXPERIMENTAL RESULTS

The developed algorithm was tested over 537 words taken

from the IFN/ENIT database. The words were selected

carefully to cover all shapes of the Arabic characters. The

obtained results show 89.4% of the segmentation points were

extracted correctly.

Figure 12 Samples from the obtained results

In Figure 12, segmented lines are colored with green and

blue, while sub-words starts and ends are colored by cyan and

magenta. Starting points where the boundary was traced is

451

Over-segmentation

problem

colored by yellow triangles, while character segmentation

points are colored with red circles.

Some characters were exposed to over-segmentation case.

Letter seen and sheen were over segmented because they are

composed from connected vertical lines as shown in Figure 13.

The mission of collecting these over-segmented parts is left for

the classification stage.

Figure 13 Over-segmentation problem in letters seen and sheen respectively

Also letter Taa and Thaa were over segmented because the

algorithm also detects the vertical line they contain as

illustrated in Figure 14.

Figure 14 Over-segmentation problem in letters Thaa and Taa

respectively

The algorithm detects successfully the ligatures contained

in the Arabic words as shown Figure 15. This is referred to the

algorithm’s design of detecting any change of the contour

from horizontal line to vertical or curved line.

Figure 15 Successful segmentation of ligatures

Other letters will show a segmentation point very close to

the end of the sub-word. Possible solution will be by appending

this point to the sub-word end segmentation point. However,

there is a special case of letter alif where this point is important

and the appending will lead to the loss of letter Alif. This

mission is left to the recognition phase to check if the object

located between two segmentation point near the sub-word and

to be either letter Alif (the shape be vertical line) or a reminder

of previous segmented letter (in this case will be a curved

object).

To integrate the results obtained by this algorithm with the

recognition phase of any OCR system, all the extracted

segmentation points have been merged together. Starting

points, character segmentation and sub-word end points are to

be the final coordinates to be considered. Note that the sub-

word stating points were used at the beginning to calculate the

starting points. The starting point's coordinates are more

accurate as they determine each sub-word individually even if

they are overlapped and not be recognized as sub-words at the

sub-word segmentation stage. Finally, every letter is obtained

by extracting the object lies between two successive

coordinates from the array of final segmentation points.

The conducted error percentage can be referred to the

following reasons. First, the case of far located dots from the

main body of the word, specifically, at the beginning or the

end. This resulted in very small sub-words. To solve this

problem, additional check before tracing the contour of each

sub-word would be to check if its area is very small and hence,

bypass the remaining procedure of determining a character

segmentation point.

Another source of error is the deletion of small connected

components in the sub-word depending on its percentage.

Sometimes, this condition leads to the loose of characters. To

explain, see Figure 16.

Figure 16 Error while removing small components resulted in losing a letter.

Finally, a comparison between the results obtained in the

proposed algorithm and previous work in Figure 17.

Figure 17 Result comparison.

Over-segmentation problem

452

V. CONCLUSION AND FUTURE WORK

The algorithm has been tested on the IFN/ENIT database

and shows 89.4% of accuracy in segmenting the Arabic

handwritten text into characters. A future work will be through

enhancing the results of the character segmentation stage and

the integration of this algorithm with an OCR system.

ACKNOWLEDGMENT

The author would like to thank Nile Center for Technology

research for its great help and support that reflected positively

on this work.

REFERENCES

[1] A. M. Zeki. The segmentation problem in Arabic character recognition the state of the art. In First International Conference on Information and Communication Technologies, ICICT., pages 11-26, 2005.

[2] A. Lawgali, A. Bouridane, M. Angelova and Z. Ghassemolooy. Automatic segmentation for Arabic character handwritten. In proceeding of 18th IEEE International Conference on Image Processing, At Brussels, Volume: 3529-3532.

[3] S. Wshah, Z. Shi and V. Govindaraju. Segmentation of Arbic handwriting based on both contour and skeleton segmentation. In proceeding of 10th International Conference on Document Analysis and Recognition, 2009.

[4] A. Cheung, M. Bennamoun and N.W. Bergmann, “An Arabic optical character recognition system using recognition-based segmentation” Pattern Recognition 34 215-233, 2001.

[5] A. Amin, Recognition of Arabic handprinted mathematical formulae, Arabian J. Sci. Eng. 16 (4B) (1991) 532-542.

[6] H. A. Al-Hamad and R. Abu Zitar. Development of an efficient neural-

based segmentation technique for arabic handwriting recognition. Pattern Recogn., 43(8):2773–2798, 2010.

[7] Rafael C. Gonzalez, Richard E. Woods, and Steven L. Eddins, Digital Image Processing using MATLAB, 2004.

[8] T. Sari, L. Souici, and M. Sellami. Off-line handwritten arabic character

segmentation algorithm: Acsa. In Frontiers in Handwriting Recognition, 2002. Proc. 8th Intern.Workshop on, pages 452 – 457, 2002.

Documents

[IEEE 2013 International Conference on Computing, Electrical and Electronics Engineering (ICCEEE) - Khartoum, Sudan (2013.08.26-2013.08.28)] 2013 INTERNATIONAL CONFERENCE ON COMPUTING,