Upload
yusra
View
216
Download
2
Embed Size (px)
Citation preview
2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE)
447
Segmentation Algorithm for Arabic Handwritten Text
based on Contour Analysis
Yusra Osman
Nile Center for Technology Research, NCTR
Khartoum, Sudan
Abstract—Segmentation is the process of dividing the binary
image into useful regions according to certain conditions. It is the
most important phase in any optical character recognition (OCR)
system and its accuracy affects significantly the recognition rate
of that system. In cursive nature languages such as Arabic, the
segmentation procedure is complicated especially in handwritten
documents because writers’ styles differs as well as the special
cases of characters overlapping and ligatures. Hence, the design
of the segmentation algorithms must be based on general
descriptors that most writers follow. In this paper, a
segmentation algorithm for Arabic handwriting has been
developed. The main idea of the algorithm is to divide the
selected image into lines and sub-words. Then, for each sub-
word, the contour of each sub-word is traced. After that, the
algorithm detects the exact points where the contour changes its
state from a horizontal line to another state of vertical or curved
line. Finally, the coordinates of these points are considered as the
segmentation points. The algorithm was tested over the
IFN/ENIT database words. Over 537 tested words containing
3222 character; the algorithm achieved 89.4% of correct
character segmentation points.
Keywords- Segmentation; Arabic language features;
Handwritten text
I. INTRODUCTION
Image segmentation is one of the most important concerns in digital image processing. It's a long standing problem in computer vision. It has been classified as the second core stage after recognition in any optical character recognition (OCR) system. In addition to, it is the main source of errors in the recognition stage and still represents a challenge. Hence, the efficient design of segmentation stage is needed.
The design challenge of the segmentation algorithm comes from both language and type of writing it processes. In terms of type of writing, handwritten documents comes first. This difficulty arises from many sources. An example is the cursive nature of writing that most writers tend to. Another example is the mode of writer which means that one writer could have more than one style of witting according to his mode. On the other hand, in terms of the language of the handwriting texts, Arabic is one of the top difficult languages. Moreover, Arabic still suffers from lack of research compared to Latin and Chinese [1].
The main challenges in Arabic are the possibility of some characters to overlap with others which make the words segmentation paths ambiguous and difficult to detect. Also,
some characters may stack over together in ligatures. Thus, there may be more than one character in one region of interest.
Several methods and algorithms have been presented to segment the handwriting text into characters. However, most f them do not currently solve the overlapping problem. Generally, some of them use the artificial intelligence methods and others do not.
In [2] the algorithm first detects the baseline of each sub-word using the horizontal projection analysis. Second, it analyzes the vertical projection of the sub-word to examine its relative distance from the calculated baseline. Points with far distance from the baseline are ignored.
Another algorithm is represented in [3] which determines the intersection points by analyzing a skeleton image. Then, the algorithm analyzes the image contour to find the closest path to the extracted intersection points. The algorithm chooses the first three lower peaks of the distance map between the intersection points and the chain code as the final segmentation points.
Another different technique is introduced in [4] which uses two processes to determine the final sequence of tentative lines of segmentation paths. First, the algorithm applies a simplified version of character segmentation algorithm found in [5] which relies on the vertical projection analysis of each sub-word to determine weak points. Second, the algorithm extracts the convex dominant points by detecting the zero derivatives of the curvature contour.
Applying the neural network for validating the segmentation points is represented in [6]. The validation process of the segmentation points is checked by some features such as segments directions.
In this paper, the proposed algorithm simulates the human analogy in writing Arabic scripts. Instead of detecting the baseline or the intersection points, the algorithm detects the points where the contour of each sub-word changes from horizontal line to vertical or curved line. These points represents where the writers start to write a new letter.
The organization of this paper is as follows. Section II covers some basic background about the Arabic language properties. Then, a detailed description of the proposed algorithm is provided in Section III. The obtained results from the proposed algorithm are available in Section IV. A conclusion and future work is discussed in Section V.
978-1-4673-6232-0/13/$31.00 ©2013 IEEE
448
II. BACKGROUND
An in-depth understanding of the characteristic of Arabic writing is essential for the implementation of its segmentation system. This knowledge helps to discover the suitability of existing techniques to the system and may also lead to the development of new techniques. Arabic has no capital or lower cases letters and is written from right to left. There are 28 characters in the Arabic alphabet. Each character has 2 to 4 different forms which depend on its position in the word or sub-word; all letters are shown in Figure 1.
Figure 1 Arabic alphabet
Also some Arabic words consist of one or more units called
sub-words when they contain some special characters such
as“ ز, ذ, ر, د, ا ,”. As an example, the word (حاسب, Haseb)
consist of two sub-words: (حا,Ha) and (سب,seb). Figure 2
illustrates an example of Arabic word consists of four sub-
words.
Figure 2 Arabic word contains two sub-words
Arabic language is a very challenging language; it is a cursive-type language meaning that characters are connected in a sub-word through an imaginary horizontal line as shown in Figure 3.
Figure 3 The concept of baseline in Arabic text
Moreover, some special characters lead to overlapping between characters such as “و ,ر, ز ”such as shown in Figure 4. Also, there is a special group of characters “ه, م , ح , ج , خ ” that cause the sub-word to have stacked characters; this is illustrated by an example in Figure 5.
Figure 4 Some overlapping cases between characters
Figure 5 Some character stack over together in ligatures
Text segmentation algorithms can be categorized into segmenting into primitives and segmentation into characters [4]. Segmenting into primitives segments a cursive word into primitives which are possibly smaller than characters such as intersection points, loops, dots and ligatures. The decision of enlarging the number of taken primitives is done later at the recognition or the classification stage. On the other hand, segmenting into characters tries to detect the exact places of characters to segment the word.
III. PROPOSED APPROACH
The proposed algorithm’s model is illustrated in Figure 6. In the first stage of the model, the selected image is converted to binary image. Also, the image is cleaned to remove any small noisy components. For the compatibility issues between the programming utility and the writing style direction of Arabic, the pre-processed image is flipped horizontally. The image is now ready to enter the segmentation phase which consists of three levels: lines, sub-words and characters segmentation.
Figure 6 The algorithm's model
In the second stage of the model, the image is segmenated
into lines by analyzing the horizontal projection of the image’
rows. This analysis helps in detecting the white spaces
between image’ rows because the actual text’ lines are inserted
between the white spaces. Hence, the starting and ending
indecies of each text line between two successive white spaces
Image
Acquisition and
Pre-processing
Page
segmentation into
lines
Lines’
segmentation into
sub-words
Sub-words’
segmentation into
characters
449
are recoreded as Line Start and End respectively. For more
details, Table I conatins the lines’ segmentation procedure.
TABLE I. LINES SEGMENTATION ALGORITHM
Step 0 Build up the horizontal projection for the image. Store the projection values of the columns in array x.
Take the first element in array x.
Go to Step 1.
Step 1 If the current element index is smaller than the max index in array x
If the current element equals to 0 and the next element is
greater than 0, Store the next element’s index in Line Start array.
Go to Step 2.
Else If the current element is greater than 0 and the next
element equals to 0
Store the current element’s index in Line End array. Go to Step 2.
Else
Go to Step 2. Else
Go to Step 3
Step 2 Take the next element in array x. Go to Step 1.
Step 3 END
After the image is segmented into lines, every line is divided into sub-words. This time, the vertical projection of each line is analyzed. The white spaces are detected as zero values. Then, the starting and ending indices of zero values are stored as sub-word Start and End respectively. Table II explains the procedure steps.
TABLE II. SUB-WORDS SEGMENTATION ALGORITHM
Step 0 Build up the vertical projection for the image. Store the projection values of the columns in array y.
Take the first element in array y.
Go to Step 1.
Step 1 If the current element index is smaller than the max index in array y
If the current element equals to 0 and the next element
is greater than 0 Store the next element’s index in Sub-word Start array.
Go to Step 2.
Else If the current element is greater than 0 and the next
element equals to 0
Store the current element’s index in Sub-word End array. Go to Step 2.
Else
Go to Step 2. Else
Go to Step 3.
Step 2 Take the next element in array x. Go to Step 1.
Step 3 END
Now, the image is divided into lines and sub-words and ready to be segmented into characters. At this stage, the incoming image is thinned into one pixel’s width. The purpose of this step is to reduce the number of boundary pixels that must be traced. Then, the small dots and zigzags that may be contained in the Arabic text are checked. If the area percentage of a connected component was less than 10%, then, delete that
object. Hence, the dots and zigzags are removed from the sub-word to concentrate on the main shape of the sub-word.
Next, the remaining object is labeled because the sub-word may contain more than one connected components. This is a typical case of characters’ overlapping. Thus, the algorithm should search for this possibility too. Then, every connected component is traced beginning from a point called the starting point. Figure 7 explains the algorithm’s flow chart.
Figure 7 The algorithm's flow chart
After that, the Freeman chain code [7] is applied to translate the traced boundary into numbers between 0 and 7 to express the direction change. Figure 8 illustrates the chain code numbering.
Record this pixel’s
coordinates
Yes
No
Starting point calculation for
the remaining object
Trace object’s boundaries
Build up the Chain code of
the traced boundary
Check if there
is a change to
vertical or slope
line?
Sub-words Thinning
Removal of small
components (dots and
zigzags)
Start
End
450
Figure 8 Freeman chain code
Next, the chain code is analyzed to determine the points
where the chain code is changed from a steady horizontal line
to another state of vertical or slope line. Figure 9 shows a
typical illustration of chain code tracing for letter Lam “ل”.
Figure 9 typical example of chain code tracing of letter lam “ل”
The aim of this analysis is to simulate the human analogy in writing Arabic scripts. Thus, the coordinates where the writers start to draw a new character is detected. Typical analysis will follow chain code of the word contour as done in Figure 10 where the chain code changes from 4 to 2 means vertical line.
Figure 10 The defined segmentation points in the algorithm
By the end of this stage, all character segmentation points are recorded as well as lines and sub-words segmentation points resulted in the previous stage. Finally, all these points are marked in the input text's image. Figure 11 shows a typical process for an Arabic word.
Figure 11 A typical Arabic word processed by the algorithm.
IV. EXPERIMENTAL RESULTS
The developed algorithm was tested over 537 words taken
from the IFN/ENIT database. The words were selected
carefully to cover all shapes of the Arabic characters. The
obtained results show 89.4% of the segmentation points were
extracted correctly.
Figure 12 Samples from the obtained results
In Figure 12, segmented lines are colored with green and
blue, while sub-words starts and ends are colored by cyan and
magenta. Starting points where the boundary was traced is
451
Over-segmentation
problem
colored by yellow triangles, while character segmentation
points are colored with red circles.
Some characters were exposed to over-segmentation case.
Letter seen and sheen were over segmented because they are
composed from connected vertical lines as shown in Figure 13.
The mission of collecting these over-segmented parts is left for
the classification stage.
Figure 13 Over-segmentation problem in letters seen and sheen respectively
Also letter Taa and Thaa were over segmented because the
algorithm also detects the vertical line they contain as
illustrated in Figure 14.
Figure 14 Over-segmentation problem in letters Thaa and Taa
respectively
The algorithm detects successfully the ligatures contained
in the Arabic words as shown Figure 15. This is referred to the
algorithm’s design of detecting any change of the contour
from horizontal line to vertical or curved line.
Figure 15 Successful segmentation of ligatures
Other letters will show a segmentation point very close to
the end of the sub-word. Possible solution will be by appending
this point to the sub-word end segmentation point. However,
there is a special case of letter alif where this point is important
and the appending will lead to the loss of letter Alif. This
mission is left to the recognition phase to check if the object
located between two segmentation point near the sub-word and
to be either letter Alif (the shape be vertical line) or a reminder
of previous segmented letter (in this case will be a curved
object).
To integrate the results obtained by this algorithm with the
recognition phase of any OCR system, all the extracted
segmentation points have been merged together. Starting
points, character segmentation and sub-word end points are to
be the final coordinates to be considered. Note that the sub-
word stating points were used at the beginning to calculate the
starting points. The starting point's coordinates are more
accurate as they determine each sub-word individually even if
they are overlapped and not be recognized as sub-words at the
sub-word segmentation stage. Finally, every letter is obtained
by extracting the object lies between two successive
coordinates from the array of final segmentation points.
The conducted error percentage can be referred to the
following reasons. First, the case of far located dots from the
main body of the word, specifically, at the beginning or the
end. This resulted in very small sub-words. To solve this
problem, additional check before tracing the contour of each
sub-word would be to check if its area is very small and hence,
bypass the remaining procedure of determining a character
segmentation point.
Another source of error is the deletion of small connected
components in the sub-word depending on its percentage.
Sometimes, this condition leads to the loose of characters. To
explain, see Figure 16.
Figure 16 Error while removing small components resulted in losing a letter.
Finally, a comparison between the results obtained in the
proposed algorithm and previous work in Figure 17.
Figure 17 Result comparison.
Over-segmentation problem
452
V. CONCLUSION AND FUTURE WORK
The algorithm has been tested on the IFN/ENIT database
and shows 89.4% of accuracy in segmenting the Arabic
handwritten text into characters. A future work will be through
enhancing the results of the character segmentation stage and
the integration of this algorithm with an OCR system.
ACKNOWLEDGMENT
The author would like to thank Nile Center for Technology
research for its great help and support that reflected positively
on this work.
REFERENCES
[1] A. M. Zeki. The segmentation problem in Arabic character recognition the state of the art. In First International Conference on Information and Communication Technologies, ICICT., pages 11-26, 2005.
[2] A. Lawgali, A. Bouridane, M. Angelova and Z. Ghassemolooy. Automatic segmentation for Arabic character handwritten. In proceeding of 18th IEEE International Conference on Image Processing, At Brussels, Volume: 3529-3532.
[3] S. Wshah, Z. Shi and V. Govindaraju. Segmentation of Arbic handwriting based on both contour and skeleton segmentation. In proceeding of 10th International Conference on Document Analysis and Recognition, 2009.
[4] A. Cheung, M. Bennamoun and N.W. Bergmann, “An Arabic optical character recognition system using recognition-based segmentation” Pattern Recognition 34 215-233, 2001.
[5] A. Amin, Recognition of Arabic handprinted mathematical formulae, Arabian J. Sci. Eng. 16 (4B) (1991) 532-542.
[6] H. A. Al-Hamad and R. Abu Zitar. Development of an efficient neural-
based segmentation technique for arabic handwriting recognition. Pattern Recogn., 43(8):2773–2798, 2010.
[7] Rafael C. Gonzalez, Richard E. Woods, and Steven L. Eddins, Digital Image Processing using MATLAB, 2004.
[8] T. Sari, L. Souici, and M. Sellami. Off-line handwritten arabic character
segmentation algorithm: Acsa. In Frontiers in Handwriting Recognition, 2002. Proc. 8th Intern.Workshop on, pages 452 – 457, 2002.