Research Report on the Nepali OCR Prajwal Rupakheti ... Phase 2/CCs/Nepal/MPP/Papers/200… · thereby present the logic, architecture and results of the Nepali ... note, we can see

Madan Puraskar Pustakalaya PANL10n/Admin/ReportsNepali OCR 30092009

Research Report on the Nepali OCR

Prajwal Rupakheti [email protected]

Bal Krishna Bal [email protected] .np

September, 2009

Research Report on the Nepali OCR 1

mailto:[email protected]




AbstractIn this report we present the results of training 202 Nepali Characters via Teserract 2.04. We thereby present the logic, architecture and results of the Nepali Fragmenter module, whose output is readily accepted by Tesseract for the recognition process.

INTRODUCTION AND BACKGROUND

We come across with various text scripts around the world. Some are written from top to bottom like Chinese. Some are written from left to right like Urdu. Languages like Nepali, Bengali, Hindi , English, Spanish, French, Russian etc. follow their script to be written from left to right. However as compared to English, Spanish, French, Russian etc., Nepali, Bengali and Hindi have a common feature of containing straight line on top of words of the language. Characters in a single word have a connected headline, where as different words in a sentence have disconnected headline.

For e.g.मेरो कलम हरायो ( English translation : My pen is lost) We can see that every single word in the sentence has connected headline on its top. In this note, we can see uniformity in connecting component of different characters in a single word. Although the connecting component make it somewhat complicated as compared to OCR process like in English, it is relatively easier as compared to languages like Urdu, where characters in word are connected but in a sporadic fashion.

Nepali Character Set:

Nepali has 36 consonant characters and 12 vowel characters. Every single vowel can modify a single consonant character. In this regard there can be total of 436 characters. Besides there are 10 numeric characters from zero to nine which make it altogether 446 characters.

Why Tesseract?

Tesseract can handle left to right languages. However Tesseract is unlikely to handle connected scripts. Since connected component in Nepali script(Headline) can be disconnected at desired place(by use of fragmenter tool being developed) to be accepted by Tesseract, we can surely opt for Tesseract. Besides Tesseract is free and open source and has been authenticated by so many users.



Tesseract Vs HTK

In the earlier phase, we had been pursuing with HTK based Nepali OCR. The recognition rate of HTK is almost close to that of Tesseract. However, there is an extra overhead of processing optical character in HTK, which is somewhat cumbersome and have been creating regular crash while running OCR.

While switching to Tesseract, we have very few overhead of preprocessing. In the meantime, this is able to process image of bigger size containing so many lines of text. However since Tesseract yields higher accuracy in disconnected text, we are left with only a small overhead of fragmenting different characters in a single words which are connected by a single headline.

Briefing on Tools Developed For Training Tesseract

In order to simplify our training process via Tesseract for the time being, we have developed our own shell script that embeds the functionality of Tesseract. All the Nepali images that are to be trained are put in a separate folder along with our script file trainer.sh. Now the training process follows the following 5 steps.

1) To make Box file:./trainer.sh makebox

2) Edit the box file a per the character present in the images

3) To make .tr file./trainer.sh train

4) To cluster:./trainer.sh mftraining./trainer.sh cntraining

5) To compute the character set:./trainer.sh unicharset_extractor

6) To copy the data files to tesseract directory./trainer.sh finalize



Results of Training with Tesseract

We have altogether trained 202 Nepali characters so far. This includes all the consonants, vowels and numeral characters. Besides, 144 derived characters are also trained.

S No

Type of characters Total number Recognized characters

1 Consonant (figure 1) 36 332 Vowels (Figure 2, first line) 12 123 Derived characters ( Consonant modified by

vowels) (figure 3 and figure)144 141

4 Numbers (Figure 2, second line) 10 10Total 202 196Accuracy Percentage 97%


Figure 1: Nepali Consonant Characters

Figure 2: Vowels (first line) and numeric(second line) Nepali Characters


The FragmentationFrom the above discussion, we have already divulged the need for accurate fragmentation of Nepali optical characters from the scanned printed text document. In such documents, we have generally the following hierarchy.

1. A scanned document has multiple pages of text.2. Every single page has multiple lines of text. 3. Every line has physically disconnected words.4. Every single word has one or more characters which are connected with header line at

the top.

Considering the above in mind, we have made a similar hierarchy of modules in our implementation. Distinct modules for lines fragmentation, words fragmentation and character fragmentation have been implemented.


Figure 3: Derived Nepali Characters

Figure 4: (More) Derived Nepali Characters


Brief Description of the Fragmentation Technique

The fragmenter demands a noise free and purely aligned horizontal nepali texts. If it is not there we have to make our image comply such requirements

Lines Fragmentation: A line in a single page of document has a starting row and ending row boundary. We identify the starting and ending row boundary with the following steps. 1. Create a horizontal histogram of binarized page image i.e. count the number of zeros in

every single row of binarized page image and store in a one dimensional array H. 2. We mark the starting row boundary of particular line in the page document if after series

of continuous zeros in H we encounter a non Zero value in H. 3. We mark the ending row boundary of the line in step 2 if after series of continuous non

zeros in H we encounter Zero value in H. 4. Go back to step 2 until the whole H is not traversed.

Words Fragmentation: Once the lines are fragmented different words in the line has to be fragmented. In this case we identify the starting and ending column boundary for words in the line with the following steps.1. Create a vertical histogram of binarized pixel matrix of the line whose words are to be

fragmented i.e. count the number of zeros in every single column and store it in one dimensional array V.

2. We mark the starting column boundary of a particular word in the line fragment if after series of continuous zeros in V we encounter a nonzero value in V.

3. We mark the ending column boundary of the word in step 2 if after series of continuous non zeros values in V we encounter a zero value in V.

4. Go back to step 2 until the whole V is not traversed.

Header Line Width identification: After Word fragmentation is done, the starting row boundary of header line and its width is determined by following the steps below.1. Create a vertical histogram of binarized pixel matrix of the word whose characters are to

be fragmented i.e. count the number of zeros in every single column and store it in one dimensional array V.

2. Find the maximum value m in the histogram V.3. Start traversing V from beginning and continue until V yields value that is equal to m.4. We mark the starting row index s of header line in a particular word if V[s] is equal to m

and keep traversing V.5. We mark the ending row index e of header line in a particular word If V[e] is less than m.6. The headline width is es.



Character Fragmentation: After identifying the width of the header line w in a fragmented word, we proceed for identifying the starting and ending column boundary of all characters in the word with following steps. 1. Create a vertical histogram of binarized pixel matrix of the words whose characterss are

to be fragmented i.e. count the number of zeros in every single column and store it in one dimensional array V.

2. We mark the the starting column boundary of a particular character in the word fragment if after series of continuous w in V we encounter a value greater than m in V.

3. We mark the ending column boundary of the character in step 2 if after series of continuous values greater than w in V we encounter a value equal to m in V.

4. Go back to step 2 until whole V is not traversed.

Different Components of the Fragmenter module

1. Image Reader: Reads the tiff image, binarizes it and store it in a matrix. 2. Image Writer: Creates a binary tiff image out of binary pixel matrix. 3. PageImage: This is module that contains information about the pixel data of the single

page of the scanned document. This module is able to fragment the lines of text present in it and store it in a linked list.

4. LineImage: This is the module that contains information about the pixel data of the single line of text in a particular page of scanned document. This is able to fragments all words present in it and store all those word in a linked list.

5. WordImage: This is the module that contains information about pixel data of the word present in a single line of text. This module evaluates mark the header line boundary and finds it width. It also fragments the character present in it and store it in a linked list.

6. CharacterImage: This module contains the pixel information about a single character present in particular word. This module is able to erase the pixels towards its column boundary from the header line.

Figure 5 below depicts the architecture of the Fragmenter.



+fragm ent()p ixe lsM atr ix

Pag eImag e

+fragm ent()

p ixe lsM atr ixrow Star tIndexrow EndIndex

L in eImag e

+fragm ent()

p ixe lsM atr ixrow StartIndexrow EndIndexcolStar tIndexcolEndIndexheaderLineW idth

W o rd Imag e

+fragm ent()

p ixe lsM atr ixrow StartIndexrow EndIndexcolStar tIndexcolEndIndexheaderLineW idth

C h aracterImag e

fragm ents the lines and stores in list L of L ineIm age

for a ll l in L{ l.fragm ent();}

*

*

1

*1

*1

*

fragm ents the w ords and stores in list L o f W ordIm age

for a ll l in L{ l.fragm ent();}

fragm ents the characters and stores in list L of C haracter Im age

for a ll l in L{ l.fragm ent( );}

Erases the headerline p ixe ls at the boundary

Imag eR ead er Imag eW riter

+fragm ent()

F rag men ter

1

*

1

*

1

*

calls the fragm ent( )of PageIm age object.

Figure 5: Architecture of the Fragmenter

API used for Fragmentation

As Tesseract reads and recognize image only of tiff format, we need to develop the fragmenter especially for tiff images. Since Java asprise API provides efficient and easy to use classes for reading writing and binarizing tiff image, we opt to choose java asprise API for the purpose (currently a trial version of it).



Results of the Fragmenter:Figure 7 is the output after fragmenting image in Figure 6.

Figure 6: A sample input to the fragmenter

Figure 7: Output of the sample input to fragmenter in Figure 2

Using Fragmenter and Tesseract for the recognition process

Once the Tesseract is properly setup in the machine we proceed with the following steps for the recognition process

• Copy the fragmenter directory anywhere you like in your Linux machine.• Enter into fragmenter directory with the command cd fragmenter• Run the command the command

java classpath .;aspriseTIFF.jar np.org.mpp.nepalifragmenter.Main InputImageFilePath output.tif(NOTE: InputImageFilePath is the complete path of the image file to be fragmented.)

• This will create output.tif file with fragmentation which should be fed as input to tessaract with following tesseract output.tif output –l nep



(NOTE: see the corresponding editable text in output.txt file created in the fragmenter directory)

Future Work and Recommendations

The Tesseract module has a limitation on not being able to fragment joint(half) character. Besides, for reading a bulky TIFF image, we have used a trial version java asprise TIFF API which is currently injecting noise character in the image. We are in a need of searching another TIFF reader and writer API instead of java asprise TIFF API with a slight change in ImageReader and ImageWriter module discussed earlier. A proper HCI gluing fragmenter and Tesseract would need to be developed. We are yet to consider noise removal and alignment of image before feeding it to fragmenter. Our current fragmenter assumes the image to be noise free and well aligned horizontally.

Acknowledgement

The PAN L10n works have been carried out with the aid of a grant from the International Development Research Centre, Ottawa, Canada, administered through the Center for Research in Urdu Language Processing (CRLUP), National University of Computing and Emerging Sciences, Lahore, Pakistan (NUCES).

ReferencesVeena Bansal R.M.K Sinha, A Complete OCR for Printed Hindi Text in Devanagari Script,IIT, Kanpurhttp://code.google.com/p/Tesseractocr


http://code.google.com/p/Tesseract-ocr

Documents

Research Report on the Nepali OCR Prajwal Rupakheti ... Phase 2/CCs/Nepal/MPP/Papers/200… · thereby present the logic, architecture and results of the Nepali ... note, we can see