25th June 2002IEMCT CDAC Pune1 Non-linear Normalization to Improve Telugu OCR Atul Negi, Chakravarthy Bhagvati, V.V. Suresh Kumar Department of Computer

25th June 2002 IEMCT CDAC Pune 1

Non-linear Normalization to Improve Telugu OCR

Atul Negi, Chakravarthy Bhagvati, V.V. Suresh Kumar

Department of Computer and Information Sciences,

University of Hyderabad


Acknowledgements

Ministry of Information Technology, New Delhi Under the Project

Resource Center for Indian Language Technology Solutions (Telugu)


Organization of Presentation• Introduction• Telugu Script• Classification By Template Matching• Complete OCR Algorithm• Nonlinear Normalization• Results• Concluding Remarks• Bibliography• Contact Information


Introduction• OCR Research Indian Scripts

– Initial era Pioneers: RMK Sinha, Deekshitalu, ISI Kolkata

– Maturity: Mid Nineties Complete Systems • Bangla

• Devanagari

• Recent Status of OCR in Indian Scripts– ICDAR 1999, Bangalore

– ILOCR Workshop, 2002 UoH

– Sadhana, Indian Acad. Sci. Feb `02, Special Issue


Introduction: Progress of Telugu OCR

• Structural approach (ref. 4), – moments and size of the character used

• Neural networks (ref. 1), – Connected Components, training and

recognition

• Template Matching (ref. 5),– Connected Components, Templates and linear

size normalization• Wavelet multi-resolution analysis (ref. 6)


Telugu Script• Features of Telugu

– Basic vowel sounds (Acchulu) 16 symbols– Simple consonants (Hallulu) 36 symbols– Vowel Sounds (Matraas) 16 symbols – Half Consonants (Voththus) 30 symbols

• Complexity of Character Recognition– Composition of Characters and Syllables from

above symbols: 5000 or so in common use.

• Reducing Complexity– Identification of glyphs used in composition :

about 400


Few Telugu Characters

• Achchus

• Hallus

• Maatras

• Voththus


Classification By Template Matching

• Why Template Matching?– Feature Extraction

Effectiveness– Dimensionality (Size 32x32)

• Fringe Distances (ref. 10)– No need for blurring– Distances Pre-computed and

Stored– Ease of matching


The Complete OCR algorithm• Read an input binary image • Segment the image into words • Extract the connected components from

each word • For each component

– (a) Normalize size to match stored templates – (b) Compute fringe distance map – (c) Compute fringe distance from all templates – (d) Output template with smallest fringe

distance – (e) Convert template code to ISCII

• Store ISCII output in a file


Nonlinear Normalization• Need for Normalization

– Preprocessing step to equalize size, position, inclination etc. to ease recognition

– Necessary when recognition is by template matching

• Non-Linear Normalization– All parts of the character image not treated

equally– Hypothesis: Differences between characters

will be increased, therefore improved discrimination


Nonlinear Normalization Technique• Line density equalization-analogous to

histogram density equalization (ref. 13)• Generalization: Feature Density Equalization

(ref. 14)– Projection of feature density onto horizontal,

vertical axes– Feature projection functions H(i) and V(j) – input, i=1,…I and j=1,…J.

– new position (m, n) output computed in normalized image of size (M,N) for point (i, j) in input image of size (I,J).


Nonlinear Normalization Technique• Feature Density Equalization

– Feature projection functions H(k) and V(l), input, i=1,…I and j=1,…J.

– New position (m, n) output size (M,N), for each point (i, j) in input image of size (I,J).

– m= k=1 to i H(k) M / [k=1 to i H(k)]

– n= l=1 to j V(l) N / [l=1 to j V(l)]

– H(i)= (j=1 to J) f(i, j) + H

– V(j)= (i=1 to I) f(i, j) + V{NSN by dot density


Example


Normalized Glyphs


Results

0

100

200

300

400

500

Image 1 Image 2 Image 3 Image 4 Image 5

Fig. 5. Graphical representation of comparision of Linear and Non-linear Normalization

Number Glyphs Linear Normalization Non-Linear Normalization


Image 1Misclassifications: 1 (NSN) , 7 (L)

Total Glyphs: 145 ( 99%, 95.2% )


Image 5Misclassifications:

• 105 (NSN)

• 136 (linear Normalization)

Total Glyphs: 354 (70.3%, 61.6%)


Discussion

•Why Should Nonlinear Normalization succeed despite shape distortions?

•Is the best that we can do?

•Why not use this always?


Concluding Remarks

• Non-linear normalization appears to improve OCR accuracy (based on 1300 glyphs examined)

• More experimentation with the features is required to overcome problems like gaps

• Further testing on variety of fonts and sizes is required to conclude recognition improvement with more confidence


Bibliography• M.B. Sukhswami, P. Seetharamulu , and Arun K. Pujari, “Recognition of Telugu characters using Neural networks,” Int. J. of

Neural Systems, 6(3):317 (1995).• R. Kasturi and S. N. Srihari (Eds.). Proc. Fifth International Conf. Document Anaalysis and Recognition. Bangalore, India,

IEEE Computer Society Press, Los Alamitos, CA, (1999).• B.B. Chaudhuri and U. Garain, and M. Mitra, “On OCR of the most popular two Indian language scripts: Devanagari and

Bangla”, in Visual Text Recogntion and Document Processing, Ed. N. Murshed, World Scientific Press (2000).• SNS Rajasekharan and B.L. Deekshatulu, “Generation and Recognition of Printed Telugu characters”, Computer Graphics

and Image Processing, 6:335-360, (1977).• Atul Negi, Chakravarthy Bhagvati, and B. Krishna, “An OCR system for Telugu”, Proc. . Sixth International Conf. Document

Analysis and Recognition. Seattle, USA, IEEE Computer Society Press, Los Alamitos, CA, (2001).• A.K. Pujari, C.D. Naidu, and B.C.Jinaga, “An addaptive and intelligent character recognizer for Telugu scripts using

multiresolution analysis and associative memory”, Proc. Canadian Conf. On AI, Calagary, Canada, May 2002, LNCS, Springer Verlag (2002).

• B. Krishna, “Design and implementation of a Telugu script recognition system” Technical report, Dept. of Computer and Information Sciences, University of Hyderabad, Hyderabad, India, (2000).

• R.C. Gonzalez and R.E. Woods. Digital Image Processing. Addison-Wesley, 1993• O.D. Trier, A.K. Jain, and R.Taxt. “Feature extraction methods for character recognition-a survey”, Pattern Recognition,

29(4):641-662, (1996).• R.L. Brown. “The fringe distance measure: an easily calculated image distance measure with recognition results comparable

to Gaussian blurring”, IEEE Trans. System Man and Cybernetics, 24(1):111-116, (1994).• K. Wong, R. Casey, and F. Wahl. “Document analysis system”. IBM J. Research and Development, 26(6), (1982).• G. Nagy, S. Seth, and M. Vishwanathan, “A prototype document image analysis system for technical journals” Computer,

25(7), (1992).• H. Yamada, K. Yamamoto and T. Saito, “A nonlinear normalization method for handprinted Kanji character recognition-line

density equalization”, Pattern Recognition, 23(9):1023-1029, (1990).• S-W. Lee and J-S. Park, “Nonlinear shape normalization methods for the recognition of large set handwritten characters”,

Pattern Recognition, 27(7):895-902, (1994).• V.V. Suresh Kumar, “Non-linear Normalization Techniques to Improve OCR”, Technical report, Dept. of Computer and

Information Sciences, University of Hyderabad, Hyderabad, India,(2002).


Contact Information

Atul Negi, Chakravarthy BhagvatiDepartment of Computer and Information Sciences,

University of Hyderabad

Hyderabad 500 046, AP INDIA

Email: [email protected]

Visit http://www.uohyd.ernet.in

and http://www.Languagetechnologies.ac.in

Documents

25th June 2002IEMCT CDAC Pune1 Non-linear Normalization to Improve Telugu OCR Atul Negi, Chakravarthy Bhagvati, V.V. Suresh Kumar Department of Computer