21
25th June 2002 IEMCT CDAC Pune 1 Non-linear Normalization to Improve Telugu OCR Atul Negi, Chakravarthy Bhagvati, V.V. Suresh Kumar Department of Computer and Information Sciences, University of Hyderabad

25th June 2002IEMCT CDAC Pune1 Non-linear Normalization to Improve Telugu OCR Atul Negi, Chakravarthy Bhagvati, V.V. Suresh Kumar Department of Computer

Embed Size (px)

Citation preview

25th June 2002 IEMCT CDAC Pune 1

Non-linear Normalization to Improve Telugu OCR

Atul Negi, Chakravarthy Bhagvati, V.V. Suresh Kumar

Department of Computer and Information Sciences,

University of Hyderabad

25th June 2002 IEMCT CDAC Pune 2

Acknowledgements

Ministry of Information Technology, New Delhi Under the Project

Resource Center for Indian Language Technology Solutions (Telugu)

25th June 2002 IEMCT CDAC Pune 3

Organization of Presentation• Introduction• Telugu Script• Classification By Template Matching• Complete OCR Algorithm• Nonlinear Normalization• Results• Concluding Remarks• Bibliography• Contact Information

25th June 2002 IEMCT CDAC Pune 4

Introduction• OCR Research Indian Scripts

– Initial era Pioneers: RMK Sinha, Deekshitalu, ISI Kolkata

– Maturity: Mid Nineties Complete Systems • Bangla

• Devanagari

• Recent Status of OCR in Indian Scripts– ICDAR 1999, Bangalore

– ILOCR Workshop, 2002 UoH

– Sadhana, Indian Acad. Sci. Feb `02, Special Issue

25th June 2002 IEMCT CDAC Pune 5

Introduction: Progress of Telugu OCR

• Structural approach (ref. 4), – moments and size of the character used

• Neural networks (ref. 1), – Connected Components, training and

recognition

• Template Matching (ref. 5),– Connected Components, Templates and linear

size normalization• Wavelet multi-resolution analysis (ref. 6)

25th June 2002 IEMCT CDAC Pune 6

Telugu Script• Features of Telugu

– Basic vowel sounds (Acchulu) 16 symbols– Simple consonants (Hallulu) 36 symbols– Vowel Sounds (Matraas) 16 symbols – Half Consonants (Voththus) 30 symbols

• Complexity of Character Recognition– Composition of Characters and Syllables from

above symbols: 5000 or so in common use.

• Reducing Complexity– Identification of glyphs used in composition :

about 400

25th June 2002 IEMCT CDAC Pune 7

Few Telugu Characters

• Achchus

• Hallus

• Maatras

• Voththus

25th June 2002 IEMCT CDAC Pune 8

Classification By Template Matching

• Why Template Matching?– Feature Extraction

Effectiveness– Dimensionality (Size 32x32)

• Fringe Distances (ref. 10)– No need for blurring– Distances Pre-computed and

Stored– Ease of matching

25th June 2002 IEMCT CDAC Pune 9

The Complete OCR algorithm• Read an input binary image • Segment the image into words • Extract the connected components from

each word • For each component

– (a) Normalize size to match stored templates – (b) Compute fringe distance map – (c) Compute fringe distance from all templates – (d) Output template with smallest fringe

distance – (e) Convert template code to ISCII

• Store ISCII output in a file

25th June 2002 IEMCT CDAC Pune 10

Nonlinear Normalization• Need for Normalization

– Preprocessing step to equalize size, position, inclination etc. to ease recognition

– Necessary when recognition is by template matching

• Non-Linear Normalization– All parts of the character image not treated

equally– Hypothesis: Differences between characters

will be increased, therefore improved discrimination

25th June 2002 IEMCT CDAC Pune 11

Nonlinear Normalization Technique• Line density equalization-analogous to

histogram density equalization (ref. 13)• Generalization: Feature Density Equalization

(ref. 14)– Projection of feature density onto horizontal,

vertical axes– Feature projection functions H(i) and V(j) – input, i=1,…I and j=1,…J.

– new position (m, n) output computed in normalized image of size (M,N) for point (i, j) in input image of size (I,J).

25th June 2002 IEMCT CDAC Pune 12

Nonlinear Normalization Technique• Feature Density Equalization

– Feature projection functions H(k) and V(l), input, i=1,…I and j=1,…J.

– New position (m, n) output size (M,N), for each point (i, j) in input image of size (I,J).

– m= k=1 to i H(k) M / [k=1 to i H(k)]

– n= l=1 to j V(l) N / [l=1 to j V(l)]

– H(i)= (j=1 to J) f(i, j) + H

– V(j)= (i=1 to I) f(i, j) + V{NSN by dot density

25th June 2002 IEMCT CDAC Pune 13

Example

25th June 2002 IEMCT CDAC Pune 14

Normalized Glyphs

25th June 2002 IEMCT CDAC Pune 15

Results

0

100

200

300

400

500

Image 1 Image 2 Image 3 Image 4 Image 5

Fig. 5. Graphical representation of comparision of Linear and Non-linear Normalization

Number Glyphs Linear Normalization Non-Linear Normalization

25th June 2002 IEMCT CDAC Pune 16

Image 1Misclassifications: 1 (NSN) , 7 (L)

Total Glyphs: 145 ( 99%, 95.2% )

25th June 2002 IEMCT CDAC Pune 17

Image 5Misclassifications:

• 105 (NSN)

• 136 (linear Normalization)

Total Glyphs: 354 (70.3%, 61.6%)

25th June 2002 IEMCT CDAC Pune 18

Discussion

•Why Should Nonlinear Normalization succeed despite shape distortions?

•Is the best that we can do?

•Why not use this always?

25th June 2002 IEMCT CDAC Pune 19

Concluding Remarks

• Non-linear normalization appears to improve OCR accuracy (based on 1300 glyphs examined)

• More experimentation with the features is required to overcome problems like gaps

• Further testing on variety of fonts and sizes is required to conclude recognition improvement with more confidence

25th June 2002 IEMCT CDAC Pune 20

Bibliography• M.B. Sukhswami, P. Seetharamulu , and Arun K. Pujari, “Recognition of Telugu characters using Neural networks,” Int. J. of

Neural Systems, 6(3):317 (1995).• R. Kasturi and S. N. Srihari (Eds.). Proc. Fifth International Conf. Document Anaalysis and Recognition. Bangalore, India,

IEEE Computer Society Press, Los Alamitos, CA, (1999).• B.B. Chaudhuri and U. Garain, and M. Mitra, “On OCR of the most popular two Indian language scripts: Devanagari and

Bangla”, in Visual Text Recogntion and Document Processing, Ed. N. Murshed, World Scientific Press (2000).• SNS Rajasekharan and B.L. Deekshatulu, “Generation and Recognition of Printed Telugu characters”, Computer Graphics

and Image Processing, 6:335-360, (1977).• Atul Negi, Chakravarthy Bhagvati, and B. Krishna, “An OCR system for Telugu”, Proc. . Sixth International Conf. Document

Analysis and Recognition. Seattle, USA, IEEE Computer Society Press, Los Alamitos, CA, (2001).• A.K. Pujari, C.D. Naidu, and B.C.Jinaga, “An addaptive and intelligent character recognizer for Telugu scripts using

multiresolution analysis and associative memory”, Proc. Canadian Conf. On AI, Calagary, Canada, May 2002, LNCS, Springer Verlag (2002).

• B. Krishna, “Design and implementation of a Telugu script recognition system” Technical report, Dept. of Computer and Information Sciences, University of Hyderabad, Hyderabad, India, (2000).

• R.C. Gonzalez and R.E. Woods. Digital Image Processing. Addison-Wesley, 1993• O.D. Trier, A.K. Jain, and R.Taxt. “Feature extraction methods for character recognition-a survey”, Pattern Recognition,

29(4):641-662, (1996).• R.L. Brown. “The fringe distance measure: an easily calculated image distance measure with recognition results comparable

to Gaussian blurring”, IEEE Trans. System Man and Cybernetics, 24(1):111-116, (1994).• K. Wong, R. Casey, and F. Wahl. “Document analysis system”. IBM J. Research and Development, 26(6), (1982).• G. Nagy, S. Seth, and M. Vishwanathan, “A prototype document image analysis system for technical journals” Computer,

25(7), (1992).• H. Yamada, K. Yamamoto and T. Saito, “A nonlinear normalization method for handprinted Kanji character recognition-line

density equalization”, Pattern Recognition, 23(9):1023-1029, (1990).• S-W. Lee and J-S. Park, “Nonlinear shape normalization methods for the recognition of large set handwritten characters”,

Pattern Recognition, 27(7):895-902, (1994).• V.V. Suresh Kumar, “Non-linear Normalization Techniques to Improve OCR”, Technical report, Dept. of Computer and

Information Sciences, University of Hyderabad, Hyderabad, India,(2002).

25th June 2002 IEMCT CDAC Pune 21

Contact Information

Atul Negi, Chakravarthy BhagvatiDepartment of Computer and Information Sciences,

University of Hyderabad

Hyderabad 500 046, AP INDIA

Email: [email protected]

Visit http://www.uohyd.ernet.in

and http://www.Languagetechnologies.ac.in