View
1.556
Download
4
Category
Tags:
Preview:
DESCRIPTION
Talk about Tesseract-OCR system for Malayalam in National Conference on Free Software
Citation preview
National Conference on Free Software
Nishad T RNIT, Calicuthttp://www.himili.com/ocr/
Lessons from Indic OCR Development
2
Overview
History and Evolution of OCRWhen, Where, Why and How of OCRSelection of an OCR Engine and other
gearsPutting it all together, and whyTesseract architectural styleChallenges in Indic OCRLessons learned and appliedWhere is it NOW?
Apr 13, 2023 3
OCR in General
EngineTraining DataInput ToolsOutput formatting tools
Apr 13, 2023 4
Three competents
Ocrad Ocrad is the GNU OCR program. It was written
by Antonio Diaz Diaz and is licensed under GPL.
GOCR GOCR is an OCR program written by Joerg
Schulenburg and others. It is licensed under GPL.
Tesseract Under the sponsorship of Google, Tesseract
was made open source in 2006.
And how they performed
Again how they performed
And the winner is ….
Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test. The only caveat is that one absolutely must convert the input to bitonal.
Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy.
GOCR gives poor output at a slow speed.
Apr 13, 2023 8
Development Process Evolution
Fostering Contributions developer focus and avoiding starvation code, code review, documentation, support
Recognizing Ego trust and good intentions beware of maniacal focus
Limits of volunteerism eight knives and an apple (dining developer
problem) eight knives and a pumpkin eight pumpkins and no knives
How Debayan tamed Matra
http://debayanin.googlepages.com/hackingtesseract
And how they performed
To train for another language, you have to create 8 data files in the tessdata subdirectory. Language codes follow the ISO 639-3 standard tessdata/xxx.freq-dawg tessdata/xxx.word-dawg tessdata/xxx.user-words tessdata/xxx.inttemp tessdata/xxx.normproto tessdata/xxx.pffmtable tessdata/xxx.unicharset tessdata/xxx.DangAmbigs
Apr 13, 2023 11
The BOX File concept
Command tesseract fontfile.tif fontfile batch.nochop
makeboxSample Box
അ 8 682 53 703 ആ 62 676 112 703 ഇ 121 676 155 705 ഈ 165 677 220 705 ഉ 232 677 256 704 ഊ 265 677 313 705
Apr 13, 2023 12
In Kindergarten
13
His Teacher
JTesseract is the Tesseract GUI responsible for easing the
training process. JTesseract is released under Apache 2.0 license.
JTesseract currently works only on Windows platform.
Developed by Ruwan Janapriya Egoda Gamage http://www.janapriya.net
Features Visual box file editing Project based training process
Apr 13, 2023 14
His Classmates
nopapaper
Apr 13, 2023 15
LibTIFFThis software provides support for the Tag
Image File Format (TIFF), a widely used format for storing image data. The latest version of the TIFF specification is available on-line in several different formats, as are a number of Technical Notes (TTN's).
Apr 13, 2023 16
Windows GUI
Apr 13, 2023 17
Questions?
Places to see: Front Door
http://code.google.com/p/tesseract-ocr jtesseract
http://code.google.com/p/jtesseract/ FreeOCR
http://www.freeocr.net
http://www.himili.com/ocr
Recommended