1
Khmer OCR
LONG SeangmengLecturer and researcher, GIC - ITC
Scientific Day 3th May, 2012
2
Khmer OCR
• What is OCR?• Khmer OCR Project• State of the Art• Khmer OCR System• Project status• Perspectives
4
Khmer OCR Project
• 2011• Team– Dr. SENG Sopheap, ITC– Mr. LONG Seangmeng, ITC– Mr. EN Sovann (doing master)– Ms. PRUM Sophea (doing PhD)– Mr. HAO Jeudi (5th year)
• Develop a Khmer OCR system– Font independent– Size independent
5
State of the ArtAuthor Limitation Result
CHEY Chanoeurn, KOSIN Chamnongthai and PINIT Kumhom
10 characters (បពជកភណឃសវទ)
92%
CHEY Chanoeurn, KOSIN Chamnongthai and PINIT Kumhom
20 fonts 92.85% (size 22)91.66% (size 18)89.27% (size 12)
ING Leng Ieng and MUAZ Ahmed
Limon R1 22 98.88%
KRUY Vanna Font and size independent(manual preparation for new fonts)
97%
EN Sovann Font and size independent(manual preparation for new fonts)
96%
Khmer OCR System
6
Pre processing
Segmentation
Recognition
Post processing
Text Image
Editable Text
សា លា �្ ក �្ ង �្ ភ ្�
សាលាក�ងភ��ពេពញនិ�ងសហជ�ព
7
Khmer OCR System (cont.)
• Pre processing Binarization
Noise removal
Skew detection and correction
9
Khmer OCR System (cont.)• Recognition
Blob
Training images (sample images) with label
…
Blob to be recognized
Search for closest match
Closest match
Image:
Label: ក
10
Khmer OCR System (cont.)• Recognition (cont.)
– How to find closest match?– How to represent the blob image?
• Fourier transform: Any function f(t) with period T can be written as
Blob image => 2-D Fourier transformThe blob image (B) represented by Fourier coefficients:
B[0], B[1], B[2], …City block distance between two blobs B and B’:
Distance = |B[0] – B’[0]| + |B[1] – B’[1]| + |B[2] – B’[2]| + …
11
Khmer OCR System (cont.)• Post processing ឦ
ញAssembling
Blob
សា លា �្ ក �្ ង �្ ភ ្� ពេ្ ព ញ �្ និ ង
សា លា �្ក �្ ង �្ភ ្� ពេ្ព ញ �្និ ង
សាលា �្ក�ង�ភ�ពញ�និង
សាលាក�ងភ��ពេពញ
Reordering
ក��
ត្�ង ក�ង
ពេបស�
ភ��
របស�
Spell Checking
12
Project status• Pre processing
– Binarization and noise removal √– Skew detection and correction X
• Segmentation √• Recognition
– Features extraction √– Automatic generation of training data for new fonts √
• Post processing– Assembling and reordering rules
• Manual √• Automatic X
– Spell checking X• Performance evaluation X