Upload
malory
View
86
Download
0
Tags:
Embed Size (px)
DESCRIPTION
OCR at INIS. INIS Training Seminar 7-11 October 2013, Vienna, Austria. Branko Krznari ć. INIS Unit. ( ba sed on the presentation b y Yves Reynaud). Outline. What is OCR ? OCR Objectives Principles Techniques Software. What is OCR?. (source: pcmag.com). - PowerPoint PPT Presentation
Citation preview
IAEAInternational Atomic Energy Agency
International Atomic Energy Agency
International Nuclear Information System (INIS)
OCR at INIS
INIS Training Seminar7-11 October 2013, Vienna, Austria
Branko Krznarić
(based on the presentation by Yves Reynaud)INIS Unit
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
2
Outline
• What is OCR?• OCR Objectives• Principles• Techniques• Software
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
3
What is OCR?
(source: pcmag.com)
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
4
Optical Character Recognition (OCR)
• OCR is the “conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text.” [1]
• Make digitized images of printed documents searchable.
• Font encoding issues.
IAEA 5
OCR Objectives
We can “find the needle in the haystack”
• OCR offers a basic search from an unstructured document.
• OCR adds an extra value to your image.• OCR brings to life your digitized collection.
INIS Training Seminar 7-11 October 2013, Vienna, Austria
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
6
OCR Techniques
• Pre-processing• De-skew• Despeckle• Binarization (optional)• Line removal• Layout analysis (zoning)
• Post-processing (dictionary)
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
7
Scanned vs. Vector Image
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
8
“Do not look at the trees (letters)try to see the forest (sentences)“
F0R 488UR1N6 7H3 L0N63V17Y 0F 1NF0RM4710N, P3RH4P8 7H3 M087 1MP0R74N7 R0L3 1N 7H3 0P3R4710N 0F 4 D16174L 4RCH1V3 18 M4N461N6 7H3 1D3N717Y, 1N736R17Y 4ND QU4L17Y 0F 7H3 4RCH1V38 1783LF 48 4 7RU873D 80URC3 0F 7H3 CUL7UR4L R3C0RD.
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
9
Verdana Font
FOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD.
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
10
Brush Script MT (Windows Font)
FOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD.
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
11
PCs ≠ Humans
• OCR compares patterns and selects the closest match. It can be forced to a specific context, but requires customization.
• People adapt to circumstances and can circumvent misspellings if context is clear.
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
12
True or false
Usually, printed text is adequately sampled if each line is at least two pixels in thickness:
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
13
Zoom in
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
14
Zoom in
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
15
Results from OCR
It is in this context that I…
… and an additional protocol on the basis…
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
16
Chinese Raster Image (scanned)
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
17
Chinese Vector Image (OCR)滤器
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
18
Arabic Raster Image (scanned)
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
19
Arabic Vector Image (OCR)
ا هذوشملت
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
20
Japanese Raster Image (scanned)
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
21
Japanese Vector Image (OCR)
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
22
Font Encoding
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
23
Font Encoding (cont.)
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
24
OCR Software
• Abbyy FineReader (multilingual OCR)• Adobe Acrobat• InftyReader
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
25
Abbyy FineReader (interface)
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
26
(12)where a . The indices now range from 1 to 5. The bosonic fields obey the commutation rules
(13)
InftyReader - an OCR System for Math Documents
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
27
Reference
[1] “Optical character recognition” http://en.wikipedia.org/wiki/Optical_character_recognition. Retrieved 2013-09-23.
IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria
28
Thank you!