28
IAEA International Atomic Energy Agency International Atomic Energy Agency International Nuclear Information System (INIS) OCR at INIS INIS Training Seminar 7-11 October 2013, Vienna, Austria Branko Krznarić (based on the presentation by Yves Reynaud) INIS Unit

OCR at INIS

  • Upload
    malory

  • View
    86

  • Download
    0

Embed Size (px)

DESCRIPTION

OCR at INIS. INIS Training Seminar 7-11 October 2013, Vienna, Austria. Branko Krznari ć. INIS Unit. ( ba sed on the presentation b y Yves Reynaud). Outline. What is OCR ? OCR Objectives Principles Techniques Software. What is OCR?. (source: pcmag.com). - PowerPoint PPT Presentation

Citation preview

Page 1: OCR at  INIS

IAEAInternational Atomic Energy Agency

International Atomic Energy Agency

International Nuclear Information System (INIS)

OCR at INIS

INIS Training Seminar7-11 October 2013, Vienna, Austria

Branko Krznarić

(based on the presentation by Yves Reynaud)INIS Unit

Page 2: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

2

Outline

• What is OCR?• OCR Objectives• Principles• Techniques• Software

Page 3: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

3

What is OCR?

(source: pcmag.com)

Page 4: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

4

Optical Character Recognition (OCR)

• OCR is the “conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text.” [1]

• Make digitized images of printed documents searchable.

• Font encoding issues.

Page 5: OCR at  INIS

IAEA 5

OCR Objectives

We can “find the needle in the haystack”

• OCR offers a basic search from an unstructured document.

• OCR adds an extra value to your image.• OCR brings to life your digitized collection.

INIS Training Seminar 7-11 October 2013, Vienna, Austria

Page 6: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

6

OCR Techniques

• Pre-processing• De-skew• Despeckle• Binarization (optional)• Line removal• Layout analysis (zoning)

• Post-processing (dictionary)

Page 7: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

7

Scanned vs. Vector Image

Page 8: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

8

“Do not look at the trees (letters)try to see the forest (sentences)“

F0R 488UR1N6 7H3 L0N63V17Y 0F 1NF0RM4710N, P3RH4P8 7H3 M087 1MP0R74N7 R0L3 1N 7H3 0P3R4710N 0F 4 D16174L 4RCH1V3 18 M4N461N6 7H3 1D3N717Y, 1N736R17Y 4ND QU4L17Y 0F 7H3 4RCH1V38 1783LF 48 4 7RU873D 80URC3 0F 7H3 CUL7UR4L R3C0RD.

Page 9: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

9

Verdana Font

FOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD.

Page 10: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

10

Brush Script MT (Windows Font)

FOR ASSURING THE LONGEVITY OF INFORMATION, PERHAPS THE MOST IMPORTANT ROLE IN THE OPERATION OF A DIGITAL ARCHIVE IS MANAGING THE IDENTITY, INTEGRITY AND QUALITY OF THE ARCHIVES ITSELF AS A TRUSTED SOURCE OF THE CULTURAL RECORD.

Page 11: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

11

PCs ≠ Humans

• OCR compares patterns and selects the closest match. It can be forced to a specific context, but requires customization.

• People adapt to circumstances and can circumvent misspellings if context is clear.

Page 12: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

12

True or false

Usually, printed text is adequately sampled if each line is at least two pixels in thickness:

Page 13: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

13

Zoom in

Page 14: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

14

Zoom in

Page 15: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

15

Results from OCR

It is in this context that I…

… and an additional protocol on the basis…

Page 16: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

16

Chinese Raster Image (scanned)

Page 17: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

17

Chinese Vector Image (OCR)滤器

Page 18: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

18

Arabic Raster Image (scanned)

Page 19: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

19

Arabic Vector Image (OCR)

ا هذوشملت

Page 20: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

20

Japanese Raster Image (scanned)

Page 21: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

21

Japanese Vector Image (OCR)

Page 22: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

22

Font Encoding

Page 23: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

23

Font Encoding (cont.)

Page 24: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

24

OCR Software

• Abbyy FineReader (multilingual OCR)• Adobe Acrobat• InftyReader

Page 25: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

25

Abbyy FineReader (interface)

Page 26: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

26

(12)where a . The indices now range from 1 to 5. The bosonic fields obey the commutation rules

(13)

InftyReader - an OCR System for Math Documents

Page 27: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

27

Reference

[1] “Optical character recognition” http://en.wikipedia.org/wiki/Optical_character_recognition. Retrieved 2013-09-23.

Page 28: OCR at  INIS

IAEA INIS Training Seminar 7-11 October 2013, Vienna, Austria

28

Thank you!