33
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR in libraries – some practical remarks Günter Mühlberger Department for Digitisation and Digital Preservation University Innsbruck Library

Bratislava WS - Mühlberger - OCR in libraries_pdf

Embed Size (px)

Citation preview

Page 1: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR in libraries – some practical remarks

Günter Mühlberger

Department for Digitisation and Digital Preservation

University Innsbruck Library

Page 2: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR in Libraries Not an easy chapter... Is the glass half empty or half full? Historical fonts: Black letter, gothic, Old Cyrillic, ... Great attempts for full-text

– JSTOR (1994)– Google (2004)

But: Still many digital libraries without integrated full-text

Page 3: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR and Digitization OCR changes everything! Workflow has to be adopted at all steps

– Preparation and selection of material– Image processing & scanning– Quality control– Storage and preservation– Correction and user involvement– Full-text search– Web interfaces for digital libraries

Significant increase in complexity

Page 4: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

4

Page 5: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

5

Page 6: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

6

Page 7: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7

Page 8: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

8

Page 9: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

9

Page 10: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

10

Page 11: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

11

Page 12: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

12

Page 13: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

13

Page 14: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Preparation Which material will be taken for scanning? Options:

– Bound volumes?– Microfilm?– Loose folios?

Page 15: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Option: Bound volumes Bound volumes

– Pros: That’s the way books/journals/newspapers are in the library

– Cons: Often narrow binding, especially with newspapers Often warping due to humidity

– Remark Technical solution: ScanRobots make life easier and double the speed

compared to manual interaction, e.g. 700 – 1000 pages per hour Investment for ScanRobots must not be underestimated

15

Page 16: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Option: Microfilm Microfilm

– Pros: If a microfilm is available it is a cheap alternative Easy option (no handling of volumes)

– Cons: Microfilms have the same problems as bound volumes Microfilms were often produced with minimum quality control Microfilms before 1990 are often not in a good condition

Remark– If microfilm was produced with good quality than there is no significant

difference in the OCR quality Case study with BL material will be published on IMPACT site

16

Page 17: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Option: Loose folios Pros

– No narrow binding, less warping– Extremely fast performance with industry scanners – low price– Duplicates can be sent to off-shore providers in huge packages

Cons– Not feasible for material before 1850 – libraries would run into justification problems– Organisational effort to organise duplicates (but completeness has to be evaluated

anyway) Remark

– By far the best option to produce high quality with the lowest resources– Especially interesting for newspapers, 20th century material and grey literature– Used e.g. by MOA, JSTOR

17

Page 18: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Good, bad and ugly images Careful scanning is A and O

– Scanrobots and document scanners lower the requirements for a good operator, but still individual capability is decisive

Criteria for a good page image are simple:– sharp– significant fonts with clear curves– clear background, no shining through from the backside– no warping of the page and no geometrical distortions– complete shot with some white frame around the text borders– lines to be parallel resp. rectangle to borders– no noise of users

If you have perfect images you can wait until OCR technology improves, with bad images you never get good results

Page 19: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

19

Page 20: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

20

Page 21: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

21

Page 22: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

22

Page 23: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Bad print – broken characters

Page 24: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24

und wenn

Page 25: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

25

Page 26: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Bitonal or 8/24 Bit – 300 or 400 ppi – JPEG or TIFF? Bitonal vs. 8/24 bit

– Rose Holley: Dlib Paper 2009: Grey scanning does not lead to better results– Experiment: Microfilm scanned bitonal or greyscale – no difference

Simple experiments show the opposite– Innsbrucker Zeitungsarchiv: bitonal and 24 bit– Results are clearly better with colour

300 or 400 Resolution– Very small font: Word text: 4 point font

JPEG vs. TIFF RGB– Tests with the Treventus ScanRobot but also with other material show that

there is no advantage of TIFF RGB images compared to compressed JPEGs

Modern documents with medium sized fonts can be scanned with 300 ppi and bitonal, but documents with small fonts and challenging paper quality etc. should be scanned with 400 ppi and 8 or 24 bit and can be stored as JPEGs with e.g. 90% compression rate

Page 27: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Accuracy Is the glas half full or half empty?

– Rose Holley <90% word recognition: Poor result– Google: OCR every image, so every correctly recognized word is better

than nothing– Painful errors?– Mature users?

Character vs. word accuracy– Word accuracy says much more, and is much easier to gain: Each word

which would be correctly found in a full-text search, can be counted as correct.

Page 28: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Examples from real world projects Based on: ABBYY Recognition Server 2

– Reichstagsprotokolle, 1925– Zedler, 1744– Coburger Zeitung, 1808– Judentum, 1803– Eckartshausen, 1792– Landesbauernkammer, 1921– Galvani, 1793– Hieber, 1722– Hofmann, 1875– Buschendorf, 1805– Schreiben, 1689– Lateinische Texte

Page 29: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Correction of OCR text Until recently regarded as „absurd“ But:

– Crowd sourcing– New technologies

Crowd sourcing– Figures from Austrialian Newspaper Project: – Correction via a simple editor: line by line correctioin– Since August 2008 6000 users contributed– 7 Mill. lines in 318.000 articles were corrected– If you count 50 characters per line it is worth about 200.000 EUR (=

compared to the prices of service providers) New technologies

– IBM: CONCERT Tool, LMU: PostCorrection Tool– Productivity compared to simple rekeying will be enhanced by several

factors (at least 1:5)

Page 30: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

What to do with OCR results? Structural enhancement

– INEX: competition based on OCR files– Functional Extension Parser

Preservation– Complexity is significantly increased – Output: TXT, PDF, ABBYY XML– ALTO Format– How to integrated corrective actions of users?– Proposition for enhancing ALTO format

Page 31: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Digital library applications Fulltext search

– JSTOR, Google, publishers– Facetted Search (SOLR)

Indexing through search engines– Site XML

Visibility of the OCR text– User training (by doing)– Necessary if correction shall be included

New research fields– Text mining– Linking of texts– Near duplicates, similiarity and new identifiers

Page 32: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Summary OCR is a „must“

– For documents of the 19. and 20th century OCR provides in general useful or even very good results

– Bevore 1800: Improvements can be expected by IMPACT– Careful and exact scanning is always the main prerequisite, preferable

in 400 ppi and 8 or 24 bit– Test runs with random sets

Modern applications– Fulltext search– Visibility of the erroneous text– Options for correcting the text by users– Several export formats (also for end-users)– Site XML for search engines

Page 33: Bratislava WS - Mühlberger - OCR in libraries_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Thank you for your attention!