Internet Archive OCR Stack in 2021

Internet Archive OCR Stack in 2021Switching to Open Source Software

Merlijn B.W. Wajer ([email protected])

March 26, 2021Internet Archive

March 26, 2021 Internet Archive 1 / 22

What is OCR?

Optical Character Recognition: ”reading text from images”

Why do we need OCR?

I Document analysis, exploration and accessibility

Examples:

I Linking to specific pages (analysis)

I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)

I Full text search (exploration)


What is OCR?


Why do we need OCR?


Examples:





What is OCR?


Why do we need OCR?


Examples:





What is OCR?


Why do we need OCR?


Examples:





Scale

I 2 - 3 million pages every day


How it used to work

I Abbyy GZ files in items

I DjVu XML, PDF, ePUBs created using Abbyy GZ

I Special OCR cluster for all OCR work

Replace, but maintain quality and downstream formats

Self imposed deadline: before start of 2021


How it used to work

I Abbyy GZ files in items

I DjVu XML, PDF, ePUBs created using Abbyy GZ

I Special OCR cluster for all OCR work

Replace, but maintain quality and downstream formats

Self imposed deadline: before start of 2021


Moving to an OSS stack

I Prior experience with Tesseract: used to deskew imagesin microfilm project

I Available engines: Tesseract, Calamari and ocropy

I File formats: Abbyy XML, ALTO, PAGE XML, hOCR, ...

We picked hOCR


https://github.com/UB-Mannheim/ocr-fileformat/wiki/Comparison-of-OCR-formats

Timeline

2020:

I Sept 25: First meeting to discuss alternatives

I Sept 29: Tesseract evaluation

I Oct 20: Microfilm

I Oct 22: Books

I Nov 30: CDs and LPs

I Dec 2: ’opensource’ collection

I Dec 7: Switched over completely


Timeline II

2021:

I Jan 5: Support for searching inside hOCR items (jump tospecific pages)

I Jan 25: word based hOCR and character based hOCR(compressed)

I Feb 4: hOCR pageindex and searchtext

I Feb 11: Use pageindex and searchtext when searching inside

I Mar 24: Support for converting Abbyy to hOCR


Visual inspection: hocrjs

I https:

//archive.org/services/hocr-view/view?identifier=

sim_general-radio-experimenter_1928-06_3_1

I https://github.com/kba/hocrjs


https://archive.org/services/hocr-view/view?identifier=sim_general-radio-experimenter_1928-06_3_1



https://github.com/kba/hocrjs

Visual inspection: PDF


Visual inspection: PDF searching


Searching

Magazine from March 1993 mentions Linux

First release: September 1991


Searching inside

Just $60 for a copy of the UNIX clone...


Tesseract

I Developed by HP in 1980s, open sourced, Google 2006

I Actively maintained by community

I Many languages and scripts including: Arabic, CJK, Indian,Fraktur script

I Script and orientation detection

I Version 4 has new recognition engine

List of supported languages


https://raw.githubusercontent.com/tesseract-ocr/tesseract/master/doc/tesseract.1.asc

Searching for Tesseract

(New York Times, 1904-06-25: Vol 53 Iss 16997)

Searching for Tesseract with Tesseract (Tesseract engine goingthrough an existential crisis)


Tesseract 5

Tesseract 4 vs Tesseract 5 (20201231 alpha)

https://twitter.com/brewster_kahle/status/

1364742767880990722


https://twitter.com/brewster_kahle/status/1364742767880990722

https://twitter.com/brewster_kahle/status/1364742767880990722

hOCR

I Created by Tom Breuel in 2007

I Currently maintained by Konstantin Baierer athttps://kba.cloud/hocr-spec/1.2

I Supported by existing open source tooling

I Extends (X)HTML

I Supports many typesetting elements such as tables andphotos as well detailed information per character/glyph

We developed various tools (AGPL licensed)python package:https://archive.org/~merlijn/archive-hocr-tools


https://kba.cloud/hocr-spec/1.2

https://archive.org/~merlijn/archive-hocr-tools

OCR module

I Python

I Heuristics for script and language detection, Autonomousmode

I Extensive language and script mapping

I Separate, small modules for downstream files

I Custom Tesseract debian repo:https://archive.org/download/tesseract-deb/

documentation and source athttps://archive.org/services/docs/api/ocr.html


https://archive.org/download/tesseract-deb/

https://archive.org/services/docs/api/ocr.html

Challenges

I XML

I Quality and quality comparison is hard

I There are a lot of languages and scripts out there

I Many edge cases in user uploaded content

I Working on PDF creation/compression in parallel


Community and Collaboration

I OCR-D project

I Tesseract developers

I Slack #ocr-g channel - for all who are interested (drop me anemail)


Future work

I ePUB (coming soon)

I Image and photo detection (Tesseract supports it)

I Working on creating an open access data set for OCR training

I Way for users to submit corrections?

I OCRopus 4 (Tom Breuel)


Team and Acknowledgements

I Hank: Lots of support, documentation and advice

I Jim: DjVu XML and PDF help

I Derek: Language and script mapping, help all over the place

I Andrea, Elizabeth and Richard: OCR quality comparison

I Brewster: for letting me spearhead this effort

Special thanks to folks from the Community:

I Stefan Weil: Tesseract bug fixes and speed improvements

I Konstantin Baierer: maintaining hOCR spec and hocrjs

I Tom Breuel: advice, past work and working on OCRopus 4


Summary

I We process a lot of pages every day

I hOCR replaces Abbyy on archive.org

I Great open source tools, we added some as well

I Ability to analyse and fix bugs everywhere in the stack

I Faster ”search inside”

I We now support a lot more languages and scripts

I Started on Sept 25 2020, switched over completely on Dec 7

Questions?


archive.org

Summary

I We process a lot of pages every day

I hOCR replaces Abbyy on archive.org

I Great open source tools, we added some as well

I Ability to analyse and fix bugs everywhere in the stack

I Faster ”search inside”

I We now support a lot more languages and scripts

I Started on Sept 25 2020, switched over completely on Dec 7

Questions?


archive.org

Documents

Internet Archive OCR Stack in 2021