Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Internet Archive OCR Stack in 2021Switching to Open Source Software
Merlijn B.W. Wajer ([email protected])
March 26, 2021Internet Archive
March 26, 2021 Internet Archive 1 / 22
What is OCR?
Optical Character Recognition: ”reading text from images”
Why do we need OCR?
I Document analysis, exploration and accessibility
Examples:
I Linking to specific pages (analysis)
I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)
I Full text search (exploration)
March 26, 2021 Internet Archive 2 / 22
What is OCR?
Optical Character Recognition: ”reading text from images”
Why do we need OCR?
I Document analysis, exploration and accessibility
Examples:
I Linking to specific pages (analysis)
I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)
I Full text search (exploration)
March 26, 2021 Internet Archive 2 / 22
What is OCR?
Optical Character Recognition: ”reading text from images”
Why do we need OCR?
I Document analysis, exploration and accessibility
Examples:
I Linking to specific pages (analysis)
I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)
I Full text search (exploration)
March 26, 2021 Internet Archive 2 / 22
What is OCR?
Optical Character Recognition: ”reading text from images”
Why do we need OCR?
I Document analysis, exploration and accessibility
Examples:
I Linking to specific pages (analysis)
I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)
I Full text search (exploration)
March 26, 2021 Internet Archive 2 / 22
Scale
I 2 - 3 million pages every day
March 26, 2021 Internet Archive 3 / 22
How it used to work
I Abbyy GZ files in items
I DjVu XML, PDF, ePUBs created using Abbyy GZ
I Special OCR cluster for all OCR work
Replace, but maintain quality and downstream formats
Self imposed deadline: before start of 2021
March 26, 2021 Internet Archive 4 / 22
How it used to work
I Abbyy GZ files in items
I DjVu XML, PDF, ePUBs created using Abbyy GZ
I Special OCR cluster for all OCR work
Replace, but maintain quality and downstream formats
Self imposed deadline: before start of 2021
March 26, 2021 Internet Archive 4 / 22
Moving to an OSS stack
I Prior experience with Tesseract: used to deskew imagesin microfilm project
I Available engines: Tesseract, Calamari and ocropy
I File formats: Abbyy XML, ALTO, PAGE XML, hOCR, ...
We picked hOCR
March 26, 2021 Internet Archive 5 / 22
Timeline
2020:
I Sept 25: First meeting to discuss alternatives
I Sept 29: Tesseract evaluation
I Oct 20: Microfilm
I Oct 22: Books
I Nov 30: CDs and LPs
I Dec 2: ’opensource’ collection
I Dec 7: Switched over completely
March 26, 2021 Internet Archive 6 / 22
Timeline II
2021:
I Jan 5: Support for searching inside hOCR items (jump tospecific pages)
I Jan 25: word based hOCR and character based hOCR(compressed)
I Feb 4: hOCR pageindex and searchtext
I Feb 11: Use pageindex and searchtext when searching inside
I Mar 24: Support for converting Abbyy to hOCR
March 26, 2021 Internet Archive 7 / 22
Visual inspection: hocrjs
I https:
//archive.org/services/hocr-view/view?identifier=
sim_general-radio-experimenter_1928-06_3_1
I https://github.com/kba/hocrjs
March 26, 2021 Internet Archive 8 / 22
Visual inspection: PDF
March 26, 2021 Internet Archive 9 / 22
Visual inspection: PDF searching
March 26, 2021 Internet Archive 10 / 22
Searching
Magazine from March 1993 mentions Linux
First release: September 1991
March 26, 2021 Internet Archive 11 / 22
Searching inside
Just $60 for a copy of the UNIX clone...
March 26, 2021 Internet Archive 12 / 22
Tesseract
I Developed by HP in 1980s, open sourced, Google 2006
I Actively maintained by community
I Many languages and scripts including: Arabic, CJK, Indian,Fraktur script
I Script and orientation detection
I Version 4 has new recognition engine
List of supported languages
March 26, 2021 Internet Archive 13 / 22
Searching for Tesseract
(New York Times, 1904-06-25: Vol 53 Iss 16997)
Searching for Tesseract with Tesseract (Tesseract engine goingthrough an existential crisis)
March 26, 2021 Internet Archive 14 / 22
Tesseract 5
Tesseract 4 vs Tesseract 5 (20201231 alpha)
https://twitter.com/brewster_kahle/status/
1364742767880990722
March 26, 2021 Internet Archive 15 / 22
hOCR
I Created by Tom Breuel in 2007
I Currently maintained by Konstantin Baierer athttps://kba.cloud/hocr-spec/1.2
I Supported by existing open source tooling
I Extends (X)HTML
I Supports many typesetting elements such as tables andphotos as well detailed information per character/glyph
We developed various tools (AGPL licensed)python package:https://archive.org/~merlijn/archive-hocr-tools
March 26, 2021 Internet Archive 16 / 22
OCR module
I Python
I Heuristics for script and language detection, Autonomousmode
I Extensive language and script mapping
I Separate, small modules for downstream files
I Custom Tesseract debian repo:https://archive.org/download/tesseract-deb/
documentation and source athttps://archive.org/services/docs/api/ocr.html
March 26, 2021 Internet Archive 17 / 22
Challenges
I XML
I Quality and quality comparison is hard
I There are a lot of languages and scripts out there
I Many edge cases in user uploaded content
I Working on PDF creation/compression in parallel
March 26, 2021 Internet Archive 18 / 22
Community and Collaboration
I OCR-D project
I Tesseract developers
I Slack #ocr-g channel - for all who are interested (drop me anemail)
March 26, 2021 Internet Archive 19 / 22
Future work
I ePUB (coming soon)
I Image and photo detection (Tesseract supports it)
I Working on creating an open access data set for OCR training
I Way for users to submit corrections?
I OCRopus 4 (Tom Breuel)
March 26, 2021 Internet Archive 20 / 22
Team and Acknowledgements
I Hank: Lots of support, documentation and advice
I Jim: DjVu XML and PDF help
I Derek: Language and script mapping, help all over the place
I Andrea, Elizabeth and Richard: OCR quality comparison
I Brewster: for letting me spearhead this effort
Special thanks to folks from the Community:
I Stefan Weil: Tesseract bug fixes and speed improvements
I Konstantin Baierer: maintaining hOCR spec and hocrjs
I Tom Breuel: advice, past work and working on OCRopus 4
March 26, 2021 Internet Archive 21 / 22
Summary
I We process a lot of pages every day
I hOCR replaces Abbyy on archive.org
I Great open source tools, we added some as well
I Ability to analyse and fix bugs everywhere in the stack
I Faster ”search inside”
I We now support a lot more languages and scripts
I Started on Sept 25 2020, switched over completely on Dec 7
Questions?
March 26, 2021 Internet Archive 22 / 22
Summary
I We process a lot of pages every day
I hOCR replaces Abbyy on archive.org
I Great open source tools, we added some as well
I Ability to analyse and fix bugs everywhere in the stack
I Faster ”search inside”
I We now support a lot more languages and scripts
I Started on Sept 25 2020, switched over completely on Dec 7
Questions?
March 26, 2021 Internet Archive 22 / 22