27
Internet Archive OCR Stack in 2021 Switching to Open Source Software Merlijn B.W. Wajer ([email protected]) March 26, 2021 Internet Archive March 26, 2021 Internet Archive 1 / 22

Internet Archive OCR Stack in 2021

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Internet Archive OCR Stack in 2021

Internet Archive OCR Stack in 2021Switching to Open Source Software

Merlijn B.W. Wajer ([email protected])

March 26, 2021Internet Archive

March 26, 2021 Internet Archive 1 / 22

Page 2: Internet Archive OCR Stack in 2021

What is OCR?

Optical Character Recognition: ”reading text from images”

Why do we need OCR?

I Document analysis, exploration and accessibility

Examples:

I Linking to specific pages (analysis)

I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)

I Full text search (exploration)

March 26, 2021 Internet Archive 2 / 22

Page 3: Internet Archive OCR Stack in 2021

What is OCR?

Optical Character Recognition: ”reading text from images”

Why do we need OCR?

I Document analysis, exploration and accessibility

Examples:

I Linking to specific pages (analysis)

I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)

I Full text search (exploration)

March 26, 2021 Internet Archive 2 / 22

Page 4: Internet Archive OCR Stack in 2021

What is OCR?

Optical Character Recognition: ”reading text from images”

Why do we need OCR?

I Document analysis, exploration and accessibility

Examples:

I Linking to specific pages (analysis)

I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)

I Full text search (exploration)

March 26, 2021 Internet Archive 2 / 22

Page 5: Internet Archive OCR Stack in 2021

What is OCR?

Optical Character Recognition: ”reading text from images”

Why do we need OCR?

I Document analysis, exploration and accessibility

Examples:

I Linking to specific pages (analysis)

I Creation of downstream formats like plaintext, PDF, ePUB(accessibility)

I Full text search (exploration)

March 26, 2021 Internet Archive 2 / 22

Page 6: Internet Archive OCR Stack in 2021

Scale

I 2 - 3 million pages every day

March 26, 2021 Internet Archive 3 / 22

Page 7: Internet Archive OCR Stack in 2021

How it used to work

I Abbyy GZ files in items

I DjVu XML, PDF, ePUBs created using Abbyy GZ

I Special OCR cluster for all OCR work

Replace, but maintain quality and downstream formats

Self imposed deadline: before start of 2021

March 26, 2021 Internet Archive 4 / 22

Page 8: Internet Archive OCR Stack in 2021

How it used to work

I Abbyy GZ files in items

I DjVu XML, PDF, ePUBs created using Abbyy GZ

I Special OCR cluster for all OCR work

Replace, but maintain quality and downstream formats

Self imposed deadline: before start of 2021

March 26, 2021 Internet Archive 4 / 22

Page 9: Internet Archive OCR Stack in 2021

Moving to an OSS stack

I Prior experience with Tesseract: used to deskew imagesin microfilm project

I Available engines: Tesseract, Calamari and ocropy

I File formats: Abbyy XML, ALTO, PAGE XML, hOCR, ...

We picked hOCR

March 26, 2021 Internet Archive 5 / 22

Page 10: Internet Archive OCR Stack in 2021

Timeline

2020:

I Sept 25: First meeting to discuss alternatives

I Sept 29: Tesseract evaluation

I Oct 20: Microfilm

I Oct 22: Books

I Nov 30: CDs and LPs

I Dec 2: ’opensource’ collection

I Dec 7: Switched over completely

March 26, 2021 Internet Archive 6 / 22

Page 11: Internet Archive OCR Stack in 2021

Timeline II

2021:

I Jan 5: Support for searching inside hOCR items (jump tospecific pages)

I Jan 25: word based hOCR and character based hOCR(compressed)

I Feb 4: hOCR pageindex and searchtext

I Feb 11: Use pageindex and searchtext when searching inside

I Mar 24: Support for converting Abbyy to hOCR

March 26, 2021 Internet Archive 7 / 22

Page 13: Internet Archive OCR Stack in 2021

Visual inspection: PDF

March 26, 2021 Internet Archive 9 / 22

Page 14: Internet Archive OCR Stack in 2021

Visual inspection: PDF searching

March 26, 2021 Internet Archive 10 / 22

Page 15: Internet Archive OCR Stack in 2021

Searching

Magazine from March 1993 mentions Linux

First release: September 1991

March 26, 2021 Internet Archive 11 / 22

Page 16: Internet Archive OCR Stack in 2021

Searching inside

Just $60 for a copy of the UNIX clone...

March 26, 2021 Internet Archive 12 / 22

Page 17: Internet Archive OCR Stack in 2021

Tesseract

I Developed by HP in 1980s, open sourced, Google 2006

I Actively maintained by community

I Many languages and scripts including: Arabic, CJK, Indian,Fraktur script

I Script and orientation detection

I Version 4 has new recognition engine

List of supported languages

March 26, 2021 Internet Archive 13 / 22

Page 18: Internet Archive OCR Stack in 2021

Searching for Tesseract

(New York Times, 1904-06-25: Vol 53 Iss 16997)

Searching for Tesseract with Tesseract (Tesseract engine goingthrough an existential crisis)

March 26, 2021 Internet Archive 14 / 22

Page 19: Internet Archive OCR Stack in 2021

Tesseract 5

Tesseract 4 vs Tesseract 5 (20201231 alpha)

https://twitter.com/brewster_kahle/status/

1364742767880990722

March 26, 2021 Internet Archive 15 / 22

Page 20: Internet Archive OCR Stack in 2021

hOCR

I Created by Tom Breuel in 2007

I Currently maintained by Konstantin Baierer athttps://kba.cloud/hocr-spec/1.2

I Supported by existing open source tooling

I Extends (X)HTML

I Supports many typesetting elements such as tables andphotos as well detailed information per character/glyph

We developed various tools (AGPL licensed)python package:https://archive.org/~merlijn/archive-hocr-tools

March 26, 2021 Internet Archive 16 / 22

Page 21: Internet Archive OCR Stack in 2021

OCR module

I Python

I Heuristics for script and language detection, Autonomousmode

I Extensive language and script mapping

I Separate, small modules for downstream files

I Custom Tesseract debian repo:https://archive.org/download/tesseract-deb/

documentation and source athttps://archive.org/services/docs/api/ocr.html

March 26, 2021 Internet Archive 17 / 22

Page 22: Internet Archive OCR Stack in 2021

Challenges

I XML

I Quality and quality comparison is hard

I There are a lot of languages and scripts out there

I Many edge cases in user uploaded content

I Working on PDF creation/compression in parallel

March 26, 2021 Internet Archive 18 / 22

Page 23: Internet Archive OCR Stack in 2021

Community and Collaboration

I OCR-D project

I Tesseract developers

I Slack #ocr-g channel - for all who are interested (drop me anemail)

March 26, 2021 Internet Archive 19 / 22

Page 24: Internet Archive OCR Stack in 2021

Future work

I ePUB (coming soon)

I Image and photo detection (Tesseract supports it)

I Working on creating an open access data set for OCR training

I Way for users to submit corrections?

I OCRopus 4 (Tom Breuel)

March 26, 2021 Internet Archive 20 / 22

Page 25: Internet Archive OCR Stack in 2021

Team and Acknowledgements

I Hank: Lots of support, documentation and advice

I Jim: DjVu XML and PDF help

I Derek: Language and script mapping, help all over the place

I Andrea, Elizabeth and Richard: OCR quality comparison

I Brewster: for letting me spearhead this effort

Special thanks to folks from the Community:

I Stefan Weil: Tesseract bug fixes and speed improvements

I Konstantin Baierer: maintaining hOCR spec and hocrjs

I Tom Breuel: advice, past work and working on OCRopus 4

March 26, 2021 Internet Archive 21 / 22

Page 26: Internet Archive OCR Stack in 2021

Summary

I We process a lot of pages every day

I hOCR replaces Abbyy on archive.org

I Great open source tools, we added some as well

I Ability to analyse and fix bugs everywhere in the stack

I Faster ”search inside”

I We now support a lot more languages and scripts

I Started on Sept 25 2020, switched over completely on Dec 7

Questions?

March 26, 2021 Internet Archive 22 / 22

Page 27: Internet Archive OCR Stack in 2021

Summary

I We process a lot of pages every day

I hOCR replaces Abbyy on archive.org

I Great open source tools, we added some as well

I Ability to analyse and fix bugs everywhere in the stack

I Faster ”search inside”

I We now support a lot more languages and scripts

I Started on Sept 25 2020, switched over completely on Dec 7

Questions?

March 26, 2021 Internet Archive 22 / 22