31
Automatic OCR correction http://overproof.projectcomputing.com Correcting noisy OCR - Context beats Confusion [ presentation viewable at http://goo.gl/n85gR6 ]

Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Embed Size (px)

DESCRIPTION

Presentation of the paper Correcting Noisy OCR: Context Beats Confsusion by John Evershed and Kent Fitch in DATeCH 2014. #digidays

Citation preview

Page 1: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

Correcting noisy OCR

- Context beats Confusion

[ presentation viewable at http://goo.gl/n85gR6 ]

Page 2: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

who are we?

● Australian software company

● developers John and Kent

● we put theory into practice

Page 3: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

● the first draft of history

● popular if made available

● usually poorly digitized

● too extensive for full human

correction

main target - newspapers

Page 4: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

goals

● run on commodity cloud server

● optimal for noisy text

● at least 1000 words/sec

● correct at least 50% of errors

Page 5: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

division of labour

bad

good

models

models

MANAGER,

TRIAGE

CORE

Page 6: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

snippets for the core

● prefer triaged good words at start/end

● column aware

● some easy corrections applied

● some suggestions supplied

● bag of topic words available

● surrounding noise level indicated

Page 7: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

error contexts

● spell: vowals or consonnants

● type: you jit teh wrng key

● OCR: roprcroiitativcs cf thc Coveriuient

● random: anygh<eg 0at7happen

Page 8: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

confusion cost matrix

93: w ← w 155: e ← e 3750: c ← e 4451: m ← rn 6652: rn ← m 11065: E ← m

Page 9: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

word cost (eg rnorniny|morning)

language cost ● lexicon frequency

● entity list

● rare word list

● character 4-gram

error cost ● edit sum

● visual correlation

● generator hint

Page 10: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

word character confusion

m o r n i n g

r n o r n i n y

Page 11: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

visual correlation

Page 12: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

suggestion methods

● gift

● common, cached

● language

● entities

● split/join

● generated (magic)

Page 13: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

searching for gold (A*)

l i

i

n e

r

h

hcii h li b n ... c e r o … i i 1 l n u … i i 1 l ...

purple nodes: working priority queue

red nodes: output priority queue

Page 14: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

amazing generated suggestions

Parhumuitar} ← Parliamentary I.iulwuvB ← Railways Itegtniont ← Regiment niltfltory ← adultery uj.rccu.eut ← agreement couniutfc.o ← committee cnuipuii ← company dctoimiuatJOu ← determination uiidcrtkikcr’a ← undertaker’s

Page 15: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

selecting best combination

unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently

bohavlour behaviour behavour behavior Behaviour behaviours behaving

abonf about above along been

am am an a in as

unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently

disgrie disgrace disagree disguise desire degree disease

[NOTE: word joins and splits are also supported]

Page 16: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

training

● 5-grams - subset selection

● corpus 1,2,3-grams - statistical build

● extra word lists - easy

● error model - bootstrap or new pairs

Page 17: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

testing

● 65000 words ground truth including

foreign (US) newspapers

● all measures exceeded goal:

○ search errors (article word types)

○ read errors (article word tokens)

○ entropy weighted term errors

Page 18: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

Before After

Recall 83.8% 94.1% recall misses reduced 63.3%

Raw Error Rate 18.5% 5.5% errors reduced 70.1%

Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4%

SMH sample

Page 19: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

¿preguntas?

Presentation viewable at http://goo.gl/n85gR6

Page 20: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

Page 21: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

National Library of Australia’s

TROVE

● 1.4m distinct visitors/month

● 16m pageviews/month

● 80% of usage is old newspapers o 13m pages, over 600 titles

o 85k lines corrected/day

Page 22: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

Even this massive volunteer effort

cannot keep up

● < 2% of errors have been corrected

● % corrected is declining

● Hence searching is unreliable, OCR’ed text

is hard to read and reuse

● Trove’s accuracy is “typical”

Page 23: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

Page 24: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

159 randomly selected news

articles from The Sydney

Morning Herald

47.4K words hand-corrected to ground truth

Page 25: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

Before After

Recall 83.8% 94.1% recall misses reduced 63.3%

False positive recall 26.7% 9.1% false positives reduced 65.8%

Raw Error Rate 18.5% 5.5% errors reduced 70.1%

Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4%

SMH sample

Page 26: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

Page 27: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

Page 28: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

Page 29: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

Page 30: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

49 randomly selected news

articles from LoC

Chronicling America

18.1K words hand-corrected to ground truth

Page 31: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

Automatic OCR correction http://overproof.projectcomputing.com

Before After

Recall 84.0% 93.1% recall misses reduced 56.6%

False positive recall 23.6% 8.8% false positives reduced 62.8%

Raw Error Rate 19.1% 6.4% errors reduced 66.7%

Weighted Error Rate 16.0% 7.7% weighted errors reduced 51.8%

LOC sample