Shape-Free Statistical Information in Optical Character ...scottl/research/contextual_ocr.pdf · Machine Learning Group Talk University of Toronto April 2nd, 2007 1. ... The OCR-A

Shape-Free Statistical Information in OpticalCharacter Recognition

Scott Leishman Sam Roweis

Machine Learning Group TalkUniversity of Toronto

April 2nd, 2007

1

Optical Character Recognition

DefinitionOptical Character Recognition (OCR) is the process by which digitalimages of textual symbols are translated into a machine-readablerepresentation.

2

The Early Years - “Optical” Recognition Systems

System Proposed in Handel’s 1933 Patent Filing

Classic patternrecognition problemIdea dates back atleast to patent filingsin the early 1930’sInitial uses includedtelegraph processingand as an aid to theblind

3

The “Middle” Ages - Constrained Template Matching

The OCR-A Font

First commercial OCRsystems appear in the mid1950’sSystem designs heavilyinfluenced by computer logiccircuitry and electronicsInput documents extremelyconstrainedMost systems used templatematching (limited to a fewfonts and sizes)

4

Current Systems - Visual Classification

Neural network or othershape-based classifiers trained on alarge variety of fonts.Incoroporation of some contextualinformation after initial recognitionphaseMany commercial offerings, someclaiming accuracy rates > 99%

Problem solved??

5

Ideal Input

Product Character AccuracyAcrobat 99.66

Any2DjVu 99.74Tesseract 98.48

6

Some Problem Cases - Noisy Curled Characters

Acrobat

Any2DjVu

Tesseract

7

Some Problem Cases - Textured Background

Acrobat

Any2DjVu

Tesseract

8

Some Problem Cases - Unknown Font

AcrobatAny2DjVuTesseract

Clearly these systems have not been trained on a font in thisstyle/size!

9

Improving on the Untrained Font Problem

Ideally want a system that adapts to each document to berecognizedExploit language consistency within a document (true unlessdealing with dictionaries, machine translation papers, etc.)Exploit glyph shape and font consistency within a document(relatively true for most cases except things like ransom notes andfont manuals)

10

Our Approach

Estimate frequencydistributions of characterco-occurrences from a largetext corpusLocate and cluster togethersimilar looking glyph imagesfrom a document to berecognizedEstimate frequencydistributions over the clustersDetermine the mappingbetween these two

11

Preprocessing - Denoising

Input document images can contain noise from a number ofsources

Fax line-noiseScanning Sensor noiseOther unwanted marks (ex. staples, large book gutters)

Find components and use aspect ratio to throw out very large orvery small objects thus removing additive noise (must be carefulnot to throw out small punctuation symbols)Convolve image to smooth over dropout pixels (must be carefulnot to join non-touching symbols, fill small holes)

12

Preprocessing - Binarization

Input document images often given in full colour or grayscaleReducing them to two intensities improves processing speed andcertain algorithms require binary images (e.g. Hausdorff distance)Methods fall under one of two categories: global or local methodsNo magic bullet!

13

Preprocessing - Page Deskewing

Manually placed scanned pages typically vary up to ±15◦

Some algorithms for line finding and other processing tasks willnot work correctly on skewed documentsMany early approaches based on the generalized Houghtransform[Duda1972]

14

Preprocessing - Page Segmentation

Top-down and bottom-upapproaches exist (XY-cuts,run-length smoothing, areaVoronoi technique)Textual region identification isnon-trivial; for mostly textualdocuments, can look forregular valleys in projectionprofile taken perpendicular toreading order directionRegion sequencedetermination also difficult,particularly for multi-columndocuments, or documents withfigure captions

15

Isolating Individual Symbols - Connected Components

Simple 2-pass sweep used to group pixels into connectedcomponents[Rosenfeld1966]Bounding box co-ordinates calculated and stored

16

Isolating Individual Symbols - Neighbour / Line Finding

Nearest neighbouring component (and distance) calculated foreach component in 4 principal directions (requires 2 sweeps overcomponent label image)Lines found by following chains of neighbouring components, thenbaseline and x-height offsets estimated from profile sumsAttempts made to merge diacritical and other small verticalcomponents belonging to the same symbol like i, é, :, =, ?

17

Clustering Connected Components

Bottom-up, agglomerative clustering usedComponent pixel intensities compared to determine clusteringDistance metrics considered include Euclidean, HausdorffFor nearly noiseless documents, initial sweep using smallthreshold performed to rapidly reduce number of clusters

18

Hausdorff Distance Metric

Would prefer to weight mis-matchedpixel intensities so that “on” pixelslying further away from the nearest“on” pixel in the comparison imageare charged a larger costHausdorffdistance[Huttenlocher1992] doesthis, measuring the distancebetween two images A, B as follows:

DH = max(h(A, B), h(B, A) (1)

whereh(X , Y ) = max

i∈X ′(minj∈Y ′

(d(X(i), Y (j)))) (2)

and d(x , y) is replaced by a distance metric like Euclidean distance, X ′ denotes the set of

foreground pixels of X

19

Merging Fragmented Symbols

After one round of match based processing, cluster informationused to piece together fragmented glyphsClusters containing components who share nearest neighboursbelonging to the same cluster are marked for mergerProvided they lie only a small distance apart, belong to the sameline, and the neighbours in turn list components in the originalcluster as their neighbour, then a merger is performed betweenthese componentsSymbol segmentation often carried out as part of the recognitionprocess

20

Splitting Apart Touching Symbols

Clustering information also used to try and split componentscontaining multiple symbolsCandidate horizontal split points found based on component widthand vertical projection sumHalves are recursively searched for matches against other clustercentroids (using Hausdorff distance for example)

21

Refining the Clusters

Matching, merging, andsplitting process is repeatedover affected clusters, until nofurther changes are seenConservative thresholds usedto prevent cluster impurities (atthe expense of leaving multipleclusters per character)No real attempts made atrescaling cluster averages

22

Determining Word Boundaries

Determining word boundariesis critical for accuratecontextual estimation sinceour approach makes use ofwithin word positionalfrequency

For most textual documents,the distribution overneighbouring horizontalcomponent distances isbimodal with the smallerrepresenting intercharacterspacing, and the secondinterword

We attempt to determinecut-off point by fitting amixture model to thesefrequency counts

23

Symbol Decoding Using Positional Frequency

Neighbours in reading order give cluster id sequenceCommon words like the, of, and, it dominate documentsthat are written in grammatically correct English prose (Zipf’s“law”)Certain character n-grams are much more common in Englishwords than others (contrast ing with zzz)Given word breaks, certain characters often occur in particularword positions; uppercase letters tend to be seen in the firstposition of a word, letters like s and punctuation symbols like .are common choices for the last position of a word.

24

Positional Frequency (Cont’d)

We exploit these regularities by estimating the word positionalfrequencies for each symbolEstimated first for each symbol in a large labelled text corpus,then on each cluster using component and estimated wordboundary informationCounts normalized to create distributions over each word length(unless no counts seen of that symbol at a particular word length)

25

Matching Observed Cluster Vectors to ReferenceVectors

Counts define a point in an x(x+1)2 dimensional “positional” feature

space (where x is max word length considered)Mapping from observed counts to reference counts can be carriedout via likelihood or cross entropy minimizationOur current approach

Normalize counts within each word lengthFeature vectors re-weighted based on word-length to minimizeaffect of wild positional differences on seldom seen long wordsEuclidean distance measured between each cluster feature vectorand symbol vectors to find closest map

26

Improving the Mappings via Dictionary Lookup

Positional mapping alone often provides good match for frequentlyseen symbols like lower case vowels and some consonants,however other symbols regularly exhibit high variance betweencluster and corresponding reference vectorCauses for this include the short length of most input documentscoupled with document type and subject matterUse dictionary lookup to improve matchings

the αuick

α = ?

27

Dictionary Lookup Procedure

Greedily assign clusters to symbols based on occurencefrequencyTry candidates in order based on positional feature distance

28

Dictionary Lookup Procedure

Lookup each partially assigned word cluster sequence in whichthis cluster appears, and check for it in the dictionary (usewildcards for unmapped clusters)If at least some threshold of these words match, permanently mapthis symbol to each component in this clusterIf multiple symbols produce the same word lookup score, ties arefirst broken based on line offset information (each symbol belongsto one of 4 classes: ascenders, descenders, normal, and shortx-height)If ties still remain, or no symbol achieves the threshold score,closest positional match taken as final symbol

29

Experimental Setup

Symbol alphabet contains 92 differentsymbols including upper and lower caseletters, digits, punctuation, brackets, simplearithmetic operators

Reference positional symbol counts and theword lookup dictionary were constructedfrom a 17,601 word chunk of theReuters-21578 News corpus[Lewis2004](744,522 symbols). All symbols left intact,but trailing “Reuter” byline removed

Initial input tests performed against 10 15page Legal documents from the UNLV ISRIOCR dataset[Nartker2005] (fine fax modedpi)

30

Results

% Correctword 92.49

character 90.72low letters 97.79

upper letters 4.02digits 7.51

other sym. 31.44

Text aligned to ground truth as bestas possible, then string editoperations used to determineaccuracy by classResults for lowercase letters foundto be roughly on par with shapebased approaches, but othersymbols and overall performanceworseLowercase letters dominate thisdataset (84.2% of all symbols)

31

Impact of Segmentation and Clustering onPerformance

Regular Perfectword 92.49 95.30

character 90.72 96.07low letters 97.79 100

upper letters 4.02 65.66digits 7.51 23.64

other sym. 31.44 61.72

To determine impact ofsegmentation and clusteringon the performance, the ASCIIcodes of each symbol wereused to group symbolstogetherThis represents a perfectclustering (no split or mergedsymbols, only a single clusterfor each distinct symbol)

32

Impact of Document Length on Performance

0 2000 4000 6000 8000 10000 120000

20

40

60

80

100

number of characters in document

% o

f ch

arac

ters

cor

rect

ly id

entif

ied

Combined

Low

Upper

Digits

Other

33

Short Document Results

159 short 1-2 page business letters sampled from the UNLV ISRIOCR dataset[Nartker2005] (fine fax mode dpi)2010 symbols per document on average

Bus. Bus (Perfect) Legal Legal (Perfect)word 50.50 79.91 92.49 95.30

character 67.84 88.24 90.72 96.07low letters 73.17 95.71 97.79 100

upper letters 9.21 50.65 4.02 65.66digits 6.84 20.52 7.51 23.64

other sym. 24.55 52.36 31.44 61.72

34

Conclusions and Future Work

Can’t build a complete system using context aloneWorks well when characters appear enough times that they startto somewhat approximate their reference countsPositional counts and word lookup scores provide a fairly usefulsource of information to the recognition process, something thatisn’t being exploited much by current approachesBiggest improvements can be had by improving segmentationperformance (rather than improving contextual collectiontechniques). Majority of errors due to merged symbols not beingseparated during clustering phaseFuture work involves exploring more detailed models for statisticalinformation comparison

35

Advantages Of Our Approach

Big gain is that our approach is font and resolution independent(though we have tested using baseline and x-height offset toimprove tie-breaking)Can also be re-targetted to other phonetic languages by pluggingin appropriate lookup dictionaryWorks (faster) on symbolically compressed documents (see nextslide)

36

Symbolically Compressed Documents

JBIG2 image compression standard for binary images, speciallysuited for images that are composed of repeated subimages (liketextual document images)Lossless compression scheme stores a single template image, aswell as co-ordinate offsets on each page, and image differences atthose offsetsLossy versions also store templates and offsets, but don’t storethe residual image differencesUsed in PDF (1.4 and higher), DjVu, xpdf, and othersOur OCR approach can work directly with these compresseddocuments without having to perform clustering (though split andmerge refinements may be required)

37

Related Work

Cryptogram decoding[Nagy1984]Ho and Nagy Symbol Class identificationHuang Entropy based approach[Huang2006]

38

Automatic Script and Language Determination

Recognizing glyphs via statistical language features is onlypossible if we know the underlying language of the input regionPossible to distinguish Han scripts from Latin-based scripts bymeasuring frequency and height of upward concaving runs ofpixels[Spitz1997]Able to distinguish amongst 23 Latin-based languages usingcharacter codes from frequently occuring word images

39