Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Shape-Free Statistical Information in OpticalCharacter Recognition
Scott Leishman Sam Roweis
Machine Learning Group TalkUniversity of Toronto
April 2nd, 2007
1
Optical Character Recognition
DefinitionOptical Character Recognition (OCR) is the process by which digitalimages of textual symbols are translated into a machine-readablerepresentation.
2
The Early Years - “Optical” Recognition Systems
System Proposed in Handel’s 1933 Patent Filing
Classic patternrecognition problemIdea dates back atleast to patent filingsin the early 1930’sInitial uses includedtelegraph processingand as an aid to theblind
3
The “Middle” Ages - Constrained Template Matching
The OCR-A Font
First commercial OCRsystems appear in the mid1950’sSystem designs heavilyinfluenced by computer logiccircuitry and electronicsInput documents extremelyconstrainedMost systems used templatematching (limited to a fewfonts and sizes)
4
Current Systems - Visual Classification
Neural network or othershape-based classifiers trained on alarge variety of fonts.Incoroporation of some contextualinformation after initial recognitionphaseMany commercial offerings, someclaiming accuracy rates > 99%
Problem solved??
5
Ideal Input
Product Character AccuracyAcrobat 99.66
Any2DjVu 99.74Tesseract 98.48
6
Some Problem Cases - Noisy Curled Characters
Acrobat
Any2DjVu
Tesseract
7
Some Problem Cases - Textured Background
Acrobat
Any2DjVu
Tesseract
8
Some Problem Cases - Unknown Font
AcrobatAny2DjVuTesseract
Clearly these systems have not been trained on a font in thisstyle/size!
9
Improving on the Untrained Font Problem
Ideally want a system that adapts to each document to berecognizedExploit language consistency within a document (true unlessdealing with dictionaries, machine translation papers, etc.)Exploit glyph shape and font consistency within a document(relatively true for most cases except things like ransom notes andfont manuals)
10
Our Approach
Estimate frequencydistributions of characterco-occurrences from a largetext corpusLocate and cluster togethersimilar looking glyph imagesfrom a document to berecognizedEstimate frequencydistributions over the clustersDetermine the mappingbetween these two
11
Preprocessing - Denoising
Input document images can contain noise from a number ofsources
Fax line-noiseScanning Sensor noiseOther unwanted marks (ex. staples, large book gutters)
Find components and use aspect ratio to throw out very large orvery small objects thus removing additive noise (must be carefulnot to throw out small punctuation symbols)Convolve image to smooth over dropout pixels (must be carefulnot to join non-touching symbols, fill small holes)
12
Preprocessing - Binarization
Input document images often given in full colour or grayscaleReducing them to two intensities improves processing speed andcertain algorithms require binary images (e.g. Hausdorff distance)Methods fall under one of two categories: global or local methodsNo magic bullet!
13
Preprocessing - Page Deskewing
Manually placed scanned pages typically vary up to ±15◦
Some algorithms for line finding and other processing tasks willnot work correctly on skewed documentsMany early approaches based on the generalized Houghtransform[Duda1972]
14
Preprocessing - Page Segmentation
Top-down and bottom-upapproaches exist (XY-cuts,run-length smoothing, areaVoronoi technique)Textual region identification isnon-trivial; for mostly textualdocuments, can look forregular valleys in projectionprofile taken perpendicular toreading order directionRegion sequencedetermination also difficult,particularly for multi-columndocuments, or documents withfigure captions
15
Isolating Individual Symbols - Connected Components
Simple 2-pass sweep used to group pixels into connectedcomponents[Rosenfeld1966]Bounding box co-ordinates calculated and stored
16
Isolating Individual Symbols - Neighbour / Line Finding
Nearest neighbouring component (and distance) calculated foreach component in 4 principal directions (requires 2 sweeps overcomponent label image)Lines found by following chains of neighbouring components, thenbaseline and x-height offsets estimated from profile sumsAttempts made to merge diacritical and other small verticalcomponents belonging to the same symbol like i, é, :, =, ?
17
Clustering Connected Components
Bottom-up, agglomerative clustering usedComponent pixel intensities compared to determine clusteringDistance metrics considered include Euclidean, HausdorffFor nearly noiseless documents, initial sweep using smallthreshold performed to rapidly reduce number of clusters
18
Hausdorff Distance Metric
Would prefer to weight mis-matchedpixel intensities so that “on” pixelslying further away from the nearest“on” pixel in the comparison imageare charged a larger costHausdorffdistance[Huttenlocher1992] doesthis, measuring the distancebetween two images A, B as follows:
DH = max(h(A, B), h(B, A) (1)
whereh(X , Y ) = max
i∈X ′(minj∈Y ′
(d(X(i), Y (j)))) (2)
and d(x , y) is replaced by a distance metric like Euclidean distance, X ′ denotes the set of
foreground pixels of X
19
Merging Fragmented Symbols
After one round of match based processing, cluster informationused to piece together fragmented glyphsClusters containing components who share nearest neighboursbelonging to the same cluster are marked for mergerProvided they lie only a small distance apart, belong to the sameline, and the neighbours in turn list components in the originalcluster as their neighbour, then a merger is performed betweenthese componentsSymbol segmentation often carried out as part of the recognitionprocess
20
Splitting Apart Touching Symbols
Clustering information also used to try and split componentscontaining multiple symbolsCandidate horizontal split points found based on component widthand vertical projection sumHalves are recursively searched for matches against other clustercentroids (using Hausdorff distance for example)
21
Refining the Clusters
Matching, merging, andsplitting process is repeatedover affected clusters, until nofurther changes are seenConservative thresholds usedto prevent cluster impurities (atthe expense of leaving multipleclusters per character)No real attempts made atrescaling cluster averages
22
Determining Word Boundaries
Determining word boundariesis critical for accuratecontextual estimation sinceour approach makes use ofwithin word positionalfrequency
For most textual documents,the distribution overneighbouring horizontalcomponent distances isbimodal with the smallerrepresenting intercharacterspacing, and the secondinterword
We attempt to determinecut-off point by fitting amixture model to thesefrequency counts
23
Symbol Decoding Using Positional Frequency
Neighbours in reading order give cluster id sequenceCommon words like the, of, and, it dominate documentsthat are written in grammatically correct English prose (Zipf’s“law”)Certain character n-grams are much more common in Englishwords than others (contrast ing with zzz)Given word breaks, certain characters often occur in particularword positions; uppercase letters tend to be seen in the firstposition of a word, letters like s and punctuation symbols like .are common choices for the last position of a word.
24
Positional Frequency (Cont’d)
We exploit these regularities by estimating the word positionalfrequencies for each symbolEstimated first for each symbol in a large labelled text corpus,then on each cluster using component and estimated wordboundary informationCounts normalized to create distributions over each word length(unless no counts seen of that symbol at a particular word length)
25
Matching Observed Cluster Vectors to ReferenceVectors
Counts define a point in an x(x+1)2 dimensional “positional” feature
space (where x is max word length considered)Mapping from observed counts to reference counts can be carriedout via likelihood or cross entropy minimizationOur current approach
Normalize counts within each word lengthFeature vectors re-weighted based on word-length to minimizeaffect of wild positional differences on seldom seen long wordsEuclidean distance measured between each cluster feature vectorand symbol vectors to find closest map
26
Improving the Mappings via Dictionary Lookup
Positional mapping alone often provides good match for frequentlyseen symbols like lower case vowels and some consonants,however other symbols regularly exhibit high variance betweencluster and corresponding reference vectorCauses for this include the short length of most input documentscoupled with document type and subject matterUse dictionary lookup to improve matchings
the αuick
α = ?
27
Dictionary Lookup Procedure
Greedily assign clusters to symbols based on occurencefrequencyTry candidates in order based on positional feature distance
28
Dictionary Lookup Procedure
Lookup each partially assigned word cluster sequence in whichthis cluster appears, and check for it in the dictionary (usewildcards for unmapped clusters)If at least some threshold of these words match, permanently mapthis symbol to each component in this clusterIf multiple symbols produce the same word lookup score, ties arefirst broken based on line offset information (each symbol belongsto one of 4 classes: ascenders, descenders, normal, and shortx-height)If ties still remain, or no symbol achieves the threshold score,closest positional match taken as final symbol
29
Experimental Setup
Symbol alphabet contains 92 differentsymbols including upper and lower caseletters, digits, punctuation, brackets, simplearithmetic operators
Reference positional symbol counts and theword lookup dictionary were constructedfrom a 17,601 word chunk of theReuters-21578 News corpus[Lewis2004](744,522 symbols). All symbols left intact,but trailing “Reuter” byline removed
Initial input tests performed against 10 15page Legal documents from the UNLV ISRIOCR dataset[Nartker2005] (fine fax modedpi)
30
Results
% Correctword 92.49
character 90.72low letters 97.79
upper letters 4.02digits 7.51
other sym. 31.44
Text aligned to ground truth as bestas possible, then string editoperations used to determineaccuracy by classResults for lowercase letters foundto be roughly on par with shapebased approaches, but othersymbols and overall performanceworseLowercase letters dominate thisdataset (84.2% of all symbols)
31
Impact of Segmentation and Clustering onPerformance
Regular Perfectword 92.49 95.30
character 90.72 96.07low letters 97.79 100
upper letters 4.02 65.66digits 7.51 23.64
other sym. 31.44 61.72
To determine impact ofsegmentation and clusteringon the performance, the ASCIIcodes of each symbol wereused to group symbolstogetherThis represents a perfectclustering (no split or mergedsymbols, only a single clusterfor each distinct symbol)
32
Impact of Document Length on Performance
0 2000 4000 6000 8000 10000 120000
20
40
60
80
100
number of characters in document
% o
f ch
arac
ters
cor
rect
ly id
entif
ied
Combined
Low
Upper
Digits
Other
33
Short Document Results
159 short 1-2 page business letters sampled from the UNLV ISRIOCR dataset[Nartker2005] (fine fax mode dpi)2010 symbols per document on average
Bus. Bus (Perfect) Legal Legal (Perfect)word 50.50 79.91 92.49 95.30
character 67.84 88.24 90.72 96.07low letters 73.17 95.71 97.79 100
upper letters 9.21 50.65 4.02 65.66digits 6.84 20.52 7.51 23.64
other sym. 24.55 52.36 31.44 61.72
34
Conclusions and Future Work
Can’t build a complete system using context aloneWorks well when characters appear enough times that they startto somewhat approximate their reference countsPositional counts and word lookup scores provide a fairly usefulsource of information to the recognition process, something thatisn’t being exploited much by current approachesBiggest improvements can be had by improving segmentationperformance (rather than improving contextual collectiontechniques). Majority of errors due to merged symbols not beingseparated during clustering phaseFuture work involves exploring more detailed models for statisticalinformation comparison
35
Advantages Of Our Approach
Big gain is that our approach is font and resolution independent(though we have tested using baseline and x-height offset toimprove tie-breaking)Can also be re-targetted to other phonetic languages by pluggingin appropriate lookup dictionaryWorks (faster) on symbolically compressed documents (see nextslide)
36
Symbolically Compressed Documents
JBIG2 image compression standard for binary images, speciallysuited for images that are composed of repeated subimages (liketextual document images)Lossless compression scheme stores a single template image, aswell as co-ordinate offsets on each page, and image differences atthose offsetsLossy versions also store templates and offsets, but don’t storethe residual image differencesUsed in PDF (1.4 and higher), DjVu, xpdf, and othersOur OCR approach can work directly with these compresseddocuments without having to perform clustering (though split andmerge refinements may be required)
37
Related Work
Cryptogram decoding[Nagy1984]Ho and Nagy Symbol Class identificationHuang Entropy based approach[Huang2006]
38
Automatic Script and Language Determination
Recognizing glyphs via statistical language features is onlypossible if we know the underlying language of the input regionPossible to distinguish Han scripts from Latin-based scripts bymeasuring frequency and height of upward concaving runs ofpixels[Spitz1997]Able to distinguish amongst 23 Latin-based languages usingcharacter codes from frequently occuring word images
39