Distinguishing Mathematics Notation from English Text using
Computational Geometry
D. Drake, H.S. BairdDepartment of Computer Science and
EngineeringLehigh University
The TaskDifferentiate isolated math and English textlines
English text or Math?
English text or Math?
How can Optical Character Recognition (OCR) systems make this distinction? (a) math symbols; (b) spatial arrangement
Applications of Textline ClassificationCommercial OCR systems: far better on text
than on math– e.g. OCR systems still garble math
Textline classification allows:– Processing of text/math differently– Hand off math to special purpose recognizers– Users can see Math textlines as image
• No OCR errors
Prior WorkPast approaches:Symbol recognitionplus spatial analysis
Requires a classifier for special math symbolsOften sensitive to – font – font size – text orientation – language
Independent of: – font – font size – text orientationEasily extendable to other languagesBut may not handle as many cases–-let’s see…
Our approach:Purely spatial analysis
Voronoi DiagramsGiven a set of point sites in the plane,Partition the plane into regions such that the points in each region are closer to one site than any other
A computational geometry data structure which is invariant under arbitrary nonsingular similarity transformations (translation, rotation, & scaling) --- and is efficiently computable
We use Kise’s Area Voronoi diagramsInput Image
Sample points on boundaryof black connected components
Compute Voronoi Diagram
Compute Area Voronoi Diagram
Compute Neighbor Graph
Input to our classifier – decideswhether textline is math or text
Kise’s algorithm run on math notation
Features of the Neighbor Graph we use for Classification
Crafted to detect spatial arrangements among characters that distinguish math from text
Edge Features– angle (wrt horizontal)– ratio of areas– ratio of diameters– ‘shadowing’:
] +
Node Features– aspect ratio– diameter/area ratio– ‘fanout’:
45°
0°
90°
Down
Left Right
Up
Coarsely quantizedBinary-valued: presence (1) or absence (0)
Classifier design• 77 node binary features
– 2926 quadratic binary features (ANDing pairs of features)
– assume class-conditional independence among quadratic features
– trained a Bayesian node classifier
• 29 edge binary features– 406 quadratic binary features (pairs of features)
– assume class-conditional independence among quadratic features
– trained a Bayesian edge classifier
• Combined results into a textline classifier
• Runs fast: 0.072 CPU sec per textline on average
(on a 650 MHz SunBlade); not optimized for speed
Training & Test data• Collected 264 images of textlines:
– from scanned math books– also, synthesized using LaTeX
• Training set:– 132 textlines: 68 math, 64 text– 7273 nodes total: 2273 math, 5000 text– 9358 edges total: 3827 math, 5531 text
• Test set:– 132 textlines: 68 math, 64 text– 7072 nodes total: 2269 math, 4803 text– 9322 edges total: 4005 math, 5317 text
(A small, preliminary trial….)
Examples of Correctly Classified Textlines
ResultsExperiment performed on synthetically-generated images and
scanned books
Classified as:True class
Math Text
Math 67 1Text 0 64
Confusion Matrix
Data Set:True class
Training Testing
Math 0.029 0.015Text 0.000 0.000
Overall 0.015 0.008
Error RatesExamples of misclassified textlines:
Summary
– Analysis of spatial arrangements (without symbol recognition) handles many cases
– Automatically trainable– Needs no prior knowledge of font, font size, or
spacing– Far less effort to train spatial classifiers than to
build a recognizer for math symbols in all typefaces, sizes, etc
– Possibly easily extendable to (trainable on) other languages than English
Future Work
– Locate inline math– Mop up failure cases by adding a few more
simple spatial features– Speed up (if desirable) by pruning features
AcknowledgementsKoichi Kise, Osaka Prefecture University for generously contributed advice and code---------------------------------------------------------------
Derek Drake who, after all, did all the work
who by rights should be giving this talk
… but he’s starting the CS PhD program
at Purdue Univ. this week