Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Versatile Document Image Content Extraction

Henry S. BairdMichael A. MollJean NonnemakerMatthew R. CaseyDon L. Delorenzo


Document Image Content Extraction Problem

Given an image of a documentFind regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc


Difficulties

Vast diversity of document types Arduous data collection

• How big is a representative training set?

• Expense of preparing correctly labeled “ground-truthed” samples

Lack of consensus on how to evaluate performance


Our Research Goals Versatility First

• Beware “brittle” or narrow approaches• Develop methods that work across broadest possible spectrum of document and image types

Voracious Classifiers• Belief that accuracy of a classifier has more to do with training data than other considerations

• Want to train on extremely large (and representative) data sets (in reasonable amounts of time)

Extremely High Speed Classification• Ideally, perform nearly at I/O rates (as fast as images can be read)

Too ambitious?


Related Strategies (for the future)

Amplification• Real ground-truthed training samples are hard to find, expensive to generate and difficult to ensure coverage

• Want to use real samples as ‘seeds’ for massive synthetic generation of pseudo randomly perturbed samples for use in supplementary training

Confidence Before Accuracy• Confidence is maybe more important than accuracy, since even modest accuracy (across all cases) can be useful

Near-Infinite Space• Design for best performance in near future when main memory will be orders of magnitude larger and faster

Data-Driven Design• Avoid arbitrary engineering decisions such as choice of features, instead allowing training data to determine this


Document Images Range of document and image types

• Color, grey-level, black and white• Any size or resolution• Lots of file formats (TIFF, JPEG, PNG, etc)

Pre-processing step of converting images into three channel color PNG file in HSL (Hue, Saturation, Luminance) color space• Bi-level and gray images will primarily map into Luminance component


Document Image Content Types

Now: Handwriting, Machine Print, Line Art, Photos, Junk/Noise, Blank

Soon: Maps, Mathematic Equations, Engineering Drawings, Chemical Diagrams

Gathering large of collection of electronic images: 7123 page images so far

We attempt to collect samples for each content type in black and white, grey scale and color• Avoid arbitrary image processing decisions of our own

Carefully “zoned” images are a rare commodity• Our software accepts existing ground truth in the form of rectangular zones

• We have developed a ground truthing tool for images that are not zoned


Coverage of Image and Content Types

Bitonal Greyscale Color

HW Amem, UNLV, NIST

Amem, UNLV, LU Amem, NYPL

MP UW, UNLV, NIST

UW, UNLV, LU Amem, NYPL

LA LU, UNLV LU, UNLV Amem, NYPL

PH UNLV UNLV, LU Amem, LU

MT ISI, UW ADS, LU (none)

MA NYPL, Amem NYPL, Amem NYPL, Amem

ED UW LU (none)

CD UW (none) (none)

JK All All All

BL All All All


Statistical Framework for Classification

Each training & test sample will be a pixel, not a region• Want to avoid arbitrariness and restrictiveness associated with the choice of a limited class of shapes

• Since each image can contain millions of pixels, it easy to exceed 1 billion training samples

• This policy suggested by Thomas Breuel


Example of Classifying Pixels

Input Image Classification Output

~75%

correct

Photo

Machine Print

Hand writing


Features Simple, local features of each pixel

• Average luminosity of every pixel in 1x1, 3x3, 9x9 and 27x27 boxes around given pixel

• Average luminosity of 20 pixels on either side of given pixel on horizontal and vertical lines and lines of slope +/- 1 and 2

• Also for each box and line, the average change and maximum change from one pixel to a neighbor

Choice of features is merely expedient• Expect to refine indefinitely


A Nearly Ideal Classifier (For Our Purposes)

K Nearest Neighbor Algorithm• Error rate approaches no worse than twice Bayes error

• Generalizes directly (more than, e.g., SVMs) to more than two classes

• Competitive in accuracy with more recently developed methodologies

We have implemented a straightforward exhaustive 5-NN search

Aware of many techniques (editing, tree search, etc) for speeding up kNN that work well in practice, but do not appear to in the broadest range of cases

We hope to exploit highly non-uniform data distributions• Explore using hashing techniques with geometric tree search

• Intrinsic dimensionality of data seems low


Adaptive k-d Trees Recursively partition set of points in stages

• At each stage, divide one partition into two sub partitions

• Assume it is possible to choose cuts to achieve balance• Ensures find operations operate in O(log n) time at worst

• Final partitions are generally hyper rectangles• This approximates kNN under Infinity Norm

Pruning power of k-d trees (Bentley) speed up range searches• Given a search point (a test sample), it is fast to find the enclosing hyper rectangle

• Adaptive k-d trees guarantee shallow trees


We Use Non-adaptive k-d Trees

Constructed in manner similar to adaptive k-d trees, except the distribution of the data is ignored in generating cuts• Suppose upper and lower bounds are known for each feature, then cuts can be placed at midpoints of these bounds

• No balance guarantee and time and space optimality properties of adaptive k-d trees are lost

• However, values of cut thresholds can be predicted and as result the total number of cuts, r, is known. • Hyperrectangle any sample lies in can be computed in O(r) time o Computing the cuts is so fast it is negligible


Bit-Interleaving Addresses

Partitions can be addressed using bit-interleaving

1

2

3

45

6

1

1

1

1

1 1

0

0

0

00

0

100111


Assumption of Bit-Interleaving

Since in this context, we expect that our data is non-uniformly distributed in feature space, only a small fraction of partitions should contain any training data

Therefore very few distinct bit-interleaved addresses should occur, making it possible to use a dictionary data structure to store them

Experiments show that the number of occupied partitions as a function of their bit-interleaved address, is asymptotically cubic (far better than exponential!)


Cubic Growth of Populated Bins


Initial Results and Analysis

n = 844,525 training points, d = 15 and tested on 192,405 points• Brute Force 5NN classified 70% correctly but required 163 billion distance calculations

• Hashing bit-interleaved addresses of length r = 32 classified 66% correctly, with speedup of 148 times

Of course this only approximates kNN• Hash into a cell and does not contain the kNN neighbors


Approximate, but Quite Good

Input Image Classification Output

~75%

correct

Photo

Machine Print

Hand writing


Speed Tradeoff


Future Work Much larger scale experiments

• Wider range of content types More and better features Explore refined approximate kNN methods

• How does BIA approach behave under vastly larger data sets

• Paging hash tables that grow too large• Exploit style consistency (isogeny)

Compare to CARTs, Locality Sensitive Hashing, etc


Thank You!

Henry S. BairdHenry S. BairdMichael A. MollMichael A. Moll

Documents

Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker