22
Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker Matthew R. Casey Don L. Delorenzo

Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Embed Size (px)

Citation preview

Page 1: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Versatile Document Image Content Extraction

Henry S. BairdMichael A. MollJean NonnemakerMatthew R. CaseyDon L. Delorenzo

Page 2: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Document Image Content Extraction Problem

Given an image of a documentFind regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc

Page 3: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Difficulties

Vast diversity of document types Arduous data collection

• How big is a representative training set?

• Expense of preparing correctly labeled “ground-truthed” samples

Lack of consensus on how to evaluate performance

Page 4: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Our Research Goals Versatility First

• Beware “brittle” or narrow approaches• Develop methods that work across broadest possible spectrum of document and image types

Voracious Classifiers• Belief that accuracy of a classifier has more to do with training data than other considerations

• Want to train on extremely large (and representative) data sets (in reasonable amounts of time)

Extremely High Speed Classification• Ideally, perform nearly at I/O rates (as fast as images can be read)

Too ambitious?

Page 5: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Related Strategies (for the future)

Amplification• Real ground-truthed training samples are hard to find, expensive to generate and difficult to ensure coverage

• Want to use real samples as ‘seeds’ for massive synthetic generation of pseudo randomly perturbed samples for use in supplementary training

Confidence Before Accuracy• Confidence is maybe more important than accuracy, since even modest accuracy (across all cases) can be useful

Near-Infinite Space• Design for best performance in near future when main memory will be orders of magnitude larger and faster

Data-Driven Design• Avoid arbitrary engineering decisions such as choice of features, instead allowing training data to determine this

Page 6: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Document Images Range of document and image types

• Color, grey-level, black and white• Any size or resolution• Lots of file formats (TIFF, JPEG, PNG, etc)

Pre-processing step of converting images into three channel color PNG file in HSL (Hue, Saturation, Luminance) color space• Bi-level and gray images will primarily map into Luminance component

Page 7: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Document Image Content Types

Now: Handwriting, Machine Print, Line Art, Photos, Junk/Noise, Blank

Soon: Maps, Mathematic Equations, Engineering Drawings, Chemical Diagrams

Gathering large of collection of electronic images: 7123 page images so far

We attempt to collect samples for each content type in black and white, grey scale and color• Avoid arbitrary image processing decisions of our own

Carefully “zoned” images are a rare commodity• Our software accepts existing ground truth in the form of rectangular zones

• We have developed a ground truthing tool for images that are not zoned

Page 8: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Coverage of Image and Content Types

Bitonal Greyscale Color

HW Amem, UNLV, NIST

Amem, UNLV, LU Amem, NYPL

MP UW, UNLV, NIST

UW, UNLV, LU Amem, NYPL

LA LU, UNLV LU, UNLV Amem, NYPL

PH UNLV UNLV, LU Amem, LU

MT ISI, UW ADS, LU (none)

MA NYPL, Amem NYPL, Amem NYPL, Amem

ED UW LU (none)

CD UW (none) (none)

JK All All All

BL All All All

Page 9: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Statistical Framework for Classification

Each training & test sample will be a pixel, not a region• Want to avoid arbitrariness and restrictiveness associated with the choice of a limited class of shapes

• Since each image can contain millions of pixels, it easy to exceed 1 billion training samples

• This policy suggested by Thomas Breuel

Page 10: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Example of Classifying Pixels

Input Image Classification Output

~75%

correct

Photo

Machine Print

Hand writing

Page 11: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Features Simple, local features of each pixel

• Average luminosity of every pixel in 1x1, 3x3, 9x9 and 27x27 boxes around given pixel

• Average luminosity of 20 pixels on either side of given pixel on horizontal and vertical lines and lines of slope +/- 1 and 2

• Also for each box and line, the average change and maximum change from one pixel to a neighbor

Choice of features is merely expedient• Expect to refine indefinitely

Page 12: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

A Nearly Ideal Classifier (For Our Purposes)

K Nearest Neighbor Algorithm• Error rate approaches no worse than twice Bayes error

• Generalizes directly (more than, e.g., SVMs) to more than two classes

• Competitive in accuracy with more recently developed methodologies

We have implemented a straightforward exhaustive 5-NN search

Aware of many techniques (editing, tree search, etc) for speeding up kNN that work well in practice, but do not appear to in the broadest range of cases

We hope to exploit highly non-uniform data distributions• Explore using hashing techniques with geometric tree search

• Intrinsic dimensionality of data seems low

Page 13: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Adaptive k-d Trees Recursively partition set of points in stages

• At each stage, divide one partition into two sub partitions

• Assume it is possible to choose cuts to achieve balance• Ensures find operations operate in O(log n) time at worst

• Final partitions are generally hyper rectangles• This approximates kNN under Infinity Norm

Pruning power of k-d trees (Bentley) speed up range searches• Given a search point (a test sample), it is fast to find the enclosing hyper rectangle

• Adaptive k-d trees guarantee shallow trees

Page 14: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

We Use Non-adaptive k-d Trees

Constructed in manner similar to adaptive k-d trees, except the distribution of the data is ignored in generating cuts• Suppose upper and lower bounds are known for each feature, then cuts can be placed at midpoints of these bounds

• No balance guarantee and time and space optimality properties of adaptive k-d trees are lost

• However, values of cut thresholds can be predicted and as result the total number of cuts, r, is known. • Hyperrectangle any sample lies in can be computed in O(r) time o Computing the cuts is so fast it is negligible

Page 15: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Bit-Interleaving Addresses

Partitions can be addressed using bit-interleaving

1

2

3

45

6

1

1

1

1

1 1

0

0

0

00

0

100111

Page 16: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Assumption of Bit-Interleaving

Since in this context, we expect that our data is non-uniformly distributed in feature space, only a small fraction of partitions should contain any training data

Therefore very few distinct bit-interleaved addresses should occur, making it possible to use a dictionary data structure to store them

Experiments show that the number of occupied partitions as a function of their bit-interleaved address, is asymptotically cubic (far better than exponential!)

Page 17: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Cubic Growth of Populated Bins

Page 18: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Initial Results and Analysis

n = 844,525 training points, d = 15 and tested on 192,405 points• Brute Force 5NN classified 70% correctly but required 163 billion distance calculations

• Hashing bit-interleaved addresses of length r = 32 classified 66% correctly, with speedup of 148 times

Of course this only approximates kNN• Hash into a cell and does not contain the kNN neighbors

Page 19: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Approximate, but Quite Good

Input Image Classification Output

~75%

correct

Photo

Machine Print

Hand writing

Page 20: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Speed Tradeoff

Page 21: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Future Work Much larger scale experiments

• Wider range of content types More and better features Explore refined approximate kNN methods

• How does BIA approach behave under vastly larger data sets

• Paging hash tables that grow too large• Exploit style consistency (isogeny)

Compare to CARTs, Locality Sensitive Hashing, etc

Page 22: Henry S. Baird & Daniel Lopresti Pattern Recognition Research Lab Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker

Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab

Thank You!

Henry S. BairdHenry S. BairdMichael A. MollMichael A. Moll