Upload
chrystal-williamson
View
214
Download
0
Embed Size (px)
Citation preview
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Versatile Document Image Content Extraction
Henry S. BairdMichael A. MollJean NonnemakerMatthew R. CaseyDon L. Delorenzo
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Document Image Content Extraction Problem
Given an image of a documentFind regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Difficulties
Vast diversity of document types Arduous data collection
• How big is a representative training set?
• Expense of preparing correctly labeled “ground-truthed” samples
Lack of consensus on how to evaluate performance
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Our Research Goals Versatility First
• Beware “brittle” or narrow approaches• Develop methods that work across broadest possible spectrum of document and image types
Voracious Classifiers• Belief that accuracy of a classifier has more to do with training data than other considerations
• Want to train on extremely large (and representative) data sets (in reasonable amounts of time)
Extremely High Speed Classification• Ideally, perform nearly at I/O rates (as fast as images can be read)
Too ambitious?
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Related Strategies (for the future)
Amplification• Real ground-truthed training samples are hard to find, expensive to generate and difficult to ensure coverage
• Want to use real samples as ‘seeds’ for massive synthetic generation of pseudo randomly perturbed samples for use in supplementary training
Confidence Before Accuracy• Confidence is maybe more important than accuracy, since even modest accuracy (across all cases) can be useful
Near-Infinite Space• Design for best performance in near future when main memory will be orders of magnitude larger and faster
Data-Driven Design• Avoid arbitrary engineering decisions such as choice of features, instead allowing training data to determine this
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Document Images Range of document and image types
• Color, grey-level, black and white• Any size or resolution• Lots of file formats (TIFF, JPEG, PNG, etc)
Pre-processing step of converting images into three channel color PNG file in HSL (Hue, Saturation, Luminance) color space• Bi-level and gray images will primarily map into Luminance component
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Document Image Content Types
Now: Handwriting, Machine Print, Line Art, Photos, Junk/Noise, Blank
Soon: Maps, Mathematic Equations, Engineering Drawings, Chemical Diagrams
Gathering large of collection of electronic images: 7123 page images so far
We attempt to collect samples for each content type in black and white, grey scale and color• Avoid arbitrary image processing decisions of our own
Carefully “zoned” images are a rare commodity• Our software accepts existing ground truth in the form of rectangular zones
• We have developed a ground truthing tool for images that are not zoned
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Coverage of Image and Content Types
Bitonal Greyscale Color
HW Amem, UNLV, NIST
Amem, UNLV, LU Amem, NYPL
MP UW, UNLV, NIST
UW, UNLV, LU Amem, NYPL
LA LU, UNLV LU, UNLV Amem, NYPL
PH UNLV UNLV, LU Amem, LU
MT ISI, UW ADS, LU (none)
MA NYPL, Amem NYPL, Amem NYPL, Amem
ED UW LU (none)
CD UW (none) (none)
JK All All All
BL All All All
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Statistical Framework for Classification
Each training & test sample will be a pixel, not a region• Want to avoid arbitrariness and restrictiveness associated with the choice of a limited class of shapes
• Since each image can contain millions of pixels, it easy to exceed 1 billion training samples
• This policy suggested by Thomas Breuel
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Example of Classifying Pixels
Input Image Classification Output
~75%
correct
Photo
Machine Print
Hand writing
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Features Simple, local features of each pixel
• Average luminosity of every pixel in 1x1, 3x3, 9x9 and 27x27 boxes around given pixel
• Average luminosity of 20 pixels on either side of given pixel on horizontal and vertical lines and lines of slope +/- 1 and 2
• Also for each box and line, the average change and maximum change from one pixel to a neighbor
Choice of features is merely expedient• Expect to refine indefinitely
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
A Nearly Ideal Classifier (For Our Purposes)
K Nearest Neighbor Algorithm• Error rate approaches no worse than twice Bayes error
• Generalizes directly (more than, e.g., SVMs) to more than two classes
• Competitive in accuracy with more recently developed methodologies
We have implemented a straightforward exhaustive 5-NN search
Aware of many techniques (editing, tree search, etc) for speeding up kNN that work well in practice, but do not appear to in the broadest range of cases
We hope to exploit highly non-uniform data distributions• Explore using hashing techniques with geometric tree search
• Intrinsic dimensionality of data seems low
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Adaptive k-d Trees Recursively partition set of points in stages
• At each stage, divide one partition into two sub partitions
• Assume it is possible to choose cuts to achieve balance• Ensures find operations operate in O(log n) time at worst
• Final partitions are generally hyper rectangles• This approximates kNN under Infinity Norm
Pruning power of k-d trees (Bentley) speed up range searches• Given a search point (a test sample), it is fast to find the enclosing hyper rectangle
• Adaptive k-d trees guarantee shallow trees
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
We Use Non-adaptive k-d Trees
Constructed in manner similar to adaptive k-d trees, except the distribution of the data is ignored in generating cuts• Suppose upper and lower bounds are known for each feature, then cuts can be placed at midpoints of these bounds
• No balance guarantee and time and space optimality properties of adaptive k-d trees are lost
• However, values of cut thresholds can be predicted and as result the total number of cuts, r, is known. • Hyperrectangle any sample lies in can be computed in O(r) time o Computing the cuts is so fast it is negligible
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Bit-Interleaving Addresses
Partitions can be addressed using bit-interleaving
1
2
3
45
6
1
1
1
1
1 1
0
0
0
00
0
100111
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Assumption of Bit-Interleaving
Since in this context, we expect that our data is non-uniformly distributed in feature space, only a small fraction of partitions should contain any training data
Therefore very few distinct bit-interleaved addresses should occur, making it possible to use a dictionary data structure to store them
Experiments show that the number of occupied partitions as a function of their bit-interleaved address, is asymptotically cubic (far better than exponential!)
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Cubic Growth of Populated Bins
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Initial Results and Analysis
n = 844,525 training points, d = 15 and tested on 192,405 points• Brute Force 5NN classified 70% correctly but required 163 billion distance calculations
• Hashing bit-interleaved addresses of length r = 32 classified 66% correctly, with speedup of 148 times
Of course this only approximates kNN• Hash into a cell and does not contain the kNN neighbors
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Approximate, but Quite Good
Input Image Classification Output
~75%
correct
Photo
Machine Print
Hand writing
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Speed Tradeoff
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Future Work Much larger scale experiments
• Wider range of content types More and better features Explore refined approximate kNN methods
• How does BIA approach behave under vastly larger data sets
• Paging hash tables that grow too large• Exploit style consistency (isogeny)
Compare to CARTs, Locality Sensitive Hashing, etc
Henry S. Baird & Daniel LoprestiPattern Recognition Research Lab
Thank You!
Henry S. BairdHenry S. BairdMichael A. MollMichael A. Moll