Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Multiple Frame Integration for OCR on
Mobile Devices
Master’s Thesis
Georg Krispel
Advisor: Horst Bischof
December 12, 2016
Institute for Computer Graphics and Vision
Anyline GmbH
Scene Text Recognition on Mo-
bile Devices
Scene Text Recognition on Mobile Devices
Scene Text Recognition Use Cases
3
Scene Text Recognition on Mobile Devices Cont’d
• An almost orthogonal view is
assumed
• A search window is introduced to
improve user experience and spare
searching for the text
• Sophisticated preprocessing steps
• Text recognition
• Possible repetition for validation
4
Scene Text Recognition on Mobile Devices Cont’d
• An almost orthogonal view is
assumed
• A search window is introduced to
improve user experience and spare
searching for the text
• Sophisticated preprocessing steps
• Text recognition
• Possible repetition for validation
4
Problems
• Low resolution images from
outdated mobile phones
• Reflections and glares
• Poor lighting conditions
5
Problems
• Low resolution images from
outdated mobile phones
• Reflections and glares
• Poor lighting conditions
5
Objectives
• Evaluate the possibilities of mitigating these effects to improve
overall text recognition results
• Exploit multiple frames available in the camera stream and their
redundant information (Multiple Frame Integration)
• Implement the resulting pipeline on mobile hardware
6
Scene Text Processing Pipeline
Assumptions
• Text is written on a nearly planar surface
• The surface is well textured
• Sufficiently smooth motion of the camera
8
Overview
• Detect text in keyframes and track it (respectively the underlying
plane) over time
• Keyframe selection according to blurriness and text detection result
• Before text detection we rectified the underlying plane
• Utilizing multiple threads to outsource expensive tasks
• Asynchronous plane rectification and scene text detection
• Tracking of dominant plane in order to propagate text detection
results to remaining frames
• Reinitialization after certain time respectively degeneration of
tracking
9
Initialization Process
#0 #1 #8 #9 #10
#0 #0
MainThread
TextDetectionThread
Plane Rectification Text Detection
Tracking TrackingTracking Tracking & MFI Tracking & MFI
Pipeline Initialization Process
10
Initialization Process Cont’d
Pipeline Processing Example
11
Initialization Process Cont’d
Pipeline Processing Example
12
Modules
A modular design ensures the possibility of exchanging the different parts
of the pipeline:
• Visual Tracking
• Rectification
• Text Detection
• Multiple Frame Integration
13
Visual Tracking
• Feature based
• Good features to track and Kanade-Lucas-Tomasi (8, 13, 14)
• AKAZE features (1, 2) and FLANN matching (9)
• Intensity based refinement just for text patches
• Parametric image alignment using ECC (6)
14
Rectification
• Rectangular region localization and
extraction (LocEx) module by
Andreas Hartl et al. (7)
• M-Estimator Sample Consensus
(MSAC) based vanishing point
detection by Nieto et al. (12)
15
Rectification
• Rectangular region localization and
extraction (LocEx) module by
Andreas Hartl et al. (7)
• M-Estimator Sample Consensus
(MSAC) based vanishing point
detection by Nieto et al. (12)
15
Rectification
• Rectangular region localization and
extraction (LocEx) module by
Andreas Hartl et al. (7)
• M-Estimator Sample Consensus
(MSAC) based vanishing point
detection by Nieto et al. (12)
15
Scene Text Detection
• TextSpotter (TS) by Neumann et al. (11)
• Based on classification and grouping of Extremal Regions
• Stroke-Width-Transformation (SWT) by Epshtein et al. (5)
16
Multiple Frame Integration
TextRecognition
RecognitionResult Fusion
ImageEnhancement
TextRecognition
0 4 9 5 4
0 0 9 5 4
8 4 9 5 4
0 4 9 5 4
0 2 9 5 4
0 4 9 5 4
0 4 9 5 4
MFI approaches
17
Multiple Frame Integration Cont’d
• Image Enhancement
• Minimum Operator
• Integration method by Yi et al. (15)
• Result Fusion
• Voting for most frequent recognition
18
Impact of MFI Approaches on
Overall Recognition Results
Datasets
• We assumed the use case of energy
meter readings
• We tailored our pipeline to solely
detect the respective numbers
• Just bright text on dark background
• Additional histogram based
verification step
• Constrained bounding box
dimensions
20
Datasets
Exemplary frames of the evaluation datasets showing different types of energy
meters and ground truth annotation
21
Datasets Cont’d
Video ID Light source max. Resolution No. of Frames Duration
1 Tungsten 768x1366 228 00:07
2 Daylight 768x1366 209 00:07
3 Flash 768x1366 581 00:19
4 Daylight 768x1366 1280 00:42
5 Tungsten 768x1366 507 00:16
6 Tungsten 768x1366 234 00:07
22
Detection and Tracking Accuracy
• We utilized CLEAR-MOT Evaluation Framework (4)
• Multiple Object Tracking Precision (MOTP)
• Multiple Object Tracking Accuracy (MOTA)
• We compared our method with full tracking-by-detection approaches
• Thereby, subsequently occurring bounding boxes are associated by
their overlap utilizing Munkres’ algorithm (10).
23
Detection and Tracking Accuracy Cont’d
Res. Method MOTP Misses FP
rate
MM MOTA
768x1366 NATIVE TS 0.74 0.27 0.98 0.08 -0.32
480x854
NATIVE TS 0.74 0.29 1.52 0.14 -0.95
NATIVE SWT 0.60 0.99 0.87 0.00 -0.86
TS 0.75 0.57 0.13 0.02 0.28
MSAC&TS 0.73 0.59 0.10 0.02 0.29
LOCEX&TS 0.75 0.57 0.13 0.02 0.28
KLT&MSAC&TS 0.71 0.48 0.31 0.00 0.21
AK&MSAC&TS 0.70 0.52 0.11 0.02 0.36
Hybrid KLT&MSAC&TS 0.62 0.49 0.17 0.00 0.34
Multiple Object Tracking Precision and Accuracy
24
Runtime
Device ResolutionTracking
Method
Rectification,
DetectionTotal
Laptop
480x854 AKAZE 318.2 144.4
480x854 KLT 263.1 22.2
Hybrid KLT 73.1 5.0
Shield Tablet480x720 KLT 2788.3 469.2
Hybrid KLT 519.2 84.9
Average time performance measurements in milliseconds
25
Reading Accuracy
We extracted the text patches and utilized the Anyline Energy module1
to read the meter readings from
• the current patch and
• the currently available integrated counterpart respectively we fused
the preceding results.
These recognition rates are compared.
1https://www.anyline.io/energy-anyline-io-de/
26
Reading Accuracy Cont’d
Single extracted frames sampled during a sequence of 62 frames compared to
respective integration results.
27
Reading Accuracy Cont’d
Degenerated Multi-frame Integration over Time
28
Reading Accuracy Cont’d
SF MIN YI HIST
Method
0.0
0.2
0.4
0.6
0.8
1.0
Rec
ogn
itio
nra
te
ECC
Hybrid
The recognition rates using the single extracted frames and the different MFI
methods
29
Reading Accuracy Cont’d
Resolution Single
frame
Minimum
operator
Yi inte-
gration
Histogram
voting
768x1366 0.45 0.44 0.55 0.63
480x854 0.38 0.50 0.50 0.62
320x568 0.36 0.48 0.43 0.61
Hybrid 0.33 0.29 0.27 0.61
Recognition rates
30
Conclusion & Outlook
Conclusion & Outlook
• We showed that our MFI approach is capable of achieving real-time
performance with little optimization on mobile hardware
• The multi-thread detection and tracking approach can keep up with
full detection approaches
• A distinct improvement of the recognition rates is possible
• Generally image enhancement integration methods require almost
perfect image registration
• If text recognition is fast enough, result fusion methods should be
preferred over the evaluated image enhancement approaches
32
Questions?
33
References I
References
[1] P. F. Alcantarilla, A. Bartoli, and A. J. Davison. KAZE features. In
European Conference on Computer Vision, 2012.
[2] P. F. Alcantarilla, J. Nuevo, and A. Bartoli. Fast explicit diffusion
for accelerated features in nonlinear scale spaces. In British Machine
Vision Conference, 2013.
[3] D. L. Baggio, S. Emami, D. M. Escriva, K. Ievgen, N. Mahmood,
J. Saragih, and R. Shilkrot. Mastering OpenCV with Practical
Computer Vision Projects. Packt Publishing, Limited, 2012.
[4] K. Bernardin and R. Stiefelhagen. Evaluating multiple object
tracking performance: the CLEAR MOT metrics. EURASIP Journal
on Image and Video Processing, 2008(1):1–10, 2008.
34
References II
[5] B. Epshtein, E. Ofek, and Y. Wexler. Detecting text in natural
scenes with stroke width transform. In Conference on Computer
Vision and Pattern Recognition, pages 2963–2970. IEEE, 2010.
[6] G. D. Evangelidis and E. Z. Psarakis. Parametric image alignment
using enhanced correlation coefficient maximization. Transactions on
Pattern Analysis and Machine Intelligence, 30(10):1858–1865, 2008.
[7] A. Hartl and G. Reitmayr. Rectangular target extraction for mobile
augmented reality applications. In International Conference on
Pattern Recognition, pages 81–84. IEEE, 2012.
[8] B. D. Lucas, T. Kanade, et al. An iterative image registration
technique with an application to stereo vision. In International Joint
Conference on Artificial Intelligence, volume 81, pages 674–679,
1981.
35
References III
[9] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with
automatic algorithm configuration. International Conference on
Computer Vision Theory and Applications, 2(331-340):2, 2009.
[10] J. Munkres. Algorithms for the assignment and transportation
problems. Journal of the Society of Industrial and Applied
Mathematics, 5(1):32–38, March 1957.
[11] L. Neumann and J. Matas. Real-time scene text localization and
recognition. In Conference on Computer Vision and Pattern
Recognition, pages 3538–3545. IEEE, 2012.
[12] M. Nieto and L. Salgado. Real-time robust estimation of vanishing
points through nonlinear optimization. In SPIE Photonics Europe,
pages 772402–772402. International Society for Optics and
Photonics, 2010.
36
References IV
[13] J. Shi and C. Tomasi. Good features to track. In Computer Society
Conference on Computer Vision and Pattern Recognition, pages
593–600. IEEE, 1994.
[14] C. Tomasi and T. Kanade. Detection and tracking of point features.
School of Computer Science, Carnegie Mellon Univ. Pittsburgh,
1991.
[15] J. Yi, Y. Peng, and J. Xiao. Using multiple frame integration for the
text recognition of video. In International Conference on Document
Analysis and Recognition, pages 71–75. IEEE, 2009.
37