Multiple Frame Integration for OCR on Mobile Devices - Master's … · 2016. 12. 15. · Feature based Good features to track and Kanade-Lucas-Tomasi (8, 13, 14) ... Multiple Object

Multiple Frame Integration for OCR on

Mobile Devices

Master’s Thesis

Georg Krispel

Advisor: Horst Bischof

December 12, 2016

Institute for Computer Graphics and Vision

Anyline GmbH

Scene Text Recognition on Mo-

bile Devices

Scene Text Recognition on Mobile Devices

Scene Text Recognition Use Cases

3

Scene Text Recognition on Mobile Devices Cont’d

• An almost orthogonal view is

assumed

• A search window is introduced to

improve user experience and spare

searching for the text

• Sophisticated preprocessing steps

• Text recognition

• Possible repetition for validation

4

Scene Text Recognition on Mobile Devices Cont’d

• An almost orthogonal view is

assumed

• A search window is introduced to

improve user experience and spare

searching for the text

• Sophisticated preprocessing steps

• Text recognition

• Possible repetition for validation

4

Problems

• Low resolution images from

outdated mobile phones

• Reflections and glares

• Poor lighting conditions

5

Problems

• Low resolution images from

outdated mobile phones

• Reflections and glares

• Poor lighting conditions

5

Objectives

• Evaluate the possibilities of mitigating these effects to improve

overall text recognition results

• Exploit multiple frames available in the camera stream and their

redundant information (Multiple Frame Integration)

• Implement the resulting pipeline on mobile hardware

6

Scene Text Processing Pipeline

Assumptions

• Text is written on a nearly planar surface

• The surface is well textured

• Sufficiently smooth motion of the camera

8

Overview

• Detect text in keyframes and track it (respectively the underlying

plane) over time

• Keyframe selection according to blurriness and text detection result

• Before text detection we rectified the underlying plane

• Utilizing multiple threads to outsource expensive tasks

• Asynchronous plane rectification and scene text detection

• Tracking of dominant plane in order to propagate text detection

results to remaining frames

• Reinitialization after certain time respectively degeneration of

tracking

9

Initialization Process

#0 #1 #8 #9 #10

#0 #0

MainThread

TextDetectionThread

Plane Rectification Text Detection

Tracking TrackingTracking Tracking & MFI Tracking & MFI

Pipeline Initialization Process

10

Initialization Process Cont’d

Pipeline Processing Example

11

Initialization Process Cont’d

Pipeline Processing Example

12

Modules

A modular design ensures the possibility of exchanging the different parts

of the pipeline:

• Visual Tracking

• Rectification

• Text Detection

• Multiple Frame Integration

13

Visual Tracking

• Feature based

• Good features to track and Kanade-Lucas-Tomasi (8, 13, 14)

• AKAZE features (1, 2) and FLANN matching (9)

• Intensity based refinement just for text patches

• Parametric image alignment using ECC (6)

14

Rectification

• Rectangular region localization and

extraction (LocEx) module by

Andreas Hartl et al. (7)

• M-Estimator Sample Consensus

(MSAC) based vanishing point

detection by Nieto et al. (12)

15

Rectification







15

Rectification







15

Scene Text Detection

• TextSpotter (TS) by Neumann et al. (11)

• Based on classification and grouping of Extremal Regions

• Stroke-Width-Transformation (SWT) by Epshtein et al. (5)

16

Multiple Frame Integration

TextRecognition

RecognitionResult Fusion

ImageEnhancement

TextRecognition

0 4 9 5 4

0 0 9 5 4

8 4 9 5 4

0 4 9 5 4

0 2 9 5 4

0 4 9 5 4

0 4 9 5 4

MFI approaches

17

Multiple Frame Integration Cont’d

• Image Enhancement

• Minimum Operator

• Integration method by Yi et al. (15)

• Result Fusion

• Voting for most frequent recognition

18

Impact of MFI Approaches on

Overall Recognition Results

Datasets

• We assumed the use case of energy

meter readings

• We tailored our pipeline to solely

detect the respective numbers

• Just bright text on dark background

• Additional histogram based

verification step

• Constrained bounding box

dimensions

20

Datasets

Exemplary frames of the evaluation datasets showing different types of energy

meters and ground truth annotation

21

Datasets Cont’d

Video ID Light source max. Resolution No. of Frames Duration

1 Tungsten 768x1366 228 00:07

2 Daylight 768x1366 209 00:07

3 Flash 768x1366 581 00:19

4 Daylight 768x1366 1280 00:42

5 Tungsten 768x1366 507 00:16

6 Tungsten 768x1366 234 00:07

22

Detection and Tracking Accuracy

• We utilized CLEAR-MOT Evaluation Framework (4)

• Multiple Object Tracking Precision (MOTP)

• Multiple Object Tracking Accuracy (MOTA)

• We compared our method with full tracking-by-detection approaches

• Thereby, subsequently occurring bounding boxes are associated by

their overlap utilizing Munkres’ algorithm (10).

23

Detection and Tracking Accuracy Cont’d

Res. Method MOTP Misses FP

rate

MM MOTA

768x1366 NATIVE TS 0.74 0.27 0.98 0.08 -0.32

480x854

NATIVE TS 0.74 0.29 1.52 0.14 -0.95

NATIVE SWT 0.60 0.99 0.87 0.00 -0.86

TS 0.75 0.57 0.13 0.02 0.28

MSAC&TS 0.73 0.59 0.10 0.02 0.29

LOCEX&TS 0.75 0.57 0.13 0.02 0.28

KLT&MSAC&TS 0.71 0.48 0.31 0.00 0.21

AK&MSAC&TS 0.70 0.52 0.11 0.02 0.36

Hybrid KLT&MSAC&TS 0.62 0.49 0.17 0.00 0.34

Multiple Object Tracking Precision and Accuracy

24

Runtime

Device ResolutionTracking

Method

Rectification,

DetectionTotal

Laptop

480x854 AKAZE 318.2 144.4

480x854 KLT 263.1 22.2

Hybrid KLT 73.1 5.0

Shield Tablet480x720 KLT 2788.3 469.2

Hybrid KLT 519.2 84.9

Average time performance measurements in milliseconds

25

Reading Accuracy

We extracted the text patches and utilized the Anyline Energy module1

to read the meter readings from

• the current patch and

• the currently available integrated counterpart respectively we fused

the preceding results.

These recognition rates are compared.

1https://www.anyline.io/energy-anyline-io-de/

26

Reading Accuracy Cont’d

Single extracted frames sampled during a sequence of 62 frames compared to

respective integration results.

27


Degenerated Multi-frame Integration over Time

28


SF MIN YI HIST

Method

0.0

0.2

0.4

0.6

0.8

1.0

Rec

ogn

itio

nra

te

ECC

Hybrid

The recognition rates using the single extracted frames and the different MFI

methods

29


Resolution Single

frame

Minimum

operator

Yi inte-

gration

Histogram

voting

768x1366 0.45 0.44 0.55 0.63

480x854 0.38 0.50 0.50 0.62

320x568 0.36 0.48 0.43 0.61

Hybrid 0.33 0.29 0.27 0.61

Recognition rates

30

Conclusion & Outlook

Conclusion & Outlook

• We showed that our MFI approach is capable of achieving real-time

performance with little optimization on mobile hardware

• The multi-thread detection and tracking approach can keep up with

full detection approaches

• A distinct improvement of the recognition rates is possible

• Generally image enhancement integration methods require almost

perfect image registration

• If text recognition is fast enough, result fusion methods should be

preferred over the evaluated image enhancement approaches

32

Questions?

33

References I

References

[1] P. F. Alcantarilla, A. Bartoli, and A. J. Davison. KAZE features. In

European Conference on Computer Vision, 2012.

[2] P. F. Alcantarilla, J. Nuevo, and A. Bartoli. Fast explicit diffusion

for accelerated features in nonlinear scale spaces. In British Machine

Vision Conference, 2013.

[3] D. L. Baggio, S. Emami, D. M. Escriva, K. Ievgen, N. Mahmood,

J. Saragih, and R. Shilkrot. Mastering OpenCV with Practical

Computer Vision Projects. Packt Publishing, Limited, 2012.

[4] K. Bernardin and R. Stiefelhagen. Evaluating multiple object

tracking performance: the CLEAR MOT metrics. EURASIP Journal

on Image and Video Processing, 2008(1):1–10, 2008.

34

References II

[5] B. Epshtein, E. Ofek, and Y. Wexler. Detecting text in natural

scenes with stroke width transform. In Conference on Computer

Vision and Pattern Recognition, pages 2963–2970. IEEE, 2010.

[6] G. D. Evangelidis and E. Z. Psarakis. Parametric image alignment

using enhanced correlation coefficient maximization. Transactions on

Pattern Analysis and Machine Intelligence, 30(10):1858–1865, 2008.

[7] A. Hartl and G. Reitmayr. Rectangular target extraction for mobile

augmented reality applications. In International Conference on

Pattern Recognition, pages 81–84. IEEE, 2012.

[8] B. D. Lucas, T. Kanade, et al. An iterative image registration

technique with an application to stereo vision. In International Joint

Conference on Artificial Intelligence, volume 81, pages 674–679,

1981.

35

References III

[9] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with

automatic algorithm configuration. International Conference on

Computer Vision Theory and Applications, 2(331-340):2, 2009.

[10] J. Munkres. Algorithms for the assignment and transportation

problems. Journal of the Society of Industrial and Applied

Mathematics, 5(1):32–38, March 1957.

[11] L. Neumann and J. Matas. Real-time scene text localization and

recognition. In Conference on Computer Vision and Pattern

Recognition, pages 3538–3545. IEEE, 2012.

[12] M. Nieto and L. Salgado. Real-time robust estimation of vanishing

points through nonlinear optimization. In SPIE Photonics Europe,

pages 772402–772402. International Society for Optics and

Photonics, 2010.

36

References IV

[13] J. Shi and C. Tomasi. Good features to track. In Computer Society

Conference on Computer Vision and Pattern Recognition, pages

593–600. IEEE, 1994.

[14] C. Tomasi and T. Kanade. Detection and tracking of point features.

School of Computer Science, Carnegie Mellon Univ. Pittsburgh,

1991.

[15] J. Yi, Y. Peng, and J. Xiao. Using multiple frame integration for the

text recognition of video. In International Conference on Document

Analysis and Recognition, pages 71–75. IEEE, 2009.

37

Documents

Multiple Frame Integration for OCR on Mobile Devices - Master's … · 2016. 12. 15. · Feature based Good features to track and Kanade-Lucas-Tomasi (8, 13, 14) ... Multiple Object