66
Extraction of Text Objects in Video Documents: Recent Progress Jing Zhang and Rangachar Kasturi University of South Florida Department of Computer Science and Engineering

Das 09112008

Embed Size (px)

Citation preview

Page 1: Das 09112008

Extraction of Text Objects in Video Documents: Recent

Progress

Jing Zhang and Rangachar Kasturi

University of South FloridaDepartment of Computer Science and Engineering

Page 2: Das 09112008

Acknowledgements

The work presented here is that of numerous researchers from around the world. We thank

them for their contributions towards the advances in video document processing.

In particular we would like to thank theauthors of papers whose work is cited in this

presentation and in our paper.

Page 3: Das 09112008

• Introduction• Recent Progress• Performance Evaluation• Discussion

Outline

Page 4: Das 09112008

Since 1990s, with rapid growth of available multimedia documents and increasing demand for information indexing and retrieval, much effort has been done on text extraction in images and videos.

Introduction

Page 5: Das 09112008

• Text Extraction in Video

– Text consists of words that are well-defined models of concepts

for humans communication.

– Text objects embedded in video contain much semantic information related to the multimedia content.

– Text extraction techniques play an important role in content-based multimedia information indexing and retrieval.

Introduction

Page 6: Das 09112008

Extracting text in video presents unique challenge over that in scanned documents:

Introduction

Cons: Pros:Low contrast Temporal Redundancy (text in video

usually persists for at least several seconds, to give human viewers the necessary time to read it)

Low resolution

Color bleeding

Unconstrained backgrounds

Unknown text color, size, position, orientation, and layout

Page 7: Das 09112008

Introduction• Caption Text which is artificially superimposed on the video at the

time of editing.• Scene Text which naturally occurs in the field of view of the camera

during video capture.• The extraction of scene text is a much tougher task due to varying

lighting, complex movement and transformation.

Scene Text

Caption Text

Page 8: Das 09112008

Five stages of text extraction in video:

1) Text Detection: finding regions in a video frame that contain text;

2) Text Localization: grouping text regions into text instances and generating a set of tight bounding boxes around all text instances;

3) Text Tracking: following a text event as it moves or changes over time and determining the temporal and spatial locations and extents of text events;

4) Text Binarization: binarizing the text bounded by text regions and marking text as one binary level and background as the other;

5) Text Recognition: performing OCR on the binarized text image.

Introduction

Page 9: Das 09112008

Introduction

Text Detection

Text Localization

Text Binarization

Text Recognition

Text Tracking

Video ClipsVideo Clips

Text ObjectsText Objects

Page 10: Das 09112008

The goal of Text detection, text localization and text tracking is to generate accurate bounding boxes of all text objects in video frames and provide a unique identity to each text event which is composed of the same text object appearing in a sequence of consecutive frames.

Introduction

Page 11: Das 09112008

This presentation mainly concentrates on the approaches proposed for text extraction in videos in the most recent five years, to summarize and discuss the recent progress in this research area.

Introduction

Page 12: Das 09112008

Region Based Approach utilizes the different region properties between text and background to extract text objects. – Bottom-up: separating the image into small regions and then grouping

character regions into text regions. – Color features, edge features, and connected component methods

Texture Based Approach uses distinct texture properties of text to extract text objects from background. – Top-down: extracting texture features of the image and then locating

text regions. – Spatial variance, Fourier transform, Wavelet transform, and machine

learning methods.

Introduction

Page 13: Das 09112008

• Introduction• Recent Progress• Performance Evaluation• Discussion

Outline

Page 14: Das 09112008

Text extraction in video documents, as an important research branch of content-based information retrieval and indexing, continues to be a topic of much interest to researchers.

A large number of newly proposed approaches in the literature have contributed to an impressive progress of text extraction techniques.

Recent Progress

Page 15: Das 09112008

Recent Progress

Prior to 2003

• Only a few text extraction approaches considered the temporal nature of video.

• Very little work was done on scene text.

• Objective performance evaluation metrics were scarce.

Now

• Temporal redundancy of video is utilized by almost all recent text extraction approaches.

• Scene text extraction is being extensively studied.

• A comprehensive performance evaluation framework has been developed.

Page 16: Das 09112008

The progress of text extraction in videos can be categorized into three types:

1. New and improved text extraction approaches

2. Text extraction techniques adopted from other research fields

3. Text extraction approaches proposed for specific text types and specific genre of video documents

Recent Progress

Page 17: Das 09112008

1. New and improved text extraction approaches:

The new and improved approaches play an important role in the recent progress of text extraction technique for videos. These new approaches introduce not only new algorithms but also new

understanding of the problem.

Recent Progress

Page 18: Das 09112008

H. Tran, A lux, H.L. Nguyen T. and A. Boucher, A novel approach for text detection in images

using structural features, The 3rd International Conference on Advances in Pattern Recognition,

LNCS Vol. 3686, pp. 627-635, 2005

A text string is modeled as its center line and the skeletons of characters by ridges at different hierarchical scales.

Recent Progress-New and improved text extraction approaches

First line: Images with rectangle showing the text region. Second line: Zoom on text regions. Third line: ridges detected at two scales (red in high level, blue in small level) in the text region that represent local structures of text lines whatever the type of text.

Page 19: Das 09112008

• Abstract. We propose a novel approach for finding text in images by using ridges at several scales. A text string is modelled by a ridge at a coarse scale representing its center line and numerous short ridges at a smaller scale representing the skeletons of characters. Skeleton ridges have to satisfy geometrical and spatial constraints such as the perpendicularity or non-parallelism to the central ridge. In this way, we obtain a hierarchical description of text strings, which can provide direct input to an OCR or a text analysis system. The proposed method does not depend on a particular alphabet, it works with a wide variety in size of characters and does not depend on orientation of text string. The experimental results show a good detection.

H. Tran, A lux, H.L. Nguyen T. and A. Boucher, A novel approach for text detection in images using structural features, The 3rd International Conference on Advances in Pattern Recognition, LNCS Vol. 3686, pp. 627-635, 2005

X. Liu, H. Fu and Y. Jia.: Gaussian Mixture Modeling and learning of Neighbor Characters for Multilingual Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008.

Abstract: This paper proposes an approach based on the statistical modeling and learning of neighboring characters to extract multilingual texts in images. The case of three neighboring characters is represented as the Gaussian mixture model and discriminated from other cases by the corresponding ‘pseudo-probability’ defined under Bayes framework. Based on this modeling, text extraction is completed through labeling each connected component in the binary image as character or non-character according to its neighbors, where a mathematical morphology based method is introduced to detect and connect the separated parts of each character, and a Voronoi partition based method is advised to establish the neighborhoods of connected components. We further present a discriminative training algorithm based on the maximum–minimum similarity (MMS) criterion to estimate the parameters in the proposed text extraction approach. Experimental results in Chinese and English text extraction demonstrate the effectiveness of our approach trained with the MMS algorithm, which achieved the precision rate of 93.56% and the recall rate of 98.55% for the test data set. In the experiments, we also show that the MMS provides significant improvement of overall performance, compared with influential training criterions of the maximum likelihood (ML) and the maximum classification error (MCE).

Page 20: Das 09112008

X. Liu, H. Fu and Y. Jia, Gaussian Mixture Modeling and learning of Neighbor Characters for Multilingual Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008.

The GMM based algorithm treats the text features of three neighboring characters as three mixed Gaussian models to extract text objects.

Recent Progress-New and improved text extraction approaches

An example of neighborhood computation. In each figure, the image (a) shows a binary image, where black dots denote centroids of CCs; the image (b) shows the Delaunay triangulation of centroids, where each triangle is corresponding with a neighbor set. However, the neighborhoods of characters cannot be completely reflected in the Delaunay triangulation. (c) The solution by taking all three nodes which are joined one by one in the convex hull of the centroid set as neighbor sets.

(a) (b) (c)

Page 21: Das 09112008

P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International Conference Signal Processing, IEEE, Vol. 4, 2006

Only the vertical edge features are utilized to find text regions based on the observation that vertical edges can enhance the characteristic of text and eliminate most irrelevant information.

Recent Progress-New and improved text extraction approaches

(a) (b) (c) (d)

(a) Original image, (b) detected group of vertical lines, (c) extracted text region, (d) result

Page 22: Das 09112008

K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text-Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE, pp. 33-37, 2007

Character-stroke is used to extract text objects by utilizing three line scans (a set of pixels along the horizontal line of an intensity image) to detect image intensity changes.

Recent Progress-New and improved text extraction approaches

(a) Original image, (b) Intensity plots along the blue line l, l-2, and l+2, is the stroke width, (c) threshold I g 0.35, (d) The thresholded image after morphological operations and connected component analysis.

Page 23: Das 09112008

D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events from digital video, International Journal on Document Analysis and Recognition, Vol. 5, pp. 138-157, 2003

Recent Progress-New and improved text extraction approaches

8×8 block-wise DCT is applied on each video frame. For each block, 19 optimal coefficients that best correspond to the properties of text are determined empirically. The sum of the absolute values of these coefficients is computed and regarded as a measure of the “text energy” of that block. The motion vectors of MPEG-compressed videos are used for text objects tracking.

(a) Original image

(b) Text energy (c) Tracking result

Page 24: Das 09112008

In addition, many former text extraction approaches have been enhanced and extended recently.

By extracting and integrating more comprehensive characteristics of text objects, these new approaches can provide more robust performance than previous approaches.

Besides new approaches, many improved approaches are presented to overcome the limitations of former approaches.

Recent Progress-New and improved text extraction approaches

Page 25: Das 09112008

S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors, Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE,

pp. 106-110, 2005

Color-related detector, wavelet-based texture detector, edge-based contour detector and temporal invariance principle are adopted to detect candidate caption regions. Then a parallel fusion strategy

C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006.

Euclidean distance based and Cosine similarity based clustering methods are applied on GRB color space complementarily to partition the original image into three clusters: textual foreground, textual background, and noise.

Recent Progress-New and improved text extraction approaches

Overview of the proposed algorithm combining color and spatial information.

Page 26: Das 09112008

M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp. 243-255, 2005.

The sequential multi-resolution paradigm can remove the redundancy of parallel multi-resolution paradigm. No text edges can appear several times at different resolution levels.

Recent Progress-New and improved text extraction approaches

Sequential multiresolution paradigm

Page 27: Das 09112008

J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp. 283-290, 2006.

Fuzzy C-means based individual frame clustering is replaced by the fuzzy clustering ensemble (FCE) based multi-frame clustering to utilize temporal redundancy.

Recent Progress-New and improved text extraction approaches

Fuzzy cluster ensemble for text detection in videos

Page 28: Das 09112008

2. Text extraction techniques adopted from other research fields:

Another encouraging progress is that more and more techniques that have been

successfully applied in other research fields have been adapted for text extraction.

Because these approaches were not initially designed for the text extraction task, many unique characteristics of their original research fields are embedded in them intrinsically.

Therefore, by using these approaches from other fields, we can view the text extraction problem from the viewpoints of other related research fields and benefit from them. It is a promising way to find good solutions for text extraction task.

Recent Progress

Page 29: Das 09112008

K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support vector machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern Analysis and Machine Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003.

The continuously adaptive mean shift algorithm (CAMSHIFT) was initially used to detect and track faces in a video stream.

Recent Progress-Text extraction techniques adopted from other research fields

Example of text detection using CAMSHIFT. (a) input image (540×400), (b) initial window configuration for CAMSHIFT iteration (5×5-sized windows located at regular intervals of (25, 25)), (c) texture classified region marked as white and gray level (white: text region, gray: non-text region), and (d) final detection result

Page 30: Das 09112008

The multiscale statistical process control (MSSPC) was originally proposed for detecting changes in univariate and multivariate signals.

H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE pp. 894-898, 2007.

Recent Progress-Text extraction techniques adopted from other research fields

Substeps involved in the use of MSSPC for videotext event detection

Page 31: Das 09112008

D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006.

Discriminative Random Fields (DRF) was initially applied to detect man-made building in 2D images.

Recent Progress-Text extraction techniques adopted from other research fields

(a) 2D DRF, with state si and one of its neighbors sj . (b) 3D DRF, with multiple 2D DRFs stacked over time. (c) 2D DRF-HMM type(A), with intra-frame dependencies modelled by undirected DRFs, and inter-frame dependencies modelled by HMMs. States are shared between the two models.

Page 32: Das 09112008

Sparse representation was initially used for research on the receptive fields of simple cells.

W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using Sparse Representations, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE, pp. 412-416, 2007.

Recent Progress-Text extraction techniques adopted from other research fields

(a) (b) (C)

(a) Camera Captured Image; (b) foreground text generated by image decomposition via sparse representations; (c) binarized result of (b) using Otsu’s method.

Page 33: Das 09112008

3. Text extraction approaches proposed for specific text types and specific genre of video documents:

Besides general text extraction approaches, an increasing number of approaches have been proposed for specific text types.

Based on domain knowledge, these specific approaches can take advantages of unique properties of specific text type or video genre and often achieve better performance than general approaches.

Recent Progress

Page 34: Das 09112008

W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEE Transactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005.

This approach is composed of two stages: 1. localizing road signs; 2. detecting text.

Recent Progress-Text extraction approaches proposed for specific text types and specific genre of video documents

Architecture of the proposed framework

Page 35: Das 09112008

content fluctuation curve based on the number of chalk pixels is used to measure the content in each frame of instructional videos. The frames with enough chalk pixels are extracted as key frames. Hausdorff-distance and connected-component decomposition are adopted to reduce the redundancy of key frames by matching the content and mosaicking the frames.

C. Choudary, and T. Liu, Summarization of Visual Content in Instruction videos, IEEE Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007.

Recent Progress-Text extraction approaches proposed for specific text types and specific genre of video documents

Comparison of our summary frames with the key frames obtained using different key frame selection methods in a test video. (a) our summarization algorithm; (b) fixed sampling; (c) dynamic clustering; (d) tolerance band. Our summary frames are rich in content and more appealing.

(a) (b) (C) (d)

Page 36: Das 09112008

Additional References:

• C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text Extraction, Proceedings of IEEE international Conference on Image Processing, pp. 985-988, 2006.

• D. Q. Zhang and S. F. Chang, Learning to Detect Scene Text Using a Higher-order MRF with Belief Propagation, IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2004.

• L. Tang and J.R. Kender, A unified text extractionmethod for instructional videos, Proceedings of IEEE international conference on image processing, Vol. 3, pp11-14, 2005.

• M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp. 243-255, 2005.

• S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors, IEEE Proceedings of Eighth International Conference on Document Analysis and Recognition, pp. 106-110, 2005.

• CC Lee, YC Chiang, CY Shih, HM Huang, Caption localization and detection for news videos using frequency analysis and wavelet features, Proceedings of IEEE international conference on tools with artificial intelligence, Vol. 2 ,pp 539-542, 2007.

• …

Recent Progress-Text extraction approaches proposed for specific text types and specific genre of video documents

Page 37: Das 09112008

• Introduction• Recent Progress• Performance Evaluation• Discussion

Outline

Page 38: Das 09112008

Evaluation Metrics:

Video Analysis and Content Extraction (VACE)

Performance EvaluationR. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang, Framework for Performance Evaluation ofFace, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol, to appear IEEE Transactions on Pattern Analysis Machine Intelligence, 2008.(http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.57)

Page 39: Das 09112008

Text: Task Definition

Detection Task: Spatially locate the blocks of text in each video frame in a video sequence• Text blocks (objects) contain all words in a particular line of text where

the font and size are the same

Tracking Task: Spatially/temporally locate and track the text objects in a video sequence

Recognition Task: Transcribe the words in each frame, including their spatial location (detection implied)

Page 40: Das 09112008

Task Definition Highlights

• Annotate oriented bounding rectangle around text objects (The reference annotation was done by VideoMining Inc., State College, PA)

• Detection and Tracking task – Line level annotation with IDs maintained– Rules based on similarity of font, proximity and readability levels

• Recognition task– Word Level (IDs maintained)

• Documents– Annotation guidelines - Evaluation protocol

• Tools– ViPER (Annotation) - USF-DATE (Scoring)

Page 41: Das 09112008

• Micro-corpus: a small amount of data that was created after extensive discussions with the research community to act as a seed for initial annotation experiments and to provide new participants with a concrete sampling of the datasets and the tasks.

Data Resources VIDEO

DATA NUMBER

OF CLIPS

TOTAL MINS

MICRO-CORPUS 5 10

TRAINING 50 175

TESTING 50 175

Page 42: Das 09112008

These discussions were coordinated as a series of weekly teleconferences with VACE contractors and other eminent members of the CV community.

The discussions made the research community a partner in the evaluations and helped us in:

1. selecting the video recordings to be used in the evaluations,

2. creating the specifications for the ground truth annotations and scoring tools

3. defining the evaluation infrastructure for the program.

Data Resources

Page 43: Das 09112008

MPEG–2 standard, progressive scanned at 720 × 480 resolution. GOP (Group of Pictures) of 12 for the broadcast news corpus where the frame-rate was 29.97 fps (frames per second) and GOP of 10 for the surveillance dataset where the frame-rate was 25 fps.

TASK DOMAIN

Text Detect & Track Broadcast News ABC & CNN*

Face Detect & Track Broadcast News ABC & CNN*

Vehicle Detect & Track Surveillance i-LIDS**

Data Resources

* Distributed by the Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu** i-LIDS [Multiple Camera Tracking/Parked Vehicle Detection/Abandoned Baggage Detection] scenario datasets were developed by the UK Home Office and CPNI. (http://scienceandresearch.homeoffice.gov.uk/hosdb/cctv-imagingtechnology/video-based-detection-systems/i-lids/)

Page 44: Das 09112008

Text Ground Truth: Every new text area was marked with a box when it appeared in the video. The box was moved and scaled to fit the text as it moved in successive frames. This process was done at the text line level until the text disappeared from the frame.

Reference Annotations

Three readability levels:

READABILITY = 2 (black)Clearly readable text

READABILITY = 2 (black)Clearly readable text

READABILITY = 1 (gray)Partially readable text

READABILITY = 1 (gray)Partially readable text

READABILITY = 1 (white)Completely unreadable textREADABILITY = 1 (white)

Completely unreadable text

Page 45: Das 09112008

• Text regions were tagged based on a comprehensive set of rules:

1. All text within a selected block must contain the same readability level and type.

2. Blocks of text must contain the same size and font.

3. The bounding box should be tight to the extent that there is no space between the box and the text.

4. Text boxes may not overlap other text boxes unless the characters themselves are superimposed atop one another.

Reference Annotations

Page 46: Das 09112008

Sample Annotation Clip (line-level)

Page 47: Das 09112008

• The Frame Detection Accuracy (FDA) measure calculates the spatial overlap between the ground truth and system output objects as a ratio of the spatial intersection between the two objects and the spatial union of them. The sum of all of the overlaps was normalized over the average of the number of ground truth and detected objects

)(

1)()(

)()(

Ratio Overlap where,

tmappedN

iti

ti

ti

ti

DG

DG

2

Ratio Overlap)( )()( t

DtG NN

tFDA

Frame Detection Accuracy (FDA)

Detection Metric

Gi denotes the ith ground truth object at the sequence level and Gi(t) denotes the ith ground truth object in frame t.

Di denotes the ith detected object at the sequence level and Di(t) denotes the ith detected object in frame t.

N(t)G and N(t)

D denote the number of ground truth objects and the number of detected objects in frame t respectively.

Page 48: Das 09112008

• The Sequence Frame Detection Accuracy (SFDA), is essentially the average of the FDA measure over all of the relevant frames in the sequence.

frames

frames

N

t

(t)D

tG

N

t

)NN

tFDASFDA

1

)(

1

OR (

)(

Sequence Frame Detection Accuracy (SFDA)

Range: 0 to 1 (higher is better)

Detection Metric

Nframes is the number of frames in the sequence

Page 49: Das 09112008

• The Average Tracking Accuracy (ATA) is a spatio-temporal measure which penalizes fragmentations in both the temporal and spatial dimensions while accounting for the number of objects detected and tracked, missed objects, and false positives.

Tracking Metric

mapped

ii

iframes

N

i DG

N

tti

ti

ti

ti

N

DG

DG

STDA1 )(

1)()(

)()(

Sequence Track Detection Accuracy (STDA)

2DG NN

STDAATAAverage Tracking Accuracy (ATA)

Range: 0 to 1 (higher is better)

NG and ND denote the number of unique ground truth objects and the number of unique detected objects in the given sequence respectively. Uniqueness is defined by object IDs.

Page 50: Das 09112008

2901.02)55(

]1[]4505.0[

FDA

Spatial alignment error (ratio = .4505)

Correctly detected object – perfect overlap (ratio = 1.0)

3 false alarm objects

3 missed objects

Green: Detected box Red: Ground truth box Yellow: Overlap in mapped boxes

Example Detection Scoring

Page 51: Das 09112008

Annotation Quality

Evaluation relies on manual labeling

The degree of consistency becomes increasingly important as systems approach human levels of performance.

Humans fatigued easily when performing such tedious tasks

A high degree of consistency would be difficult to achieve with somewhatsubjective attributes like readability

10% of the entire corpus was doubly annotated by multiple annotators and checked for quality using the evaluation measures.

10% of the entire corpus was doubly annotated by multiple annotators and checked for quality using the evaluation measures.

Page 52: Das 09112008

Average Sequence Frame Detection Accuracy (SFDA)

Annotation Quality

Text detection

Average Average Tracking Accuracy (ATA)

Text tracking

95%

85%

For double annotated corpus

The scores for the current state-of-the-art automatic algorithms are significantly lower than these numbers (22% relative for text detection, and 61% relative for text tracking).

Page 53: Das 09112008

Annotation Quality

Flowchart of Annotation Quality Control Procedure. Steps denoted by dark shaded boxes were carried out by the annotators. Steps denoted by light shaded boxes were carried out by the evaluators.

Page 54: Das 09112008

Text Detection and Tracking – VACE

SFDA ATA 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Mean SFDA/ATA Scores for Eng Text Detection and Tracking: BNews

A B

CD

Page 55: Das 09112008

Text Recognition Evaluation

• Datesets: Broadcast News

• Training/Dry Run Development Set– 5 Clips

• 14.5 minutes

• 1181 words

• Evaluation Set– 25 Clips

• 62.5 minutes

• 4178 word objects

• 68,738 word frame instances

Page 56: Das 09112008

Text Recognition Evaluation

Evaluate only the most easily readable text (to establish a baseline at a high level of inter-annotator agreement)

• Type = graphic (no scene text)• Readability = 2 • Logo = false• Occlusion = false• Ambiguous = false

— Exclude scrolling (ticker), dynamic text (scoreboard)

• Case insensitive and punctuation ignored

Page 57: Das 09112008

Sample Annotation Clip (Word-level)

Page 58: Das 09112008

• Spatially map system output detected words to reference words, then compare the strings for mapped words– An unmapped word in system output incurs an Insertion (I) error– An unmapped word in reference incurs a Deletion (D) error– A mapped word with a character mismatch incurs a Substitution

(S) error

• Errors are accumulated over entire test set• Also generate: Character Error Rate

Ref)in Words# (Total

S)D(IWER

The raven caws at midnight

raven calls at at midnight

D S IREF:

Sys Output: WER = (1 + 1 + 1)/5 = 3/5 (60%)

Recognition Evaluation Metrics

Page 59: Das 09112008

Individual Clip Word Error Rate

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Clips

WE

R p

er c

lip (

no

rmal

ized

by

the

#wo

rds

in e

ach

clip

)

Clip-wise WER

Page 60: Das 09112008

Scores (Word Error Rate)

WER/Word WER/Frames0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Val

ues

Word Error Rates with different Normalizations

WER CER

0.4233 0.2823

Page 61: Das 09112008

• Introduction• Recent Progress• Performance Evaluation• Discussion

Outline

Page 62: Das 09112008

• The recent progresses provide many promising solutions and research directions for text extraction problem.

• Due to the large variations of text objects in videos, no single approach can achieve satisfactory performance in all applications.

• To further improve the performance of text extraction techniques, much work in the area remains.

Discussion

Page 63: Das 09112008

Detection and Localization

– How to efficiently combine several complementary extraction algorithms to produce better performance and how to extract better features by analyzing the shape of characters and the relationships between text and its background still need more investigation.

Discussion

Page 64: Das 09112008

Tracking

– Although text tracking is an indispensable step for text extraction in videos, not many text tracking approaches have been reported in recent years.

– More effort is needed to focus on tracking, not only for static and scrolling text, but also for dynamic text objects (growing, shrinking, and rotating text).

Discussion

Page 65: Das 09112008

Datasets:

– Besides extraction approaches, because most algorithms are still tested on their own datasets, in order to compare and evaluate all algorithms, a large freely available annotated video dataset is urgently needed.

Discussion

Page 66: Das 09112008

THANK YOU!

See you at ICPR 2008 in December