A Localization Method of Video Text in TV Program Video · A Localization Method of Video Text in TV Program ... XXX X X X XXX X X X XXX X XX XX X X XX X X X X XX XX X X XX XXX X

FEATURE

11

Video text superimposed on a TV program includes use-ful information for analyzing the content of the program. For example, such text may represent a person’s name, the substance of an utterance, or the score of a match in some sport. This report introduces our technique of localizing video text with high accuracy. We used a machine learn-ing framework and image features reflecting the layout of character components such as lines and points to improve detection accuracy. In an experiment on broadcast TV programs, our technique achieved a recall rate of 89.9%, precision of 88.0%, and F-value of 0.889, which is a 0.135 improvement in F-value over existing techniques.

1. IntroductionRecently, with the increase in storage media capacities

and the conversion to file-based broadcast systems, it has become possible for broadcasting stations to manage large amounts of video data centrally. To use this accumulated video data effectively, a search technology that enables desired scenes to be found easily is essential.

To be able to search the video data, labels (metadata) that can be searched must be attached to each scene beforehand. However, it is not practicable to process all of the video data manually, so various technologies are being studied1)-3) to automate the analysis of video content. One such technology is the analysis of subtitles*1 superimposed on video4)-9). Sub-titles include useful information for searches, such as names of people and places, summaries of statements made, topics discussed, and when points are won in sports. The results of analyzing subtitles can be presented as text data, so it is highly compatible with keyword-based search systems4). On this basis, we have developed a technology for the ac-curate detection of subtitles in television program video. We expect to use existing optical character recognition (OCR)*2

technology to recognize the text of the subtitles.One existing technology is the extraction of edges*3 ap-

pearing at the same location across multiple frames, where regions containing high concentrations of these edges are judged to be subtitles5). This technology is based on the assumption that subtitles will be displayed continuously at the same position for a fixed period of time. For this reason, it does not support cases in which the position or size of the text varies over time. There is also a technology that improves on this as it supports moving subtitles by estimat-ing the motion of the edges6). There are also technologies that detect captions in single frames without using the relationships between frames7)8). However, each of these methods depends heavily on knowledge gained from expe-rience, and the accuracy of detection can drop if aspects of the program, such as the genre or production style, differ from what was assumed. There have also been technologies aimed at improving the robustness of these methods using machine learning9), but these still have issues in areas such as computational cost and detection accuracy.

In this article, we propose a new subtitle detection technology with a framework in which a new image fea-ture value, called edge-space patterns, is combined with machine learning. Edge-space patterns are feature values that reflect the arrangements of lines and points that form characters, and can be used to discriminate between subtitle regions and other regions. Machine learning using a support vector machine (SVM)*4 is employed in this discrimination. Evaluation tests using real television program video were carried out to verify the effectiveness of this technology.

2. Subtitle region detection technologyFigure 1 gives an overview of the proposed technology.

The input is a single-frame image extracted from television program video [Fig. 1 (a)].

A Localization Method of Video Text in TV Program VideoYoshihiko Kawai, Takahiro Mochizuki and Hideki Sumiyoshi

*3 Places where the brightness changes suddenly.

*4 Technology that infers labels for unknown input data using a discriminator that was tuned using training data that is already labeled.

*1 In this article, “subtitle” refers to subtitle text or “open cap-tions” (which are always displayed).

*2 Technology that reads text written in an image or on paper using a scanner or other means, and converts it to text coding that can be used on computers.

FEATURE

12

A small region called a scan window is first decided, and then this region is moved a little at a time, each time determining whether the scan window is part of a subtitle area or not [Fig. 1 (b)]. The SVM is used to make this de-termination. Then, groups of more than a certain number of overlapping subtitle scan window positions are selected [Fig. 1 (c)]. Finally, the bounding boxes of the selected ar-eas (smallest rectangle containing the area) are computed, and areas with low edge density*5 or small size are exclud-ed [Fig. 1 (d)]. These processes are described in more detail below.

2.1 Detecting candidate regions using a scan window

In this section, we describe the detection of candidate re-gions using a scan window, as shown in Fig. 1 (b). Details of the process are shown in Fig. 2. First, whether the edge density in the scan window region is greater than threshold Tsd is determined. Then the SVM is used to discriminate whether the scan window region is a subtitle candidate re-gion. This discrimination is made using five types of image feature value: the edge-space patterns we devised, edge-orientation histograms12), edge-magnitude histograms, lo-cal binary patterns13) and Haar wavelet textures. Each of these feature values is described in more detail below.(1) Edge-space patterns

Edge-space patterns are a feature value that reflects the arrangement of lines and points that form characters.

(b) Detect candidate regionsusing scan window

(a) Input frame imageScan window

(c) Discriminate based on scan window overlap

(d) Discriminate based on density and rectangle size

(e) Detection result

Figure 1: Overview of proposed method

*5 The edge intensity per pixel in a given area. Edge intensity is the change in brightness at an edge.

Scan

win

dow

Dis

crim

inat

ion

by e

dge

dens

ity

Con

solid

atio

n

Supp

ort

vect

or m

achi

ne (

SVM

)

Dis

crim

inat

ion

of c

andi

date

reg

ion

or n

ot

Edge-space patterns (ESPs)

Edge-orientation histogram (EOH)

Edge-magnitude histogram (EMH)

Local binary patterns (LBP)

Haar wavelet textures (HWTs)

Discrimination by machine learning

Figure 2: Procedure of candidate region detection using scan windows [Process in Fig. 1 (b)]

FEATURE

13

Each of the pixels in the scan window is first labeled as “edge” if the edge intensity is greater than threshold Tbi, and “nonedge” otherwise. The edge intensity is computed using a Sobel filter*6.

Then, for each pixel (called the “focus pixel”), the edge-space pattern, ESP8,n, based on the labels of n of the sur-rounding eight pixels, is computed. The example for n=1 is shown in Fig. 3. If the focus pixel is an edge, then, depend-ing on whether a particular one of the eight neighboring pixels is a nonedge or not, the corresponding element of the vector is set to 1 or 0, respectively. The eight ways of selecting one pixel from the surrounding eight pixels and adding one dimension when the focus pixel is a nonedge results in a feature vector with nine dimensions. We refer to the edge-space pattern for the focus pixel at (x, y) as ESP8,1 (x, y).

In Fig. 3, the positional relationships among edge and nonedge pixels neighboring the focus pixel are expressed

using nine patterns. If the pixel set matches a pattern, the corresponding vector element is set to 1. The list of pat-terns for n=2 is shown in Fig. 4. By our technique, ESP8,n for n=1 to 5 can be computed for each pixel in the scan window region and their averages, ESP8,1... ESP8,5, can be concatenated to obtain the feature value. (2) Edge-orientation histogram

The edge-orientation histogram12) is a feature value based on the distribution of edge orientations. First, the edge orientation for each pixel in the scan window region is quantized into k steps, and for each orientation, the sum of the edge intensities of all pixels, Ek, is computed. The edge orientation and intensity for each pixel are computed using a Sobel filter.

The edge-orientation histogram, EOHk1,k2, is defined by a ratio of Ek1 to Ek2. The proposed technique quantizes edge orientations into eight directions and computes 64 histo-grams, EOHk1,k2 for k1=1 to 8 and k2=1 to 8.(3) Edge-magnitude histogram

The edge-magnitude histogram is a feature value based on the distribution of edge magnitudes. The edge magni-tude for each pixel in the scan window is quantized into k

Example of computed edge-space pattern feature vector

X X XX X

X X

X X XX XX X

X X XX XX X

X X XXX X X

X XX XX X X

X XX XX X X

X XX XX X X

X X XX

X X X

X X XX XX X X

Input example

Focus pixel 0 1 1 0 1 0 1 1 0

Patterns

Edge pixel Non-edge pixel X Don't care

1 if pattern matches

Figure 3: Example of computing edge-space pattern, ESP8,1 (for n=1)

X X XX X

X

X X XX XX

X X XXX X

X XX XX X

X XX XX X

X XX XX X

X X XX

X X

X X XX XX

X X XXX X

X XX XX X

X XX XX X

X XX XX X

X X XX

X X

X X XXXX

X XX XXX

X XX XXX

X XX XXX

X X XX

XX

X XXXX X

X XXXX X

X XXXX X

X X X

XX X

XX XXX X

XX XXX X

X XX

XX X

XX XXX X

X XX

XX X

X XX

XX X

X X XX XXX X

Edge pixel

Non-edge pixel

X Don't care

Figure 4: List of patterns for edge-space pattern ESP8,2 (for n=2)

*6 A method for computing the intensity and orientation of edges using the changes in edge intensity in the x and y direc-tions computed from multiple surrounding pixels.

FEATURE

levels, and for each level, the frequency of occurrence (the number of pixels at the kth level), EMHk, is computed. In the proposed technique, edge magnitudes are quantized into 16 levels and edge magnitude histograms, EMHk, for k=1 to 16 are computed.(4) Local binary pattern

Local binary patterns13) are a feature value based on whether the focus pixel has higher or lower brightness than the surrounding pixels. The local binary pattern LBPp,r, computed from p pixels on a circumference of radius r, is defined as follows.

Here, the function δp,r is 1 if the pixel at (x+xi,y+yi), near the focus pixel at (x, y), has greater brightness than the fo-cus pixel, and 0 otherwise. Also, xi, yi for i=1 to p give the coordinates of the ith point on the circumference. Up,r rep-resents the total number of times the value of δp,r changes between 0 and 1 around the circumference. In the proposed technique, LBPp,r ({p,r} = {8,1}, {16,2}, {23,4}) for each pixel in the scan window is computed, as well as a histo-gram of these frequencies*7, and it is used as the feature value.(5) Haar wavelet textures

Haar wavelet textures are a feature value based on spatial frequencies in the image. The Haar wavelet transform*8 is first applied to the scan window for three iterations. Then, in each of the sub-band regions, the brightness variance is computed, and these are concatenated to obtain the feature value.

2.2 Discrimination by scan window overlap

In this section, we describe the filtering of candidate re-gions on the basis of scan window overlap, as shown in Fig. 1 (c). Scan windows falsely selected as candidate regions tend to be solitary or scattered in small numbers. The pro-cedure for excluding false detections is shown in Table 1. The number of scan window overlaps, s(x, y), at the focus pixel (x, y) is maintained for each pixel, and when the scan

has been completed for the entire frame, pixels with s(x,y)below the threshold Tov are excluded as false detections.

2.3 Discrimination by edge density and rectangle size

Finally, we describe discrimination based on edge den-sity and rectangle size, as shown in Fig. 1 (d). First, the sets of pixels from the process presented in Section 2.2 are divided into contiguous regions (neighboring pixels are gathered into regions), and then the bounding box of each region is derived. If the edge density and size of the bounding box region exceed threshold values, the region is determined to be a subtitle region. The conditions for des-ignating a bounding box region, R, are given below.(D(R) > Trd) and (|R| > Tra) and (|R|w > Trw) and (|R|h > Trh)

Here, D(R) is the edge density in bounding box region R, |R| is the area, |R|w the width, and |R|h the height. Trd, Tra, Trw, and Trh are corresponding threshold values.

This concludes the explanation of the process for detect-ing subtitle regions in the input frame image.

3. Evaluation testing3.1 Test video and threshold settings

To verify the utility of our technology, we evaluated it with tests using television program video. The video reso-lution was 432×240 pixels. Each pixel had a brightness value in the range from 0 to 255. A scan window size of 16x16 pixels was used, and the entire frame was scanned by moving the window in increments of four pixels in both horizontal and vertical directions. The threshold values used are given in Table 2. We selected these values to balance processing time with detection accuracy and with consider-ation of the preliminary test results.

14

*7 A histogram with the LBP value on the horizontal axis and the total number of pixels with that value on the vertical axis.

*8 A wavelet transform that separates input image into sub-bands using the sums and differences in the brightness of neighboring pixels.

F: All pixels in the frame image

while before scan of whole frame is complete do begin Shift scan window S if S is a subtitle candidate region then begin

endend

if S s(x, y) < T ov then remove pixel (x, y) from the candidatesend.

for all (x, y) ∈ F do s(x, y) ← 0

for all (x, y) ∈ F do begin

for all (x, y) ∈ S do s(x, y) ← s(x, y) + 1

Table 1: Discrimination procedure for candidate regions on the basis of scan window overlap

LBPp,r (x, y) = δp,r (xi , yi )i=1

p

if U p,r (x, y) 2

p +1 otherwise

FEATURE

15

3.2 Accuracy of discrimination using scan window

We first evaluated the accuracy of the discrimination of candidate regions using the scan window.(1) Test settings

Video segments from ten programs broadcast on NHK were used for the tests. This included a total of approxi-mately 320 minutes of video in various genres such as news, documentary, and drama. One frame was taken every sec-ond from each program video, and random partial images of the same size as the scan window were extracted from the images.

When extracting the partial images, only those with edge density exceeding the threshold Tsd were selected, in ac-cordance with the process described in Section 2.1. A total of 9,500 subtitle regions (examples) and 9,500 non-subtitle regions (counterexamples) were extracted. Of these, 7,000 examples and 7,000 counterexamples were used as training data, and the rest were used as test data. The following ex-pression for discrimination accuracy was used as an index for evaluation.

(2) Test resultsWe compared discrimination accuracy among the feature

values from Section 2.1, namely, edge-space pattern (ESP), edge-orientation histogram (EOH), edge-magnitude histo-gram (EMH), local binary pattern (LBP), and Haar wavelet texture (HWT).

The test results are shown in Table 3. The accuracy values for individual feature values are given in the first five lines of the table. Of these, our ESP feature value yielded the highest accuracy of 81.3%. The sixth line gives the results of combining all of the feature values. The discrimination accuracy was 94.1%, which was the highest one of all cases.

3.3 Subtitle detection accuracy

We then evaluated the accuracy for detecting subtitle regions.(1) Test settings

Video segments from 23 programs broadcast on NHK were used for these tests. A total of approximately ten hours of video from various genres was used. The data was completely different from that used for the tests described in Section 3.2. One frame image was extracted every second from the program video and used as test data. A total of 37,457 frame images were extracted. Subtitle bounding box regions were manually applied to these frames and used as correct data. This included a total of 23,216 subtitle regions in the correct data. To evaluate the detection accuracy, the following values for recall rate, precision, and F score were used.

Here, Ncrr is the number of subtitle regions correctly detected by our technology, Nmiss is the number of subtitle regions not detected by our technology, and Nerr is the num-ber of regions incorrectly detected as subtitle regions by our technology. The recall rate expresses how many regions were not detected, while the precision expresses how many regions were detected incorrectly. The F score is an index comprising a combination of recall and precision, with a tradeoff coefficient. For each of these indices, higher values indicate higher accuracy.

Discrimination accuracy Total number of test cases

Number of correctly discriminated test cases=

Recall rateNcrr+Nmiss

Ncrr=

PrecisionNcrr+Nerr

Ncrr=

F scoreRecall rate+Precision

2×Recall rate×Precision=

Table 3: Comparison of scan window discrimination accuracy

　　

Threshold Value

Tsd 380.0

Tbi 600.0

Tov 2.0

Trd 380.0

Tra 700.0

Trw 0.0

Trh 0.0

＃Feature value

Discrimination accuracyESP EOH EMH LBP HWT

1 ● 81.3%（4,066/5,000）

2 ● 71.4%（3,570/5,000）

3 ● 70.0% （3,500/5,000）

4 ● 78.6%（3,929/5,000）

5 ● 79.8%（3,988/5,000）

6 ● ● ● ● ● 94.1%（4,706/5,000）

Table 2: Threshold settings

FEATURE

16

*9 The width (thickness) of lines comprising characters.

Method

Epshtein10)

Takao5)

Minetto11)

Proposed technique

Recall rate

27.3% ( 6,338/23,216)

75.4% (17,505/23,216)

63.8% (14,811/23,216)

89.9% (20,871/23,216)

Precision

30.8% ( 6,338/20,578)

60.3% (17,505/29,030)

92.3% (14,811/16,047)

88.0% (20,871/23,717)

F score

0.289

0.670

0.754

0.889

Table 4: Accuracy of subtitle region detection

Method

Epshtein10)

Takao5)

Minetto11)

Proposed method

Average processing time/frame

0.133 s

0.050 s

21.239 s

0.275 s

Table 5: Comparison of processing times

(2) Comparison of techniquesTo verify the utility of our technology, we compared it

with earlier techniques. These included a traditional tech-nique reported by Takao et al.5) (Takao), and two relatively new techniques reported by Epshtein et al.10) and Minetto et al.11) (Epshtein and Minetto, respectively).

Takao is a subtitle region detection technique based on static edge density. Epshtein is a technique based on charac-ter stroke width*9. For our tests, we used a program created by Kumar and Perrault14). Minetto is a text region detection technique in which an SVM discriminator and feature val-ues based on Fourier transforms are used. For our tests, we used a program created by Minetto15).(3) Test results

The test results are shown in Table 4. In the results, our technology exhibited recall of 89.9%, precision of 88.0%, and an F score of 0.889. The recall and F score were the highest among the techniques tested. Evaluations of televi-sion programs were also good, with almost all programs yielding F scores of 0.7 or greater, showing robust behavior spanning various genres including news, documentary, and variety. The F score indicated particularly high accuracy for news programs, with values of 0.95 and greater. Subtitles in news programs usually have clear boundaries between subtitle and background regions, so this may have increased the detection accuracy. On the other hand, fine patterns in character costumes and crowd scenes in drama programs resulted in more false detections.

Having the second-highest F score after our technology, Minetto had recall of 63.8%, precision of 92.3%, and F score of 0.754. This precision was the highest of all techniques compared. Minetto includes postprocessing to reduce false positives in addition to the usual discrimination process, which may account for its higher precision. However, such strict discrimination also increases the number of missed regions, reducing the recall rate. This resulted in an F score

0.135 lower than that of our technology. Takao yielded recall of 75.4%, precision of 60.3%, and F

score of 0.670. It is based on edge density, which is a simple feature value that produces more false detections in regions with high frequency, reducing its precision. It also deals with edges that appear continuously at the same position, so it cannot detect subtitles that change in position or size over time.

Epshtein yielded recall of 27.3%, precision of 30.8%, and F score of 0.289. It showed a significant drop in accuracy owing to the difficulty in correctly detecting character stroke width when subtitles are layered over complex backgrounds. Another possible reason for the decrease in accuracy could be that the program implemented by Kumar et al.14) was tuned for images captured using a mobile device.(4) Comparison of processing time

Table 5 gives the results of comparing processing times. The processing environment consisted of an Intel Xeon X5690 3.47 GHz CPU and 72 GB of memory.

Our technology consumed 0.275 s of computing time per frame. Epshtein and Takao required less processing time than our technique, 0.133 and 0.05 s/frame, which are approximately 1/2 and 1/6 of the processing time required by our technology, respectively. These techniques may have required significantly less processing time than our technique because they do not incorporate machine learning discrimination, which is costly. The per-frame processing time for Minetto was 21.239 s. This is approximately 77 times that required by our technology, and is particularly long compared with the other conventional techniques.

The reason that Minetto required large amounts of

FEATURE

17

computing resources may have been because it executes many iterations of discrimination using machine learn-ing. This measure reduces false positives, but because the discrimination is performed on many candidate regions, the computing cost is high. It also consumes computing time to consolidate pixels and extract text candidate regions us-ing positional relationships among text candidate regions. When processing large amounts of video data, computation cost, in addition to detection accuracy, becomes important. We will continue to improve our technology in the future, taking into consideration the balance between detection accuracy and computation cost.

4. ConclusionIn this article, we introduced a technology for detecting

subtitle regions in television program video. The technology uses a new image feature value called edge-space patterns combined with machine learning, enabling robust operation for various types of program video. Edge-space patterns are a feature value that represents the arrangements of lines and points that form characters.

In tests using real broadcast video, the new technology achieved very good results, with a recall rate of 89.9%, precision of 88.0%, and an F score of 0.889. This F score was 0.135 higher than that achieved by a conventional tech-nique. In the future, we will continue our investigation on feature values that enable more accurate discrimination and on ways to reduce false-positive detection.

This article was revised and amended in accordance with

the following article appearing in the Journal of the Institute

of Image Information and Television Engineers.

Y. Kawai, T. Mochizuki, H. Sumiyoshi: “A Localization

Method of Video Text in TV Program Video Using Edge

Space Pattern,” Journal of the Institute of Image Information

and Television Engineers, Vol. 69, No. 5, pp. J197–J206

(2015)

References1) S. Kiranyaz and M. Gabbouj: “Hierarchical Cellular Tree: An

Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases,” IEEE Trans. Multimedia, Vol. 9, No. 1, pp. 102-119 (2007)

2) T. Sato, T. Kanade, E. K. Kughes, M. A. Smith and S. Satoh: “Video OCR: Indexing Digital News Libraries by Recognition of Superimposed Captions,” ACM Multimedia Syst., Vol. 7, No. 5, pp. 385-395 (1999)

3) N. Babaguchi, Y. Kawai and T. Kitahashi: “Event Based Index-ing of Broadcasted Sports Video by Intermodal Collaboration,” IEEE Trans. Multimedia, Vol. 4, No. 1, pp. 68-75 (2002)

4) K. Jung, K. I. Kim and A. K. Jain: “Text Information Extrac-tion in Images and Video: A Survey,” Pattern Recognition, pp. 977-997 (2004)

5) S. Takao, Y. Ariki and J. Ogata: “Cross Media Passage Re-trieval - Access Method to Spoken Documents by Telop and CG Flip Character Strings as Queries -,” Trans. IEICE D, Vol. J84-D-II, No. 8, pp. 1809-1816 (2001)

6) H. Arai, H. Kuwano, S. Kurakake and T. Sugimura: “Detec-tion of Telop in Video Data,” Trans. IEICE D, Vol. J83-D-II, No. 6, pp. 1477-1486 (2000)

7) Y. Zhong, H. Zhang and A. K. Jain: “Automatic Caption Lo-calization in Compressed Video,” IEEE Trans. Pattern Anal. Mach. Intell., Vol. 22, No. 4, pp. 385-392 (2000)

8) Y. Kawai, H. Sumiyoshi and M. Fujii: “A Method for De-tecting Display Regions of Open Caption from Single Video Frame,” ITE Winger Annual Convention, 3-7 (2010)

9) R. Lienhart and A. Wernicke: “Localizing and Segmenting Text in Images and Videos,” IEEE Trans. Circuits Syst. Video Technol., Vol. 12, No. 4, pp. 256-268 (2002)

10) B. Epshtein, E. Ofek and Y. Wexler: “Detecting Text in Natu-ral Scenes with Stroke Width Transform,” Proc. IEEE CVPR, pp. 2963-2970 (2010)

11) R. Minetto, N. Thome, M. Cord, N. J. Leite and J. Stolfi: “SnooperText: A Text Detection System for Automatic Index-ing of Urban Scenes,” Comput. Vision Image Understanding, Vol. 122, pp. 92-104 (2014)

12) K. Levi and Y. Weiss: “Learning Object Detection from a Small Number of Examples: The Importance of Good Fea-tures,” Proc. IEEE CVPR, Vol. 2, pp. 53-60 (2004)

13) T. Ojala, M. Pietikainen and T. Maenpaa: “Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns,” IEEE Trans. Pattern Anal. Mach. In-tell., Vol. 24, No. 7, pp. 971-987 (2002)

14) S. Kumar and A. Perrault: “Text Detection on Nokia N900 Using Stroke Width Transform,” https://github.com/aperrau/DetectText (2010)

15) R. Minetto: “SnooperText Source Code (Java),” http://www.dainf.ct.utfpr.edu.br/‾rminetto/projects/snoopertext/