Download pdf - [IEEE 2011 IEEE First International Conference on Consumer Electronics - Berlin (ICCE-Berlin) - Berlin, Germany (2011.09.6-2011.09.8)] 2011 IEEE International Conference on Consumer

Overlay Text Localization for Full HD Video Applications

Jong-Ju Hong, Yun-Ki Han, Woo-Jin Song, Member, IEEE

Dept. of Electronic and Electrical Engineering Pohang University of Science and Technology, Pohang, 790-784

Korea [email protected]

Junghwan Lee, HeeJung Hong, EuiYeol Oh LG Display, R&D Center

1007, Deogeun-ri, Wollong-myeon, Paju-si, Gyeonggi-do, 413-811 Korea

Abstract—In various full HD video applications, high-level image enhancement requires localization of overlay text. In this paper, we propose text localization that uses a region-shrinking multistep process to achieve both high accuracy and fast processing in full HD videos.

Keywords-text detection, region-shrinking multistep processing, full HD image processing

I. INTRODUCTION Video is an important source of information in modern

media. Various kinds of content are displayed in full high definition (HD), i.e. 1920 x 1080 pixels. To aid in visual understanding and to emphasize certain content, overlay text may be inserted into a video stream. To achieve high-level image enhancement in modern media, the overlay text image must be separated from the background image. For that reason, text detection and localization are important. Separation of the overlay text image from the background image can be built by such text detection and localization. In most cases, text localization designates both text detection and localization.

Several algorithms to localize text from images and videos have been proposed for text segmentation for video indexing [1], text detection and localization for multilingual video [2], caption localization [3], and other applications [4-8]. Existing algorithms localize the detected text accurately, but they are not appropriate for the latest full HD video format. These methods require huge computational complexity because they repeat several whole-image resolution processes and as a result cannot process full HD videos in real-time.

In this paper, we propose text localization algorithm that uses a region-shrinking multistep process to achieve both high accuracy and fast processing in full HD videos. First, we characterize the overlay text features in diverse modern video content to ensure that the algorithm works, and we improve the accuracy of text localization. Then we set the overall text localization strategy based on the overlay text features so that the strategy is efficient.

Figure 1. Overall Text Localization Strategy

II. TEXT LOCALIZATION Our proposed text localization algorithm using region-

shrinking multistep processing (Fig. 1) exploits the difference between overlay text features and the background image. Overlay text has strong edges and size and stroke restrictions, and shows regular alignment in horizontal or vertical directions. It also appears in a cluster and in consecutive frames of video. Processes of the overall strategy are explained in Sections II-A-F.

A. Text Edge Detection Overlay text can be detected using edge detection, because

the intensity of text is similar to neighboring pixels and different from non-overlay text pixels. We conduct sparse text edge detection in horizontal and vertical directions using a 1st -order derivative to obtain fast detection:

�( , ) ( , 1) ( , )

( , ) ( 1, ) ( , ) ,h

v

G i j I i j I i j

G i j I i j I i j

� � � ��

� � ��

where Gh, Gv is the gradient image in the horizontal and vertical directions, respectively, i and j are the pixel coordinates and I is an image. Note that Gh and Gv are

This work was supported in part by the Brain Korea (BK) 21 Program funded by the MEST, in part by LG-Display Co. and in part by the IT R&D program of MCST/IITA (2008-F-031-01), Korea.

2011 IEEE International Conference on Consumer Electronics - Berlin (ICCE-Berlin)

978-1-4577-0234-1/11/$26.00 ©2011 IEEE 342

calculated only in ch-times rows and cv-times columns that are relevant to the detection rate. Then, the text edge is determined by thresholding:

� �( , ) 1 if max ( , ), ( , ) ,h v TETE i j G i j G i j T� � � ��

where threshold TTE is determined by the unimodal thresholding method applied in a gradient histogram [9]. This method successfully detects text edges ( Fig. 2).

Figure 2. Text Edge Detection: (a)RGB Image; (b)Intensity Image; (c)Text Edge Image

B. Text Block Formulation We formulate blocks composed of M rows and N columns

of pixels. Then, for each block, we count the number of text

edges (NTEs) obtained using (2). The NTEs in the m-th row and n-th column blocks are denoted by NTE(m, n):

�pixels in ( , ) blocks

( , ) ( , ).TEm n

N m n TE i j� ��

C. Multistep Thresholding for Text Block Selection To select text blocks using NTE(m, n), we employ multiple

thresholds. In preliminary experiments, using only one threshold caused many misses and false alarms because NTEs are widely distributed in the overlay text region. Especially, several non-overlay text regions exist which have similar NTE and these regions can be falsely identified as text blocks (Fig. 3, yellow rectangle). Furthermore, small NTEs may be distributed near the side of the text regions and they can be missed if only single threshold is applied (Fig. 3, red rectangle).

Thus, we propose a multistep NTE thresholding algorithm (Fig. 4) that can specify the overlay text features and improve the accuracy of text block selection. In Step 1, simple block selection with a threshold TNTE is used. In step 2, thresholding using TNTE/2 is applied around the blocks selected in step one. Finally, in step 3, thresholding using TNTE/4 is applied if the neighboring blocks are already selected and more than two blocks were identified in the previous steps.

Figure 3. NTE Distribution: (a)RGB Image and Target Regions (Yellow and Red Rectangles); (b)NTE distibution of yellow (top) and red (bottom)

Regions in (a)

The accuracy of Multistep NTE thresholding is influenced by NTE distribution and TNTE. Finding these two factors is not difficult because they are related to block size, and the threshold used in text edge detection, TTE. Intuitively, as the larger a block increases, it may include an increasing number of text edge candidates, so a higher value of TTE may reduce the number of text edges.

(a)

(b)

(a)

(b)

(c)

343

Figure 4. Overall Multistep Thresholding for Text Block Selection

Text blocks are selected stepwise (Fig. 5): in step 1, the most overlay-text-likely blocks are selected; in step 2 and 3, additional overlay-text-likely blocks are selected around blocks selected in previous steps.

Figure 5. Multistep Thresholding at Step 1 and Step 2

D. Stroke Constrained Text Block Sifting Text consists of a limited number of strokes which appear

to be solid lines in an image. Thus characters composed of stokes have fewer solid lines than does the complicated background. Such a stroke restriction can be applied to sift non-text blocks from the candidate text blocks selected in section II-C. The need for sifting is tested only for these few blocks, block sifting requires comparatively little processing time.

Figure 6. Stroke Extraction Algorithm

Figure 7. Stroke Constrained Block Sifting: (a)RGB Image and Target Regions (Yellow and Red Rectangles); (b) Selected Blocks on the Text Edge

Image of Regions in (a); (c) Blocks remaining after Block Sifting on the Stroke Extracted Image of Regions in (a)

(a)

(b)

(c)

Step 1

NTE1

1 if ( , )( , )

0 elsewhereTEN m n T

B m n��

� �

Step 2

2

1

1 NTE', '

( , )

1 if ( , ) 1or ( , ) 1 and max ( ', ') 2

0 elsewherewhere ' { , 1}, ' { , 1}, ( ', ') ( , )

TEm n

B m n

B m nB m n N m n T

m m m n n n m n m n

� ��

� � � � �

Step 3

3

2

2 NTE'', ''

', '

( , )

1 if ( , ) 1or ( , ) 1 and max ( '', '') 4

0 elsewherewhere '' { , 1}, '' { , 1},( '', '') ( , ), ( '', '') arg max ( ', ')

TEm n

TEm n

B m n

B m nB m n N m n T

m m m n n nm n m n m n N m n

� ��

� � � ��

344

The proposed stroke extraction algorithm is based on Canny edge detection [10]. The Canny detector detects edges and represents them as lines that are one pixel wide. The resultant lines in the text region are strokes of characters but those on the non-text region are not strokes. Text in the block has a limited number of strokes because the size of blocks defined in Section II-B is fixed as a certain value, M x N, and the size of overlay text is restricted according to image resolution [11].

The result of canny edge detection is a binary image: pixels have a value of 1 if they are on the lines detected and 0 otherwise. To extract the stroke, connected non-zero pixels should be found and joined to form lines (Fig. 6). Each line is a stroke and the number of strokes in the block is constrained to sift block:

� ( , ) 0 ( , )stroke strokeB m n if N m n T� � � ��

E. Text Block Refinement Overlay text on an image shows a regular alignment and

clusters together. Thus, the text block obtained in section II-D must be refined to improve accuracy using these attributes. If a block is missing around a clustered text block region, then it is reclassified from non-text block to text block. Conversely, if an isolated block exists around no neighboring text blocks, it is reclassified from text block to non-text block:

�

1

2

0 if ( , ) and '( , ) 1( , )

1 if ( , ) and '( , ) 0

where ( , ) ( , ).

( , )`:the block identified in the previous stage.

RF

RF

QP

p P q Q

S m n T B m nB m n

S m n T B m n

S m n B m p n q

B m n��

� ��

� � ��

Moreover, the clustered blocks are refined to a rectangular box, because overlay text is inserted in horizontal or vertical shape [2].

F. Text Localization in Consecutive Frames In videos, overlay text appears in consecutive frames. Text

localization in the current frame is copied from the previous frame if the selected blocks are identical because the correlation is high between image signals in consecutive frames. Fast process requires single pixel-wise iteration in full resolution, but the exact process repeats several pixel-wise iterations only in the block resolution (section II-D-E). Therefore, the strategy in retaining the location of text blocks in consecutive frames reduces processing time effectively so that the proposed text localization can process full HD videos in real time.

III. EXPERIMENTAL RESULTS The accuracy of the proposed text localization was

evaluated in various videos including movies, TV shows and

animations. Despite the remarkably reduced complexity of the proposed algorithm for real time full HD video processing, overlay text was successfully localized (Fig. 8).

Figure 8. Experimental Results

An evaluation of the reliability of the proposed algorithm was verified in full HD video using the following accuracy measures with miss rate (RM) and false alarm rate (RFA) in (6). The proposed algorithm showed comparable low miss and false alarm rate in full HD videos, i.e. “Avatar Tribe” (normal movie), “Avatar Landing” (movie with complex background) and “One Piece” (animation) scene (TABLE. I).

(# )(# )

(# )(# )

M

FA

of Missed FramesRof Text Frames

of False Alarmed FramesRof Total Frames

�

��

345

TABLE I. EVALUATION OF DETECTION ACCURACY

Scene Text/Non-text (Total) Frames

% RM (# of Frames)

% RFA (# of Frames)

Avatar Tribe 642/182 (824) 0 (0) 0.12 (1)

Avatar Landing 282/291 (573) 0 (0) 1.05 (6)

One Piece 223/231 (454) 0 (0) 1.1 (5)

IV. CONCLUSION An overlay text localization algorithm for full HD video

has been proposed. The proposed region-shrinking multistep processing of text localization is based on the characterized overlay text features. In full HD video, both high accuracy and fast processing were achieved using the proposed algorithm.

REFERENCES [1] R. Lienhart, W. Effelsberg, “Automatic text segmentation and text

recognition for video indexing”, Multimedia Systems 8: 69-81 (2000). [2] M. R. Lyu, J. Song, and M. Cai, A Comprehensive Method for

Multilingual Video Text Detection, Localization, and Extraction , IEEE Trans. Circuits and Systems for Video Technology, , Vol. 15, No. 2, February 2005, pp.243-255

[3] Y. Zhong, H. Zhang, and A. K. Jain, Automatic Caption Localization in Compressed Video , IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 22, No. 4, APRIL 2000, pp. 385-392

[4] K.Jung, K.I.Kim, A.K.Jain, “Text information extraction In images and video: a survey”, Pattern Recognition 37 (2004) 977–997.

[5] A. K. Jain and S. Bhattacharjee, Text Segmentation Using Gabor Filters for Automatic Document Processing , Machine Vision and Applications, 5, 1992

[6] Y. Zhong, K. Karu, and Anil. K. Jain, Locating Text in Complex Color Images , Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1), 1995

[7] V. Wu, R. Manmatha, E. M. Riseman, TextFinder: An automatic System to Detect and Recognize Text in Images , IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 21, No. 11, November 1999 pp. 1224-1229

[8] H. Li, D. Doermann, and O. Kia, Automatic Text Detection and Tracking in Digital Video , IEEE Trans. Image Processing, Vol. 9, No. 1, January 2000.

[9] P. L. Rosin, “Unimodal Thresholding”, Pattern Recognition 34, 2083-2096, 2001.

[10] Canny, J., “A Computational Approach To Edge Detection”, IEEE Trans. Pattern Analysis and Machine Intelligence, 8(6):679 698, 1986

[11] W. Kim and C. Kim, A New Approach for Overlay Text Detection and Extraction From Complex Video Scene , IEEE Trans. On Image Processing, VOL. 18, NO. 2, February 2009, pp. 401-411

346