6
Distinguishing Text/Non-Text Natural Images with Multi-Dimensional Recurrent Neural Networks Pengyuan Lyu, Baoguang Shi, Chengquan Zhang, Xiang Bai School of Electronic Information and Communications Huazhong University of Science and Technology Wuhan, Hubei, China 430074 Email: {lvpyuan, shibaoguang, zchengquan}@gmail.com [email protected] Abstract—In this paper, we focus on the text/non-text classifi- cation problem: distinguishing images that contain text from a lot of natural images. To this end, we propose a novel neural network architecture, termed Convolutional Multi-Dimensional Recurrent Neural Network (CMDRNN), which distinguishes text/non-text images by classifying local image blocks, taking both region pixels and dependencies among blocks into account. The network is composed of a Convolutional Neural Network (CNN) and a Multi-Dimensional Recurrent Neural Network (MDRNN). The CNN extracts rich and high-level image representation, while the MDRNN analyzes dependencies along multiple directions and produces block-level predictions. By evaluating CMDRNN on a public dataset, we observe improvements over prior arts in terms of both speed and accuracy. I. I NTRODUCTION Text is a dominant information carrier of human civiliza- tion. Scene text is ubiquitous, especially in urban areas. It appears on road signs, billboards, products packaging, etc., and provides important cues for image understanding and analy- sis. Reading text in the wild facilitates numerous real-world applications, e.g. image retrieval, geolocation, and driverless car. For these reasons, scene text reading has attracted great interest from the computer vision community in the past few years [33], [18], [26], [21]. However, among a large number of images on the internet, only a small portion of those images contain text. Directly applying existing scene text reading methods such as [11], [32], [29], [14] tends to be inefficient, as these methods typically involve exhaustive searches within local image patches or candidate regions. Therefore, in practice, mining textual information from image data requires text/non-text classification as a preprocessing step. By the definition in [31], an image should be classified as text image if it contains any text, or non-text image if no text exists at all. Therefore, the problem of text/non-text classification is different from a general image classification problem, mainly in that scene text may appear at any location and scale. It is also a very challenging problem, due to 1) scene text exhibits large variations in font, color and layout; 2) scene text could be easily confused by a machine with backgrounds such as brick wall, fences, and grass; 3) a text/non-text classification algorithm should be efficient enough to deal with large amounts of data. Some attempts have been made to address the text/non-text classification problem. Several recent approaches distinguish text/non-text in document images. Indermuhle et al. [12] and Vidya et al. [24] proposed a method that aims at distinguishing text/non-text regions in handwritten documents, respectively. In [23], Tuan Anh Tran et al. proposed a novel hybrid method to deal with document layout analysis. In [1], the algorithm proposed by Alessi et al. is able to detect whether text is present in an image. Again, the algorithm can only used in scanned document images or simple natural images. The methods mentioned above are suitable for documents analysis, but give poor performance in natural scene. As far as we know, in [31], Zhang et al. first proposed a method that is specially designed for classifying scene text/non-text images. In their method, Maximally Stable Extremal Regions (MSER) [17], CNN [16] and Bag-of-Words (BoW) [4] are combined to address this issue. The above-mentioned methods classify text/non-text images by classifying all local image patches, each of which is rep- resented and classified independently. However, this scheme ignores the dependencies between local patches. We argue that such dependencies are valuable and should be exploited, since image contexts are important for distinguishing text from non- text backgrounds. In scene text, a word or sentence usually consists of characters of the same color, font and scale, and it tends to occupy an elongated region, horizontal or vertical, which cannot be fully covered by a single square-shaped patch. By leveraging the dependencies among the patches of a word/sentence, one can better distinguish text from other confusing backgrounds. Different from previous works, we propose an effective algorithm that effectively utilizes spatial context to robustly discriminate text images in this paper. For capturing spatial information of text, Recurrent Neural Networks (RNN) was used in our model, which has been proven to be effective by [20], [9], [28] in scene text recognition, speech recognition and image caption task. We use Multi-Dimensional Recurrent Neural Networks[8] (MDRNN) in order to capture context from multiple dimensions. Especially, we construct a deep neural network that combines the advantages of CNN and MDRNN, named Convolutional Multi-Dimensional Recurrent Neural Network (CMDRNN). An overview of the network architecture is shown in Fig. 2. We train this network end- 2016 23rd International Conference on Pattern Recognition (ICPR) Cancún Center, Cancún, México, December 4-8, 2016 978-1-5090-4847-2/16/$31.00 ©2016 IEEE 3981

Distinguishing Text/Non-Text Natural Images with Multi ......Pengyuan Lyu, Baoguang Shi, Chengquan Zhang, Xiang Bai School of Electronic Information and Communications Huazhong University

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distinguishing Text/Non-Text Natural Images with Multi ......Pengyuan Lyu, Baoguang Shi, Chengquan Zhang, Xiang Bai School of Electronic Information and Communications Huazhong University

Distinguishing Text/Non-Text Natural Images withMulti-Dimensional Recurrent Neural Networks

Pengyuan Lyu, Baoguang Shi, Chengquan Zhang, Xiang BaiSchool of Electronic Information and Communications

Huazhong University of Science and TechnologyWuhan, Hubei, China 430074

Email: {lvpyuan, shibaoguang, zchengquan}@gmail.com [email protected]

Abstract—In this paper, we focus on the text/non-text classifi-cation problem: distinguishing images that contain text from a lotof natural images. To this end, we propose a novel neural networkarchitecture, termed Convolutional Multi-Dimensional RecurrentNeural Network (CMDRNN), which distinguishes text/non-textimages by classifying local image blocks, taking both regionpixels and dependencies among blocks into account. The networkis composed of a Convolutional Neural Network (CNN) and aMulti-Dimensional Recurrent Neural Network (MDRNN). TheCNN extracts rich and high-level image representation, whilethe MDRNN analyzes dependencies along multiple directions andproduces block-level predictions. By evaluating CMDRNN on apublic dataset, we observe improvements over prior arts in termsof both speed and accuracy.

I. INTRODUCTION

Text is a dominant information carrier of human civiliza-tion. Scene text is ubiquitous, especially in urban areas. Itappears on road signs, billboards, products packaging, etc., andprovides important cues for image understanding and analy-sis. Reading text in the wild facilitates numerous real-worldapplications, e.g. image retrieval, geolocation, and driverlesscar. For these reasons, scene text reading has attracted greatinterest from the computer vision community in the pastfew years [33], [18], [26], [21]. However, among a largenumber of images on the internet, only a small portion ofthose images contain text. Directly applying existing scenetext reading methods such as [11], [32], [29], [14] tends tobe inefficient, as these methods typically involve exhaustivesearches within local image patches or candidate regions.Therefore, in practice, mining textual information from imagedata requires text/non-text classification as a preprocessingstep.

By the definition in [31], an image should be classifiedas text image if it contains any text, or non-text image ifno text exists at all. Therefore, the problem of text/non-textclassification is different from a general image classificationproblem, mainly in that scene text may appear at any locationand scale. It is also a very challenging problem, due to 1) scenetext exhibits large variations in font, color and layout; 2) scenetext could be easily confused by a machine with backgroundssuch as brick wall, fences, and grass; 3) a text/non-textclassification algorithm should be efficient enough to deal withlarge amounts of data.

Some attempts have been made to address the text/non-textclassification problem. Several recent approaches distinguishtext/non-text in document images. Indermuhle et al. [12] andVidya et al. [24] proposed a method that aims at distinguishingtext/non-text regions in handwritten documents, respectively.In [23], Tuan Anh Tran et al. proposed a novel hybrid methodto deal with document layout analysis. In [1], the algorithmproposed by Alessi et al. is able to detect whether text ispresent in an image. Again, the algorithm can only usedin scanned document images or simple natural images. Themethods mentioned above are suitable for documents analysis,but give poor performance in natural scene. As far as we know,in [31], Zhang et al. first proposed a method that is speciallydesigned for classifying scene text/non-text images. In theirmethod, Maximally Stable Extremal Regions (MSER) [17],CNN [16] and Bag-of-Words (BoW) [4] are combined toaddress this issue.

The above-mentioned methods classify text/non-text imagesby classifying all local image patches, each of which is rep-resented and classified independently. However, this schemeignores the dependencies between local patches. We argue thatsuch dependencies are valuable and should be exploited, sinceimage contexts are important for distinguishing text from non-text backgrounds. In scene text, a word or sentence usuallyconsists of characters of the same color, font and scale, and ittends to occupy an elongated region, horizontal or vertical,which cannot be fully covered by a single square-shapedpatch. By leveraging the dependencies among the patches ofa word/sentence, one can better distinguish text from otherconfusing backgrounds.

Different from previous works, we propose an effectivealgorithm that effectively utilizes spatial context to robustlydiscriminate text images in this paper. For capturing spatialinformation of text, Recurrent Neural Networks (RNN) wasused in our model, which has been proven to be effective by[20], [9], [28] in scene text recognition, speech recognitionand image caption task. We use Multi-Dimensional RecurrentNeural Networks[8] (MDRNN) in order to capture contextfrom multiple dimensions. Especially, we construct a deepneural network that combines the advantages of CNN andMDRNN, named Convolutional Multi-Dimensional RecurrentNeural Network (CMDRNN). An overview of the networkarchitecture is shown in Fig. 2. We train this network end-

2016 23rd International Conference on Pattern Recognition (ICPR)Cancún Center, Cancún, México, December 4-8, 2016

978-1-5090-4847-2/16/$31.00 ©2016 IEEE 3981

Page 2: Distinguishing Text/Non-Text Natural Images with Multi ......Pengyuan Lyu, Baoguang Shi, Chengquan Zhang, Xiang Bai School of Electronic Information and Communications Huazhong University

(a) text images

(b) non-text imagesFig. 1. Examples of text/non-text images from various scenarios. (a) Text images contain text in it. Text in natural images is various, varies significantlyin size, color, font, direction, location, and layout. (b) Non-text images contain no text at all. Many non-text regions in natural images(such as brick wall,branch, grass and so on) have the similar texture with text regions.

to-end, using only natural images and the labels with respectto the images. As shown in Fig. 2, CNN acts as a featureextractor, which learns the high level feature of text region andnon-text region automatically. MDRNN plays an importantrole as a decoder, to access contextual information in multipledimensions, after that. Then, we can get the probabilitydistribution of text or non-text in every block of the inputimage.

The contributions of this paper as follows: Firstly, we pro-pose a text image classification method that effectively uses thespatial characteristics of scene text and is robust and effectiveto discriminate text images. We expose the performance of ourmethod on a challenging dataset of natural images, showingthat we can achieve state-of-the-art performance both in speedand accuracy. This means we can deploy the method as anon-text filter before many vision applications easily, withsmall time cost and high accuracy. Secondly, different from theprevious methods on text image discrimination, our algorithmnot only predicts the result of full image, but also every blockof the input image. And the prediction of every block of inputimage may be further used in text detection in the future.

The details of our method are presented in Section II. InSection III, we verify our method on a public benchmark.Conclusions are drawn in Section IV.

II. METHODOLOGY

In this section, we describe the proposed method for textimage classification in detail. The model consists of two parts,the CNN as an encoder and the MDRNN as a decoder. Atthe bottom of CNN, the model takes a natural image asinput, and then the convolutional layers automatically extract2-dimensional feature maps from each input image. Everypixel in the feature maps extracted by CNN corresponds with

a region of the input image. On top of the convolutionalneural network, a multi-dimensional recurrent neural networkis built in order to decode the 2-dimensional feature mapsand make a prediction for each pixel of the feature maps,outputted by the convolutional layers. The MDRNN outputsa sequence Y = {y11, . . . yH′W ′}, where y represents thepredicted label of each pixel in the 2-dimensional featuremaps, the subscript of y represents the location of y, H

′and

W′

represent the height and width of the 2-dimensional featuremaps, respectively. Though the model is composed of differentkinds of network architectures (eg. CNN and RNN), it can bejointly trained end-to-end, with one loss function.

A. Convolutional Network

In our method, the CNN acts as an encoder and plays arole of image feature extractor. The CNN is currently state-of-the-art and has been widely used and studied for manyimage tasks, such as object recognition [15], [2] and detection[7], [6]. We use a Deep Convolutional Network in our modelwhich is constructed by the convolutional and max-poolinglayers. Moreover, batch normalization layer [13] is also usedin the network. The architecture of the convolutional layersis similar to VGG [22] (we use fewer convolutional layerscompared to VGG16 for higher efficiency), as VGG has state-of-the-art performance. The CNN is used to extract featurerepresentation from an input image. All the images need to bescaled to the same size, before being fed into the CNN. Soa scaled input image of shape H ×W gives rise to a tensorof features of shape C × H

′ ×W′

where C is the numberof feature maps, H

′and W

′mean the height and width of

the feature maps, respectively. Specifically, the features canbe seen as a 2-dimensional map X consist of H

′ × W′

feature vectors {x11, x12 . . . xij . . . xH′W ′}. This means the

3982

Page 3: Distinguishing Text/Non-Text Natural Images with Multi ......Pengyuan Lyu, Baoguang Shi, Chengquan Zhang, Xiang Bai School of Electronic Information and Communications Huazhong University

Input image DCNN Multi-Direction 2-Dim LSTM Output

..

Fig. 2. Model overview. The architecture consists of two parts: 1) convolutional layers, which extract 2-dimensional feature maps from the input imageautomatically; 2) multi-dimensional recurrent layers, which capture spatial context from 2-dim and predict a label distribution for each block. The model canbe trained end-to-end with gradient descent.

xmn feature vector is the concatenation of the m − th rowand the n− th column pixel of all the maps.

The CNN is translation invariant, as the convolution layersand max-pooling layers are applied to local regions. Therefore,each feature vector in the 2-dimensional map corresponds toa rectangle region of the original image. In addition, everyrectangle region in the original image has the same locationwith respect to the feature vector in the 2-dimensional map.So, each vector in the 2-dimensional map can be consideredas the image descriptor of the corresponding region.

The output of CNN will be fed into MDRNN as input, anddecoded by MDRNN.

B. Multi-Dimensional Recurrnet Neural Networks

A deep multi-directional multi-dimensional recurrent neuralnetwork is built on the top of the CNN, as the recurrentlayers. The choice of MDRNN in our model is governed by itsability to access contextual information in multi-dimensionaldomains which have been proved to be effective on imagesegmentation and object detection task in [3], [8]. Text in nat-ural images has good geometry local continuity, so contextualinformation is important for text image classification. Thus,the MDRNN is used as a decoder and to capture the spatialcontext of text/non-text regions and predict a label distributionymn for each feature vector xmn in the 2-dimensional mapX = {x11,x12, . . . xH′W ′}. So, ymn means the text/non-textprobability of xmn and the corresponding block in originalimage.

Long-Short Term Memory (LSTM) is a type of RNN unitthat is specially designed by S Hochreiter [10] and Felix Gers[5] et al. to overcome the problem of vanishing gradients.An LSTM unit is make up of a memory cell and threemultiplicative gates, which refer to the input gate, the outputgate and the forget gate, respectively. As the name implies, thememory cell stores the past contexts, and the input gate andoutput gate control the input and output of the memory cell.

2mnf

1mnfmni

mno

mnu

},,{ 11 mnnmmn hhx

mnc

}{ 1nmc

}{ 1-mnc

}{ mnh

Fig. 3. 2-Dim LSTM unit. i, o and f denote the input gate, output gate andforget gate, respectively. c means memory cell. u represents updating term.The symbol of concentric circle represents the product with a gate value.

In addition, the memory in the memory cell can be clearedby the forget gate. The effect of the gates is to allow thecells to store and access information over long periods oftime compared with the standard RNN, thereby overcomingthe vanishing gradient problem.

Since a standard LSTM contains only a single forget gate,which means the standard LSTM can only access contextinformation from one-dimensional. For accessing context in-formation from multi-dimensional, we can extend the standardLSTM to two-dimensional LSTM easily by using 2 forgetgates in an LSTM unit, as shown in Fig. 3. Moreover, multi-dimensional LSTM can access to multi-directional contextwhen a multi-dimensional LSTM contains multiple separate

3983

Page 4: Distinguishing Text/Non-Text Natural Images with Multi ......Pengyuan Lyu, Baoguang Shi, Chengquan Zhang, Xiang Bai School of Electronic Information and Communications Huazhong University

Fig. 4. Left: An input image. Right: A 2-dimensional RNN captures spatialcontext from 2-dimensional. The pink dots represent the feature of non-textblocks, and green dots represent the feature of text blocks.

hidden layers that process the input feature map in the multi-directions. The multiple hidden layers are connected to a singleoutput layer, thereby providing the network with access tomulti-directional context.

As shown in Fig. 4, the output of CNN can be processedby a MDLSTM as follows:

hmn = MDLSTM(xmn, hm−1n, hmn−1) (1)

ymn = SoftMax(Whmn) (2)

Here, MDLSTM stands for the Multi-Dimensional Long-Short Term Memory unit, which takes feature vector xmn andthe memory state of hm−1n, hmn−1 as input and outputs hmn.The W matrix is trained parameters, and ymn is the probabilitydistribution over text/non-text. Note, the formula in (1) meansthe direction of the MDLSTM is from the top left to the bottomright.

C. Network Training

The CNN and MDLSTM can be cascaded into one networkand trained end-to-end. Our loss is the sum of the log loss overtwo classes of each block on the training set as follows:

Loss =∑i

H′∑

m=1

W′∑

n=1

Lcls(yimn, l

imn)

The superscript i means the i − th training sample, H′

and W′

denote the height and width of the feature maps,respectively. The lmn is the ground truth of the block in them− th row and the n− th column of the spatial partitions ofthe input image. The classification loss Lcls is log loss overtwo classes (text vs. non-text). The above loss is minimizedw.r.t. all the parameters of the DCNN and the MDLSTM.The network is trained by standard back-propagation [19]. Inparticular, the Back-Propagation Through Time (BPTT) [27]is applied to calculate the error differentials in the recurrentlayers.

III. EXPERIMENTS

We perform several experiments to assess the effectivenessof our model on a standard public benchmark for text imagediscrimination which is very challenging. The dataset andsetting for training and testing are provided in III-A. Then

0 00

0000

11

11

1

11

000 0

0 00

000 0

(a) (b)Fig. 5. The ground truth of an input image. We obtain the ground truth forevery block with respect to the feature vector in the 2-dimensional maps bycalculating the overlap of all bounding boxes with the block.

the detail of the model is described in III-B and the resultswith comprehensive comparisons are reported in III-C.

A. Dataset

We evaluate the proposed algorithm on a public datasetcalled TextDis, released by Zhang et al. [31] which is theonly dataset proposed for natural text image classification tothe best of our knowledge. The dataset mainly composed ofnatural images, a small part of computer-produced images andscanned document images are also included. There are 5302text images and 6000 non-text images in train set and test setincludes 2000 text images and 2000 non-text images. In textimages, the text regions are manually labeled with boundingboxes. As shown in Fig. 5, we obtain the ground truth for everyblock with respect to the feature vector in the 2-dimensionalmaps by calculating the overlap of all bounding boxes withthe block. We assign a positive label to a block if the blockthat has an IoB (Intersection-over-Block) overlap higher than1/16 with any text bounding box. And the ground truth of animage can be represented as L = {l11, l12...lH′W ′}.

B. Implementation Details

The configuration of the network architecture used in thiswork is shown in Table I. In this paper, all images are scaledto 640× 640 and converted to grayscale images before beingfed into the model. In DataLayer, the value of the inputimage is normalized. After 6 max pooling steps, 10 × 10 2-dimensional maps X can be generated. Then we use 2 4-dir2-dim LSTM layers as recurrent layers. We name this model as2Dim4DirLSTM. In order to help us understand how much thecontribution of the MDRNN is to the final accuracy, we alsotrain a model which has CNN layers with a softmax layer oneach of the feature map pixels as baseline called CNN Block.Besides, we also train a model which the recurrent layers con-sist of two 1-dir 2-dim LSTM layers namely 2Dim1DirLSTM,to verify the necessity of accessing contextual informationfrom multi-direction. We train all models end-to-end with theADADELTA [30] optimization method. We train all networksfor about 10 epochs on the training data set and we use abatch size of 8 throughout training phrase.

We regard a block as a text block if the probability of textgreater than 0.5. And we see an image as a text image if thepredicted label of any block in this image is text when testing.

3984

Page 5: Distinguishing Text/Non-Text Natural Images with Multi ......Pengyuan Lyu, Baoguang Shi, Chengquan Zhang, Xiang Bai School of Electronic Information and Communications Huazhong University

Fig. 6. Classification results of TextDis benchmark. The pictures in the first row are some samples of text images from TextDis dataset, yellow boundingboxes mean the text region of these images. The highlight blocks in the bottom row are the text blocks detected by our model.

TABLE INETWORK CONFIGURATION SUMMARY. IN THIS TABLE,‘K’, ‘S’ AND ‘P’

STAND FOR KERNEL SIZE, STRIDE AND PADDING SIZE RESPECTIVELY.’#MAPS’ MEANS THE NUMBER OF FEATURE MAPS, ’#HIDDEN UNITS’

MEANS THE NUMBER OF HIDDEN UNITS IN THE RECURRENT LAYER, AND’#DIRS’ MEANS THE NUMBER OF DIRECTIONS OF MDLSTM.

Type ConfigurationsInput 640× 640 grayscale image

DataLayer mean:0, range:[−1, 1]Convolution 1 k:3× 3, p:1, s:1 #maps:64Convolution 2 k:3× 3, p:1, s:1 #maps:128MaxPooling 1 k:2× 2, s:2Convolution 3 k:3× 3, p:1, s:1 #maps:128MaxPooling 2 k:2× 2, s:2Convolution 4 k:3× 3, p:1, s:1 #maps:128MaxPooling 3 k:2× 2, s:2Convolution 5 k:3× 3, p:1, s:1 #maps:256

Batch Normalization 1 -MaxPooling 4 k:2× 2, s:2Convolution 6 k:3× 3, p:1, s:1 #maps:256MaxPooling 5 k:2× 2, s:2Convolution 7 k:3× 3, p:1, s:1 #maps:512

BatchNormalization 2 -MaxPooling 6 k:2× 2, s:2

2-Dimensional LSTM 1 #hidden units:256, #dirs:42-Dimensional LSTM 2 #hidden units:256, #dirs:4

LogSoftMax -

C. Comparative Evaluation

We evaluate our method on the TextDis Dataset and comparewith other algorithms, such as LLC[25] and CNN Coding [31]in this section.

1) Quantitative results: The evaluation protocol of textimage discrimination has been proposed in [31]. In this paper,we follow [31] to measure the performance of our model.The evaluation is computed by Precision(P), Recall(R), andF-measure(F).

The classification results of different methods can be foundin Table II. The results in the first three rows of Table II arecome from [31], which evaluated on the same dataset, so canbe compared fairly to our method. As we can see, our model

TABLE IITHE RESULTS OF DIFFERENT COMPARISON METHODS. WE EVALUATE

PRECISION, RECALL, F-MEASURE AND TIME COST OF DIFFERENTMETHODS.

Methods Precision Recall F-Measure Time CostLLC 0.839 0.774 0.805 -CNN - - 0.821 -

CNN Coding 0.898 0.903 0.901 0.46sOurs(CNN Block) 0.838 0.927 0.880 0.06s

Ours(2Dim1DirLSTM) 0.875 0.926 0.900 0.07sOurs(2Dim4DirLSTM) 0.904 0.933 0.918 0.09s

outperforms all the other methods both in precision, recalland F-measure. Our model outperforms the previous art[31] byalmost 1 percent in precision, 3 percent in recall and 2 percentin F-measure, a clear improvement. Besides, the results of oursCNN Block and ours MDLSTM models prove that MDLSTMcan access the local context information which is advantageto text/non-text classification. In addition, we also can learnthat by using multi-dir MDLSTM, more context informationcan be accessed to make a more precise prediction, from theresult of 2Dim1DirLSTM and 2Dim4DirLSTM.

2) Qualitative results: Our method not only classifiestext/non-text images in image level, but also predicts theprobability of text appeared in each block in the input images.So we can get the coarse position of text in text images. Theresults may be further used as input to text detection modelin the future. Besides, our method is very robust and canclassify text blocks which contain text in different size, color,font, direction, location, language and layout. We show someexamples in Fig. 6.

3) Runtime evaluation: The running time of our algorithmis also evaluated. We measure it on a regular workstation(CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz; GPU:Tesla K40c; RAM: 64GB). As shown in Table II, our modelis efficient at test time: an image is processed in 0.09s on asingle GPU, which is about 5x faster than CNN Coding.

3985

Page 6: Distinguishing Text/Non-Text Natural Images with Multi ......Pengyuan Lyu, Baoguang Shi, Chengquan Zhang, Xiang Bai School of Electronic Information and Communications Huazhong University

IV. CONCLUSIONS

In this paper, we propose a method for classifying text/non-text natural images automatically, called Convolutional Multi-Dimensional Recurrent Neural Network (CMDRNN) whichcombines the advantages of CNN and MDRNN, and canbe trained end-to-end. The CMDRNN classifies text/non-text image in block-level, while takes an image as input.The CMDRNN is robust, effective and efficient, and theexperiments on public dataset demonstrate that CMDRNNachieves superior performance both in speed and accuracy,compared to other methods. In our future work, we would liketo combine CMDRNN with text reading system for miningtextual information.

ACKNOWLEDGEMENTS

This work was primarily supported by National NaturalScience Foundation of China (NSFC) (No. 61222308, No.61573160) and Open Project Program of the National KeyLaboratory of Digital Publishing Technology (No. F2016001).

REFERENCES

[1] Nicolo G Alessi, Sebastiano Battiato, Giovanni Gallo, Massimo Man-cuso, and Filippo Stanco. Automatic discrimination of text images.In Electronic Imaging 2003, pages 351–359. International Society forOptics and Photonics, 2003.

[2] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple objectrecognition with visual attention. CoRR, abs/1412.7755, 2014.

[3] Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrentneural networks. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2016.

[4] David Filliat. A visual bag of words method for interactive qualitativelocalization and mapping. In Robotics and Automation, 2007 IEEEInternational Conference on, pages 3921–3926. IEEE, 2007.

[5] A. Gers F., J. Schmidhuber, and F. Cummins. Learning to forget:Continual prediction with lstm. In Artificial Neural Networks, 1999.ICANN 99. Ninth International Conference on (Conf. Publ. No. 470),pages 2451–2471, 1999.

[6] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 1440–1448, 2015.

[7] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Richfeature hierarchies for accurate object detection and semantic segmen-tation. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 580–587, 2014.

[8] Alex Graves, Santiago FernA¡ndez, and JAŒrgen Schmidhuber. Multi-dimensional recurrent neural networks. In Proceedings of the 17thinternational conference on Artificial neural networks, pages 549–558,2007.

[9] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speechrecognition with deep recurrent neural networks. In 2013 IEEE inter-national conference on acoustics, speech and signal processing, pages6645–6649. IEEE, 2013.

[10] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997.

[11] W. Huang, Y. Qiao, and X. Tang. Robust scene text detection withconvolution neural network induced mser trees. In Proc. of ECCV,2014.

[12] Emanuel Indermuhle, Horst Bunke, Faisal Shafait, and Thomas Breuel.Text versus non-text distinction in online handwritten documents. InProceedings of the 2010 ACM Symposium on Applied Computing, pages3–7. ACM, 2010.

[13] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In Proceed-ings of The 32nd International Conference on Machine Learning, pages448–456, 2015.

[14] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-man. Reading text in the wild with convolutional neural networks. IJCV,pages 1–20, 2014.

[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classificationwith deep convolutional neural networks. In Proc. of NIPS, 2012.

[16] Yann LeCun and Yoshua Bengio. Convolutional networks for images,speech, and time series. The handbook of brain theory and neuralnetworks, 3361(10):1995, 1995.

[17] Jiri Matas, Ondrej Chum, Martin Urban, and Tomas Pajdla. Robustwide-baseline stereo from maximally stable extremal regions. Imageand vision computing, 22(10):761–767, 2004.

[18] A. Mishra, K. Alahari, and C. V. Jawahar. Scene text recognition usinghigher order language priors. In Proc. of BMVC, 2012.

[19] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learn-ing representations by back-propagating errors. Cognitive modeling,5(3):1, 1988.

[20] Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neuralnetwork for image-based sequence recognition and its application toscene text recognition. CoRR, abs/1507.05717, 2015.

[21] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and XiangBai. Robust scene text recognition with automatic rectification. In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR),June 2016.

[22] Karen Simonyan and Andrew Zisserman. Very deep convolutionalnetworks for large-scale image recognition. CoRR, abs/1409.1556, 2014.

[23] Tuan Anh Tran, In Seop Na, and Soo Hyung Kim. Page segmentationusing minimum homogeneity algorithm and adaptive mathematical mor-phology. International Journal on Document Analysis and Recognition(IJDAR), pages 1–19, 2016.

[24] V Vidya, TR Indhu, and VK Bhadran. Classification of handwrittendocument image into text and non-text regions. In Proceedings of theFourth International Conference on Signal and Image Processing 2012(ICSIP 2012), pages 103–112. Springer, 2013.

[25] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, andYihong Gong. Locality-constrained linear coding for image classifica-tion. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEEConference on, pages 3360–3367. IEEE, 2010.

[26] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recogni-tion. In Proc. of ICCV, 2011.

[27] Paul J Werbos. Backpropagation through time: what it does and how todo it. Proceedings of the IEEE, 78(10):1550–1560, 1990.

[28] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville,Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attendand tell: Neural image caption generation with visual attention. InProceedings of the 32nd International Conference on Machine Learning(ICML-15), pages 2048–2057, 2015.

[29] C. Yao, X. Bai, and W. Liu. A unified framework for multi-oriented textdetection and recognition. IEEE Trans. Image Processing, 23(11):4737–4749, 2014.

[30] Matthew D. Zeiler. ADADELTA: an adaptive learning rate method.CoRR, abs/1212.5701, 2012.

[31] Chengquan Zhang, Cong Yao, Baoguang Shi, and Xiang Bai. Automaticdiscrimination of text and non-text natural images. In Document Analysisand Recognition (ICDAR), 2015 13th International Conference on, pages886–890. IEEE, 2015.

[32] Zheng Zhang, Wei Shen, Cong Yao, and Xiang Bai. Symmetry-basedtext line detection in natural scenes. In Proc. of CVPR, pages 2558–2567, 2015.

[33] Y. Zhu, C. Yao, and X. Bai. Scene text detection and recognition: Recentadvances and future trends. Frontiers of Computer Science, 2015.

3986