11
Detecting Multi-Oriented Text with Corner-based Region Proposals Linjie Deng a , Yanxiang Gong a , Yi Lin b , Jingwen Shuai a , Xiaoguang Tu a , Yuefei Zhang c , Zheng Ma a , Mei Xie a,* a School of Information and Communication Engineering, UESTC, Chengdu, China b National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China c Chongqing Institute of Public Security Science and Technology, Chongqing, China Abstract Previous approaches for scene text detection usually rely on manually defined sliding windows. This work presents an intuitive two-stage region-based method to detect multi-oriented text without any prior knowledge regarding the textual shape. In the first stage, we estimate the possible locations of text instances by detecting and linking corners instead of shifting a set of default anchors. The quadrilateral proposals are geometry adaptive, which allows our method to cope with various text aspect ratios and orientations. In the second stage, we design a new pooling layer named Dual-RoI Pooling which embeds data augmentation inside the region-wise subnetwork for more robust clas- sification and regression over these proposals. Experimental results on public benchmarks confirm that the proposed method is capable of achieving comparable performance with state-of-the-art methods. The code is publicly available at https://github.com/xhzdeng/crpn. Keywords: Multi-Oriented Text Detection, Dual-RoI Pooling, Corner-based Region Proposal Network 1. Introduction Automatically reading text in the wild is a fundamental problem of computer vision since text in scene images commonly conveys valuable information. It has been widely used in various applications such as multilingual transla- tion, automotive assistance and image retrieval. Normally, a reading system consists of two sub-tasks: detection and recognition. This work focuses on text detection, which is the essential prerequisite of the subsequent processes in the whole workflow. Though extensively studied [1, 2, 3, 4] in recent years, scene text detection is still enormously challenging due to the diversity of text instances and undesirable image quality. Before the era of deep learning [5], most related works utilized the sliding window [4, 6] or connected component [1, 2, 3] with hand-crafted features. The methods based on the sliding window detect texts by shifting several windows onto all positions in multiple scales. Although these methods are able to achieve a high recall rate, the large number of candidates may result in low precision and heavy computations. The approaches based on the connected component focus on the detection of individual characters and the relationships between them. Although these methods are faster than previous ones, errors will occur and accumulate throughout each of the sequential steps [7], which may degenerate the performance of detection. Recently, benefited from the significant achievements of generic object detection based on deep neural network [8], methods with high performance [9, 10, 11] have been modified to detect horizontal scene text [12, 13, 14] and the results have amply demonstrated their eectiveness. In addition, in order to achieve multi-oriented text detection, some methods [15, 16] designed several rotated anchors to find the best matched proposals to inclined text instances. Although these methods have shown their promising performances, as [16] refers, the strategy of man-made shapes of anchors may not be the optimal designs. In this paper, we tackle scene text detection with a new perspective which is mostly learned from the process of object annotation. As shown in Fig.1, the most common way [17, 18] to make an annotation is to click on a * Corresponding author Email address: [email protected] (Mei Xie) Preprint submitted to Journal of L A T E X Templates June 4, 2019 arXiv:1804.02690v2 [cs.CV] 3 Jun 2019

Detecting Multi-Oriented Text with Corner-based Region ... · adaptive, which makes our method robust to various text aspect ra-tios and orientations. ... After corner detection,

  • Upload
    ngolien

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Detecting Multi-Oriented Text with Corner-based Region Proposals

Linjie Denga, Yanxiang Gonga, Yi Linb, Jingwen Shuaia, Xiaoguang Tua, Yuefei Zhangc, Zheng Maa, Mei Xiea,∗

aSchool of Information and Communication Engineering, UESTC, Chengdu, ChinabNational Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China

cChongqing Institute of Public Security Science and Technology, Chongqing, China

Abstract

Previous approaches for scene text detection usually rely on manually defined sliding windows. This work presentsan intuitive two-stage region-based method to detect multi-oriented text without any prior knowledge regarding thetextual shape. In the first stage, we estimate the possible locations of text instances by detecting and linking cornersinstead of shifting a set of default anchors. The quadrilateral proposals are geometry adaptive, which allows ourmethod to cope with various text aspect ratios and orientations. In the second stage, we design a new pooling layernamed Dual-RoI Pooling which embeds data augmentation inside the region-wise subnetwork for more robust clas-sification and regression over these proposals. Experimental results on public benchmarks confirm that the proposedmethod is capable of achieving comparable performance with state-of-the-art methods. The code is publicly availableat https://github.com/xhzdeng/crpn.

Keywords: Multi-Oriented Text Detection, Dual-RoI Pooling, Corner-based Region Proposal Network

1. Introduction

Automatically reading text in the wild is a fundamental problem of computer vision since text in scene imagescommonly conveys valuable information. It has been widely used in various applications such as multilingual transla-tion, automotive assistance and image retrieval. Normally, a reading system consists of two sub-tasks: detection andrecognition. This work focuses on text detection, which is the essential prerequisite of the subsequent processes in thewhole workflow.

Though extensively studied [1, 2, 3, 4] in recent years, scene text detection is still enormously challenging due tothe diversity of text instances and undesirable image quality. Before the era of deep learning [5], most related worksutilized the sliding window [4, 6] or connected component [1, 2, 3] with hand-crafted features. The methods basedon the sliding window detect texts by shifting several windows onto all positions in multiple scales. Although thesemethods are able to achieve a high recall rate, the large number of candidates may result in low precision and heavycomputations. The approaches based on the connected component focus on the detection of individual charactersand the relationships between them. Although these methods are faster than previous ones, errors will occur andaccumulate throughout each of the sequential steps [7], which may degenerate the performance of detection.

Recently, benefited from the significant achievements of generic object detection based on deep neural network[8], methods with high performance [9, 10, 11] have been modified to detect horizontal scene text [12, 13, 14] andthe results have amply demonstrated their effectiveness. In addition, in order to achieve multi-oriented text detection,some methods [15, 16] designed several rotated anchors to find the best matched proposals to inclined text instances.Although these methods have shown their promising performances, as [16] refers, the strategy of man-made shapesof anchors may not be the optimal designs.

In this paper, we tackle scene text detection with a new perspective which is mostly learned from the processof object annotation. As shown in Fig.1, the most common way [17, 18] to make an annotation is to click on a

∗Corresponding authorEmail address: [email protected] (Mei Xie)

Preprint submitted to Journal of LATEX Templates June 4, 2019

arX

iv:1

804.

0269

0v2

[cs

.CV

] 3

Jun

201

9

(a) (b) (c)

Figure 1: Annotating an instance of text: : (a) The conventional way of drawing a rectangular bounding-box. Clicking on a corner (blue point)and dragging the mouse to the diagonally opposite corner (red point) to receive a rectangle (solid purple box). But the rectangular bounding-box isnot very accurate for rotated text region. (b) (c) Drawing another diagonal (dashed arrow in (b)) to finish the annotation (solid orange box in (c))for inclined text instance. Best viewed in color.

corner of an imaginary rectangle enclosing the object, and then drag the mouse to the diagonally opposite cornerto get a rectangular box. To annotate the inclined text instance, we need another diagonal to obtain a quadrilateralbounding-box. If the box is not particularly accurate, users can make further amendments by adjusting the coordinatesof the corners. Following the above procedure, our motivation is to harness the corners to infer the locations of textbounding-boxes. Basically, the proposed method is a two-stage region-based [19] detection framework. In regionproposal stage, we abandon the anchor fashion and utilize corners to generate quadrilateral proposals for estimatingthe possible locations of text instances. In region classification stage, a region-wise network is trained for text/non-textclassification and corner regression over these proposals. Based on corners, the quadrilateral outputs are flexible tocapture various text aspect ratios and orientations. The resulting model is trained end-to-end and obtains an F-measure0.876 on ICDAR 2013, 0.845 on ICDAR 2015 and 0.591 on COCO-Text. Besides, compared with recent publishedworks [20, 21], it is competitively faster in running speed.

We summarize our primary contributions as follows:

(1) Dissimilar to previous works heavily relying on the well-designed anchors, our new corner-based region proposalnetwork generates quadrilateral region proposals by detecting and linking the corners of text bounding-boxes.

(2) In order to suppress the negative linkages, we design a new variable named Link Direction to indicate where tofind the partner for each detected corner candidate.

(3) A nearly cost-free module named Dual-RoI Pooling which embeds data augmentation inside the region-wisesubnetwork is presented to improve the utilization of training data.

(4) Our scene text detector achieves strong performances and competitive speeds on public benchmarks. All of ourtraining and testing code is open source now.

2. Methodology

In this section, we will describe the proposed method. Technically, our method is based on Faster R-CNN [9]and DeNet [22] detection frameworks. We make some innovations to combine and extend them for detecting multi-oriented text. Details will be delineated in the following.

2.1. Corner-based Region Proposal Network

Here we introduce our new region proposal algorithm named Corner-based Region Proposal Network (CRPN). Itdraws primarily on DeNet [22], which is a novel generic object detection framework. We extend the Corner-based RoIDetector which can only output rectangular proposals in DeNet to generate quadrilateral ones for matching arbitrary-oriented text instances. In general, two intersected diagonals can determine a quadrangle. The main purpose of CRPNis to search corners on the whole image and find the intersected diagonals which link them. It mainly consists of twosteps: corner detection and proposal sampling.

2

2.1.1. Corner DetectionSimilar to semantic segmentation, the task of corner detection is performed by predicting the probability of each

position (x, y) that belongs to one of the predefined corner types. Owing that texts are usually very close to each otherin natural scenes, it is hard to separate corners which are from adjacent text instances. Thus, we use the one-versus-reststrategy and apply four-branches ConvNet to predict the independent probability denoted as Pi(x, y) for each cornertype, where i ∈ {top le f t, top right, bot right, bot le f t}. Obviously, only the top le f t and bot right or top right andbot le f t can be connected diagonally. However, most of the linkage is negative because there is only one positivepartner for each corner candidate. For the purpose of suppressing the negative linkages, we define a new variablenamed Link Direction. Let θ(p, q) be the orientation of any diagonal −→pq to the horizontal, p and q are two cornercandidates from two types which can be linked into a diagonal. Then θ(p, q) is discretized into one of K values, asHOG [23] did. The Link Direction of corner p to q is calculated as follow:

D−→pq = ceil(

K ∗ θ(p, q)2π

)(1)

Then we convert the binary classification problem (corner/non-corner) to a multi-class one, where each class corre-sponds to a value of Link Direction. To this end, each corner detector outputs a prediction map with channel dimensionK + 1 (plus one for background and other corner types). The independent probability of each position which belongsto corner type i is given by:

Pi(x, y) =

K∑k=1

P′

i(k|x, y), k ∈ {0, 1, . . . ,K} (2)

P′

i(k|x, y) is computed by a softmax over the K +1 output maps of corner class i. As [1] applying the gradient directiondp of each edge pixel p to find another edge pixel q where the gradient direction dq is roughly opposite to dp, we filterout the linkages which:

|D−→pq − Dp| > 1 (3)

where Dp is the predicted Link Direction of corner p, D−→pq is the practical Link Direction of corner p to q which canbe easily calculated via Eq.1. As depicted in Fig.2, the Link Direction can indicate where to find the partner for eachcorner, and meanwhile is helpful for separating two adjacent texts.

(a) (b)

Figure 2: Suppressing the negative linkages via Link Direction: The green and yellow points are two types of detected corners which can beconnected diagonally. (a) The solid (red) and dashed (orange and purple) arrows represent the practical and predicted Link Direction respectively.The dashed orange arrows will be filter out via Eq.3. (b) The output region proposals (solid blue box) are generated by dashed purple arrows. Bestviewed in color.

2.1.2. Proposal SamplingWe develop a simple algorithm for searching corner candidates from probability maps and assembling them into

quadrilateral proposals. Specially, in order to improve the recall rate, we only use three different types of corners togenerate a proposal. The working steps of algorithm are described as follows:

3

Algorithm 1 Generating Proposals by Searching and Linking CornersInput: the outputs of each corner detectorOutput: quadrilateral proposals

Step 1: Predefine a threshold T and search the probability map for corner candidates where Pi(x, y) > T . Acandidate which is close to higher scoring selected one will be rejected and only top-M ranked candidates will beused in next step.Step 2: Generate a set of unique diagonals by linking every candidate of top le f t and every one of bot right. Thenegative linkages will be filtered out via Eq.3.Step 3: Generate another set of diagonals by rotating each diagonal to collinear with every corner of last two types.A quadrilateral proposal will be generated by two intersected diagonals (source one and rotated one).Step 4: Repeat Step 2 and Step 3 with corners of types top right and bot le f t.

With the probability of corner, we can easily estimate the score that a proposal B contains a text instance byapplying a Naıve Bayesian Classifier to each corner of B:

P(B) =∏

i

Pi(xi, yi) (4)

where (xi, yi) indicates the coordinates of the corner type i associate with proposal B. We also adopt non-maximumsuppression (NMS) on proposals based on their score to reduce redundancy because most of the proposals highlyoverlap with each other. Considering the computational accuracy of the standard NMS based on the IoU of rectanglesis unsatisfactory for quadrilateral proposals, we use the algorithm introduced by [15], which can compute the IoU ofquadrangles, to solve this problem. Then the vast majority of proposals will be discarded after this operation.

2.2. Dual-RoI Pooling

𝜽

𝒘

𝒉

𝜽$

𝒘$

𝒉$

Affine Transformation

Max Pooling

Element-wise Adding

RoI Transformation

Figure 3: Dual-RoI Pooling: Actually, the Dual-RoI Pooling is performed on the feature map. We draw it on source image for better viewing.Different colored arrows represent different operations. Best viewed in color.

As presented in [24], the RoI pooling layer extracts a fixed-length feature vector from the feature map for eachregion of interest (RoI), and then each vector is fed into a sequence of fully connected layers for further classificationand bounding-box regression. In our task, the traditional RoI pooling which can only handle axis-aligned RoIs is alsonot accurate for quadrilateral ones. Thus, the Rotation RoI (RRoI) pooling presented by [15] is adopted to address this

4

issue. Each RRoI is represented by a five-tuple (r, c, h,w, θ) where r, c are coordinates of bounding-box center, h,w areheight and width of bounding-box, and θ is the rotation angle to the horizontal. This operation works by mapping therotated RoI into an axis-aligned one via affine transformation, and then dividing the h×w window into an H ×W gridof sub-windows and max-pooling the values in each sub-window into the corresponding output grid cell, as shown inFig.3. Since the input format of RRoI pooling is rotated rectangle, we convert the quadrilateral proposals into rotatedrectangular ones through method minAreaRect in OpenCV and θ is within the interval [0◦, 90◦).

Usually, for improving the robustness of the model when encountering various orientations of texts, existingmethods [20, 25, 26] make use of data augmentation that rotates source images to different angles to harvest sufficientdata for training. Despite the effectiveness of data augmentation, the main drawback lies in learning all the possibletransformations of augmented data require more network parameters, and it also may result in significant increaseof training cost and over-fitting risk [27]. With the aim to alleviate these drawbacks, we present a built-in dataaugmentation named Dual-RoI Pooling. For each RRoI O which represented by (r, c, h,w, θ), we transform it intoanother RRoI O

represented by (r′

, c′

, h′

,w′

, θ′

) where:

r′

= r, c′

= c, h′

= w, w′

= h, θ′

= 90◦ − θ (5)

As described in Fig.3, inputting the RoI O′

to RRoI pooling layer will obtain a totally different grid of sub-windows.We combine these two RoIs as a Dual-RoI and fuse their feature vectors to get the final output via element-wise addingoperation. The essence of Dual-RoI pooling embeds multi-instance learning [28] inside the region-wise subnetworkand that will be helpful for finding the most representative features for training, similar to TI-POOLING [29]. As aresult of directly conducting the transformation on feature map instead of on the source image, our module is moreefficient than [29]. Considering the diversity of text arrangement, we argue that the element-wise adding is moreappropriate than element-wise maximum in our task.

2.3. Network Architecture

The network architecture of the proposed method is diagrammed in Fig.4. All of our experiments are implementedon VGG-16 [30], though other networks are also applicable. To obtain more accurate corners prediction, we choosethe output of Conv4 as the final feature map, whose size is 1/8 of the input image. Which has been proven in manyworks [31, 32], combining the fine-grained details from low level layers with coarse semantic information from highlevel layers will be helpful for the detector. Our structure of feature fusion mainly inherits from PVANet [32] but isslightly different to it. We add a deconvolutional layer to carry out upsampling, and a 3 × 3 × 64 convolutional layerwith stride = 2 to conduct subsampling. The concatenated features are combined by a 1×1×512 convolutional layer.

Conv

3

Fusion Feature Map

Conv

4

Conv

5Su

bsam

ple

Ups

ampl

e

Conc

aten

atio

n

RPN

Map

Top-L

Bot-L

Top-R

Bot-R

RoI G

ener

atio

nD

ual-R

oI P

ool

FC L

ayer

s

Backbone (VGG-16)

Fusion Feature

Corner-based RPN

Region-wise Network

Figure 4: The architecture of the proposed detection network. Best viewed in color.

5

2.4. Loss FunctionBased on the above definitions, we give the multi-task loss L of proposed network. The model is trained to

simultaneously minimize the losses on corner segmentation, proposal classification and corner regression. Overall,the loss function is a weighted sum of these losses:

L = λsegLseg + λclsLcls + λlocLloc (6)

where λseg, λcls, λloc are user defined constants indicating the relative strength of each component. In our experiment,all weight constants λ are set to 1.

2.4.1. Loss for SegmentationIn training stage, the loss function for segmentation task is computed over all pixels in the output map with

the ground-truth which is identified by mapping the corners to a single position in the label map, corners out ofboundary are simply discarded. For a typical natural image, the distribution of corner/non-corner pixels is heavilybiased. It is possible to optimize for the loss functions of all pixels, but this will bias towards negative samples asthey are dominant. Instead, we randomly sample 32 pixels as a mini-batch which the samples positive and negativepixels have a ratio of up to 1:1. If the number of positive pixels is not enough, we pad the mini-batch with negativeones. Furthermore, we adopt a weighted softmax loss function introduced in [33] to balance the loss between thecorner/non-corner classes, given by:

Lseg =∑

i

−1|S i|

|S i |∑m=1

K∑k=0

ωki I(zm

i = k)logP′

i(k|smi ) (7)

S i is a mini-batch of corner type i. |S i| is the number of samples. K denotes the maximum value of the Link Direction.zm

i is the ground-truth of the m-th sample smi . The function I(·) evaluates to 1 when zm

i = k and 0 otherwise. P′

i(k|smi )

represents how likely the value of Link Direction of smi is k. The wk

i is the balanced weight between corner andnon-corner samples, given by:

ωki =

m I(zmi == 0)|S i|

if k > 0∑m I(zm

i > 0)|S i|

otherwise(8)

2.4.2. Loss for Region-wise NetworkAs in [24], the region-wise network has two sibling outputs: one for proposal classification and another for corner

regression. The first objective aims for distinguish texts from non-texts with softmax loss, as follows:

Lcls = ylog(p) + (1 − y)log(1 − p) (9)

where y equals 1 if input sample is text, otherwise is 0. The p is text confidence score. The second objective attemptsto regress the fine bounding-box, as follows:

Lloc = smoothL1 (t − t∗) (10)

t and t∗ represents the predicted and target regression tuples respectively, and smoothL1 is a robust L1 loss defined inR-CNN [5]. The regression target t∗ consists of eight terms:

t∗xi= (xi − x∗i )/w, t∗yi

= (yi − y∗i )/h (11)

(xi, yi) and (x∗i , y∗i ) denote the coordinates of the corner type i for proposal and target bounding-box respectively. w

and h are the width and height of minimal bounding rectangle of input proposal. The target bounding-box for eachproposal is identified by selecting the ground-truth with the largest overlap.

6

3. Experiments

In the following part, we will describe the implementation details of our method. We also run a number of ablationsto analyze the effectiveness of the proposed component. Finally, we will present the evaluation results on three publicbenchmarks: ICDAR 2013 [34], ICDAR 2015 [35] and COCO-Text [36].

3.1. Implementation Details

The backbone of network is initialized by pretraining a model for ImageNet [37] classification, and all other newlayers are initialized by ”Xavier” [38]. The training images are collected from SynthText [13], ICDAR 2013 andICDAR 2015. We randomly pick up 100,000 images from SynthText for pretraining, and then the real data from thetraining sets of ICDAR 2013 and 2015 is used to finetune a unified model. The model is trained end-to-end by usingthe standard SGD algorithm. Momentum and weight decay are set to 0.9 and 5 × 10−4 respectively. Following themulti-scale training in [32], we resize the images in each training iteration such that the short side of input is randomlychose between 480 and 800. In pretraining stage, the learning rate is set to 10−3 for the first 60k iterations, and thendecayed to 10−4 for the other 40k iterations. In finetuning stage, the learning rate is fixed to 10−4 for 20k iterationsthroughout. No extra data augmentation is used.

For the trade-off between efficiency and accuracy, we set K = 24 and M = 32 in out implement. Moreover,threshold T is set to 0.1 in training for high recall and 0.5 in testing for high precision. Thanks to location by corners,the proposals are very accurate and only 200 proposals are used for further detection at test-time. The proposedmethod is implemented using Caffe [39]. All experiments are carried out on a standard PC with Intel i7-6800k CPUand a single NVIDIA 1080Ti GPU. All of our results are reported on single-scale testing images with a single model.

3.2. Ablation Study

To investigate the effectiveness of the proposed components, we conduct several ablation studies. We adopt FasterR-CNN [9] as our baseline model and convert the format of outputs to match the evaluation of ICDAR 2015. Resultsare shown in Table. 1 and discussed in detail next.

Table 1: Ablations on ICDAR 2015.

Method Feature Map Recall Precision F − measureBaseline conv5, stride = 16 0.606 0.553 0.579

CRPN + RoI Pooling conv5, stride = 16 0.707 0.611 0.656CRPN + RoI Pooling fusion, stride = 8 0.767 0.781 0.774

CRPN + RRoI Pooling fusion, stride = 8 0.788 0.870 0.827CRPN + Dual-RoI Pooling fusion, stride = 8 0.807 0.887 0.845

First, using Corner-based RPN instead of conventional RPN, the F-measure improves from 0.579 to 0.656. It isworth noting that the regression task in baseline is designed for refining the rectangular bounding-box. Next, with theaim to generate more accurate proposals, we predict on conv4 with fusion structure which is introduced in section2.3, and the resulting model obtains a huge improvement in F-measure (0.656 vs 0.774). Replacing RoI Poolingwith RRoI Pooling leads to an F-measure of 0.827, which also outperforms recent Rotation RPN (0.800 F-measureshown in Table. 3). From this we can conclude the Corner-based RPN is more effective to match the multi-orientedtext in scene image. Finally, by incorporating with Dual-RoI Pooling, our method further obtains a gain of 1.8% inF-measure and also achieves the highest recall and precision in the list. This suggest that the Dual-RoI Pooling isindeed helpful for the scene text detector.

3.3. Experimental Results

We evaluate our method on public benchmarks, following the standard evaluation protocols in each field. Fig.5shows some detection results from these datasets.

7

Figure 5: Selected results from the public benchmarks. Viewing digitally with zoom is recommended.

3.3.1. ICDAR 2013 Focused Scene TextThe ICDAR 2013 [34] dataset consists of 229 training and 233 testing images which were captured by user

explicitly detecting the focus of the camera on the text content of interest. It is the standard benchmark for evaluatinghorizontal or nearly horizontal text detection. In this benchmark, all of the testing images are resized with a fixedshort side of 640 and we obtain an F-measure of 0.876 by using the ICDAR 2013 standard. As depicted in Table. 2,the result of proposed method is a comparable performance with a state-of-the-art method [40], which utilizes aConnectionist Text Proposal Network to detect texts by predicting the sequence of fine-scale text components. Ourmethod also outperforms other compared methods including DeepText [12], FCRN [13] and TextBoxes [14], whichare mainly designed for nearly horizontal text detection. The proposed method runs at 9.1 fps, which is slightly fasterthan recent public work [21]. However, it is still too slow for real-time (25 fps) or nearly real-time application.

Table 2: Results on ICDAR 2013 Focused Scene Text. The results are reported in terms of Recall (R), Precision (P) and F-measure (F). (–) meansno report in their papers.

MethodICDAR S tandard DetEval

FPSR P F R P F

FASText [3] 0.693 0.840 0.768 – – – 6.7TextFlow [7] 0.759 0.852 0.803 – – – 1.1

TextBoxes [14] 0.740 0.860 0.800 0.740 0.880 0.810 11.1FCRN [13] 0.764 0.938 0.842 0.755 0.920 0.830 0.8

DeepText [12] 0.830 0.870 0.850 – – – 0.6SegLink [41] – – – 0.830 0.877 0.853 20.6CTPN [40] 0.730 0.930 0.820 0.820 0.930 0.880 7.1SSTD [21] 0.860 0.880 0.870 0.860 0.890 0.880 7.7

Ours 0.839 0.919 0.876 0.840 0.921 0.879 9.1

8

3.3.2. ICDAR 2015 Incidental Scene TextThe ICDAR 2015 [35] benchmark was released during the ICDAR 2015 Robust Reading Competition. It provides

1000 training and 500 test images which was collected without taking any specific prior attention. Therefore, it ismore difficult than previous ICDAR challenges. In this benchmark, we rescale all of the testing images such that theirshort side is 900 pixels for detecting small text regions. As shown in Table.3, our method achieves an F-measure of0.845, surpassing all of the anchor fashion methods, including RRPN [15] and R2CNN [26], which are also extendedfrom Faster R-CNN [9] framework and employ VGG-16 [30] as the backbone of network. The FTSN [25] based ondetecting and segmenting text instance is very close to us in performance, but it is built on a more powerful network(ResNet-101 [42]). However, compared with EAST [43] which utilizes a deep regression network that directly predictstext region with arbitrary orientations in full images, there is still a big gap in terms of recall.

Table 3: Results on ICDAR 2015 Incidental Scene Text.

Method Recall Precision F − measureMCLAB FCN [44] 0.430 0.710 0.540

CTPN [40] 0.520 0.740 0.610DMPNet [16] 0.680 0.730 0.710SegLink [41] 0.768 0.731 0.750SSTD [21] 0.730 0.800 0.770RRPN [15] 0.770 0.840 0.800EAST [43] 0.833 0.783 0.807

NLPR CASIA [20] 0.800 0.820 0.810R2CNN [26] 0.797 0.856 0.825FTSN [25] 0.800 0.886 0.841

Ours 0.807 0.887 0.845

3.3.3. COCO-TextThe original images of COCO-Text [36] are harvested from Microsoft COCO [45] dataset, and it contains 43686

training images and 20000 images for validation and testing. It is the largest dataset for text detection and recognitionin scene images to date. The testing images are resized with a fixed short side of 900, and our method achieves 0.633,0.555 and 0.591 in recall, precision and F-measure by using the online evaluation system provided officially, as shownin Table.4. It is worth noting that no images from COCO-Text are involved in training phase. The presented resultsdemonstrate that our method is capable of applying practically in the unseen contexts.

Table 4: Results on COCO-Text. The results of compared methods are grasped from the public COCO-Text leaderboard.

Method Recall Precision F − measureSCUT DLVClab 0.625 0.316 0.420

SARI FDU RRPN 0.632 0.333 0.436UM 0.654 0.475 0.551

TDN SJTU v2 0.543 0.624 0.580Text Detection DL 0.618 0.609 0.613

Ours 0.633 0.555 0.591

3.4. Limitations

We further analyze the limitations of our method. In essence, the proposed framework utilizes a bottom-up strategyfrom corners to quadrilateral bounding-box. Hence the problem of accumulation of errors still exists. As shown inFig.6.a, our method may fail to detect text with single character because we drop it in training stage for high precision.Moreover, our method still struggles with some extreme scenarios, such as the examples in Fig.6.b, c. Finally, wehave to magnify the image size in order to detect small text region, which also limits the efficiency of our method.

9

(a) (b) (c)

Figure 6: Failure cases on ICDAR 2013 and 2015. Red boxes are false positives or true negatives. Best viewed zoomed in color.

4. Conclusion and Future Work

This paper presents an intuitive region-based method towards multi-oriented text detection by learning the ideafrom the process of object annotation. We discard the anchor strategy and employ corners to estimate the possiblelocations of text instances. Based on corners, our method is flexible to generate quadrilateral proposals for capturingvarious kinds of texts. Moreover, we design a built-in data augmentation module inside the region-wise subnetwork,which not only utilizes training data more efficiently, but also improves the robustness of the resulting model. In thefuture, the performance of our method can be further improved by using much stronger networks, such as ResNet [42]or DenseNet [46]. Additionally, we are also interested in extending this method to an end-to-end text reading system.

References

References

[1] B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, in: CVPR, 2010.[2] L. Neumann, J. Matas, Real-time scene text localization and recognition, in: CVPR, 2012.[3] M. Busta, L. Neumann, J. Matas, Fastext: Efficient unconstrained scene text detector, in: ICCV, 2015.[4] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Reading text in the wild with convolutional neural networks, IJCV.[5] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: NIPS, 2012.[6] Z. Zhang, W. Shen, C. Yao, X. Bai, Symmetry-based text line detection in natural scenes, in: CVPR, 2015.[7] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, C. Lim Tan, Text flow: A unified text detection system in natural scene images, in: ICCV, 2015.[8] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE.[9] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, in: NIPS, 2015.

[10] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: CVPR, 2016.[11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, SSD: Single shot multibox detector, in: ECCV, 2016.[12] Z. Zhong, L. Jin, S. Zhang, Z. Feng, Deeptext: A unified framework for text proposal generation and text detection in natural images, arXiv

preprint arXiv:1605.07314.[13] A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data for text localisation in natural images, in: CVPR, 2016.[14] M. Liao, B. Shi, X. Bai, X. Wang, W. Liu, Textboxes: A fast text detector with a single deep neural network, in: AAAI, 2017.[15] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, X. Xue, Arbitrary-oriented scene text detection via rotation proposals, arXiv preprint

arXiv:1703.01086.[16] Y. Liu, L. Jin, Deep matching prior network: Toward tighter multi-oriented text detection, in: CVPR, 2017.[17] D. P. Papadopoulos, J. R. Uijlings, F. Keller, V. Ferrari, Extreme clicking for efficient object annotation, in: ICCV, 2017.[18] K.-K. Maninis, S. Caelles, J. Pont-Tuset, L. Van Gool, Deep extreme cut: From extreme points to object segmentation, arXiv preprint

arXiv:1711.09081.[19] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: CVPR,

2014.[20] W. He, X.-Y. Zhang, F. Yin, C.-L. Liu, Deep direct regression for multi-oriented scene text detection, in: ICCV, 2017.[21] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, X. Li, Single shot text detector with regional attention, in: ICCV, 2017.[22] L. Tychsen-Smith, L. Petersson, Denet: Scalable real-time object detection with directed sparse sampling, in: ICCV, 2017.[23] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: CVPR, 2005.[24] R. Girshick, Fast R-CNN, in: ICCV, 2015.[25] Y. Dai, Z. Huang, Y. Gao, K. Chen, Fused text segmentation networks for multi-oriented scene text detection, arXiv preprint

arXiv:1709.03272.

10

[26] Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, Z. Luo, R2cnn: Rotational region cnn for orientation robust scene text detection,arXiv preprint arXiv:1706.09579.

[27] Y. Zhou, Q. Ye, Q. Qiu, J. Jiao, Oriented response networks, in: CVPR, 2017.[28] J. Wu, Y. Yu, C. Huang, K. Yu, Deep multiple instance learning for image classification and auto-annotation., in: CVPR, 2015.[29] D. Laptev, N. Savinov, J. M. Buhmann, M. Pollefeys, Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural

networks, in: CVPR, 2016.[30] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: ICLR, 2015.[31] T. Kong, A. Yao, Y. Chen, F. Sun, Hypernet: Towards accurate region proposal generation and joint object detection, in: CVPR, 2016.[32] K.-H. Kim, S. Hong, B. Roh, Y. Cheon, M. Park, Pvanet: Lightweight deep neural networks for real-time object detection, arXiv preprint

arXiv:1608.08021.[33] W. Shen, K. Zhao, Y. Jiang, Y. Wang, Z. Zhang, X. Bai, Object skeleton extraction in natural images by fusing scale-associated deep side

outputs, in: CVPR, 2016.[34] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. De Las Heras, Icdar

2013 robust reading competition, in: ICDAR, 2013.[35] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu,

et al., Icdar 2015 competition on robust reading, in: ICDAR, 2015.[36] A. Veit, T. Matera, L. Neumann, J. Matas, S. Belongie, Coco-text: Dataset and benchmark for text detection and recognition in natural images,

arXiv preprint arXiv:1601.07140.[37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale

visual recognition challenge, IJCV 115 (3).[38] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: AISTATS, 2010.[39] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast

feature embedding, arXiv preprint arXiv:1408.5093.[40] Z. Tian, W. Huang, T. He, P. He, Y. Qiao, Detecting text in natural image with connectionist text proposal network, in: ECCV, 2016.[41] B. Shi, X. Bai, S. Belongie, Detecting oriented text in natural images by linking segments, in: CVPR, 2017.[42] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR, 2016.[43] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, East: An efficient and accurate scene text detector, in: CVPR, 2017.[44] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, X. Bai, Multi-oriented text detection with fully convolutional networks, in: CVPR, 2016.[45] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick, Microsoft coco: Common objects in context, in:

ECCV, 2014.[46] G. Huang, Z. Liu, K. Q. Weinberger, L. van der Maaten, Densely connected convolutional networks, in: CVPR, 2017.

11