Online Handwritten Cursive Word Recognition Using ...web.tuat.ac.jp/~nakagawa/pub/2015/pdf/Zhubilan_ACPR2015.pdf · Realizing online handwritten character recognition with high performance

225

Abstract

This paper describes a comparison between online

handwritten cursive word recognition using segmentation-free method and that using segmentation-based method. To search the optimal segmentation and recognition path as the recognition result, we attempt two methods: segmentation-free and segmentation-based, where we expand the search space using a character-synchronous beam search strategy. The probable search paths are evaluated by integrating character recognition scores with geometric characteristics of the character patterns in a Conditional Random Field (CRF) model. Our methods restrict the search paths from the trie lexicon of words and preceding paths during path search. We show this comparison on a publicly available dataset (IAM-OnDB).

1. Introduction The development of pen-based or touch-based input

devices such as tablets and smart phones has led to a push towards more fluid interactions with these electronics. Realizing online handwritten character recognition with high performance is vital, especially for applications such as the natural input of text on smart phones, to provide a satisfactory user experience.

Without character recognition cues, characters cannot be segmented unambiguously due to the fact that segmentation points between characters are not obvious. A feasible way to overcome the ambiguity of segmentation is called integrated segmentation and recognition, which is classified into segmentation-free and segmentation-based methods.

The segmentation-based method attempts to split cursive words into character patterns at their true boundaries and label the split character patterns. A word pattern is over-segmented into primitive segments such that each segment comprises of a single character or part of a character. The segments are combined to generate candidate character patterns (forming a candidate lattice),

which are evaluated using character recognition, incorporating geometric and linguistic contexts.

On the other hand, the segmentation-free method avoids problems associated with segmentation, and utilizes the abilities of one-dimensional structural models such as HMMs or MRFs for concatenating character models to construct word models based on the provided lexicon of words during recognition, and erases and selects segmentation points using character models.

Online handwritten word recognition using a recurrent neural network [1] performs word recognition by continuously moving the input window of the network across the frame sequence representation of a word, thus generating activation traces at the output of the network. These output traces are subsequently examined to determine the ASCII string(s) best representing the word image.

It has got a lot of attention and discussion which is better of the segmentation-free and segmentation-based methods for handwritten string recognition. In this paper, we attempt the two methods and make a comparison between them for online handwritten cursive word recognition. We try the two methods (segmentation-free and segmentation-based) to search the optimal path as the recognition result. We expand the search space using a character-synchronous beam search strategy and the probable search paths are evaluated by a path evaluation criterion in a CRF model. To evaluate character patterns, we combine a MRF model [2] with a P2DBMN-MQDF (pseudo 2D bi-moment normalization - modified quadratic discriminant function) recognizer [3].

The rest of this paper is organized as follows: Section 2 begins with the description of the preprocessing steps. Section 3 describes the CRFs for our word recognition, Section 4 presents the two search methods (segmentation-free and segmentation-based), Section 5 presents the experimental results, and Section 6 makes our conclusions.

2. Preprocessing Given a word, we compute baselines using linear

regression lines that approximate the local minima or the local maxima. Then, we normalize the different slants of

Online Handwritten Cursive Word Recognition Using Segmentation-free and

Segmentation-based Methods

Bilan Zhua, Arti Shivramb, Venu Govindarajub and Masaki Nakagawaa aDepartment of Computer and Information

Sciences, Tokyo University Agriculture and Technology, Tokyo, Japan

{zhubilan, nakagawa}@cc.tuat.ac.jp

bCenter for Unified Biometrics and Sensors, University at Buffalo, Buffalo, US

{ashivram, govind}@buffalo.edu

s010506

タイプライターテキスト

3th Asian Conference on Pattern Recognition (ACPR2015), November 3-6, 2015, Kuala Lumpur, Malaysia.

226

words, by shearing every word according to its slant angle. After transforming every word to a given corpus height (the distance between the baseline and the corpus line), we re-sample the trajectory points to ensure that pen-tip coordinates are equidistant using linear interpolation. Then, smoothing is accomplished by applying a Gaussian filter to each coordinate point in the sequence. Critical points such as start and end points of strokes and points of high curvature are detected and included in the re-sampled/smoothed trajectory output. We also remove delayed strokes (e.g., crosses on ‘t’s, dots on ‘i’s etc.). Finally, we extract feature points using the method developed by Ramner [2]. The start and end points of every stroke are picked up as feature points. Then, the most distant point from the straight line between adjacent feature points is selected as a feature point if the distance to the straight line is greater than a threshold value. This selection is done recursively until no more feature points are selected.

3. Word recognition CRFs

3.1. CRFs Shivram et al. [4] have proposed a CRF model for online

handwritten word recognition, where the CRF model was constructed from over-segmented points. Here we apply a similar model, but formulate it based on feature points to apply the segmentation-free method.

Consider an input word pattern X which has a sequence of feature points F={f1, f2, f3,…, fg} as shown in Fig. 1. There are many probable segmentations {S1, S2, S3, …, Sm}, and each Si={si

1, si2, si

3,…,sin_i} where si

j denotes a candidate character pattern between feature points pk and pl (k, l = 1~g ). For each Si we construct a neighborhood graph gi as shown in Fig. 1, where each node denotes a candidate character pattern and each link represents the relationship between neighboring candidate character patterns.

Let (sij-1,si

j) denote a pair of neighboring candidate characters si

j-1 and sij. si

j and (sij-1,si

j) correspond to unary (single) and binary (pair-wise) cliques, respectively.

Assuming that Y refers to the label sequence of Si, P(Y|Si,X) can be approximated by a real-valued energy function E(X,Si,Y,N,Λ) with clique set N and parameters Λ:

∑ Λ−=

Λ−=

)'(

)),,',,(exp(),(),(

)),,,,(exp(),|(

Yii

i

ii

NYSXEXSZXSZ

NYSXEXSYP

(1)

Z(Si,X), the normalization constant may be ignored if we do not require strict probability values. Then, the problem of finding the best label Y∗, that involves maximizing

P(Y|Si,X), becomes equivalent to minimizing the total graph energy:

Y∗ = arg max P(Y|Si,X) = arg min E(X,Si,Y,N,Λ) (2) To maximize efficiency we utilize only unary and binary

cliques. Thus, the total energy function is defined as

∑ ∑∈ =

−−

−−=Λ

iij

ij

jjSss

K

kjj

ksski yyfNYSXE

),( 11),(

1

1),(),,,,( λ

(3)

where ),( 1),( 1 jjk

ss yyfjj −−

are feature functions on a binary

clique (sij-1,si

j), Λ={λk} are weighting parameters, yj-1 and yj denote the labels of si

j-1 and sij, respectively. In eq. (3) we

consider only binary cliques (sij-1,si

j) since they subsume the feature functions of unary cliques si

j. The total energy function is used to measure the plausibility of Y, and the smaller E(X,Si,Y,N,Λ) is, the larger will P(Y|Si,X) be.

For all segmentations {S1, S2, S3,…, Sm}, we select the best path. Denoting the number of feature points on Si by N(Si, X), we use the following path evaluation criterion to evaluate all segmentation and recognition paths.

),(),,,,(),,( i

i XSNNYSXEXYSEC

i

Λ= (4)

N(Si, X) is the same for all segmentation hypotheses on

the total path length, and it would differ while evaluating partial paths in a beam search. Therefore, it is necessary to normalize the path scores using N(Si, X) when evaluating partial paths in a beam search. When using a Viterbi search such as dynamic programming (DP) it would make the model analogous to a semi-CRF framework as in [5].

3.2. Feature functions The feature functions are unary if defined on single

cliques, and binary on pair-wise cliques. On the other hand, the feature functions are class-relevant if dependent on character class (or class-pair), otherwise are class-irrelevant. So, there are totally four types of feature functions: unary class-relevant, unary class-irrelevant, binary class-relevant and binary class-irrelevant.

There are two output scores of character recognizers on candidate character patterns (two unary class-relevant feature functions): a P2DBMN-MQDF classifier on direction histogram features [3] and a MRF classifier [2]. The P2DBMN-MQDF classifier evaluates shapes of character patterns while the MRF classifier evaluates structures of character patterns. The feature points extracted (section 2) are normalized so that the minimum of X coordinate is 0 while Y coordinates are unchanged, and, are used to construct one-dimensional structural character MRFs to perform recognition. On the P2DBMN-MQDF classifier, feature vector dimensionality is reduced using Fisher Linear Discriminant Analysis (FLDA). There are few symbol classes in English (78 for our dataset) and while using FLDA the reduced dimensionality must be less than the number of symbol classes. Using too few a feature dimension results in low recognition accuracy. To resolve

Figure 1: Extracting feature points and neighborhood graphs.

227

this, we create sub-classes for each symbol category by applying k-means and use these sub-class categories instead of the true class information. Using sub-classes can also express better the distribution of symbols resulting in improved accuracies. We input the original re-sampled and smoothed points (section 2) for each candidate character pattern into the P2DBMN-MQDF recognizer to obtain a recognition score.

We use three unary class-relevant feature functions to evaluate the node attributes (character inner structure, character size, single-character position), and use a binary class-relevant feature function and a binary class-irrelevant feature function to capture the dependencies between nodes (pair-character position, horizontal overlap). We use log-likelihood scores using quadratic discriminant function (QDF) classifiers for them.

3.3. Parameter Learning We train the weighting parameters Λ by a genetic

algorithm (GA) using valid data to maximize the recognition rate on valid data. We treat each one of weights as an element of a chromosome. For evaluating the fitness of a chromosome, each training word pattern is searched for the optimal path evaluated using the weight values in the chromosome. To save computation, we first set each weight value as 1, and select the top 300 recognition candidates (segmentation-recognition paths) for each training word. We then train the weight parameters by GA using the selected 300 recognition candidates. After some iterations, we use the updated weight values to re-select top 300 recognition candidates for each training word pattern. We repeat recognition candidate selection three times.

4. Search methods To search the optimal path as the recognition result, we

attempt segmentation-free and segmentation-based methods. For the two methods we use a character-synchronous beam search strategy where first a trie lexicon is constructed from a word database, as shown in Fig. 2. When training character MRFs, we can count the

range of the number (length) of feature points for each character class. For instance, the length range of feature points for the character “O” is from 5 to 50 as shown in Fig. 2. Then, the length range of feature points to the terminal for each node of the trie can be calculated according to the length range of feature points for each character class, where the numbers shown in parentheses of each node box are the length range of feature points to the terminal for the node. We can restrict the searched paths by the length ranges resulting in improved recognition accuracy. We use the length restriction of feature points for the segmentation-free method while using the length restriction of nodes (characters) to the terminal for the segmentation-based method. For instance, the lengths of nodes to the terminal for the character “O” are 3, 5 and 7.

4.1. Segmentation-free method We use the example of the extracted feature points of a

word pattern as shown in Fig. 1, and show a search in Fig. 3 to describe our processes. We conduct the search and expansion from the expansion depth d1 to d5.

We first search the trie lexicon from its start nodes. In Fig. 2, the start nodes are [O], [p], and [i], and based on these we expand the root node and set its children nodes N1-1 with a character category [O], N1-2 with a character category [p] and N1-3 with a character category [i], where each child node has a character category C and a start point of a feature sequence. They share the start point f1. For each node, the length from the start point f1 to the terminal point f56 is 56 so that the node N1-3 is erased because the length range to the terminal of the trie node [i] is 9-44 and does not satisfy the length. For each node of N1-1 and N1-2, we match the feature points from each start point fi to the point fi+M

with the states S={s1, s2, s3,…,sJ} of the character MRF model of the corresponding C by the Viterbi search, where M is the maximum number of the feature points of C and it is calculated from training patterns. Before matching, we shift the X coordinates of the feature points from fi to fi+M so that the minimum of X coordinate is 0. Then we can get MRF paths at the [End] state which correspond to some end points such as f16, f17 and f18 for N1-1.We sort the scores of the MRF paths (note that each score SMRF is normalized by

Figure 2: Portion of trie lexicon of words. Figure 3: Segmentation-free search. Figure 4: Segmentation candidate lattice.

228

the number Nfea of the feature points from the start point to the end point of the path to SMRF / Nfea), and select NMRF top MRF paths. In Fig. 3, NMRF is set as three and three MRF paths (from f1 to f16, from f1 to f17, from f1 to f18) of N1-1, and three MRF paths (from f1 to f12, from f1 to f15, from f1 to f18) of N1-2 are selected, and we call them sub-nodes. Node N1-1 has three sub-nodes (N1-1-1, N1-1-2 and N1-1-3) while N1-2 has three sub-nodes (N1-2-1, N1-2-2 and N1-2-3).

For all the sub-nodes up to the depth d1, we evaluate all search paths according to the evaluation criterion in eq. (3), sort them, then only select several top paths and erase others. The number of the selected top paths is called the beam band. In Fig. 3, the beam band is set as two, and two paths ending at N1-1-1 and N1-1-3 are selected for d1.

Then, we apply the same method to process the next depth. Finally, the expansion reaches to the depth d5. For all nodes up to the depth d5, we evaluate all paths according to the path evaluation criterion, sort them, and then select the optimal path as the recognition result.

For each node Nk-j with a character category C and a start point fi, we need to execute MRF matching to decide its end points and sub-nodes. After that we need to extract features to recognize each sub-node by the P2DBMN-MQDF recognizer. Different nodes may be required for the same pair of C and fi. Therefore, for each pair of C and fi, once its end points and sub-nodes are decided and the scores of its sub-nodes are calculated, we store them and use them for other nodes from the second time. For each pair of start and end points, we also store the features for the P2DBMN-MQDF recognizer, and use them for the second pass. This can greatly improve recognition time. We call this storage strategy of scores and features (SSSF).

4.2. Segmentation-based method The method selects some candidate segmentation points

P={p1, p2, p3,…, pq} from the feature points F={f1, f2, f3,…, fg}, and only considers the selected points for segmentation. We set the ‘pen-up’, ‘pen-down’ and the ‘minima’ and

‘maxima’ points as candidate segmentation points which are used to over-segment the word into primitives. One or more consecutive primitive segments are combined to form a candidate character pattern. All possible combinations of

these candidate character patterns are represented in a lattice where each node denotes a candidate character pattern and each edge denotes a segmentation point. Each candidate character pattern that exists between segmentation points pk and pl (k, l = 1~g) is followed by candidate character patterns that start from segmentation point pl. A candidate character pattern does not start from a pen-up point, and does not end at a pen-down point. We also restrict the width of each candidate character pattern to a threshold value. Figure 4 shows a segmentation candidate lattice for the word pattern as shown in Fig. 1.

For each node in the lattice possible lengths to the terminal node are calculated. This is done by first setting the length of all terminal nodes as one and then working backwards one level at a time. Thus, for each preceding node, the length vector is calculated by adding one to the length of its succeeding nodes. This is shown in Fig. 4, where the numbers in each node box refer to the possible lengths. This length vector is used in conjunction with the lexicon to prune out unlikely search paths thereby improving both recognition accuracy and speed.

We show a search in Fig. 5. d1 - d5 represent the depth levels in the search space. Starting with the root node, lattice nodes (1) - (6) (Fig. 4) are expanded synchronously at depth d1. At N1-1 lattice node (1) can be matched to three possible characters (O, p, i) from the lexicon (Fig. 2). Since the lengths for node (1) are 6-19 and the lexicon length from p is 4 and that from i is 2 and they do not match these possible lengths, we drop them. The case is similar for N1-2 - N1-6. This forms the expansion for depth d1. At this stage, all paths are evaluated according to the path evaluation criterion in eq. (3) and a few top scoring paths are selected while others are pruned out. The number of selected paths is called the beam width. In Fig. 5 the beam width is two and two paths ending at N1-5-O and N1-6-O are selected for d1. Similarly, at depth d2 this process is repeated to expand the search. Finally, this expansion is repeated for depth d5 yielding one candidate - ‘Offer’ as the recognized word.

A candidate character pattern (lattice node) may appear in several search paths. Evaluating node-character matches using a character recognizer requires extraction of features from the character pattern. In order to avoid redundant processing, for each node, once its features are extracted, we store them. These are then used in subsequent calls. Additionally, for a node and a character category, we store the recognition score at the first recognition and use this stored score for subsequent evaluations. We also call this SSSF similar to the segmentation-free method.

5. Experiments We evaluate the word recognizers using a publicly

available dataset IAM-onDB [6], which consists of handwritten sentences acquired from a ‘smart’ whiteboard. It has 4 disjoint sets: a training set (5,364 lines); two

Figure 5: Segmentation-based search.

229

validation sets (1,438 lines and 1,518 lines); and a test set (3,859 lines). For our isolated word recognition, we generate word-level ground truth using a two step approach as elaborated in [7], where a few words giving segmentation errors were dropped, however their numbers were far too less to make a significant impact on the results. We set the trie lexicon to have only one word (the word string label of the recognized word pattern), and used the word recognizer trained in the previous research [4] to recognize each word pattern and set the recognized segmentation result as the character segmentation labels of the pattern. After that, we obtained character-level segmentation labels for the word patterns of IAM-onDB.

We used the training data to train the feature function models. Then, we tested the recognition rate for the testing data. Each of these models was trained on 78 symbols. The constructed trie lexicon contains 5,562 words (actual size of the test set lexicon). The weighting parameters Λ were estimated using the validation sets by GA. The experiments were implemented on an Intel(R) Core(TM) 2 Quad CPU Q9550 @ 2.83GHz 2.83 GHz with 4.00 GB memory. Table 1 shows the results. The numbers enclosed in square brackets are the average recognition time for a word.

Table: Word recognition rate (%). Method

Perform. ALL Witout MRF

Witout P2DBMN-MQDF

Witout geometricfeatures

Without length restricting

Seg. based

300-beam 85.96[1.0s]60.63 82.48 83.67 79.691000-beam 86.54[1.2s]

Seg. free

300-beam 85.58[2.8s]60.10 82.12 83.11 80.561000-beam 86.12[7.4s]

From these results, we can see that (1) using all feature function models significantly improves the recognition accuracy by combining a MRF model with a P2DBMN-MQDF recognizer. The MRF model applies a structural method that is weak at collecting global character information, but robust against character shape variations. The 2DBMN-MQDF recognizer uses an un-structural method that is robust against noises but weak against character shape variations. By combining them, they compensate for their respective disadvantages. (2) The length restricting can bring higher recognition accuracy. (3) The two methods (segmentation-free and segmentation-based) yielded comparable results and the segmentation-based method performed a little better. Although both methods have used the same path evaluation criterion in a CRF model, their recognition results are different. We consider the segmentation-based method attempts to select candidate segmentation points at true characters boundaries, resulting in reduced confusion and improved recognition speed. The segmentation-free method uses the MRF model to select NMRF top MRF paths and some end points from a start point for a character category only in one search resulting in improved processing speed. Therefore it is possible to significantly improve recognition speed by combining the two methods (segmentation-free and segmentation-based). (4) Our system outperforms a state-of-the-art recurrent neural

network BLSTM with recognition rate 85.3% for the same datasets and lexicon [1].

We also evaluate the SSSF and got a result that at beam band 300 it led to greatly improved recognition speed (by about eight times for segmentation-based method, 89 times for segmentation-free method). The memory consumption of our recognizer is about 8.5MB.

6. Conclusion This paper presented a system for online handwritten

English cursive word recognition and made a comparison between the segmentation-free and segmentation-based methods at the same path evaluation criterion in a CRF model. Experimental results demonstrate the segmentation-based method performed better because it attempts to select candidate segmentation points resulting in reduced confusion and improved recognition speed. We expect to significantly improve recognition speed by combining the two methods in the future. We have already shown that combining MRF model with P2DBMN-MQDF achieves very high recognition rate for online handwritten Japanese text. In this paper, we have shown this combination is also very effective and successful for online English word recognition.

Acknowledgement This research is partially being supported by

Grant-in-aid for Scientific Research C-15K00225.

References [1] A. Graves, M. Liwicki, S. Fern´andez, R. Bertolami, H.

Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE Trans PAMI, 31(5), 855–868, 2009.

[2] B. Zhu and M. Nakagawa. On-line handwritten Japanese characters recognition using a MRF model with parameter optimization by CRF. Proc. 11th ICDAR, 603–607, 2011.

[3] C.-L. Liu and X.-D. Zhou. Online Japanese character recognition using trajectory-based normalization and direction feature extraction, Proc. 10th IWFHR, 217-222, 2006.

[4] A. Shivram, B. Zhu, S. Setlur, M. Nakagawa, and V. Govindaraju. Segmentation based online word recognition: A conditional random field driven beam search strategy. Proc. 12th ICDAR, 852–856, 2013.

[5] X.-D. Zhou, D.-H. Wang, F. Tian, C.-L. Liu, and M. Nakagawa. Handwritten Chinese/Japanese text recognition using semi-Markov conditional random fields. IEEE Trans. PAMI, 35(10), 2413–2426, 2013.

[6] M. Liwicki and H. Bunke. IAM-OnDB-an online English sentence database acquired from handwritten text on a whiteboard. Proc. 8th ICDAR, 956–961, 2005

[7] C. T. Nguyen, B. Zhu, and M. Nakagawa. A semi-incremental recognition method for on-line handwritten English text. Proc. 14th ICFHR, 234–239, 2014.

Documents

Online Handwritten Cursive Word Recognition Using ...web.tuat.ac.jp/~nakagawa/pub/2015/pdf/Zhubilan_ACPR2015.pdf · Realizing online handwritten character recognition with high performance