PatternRecognition · to the class of minimum quadratic distance (MQDF), and multiple candidate classes are ordered in ascending order of distances. 3444 T.-F. Gao, C.-L. Liu / Pattern

Pattern Recognition 41 (2008) 3442 -- 3451

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.e lsev ier .com/ locate /pr

HighaccuracyhandwrittenChinesecharacter recognitionusingLDA-basedcompounddistances

Tian-Fu Gao∗, Cheng-Lin LiuNational Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, P.O. Box 2728, Beijing 100190, PR China

A R T I C L E I N F O A B S T R A C T

Article history:Received 9 December 2007Received in revised form 11 April 2008Accepted 15 April 2008

Keywords:Handwritten Chinese character recognitionLDACompound distanceCompound Mahalanobis function

To improve the accuracy of handwritten Chinese character recognition (HCCR), we propose linear dis-criminant analysis (LDA)-based compound distances for discriminating similar characters. The LDA-basedmethod is an extension of previous compound Mahalanobis function (CMF), which calculates a comple-mentary distance on a one-dimensional subspace (discriminant vector) for discriminating two classes andcombines this complementary distance with a baseline quadratic classifier. We use LDA to estimate thediscriminant vector for better discriminability and show that under restrictive assumptions, the CMF isa special case of our LDA-based method. Further improvements can be obtained when the discriminantvector is estimated from higher-dimensional feature spaces. We evaluated the methods in experimentson the ETL9B and CASIA databases using the modified quadratic discriminant function (MQDF) as baselineclassifier. The results demonstrate the superiority of LDA-based method over the CMF and the superiorityof discriminant vector learning from high-dimensional feature spaces. Compared to the MQDF, the pro-posed method reduces the error rates by factors of over 26%.

© 2008 Elsevier Ltd. All rights reserved.

1. Introduction

The problem of handwritten Chinese character recognition(HCCR) has received considerable attention from the research com-munity for its potential in many applications and the technicalchallenge. Since 1970s, many effective methods of image prepro-cessing, feature extraction, classification, and post-processing havebeen proposed to HCCR, and the recognition performance hasbeen improved constantly [1]. However, compared to the humanrecognition ability and the requirement of applications, the currentrecognition accuracy by computers is still insufficient, especially forunconstrained handwriting. Chinese character recognition involvesa large character set (e.g., about 5000 characters are frequentlyused in China, while some recognition systems consider over 10,000characters), and there are many similar characters. The similar char-acters are prone to be confused with each other by either computeror human since they often share common radicals and have onlysubtle shape difference in local details. The variability of writingstyles further blurs the shape difference between similar characters.Hence, to improve the discrimination ability of similar characterscan contribute significantly to the reduction of recognition errors.

∗ Corresponding author. Tel.: +861062632251.E-mail addresses: [email protected] (T.-F. Gao), [email protected] (C.-L. Liu).

0031-3203/$30.00 © 2008 Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2008.04.011

A Chinese character recognition system usually adopts a hier-archical classification scheme for improving both the classificationspeed and the accuracy. Often, a simple and fast classifier (coarseclassifier) is used to select some candidate classes which includesthe true class of test character with high probability, then a high ac-curacy classifier (fine classifier) identifies the class from the candi-dates. The fine classifier can be designed to discriminate all characterclasses, a subset of classes, or a pair of classes. An all-class fine clas-sifier can readily classify an arbitrary subset of candidate classes (thecandidate subset is variable depending on the test character), but totrain it for separating all characters is non-trivial. It is relatively eas-ier to train a classifier for separating a fixed subset of classes, but thepartitioning of class subsets is non-trivial either. In the following, wereview some previous works toward fine classifier design.

Classifiers based on quadratic discriminant functions (QDFs)have been applied successfully to HCCR, and the most popular oneamong them is the modified quadratic discriminate function (MQDF)proposed by Kimura et al. [2]. Compared to the ordinary QDF, theMQDF reduces the computational complexity and meanwhile im-proves the generalization performance via replacing the eigenvaluesin the minor subspace of each class with a constant. Other improvedversions to QDF include the pseudo Bayes classifier [3], asymmetricMahalanobis distance (MD) [4], determinant normalized QDF [5],etc. These quadratic classifiers, though perform fairly well in HCCR,cannot discriminate well all classes, especially similar characters. Liuapplied discriminative training to parameter learning of MQDF for

http://www.sciencedirect.com/science/journal/pr

http://www.elsevier.com/locate/pr

T.-F. Gao, C.-L. Liu / Pattern Recognition 41 (2008) 3442 -- 3451 3443

HCCR [6], but showed that discriminative MQDF [7] improves theaccuracy of HCCR only slightly. On the other hand, he showed thatdiscriminative feature extraction (discriminative subspace learning)[8] improves the accuracy significantly.

Neural network classifiers and support vector machines (SVMs)are suitable for discriminating subsets of confusing characters, butfor discriminative training, the subset of classes must be fixed. Fuand Xu designed probabilistic decision-based neural networks fordiscriminating groups of classes divided by clustering, with each net-work trained with the samples of the classes in a group [9]. SVMshave been used by Dong et al. in HCCR for classifying one class fromthe others, by designing for each class an SVM trained on all sam-ples using a fast algorithm [10]. The training complexity can be re-duced by training with the samples of one class (positive samples)and its confusing classes (negative samples) only, as done for neuralnetworks [11]. The confusing classes can be detected according tonearest neighbor rules or using a baseline classifier. Kawatani et al.separate a class from its confusing classes by designing a QDF usingFisher's linear discriminant analysis (LDA) and add this discriminantfunction to the baseline weighted Euclidean distance (ED) for clas-sification [12].

Classifiers for subsets of classes can better discriminate similarcharacters than the all-class classifier because of the lower separa-tion complexity. In the simplest case, the subset classifier discrimi-nates only two classes. Using such pair discriminators to reorder thecandidate classes given by a baseline classifier has shown success[13,14]. However, the exhaustive set of pair discriminators is huge:For M classes, the total number of pairs is as large as M(M−1)/2. Forthousands of classes of Chinese characters, storing pairwise classi-fiers for all pairs is prohibitive, even though most pairs of Chinesecharacters can be well separated by linear classifiers [15]. Actually,pair discriminators are needed only for confusing similar pairs, formost pairs are well separated by the baseline classifier.

The compound Mahalanobis function (CMF), proposed by Suzukiet al. [16], can be viewed as a pairwise classifier. It calculates a com-plementary distance on a one-dimensional subspace (discriminantvector) for a pair of classes. The complementary distance is com-bined with the baseline quadratic classifier (MD in Ref. [16]) forre-ordering two candidate classes. The discriminant vector is esti-mated from the Gaussian density parameters of two classes in timeof recognition. Since the pairwise classifier stores no extra param-eters in addition to the base classifier, pair discrimination is avail-able for all pairs of classes. This method has been extended to othertypes of quadratic classifiers (we call this family of methods com-pound function or compound distance methods) [17,18], includingthe modified projection distance [17] and the MQDF.

Though the CMF has shown superior recognition performance inprevious works [16--18], it has two drawbacks. First, the discrimi-nant vector for two classes is not optimized, as we will discuss inSection 4. Second, despite no extra parameters in CMF, calculatingthe discriminant vector during recognition is time-consuming. Toovercome these problems, we propose to estimate the discriminantvector using LDA [19]. We will show that under restrictive assump-tions, the CMF method is a special case of our LDA method. We esti-mate the discriminant vector for selected pairs of classes in trainingstage. In recognition stage, a distance measure is calculated on theone-dimensional subspace and is added to the baseline classifier (theMQDF in our case) for re-ordering two candidate classes. Besideshigher discriminability than the CMF, the LDA-based compound dis-tance method is flexible in that the discriminant vector can be esti-mated in arbitrary feature space, while the CMF is calculated in thesame feature space as the baseline quadratic classifier. For Chinesecharacter recognition, the baseline classifier is mostly designed on alower-dimensional subspace transformed from the original featurespace for classification efficiency. We will evaluate LDA-based com-

pound distance on the original feature space and subspaces of vari-able dimensionality, and show that compound distances on higher-dimensional spaces give higher recognition accuracy.

Some preliminary results of our method have been presented at aconference [20], where the LDA-based compound distance was usedin the same feature space with MQDF. In this paper, we extend themethod to variable feature spaces and report extensive experimentalresults. In the rest of this paper, we will review the MQDF and CMFmethod in Section 2, and give an overview of our Chinese characterrecognition system in Section 3; the proposed LDA-based compounddistance method is described in Section 4; Section 5 presents ourexperimental results and Section 6 gives concluding remarks.

2. MQDF and CMF

The MQDF of Kimura et al. [2] has been widely applied to hand-written Japanese and Chinese character recognition with great suc-cess. We take it as a baseline classifier and use compound distancefunctions to further improve the accuracy.

2.1. MQDF

The MQDF is a modified version of the ordinary QDF, which as-sumes Gaussian density for each class [19]. The QDF suffers from thequadratic number of parameters, which cannot be estimated reliablywhen the number of samples per class is small compared to the fea-ture dimensionality. The MQDF reduces the complexity of QDF byreplacing the minor eigenvalues of covariance matrix of each classwith a constant. Consequently, the minor eigenvectors are cancelledout in the discriminant function. This reduces the complexity of bothstorage and computation and meanwhile, improves the generaliza-tion performance of classification.

Denote the input character pattern by a d-dimensional featurevector x= (x1, x2, x3, . . . , xd)

T. For classification, each class is assumedto have a Gaussian density p(x|�i) = N(li,�i), where li and �i arethe class mean and covariance matrix, respectively. Assuming equala priori class probabilities, the discriminant function is given by thelog-likelihood:

−2 logp(x|�i) = (x − li)T�−1i (x − li) + log |�i| + CI, (1)

where CI is a class-independent term, and is usually omitted. Wetake the minus log-likelihood to make the discriminant function adistance measure. The covariance matrix �i can be diagonalized as:�i = �i�i�T

i , where �i = diag[��1, . . . ,�ik, . . . ,�id] has the eigenval-ues of �i (in descending order) as diagonal elements, �i is an ortho-normal matrix comprising as columns the eigenvectors of �i. Re-placing the minor eigenvalues with a constant, i.e., replacing �i withdiag[�i1, . . . ,�ik,�i, . . . ,�i] (k is the number of principal eigenvectorsto be retained), the discriminant function of Eq. (1) becomes whatwe call MQDF:

f (x,�i) =k∑

j=1

1�ij

[(x − li)Tuij]2

+ 1�i

⎛⎝‖x − li‖2 −

k∑j=1

[(x − li)Tuij]2⎞⎠

+k∑

j=1

log�ij + (d − k) log�i, (2)

where uij, j=1, . . . , k, are the principal eigenvectors of the covariancematrix of class �i. In classification, the input pattern is classifiedto the class of minimum quadratic distance (MQDF), and multiplecandidate classes are ordered in ascending order of distances.

3444 T.-F. Gao, C.-L. Liu / Pattern Recognition 41 (2008) 3442 -- 3451

Fig. 1. Projection vectors of CMF in 3D space (minor subspace in 2D).

2.2. CMF

In MQDF, the d--kminor eigenvectors of each class are totally can-celled out. The CMF [16--18] makes use of these minor eigenvectorsto further discriminate pairs of classes. Similar to MQDF, the minorsubspace is implicitly represented by the principal eigenvectors, andthe minor eigenvectors are not necessarily stored.

For discriminating two classes �j and �i, the CMF method calcu-lates a compound distance measure for each class. With respect to�j, the CMF is formulated as

fCMF (x,�j) = (1 − )fMQDF (x,�j)

+ fMD(x′,�j;�i), (0��1), (3)

where fMQDF (x,�j) is the MQDF (or possibly, any other QDF) andfMD(x′,�j;�i) is an added MD (or ED) to �j, is a weighting coef-ficient. The added distance is calculated on a one-dimensional sub-space (discriminant vector) estimated from the minor eigenvectorsof two classes. x′ is the projection of x on the subspace:

x′ = lj + [(x − lj)Tw]w, (4)

where

w=∑d

m=k+1 [(li − lj)Tujm]ujm√∑dm=k+1 [(li − lj)Tujm]2

(5)

is the discriminant vector estimated from the minor subspace ofclass �j. As shown in Fig. 1, w is the normalized projected vector ofli−lj onto the minor subspace of class �j. This direction is proposedto provide complementary information for discriminating class �jfrom �i (Fig. 1).

The added MD fMD(x′,�j;�i) is then calculated by

fMD(x′ ,�j;�i) = ‖x′ − lj‖22

={(x − lj)T(li − lj) − ∑k

m=1 [(x − lj)Tujm][(li − lj)Tujm]}

2{‖(li − lj)‖2 − ∑k

m=1 [(li − lj)Tujm]2} . (6)

In the above formula, the minor eigenvectors, ujm, m = k + 1, . . . ,d,are eliminated according to

li − lj =d∑

m=1

[(li − lj)Tujm]ujm.

Thus, the CMF has no extra parameters compared to the baselineMQDF classifier.

In the sameway, we can calculate the CMF fCMF (x,�i) for class�i.Finally, the two classes are discriminated by comparing fCMF (x,�j)and fCMF (x,�i), and the input pattern is classified to the class ofsmaller CMF. If we want to reorder more than two candidate classesgiven by the baseline classifier, we need to compare each pair of themby CMF, and the class with maximum votes of winning is acceptedas the final result.

3. Overview of recognition system

The diagram of our HCCR system is shown in Fig. 2. The inputcharacter image is first normalized to a standard size, and after fea-ture extraction and dimensionality reduction, the reduced featurevector is fed to the classification module. Dimensionality reductionis usually performed by LDA, which performs fairly well for classifi-cation of large number of classes. In classification stage, the coarseclassifier gives some candidate classes according to the ED from in-put vector to class means, and the fine classifier (MQDF) reorders thecandidate classes. If the top two candidates ordered by MQDF aresimilar characters (the pairs of similar characters are determined intraining stage and stored in a database), LDA-based compound dis-tance (MQDF+LDA) is used to choose a class from the two candidates(the details of the method will be given in Section 4.4); otherwise,the top candidate given by MQDF is taken as the recognition result.

The CMF method [17] was applied to discriminate all pairs ofclasses. When the baseline classifier gives more than two candidateclasses, each pair of candidates is classified by CMF, and the finalresult is determined by the voting rule. In our method, since the dis-criminant vector for a pair of classes must be estimated in trainingand stored in a database, we only discriminate selected pairs of con-fusing characters. Even though the LDA-based compound distanceis applied only when the top two candidates given by the baselineclassifier belong to a pre-selected pair, the LDA-based discriminantvector offers higher recognition accuracy than the CMF.

4. Pair discrimination using LDA

To improve the accuracy of pair discrimination, we propose toestimate the discriminant vector using LDA [19], and discuss therelationship between LDA and the CMF method.

4.1. LDA-based discriminant vector

The CMF method projects the input feature vector onto an axison the minor subspace of one class, which is based on the covariance

Fig. 2. Diagram of HCCR system.


of one class only. To improve the separability of two classes, weturn to estimate the discriminant axis using LDA, which considersthe covariance matrices of both classes. In LDA, the projection axis(discriminant vector) w for discriminating two classes is estimatedto maximize the Fisher criterion:

J(w) = tr((wTSww)−1(wTSBw)), (7)

where tr(·) denotes the trace of matrix, SB and Sw denote thebetween-class scatter matrix andwithin-class scatter matrix, respec-tively. For two classes �i and�j, with means li and lj, covariancematrices �i and �j, a priori probabilities pi and pj (often assumedpi = pj), the within-class and between-class scatter matrices can bewritten as

Sw = pi�i + pj�j = �ij, (8)

SB = (li − lj)(li − lj)T. (9)

By LDA, the optimal discriminant vector is obtained as

w = S−1w (li − lj) = (pi�i + pj�j)

−1(li − lj)= �−1

ij (li − lj). (10)

�ij is the average covariance matrix of two classes and can be re-written as

�ij =WKWT =d∑

m=1

�mwmwTm,

whereW= [w1,w2, . . . ,wd] and K=diag[�1,�2, . . . ,�d] (the eigenval-ues are ordered in descending order). So,

w = �−1ij (li − lj) =WK−1WT(li − lj)

=d∑

n=1

1�nwnw

Tn(li − lj). (11)

In calculatingw, we encounter the same problem of estimation errorof �ij on limited samples: when estimated by maximum likelihood(ML), the minor eigenvalues tend to be under-estimated as in MQDFFor this reason, we set a threshold

b = a ·⎛⎝ d∑

i=1

�i

/d

⎞⎠ = a · (tr(�ij)/d), (12)

where the constant multiplier a is selected empirically (we will givedetails on how to choose a in Section 5), and d is the dimensionalityof �ij. Whenever an eigenvalue of �ij is smaller than b, it is set equalto b.

The covariance matrices of �i and �j cannot be reconstructedfrom the parameters of MQDF (principal eigenvectors and eigenval-ues). Hence, the discriminant vector w must be estimated from thetraining samples of two classes and stored in the classification dic-tionary. The vector w calculated by Eq. (11) is normalized to unitnorm: w̃ = w/||w||. In classification, the feature vector of input pat-tern is projected onto this axis as in Eq. (4) for discriminating twoclasses.

4.2. CMF and LDA

In the following, we show that the discriminant vector calculatedby LDA in Eq. (11) covers that of CMF as a special case. When �i ≈�j, we can set �ij = �j in Eq. (10), and the vector w in Eq. (11) ischanged to

w =d∑

m=0

1�jmujmu

Tjm(li − lj), (�j1��j2� · · · ��jd). (13)

When �jm is large enough, we can set 1/�jm ≈ 0. We assume that

this holds true for the principal eigenvalues (m= 1, . . . , k), which aresignificantly bigger than the minor ones (m = k + 1, . . . ,d). Then, theleading k parts of Eq. (11) can be neglected. Further, if we reasonablyset the minor eigenvalues to be a small constant (as done in MQDF):�jm = �, m = k + 1, . . . ,d, the vector w becomes

w =d∑

m=k+1

1�jmujmu

Tjm(li − lj)

= 1�

d∑m=k+1

[(li − lj)Tujm]ujm. (14)

The normalized vector w̃ = w/||w|| is then the same as w in Eq. (5).This explains that the CMF method is a special case of the LDA-basedmethod under restrictive assumptions (equal covariance matrices fortwo classes and the minor eigenvalues are much smaller than theprincipal ones).

The CMF and the LDA-based method are devised from differentviewpoints but both project two classes onto a discriminant vector.Though the CMF was not discussed from the viewpoint of LDA inthe literature [16], it works in a similar way to LDA and results inthe same discriminant vector as LDA under the above assumptions.These assumptions are reasonable because: (1) LDA also assumessmall difference between the covariance matrices of two classes, butit takes the average of two matrices rather than one of them; and(2) the eigenvalues of a covariance or scatter matrix are ordered indescending order, the minor eigenvalues are generally much smallerthan the leading ones.

From Eq. (11), the discriminant vector is calculated from the co-variance matrices of two classes. Since the diagonalization of theaverage covariance matrix is computationally expensive, it must becomputed offline and the discriminant vectors of selected class pairsshould be stored in the classification dictionary. On the other hand,the discriminant vectors of the CMF method can be calculated on-line by Eq. (5) and apply to all pairs of classes without storing extraparameters. However, the CMF method assumes that the covariancematrices of each pair of classes are identical. This results in degradedclassification performance.

4.3. LDA in higher-dimensional feature spaces

For discriminating a large number of Chinese characters whichhave complex shapes, the character pattern is usually representedby a high-dimensional feature vector (512D in our case [6]). To im-prove the efficiency of classification, the MQDF is often performedafter dimensionality reduction by global LDA for all the classes (e.g.,Refs. [3,6,18]). The global subspace learned by LDA gives high recog-nition accuracy overall, but similar characters (proximate in the fea-ture space) cannot be well separated in the subspace. For pair dis-crimination using LDA-based compound distance, the discriminantvector can be calculated either in the same feature space as MQDFor in a different feature space (possibly the original feature space).We will show in experiments that LDA-based compound distancesin higher-dimensional feature spaces outperform that in the samespace as MQDF.

We follow Ref. [21] to show that the global LDA (transform forall classes) deteriorates the separability between similar charactersand support LDA-based compound distance in higher-dimensionalfeature spaces. The global dimensionality reduction by LDA is per-formed as below.

Considering all the K classes, the between-class and within-classscatter matrices in the Fisher criterion of Eq. (7) can be formulated as

SB =K∑i=1

pi(li − l0)(li − l0)T, (15)


Sw =K∑i=1

pi�i, (16)

where

l0 =K∑i=1

pili

denote the overall mean. It is shown in Ref. [21] that the between-class scatter matrix can be derived as the summation of pairwisedistances:

SB =K−1∑i=1

K∑j=i+1

pipjSij, (17)

where

Sij = (li − lj)(li − lj)T (18)

is the outer product of the difference vector of means of two classes�i and �j.

Replace SB in Eq. (7) by Eq. (17), the Fisher criterion can bewrittenas

J(w) =K−1∑i=1

K∑j=i+1

pipj tr((wTSww)−1(wTSijw)). (19)

If we assume the pooled within-class scatter matrix to equal to thed × d identity matrix Sw = Id, and the discriminant vector w is nor-malized: wTw = 1, the Fisher criterion will be

J(w) =K−1∑i=1

K∑j=i+1

pipjE2ij , (20)

where E2ij = tr(wTSijw) = (wTli − wTlj)T(wTli − wTlj) = ‖wT(li −

lj)‖2 = ‖(li − lj)‖2 (wTw = 1) is the square ED between two classmeans in the projected subspace. Hence, the solution of the Fishercriterion is to maximize the average square ED between pairs ofclasses in the reduced subspace, and the larger the distance betweentwo classes �i and �j in the original space, the more prominent thedirection of li − lj will contribute to the discriminant vector w. IfSw does not equal to the identity matrix, it can be first whitened,and then the distance between pairs of classes in the whitened spacesimilarly influences the direction of discriminant vector.

Fig. 3 illustrates how the square ED between class means affectthe direction of projection vector learned by LDA. For simplicity, onlya two-dimensional model is depicted, and we assume that all theclasses have Gaussian densities and equal a priori probabilities. InFig. 3(a), the vector is learned by LDA for all the three classes, whilein Fig. 3(b), it is learned only for the left two classes. It is evidentthat projection on the vector learned by global LDA will bring heavyoverlap to the left two classes in Fig. 3(a), but this does not happenin Fig. 3(b).

For classification of a large number of classes, the similar classestend to have small distances in the feature space and the differencevector between them contributes less to the directions of subspaceaxes learned by LDA than those well-separated classes. Thus, the di-rections of subspace axes may differ considerably from the differ-ence vector between similar classes, and so, the discrimination ofsimilar classes in the projected subspace becomes more difficult.

To overcome the deteriorated separability between similarclasses in the feature subspace learned by global LDA, we propose toestimate the two-class discriminant vector of LDA-based compounddistance from the original feature space. The LDA-based compounddistance method, which performs a local LDA for two classes in theoriginal feature space, yielded best accuracies in our experiments.

Fig. 3. (a) LDA for three classes; and (b) LDA for two classes.

Besides the original feature space, the discriminant vector for twosimilar classes can also be calculated in different subspaces otherthan that of MQDF. For two similar classes, they may be highly over-lapped in the same feature space of MQDF, but they can be lessoverlapped in other feature spaces different from that of MQDF. Todo this, two global LDA transform matrices are needed, one for thesubspace for MQDF, and the other for a different global subspace, onwhich local LDA is applied to get pair discriminators. This proceduremakes use of the complementarity of different feature spaces, andis also proved effective in discriminating similar characters by ourexperiments.

4.4. Similar character recognition

We use the MQDF as the baseline classifier, and for each inputpattern x, the MQDF classifier gives two top-rank candidate classes�i and�j. The compound distance function is used to further identify


to which class the input pattern belongs. The compound distancesfor two classes are computed by{f (x,�i) = (1 − ) ∗ fMQDF (x,�i) + ∗ fLDA(x,�i),f (x,�j) = (1 − ) ∗ fMQDF (x,�j) + ∗ fLDA(x,�j),

(21)

and the smaller one decides the class of input pattern. fMQDF (x,�i)and fMQDF (x,�j) are the MQDF outputs of classes �i and �j, as givenby Eq. (2). fLDA(x,�i) and fLDA(x,�j) are the distances in the one-dimensional subspace (specified by discriminant vector w̃) learnedby two-class LDA.

Projecting x and li (note that they maybe in a different featurespace from the baseline classifier MQDF) onto the discriminant vec-

tor w̃, we obtain�x =xTw̃ and

�� i = lTi w̃. Then,

fLDA(x,�i) = fED(x,�i) = (�x −�

� i)2/C, (22)

where C is a constant for regulating the scale of ED fED(x,�i) to becomparable to fMQDF (x,�i). For better discrimination, the ED maybe replaced by MD:

fLDA(x,�i) = fMD(x,�i) = (�x −�

� i)2/

�2i , (23)

where�2i is the variance in one-dimensional subspace for class �i.

The distance in Eq. (22) or (23) is combined with MQDF as in Eq.(21) for discriminating two candidate classes given by the baselineclassifier (MQDF) if the two classes belong to a similar pair selectedin training (the discriminant vectors of selected pairs are stored inthe classification dictionary), otherwise the top candidate given bythe baseline classifier is taken as the classification result.

5. Experiment results

We evaluated the recognition performance of LDA-based com-pound distances and the previous CMF on two databases of hand-written Japanese/Chinese characters: ETL9B and CASIA. The ETL9Bdatabase, collected by the Electro-Technical Laboratory (ETL) ofJapan, contains the handwritten samples of 3036 characters, includ-ing 2965 Kanji characters and 71 hiragana, 200 samples per class.We choose the first 20 and last 20 samples from each class fortesting, and the remaining 160 samples from each class for training.The CASIA database, collected by the Institute of Automation ofChinese Academy of Sciences, contains 3755 Chinese characters ofthe level-1 set of the standard GB2312-80, 300 samples per class.We choose 250 samples per class for training and the remaining 50samples per class for testing.

For character image pre-processing and feature extraction, weadopt the same methods as in Ref. [6]. Each binary character imageis normalized to gray-scale image of 64 × 64 pixels by the namedbi-moment normalization method [22] and 8-direction gradient di-rection features [23] are extracted. The resulting 512-dimensionalfeature vector is projected onto a 160-dimensional subspace learnedby global LDA. The 160-dimensional projected vector is then fed tothe MQDF classifier. For similar characters discrimination, we alsocalculate LDA-based compound distances from higher-dimensionaland lower-dimensional feature spaces learned by global LDA as wellas the original feature space.

In implementing MQDF, the constant minor eigenvalue �i in Eq.(2) is made class-independent and equal to the average of all eigen-values (this value empirically performs well and slightly differentvalues make little difference). The 2 in Eq. (6) and the constant Cin Eq. (22) are also set equal to �i (which performs best among thevalues 0. 1k · �i, k = 1, . . . , 10, but all the values show slight differ-ence of performance). This is to regulate the scale of fMD(x′,�j;�i)in Eq. (6) and fED(x,�i) in Eq. (22) such that the re-scaled valuesare comparable to the outputs of MQDF. The parameter a in Eq. (12)

Table 1Numbers of similar pairs for MQDF with variable number k

k 10 20 30 40 50

ETL9B 29,244 30,279 31,173 27,616 30,897CASIA 70,098 70,904 71,867 66,378 71,784

was empirically set to 0.3, which performs best among the values0. 1, 0. 2, . . . , 1 in our experiments (other choices give only slightlyinferior performance).

Similar character pairs were selected on the training dataset by5-fold cross validation, i.e., rotationally using 4/5 of training data fortraining MQDF classifier and the remaining 1/5 of training data forvalidation. The MQDF classifier gives candidate classes with min-imum quadratic distances on each validated sample. We consideronly the leading five candidate classes (in ascending order of dis-tances) given by MQDF for selecting similar character pairs. We usethe criterion below to measure the similarity between a candidateclass and the true class of validated sample:

∣∣∣∣∣ fMQDF (x, ct) − fMQDF (x, ci)fMQDF (x, ct)

∣∣∣∣∣ �0. 1, (24)

where fMQDF (x, ct) is the true class output ofMQDF, and fMQDF (x, ci) isthe i-th candidate class output of MQDF. If a candidate class precedesthe true class in the ordered candidate list, it is paired with the trueclass; Otherwise, if the condition of Eq. (24) is met, the candidateclass is also paired with the true class.

When varying the number of principal eigenvectors per class forMQDF: k=10, 20, 30, 40, 50, the numbers of similar pairs obtained arelisted in Table 1. To see how the discriminant vectors (one for eachsimilar pair) enlarge the storage space of classification dictionary, wetake the case ofMQDFwith k=40 as example. In this case, the averagenumber of discriminant vectors per class is 27616/3036 = 9. 1 forthe ETL9B database and 66378/3755 = 17. 7 for the CASIA database.For MQDF, the number of vectors stored for each class is k + 1 (keigenvectors plus one mean vector). So the increased percentage ofstorage is 9. 1/41=22. 2% for the ETL9B database and 17. 7/41=43. 2%for the CASIA database.

5.1. Comparison of CMF and LDA-based compound distance

Using the MQDF as baseline classifier, we compare our LDA-basedcompound distance method with the CMF method in Refs. [17,18](denoted by MQDF + CMF). Our LDA-based distance function hastwo variations depending on the distance metric in one-dimensionaldiscriminant subspace: ED and MD. The two variations are denotedby MQDF+ ED and MQDF+MD, respectively. For all these methods,the pair discriminant vector is calculated in the same feature spaceas the MQDF, i.e., 160D subspace learned by global LDA. The numberof principal vectors per class for MQDF was fixed at k = 40.

By varying the weight coefficient in [0,1] (=0 corresponds tofMQDF (x) only, and = 1 corresponds to fLDA(x) only), we obtainedthe test accuracies on ETL9B and CASIA databases using differentcompound functions as shown in Figs. 4 and 5, respectively. In Fig. 4,we can see that the best accuracy on the ETL9B database reaches99.33% by MQDF+MD, 99.31% by MQDF+ED, and 99.29% by MQDF+CMF, whereas the accuracy of MQDF classifier is 99.16%. In Fig. 5,the best accuracies of MQDF + MD, MQDF + ED, and MQDF + CMFon the CASIA database are 98.40%, 98.34%, and 98.29%, respectively,whereas the accuracy of MQDF classifier is 98.06%. Comparing theperformance of three compound distance methods, it is evident thatMQDF+MD gives the best accuracy, andMQDF+ED also outperformsMQDF + CMF. We can also see that using only ED or MD is better


Fig. 4. Test accuracies on ETL9B database using MQDF with k = 40 as baselineclassifier.

Fig. 5. Test accuracies on CASIA database using MQDF with k = 40 as baselineclassifier.

Fig. 6. Test accuracies on ETL9B database using MQDF with varying number k.

than using only CMF in discriminating similar characters (=1), butcombining them with MQDF is evidently beneficial.

In Fig. 4, the performance of MQDF + MD turns out to be notstable for some weighting coefficients. This happens to MQDF+MD,but not to MQDF + ED, and is not evident for the CASIA databasein Fig. 5. The reason might be that there are less training samplesin ETL9B than in CASIA database, so the estimation of variance inone-dimensional subspace is unreliable on ETL9B.

We then fixed the weighting coefficient ( = 0. 5 for MQDF +MD, = 0. 6 for MQDF + ED and MQDF + CMF) and evaluated themethods with varying number k for MQDF. The test accuracies fortwo databases are shown in Figs. 6 and 7, respectively. Again, it isevident that the LDA-based methods MQDF + MD and MQDF + EDoutperform the previous CMFmethod (MQDF+CMF), andMQDF+MDperforms best among them.

To exemplify the effects of compound distance functions in recog-nition, Fig. 8 shows some samples that are misrecognized by MQDF.Fig. 8(a) shows examples that are corrected by MQDF + CMF and

Fig. 7. Test accuracies on CASIA database using MQDF with varying number k.

MQDF+MD ( = 0. 5), and Fig. 8(b) shows examples on which bothMQDF and MQDF + CMF fails, but MQDF + LDA succeeds. Fig. 8(c)shows the outputs details of some samples of Fig. 8(b), in which wecan see that the MQDF outputs for two classes are nearly equal, andthe pair discriminant functions fCMF fails to identify the correct class,but the functions fMD successfully identifies the correct class, thenthe LDA-based compound distances (fMQDF + fMD) can identify thecorrect class.

Table 2 gives the average CPU times of classifying a test sampleusing MQDF (k = 40), MQDF + CMF, and MQDF + MD on a personalcomputer with AMD 64 Athlon-X2 2GHz processor. It is shown thatcompared to the baseline MQDF, MQDF + CMF and MQDF + MDincreases computation overhead only slightly. MQDF + MD is lesscomputationally intensive than MQDF + CMF because in MQDF +MD, the discriminant vectors are pre-computed in training, whilethe discriminant vectors of MQDF + CMF are computed real time inclassification.

5.2. LDA-based compound distances in varying feature spaces

In this section, we show the effects of LDA-based compound dis-tances with discriminant vector estimated in varying feature spaces.For MQDF+MD performs better thanMQDF+ED, hereof we show theresults of MQDF + MD only. Let MQDF(k) + MD(d) denote the com-pound distancewith k being the number of principal eigenvectors perclass for MQDF and d being the dimensionality of feature space forpair discrimination. The feature vectors of pair discrimination wereobtained by projecting the original pattern vectors onto subspaceslearned by global LDA. The value d was set to be 80,160,240,320, and512. When d = 512, pair discrimination is performed in the originalfeature space; When d=160, the feature space of pair discriminationis the same as that of MQDF.

We fixed k=40, and obtained the test accuracies with varying d ontwo databases, as shown in Figs. 9 and 10, respectively. We can seethat the accuracy is higher when pair discrimination is performed onhigher-dimensional feature spaces. The best performance is obtainedby pair discrimination in the original feature space (d= 512). In thiscase, the best test accuracies are 99.38% on the ETL9B database and98.57% on the CASIA database, respectively.

Tables 3 and 4 list the test accuracies of MQDF(k) + MD(d) withvarying k and d on two databases. The weighting coefficient wasset to be 0.3 for MQDF+MD when d= 512, and 0.5 for all the othercases. We can see that MQDF+MD outperforms the baseline MQDFclassifier, the accuracy increases as the feature dimensionality of pairdiscrimination increases, and reaches the best when d = 512 (pairdiscrimination in original feature space). When using MQDF with k=40 as baseline classifier, the test accuracy of ETL9B database increasesfrom 99.16% by MQDF to 99.38% by MQDF + MD, and the accuracyof CASIA database increases from 98.06% by MQDF to 98.57% by


Fig. 8. Some misrecognized characters by MQDF (k = 40) from CASIA database: (a) samples misrecognized by MQDF, but corrected by MQDF + CMF and MQDF + MD; (b)samples misrecognized by both MQDF and MQDF + CMF, but corrected by MQDF + MD; and (c) detailed outputs of some samples of (b).

Table 2CPU times of classification by MQDF (k = 40), MQDF + CMF, and MQDF + MD

MQDF (ms) MQDF + CMF (ms) MQDF + MD (ms)

ETL9B 9.31 11.02 10.64CASIA 10.35 12.75 11.85

Fig. 9. Test accuracies on ETL9B databases using MQDF(40)+MD(d) with varying d.

Fig. 10. Test accuracies on CASIA database using MQDF(40)+MD(d) with varying d.

MQDF+MD. The percentage of error reduction is (0. 84−0. 62)/0. 84=26. 2% for ETL9B database and (1. 94− 1. 43)/1. 94= 26. 3% for CASIAdatabase. When k=10 (MQDF with fewer principal eigenvectors perclass), the percentage of error reduction by MQDF+MD is even moreprominent.


Table 3Test accuracies (%) of MQDF(k) + MD(d) on ETL9B database

MQDF MQDF + MD

d = 80 d = 160 d = 240 d = 320 d = 512

k = 10 99.00 99.26 99.27 99.30 99.30 99.34k = 20 99.11 99.30 99.31 99.31 99.32 99.36k = 30 99.13 99.32 99.32 99.33 99.33 99.38k = 40 99.16 99.34 99.33 99.34 99.35 99.38k = 50 99.16 99.33 99.33 99.33 99.34 99.37

Table 4Test accuracies (%) of MQDF(k) + MD(d) on CASIA database

MQDF MQDF + MD

d = 80 d = 160 d = 240 d = 320 d = 512

k = 10 97.66 98.14 98.25 98.29 98.32 98.41k = 20 97.89 98.27 98.35 98.38 98.42 98.50k = 30 98.01 98.30 98.36 98.42 98.46 98.53k = 40 98.06 98.34 98.40 98.44 98.49 98.57k = 50 98.04 98.33 98.39 98.44 98.47 98.56

Table 5Cumulative accuracies (%) of MQDF with different k on ETL9B

k ETL9B

10 20 30 40 50

Correct 99.00 99.11 99.13 99.16 99.16Top 2 99.77 99.81 99.82 99.82 99.82Top 3 99.86 99.89 99.89 99.89 99.89

Table 6Accumulative accuracies (%) of MQDF corresponding to different k on CASIA

k CASIA

10 20 30 40 50

Correct 97.66 97.89 98.01 98.06 98.04Top 2 99.44 99.53 99.56 99.58 99.58Top 3 99.73 99.77 99.79 99.80 99.80

Tables 5 and 6 list the leading three cumulative accuracies ofMQDF with different number k of principal eigenvectors per class onETL9B and CASIA databases, respectively. We can see that the cumu-lative accuracy of top two candidates is very high. This is why weconsidered only the top two candidates given by MQDF in our exper-iments. The databases used in our experiments contain mostly neatlywritten characters. For recognition of unconstrained handwriting(such databases are not yet available), to consider more candidatesin pair discrimination by voting or confidence fusion may generatehigher accuracy. This will be investigated in our future works.

6. Discussions and conclusion

We propose the LDA-based compound distance method fordiscriminating similar characters in handwritten Chinese charac-ter recognition. Compared to the previous compound Mahalanobisfunction (CMF) method, our method takes into account the dif-ference of covariance matrices of two classes, thus yield betterdiscrimination performance. We also show that under restrictiveassumptions, the CMF method can be viewed as a special case of ourmethod. Though the LDA-based method needs to store discriminantvectors for selected similar pairs, it gives significantly higher recog-nition accuracies than the CMF method. Our experiments on twolarge databases (ETL9B and CASIA) demonstrate that the proposedLDA-based method outperforms the previous CMF method, and pairdiscrimination in higher-dimensional feature spaces (other than thefeature space of baseline MQDF classifier) can further improve the

recognition accuracy. In the best case, the proposed method reducesthe error rate of baseline MQDF by factors of over 26%.

In our experiments, we applied pair discrimination only whenthe top two candidates given by the baseline MQDF classifier areamong pre-selected pairs. Since the precision of top two candidatesby MQDF is sufficiently high as listed in Tables 5 and 6, pair discrim-ination among more than two candidate classes (as is possible forthe CMF method) does not increase the overall accuracy consider-ably. However, when the precision of baseline classifier is not highenough, say, the baseline classifier is less accurate or the sampledata is more confusing, considering more than two candidates willhelp. To apply the LDA-based method to this case, we need to find away to maximally fuse the outputs of MQDF and pair discriminationwhen the top two candidates are not among pre-selected pairs. Weare also considering alternative two-class classifiers (such as sup-port vector machines) to replace LDA-based pair discrimination, butsince pairs of Chinese characters are mostly linearly separable andthe sample set for training a pair discriminator is limited, this doesnot necessarily give significantly higher accuracy. These works areunderway in our group.

Acknowledgments

This work was supported in part by the Hundred Talents Pro-gram of Chinese Academy of Sciences and the National Natural Sci-ence Foundation of China (NSFC) under Grant nos. 60543004 and60723005. The authors thank the anonymous reviewers for theirvaluable comments.

References

[1] R. Dai, C. Liu, B. Xiao, Chinese character recognition: history, status andprospects, Frontiers Comput. Sci. China 1 (2) (2007) 126--136.

[2] F. Kimura, K. Takashina, S. Tsuruoka, Y. Miyake, Modified quadratic discriminantfunctions and its application to Chinese character recognition, IEEE Trans.Pattern Anal. Mach. Intell. 9 (1) (1987) 149--153.

[3] F. Kimura, T. Wakabayashi, S. Tsuruoka, Y. Miyake, Improvement of handwrittenJapanese character recognition using weighted direction code histogram, PatternRecognition 30 (8) (1997) 1329--1337.

[4] N. Kato, M. Suzuki, S. Omachi, H. Aso, Y. Nemoto, A handwrittencharacter recognition system using directional element feature and asymmetricMahalanobis distance, IEEE Trans. Pattern Anal. Mach. Intell. 21 (3) (1999)258--262.

[5] T. Kawatani, Handwritten Kanji recognition with determinant normalizedquadratic discriminant function, in: Proceedings of the 15th ICPR, vol. 2,Barcelona, Spain, 2000, pp. 343--346.

[6] C.-L. Liu, High accuracy handwritten Chinese character recognition usingquadratic classifiers with discriminative feature extraction, in: Proceedings ofthe 18th ICPR, vol. 2, Hong Kong, 2006, pp. 942--946.

[7] C.-L. Liu, H. Sako, H. Fujisawa, Discriminative learning quadratic discriminantfunction for handwriting recognition, IEEE Trans. Neural Networks 15 (2) (2004)430--444.

[8] A. Biem, S. Katagiri, B.-H. Juang, Pattern recognition using discriminative featureextraction, IEEE Trans. Signal Process. 45 (2) (1997) 500--504.

[9] H.C. Fu, Y.Y. Xu, Multilinguistic handwritten character recognition by Bayesiandecision-based neural networks, IEEE Trans. Signal Process. 46 (10) (1998)2781--2789.

[10] J.X. Dong, A. Krzyzak, C.Y. Suen, An improved handwritten Chinese characterrecognition system using support vector machine, Pattern Recognition Lett. 26(12) (2005) 1849--1856.

[11] K. Saruta, N. Kato, M. Abe, Y. Nemoto, High accuracy recognition of ETL9B usingexclusive learning neural network-II (ELNET-II), IEICE Trans. Inf. Syst. 79-D (5)(1996) 516--521.

[12] T. Kawatani, H. Shimizu, Handwritten Kanji recognition with the LDA method,in: Proceedings of the 14th ICPR, vol.2, Brisbane, Australia, 1998, pp. 1031--1035.

[13] T. Ishii, Y. Waizumi, N. Kato, Y. Nemoto, Recognition system for handwrittencharacters by alternative method using neural network, Trans. IEICE Jpn. J83-D-II (3) (2000) 988--995.

[14] Y. Jin, S. Ma, Pairwise classifier combination and its application on Chinesecharacter recognition, in: Proceedings of the 5th World Congress on IntelligentControl and Automation, vol. 5, Hangzhou, China, 2004, pp. 4075--4078.


[15] Y. Jin, S. Ma, Experiment on linear classification of Chinese character, J. ChineseInf. Process. 14 (2) (2002) 55--60.

[16] M. Suzuki, S. Ohmachi, N. Kato, H. Aso, Y. Nemoto, A discrimination methodof similar characters using compound Mahalanobis function, Trans. IEICE JpnJ80-D-II (10) (1997) 2752--2760.

[17] T. Nakajima, T. Wakabayashi, F. Kimura, Y. Miyake, Accuracy improvement bycompound discriminant functions for resembling character recognition, Trans.IEICE Jpn. J83-D-II (2) (2000) 623--633.

[18] H. Liu, X. Ding. Handwritten character recognition using gradient feature andquadratic classifier with multiple discrimination schemes, in: Proceedings ofthe 8th ICDAR, Seoul, Korea, 2005, pp. 19--23.

[19] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed.,Academic Press, New York, 1990.

[20] T.-F. Gao, C.-L. Liu, LDA-based compound distance for handwritten Chinesecharacter recognition, in: Proceedings of the 9th ICDAR, Curitiba, Brazil, 2007,pp. 904--909.

[21] M. Loog, R.P.W. Duin, R. Haeb-Umbach, Multiclass linear dimension reductionby weighted pairwise Fisher criterion, IEEE Trans. Pattern Anal. Mach. Intell.23 (7) (2001) 762--766.

[22] C.-L. Liu, H. Sako, H. Fujisawa, Handwritten Chinese character recognition:alternatives to nonlinear normalization, in: Proceedings of the 7th ICDAR,Edinburgh, Scotland, 2003, pp. 524--528.

[23] C.-L. Liu, K. Nakashima, H. Sako, H. Fujisawa, Handwritten digit recognition:investigation of normalization and feature extraction techniques, PatternRecognition 37 (2) (2004) 265--279.

About the Author---TIAN-FU GAO received the B.S. degree in Automatic Control in 2001, and M.E. degree in Pattern Recognition and Artificial Intelligence in 2004 fromHarbin Engineering University, Harbin, China. He is now a Ph.D. candidate in Pattern Recognition and Artificial Intelligence in the Institute of Automation, Chinese Academyof Sciences, Beijing, China.

About the Author---CHENG-LIN LIU received the B.S. degree in Electronic Engineering from Wuhan University, Wuhan, China, the M.E. degree in Electronic Engineeringfrom Beijing Polytechnic University, Beijing, China, the Ph.D. degree in Pattern Recognition and Artificial Intelligence from the Institute of Automation, Chinese Academyof Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a postdoctoral fellow at Korea Advanced Institute of Science and Technology (KAIST) and later atTokyo University of Agriculture and Technology from March 1996 to March 1999. From 1999 to 2004, he was a research staff member and later a senior researcher at theCentral Research Laboratory, Hitachi, Ltd., Tokyo, Japan. From 2005, he has been a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation,Chinese Academy of Sciences, Beijing, China, and is now the Deputy Director of the laboratory. His research interests include pattern recognition, image processing, neuralnetworks, machine learning, and especially the applications to character recognition and document analysis. He has published over 70 technical papers at internationaljournals and conferences. He won the IAPR/ICDAR Young Investigator Award of 2005.

Documents

PatternRecognition · to the class of minimum quadratic distance (MQDF), and multiple candidate classes are ordered in ascending order of distances. 3444 T.-F. Gao, C.-L. Liu / Pattern