15
Pattern Recognition 40 (2007) 229 – 243 www.elsevier.com/locate/patcog Face recognition using a kernel fractional-step discriminant analysis algorithm Guang Dai a , Dit-YanYeung a , ,Yun-Tao Qian b a Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong b College of Computer Science, Zhejiang University, Hangzhou, 310027, PR China Received 22 December 2005; received in revised form 18 May 2006; accepted 22 June 2006 Abstract Feature extraction is among the most important problems in face recognition systems. In this paper, we propose an enhanced kernel discriminant analysis (KDA) algorithm called kernel fractional-step discriminant analysis (KFDA) for nonlinear feature extraction and dimensionality reduction. Not only can this new algorithm, like other kernel methods, deal with nonlinearity required for many face recognition tasks, it can also outperform traditional KDA algorithms in resisting the adverse effects due to outlier classes. Moreover, to further strengthen the overall performance of KDA algorithms for face recognition, we propose two new kernel functions: cosine fractional- power polynomial kernel and non-normal Gaussian RBF kernel. We perform extensive comparative studies based on theYaleB and FERET face databases. Experimental results show that our KFDA algorithm outperforms traditional kernel principal component analysis (KPCA) and KDA algorithms. Moreover, further improvement can be obtained when the two new kernel functions are used. 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Face recognition; Feature extraction; Nonlinear dimensionality reduction; Kernel discriminant analysis; Kernel fractional-step discriminant analysis 1. Introduction 1.1. Linear discriminant analysis Linear subspace techniques have played a crucial role in face recognition research for linear dimensionality re- duction and feature extraction. The two most well-known methods are principal component analysis (PCA) and linear discriminant analysis (LDA), which are for feature extrac- tion under the unsupervised and supervised learning settings, respectively. Eigenface [1], one of the most successful face recognition methods, is based on PCA. It finds the optimal projection directions that maximally preserve the data vari- ance. However, since it does not take into account class la- bel information, the “optimal” projection directions found, though useful for data representation and reconstruction, Corresponding author. Tel.: +852 2358 6977; fax: +852 2358 1477. E-mail address: [email protected] (D.-Y. Yeung). 0031-3203/$30.00 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2006.06.030 may not give the most discriminating features for separat- ing different face classes. On the other hand, LDA seeks the optimal projection directions that maximize the ratio of between-class scatter to within-class scatter. Since the face image space is typically of high dimensionality but the num- ber of face images available for training is usually rather small, a major computational problem with the LDA algo- rithm is that the within-class scatter matrix is singular and hence the original LDA algorithm cannot be applied directly. This problem can be attributed to undersampling of data in the high-dimensional image space. Over the past decade, many variants of the original LDA algorithm have been proposed for face recognition, with most of them trying to overcome the problem due to under- sampling. Some of these methods perform PCA first before applying LDA in the PCA-based subspace, as is done in Fisherface (also known as PCA+LDA) [2,3]. Belhumeur et al. [2] carried out comparative experiments based on the Harvard and Yale face databases and found that Fisherface did give better performance than Eigenface in many cases.

Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

Embed Size (px)

Citation preview

Page 1: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

Pattern Recognition 40 (2007) 229–243www.elsevier.com/locate/patcog

Face recognition using a kernel fractional-step discriminantanalysis algorithm

Guang Daia, Dit-Yan Yeunga,∗, Yun-Tao Qianb

aDepartment of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong KongbCollege of Computer Science, Zhejiang University, Hangzhou, 310027, PR China

Received 22 December 2005; received in revised form 18 May 2006; accepted 22 June 2006

Abstract

Feature extraction is among the most important problems in face recognition systems. In this paper, we propose an enhanced kerneldiscriminant analysis (KDA) algorithm called kernel fractional-step discriminant analysis (KFDA) for nonlinear feature extraction anddimensionality reduction. Not only can this new algorithm, like other kernel methods, deal with nonlinearity required for many facerecognition tasks, it can also outperform traditional KDA algorithms in resisting the adverse effects due to outlier classes. Moreover, tofurther strengthen the overall performance of KDA algorithms for face recognition, we propose two new kernel functions: cosine fractional-power polynomial kernel and non-normal Gaussian RBF kernel. We perform extensive comparative studies based on the YaleB and FERETface databases. Experimental results show that our KFDA algorithm outperforms traditional kernel principal component analysis (KPCA)and KDA algorithms. Moreover, further improvement can be obtained when the two new kernel functions are used.� 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

Keywords: Face recognition; Feature extraction; Nonlinear dimensionality reduction; Kernel discriminant analysis; Kernel fractional-step discriminantanalysis

1. Introduction

1.1. Linear discriminant analysis

Linear subspace techniques have played a crucial rolein face recognition research for linear dimensionality re-duction and feature extraction. The two most well-knownmethods are principal component analysis (PCA) and lineardiscriminant analysis (LDA), which are for feature extrac-tion under the unsupervised and supervised learning settings,respectively. Eigenface [1], one of the most successful facerecognition methods, is based on PCA. It finds the optimalprojection directions that maximally preserve the data vari-ance. However, since it does not take into account class la-bel information, the “optimal” projection directions found,though useful for data representation and reconstruction,

∗ Corresponding author. Tel.: +852 2358 6977; fax: +852 2358 1477.E-mail address: [email protected] (D.-Y. Yeung).

0031-3203/$30.00 � 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2006.06.030

may not give the most discriminating features for separat-ing different face classes. On the other hand, LDA seeksthe optimal projection directions that maximize the ratio ofbetween-class scatter to within-class scatter. Since the faceimage space is typically of high dimensionality but the num-ber of face images available for training is usually rathersmall, a major computational problem with the LDA algo-rithm is that the within-class scatter matrix is singular andhence the original LDA algorithm cannot be applied directly.This problem can be attributed to undersampling of data inthe high-dimensional image space.

Over the past decade, many variants of the original LDAalgorithm have been proposed for face recognition, withmost of them trying to overcome the problem due to under-sampling. Some of these methods perform PCA first beforeapplying LDA in the PCA-based subspace, as is done inFisherface (also known as PCA+LDA) [2,3]. Belhumeuret al. [2] carried out comparative experiments based on theHarvard and Yale face databases and found that Fisherfacedid give better performance than Eigenface in many cases.

Page 2: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

230 G. Dai et al. / Pattern Recognition 40 (2007) 229–243

However, by analyzing the sensitivity on the spectral rangeof the within-class eigenvalues, Liu and Wechsler [4] foundthat the generalization ability of Fisherface can be degradedsince some principal components with small eigenvaluescorrespond to high-frequency components and hence canplay the role of latent noise in Fisherface. To overcome thislimitation, Liu and Wechsler [4] developed two enhancedLDA models for face recognition by simultaneous diagonal-ization of the within-class and between-class scatter matricesinstead of the conventional LDA procedure. More recently,some other LDA-based methods have been developed forface recognition on the basis of different views. Ye et al.proposed LDA/GSVD [5] and LDA/QR [6] by employinggeneralized singular value decomposition (GSVD) and QRdecomposition, respectively, to solve a generalized eigen-value problem. Some other researchers have proposed thedirect LDA algorithm and variants [7–10]. These methodshave been found to be both efficient and effective for manyface recognition tasks.

For multi-class classification problems involving morethan two classes, a major drawback of LDA is that the con-ventional optimality criteria defined based on the scatter ma-trices do not correspond directly to classification accuracy[11–13]. An immediate implication is that optimizing thesecriteria does not necessarily lead to an increase in the accu-racy. This phenomenon can also be explained in terms of theadverse effects due to the so-called outlier classes [14], re-sulting in inaccurate estimation of the between-class scatter.As a consequence, the linear transformation of traditionalLDA tends to overemphasize the inter-class distances be-tween already well-separated classes in the input space atthe expense of classes that are close to each other leadingto significant overlap between them. To tackle this prob-lem, some outlier-class-resistant schemes [11,14,13] basedon certain statistical model assumptions have been proposed.They are common in that a weighting function is incorpo-rated into the Fisher criterion by giving higher weights toclasses that are closer together in the input space as theyare more likely to lead to misclassification. Although thesemethods are generally more effective than traditional LDA,it is difficult to set the weights appropriately particularlywhen the statistical model assumptions may not be valid insuch applications as face recognition where the data are se-riously undersampled. Recently, Lotlikar and Kothari [12]proposed an interesting idea which allows fractional steps tobe made in dimensionality reduction. The method, referredto as fractional-step LDA (FLDA), allows for the relevantdistances to be more correctly weighted. A more recent two-phase method, called direct FLDA (DFLDA) [15], attemptsto apply FLDA to high-dimensional face patterns by com-bining the FLDA and direct LDA algorithms.

1.2. Kernel discriminant analysis

In spite of their simplicity, linear dimensionality reductionmethods have limitations under situations when the decision

boundaries between classes are nonlinear. For example, inface recognition applications where there exists high vari-ability in the facial features such as illumination, facial ex-pression and pose, high nonlinearity is commonly incurred.This calls for nonlinear extensions of conventional linearmethods to deal with such situations. The past decade haswitnessed the emergence of a powerful approach in machinelearning called kernel methods, which exploit the so-called“kernel trick” to devise nonlinear generalizations of linearmethods while preserving the computational tractability oftheir linear counterparts. The first kernel method proposed issupport vector machine (SVM) [16], which essentially con-structs a separating hyperplane in the high-dimensional (pos-sibly infinite-dimensional) feature spaceF obtained througha nonlinear feature map � : Rn → F. The kernel trickallows inner products in the feature space to be computedentirely in the input space without performing the mappingexplicitly. Thus, for linear methods which can represent therelationships between data in terms of inner products only,they can readily be “kernelized” to give their nonlinear ex-tensions. Inspired by the success of SVM, kernel subspaceanalysis techniques have been proposed to extend linear sub-space analysis techniques to nonlinear ones by applying thesame kernel trick, leading to performance improvement inface recognition over their linear counterparts. As in otherkernel methods, these kernel subspace analysis methods es-sentially map each input data point x ∈ Rn into some featurespace F via a mapping � and then perform the correspond-ing linear subspace analysis in F. Schölkopf et al. [17] havepioneered to combine the kernel trick with PCA to developthe kernel PCA (KPCA) algorithm for nonlinear PCA. Basedon KPCA, Yang et al. [18] proposed a kernel extension tothe Eigenface method for face recognition. Using a cubicpolynomial kernel, they showed that kernel Eigenface out-performs the original Eigenface method. Moghaddam [19]also demonstrated that KPCA using the Gaussian RBF ker-nel gives better performance than PCA for face recognition.More recently, Liu [20] further extended kernel Eigenfaceto include fractional-power polynomial models that corre-spond to non-positive semi-definite kernel matrices.

In the same spirit as the kernel extension of PCA,kernel extension of LDA, called kernel discriminant analy-sis (KDA), has also been developed and found to be moreeffective than PCA, KPCA and LDA for many classifica-tion applications due to its ability in extracting nonlinearfeatures that exhibit high class separability. Mika et al. [21]first proposed a two-class KDA algorithm, which was latergeneralized by Baudat and Anouar [22] to give the general-ized discriminant analysis (GDA) algorithm for multi-classproblems. Motivated by the success of Fisherface and KDA,Yang [23] proposed a combined method called kernel Fish-erface for face recognition. Liu et al. [24] conducted a com-parative study on PCA, KPCA, LDA and KDA and showedthat KDA outperforms the other methods for face recog-nition tasks involving variations in pose and illumination.Other researchers have proposed a number of KDA-based

Page 3: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

G. Dai et al. / Pattern Recognition 40 (2007) 229–243 231

algorithms [25–29] addressing different problems in facerecognition applications.

1.3. This paper

However, similar to the case of LDA-based methods, theadverse effects due to outlier classes also affect the perfor-mance of KDA-based algorithms. In fact, if the mapped datapoints in F are more separated from each other, such effectscan become even more significant. To remedy this problem,we propose in this paper an enhanced KDA algorithm calledkernel fractional-step discriminant analysis (KFDA). Theproposed method, similar to DFLDA, involves two steps. Inthe first step, KDA with a weighting function incorporatedis performed to obtain a low-dimensional subspace. In thesecond step, a subsequent FLDA procedure is applied to thelow-dimensional subspace to accurately adjust the weightsin the weighting function through making fractional steps,leading to a set of enhanced features for face recognition.

Moreover, recent research in kernel methods shows thatan appropriate choice of the kernel function plays a cru-cial role in the performance delivered. To further improvethe performance of KDA-based methods for face recogni-tion, we propose two new kernel functions called cosinefractional-power polynomial kernel and non-normal Gaus-sian RBF kernel. Extensive comparative studies performedon the YaleB and FacE REcognition Technology (FERET)face databases give the following findings:

(1) Compared with other methods such as KPCA and KDA,KFDA is much less sensitive to the adverse effects dueto outlier classes and hence is superior to them in termsof recognition accuracy.

(2) For both KDA and KFDA, the two new kernels gener-ally outperform the conventional kernels, such as poly-nomial kernel and Gaussian RBF kernel, that have beencommonly used in face recognition. KFDA, when usedwith the new kernels, delivers the highest face recogni-tion accuracy.

The rest of this paper is organized as follows. In Section 2,we briefly review the conventional KPCA and KDA algo-rithms. In Section 3, the new KFDA algorithm is presented,and then the two new kernel functions are introduced to fur-ther improve the performance. Extensive experiments havebeen performed with results given and discussed in Section4, demonstrating the effectiveness of both the KFDA algo-rithm and the new kernel functions. Finally, Section 5 con-cludes this paper.

2. Brief review of KPCA and KDA

The key ideas behind kernel subspace analysis methodsare to first implicitly map the original input data points intoa feature space F via a feature map � and then implementsome linear subspace analysis methods with the correspond-

ing optimality criteria in F to discover the optimal nonlin-ear features with respect to the criteria. Moreover, instead ofperforming the computation for the linear subspace analysisdirectly in F, the implementation based on the kernel trickis completely dependent on a kernel matrix or Gram matrixwhose entries can be computed entirely from the input datapoints without requiring the corresponding feature pointsin F.

Let X denote a training set of N face images belongingto c classes, with each image represented as a vector in Rn.Moreover, let Xi ⊂ X be the ith class containing Ni exam-ples, with xj

i denoting the jth example in Xi . In this section,we briefly review two kernel subspace analysis methods,KPCA and KDA, performed on the data set X.

2.1. KPCA algorithm

Schölkopf et al. [17] first proposed KPCA as a nonlinearextension of PCA. We briefly review the method in thissubsection.

Given the data set X, the covariance matrix in F, denotedby ��, is given by

�� = E[(�(x) − E[�(x)])(�(x) − E[�(x)])T]

= 1

N

c∑i=1

Ni∑j=1

(�(xji ) − m�)(�(xj

i ) − m�)T, (1)

where m� = (1/N)∑c

i=1∑Ni

j=1�(xji ) is the mean over all

N feature vectors in F. Since KPCA seeks to find the opti-mal projection directions in F, onto which all patterns areprojected to give the corresponding covariance matrix withmaximum trace, the objective function can be defined bymaximizing the following:

Jkpca(v) = vT��v. (2)

Since the dimensionality of F is generally very high or eveninfinite, it is therefore inappropriate to solve the followingeigenvalue problem for the solution as in traditional PCA:

��v = �v. (3)

Fortunately, we can show that the eigenvector v must lie in aspace spanned by {�(xj

i )} in F and thus it can be expressedin the form of the following linear expansion:

v =c∑

i=1

Ni∑j=1

wji �(xj

i ). (4)

Substituting Eq. (4) into Eq. (2), we obtain an equivalenteigenvalue problem as follows:

(I − 1

N1)

K(

I − 1

N1)T

w = �w, (5)

Page 4: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

232 G. Dai et al. / Pattern Recognition 40 (2007) 229–243

where I is an N × N identity matrix, 1 is an N × N matrixwith all terms being one, w = (w1

1, . . . , wN11 , . . . , w1

c , . . . ,

wNcc )T is the vector of expansion coefficients of a given

eigenvector v and K is the N × N Gram matrix whichcan be further defined as K = (Klh)l,h=1,...,c where Klh =(kij )

j=1,...,Nh

i=1,...,Nland kij = 〈�(xi

l ), �(xjh)〉.

The solution to Eq. (5) can be found by solving forthe orthonormal eigenvectors w1, . . . , wm correspondingto the m largest eigenvalues �1, . . . , �m, which are ar-ranged in descending order. Thus, the eigenvectors ofEq. (3) can be obtained as �wi (i = 1, . . . , m), where� = [�(x1

1), . . . ,�(xN11 ), . . . ,�(x1

c), . . . ,�(xNcc )]. Fur-

thermore, the corresponding normalized eigenvectorsvi (i = 1, . . . , m) can be obtained as vi = (1/

√�i )�wi ,

since (�wi )T�wi = �i .

With vi = (1/√

�i )�wi (i = 1, . . . , m) constituting the morthonormal projection directions in F, any novel input vec-tor x can obtain its low-dimensional feature representationy = (y1, . . . , ym)T in F as

y = (v1, . . . , vm)T�(x), (6)

with each KPCA feature yi (i = 1, . . . , m) expandedfurther as

yi = vTi �(x) = 1√

�j

wTi ��(x)

= 1√�j

wTi (k(x1

1, x), . . . ,

k(xN11 , x), . . . , k(x1

c , x), . . . , k(xNcc , x)). (7)

Unlike traditional PCA which only captures second-orderstatistics in the input space, KPCA captures second-orderstatistics in the feature space which can correspond to higher-order statistics in the input space depending on the featuremap (and hence kernel) used. Therefore, KPCA is superiorto PCA in extracting more powerful features, which are es-pecially essential when the face image variations due to illu-mination and pose are significantly complex and nonlinear.Nevertheless, KPCA is still an unsupervised learning methodand hence the (nonlinear) features extracted by KPCA do notnecessarily give rise to high separability between classes.

2.2. KDA algorithm

Similar to the kernel extension of PCA to give KPCA, thekernel trick can also be applied to LDA to give its kernelextension. KDA seeks to find the optimal projection direc-tions in F by simultaneously maximizing the between-classscatter and minimizing the within-class scatter in F. In thissubsection, we briefly review the KDA method.

For a data set X and a feature map �, the between-class

scatter matrix S�b and within-class scatter matrix S�

w in F

can be defined as

S�b =

c∑i=1

Ni

N(m�

i − m�)(m�i − m�)T (8)

and

S�w = 1

N

c∑i=1

Ni∑j=1

(�(xji ) − m�

i )(�(xji ) − m�

i )T, (9)

where m�i = (1/Ni)

∑Ni

j=1�(xji ) is the class mean of Xi in

F and m� = (1/N)∑c

i=1∑Ni

j=1�(xji ) is the overall mean

as before.Analogous to LDA which operates on the input space, the

optimal projection directions for KDA can be obtained bymaximizing the Fisher criterion in F:

J (v) = vTS�b v

vTS�wv

. (10)

Like KPCA, we do not solve this optimization problem di-rectly due to the high or even infinite dimensionality of F.As for KPCA, we can show that any solution v ∈ F mustlie in the space spanned by {�(xj

i )} in F and thus it can beexpressed as

v =c∑

i=1

Ni∑j=1

wji �(xj

i ). (11)

Substituting Eq. (11) into the numerator and denominator ofEq. (10), we obtain

vTS�b v = wTKbw (12)

and

vTS�wv = wTKww, (13)

where w = (w11, . . . , w

N11 , . . . , w1

c , . . . , wNCc )T and Kb and

Kw can be seen as the variant scatter matrices based on somemanipulation of the Gram matrix K. As a result, the solutionto Eq. (10) can be obtained by maximizing the followingoptimization problem instead:

Jf (w) = wTKbwwTKww

, (14)

giving the m leading eigenvectors w1, . . . , wm of the matrixK−1

w Kb as solution.For any input vector x, its low-dimensional feature repre-

sentation y = (y1, . . . , ym)T can then be obtained as

y = (w1, . . . , wm)T(k(x11, x), . . . , k(xN1

1 , x), . . . ,

k(x1c , x), . . . , k(xNc

c , x))T. (15)

Note that the solution above is based on the assumption thatthe within-class scatter matrix Kw is invertible. However, forface recognition applications, this assumption is almost al-ways invalid due to the undersampling problem as discussed

Page 5: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

G. Dai et al. / Pattern Recognition 40 (2007) 229–243 233

above. One simple method for solving this problem is to usethe pseudo-inverse of Kw instead. Another simple methodis to add a small multiple of the identity matrix (�I for somesmall � > 0) to Kw to make it non-singular. Although moreeffective methods have been proposed (e.g., Refs. [28,30]),

we keep it simple in this paper by adding �I to Kw where� = 10−7 in our experiments.

3. Feature extraction via KFDA

3.1. Kernel fractional-step discriminant analysis

The primary objective of the new KFDA algorithm isto overcome the adverse effects caused by outlier classes.A commonly adopted method to solve this problem is toincorporate a weighting function into the Fisher criterionby using a weighted between-class scatter matrix in placeof the ordinary between-class scatter matrix, as in Refs.[11,13,14,31,32]. However, it is not clear how to set theweights in the weighting function appropriately to put moreemphasis on those classes that are close together and henceare more likely to lead to misclassification. Our KFDA al-gorithm involves two steps. The first step is similar to theordinary KDA procedure in obtaining a low-dimensionalsubspace and then the second step applies the FLDA pro-cedure to further reduce the dimensionality by adjustingthe weights in the weighting function automatically throughmaking fractional steps.

As in Refs. [11,31,32], we define the weighted between-class scatter matrix in F as follows:

S�B =

c−1∑i=1

c∑j=i+1

NiNj

N2 w(dij )(m�i − m�

j )(m�i − m�

j )T,

(16)

where the weighting function w(dij ) is a monotonically de-

creasing function of the Euclidean distance dij =‖m�i −m�

j ‖with m�

i and m�j being the class means for Xi and Xj in F,

respectively. Apparently, the weighted between-class scat-

ter matrix S�B degenerates to the conventional between-class

scatter matrix S�b if the weighting function in Eq. (16) al-

ways gives a constant weight value. In this sense S�B can be

regarded as a generalization of S�b . According to the FLDA

procedure in Ref. [12], the weighting function should dropfaster than the Euclidean distance between the class meansfor Xi and Xj in F. As a result, the only constraint for theweighting function w(dij ) = d

−pij , where p ∈ N, is p�3.

Moreover, it is easy to note that dij in F can be computedby applying the kernel trick as follows:

dij = ‖m�i − m�

j ‖

=

√√√√√⎛⎝ ∑

xi1∈Xi

�(xi1)

Ni

−∑

xj1∈Xj

�(xj1)

Nj

⎞⎠

T ⎛⎝ ∑

xi2 ∈Xi

�(xi2)

Ni

−∑

xj2 ∈Xj

�(xj2)

Nj

⎞⎠

=√√√√ ∑

xi1 ,xi2 ∈Xi

ki1,i2

N2i

+∑

xj1 ,xj2 ∈Xj

kj1,j2

N2j

−∑

xi1∈Xi ,xj2 ∈Xj

ki1,j2

NiNj

−∑

xi2 ∈Xi ,xj1∈Xj

kj1,i2

NiNj

, (17)

where ki1,i2 = k(xi1i , xi2

i ) = 〈�(xi1i ), �(xi2

i )〉, kj1,j2 =k(xj1

j , xj2j ) = 〈�(xj1

j ), �(xj2j )〉, ki1,j2 = k(xi1

i , xj2j ) =

〈�(xi1i ), �(xj2

j )〉 and kj1,i2 = k(xj1j , xi2

i ) = 〈�(xj1j ), �(xi2

i )〉.Based on the definition of S�

B in Eq. (16), we define anew Fisher criterion in F as

J (v) = vTS�Bv

vTS�wv

. (18)

Again, we can express the solution v=∑ci=1

∑Ni

j=1wji �(xj

i )

and hence rewrite the Fisher criterion in Eq. (18) as(see A.1)1

Jf (w) = wTKBwwTKww

, (19)

where

w = (w11, . . . , w

N11 , . . . , w1

c , . . . , wNcc )T, (20)

KB =c−1∑i=1

c∑j=i+1

NiNj

N2 w(dij )(mi − mj )(mi − mj )T, (21)

Kw = 1

N

c∑i=1

Ni∑j=1

(kji − mi )(k

ji − mi )

T, (22)

with

mi =⎛⎝ 1

Ni

Ni∑h=1

k(x11, xh

i ), . . . ,1

Ni

Ni∑h=1

k(xN11 , xh

i ), . . . ,

1

Ni

Ni∑h=1

k(x1c , xh

i ), . . . ,1

Ni

Ni∑h=1

k(xNcc , xh

i )

⎞⎠

T

, (23)

1 Since the term Kw is the same as that in classical KDA, AppendixA.1 only provides the derivation for KB in Eq. (21).

Page 6: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

234 G. Dai et al. / Pattern Recognition 40 (2007) 229–243

mj =⎛⎝ 1

Nj

Nj∑h=1

k(x11, xh

j ), . . . ,1

Nj

Nj∑h=1

k(xN11 , xh

j ), . . . ,

1

Nj

Nj∑h=1

k(x1c , xh

j ), . . . ,1

Nj

Nj∑h=1

k(xNcc , xh

j )

⎞⎠

T

, (24)

kji = (k(x1

1, xji ), . . . , k(xN1

1 , xji ), . . . ,

k(x1c , xj

i ), . . . , k(xNcc , xj

i ))T. (25)

The solution to Eq. (18) is thus the m leading eigenvectorsw1, . . . , wm of the matrix K−1

w KB .For any input vector x, its low-dimensional feature

representation z = (z1, . . . , zm)T can thus be given by(see Appendix A.2)

z = (w1, . . . , wm)T(k(x11, x), . . . , k(xN1

1 , x), . . . ,

k(x1c , x), . . . , k(xNc

c , x))T. (26)

Through the weighted KDA procedure described above, weobtain a low-dimensional subspace where almost all classesbecome linearly separable, although some classes remaincloser to each other than others. In what follows, an FLDAstep is directly applied to further reduce the dimensionalityof this subspace from m to the required m′ through makingfractional steps. The primary motivation for FLDA comesfrom the following consideration [12]. In order to substan-tially improve the robustness of the choice of the weightingfunction and avoid the instability of the algorithm broughtby an inaccurate weighting function [11], FLDA introducessome sort of automatic gain control, which reduces the di-mensionality in small fractional steps rather than integralsteps. This allows the between-class scatter matrix and itseigenvectors to be iteratively recomputed in accordance withthe variations of the weighting function, so that the chanceof overlap between classes can be reduced. Therefore, in theoutput classification space, FLDA can increase the separa-bility between classes that have small inter-class distances inthe input space while preserve the high separability betweenclasses that are already far apart. Furthermore, unlike someother techniques [11,13,14], FLDA computes the weight-ing function without having to adopt any restrictive statisti-cal model assumption, making it applicable to more generalproblem settings including face recognition tasks with highvariability between images.

Fig. A.1 summarizes the KFDA algorithm for feature ex-traction which is used with the nearest neighbor rule forclassification. In the next section, we propose a further ex-tension of the KFDA algorithm by introducing two newkernel functions.

3.2. New kernel functions

Kernel subspace analysis methods make use of the inputdata for the computation exclusively in the form of inner

products in F, with the inner products computed implic-itly via a kernel function, called Mercer kernel, k(x, y) =〈�(x), �(y)〉, where 〈·, ·〉 is an inner product operator inF. A symmetric function is a Mercer kernel if and onlyif the Gram matrix formed by applying this function toany finite subset of X is positive semi-definite. Some ker-nel functions such as the polynomial kernel, Gaussian RBFkernel and sigmoid kernel have been commonly used inmany practical applications of kernel methods. For facerecognition using kernel subspace analysis methods in par-ticular [18,23–25,27,29,31,32], the polynomial kernel andGaussian RBF kernel have been used extensively to demon-strate that kernel subspace analysis methods are more effec-tive than their linear counterparts in many cases. While inprinciple any Mercer kernel can be used with kernel sub-space analysis methods, not all kernel functions are equallygood in terms of the performance that the kernel methods candeliver.

Recently, the feasibility and effectiveness of different ker-nel choices for kernel subspace analysis methods has beeninvestigated in the context of face recognition applications.Chen et al. [33] proposed a KDA-based method for facerecognition with the parameter of the Gaussian RBF kerneldetermined in advance. Yang et al. [30] proposed using akernel function for KDA by combining multiple GaussianRBF kernels with different parameters, with the combina-tion coefficients determined to optimally combine the in-dividual kernels. However, the effectiveness of combiningmultiple kernels is still implicit for KDA, since the con-ventional Fisher criterion in F is converted into the two-dimensional Fisher criterion at the same time. Liu et al.[26] extended the polynomial kernel to the so-called co-sine polynomial kernel by applying the cosine measure.One motivation is that the inner product of two vectors inF can be regarded as a similarity measure between them.Another motivation is that such measure has been foundto perform well in practice. Moreover, Liu [20] extendedthe ordinary polynomial kernel in an innovative way to in-clude the fractional-power polynomial kernel as well. How-ever, the fractional-power polynomial kernel proposed, likethe sigmoid kernel, is not a Mercer kernel. Nevertheless,Liu applied KPCA using the fractional-power polynomialkernel for face recognition and showed improvement inperformance when compared with the ordinary polynomialkernel.

Inspired by the effectiveness of the cosine measure forkernel functions [26], we further extend the fractional-powerpolynomial kernel in this paper by incorporating the co-sine measure to give the so-called cosine fractional-powerpolynomial kernel. Note that the cosine measure can onlybe used with non-stationary kernels such as the polynomialkernel, but not with isotropic kernels such as the GaussianRBF kernel that only depend on the difference between twovectors. To improve the face recognition performance of theGaussian RBF kernel, we also go beyond the traditionalGaussian RBF kernel by considering non-normal Gaussian

Page 7: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

G. Dai et al. / Pattern Recognition 40 (2007) 229–243 235

RBF kernels, kRBF (x, y)=exp(−‖x−y‖d/�2) where d �0and d = 2. According to Ref. [16], non-normal GaussianRBF kernels completely satisfy the Mercer condition andhence are Mercer kernels if and only if 0�d �2.

In summary, the kernel functions considered in this paperare listed below:

• Polynomial kernel (poly):

kpoly(x, y)

= (�1x · yT − �2)d with d ∈ N and d �1. (27)

• Fractional-power polynomial kernel (fp-poly):

kfp-poly(x, y)

= (�1x · yT − �2)d with 0 < d < 1. (28)

• Cosine polynomial kernel (c-poly):

kc-poly(x, y) = kpoly(x, y)√kpoly(x, x)kpoly(y, y)

. (29)

• Cosine fractional-power polynomial kernel (cfp-poly):

kcfp-poly(x, y) = kfp-poly(x, y)√kfp−poly(x, x)kfp-poly(y, y)

. (30)

• Gaussian RBF kernel (RBF):

kRBF(x, y) = exp(−‖x − y‖2/�2). (31)

• Non-normal Gaussian RBF kernel (NN-RBF):

knn-RBF(x, y)

= exp(−‖x − y‖d/�2) with d �0 and d = 2.

(32)

4. Experimental results

4.1. Visualization of data distributions

We first compare the distribution of data points afterapplying FLDA, KDA and KFDA, respectively. For vi-sualization sake, we reduce the dimensionality to 2. Twodatabases are used for this set of experiments. They arethe image segmentation database from the UCI MachineLearning Repository2 and the YaleB face database.3

The instances in the image segmentation database aredrawn randomly from a database of seven outdoor images.Each instance is represented by 19 continuous-valued fea-tures extracted from a 3 × 3 region, and the correspondingclass label is obtained by manual segmentation into sevenclasses: brick-face, sky, foliage, cement, window, path andgrass. The database consists of 30 instances for each of

2 http://www.ics.uci.edu/∼mlearn/MLRepository.3 http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html.

the seven classes, summing up to a total of 210 instancesfor the whole database. Since the dimensionality of the in-stances in the image segmentation database is relatively low,FLDA can be applied to the instances directly. We also fol-low the suggestion of Ref. [12] to set the number of frac-tional steps to 30 in our experiments. Fig. A.2(a)–(c) depictsthe distributions of data points based on the two most dis-criminant features after applying FLDA, KDA and KFDA,respectively. For KDA and KFDA, the Gaussian RBF ker-nel k(x1, x2) = exp(−‖x1 − x2‖2/104) is used. Moreover,following Ref. [12], the weighting functions in both FLDAand KFDA are set to w(dij ) = d−12

ij . We can see fromFig. A.2(a) that most classes still remain non-separable evenafter applying FLDA. For KDA, Fig. A.2(b) shows thatmany classes become more linearly separable. However,some classes (such as brick-face, foliage, cement and win-dow) are so closely clustered together that they cannot beperfectly separated. On the other hand, Fig. A.2(c) showsthat all classes are very well separated and are more equallyspaced when KFDA is applied.

We further do some experiments on the YaleB facedatabase [34]. The YaleB face database contains 5850 faceimages of 10 subjects each captured under 585 differentviewing conditions (9 poses × 65 illumination conditions).To facilitate visualization, we select a subset of 240 faceimages from the database with 30 images for each of eighthumans. All images have been manually cropped and thennormalized to a size of 46 × 56 with 256 gray levels.Fig. A.3 shows the 30 images of one individual used in theexperiments. Since the face patterns are high-dimensional,we use DFLDA [15] in place of FLDA [12] for this setof experiments. Fig. A.4(a)–(c) depicts the distributionsof data points based on the two most discriminant fea-tures after applying DFLDA, KDA and KFDA, respec-tively. For KDA and KFDA, the Gaussian RBF kernelk(x1, x2) = exp(−‖x1 − x2‖2/109) is used. As above, theweighting functions in both DFLDA and KFDA are set tow(dij ) = d−12

ij . The findings are similar to those for theimage segmentation data above. While KDA gives classesthat are more compact than those obtained by DFLDA, theclasses obtained by KFDA are both more compact and wellseparated from each other, as shown in Fig. A.4(c).

4.2. Face recognition experiments

In this subsection, we study the face recognition perfor-mance of KFDA with the different kernel functions givenin Section 3.2 and compare it with two other kernel meth-ods, KPCA and KDA. We use the FERET face database forthese experiments.4

The FERET face database [35] is from the FERETProgram sponsored by the US Department of Defense’sCounterdrug Technology Development Program through theDefense Advanced Research Projects Agency (DARPA),

4 http://www.itl.nist.gov/iad/humanid/feret/.

Page 8: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

236 G. Dai et al. / Pattern Recognition 40 (2007) 229–243

and it has become the de facto standard for evaluating state-of-the-art face recognition algorithms. The whole databasecontains 13,539 face images of 1565 subjects taken duringdifferent photo sessions with variations in size, pose, illu-mination, facial expression and even age. The subset weuse in our experiments includes 200 subjects each with fourdifferent images. All images are obtained by cropping basedon the manually located centers of the eyes, and are nor-malized to the same size of 92 × 112 with 256 gray levels.Fig. A.5 shows some sample images used in our experi-ments.

In the following experiments, the images for each subjectare randomly partitioned into two disjoint sets for trainingand testing. More specifically, three images per subject arerandomly chosen from the four images available for eachsubject for training while the rest for testing. For each fea-ture representation obtained by a dimensionality reductionmethod, we use the nearest neighbor rule with Euclideandistance measure to assess the classification accuracy. Eachexperiment is repeated 10 times and the average classifica-tion rates are reported. We set �1 = 10−9 and �2 = 1 for thefirst four non-stationary kernels presented in Section 3.2 and�2 = 109 for the last two isotropic kernels.

We first compare KFDA with KDA using different non-stationary kernels: polynomial kernel (poly), fractional-power polynomial kernel (fp-poly) and cosine polynomialkernel (c-poly) and cosine fractional-power polynomial ker-nel (cfp-poly). As in Ref. [12], we set the weighting functionfor KFDA as w(dij ) = d−12

ij . The only remaining parameterin the different non-stationary kernels is the exponent d.For simplicity, we only study the polynomial (poly) and co-sine polynomial (c-poly) kernels with exponents d = {1, 2}and the fractional-power polynomial (fp-poly) and cosinefractional-power polynomial (cfp-poly) kernels with expo-nents d ={0.8, 0.6, 0.4}. We have also tested the polynomialand cosine polynomial kernels with d = {3, 4, 5} but the re-sults are similar to those for d =2. Hence, we do not includetheir results in the comparison. Fig. A.6 compares KDAand KFDA on different non-stationary kernels. We havealso tried KPCA but its results are very poor compared withKDA and KFDA and hence are not included in the graphsfor comparison. Fig. A.6(a)–(c) shows the performanceof KDA on the four different non-stationary kernels. Thefractional-power polynomial kernel outperforms the poly-nomial kernel and the cosine fractional-power polynomialkernel can achieve further improvement. The performanceof KFDA on different non-stationary kernels shows a simi-lar trend, as we can see in Fig. A.6(d)–(f). Finally, Fig. A.6(g) and (h) shows that KFDA outperforms KDA becauseKFDA can resist the adverse effects due to outlier classes.

We next compare KDA and KFDA on different isotropickernels, which include the Gaussian RBF kernel (RBF) andthe non-normal Gaussian RBF kernel (NN-RBF). Fig. A.7shows the results. The weighting function for KFDA is stillset to w(dij ) = d−12

ij as before and the non-normal Gaus-sian RBF kernel is tested on exponents d = {1.5, 1.6, 1.8}.

Fig. A.7(a) and (b) shows that the non-normal Gaussian RBFkernel generally gives better results than the Gaussian RBFkernel for both KDA and KFDA. In Fig. A.7(c), we can seethat KFDA significantly outperforms KDA when the Gaus-sian RBF kernel is used. This is also true for the non-normalGaussian RBF kernel, as shown in Fig. A.7(d). This showsthat KFDA can successfully resist the adverse effects dueto outlier classes and hence can improve the face recogni-tion accuracy significantly. We have also performed someexperiments for both KDA and KFDA on the non-normalGaussian RBF kernel with d > 2. Except for a few cases,the recognition rates are generally far lower than those forthe Gaussian RBF kernel. A possible explanation is that thekernel is no longer a Mercer kernel when d > 2.

We further try KFDA on some other weighting functionsw(dij ) = {d−4

ij , d−6ij , d−8

ij , d−10ij } as recommended by [12].

Table A.1 shows the results comparing KFDA on differentweighting functions w(dij ) = {d−4

ij , d−8ij , d−12

ij } with KPCAand KDA. We can see that KFDA is superior to KPCAand KDA on different kernels for all weighting functionstried. A closer look shows that the cosine fractional-powerpolynomial kernel and the non-normal Gaussian RBF kernelcan effectively boost the performance of KFDA. Specifically,KFDA with the cosine fractional-power polynomial kernelfor d=0.4 and the weighting function w(dij )=d−12

ij achievesa recognition rate of 90% using 62 features. Using the non-normal Gaussian RBF kernel with d =1.8 and the weightingfunction w(dij )=d−12

ij , it gives a recognition rate of 91.65%using only 46 features. On the other hand, KPCA requiresa lot more features for all kernels and yet the results arenot satisfactory. Except for the non-normal Gaussian RBFkernel, KDA using other kernels also requires many morefeatures than KFDA to attain the highest recognition ratesand yet it is still inferior to KFDA.

To sufficiently evaluate the overall performance of KFDA,we follow the suggestions of Refs. [27,15] to also report theaverage percentages of the error rate of KFDA over thoseof both KPCA and KDA on different kernels. Under ourexperimental settings, the average percentage of the errorrate of KFDA over that of another method can be calculatedas the average of (100 − �i )/(100 − �i ) (i = 5, . . . , 198),where �i and �i are the recognition rates in percentage ofKFDA and another method, respectively, when i featuresare used. The corresponding results are summarized inTable A.2, where, as recommended by Ref. [12], theweighting functions w(dij ) = {d−4

ij , d−8ij , d−12

ij } are used forKFDA. It is clear from the results that the weighting schemein KFDA can bring about performance improvement overKDA.

5. Conclusion

In this paper, we have proposed a novel kernel-based fea-ture extraction method called KFDA. Not only can this newmethod deal with nonlinearity in a disciplined manner that iscomputationally attractive, it can also outperform traditional

Page 9: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

G. Dai et al. / Pattern Recognition 40 (2007) 229–243 237

KDA algorithms in resisting the adverse effects due to out-lier classes by incorporating a weighted between-class scat-ter matrix and adjusting its weights via making fractionalsteps in dimensionality reduction. We then further improvethe performance of KFDA by using two new kernel func-tions: cosine fractional-power polynomial kernel and non-normal Gaussian RBF kernel. Extensive face recognitionexperiments based on the YaleB and FERET face databasesshow that KFDA significantly outperforms KPCA and is alsosuperior to KDA for all the kernel functions tried. More-over, the new kernels outperform the polynomial kernel andGaussian RBF kernel.

Acknowledgments

The research described in this paper has been sup-ported by Competitive Earmarked Research Grant (CERG)HKUST621305 from the Research Grants Council (RGC)of the Hong Kong Special Administrative Region, China.

Appendix A. Detailed derivations

A.1. Derivation of Eq. (21)

For clarity, let us denote � = (�(x11), . . . ,�(xN1

1 ), . . . ,

�(x1c), . . . ,�(xNc

c )). Then we have v = �w.Based on the relationship between Eqs. (18) and (19), we

have KB = �TS�B�.

Fig. A.1. Summary of KFDA algorithm.

To derive Eq. (21), KB above can be expanded as

KB = �T

⎛⎝c−1∑

i=1

c∑j=i+1

NiNj

N2 w(dij )(m�i − m�

j )

×(m�i − m�

j )T

⎞⎠ �

=c−1∑i=1

c∑j=i+1

NiNj

N2 w(dij )(�Tm�

i − �Tm�j )

× (�Tm�i − �Tm�

j )T.

Since m�i = (1/Ni)

∑Ni

h=1�(xhi ), the term �Tm�

i can beexpanded as

�Tm�i = (m�

i

T�(x1

1), . . . , m�i

T�(xN1

1 ), . . . ,

m�i

T�(x1

c), . . . , m�i

T�(xNc

c ))T

=⎛⎝ 1

Ni

Ni∑h=1

�(xhi )T�(x1

1), . . . ,

× 1

Ni

Ni∑h=1

�(xhi )T�(xN1

1 ), . . . ,

× 1

Ni

Ni∑h=1

�(xhi )T�(x1

c), . . . ,

× 1

Ni

Ni∑h=1

�(xhi )T�(xNc

c )

⎞⎠

T

Page 10: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

238 G. Dai et al. / Pattern Recognition 40 (2007) 229–243

−200 −150 −100 −50 0 50−250

−200

−150

−100

−50

0

50

(a)0 0.02 0.04 0.06 0.08

−6

−4

−2

0

2

4

6

8

10

12

x 10−3

A

(b)

1 2 3 4 5 6 7 8 9

x 10−3

−7

−6

−5

−4

−3

−2

−1

0

1

2

3

x 10−3

(c)−5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2

x 10−3

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6x 10−3

(d)

BrickfaceSkyFoliageCementWindowPathGrass

BrickfaceSkyFoliageCementWindowPathGrass

BrickfaceSkyFoliageCementWindowPathGrass

BrickfaceSkyFoliageCementWindowPathGrass

Fig. A.2. Distributions of 210 instances from the image segmentation database plotted based on the two most discriminant features obtained by threedimensionality reduction methods: (a) FLDA; (b) KDA; (c) KFDA; (d) magnification of region A in (b).

Fig. A.3. Thirty sample images of one subject from the YaleB face database used in the experiments.

Page 11: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

G. Dai et al. / Pattern Recognition 40 (2007) 229–243 239

−4 −3 −2 −1 0 1 2 3 4 5−3

−2

−1

0

1

2

3

4

5

6

7Class1Class2Class3Class4Class5Class6Class7Class8

(a)2.5 3 3.5 4 4.5 5

x 10−3

5

5.2

5.4

5.6

5.8

6

6.2

6.4

6.6

6.8

7x 10−3

Class1Class2Class3Class4Class5Class6Class7Class8

A

(b)

2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5

x 10−3

−3.9

−3.8

−3.7

−3.6

−3.5

−3.4

−3.3

−3.2

−3.1x 10−3

Class1Class2Class3Class4Class5Class6Class7Class8

(c)

3.6 3.7 3.8 3.9 4 4.1 4.2

x 10−3

5.65

5.7

5.75

5.8

5.85

5.9

5.95

6

6.05

6.1

x 10−3

Class1Class2Class3Class4Class5Class6Class7Class8

(d)

Fig. A.4. Distributions of 240 images from the YaleB face database plotted based on the two most discriminant features obtained by three dimensionalityreduction methods: (a) DFLDA; (b) KDA; (c) KFDA; (d) magnification of region A in (b).

Fig. A.5. Some sample images from the FERET face database used in the experiments.

Page 12: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

240 G. Dai et al. / Pattern Recognition 40 (2007) 229–243

40 80 120 16078

79

80

81

82

83

84

85

number of features

reco

gniti

on r

ate

(%)

KDA: poly: d=2KDA: poly: d=1KDA: fp−poly: d=0.8KDA: fp−poly: d=0.6KDA: fp−poly: d=0.4

(a)

40 80 120 16078

79

80

81

82

83

84

85

86

number of features

reco

gniti

on r

ate

(%)

KDA: c−poly: d=2KDA: c−poly: d=1KDA: cfp−poly: d=0.8KDA: cfp−poly: d=0.6KDA: cfp−poly: d=0.4

(b)

40 80 120 16078

79

80

81

82

83

84

85

86

number of features

reco

gniti

on r

ate

(%)

KDA: poly: d=2KDA: c−poly: d=2KDA: fp−poly: d=0.6KDA: cfp−poly: d=0.6KDA: fp−poly: d=0.4KDA: cfp−poly: d=0.4

(c)

40 80 120 16080

81

82

83

84

85

86

87

88

89

90

number of features

reco

gniti

on r

ate

(%)

KFDA: ploy: d=2 KFDA: ploy: d=1KFDA: fp−poly: d=0.8 KFDA: fp−poly: d=0.6KFDA: fp−ploy: d=0.4

(d)

40 80 120 16080

81

82

83

84

85

86

87

88

89

90

number of features

reco

gniti

on r

ate

(%)

KFDA: c−poly: d=2KFDA: c−poly: d=1KFDA: cfp−poly: d=0.8KFDA: cfp−poly: d=0.6KFDA: cfp−poly: d=0.4

(e)

40 80 120 16080

81

82

83

84

85

86

87

88

89

90

number of features

reco

gniti

on r

ate

(%)

KFDA: poly: d=2KFDA: c−poly: d=2KFDA: fp−poly: d=0.6KFDA: cfp−poly: d=0.6KFDA: fp−poly: d=0.4KFDA: cfp−poly: d=0.4

(f)

40 80 120 16076

78

80

82

84

86

88

90

number of features

reco

gniti

on r

ate

(%)

KDA: poly: d=2KDA: fp−poly: d=0.4KFDA: poly: d=2KFDA: fp−poly: d=0.4

(g)

40 80 120 16076

78

80

82

84

86

88

90

number of features

reco

gniti

on r

ate

(%)

KDA: c−poly: d=2KDA: cfp−poly: d=0.4KFDA: c−poly: d=2KFDA: cfp−poly: d=0.4

(h)

Fig. A.6. Comparison of KDA and KFDA on different non-stationary kernels. (a) KDA on polynomial kernel (poly) and fractional-power polynomialkernel (fp-poly); (b) KDA on cosine polynomial kernel (c-poly) and cosine fractional-power polynomial kernel (cfp-poly); (c) KDA on non-cosine kernels(poly and fp-poly) and cosine kernels (c-poly and cfp-poly); (d) KFDA on polynomial kernel (poly) and fractional-power polynomial kernel (fp-poly);(e) KFDA on cosine polynomial kernel (c-poly) and cosine fractional-power polynomial kernel (cfp-poly); (f) KFDA on non-cosine kernels (poly andfp-poly) and cosine kernels (c-poly and cfp-poly); (g) KDA vs. KFDA on polynomial kernel (poly) and fractional-power polynomial kernel (fp-poly);(h) KDA vs. KFDA on cosine polynomial kernel (c-poly) and cosine fractional-power polynomial kernel (cfp-poly).

=⎛⎝ 1

Ni

Ni∑h=1

k(x11, xh

i ), . . . ,

× 1

Ni

Ni∑h=1

k(xN11 , xh

i ), . . . ,

1

Ni

Ni∑h=1

k(x1c , xh

i ), . . . ,1

Ni

Ni∑h=1

k(xNcc , xh

i )

⎞⎠

T

.

Similarly, the term �Tm�j can be expanded as

�Tm�j =

⎛⎝ 1

Nj

Nj∑h=1

k(x11, xh

j ), . . . ,1

Nj

Nj∑h=1

k(xN11 , xh

j ), . . . ,

1

Nj

Nj∑h=1

k(x1c , xh

j ), . . . ,1

Nj

Nj∑h=1

k(xNcc , xh

j )

⎞⎠

T

.

Page 13: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

G. Dai et al. / Pattern Recognition 40 (2007) 229–243 241

40 80 120 16078

79

80

81

82

83

84

85

86

87

88

89

number of features

reco

gniti

on r

ate

(%)

KDA: RBFKDA: NN−RBF: d=1.8KDA: NN−RBF: d=1.6KDA: NN−RBF: d=1.5

(a)

40 80 120 16078

80

82

84

86

88

90

92

number of features

reco

gniti

on r

ate

(%)

KFDA: RBF

KFDA: NN−RBF: d=1.8

KFDA: NN−RBF: d=1.6

KFDA: NN−RBF: d=1.5

(b)

40 80 120 16080

81

82

83

84

85

86

87

88

89

90

number of features

reco

gniti

on r

ate

(%)

KDA: RBF

KFDA: RBF

(c)

40 80 120 16080

82

84

86

88

90

92

number of features

reco

gniti

on r

ate

(%)

KDA: NN−RBF: d=1.8KDA: NN−RBF: d=1.6KDA: NN−RBF: d=1.5KFDA: NN−RBF: d=1.8KFDA: NN−RBF: d=1.6KFDA: NN−RBF: d=1.5

(d)

Fig. A.7. Comparison of KDA and KFDA on different isotropic kernels. (a) KDA on Gaussian RBF kernel (RBF) and non-normal Gaussian RBF kernel(NN-RBF); (b) KFDA on Gaussian RBF kernel (RBF) and non-normal Gaussian RBF kernel (NN-RBF); (c) KDA vs. KFDA on Gaussian RBF kernel(RBF); (d) KDA vs. KFDA on non-normal Gaussian RBF kernel (NN-RBF).

For clarity, we denote the expansion of �Tm�i by mi and

the expansion of �Tm�j by mj .

Thus,

KB =c−1∑i=1

c∑j=i+1

NiNj

N2 w(dij )(mi − mj )(mi − mj )T.

Hence we can obtain Eq. (21).

A.2. Derivation of Eq. (26)

Let � = (�(x11), . . . ,�(xN1

1 ), . . . ,�(x1c), . . . ,�(xNc

c ))

and vi = �wi (i = 1, . . . , m).

Thus, vi =�wi (i=1, . . . , m) constitute the discriminantvectors with respect to the Fisher criterion J (v) in Eq. (18).

Then, the low-dimensional feature representationz = (z1, . . . , zm)T of the vector x can be calculated as

z = (v1, . . . , vm)T�(x)

= (w1, . . . , wm)T�T�(x)

= (w1, . . . , wm)T(�(x)T�(x11), . . . , �(x)T�(xN1

1 ), . . . ,

�(x)T�(x1c), . . . , �(x)T�(xNc

c ))T

= (w1, . . . , wm)T(k(x11, x), . . . , k(xN1

1 , x), . . . ,

k(x1c , x), . . . , k(xNc

c , x))T.

Page 14: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

242 G. Dai et al. / Pattern Recognition 40 (2007) 229–243

Table A.1Best recognition rates (%) and corresponding numbers of features (·) for KPCA, KDA and KFDA on different kernels

Weighting function w(dij ) Algorithm

KPCA KDA KFDA

d−4ij

d−8ij

d−12ij

poly d = 2 69.40 (292) 83.30 (198) 84.55 (55) 86.70 (59) 87.90 (50)poly d = 1 69.70 (267) 82.45 (199) 84.25 (50) 86.45 (56) 87.35 (59)c-poly d = 2 69.60 (241) 84.20 (199) 85.20 (86) 87.40 (75) 88.50 (58)c-poly d = 1 69.60 (196) 82.90 (197) 84.90 (55) 86.85 (53) 87.95 (64)fp-poly d = 0.8 69.70 (269) 82.95 (199) 84.85 (59) 87.20 (62) 88.15 (53)fp-poly d = 0.6 69.65 (265) 84.05 (199) 85.30 (57) 87.45 (63) 88.25 (55)fp-poly d = 0.4 69.65 (267) 84.20 (199) 86.65 (81) 88.20 (59) 89.70 (63)cfp-poly d = 0.8 69.65 (208) 83.40 (198) 85.30 (55) 87.25 (62) 88.25 (56)cfp-poly d = 0.6 69.70 (292) 84.55 (199) 86.00 (51) 87.75 (73) 88.60 (51)cfp-poly d = 0.4 69.70 (292) 85.40 (198) 86.95 (67) 88.75 (59) 90.00 (62)RBF d = 2 69.60 (290) 84.80 (199) 85.55 (90) 87.85 (55) 88.60 (58)NN-RBF d = 1.8 69.00 (271) 86.65 (38) 88.85 (32) 90.65 (61) 91.65 (46)NN-RBF d = 1.6 68.05 (296) 87.80 (27) 89.20 (36) 90.70 (31) 91.50 (46)NN-RBF d = 1.5 67.90 (297) 87.90 (37) 89.05 (32) 90.05 (38) 90.50 (44)

Table A.2Average error percentages (%) for KPCA and KDA when compared with KFDA on different kernels

Weighting function w(dij ) Algorithm/algorithm

KFDA/KPCA KFDA/KDA

d−4ij

d−8ij

d−12ij

d−4ij

d−8ij

d−12ij

poly d = 2 52.36 48.32 46.37 88.74 81.79 78.37poly d = 1 54.50 50.58 48.98 90.08 83.56 80.63c-poly d = 2 50.39 46.28 44.42 88.90 81.51 78.09c-poly d = 1 53.51 49.43 47.68 90.56 83.53 80.44fp-poly d = 0.8 52.82 48.93 47.39 90.47 83.62 80.84fp-poly d = 0.6 50.72 46.86 45.45 90.30 83.25 80.62fp-poly d = 0.4 46.66 43.68 42.29 90.93 84.93 81.99cfp-poly d = 0.8 50.58 48.49 46.85 87.76 83.95 80.95cfp-poly d = 0.6 49.46 45.94 44.57 90.41 83.82 81.16cfp-poly d = 0.4 45.29 42.26 40.87 91.07 84.76 81.75RBF d = 2 48.53 44.49 42.77 88.59 81.10 77.79NN-RBF d = 1.8 39.64 34.45 32.25 82.66 71.78 67.06NN-RBF d = 1.6 39.79 34.20 32.48 84.41 73.90 70.00NN-RBF d = 1.5 39.31 35.61 34.06 90.23 81.71 78.10

Eq. (26) can thus be obtained (Figs. A.1–A.7, Tables A.1and A.2).

References

[1] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cognit. Neurosci.3 (1) (1991) 71–86.

[2] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs.Fisherfaces: recognition using class specific linear projection, IEEETrans. Pattern Anal. Mach. Intell. 19 (1997) 711–720.

[3] D.L. Swets, J.J. Weng, Using discriminant eigenfeatures for imageretrieval, IEEE Trans. Pattern Anal. Mach. Intell. 18 (8) (1996) 831–836.

[4] C.J. Liu, H. Wechsler, Robust coding schemes for indexing andretrieval from large face databases, IEEE Trans. Image Process. 9(1) (2000) 132–137.

[5] J. Ye, R. Janardan, C.H. Park, H. Park, An optimization criterion forgeneralized discriminant analysis on undersampled problems, IEEETrans. Pattern Anal. Mach. Intell. 26 (8) (2004) 982–994.

[6] J. Ye, Q. Li, A two-stage linear discriminant analysis via QR-decomposition, IEEE Trans. Pattern Anal. Mach. Intell. 27 (6) (2005)929–941.

[7] H. Cevikalp, M. Neamtu, M. Wilkes, A. Barkana, Discriminativecommon vectors for face recognition, IEEE Trans. Pattern Anal.Mach. Intell. 27 (1) (2005) 4–13.

[8] L.F. Chen, H.Y.M. Liao, M.T. Ko, J.C. Lin, G.J. Yu, A new LDA-based face recognition system which can solve the small sample sizeproblem, Pattern Recognition 33 (10) (2000) 1713–1726.

[9] J. Yang, J.Y. Yang, Why can LDA be performed in PCA transformedspace?, Pattern Recognition 36 (2) (2003) 563–566.

[10] H. Yu, J. Yang, A direct LDA algorithm for high-dimensional datawith application to face recognition, Pattern Recognition 34 (11)(2001) 2067–2070.

Page 15: Facerecognitionusingakernelfractional-stepdiscriminant ... recognition using a... · We perform extensive comparative studies based on ... nel gives better performance than PCA for

G. Dai et al. / Pattern Recognition 40 (2007) 229–243 243

[11] Y.X. Li, Y.Q. Gao, H. Erdogan, Weighted pairwise scatter to improvelinear discriminant analysis, in: Proceedings of the sixth InternationalConference on Spoken Language Processing, 2000.

[12] R. Lotlikar, R. Kothari, Fractional-step dimensionality reduction,IEEE Trans. Pattern Anal. Mach. Intell. 22 (6) (2000) 623–627.

[13] A.K. Qin, P.N. Suganthan, M. Loog, Uncorrelated heteroscedasticLDA based on the weighted pairwise Chernoff criterion, PatternRecognition 38 (4) (2005) 613–616.

[14] M. Loog, R.P.W. Duin, R. Haeb-Umbach, Multiclass linear dimensionreduction by weighted pairwise Fisher criteria, IEEE Trans. PatternAnal. Mach. Intell. 23 (7) (2001) 762–766.

[15] J.W. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, Face recognitionusing LDA-based algorithms, IEEE Trans. Neural Networks 14 (1)(2003) 195–200.

[16] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.[17] B. Schölkopf, A. Smola, K.R. Müller, Nonlinear component analysis

as a kernel eigenvalue problem, Neural Comput. 10 (1999)1299–1319.

[18] M.H. Yang, N. Ahuja, D. Kriegman, Face recognition using kerneleigenfaces. in: Proceedings of the IEEE International Conference onImage Processing, 2000, pp. 37–40.

[19] B. Moghaddam, Principal manifolds and probabilistic subspaces forvisual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 24 (6)(2002) 780–788.

[20] C.J. Liu, Gabor-based kernel PCA with fractional power polynomialmodels for face recognition, IEEE Trans. Pattern Anal. Mach. Intell.26 (5) (2004) 572–581.

[21] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, K.R. Müller, Fisherdiscriminant analysis with kernels, in: Y.H. Hu, J. Larsen, E. Wilson,S. Douglas (Eds.), Proceedings of the Neural Networks for SignalProcessing IX, 1999, pp. 41–48.

[22] G. Baudat, F. Anouar, Generalized discriminant analysis using akernel approach, Neural Comput. 12 (2000) 2385–2404.

[23] M.H. Yang, Kernel eigenfaces vs. kernel Fisherfaces: face recognitionusing kernel methods, in: Proceedings of the Fifth IEEE InternationalConference on Automatic Face and Gesture Recognition, May 2002,pp. 215–220.

[24] Q.S. Liu, R. Huang, H.Q. Lu, S.D. Ma, Face recognition usingkernel based Fisher discriminant analysis. in: Proceedings of theFifth IEEE International Conference on Automatic Face and GestureRecognition, May 2002, pp. 187–191.

[25] G. Dai, Y.T. Qian, Kernel generalized nonlinear discriminant analysisalgorithm for pattern recognition, in: Proceedings of the IEEEInternational Conference on Image Processing, 2004, pp. 2697–2700.

[26] Q.S. Liu, H.Q. Lu, S.D. Ma, Improving kernel Fisher discriminantanalysis for face recognition, IEEE Trans. Circuits Syst. VideoTechnol. 14 (1) (2004) 42–49.

[27] J.W. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, Face recognitionusing kernel direct discriminant analysis algorithms, IEEE Trans.Neural Networks 14 (1) (2003) 117–126.

[28] J.W. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, J. Wang, Anefficient kernel discriminant analysis method, Pattern Recognition 38(10) (2005) 1788–1790.

[29] J. Yang, A.F. Frangi, J.Y. Yang, D. Zhang, Z. Jin, KPCA plus LDA:a complete kernel Fisher discriminant framework for featureextraction and recognition, IEEE Trans. Pattern Anal. Mach. Intell.27 (2) (2005) 230–244.

[30] S. Yang, S.C. Yan, D. Xu, X.O. Tang, C. Zhang, Fisher+kernelcriterion for discriminant analysis, in: Proceedings of the IEEEComputer Society Conference on Computer Vision and PatternRecognition, June 2005, pp. 197–202.

[31] G. Dai, Y.T. Qian, S. Jia, A kernel fractional-step nonlineardiscriminant analysis for pattern recognition, in: Proceedings of the18th International Conference on Pattern Recognition, vol. 2, August2004, pp. 431–434.

[32] G. Dai, D.Y. Yeung, Nonlinear dimensionality reduction forclassification using kernel weighted subspace method, in: Proceedingsof the IEEE International Conference on Image Processing,September 2005, pp. 838–841.

[33] W.S. Chen, P.C. Yuen, J. Huang, D.Q. Dai, Kernel machine-based one-parameter regularized Fisher discriminant method for facerecognition, IEEE Trans. Syst. Man Cybern. Part B: Cybern. 35 (4)(2005) 659–669.

[34] A.S. Georghiades, P.N. Belhumeur, D.J. Kriegman, From few tomany: Illumination cone models for face recognition under variablelighting and pose, IEEE Trans. Pattern Anal. Mach. Intell. 23 (6)(2001) 643–660.

[35] A.P.J. Phillips, H. Moon, P.J. Rauss, S. Rizvi, The FERET evaluationmethodology for face recognition algorithms, IEEE Trans. PatternAnal. Mach. Intell. 22 (10) (2000) 1090–1104.

About the Author—GUANG DAI received his B.Eng. degree in mechanical engineering from Dalian University of Technology and M.Phil. degreein Computer Science from Zhejiang University. He is currently a Ph.D. student in the Department of Computer Science and Engineering at the HongKong University of Science and Technology. His main research interests include semi-supervised learning, nonlinear dimensionality reduction and relatedapplications.

About the Author—DIT-YAN YEUNG received his B.Eng. degree in electrical engineering and M.Phil. degree in computer science from the Universityof Hong Kong and his Ph.D. degree in computer science from the University of Southern California in Los Angeles. He was an Assistant Professor atthe Illinois Institute of Technology in Chicago before he joined the Department of Computer Science and Engineering at the Hong Kong University ofScience and Technology, where he is currently an Associate Professor. His current research interests are in machine learning and pattern recognition.

About the Author—YUN-TAO QIAN received his B.E. and M.E. degrees in automatic control from Xi’an Jiaotong University in 1989 and 1992,respectively, and his Ph.D. degree in signal processing from Xidian University in 1996. From 1996 to 1998, he was a postdoctoral fellow at theNorthwestern Polytechnical University. Since 1998, he has been with Zhejiang University where he is currently a Professor in Computer Science. Hisresearch interests include machine learning, signal and image processing, and pattern recognition.