Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

Embed Size (px)

Citation preview

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    1/15

    178 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

    Face Recognition Using Total Margin-BasedAdaptive Fuzzy Support Vector Machines

    Yi-Hung Liu, Member, IEEE, and Yen-Ting Chen

    AbstractThis paper presents a new classifier called totalmargin-based adaptive fuzzy support vector machines (TAF-SVM)that deals with several problems that may occur in support vectormachines (SVMs) when applied to the face recognition. The pro-posed TAF-SVM not only solves the overfitting problem resultedfrom the outlier with the approach of fuzzification of the penalty,but also corrects the skew of the optimal separating hyperplanedue to the very imbalanced data sets by using different costalgorithm. In addition, by introducing the total margin algorithmto replace the conventional soft margin algorithm, a lower gen-eralization error bound can be obtained. Those three functionsare embodied into the traditional SVM so that the TAF-SVM isproposed and reformulated in both linear and nonlinear cases.

    By using two databases, the Chung Yuan Christian University(CYCU) multiview and the facial recognition technology (FERET)face databases, and using the kernel Fishers discriminant anal-ysis (KFDA) algorithm to extract discriminating face features,experimental results show that the proposed TAF-SVM is superiorto SVM in terms of the face-recognition accuracy. The resultsalso indicate that the proposed TAF-SVM can achieve smallererror variances than SVM over a number of tests such that betterrecognition stability can be obtained.

    Index TermsFace recognition, kernel Fishers discriminantanalysis (KFDA), support vector machines (SVMs).

    I. INTRODUCTION

    MANY computer vision-based systems have become more

    and more important and attractive in recent years, such as

    the surveillance, automatic access control, and the humanrobot

    interaction. Face recognition plays a critical role in those appli-

    cations. Due to the complicated pattern distribution from large

    variations in facial expressions, facial details, illumination con-

    ditions, and viewpoints, the face-recognition task has been con-

    sidered as one of the most difficult pattern-recognition research

    fields. Recently, various approaches have been proposed, e.g.,

    [3],[5],[12],[15], [16], [22],[23], and[25][32]. From these

    systems, we can conclude that how to extract discriminating fea-tures from raw face images and how to accurately classify dif-

    ferent people based on these input features are the two keys to

    the development of reliable and high-accuracy face-recognition

    systems. This paper aims to propose a new classifier called total

    margin-based adaptive fuzzy support vector machines (TAF-

    SVM), which can enhance the performance of support vector

    Manuscript received July 1, 2005; revised March 1, 2006. This work wassupported by the National Science Council of Taiwan, R.O.C., under Grant93-2212-E-033-011.

    The authors are withthe Department of Mechanical Engineering, Chung YuanChristian University, Chung-Li 32023, Taiwan, R.O.C. (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TNN.2006.883013

    machines (SVM) for face recognition. In addition to classifier

    design, selecting a good feature extractor is also necessary.

    A. Feature Selection

    Principal component analysis (PCA)[12]and Fishers linear

    discriminant analysis (FLDA) are widely used linear subspace

    analysis methods in facial feature extraction. Compared with

    PCA, FLDA owns more abilities to extract discriminating fea-

    tures since its objective is to maximize the between-class and

    minimize the within-class scatters. FLDA has been successfully

    applied to face recognition in[32]and shown to be superior toPCA. Due to the linear nature, the capabilities of linear subspace

    analysis methods are still limited. Motivated by the success of

    the use of kerneltrick in the SVMs [8], [13], Schlkopfetal. [24]

    proposed the kernel PCA (KPCA) by combining the PCA with

    the kernel trick. Since the kernel trick is capable of representing

    nonlinear relations of input data, KPCA is better than PCA in

    terms of representation and reconstruction. This has been also

    evidenced by Kims work[25]in which KPCA combined with

    linear SVM classifier was applied to face recognition.

    Another nonlinear subspace analysis method called general-

    ized discriminant analysis (GDA) or kernel Fishers discrimi-

    nant analysis (KFDA) was proposed by Baudatet al.[9]. KFDA

    first nonlinearly maps input data into higher dimensional featurespacein whichFLDA is performed.Recently, several works have

    shown that KFDA was much effective than KPCA in face recog-

    nition [3], [22], [23]. This is due to the fact that KFDA keeps the

    nature of FLDA, which is based on the separability maximiza-

    tion criterion while the unsupervised learning-based KPCA is

    still only designed for the pattern representation/reconstruction.

    Therefore, this paper adopts the KFDA as the feature extractor

    such that the goals of the extraction of discriminating features

    and the reduction of input dimensionality can be reached.

    B. Classifier Design

    Although KFDA has been proven its superiority to discrimi-nating features extraction, it suffers from the problems that its

    performance would drop while meeting new inputs that have

    never been considered in the training process, for example, a

    test face, whose viewpoint does not face the camera, while the

    training faces are of frontal face. Features extracted with KFDA

    are not invariant to these large changes because KFDA is es-

    sentially a kind of appearance-based method. In [3], authors

    suggested that a more sophisticated classifier, compared with

    nearest neighbor (NN) classifier, was still needed even though

    the KFDA algorithm has been employed for the multiview face

    recognition because the face-pattern distribution would still be

    nonseparable in the KFDA-based subspace. In other words, a

    1045-9227/$20.00 2006 IEEE

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    2/15

    LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS 179

    classifier with good generalization ability and minimal empir-

    ical risk is necessary for making up the drawback of the appear-

    ance-based feature extractor. Based on this, an SVM can serve

    as a good classifier candidate.

    SVM was proposed by Vapniket al.[13]and has been suc-

    cessfully applied to various applications such as the unsuper-

    vised segmentation of switching dynamics[46],face member-ship authentication[47], and image fusion[48]. Recently, sev-

    eral works relative to face recognition have used SVMs as classi-

    fiers and yielded satisfactory results [25][31]. In those systems,

    the SVMs used are of regular SVM. However, some researches

    which are not directly relative to the search of face recognition

    have indicated that SVMwould suffer from several critical prob-

    lems when applied to classify some particular data types. The

    first problem is that SVM is very sensitive to outliers since the

    penalty weight for every data point is the same [5][7].Second,

    the class-boundary-skew problem will be met when SVM is ap-

    plied to the problem of learning from imbalanced data sets in

    which the negative data heavily outnumber the positive data

    [1], [11],[17],[33]. The class boundary, i.e., the optimal sep-arating hyperplane (OSH), learned by SVM, can be skewed to-

    wards the positive class. In consequence, the false-negative rate

    can be very high and can make SVM ineffective in identifying

    the targets that belong to the positive class, which results in the

    class-boundary-skew problem. The two problems limit the per-

    formance of SVM. Unfortunately, they also occur in the appli-

    cation of SVM-based face recognition.

    In face recognition, for example, a face image with an exag-

    gerated expression may result in the existence of an outlier. If

    the outlier possesses nonzero value of slack variable, the soft

    margin algorithm used in regular SVM would start to find a hy-

    perplane to let the error be correct. The overfitting problem mayfollow. The other problem is that SVM was originally designed

    for the binary classification, while face recognition is practically

    a problem of multiclass classification. To extendthe binarySVM

    to multiclass face recognition, most existing systems [25][31]

    used the one-against-all (OAA) method. As far as the compu-

    tational effort is concerned, OAA may be more efficient than

    one-against-one (OAO) strategy. The advantage of OAA over

    OAO is that we only have to construct one hyperplane for each

    ofthe classesinsteadof pairwise decision functions.

    This decreases the computational effort by a factor of ;

    in some examples, it can be brought down to [35]. This

    may be the reason that authors of[25][31]used OAA in their

    systems, though it has been reported that OAO is more efficient

    than OAA in terms of classification accuracy[2],[36], [45]. As

    OAA method is used, one of the classes will be the target class

    andtherest classeswillbethenegativeclassforthelearning

    of each OSH. The class-boundary-skew problem occurs. Also,

    the larger the number of classes becomes, the more imbalanced

    the training set is when OAA method is applied.

    To remedytheseproblemswhenSVM is applied to face recog-

    nition, this paper proposes a new classifier called TAF-SVM.

    TAF-SVM is able to solve the overfitting problem by fuzzifying

    the training set which is equivalent to fuzzifying the penalty term

    [7],[44]. With this manner, training data are no longer treated

    equally but treated differently according to their relative impor-tance. Besides, TAF-SVM also embodies the different cost al-

    gorithm[11],[33], by which TAF-SVM can adapt itself to the

    imbalanced training set such that the false-negative rate is re-

    duced and the recognition accuracy is enhanced.

    Another contribution of this paper is that we replace the soft

    margin algorithm by introducing the total margin algorithm [4]

    to TAF-SVM. The total margin algorithm not only considers

    the errors but also involves the information of correctly classi-fied data points in the construction of OSH. Compared with the

    conventional soft margin algorithm used in the regular SVM, a

    lower generalization error bound can be reached. This can facil-

    itate the face recognition since the generalization ability plays

    a very important role for the predictions of unseen face images.

    We combine these approaches in TAF-SVM and show that we

    can significantly improve the face-recognition accuracy com-

    pared to applying any one approach including SVM also.

    This paper is organized as follows. Section II presents the

    KFDA-based feature extraction method. A brief review of SVM

    is given inSection III. Then, the problems of applying SVM to

    face recognition are pointed out in detail together with the solu-

    tions embodied in the TAF-SVM. InSection IV,we reformulatethe TAF-SVM in both linear and nonlinear cases. Experimental

    results are presented and discussed inSection V. Conclusions

    are drawn inSection VI.

    II. FEATURE EXTRACTION VIAKFDA

    A face image is first scanned row by row to form a vector

    of . The training set contains images out of

    subjects, namely and , ,

    where is the set of class and is the cardinality of . For

    KFDA, the within-class scatter and between-class scatter

    in space are given by

    (1)

    (2)

    where is a nonlinear mapping function that maps the data

    from input space to a higher dimensional feature space:

    . denotes the th face image in the

    th class. The mapped data are centered in space [9],[24].

    KFDA seeks tofind a set of discriminating orthonormal eigen-

    vectors for the projection of input face image byperforming FLDA in space in which the between-class scatter

    is maximized and the within-class scatter is minimized. This is

    equivalent to solving the following maximization problem:

    (3)

    Solutions associated with the largest nonzero eigenvalues

    must lie in the span of all mapped data; so, for , there exists

    a normalized expansion coefficient vector

    such that

    (4)

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    3/15

    180 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

    Thus, for a testing face image , its projection on the th eigen-

    vector is computed by

    (5)

    We do not need to know the nonlinear mapping exactly. By

    using the kernel trick, the projection can be easily obtained

    by

    (6)

    where the kernel function is defined as the dot product of vectors

    (7)

    Theradial basis function(RBF) kernel is used in this paper and

    is expressed as

    (8)

    where the width is specifieda prioriby the user.

    To project a face image into new coordinates, eigenvec-

    tors associated with thefirst largest nonzero eigenvalues

    are selected to construct the transformation matrix or

    such that the dimensionality of a face image

    is reduced from to .

    To simplify the notation used in the following, we let the numberof projection vectors be equal to .

    III. BASICIDEAS OFTAF-SVM

    A. Basic Review of SVM

    In SVM, the training set is given as , where

    is the training data and is its class label being either

    1 or 1. Let and be the weight vector and the bias of

    the separating hyperplane, the objective of SVM is to find the

    OSH by maximizing the margin of separation and minimizing

    the training errors

    Minimize (9)

    Subject to (10a)

    (10b)

    where , is the nonlinear mapping

    function which maps the data into a higher dimensional feature

    space from the input space. are slack variables representing

    the error measures of data points. The penalty weight is a free

    parameter; it measures the size of the penalties assigned to the

    errors. Minimizing the first term in (9)is equivalent to maxi-

    mizing the margin of separation, which is related to minimizing

    the VapnikChervonenkis (VC) dimension. Formulation of theobjective function in (9) is perfect accord with the structural risk

    minimization (SRM) principle, by which good generalization

    ability can be achieved[8].

    By introducing the Lagrangian, the primal constrained opti-

    mization problem can be solved with its dual form. The pre-

    dicted class of an unseen data is the output of the decision

    function

    sign (11)

    where are the nonnegative Lagrange multipliers for thefirst

    inequality constraints in the primal problem (10a), are

    support vectors for which , and is the number

    of support vectors. The optimal value of is calculated with

    KuhnTucker (KT) complementary conditions.

    B. Basic Ideas of TAF-SVM

    1) Dealing With the Overfitting Problem via Fuzzification of

    Training Set: One issue on using SVM for face recognition ishow to tackle the overfitting problem since large variation re-

    sulted from facial expressions and viewpoints may produce the

    outliers appearing in the pattern distribution. As shown in pre-

    vious researches[5], [6], SVM is very sensitive to outliers or

    noises since the penalty term of SVM treats every data point

    equally in the training process. This may result in the occurrence

    of overfitting problem if one or few data points have relatively

    very large values of . Wanget al. and Huanget al. proposed

    the fuzzy SVM (FSVM) to deal with the overfitting problem

    [7],[44],based on the idea that a membership value is assigned

    to each data according to its relative importance in its class so

    that a less important data is punished less. To achieve this, the

    fuzzy penalty term is redefined in FSVM where is

    the membership value denoting the relative importance of point

    to its own class.

    We incorporate the concept of FSVM into the proposed TAF-

    SVM. The training set isfirst divided into two sets: the fuzzy

    positive training set and the fuzzy negative training set ,

    denoted by

    (12a)

    (12b)

    where the membership values and standfor t he relative importance o f the points and to

    the positive class and negative class, respectively. The variable

    is a small positive real number. and are the cardinalities

    of fuzzy positive training set and fuzzy negative training set,

    respectively, and .

    2) Adaptation to Imbalanced Face Training Sets via Different

    Cost Algorithm: Face recognition is practically a task of mul-

    ticlass classification while SVM was designed for the binary

    classification. OAO and OAA methods are two popular ways to

    realize the SVM-based multiclass classification task[2]. Based

    on the pairwise learning framework, OAO method needs to con-

    struct OSHs and use the voting strategy to makefinal

    decisions if there are subjects to be recognized. Compared withOAO method, OAA method, by which only OSHs need to

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    4/15

    LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS 181

    be learned, is more effective in terms of computational effort.

    Therefore, it is found that most existing SVM-based face-recog-

    nition systems chose the OAA method to accomplish the task of

    multiclass classification[25][31]. However, a critical problem,

    the class-boundary-skew phenomenon, which had never been

    pointed out in these SVM and OAA method-based face-recog-

    nition systems, is followed.By using OAA method to learn each OSH for multiclass face

    recognition, one of the subjects forms the positive class and

    the rest form the negative class. With this manner, the training

    faces of the negative class significantly outnumber the training

    faces in positive class. The ratio of the size of negative class to

    the size of positive class is . A very imbalanced face

    training set is produced. The larger the number of subjects is,

    the heavier the imbalance of the face training set is.

    It has been recently reported that the success of SVM is

    limited when applied to the imbalanced data sets [1],[11],[17],

    [33] because the OSH would be skewed towards the positive

    class and results in the class-boundary-skew phenomenon. To

    solve this critical problem, some remedies have been proposedincluding the oversampling and undersampling techniques [18],

    combining oversampling with undersampling [19], synthetic

    minority oversampling technique (SMOTE) [20], different error

    cost algorithms[1], [33], class-boundary-alignment algorithm

    [17],and SMOTE with different cost algorithm (SDC)[11].

    Those methods can be divided into three categories. The

    methods proposed in[18][20]process the data before feeding

    them into the classifier. The oversampling technique dupli-

    cates the positive data by interpolation while undersampling

    technique removes the redundant negative data to reduce the

    imbalanced ratio. They are classifier-independent approaches.

    The second category belongs to the algorithm-based approach[1], [17], [33]. For example, Veropoulos et al. [1] and Lin et

    al. [33] proposed the different cost algorithms suggesting that

    by assigning heavier penalty to the smaller class, the skew

    of the OSH can be corrected. The third category is the SDC

    method which combines the SMOTE and the different error

    cost algorithm[11].

    For face recognition, since each training data stands for the

    particular face information, we attempt not to use any presam-

    pling techniques. Instead, the proposed TAF-SVM adopts the

    different cost algorithm to achieve the goal of adaptation to the

    imbalanced face training sets when faced with the OAA-based

    multiclass classification. Another reason for using this algo-

    rithm is that the different cost algorithm was originally designed

    for solving the skew problem of SVM. By combining the fuzzy

    penalty and the different cost algorithm, the proposed fuzzified

    biased penalties are expressed as

    (13)

    where and are the penalty weights for the errors of the

    positive class and negative class, respectively. The slack vari-

    ables and are the measurement of errors of the data be-

    longing to the positive class and the negative class, respectively.By setting to meet the central concept of different

    cost algorithm, the OSH will be much more distant from the

    smaller class.

    3) Improvement of the Generalization Error Bound via Total

    Margin Algorithm: Due to the fact that it is impossible to take

    all face information into consideration, i.e., the available face

    training samples are always finite and not numerous, the gen-

    eralization ability for a classifier dominates the prediction ac-curacy for unseen faces. Soft margin algorithm used in SVM

    relaxes the measure of margin by introducing slack variables

    to errors. An OSH is found with the maximal margin of sepa-

    ration by maximizing the minimum distance between

    few extreme values (support vectors) and the separating hyper-

    plane. However, only few extreme training data that are used

    would cause the loss of information because the most of in-

    formation is contained in the nonextreme data, which occupy

    the majority in the training set. Feng et al.proposed the scaled

    SVM [21], which not only employed the support vectors but

    also involved the means of the classes to reduce the generaliza-

    tion error of SVM. However, the face-pattern distribution is gen-

    erally non-Gaussian and highly nonconvex[3], [22]. Namely,the mean of a class may not be very representative. Another ap-

    proach for improving the generalization error bound called total

    margin algorithm has been also proposed by Yoon et al.[4].

    The total margin algorithm extends the soft margin algorithm

    by introducing extra surplus variables to the correctly classi-

    fied data points . The surplus variable measures the distance

    between the correctly classified data point and the hyperplane

    , if this data point belongs to the positive/neg-

    ative class. In addition to minimizing the sum of slack variables

    (the misclassified data points) while maximizing the margin

    of separation proposed by soft margin algorithm, total margin

    algorithm suggests that the sum of surplus variables (the cor-rectly classified data points) should also be maximized simulta-

    neously. Maximizing the sum of surplus variables is equivalent

    to maximizing , which in turn is equivalent to minimizing

    . Therefore, total margin-based SVM is formulated as

    the constrained optimization problem

    Minimize

    Subject to

    (14)

    where is the weight for the misclassified data points and

    is the weight for the correctly classified data points, i.e., thesurplus variables .

    From (14), we can see that the construction of OSH is no

    longer controlled only by few extreme data points in which

    most of them may be misclassified data points, but also by the

    correctly classified data points, which are the majority of the

    training set. The advantages are clear. First, the essence of the

    soft margin-based SVM is to rely only on the set of data points

    which take extreme values, the so-called support vectors. From

    the statistics of extreme values, we know that the disadvantage

    of such an approach is that the information contained in most

    samples (not extreme values) is lost, so that such an approach

    is bound to be less efficient than one that takes into account the

    lost information[21], i.e., the correctly classified data points.Therefore, total margin algorithm can be more efficient and gain

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    5/15

    182 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

    Fig. 1. Geometric interpretation of slack variables and surplus variables used in TAF-SVM.

    better generalization ability than soft margin algorithm since the

    information of all samples are considered in the construction

    of OSH. Second, from the objective expressed in(14), we can

    see that minimizing implies that the obtained OSH is

    able to gain more correctly classified data points because mini-

    mizing is equivalent to maximizing the sum of surplus

    variables. Therefore, in this paper, we adopt the total margin al-

    gorithm as one of the bases in the development of TAF-SVM

    for the face recognition.

    In order to facilitate the reformulation of TAF-SVM inSec-

    tion IV, the usage of combining surplus variables and the im-

    balanced penalties is illustrated here. Since the TAF-SVM con-

    siders both of the different cost algorithm and the total margin

    algorithm, the geometric relationship between the positive/neg-

    ative slack variables and the positive/negative surplus variables

    is illustrated inFig. 1.

    InFig. 1, the white circles and the black circles denote thedata points belonging to the positive class and the negative class,

    respectively. The slack variable measures the distance be-

    tween the hyperplane and the misclassified data point

    which is supposed to be classified as the positive class. Con-

    trarily, is the measurement from the misclassified data point

    to the hyperplane . The surplus variable mea-

    sures the distance between the correctly classified data point

    and the hyperplane . The surplus variable measures

    the distance between the correctly classified data point and the

    hyperplane . All these variables are nonnegative vari-

    ables. At least one of and will be zero for a data point .

    Furthermore, we assume that are the positive training datapoints, in which any of can have two different classification

    results: misclassified and correctly classified. Table I summaries

    the relationship between the slack variable and the surplus

    variable according to different classification results of .

    Notice that the used inTable Ican be any data point among

    all the positive training data points, while the shown inFig. 1

    is just one misclassified positive training data point.

    IV. REFORMULATION OFTAF-SVM

    In this section, we reformulate the proposed TAF-SVM for

    linearly nonseparable and nonlinearly nonseparable cases basedon the aforementioned ideas.

    TABLE IINTERPRETATION OF THE RELATIONSHIPS BETWEENSLACK VARIABLES,

    SURPLUS VARIABLES, AND THE P OINT L OCATIONS BY T AKINGPOSITIVETRAININGDATAPOINTS ASEXAMPLE

    A. Linearly Nonseparable Case

    The primal problem for the linearly nonseparable case is re-

    formulated as follows:

    Minimize

    (15)

    Subject to

    (16a)

    (16b)

    (16c)

    (16d)

    (16e)

    (16f)

    where and are the weights for positive and negative

    slack variables, respectively. and are the weights for

    the positive and negative surplus variables, respectively. It is

    difficult to solve this constrained optimization problem. Sim-ilar to SVM, the primal optimization problem of TAF-SVM is

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    6/15

    LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS 183

    transformed to the dual form by introducing a set of nonnega-

    tive Lagrange multipliers , , , , , and for the

    constraints from(16a) to (16f)to yield its Lagrangian

    (17)

    Differentiation with respect to , , , , , and yields

    (18a)

    (18b)

    (18c)

    (18d)

    (18e)

    (18f)

    By the resubstitution of these equations into primal problem, the

    dual problem is obtained

    Maximize (19)

    Subject to (20a)

    (20b)

    (20c)

    B. Nonlinearly Nonseparable Case

    The dual form for the nonlinearly nonseparable case can be

    obtained by using kernel function

    where , is a nonlinear map. The

    objective is as follows:

    Maximize (21)

    The constraints for this maximization problem are the same as

    those in the dual form of the linear case (20a)(20c).The KT

    complementary condition plays a key role in the optimality. The

    KT complementary conditions for the nonlinear TAF-SVM are

    given by

    (22a)

    (22b)

    (22c)

    (22d)

    (22e)

    (22f)

    The optimal value of can be calculated with any data in

    the training set satisfying the KT complementary conditions.

    However, from the numerical perspective, it is better to take the

    mean value of resulting from all such data[14]. Therefore,

    the optimal value of is computed by

    (23)

    where and are the subsets of and , respectively

    (24a)(24b)

    For an unseen data , its predicted class is the output of the

    decision function

    sign (25)

    According to the formulation of the TAF-SVM, three main

    properties are discussed and summarized as follows.

    1) Through an inspection from the constraints in the dualform, we can see that the Lagrange multipliers ( and

    ) are bounded with t he upper b ound ( and )

    and the lower bound ( and ). Therefore, ac-

    cording to (20b) and (20c), all training data are support

    vectors for TAF-SVM since all data are with nonzero ,

    which meets the role of the total margin algorithm. On the

    contrary, in the soft margin-based SVM, the OSH is only

    constructed by few data points whose satisfy .

    2) In SVM, the are bounded by the range of .

    For all data points, their feasible regions are fixed once

    is chosen. Speaking of TAF-SVM, the feasible region

    is dynamic since the upper and lower bounds for every

    data point are different, because the bounds of the feasibleregion are functions of the assigned membership values.

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    7/15

    184 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

    This means that a less important data point hasthe narrower

    width in the feasible region.

    Another question is how to fuzzify the training set effi-

    ciently. Basically, the rule to assign proper membership

    values to data points can depend on the relative importance

    of data points to their own classes. Therefore, for a positive

    data, its assigned membership value can be calculated withthe membership function

    if

    otherwise(26)

    where denotes the Euclidean distance, the lower bound

    is a nonnegative small real number and is user-defined,

    and is the mean of all data points in . The member-

    ship values for all the fuzzified positive training data

    are bounded in . The same procedure is also used for

    the fuzzification of the negative data in which the mean is

    calculated with all the negative data.

    3) In SVM, only one free parameter has to be adjusted. Amore complex procedure may occur for TAF-SVM since

    there are four free parameterstobe adjusted: , , ,

    and . However, the adjustment process can be further

    simplified according to some relationships. First, the in-

    equality constraints in(20b) and (20c)say that the two in-

    equalities and must be held. Second,

    based on the concept of adaptation to imbalanced data sets,

    the relationships and are required

    if the size of positive class is smaller than that of negative

    class. Two ratios are defined to simplify the procedure of

    the adjustment of these parameters

    (27)

    (28)

    Byfixing the value among any of the four parameters ,

    , , and , and setting the values of and , the

    other three parameters can be obtained directly. In the case

    of 1, no adaptation effort will be made to the imbal-

    anced case. Furthermore, TAF-SVM will be the standard

    SVM if equals 1 and goes to infinity ( ,

    0, and is finite), and the membership

    values are set as 1. It is noticed that the very small number10 is added to avoid the situation when .

    Also, when 1, infinity, and 0 1, the pro-

    posed TAF-SVM will become the FSVM. Therefore, SVM

    and FSVM can be viewed as two particular cases of the

    proposed TAF-SVM.

    V. EXPERIMENTAL RESULTS

    A. Experiment on CYCU Face Database

    Here, we present a set of experiments that were carried out by

    using the Chung Yuan Christian University (CYCU) multiview

    face database [10]. The CYCU multiview face database contains

    3150 face images out of 30 subjects and involves variations offacial expression, viewpoint, and sex. Each image is a 112 92

    Fig. 2. Thirty subjects of the CYCU multiview face database.

    Fig. 3. Collected 21 face images of one of the 30 subjects in CYCU multiviewface database.

    24-b pixel matrix. The viewpoint is governed by two parame-

    ters: the rotation angle and the tilt angle where the rotation

    angle and tilt angle have seven and three kinds of degrees of

    angles ( , ), re-

    spectively. For each viewpoint of each subject we prepared five

    face images with different facial expressions. Therefore, each

    subject has 105 face images.Fig. 2shows the total 30 subjectsin this database. All images contain face contours and hair. The

    color of the background is nearly white and the lighting condi-

    tion is controlled to be uniform.Fig. 3shows the collected 21

    images containing 21 different viewpoints of one subject.

    1) Analysis of Face-Pattern Distributions in KFDA-Based

    Subspace: Two cases are analyzed in this subsection. Before

    the experiments, all colored face images are transferred to gray-

    level images and the contrast of each gray image is enhanced

    by the histogram equalization. All gray-level images of 112

    92 pixels are resized to 28 23 pixel matrices before the fea-

    ture extraction. In addition, all extracted features by KFDA from

    Case 1 to Case 2 are thefirst two most discrimination features.Case 1: Fig. 4 depicts the distribution of the face pat-

    terns of five subjects randomly chosen from the database

    in the KFDA-based subspaces. Each subject contains 21

    patterns covering the whole range: and

    , i.e., each of the 21 viewpoints provides one

    image for one person. Two observations are as follows. First,

    when , we observe that there exists an outlier

    for the class denoted by ,as shown inFig. 4(a). This outlier

    is very far from the main body of its class and falls into another

    class denoted by .The SVM will suffer from the overfitting

    problem when it is applied to solve the binary classification

    problem between the two classes. Second, according to the

    distribution shown inFig. 4(b), it is observed that there existsan overlap between the three classes denoted by , ,and

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    8/15

    LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS 185

    Fig. 4. Distribution of 105 face patterns out of the five subjects in KFDA-basedsubspaces with the RBF kernel where (a)

    2

    and (b)

    .

    ,respectively. To identify the class by using the OAA

    method, the imbalanced ratio of positive class to negative class

    is 1 : 4. The OSH learned by the traditional SVM will be skew

    toward the positive class . Consequently, the number of

    false negatives will increase and the recognition accuracy will

    decrease.

    Case 2: Most face-recognition systems evaluate their

    systems performance by changing some face conditions such

    as expression, viewpoint, and illumination conditions, etc. Ac-

    cordingly, several well-known databases are widely used such

    as the Olivetti Research Laboratory (ORL) face database, the

    University of Manchester Institute of Science and Technology

    (UMIST) multiview face database, and Yale face database. The

    three face databases have different conditions considered. For

    example, ORL database contains 400 face images in which

    all frontal face images have different facial expressions and

    details (glasses or no glasses). UMIST database consists of 575

    face images covering a wide range of poses from one-sided

    profile to frontal views as well as the expressions. Yale database

    contains 165 frontal face images having different expressions,illumination conditions, and small occlusions (with glasses).

    The three databases have involved most of the considerable

    conditions crucial to the evaluations of face-recognition sys-

    tems. However, all the faces in these databases are bounded

    well. That is, they do not take the variations of face contour and

    hair into consideration.

    Face-recognition task follows the face-detection task. For ex-

    ample, the SVM-based face-detection system[34]searches forfaces from an image by using size-varying widows to scan this

    image and perform face/nonface-classification task. Once the

    faces are detected, these faces will be framed by rectangular

    bounding boxes with different sizes and then sent into the face-

    recognition system. The framed face images detected by dif-

    ferent face-detection systems (even the same) may contain the

    whole hair and face contours, or just partial hair and contours,

    or none of them. Most existing face-recognition systems do not

    evaluate their systems on this factor since all the images in the

    three databases are full faces containing both hair and contours.

    Er et al. have conducted an interesting experiment in their

    work[16]. They evaluated their system [discrete cosine trans-

    form (DCT) FLDA RBF neural networks] on two groups ofdata: One was full faces of Yale database, and the other was the

    closely cropped faces of the same database, and achieved error

    rates of 1.8% and 4.8%, respectively, which were lower than

    other approaches such as eigenfaces and Fisherfaces, etc. How-

    ever, each of the two groups does not consider both full faces

    and cropped faces at the same time. Nevertheless, by comparing

    the two results, we see that the information of face contour and

    hair style is important for face recognition. This study will eval-

    uate the proposed TAF-SVM by letting this information be a

    variable.

    In this paper, we assume that an input face can be a full face

    or a partially cropped face in order to fulfill the requirement thatin addition to variations resulted from different expressions and

    viewpoints, a robust and reliable face-recognition system should

    also be able tofight against the variation due to the size-varying

    face-bounding boxes. This case aims to investigate the influence

    of the changes of the sizes of face-bounding boxes upon the

    face-pattern distribution in KFDA-based subspace. To achieve

    this goal, each face image is cropped to a new face image with

    two integer cropping sizes: and . This procedure is called

    face cutting, which is illustrated inFig. 5, where the operator

    round is to force the value of to become an integer. The

    dotted white rectangle is the face-bounding box. After the cut-

    ting, the cropped image is resized to a 112 92 new image. With

    this manner, an input face may contain the whole face contour

    or just part of it.

    The face-cutting procedure is performed to all the 105 face

    images that have been used in Case 1 with randomly chosen

    within the range of [0, 7]. Fig. 6(a) and (b) shows the distribution

    of these 105 randomly cropped face patterns in the KFDA-based

    subspaces. Compared with the distribution depicted inFig. 4,

    it can be seen that face images with different cropping sizes

    significantly result in the increase of the interclass ambiguities

    and the decrease of the intraclass compactness. Although the

    KFDA-based feature extraction method has tried to maximize

    the between-class separability and the within-class compact-

    ness, it cannot absorb the large variations caused from view-point and size-varying face-bounding box. Therefore, a robust

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    9/15

    186 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

    Fig. 5. Face-cuttingprocedureand thecropped face imageswith differentcrop-ping sizes of

    1

    . The dotted white rectangle is the face-bounding box.

    Fig. 6. Distribution of randomly cropped and 105 face patterns offive subjectsin the KFDA-based subspaces subject to (a)

    2

    , and (b)

    .

    classifier is still needed even though the robust feature extrac-

    tion method KFDA has been used. The robust classifier here

    means a classifier better than NN classifier. Besides, because

    outliers would appear in the distribution, and the situation of im-

    balanced training data sets would happen when OAA method is

    employed, a classifier which is more robust than SVM is alsoneeded. This also motivates this paper.

    2) Sensitivity Test of TAF-SVM: The goal of this experiment

    is to test the sensitivities of TAF-SVMto its intrinsic parameters,

    including the penalty weights and , weights for surplus

    variables and , and the lower bound used in the fuzzy

    membership function. To make the following experiments more

    constructive, three conditions containing different criteria for

    the collection of the training set and the test set are de fined asfollows.

    Condition 1: For the training set, each subject offers 21

    face images picked from all 21 angle combinations of

    . Each angle combination randomly offers one for

    each subject. Therefore, the training set contains 630 face

    images out of the 30 subjects. The collection procedure

    for the test set is of the same procedure. The two sets have

    no overlap.

    Condition 2: Each set is provided with 21 face images

    from every subject, so each set has 630 face images in

    total. The face images, different from Condition 1, are

    picked randomly from confined angle combinations. For

    the training set, only 30 and 0 are considered in ,and 15 and 0 in . As to the test set, only combina-

    tions of 45 and 15 in , and 15 and 0 in are

    picked. Those chosen face images will not be picked again.

    Condition 3: Face images are randomly chosen from all

    3150 face images in CYCU face database for the training

    set and test set. Each of the 30 subjects provides 21 face

    images for each set and there is no overlapping between

    the two sets. Those chosen face images will not be picked

    again.

    As far as the viewpoint of face is concerned, it can be seen hat

    the degrees of uncertainties of the three data collection criteria

    are apparently different.Condition 2 has the highest uncertaindegree among the three, while Condition 1 has the lowest. Also,

    the face-cutting procedure is performed to all face images before

    the feature extraction with random cropping size in the range

    [0, 7].

    Before extracting features via KFDA method, the optimal

    RBF kernel parameter , which results in the minimum error

    rate, is found by searching the variation range of from 1 to 10 .

    The error rate is the average error rate over ten runs. Whenever

    we are performing the next run, the training set and test set are

    reprepared based on Condition 3. Following the method used

    in[3]and[15],the average error rate is computed by

    (29)

    where is the number of runs, is the number of errors for

    the th run, and is the number of total testing face images

    of each run. It is noticed that the total testing face images

    means the training set in the parameter selection process and

    classifier training, while in the experiment of comparison of dif-

    ferent classification systems (online testing), thetotal test pat-

    terns means the test set. After taking trail-and-error, the op-

    timal KFDA parameter was found to be 2 5.6 10 , which

    resulted in the lowest average error rate measured from the ten

    training sets, also resulted in a low average error rate 11.8%measured from the ten test sets, by using the NN classifier. With

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    10/15

    LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS 187

    the optimal kernel value, total of 29 discriminating feature com-

    ponents are extracted from each face image by using the KFDA

    method.

    a) Sensitivity test on , , , and : Thefirst ex-

    periment is to test the sensitivities of TAF-SVM to the four pa-

    rameters: , , , and , which have been condensed

    to the ratios and defined by(27) and (28). The values ofand for this experiment are {1, 10, 20, 30, 40} and {2, 4, 6,

    8, infinity}, respectively. The values of is set as 10 and is

    fixed. The lower bound of the fuzzy membership function is

    fixed to 0.4. RBF kernel is also used for TAF-SVM where its

    kernel parameter is set as 0.05. The experimental results for

    the three conditions are shown inFig. 7.

    The lowest error rates for the three conditions are 2.10%,

    7.10%, and 4.80% when the pairs equal (20, 4), (30, 4),

    and (30, 6), respectively. The corresponding values of( , ,

    , ) are (200, 10, 50, 2.5), (300, 10, 75, 2.5), and (300,

    10, 50, 1.66), respectively. In addition, the results also indicate

    that the error rates can be reduced by changing the ratios of

    and . In the following, we take the results ofCondition 3shown inFig. 7(c)as examples to show how the performance of

    TAF-SVM will be affected under different and . Three steps

    are as follows.

    Step 1) InFig. 7(c),the largest average error rate 7.76% oc-

    curs at the position (1, infinity).

    (1, infinity) means that

    (10,10,0,0). At this position, the different cost

    algorithm is disabled because the penalties for

    the positive and the negative classes are the

    same ( 10). The total margin al-

    gorithm is also disabled at this position because of

    0. Therefore, only the fuzzy penaltyis used in TAF-SVM.

    Step 2) As the position goes to (30, infinity) from

    (1, infinity) , the average error rate decreases to

    5.94% from 7.76%. At the position (30, in-

    finity), the different cost algorithm is enabled (used)

    in TAF-SVM because the penalties for the positive

    and the negative classes become different:

    300 and 10. We can see that when

    300 and 10, the penalty for the positive class

    is much larger than that for the negative class. This

    meets the role of different cost algorithm, which

    says: Assign heavier penalty to the smaller class. In

    the experiments ofFig. 7(c), the number of nega-

    tive training data is 29 times the number of positive

    training data in the learning of each OSH by OAA

    TAF-SVM, because there are 30 subjects in CYCU

    database needed to be classified. On the other hand,

    the total margin algorithm is still not enabled at this

    position because of 0. By comparing

    the analysis in Step 1) with the one in Step 2), we

    can see that the error rate is reduced by the applica-

    tion of different cost algorithm.

    Step 3) As the position goes to (30, 6) from (30, in-

    finity), the average error rate decreases to 4.8% from

    5.94%. At the position (30, 6), not only thedifferent cost algorithm is enabled ( 300 and

    Fig. 7. Comparisons of average error rates among differentpairs of

    usedin TAF-SVM under different data collection conditions.

    10), but also the total margin algorithm is en-

    abled ( and ). By

    comparing the analyses in Steps 1) and 2) with the

    analysis in Step 3), we can see that the error rate can

    be further reduced with the involving of total marginalgorithm, after the use of different cost algorithm.

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    11/15

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    12/15

    LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS 189

    TABLE III

    PARAMETERSETTING INKFDA, SVM, AND TAF-SVM

    TABLE IVCOMPARISON OF THE AVERAGEERRORRATE ANDSTANDARDDEVIATION(SD)

    OVERTENRUNSBETWEENTAF-SVM AND OTHERSYSTEMS

    TABLE V

    COMPARISON OFCOMPUTATIONTIMEAMONGDIFFERENTSYSTEMS

    based SVM, the improvement is very limited (from 11.41%

    to 9.02%) while that is significant by using OAO-based SVM(from 11.41% to 7.55%). It is not surprising that the difference

    of recognition accuracy between OAO- and OAA-based SVM

    is that apparent since the OAO method does not result in the oc-

    currence of imbalanced data sets while OAA method does. As a

    matter of fact, Linet al.[2]have indicated that the OAO method

    is more suitable for practical use than the OAA method in terms

    of classification accuracy according to their experiments carried

    out on various popular data. In this paper, we also suggest that

    the OAO method is better than the OAA method for SVM-based

    face recognition. However, this suggestion is only in terms of

    face-recognition accuracy because the OAO method takes more

    recognition time than the OAA method (seeTable V).We conduct this experiment mainly based on the reason that

    a robust face classifier should be able to maintain good sta-

    bility while expecting that it can achieve the best recognition

    accuracy under the training with different training sets. The re-

    sults of Table IV indicate that TAF-SVM outperforms OAO-

    and OAA-based SVM. This is due to the fact that TAF-SVM

    not only can adapt to the imbalanced face data sets but also

    can avoid the overfitting problem and improve the generaliza-

    tion error bound. In addition, the system KFDA TAF-SVM

    achieves the lowest standard deviation (0.57) compared with the

    system KFDA SVM. It indicates that the TAF-SVM is more

    stable than SVM.

    4) Computational Complexity: Our experiments were im-plemented on an Intel Xeon 3.0 GHz-Workstation (1 MB L2

    catch, DDR2 2.0 GB SDRAM, 800 MHz Front-Site-Bus, and

    10 000 rpm SCSI-hard disk). The training program was imple-

    mented by using Matlab since it is able to solve the eigenvalue

    problem for KFDA and constrained optimization problem for

    both SVM and TAF-SVM easily. After the training, we saved

    the expansion coefficients for KFDA and the indispensable in-

    formation to the further recognition including the support vec-tors, Lagrange multipliers, and the optimal bias for each OSH.

    The test program was implemented using C++ language since

    the recognition process only executes simple calculations such

    as the dot product of vectors, their linear combinations, and de-

    cision making. We recorded the computation time of the first

    run of the last experiment and listed them in Table V.

    Most training time was spent on the solving of the constrained

    optimization problem. The larger the number of the training data

    was, the more time the training process needed. Also, the pro-

    portion of the increase of training time to that of training data

    was more than one. The total training time of OAA-based SVM

    (2429.7 s for 30 OSHs) is much larger than that of OAO-based

    SVM (234.8 s for 435 OSHs), as shown inTable V. Moreover,we found that the training time of the proposed TAF-SVM was

    smaller than that of OAA-based SVM. This may be due to the

    reason that for TAF-SVM the feasible regions are functions of

    membership values less than one. That is, most data have com-

    parative smaller feasible regions in searching the Lagrange mul-

    tipliers, compared with SVM. Although the training is time-con-

    suming, what is the biggest concern for face recognition is the

    online recognition speed.

    In the training of an OSH for the OAA-based SVM, we

    noticed that the percentage of the obtained support vectors

    is around 20%25%. The proposed TAF-SVM, by which the

    percentage of the support vectors is 100%, is 4.75 (76.4/16.1)times the recognition time of OAA-based SVM, as listed in

    Table V.Besides, the recognition time for TAF-SVM is around

    0.1231 s per subject. This recognition speed is acceptable for

    the tasks of security and visual surveillance.

    B. Experiment on FERET Database

    The facial recognition technology (FERET) face database,

    obtained from the FERET program[37],[38],[43], is a com-

    monly used database for the test of state-of-art face-recognition

    algorithms. In the following, the proposed TAF-SVM is tested

    on a subset of this database.

    This subset contains 1400 images of 200 subjects. The subsetcontains the images whose names are marked with two-char-

    acter strings:ba, bj, bk, be, bf, bd,and bg.Each

    subject has seven images involving variations in illumination,

    pose, and facial expression. In our experiment, each original

    image is cropped so that each cropped image only contains

    the portions of the face and hair. Then, each cropped image is

    resized to 80 80 pixels and preprocessed by the histogram

    equalization. Some images of one of the 200 subjects are shown

    inFig. 9.Six out of seven images of each subject are randomly

    chosen for training, and the remaining one is used for testing.

    The training set size is 1200 and the test set size is 200. We run

    the previous process 20 times and obtain 20 different training

    and test sets, and in each run there is no overlap betweentraining set and test set.

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    13/15

    190 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

    Fig. 9. First row: images of one of the 200 subjects in the FERET database.Second row: cropped images of those in thefirst row after histogram equaliza-tion.

    TABLE VICOMPARISONS OFAVERAGEERROR RATE AND SD AMONG DIFFERENT

    SYSTEMSWITHKFDA FEATUREEXTRACTION ONFERET DATABASE

    1) Performance Test After the KFDA Feature Extraction: In

    this experiment, the face images will go through the KFDA fea-

    ture extractionfirst before the classification. Therefore, wefirst

    find the optimal parameters of KFDA.

    a) Optimal parameter selection: The first stepis to find the

    optimal parameters of KFDA for the experiment on the subset

    of FERET database. Only two parameters for KFDA need to

    be determined, namely the RBF kernel parameter , and the

    number of chosen eigenvectors . The optimal parameter pair

    will be the pair, over the wide ranges of and, resulting in the lowest average error rate. One average error

    rate is computed from the 20 error rates under a specific param-

    eter pair and the errors are measured by an NN classifier. In the

    sequence, the optimal parameters 6.1 10 and

    199, are found for KFDA. Then, the training sets are projected

    onto the 199 eigenvectors and thus 20 projected training sets are

    obtained.

    Then, the projected training sets are used tofind the optimal

    parameters for SVM and TAF-SVM, respectively. The RBF

    kernel is still used in the classifiers. Similar to the searching

    process for the KFDAs optimal parameter selection, here the

    optimal parameters of each classifier will also be the parameters

    resulting in the lowest average error rate over wide searching

    ranges of the classifiers parameters.

    b) Comparisons among different systems: After we select

    the optimal parameters for each classifier, we start to test and

    compare the classification accuracies by feeding 20 test sets

    into different systems. The experimental results are listed in

    Table VI.

    First, by comparing the results in Tables VIand Table IV,

    we can observe that the error rates obtained from FERET data-

    base are much larger than those obtained from CYCU data-

    base. It may be due to the following facts: 1) the number of

    available training data from FERET database (six per subject)

    is much smaller than that from CYCU database (21 per sub-ject), 2) there exist larger variations in FERET database, and 3)

    the number of subjects in the subset of FERET database (200

    subjects) is relatively much larger than that of CYCU database

    (30 subjects). Nevertheless, the experimental results carried out

    from FERET database show that TAF-SVM still performs better

    than SVM. Based on the results in Table VI, TAF-SVM out-

    performs both SVM (OAO) and SVM (OAA) in average error

    rate by 3.21% and 5.10%, respectively. Additionally, TAF-SVMachieves the lowest variance among these systems, which indi-

    cates that TAF-SVM is able to keep better stability than SVM

    when facing different unseen patterns.

    It is worth noting that though KFDA has been used to ex-

    tract discriminating features from original image raw data based

    on the maximization of between-class separability; however, it

    does not mean that the class distribution in KFDA-based sub-

    space will be separable. This can be evidenced by the result from

    Table VI: (KFDA NN) 22.18%.Thisresult impliesthat,

    based on the optimal parameters of KFDA having been used,

    there still exist numerous errors between classes. That is, the

    class distribution in KFDA-based subspace is still nonseparable.

    This may result from two reasons as follows.First, the face patterns involve too large variation such that

    KFDA is not able to separate the classes well. Second, in this

    paper, the KFDA used for the subset of FERET database and

    CYCU database actually suffers from the so-called small

    sample size(SSS) problem[3],because in our experiment the

    number of training patterns is smaller than the dimensionality

    of the input training patterns. For example, in our training

    sets, each pattern (80 80 pixel image) is a 6400-dimensional

    vector, while the number of available training patterns is only

    1200 (200 subjects, six per subject). The SSS problem also

    occurs in the KFDA used for the experiment of CYCU data-

    base, where each training pattern (28 23 pixel image) is a644-dimensional vector, while the number of training patterns

    in each training set is 630 (21 per subject). Since the KFDA

    used for the two databases suffers from the SSS problem, the

    within-class scatter matrix of(1)is degenerated because

    contains the null space.

    To solve the SSS problem in numerical computation, this

    paper uses the method of adding a matrix , where is the

    identity matrix and is a small number, to the inner product

    kernel matrix infinding the expansion coefficient vector for

    the data projection. This method is very simple and was sug-

    gested by Mika et al. [39], [40]. However, this method dis-

    cards the discriminant information contained in the null space of

    within-class scatter matrix, yet the discarded information may

    contain the most significant discriminant information[3],[41],

    [42]. This means, even if the most discriminant eigenvectors

    have been used for the data projection in our experiments, these

    eigenvectors are actually not the most discriminant. Although

    KFDA is employed in our work, the face-pattern distribution

    is still nonseparable, for example the face-pattern distribution

    shown inFigs. 4(b)and 6.

    Several more efficient solutions for this SSS problem have

    been recently proposed, such as the kernel direct discriminant

    analysis (KDDA)[3] which is a generalization of direct LDA

    [41], and the complete kernel Fisher discriminant analysis

    (CKFD) which combines the kernel PCA and LDA [42].We expect that the classification accuracy of each system in

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    14/15

    LIU AND CHEN: FACE RECOGNITION USING TAF-SVMS 191

    TABLE VII

    COMPARISONS OF AVERAGEERRORRATE ANDSD AMONGDIFFERENT

    SYSTEMSWITHOUTKFDA FEATUREEXTRACTION ONFERET DATABASE

    Tables VIandIVwill be improved if, instead of KFDA, KDDA

    or CKFD is used for the face-feature extraction in this paper.

    Moreover, though KFDA has tried to minimize the within-

    class scatters for obtaining larger intraclass compactness, this

    cannot guarantee that no outliers will appear in the KFDA-based

    subspace. For example, in Fig. 4(a), an outlier still exists in

    the KFDA-based subspace. Under such a situation, the SVMs,

    SVM (OAO), and SVM (OAA) may suffer from the overfitting

    problem and the classification performance will drop. On the

    other hand, for SVM (OAA), although KFDA has been used

    for the face-feature extraction, the case of imbalanced trainingdata sets is still unavoidable in the KFDA-based subspace. In

    the training of an OSH via SVM (OAA), the imbalanced ratio

    of negative training data to positive data is 199 : 1. Such a large

    imbalanced ratio will result in the class-boundary-skew problem

    for SVM (OAA). This may be the reason why SVM (OAO) al-

    ways performs better than SVM (OAA), because the ratio of

    negative training data to positive data is always 1 : 1 for SVM

    (OAO) if the sizes of classes are the same. To sum up,Table VI

    shows that the proposed TAF-SVM improves the classification

    performance of SVM (OAO) and SVM (OAA), and such a sig-

    nificant improvement of performance should be attributed not

    only to the use of fuzzy penalty and different cost algorithm,but also to the total margin algorithm embedded in TAF-SVM.

    2) Performance Test Without KFDAs Feature Extraction: In

    this experiment, the image raw vectors are directly sent into each

    classifier without using KFDA as the feature extractor. Since the

    KFDA feature extractor is no longer used, the optimal parame-

    ters of each classifier need to be reselected. It is noted here that

    the inputs of each classifier are normalized to zero mean and

    unit variance. After feeding the 20 different test sets into these

    systems directly without using KFDA feature extractor, the av-

    erage error rates are obtained and listed in Table VII.

    Comparing the results reported in Tables VII and VI, we

    canfind that the average error rate of each system inTable VI

    is lower than that listed in Table VII. For example,

    (KFDA TAF-SVM) 14.15%, while (TAF-SVM)

    20.40%. Therefore, we can conclude that by using KFDA as the

    feature extractor, the classification accuracy of each classifier

    can be further enhanced significantly. FromTable VII, it can be

    seen that in terms of the average classification rate, TAF-SVM

    outperforms SVM (OAA) and SVM (OAO) by 7.23% and

    3.73%, respectively. In addition, TAF-SVM still achieves the

    lowest variance.

    VI. CONCLUSION ANDFUTUREWORK

    A new classifier called TAF-SVM is proposed in this paper.

    TAF-SVM is mainly designed for the improvement of the draw-backs of traditional SVM when applied to face recognition,

    the class-boundary-skew problem, and the overfitting problem,

    by introducing the different cost algorithm and the method of

    fuzzification of training set. Another contribution is to enhance

    the generalization ability of SVM by introducing the total

    margin algorithm. Experimental results show that the proposed

    TAF-SVM is superior to OAO- and OAA-based SVM in terms

    of both face-classification rate and stability. The validity ofTAF-SVM on the improvement of classification accuracy of

    SVM for face recognition has been indicated.

    Based on the work presented, there still remain several topics

    worth studying in the future. First, the circle-like membership

    model for the training set fuzzification used in this paper is not a

    very efficient model since the face-pattern distribution is in gen-

    eral non-Gaussian and nonconvex. The study on a better model

    is needed. Second, experimental results have shown that using

    KFDA as feature extractor is able to enhance the classification

    accuracy. However, for face-recognition, KFDA suffers from

    the SSS problem in our work. It is believed that if this problem

    is solved, e.g., by using the variants of KFDA such as KDDA

    [3]or CKFD[42],the face-recognition accuracy can be furtherenhanced based on the use of TAF-SVM classifier.

    ACKNOWLEDGMENT

    The authors would like to thank the reviewers for their useful

    comments and suggestions, and Prof. H.-P. Huang, Prof. S.-G.

    Miaou, Prof. P. C. P. Chao, and H.-Y. Lin for their help in

    preparing this paper.

    REFERENCES

    [1] K. Veropoulos,C. Campbell, and N. Cristianini, Controlling the sensi-

    tivity of support vector machines,inProc. Int. Joint Conf. Artif. Intell.(IJCAI99), Stockholm, Sweden, 1999, pp. 5560.[2] C. W. Hsu and C. J. Lin,A comparison of methods for multiclass

    support vector machines, IEEE Trans. Neural Netw., vol. 13, no. 2,pp. 415425, Mar. 2002.

    [3] J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, Face recogni-tion using kernel direct discriminant analysis algorithms,IEEE Trans.

    Neural Netw., vol. 14, no. 1, pp. 117126, Jan. 2003.[4] M. Yoon, Y. Yun, and H. Nakayama,A role of total margin in support

    vector machines,in Proc. Int. Joint Conf. Neural Netw., 2003, vol. 3,pp. 20492053.

    [5] I. Guyon, N. Matic, and V. Vapnik,Discovering informative patternsand data cleaning, in Advances in Knowledge Discovery and Data

    Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthu-rusamy, Eds. Menlo Park, CA: AAAI Press, 1996, pp. 181203.

    [6] X. Zhang, Using class-center vectors to build support vector ma-chines, in Proc. IEEE Workshop Neural Netw. Signal Process.

    (NNSP99), Madison, WI, 1999, pp. 311.[7] C. F. Lin and S. D. Wang,Fuzzy support vector machines, IEEE

    Trans. Neural Netw., vol. 13, no. 2, pp. 464471, Mar. 2002.[8] V. Vapnik, Statistical Learning Theory. New York: Springer-Verlag,

    1998.

    [9] G. Baudat and F. Anouar,Generalized discriminant analysis using akernel approach,Neural Comput., vol. 12, pp. 23852404, 2000.

    [10] Chung Yuan Christian Univ. (CYCU), Multiview Face DatabaseChungli, Taiwan [Online]. Available: http://vsclab.me.cycu.edu.tw/~face/face_index.html

    [11] R. Akbani, S. Kwek, and N. Japkowicz,Applying support vector ma-chines to imbalanced datasets,in Proc. 15th Eur. Conf. Mach. Learn.(ECML), Pisa, Italy, 2004, pp. 3950.

    [12] M. Turk and A. Pentland,Eigenfaces for recognition,J. Cogn. Neu-rosci., vol. 3, no. 1, pp. 7186, 1991.

    [13] C. Corts and V. Vapnik,Support vector networks,Mach. Learn., vol.

    20, pp. 273297, 1995.[14] J. C. Burges,A tutorial on support vector machines for pattern recog-nition,Data Mining Knowl. Disc., vol. 2, pp. 121167, 1998.

  • 8/10/2019 Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines

    15/15

    192 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 1, JANUARY 2007

    [15] M. J. Er, S. Wu, J. Liu, and H. L. Toh,Face recognition with radialbasis function (RBF) neural networks,IEEE Trans. Neural Netw., vol.13, no. 3, pp. 697710, May 2002.

    [16] M.J. Er, W.L. Chen,and S.Q. Wu, High-speed face recognition basedon discrete cosine transform and RBF neural networks,IEEE Trans.

    Neural Netw. , vol. 16, no. 3, pp. 679691, May 2005.[17] G. Wu and E. Cheng, Class-boundary alignment for imbalanced

    dataset learning, in Proc. ICML 2003 Workshop Learn. Imbalanced

    Data Sets II, Washington, DC, 2003, pp. 4956.[18] N. Japkowicz,The class imbalance problem: Significance and strate-gies,inProc. 2000 Int. Conf. Artif. Intell.: Special Track on Inductive

    Learning, Las Vegas, NV, 2000, pp. 111 117.[19] C. Ling and C. Li,Data mining for direct marketing problems and

    solutions, in Proc.4th Int. Conf. Knowl.Disc.Data Mining, New York,1998, pp. 7379.

    [20] N. Chawla, K. Bowyer, and W. Kegelmeyer, SMOTE: Syntheticminority over-sampling technique, J. Artif. Intell. Res., vol. 16, pp.321357, 2002.

    [21] J. F. Feng and P. Williams,The generalization error of the symmetricand scaled support vector machines,IEEE Trans. Neural Netw., vol.12, no. 5, pp. 12551260, Sep. 2001.

    [22] M. H. Yang,Kernel Eigenfaces vs. kernel Fisherfaces: Face recogni-tion using kernel methods,in Proc. 5th IEEE Int. Conf. Autom. FaceGesture Recognit., Washington, DC, 2002, pp. 215220.

    [23] Q. S. Liu, H. Q. Lu, and S. D. Ma,Improving kernelfisher discrim-

    inant analysis for face recognition, IEEE Trans. Circuits Syst. VideoTechnol., vol. 14, no. 1, pp. 42 49, Jan. 2004.

    [24] B.Schlkopf,A. Smola,and K.R. Mller, Nonlinear component anal-ysis as a kernel eigenvalue problem,Neural Comput., vol. 10, no. 5,pp. 12991319, 1998.

    [25] K. I. Kim, K. Jung, and H. J. Kim,Face recognition using kernel prin-cipal component analysis, IEEE Signal Process. Lett., vol. 9, no. 2,pp. 4042, Feb. 2002.

    [26] G. Cui and W. Gao,SVMs for few examples-based face recognition,inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. , Hong Kong,2003, vol. 2, pp. 381384.

    [27] W. Chi, G. Dai, and L. Zhang,Face recognition based on independentGabor features and supportvector machine, inProc. 5th World Congr.

    Intell. Control Autom., Hangzhou, China, 2004, vol. 5, pp. 4030 4033.[28] C. Y. Li, F. Liu, and Y. X. Xie,Face recognition using self-orga-

    nizing feature maps and support vector machines, in Proc. 5th Int.Conf. Comput.Intell. Multimedia Appl., Xian, China, 2003,pp.3742.

    [29] G. Dai and C. Zhou,Face recognition using support vector machineswith the robust feature, in Proc. 12th IEEE Int. Workshop Robot

    Human I nteractive Commun., 2003, pp. 4953.[30] S. Y. Zhang and H. Qiao,Face recognition with support vector ma-

    chine, in Proc. IEEE Int. Conf. Robot., Intell. Syst. Signal Process.,Changsha, China, 2003, vol. 2, pp. 726730.

    [31] K. I. Kim, J. Kim, and K. Jung, Recognition of facial imagesusing support vector machines, in Proc. 11th Workshop Stat. SignalProcess., Singapore, 2001, pp. 468471.

    [32] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman,Eigenfaces vs.Fisherfaces: Recognition using class specific linear projection,IEEETrans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 711720, Jul.1997.

    [33] Y. Lin, Y. Lee, and G. Wahba,Support vector machines for classifi-cation in nonstandard situations,Mach. Learn., vol. 46, pp. 191202,2002.

    [34] E. Osuna, R. Freund, and F. Girosit, Training support vector ma-chines: An application to face detection,in Proc. Comp. Vis. Pattern

    Recognit. (CVPR) , Puerto Rico, 1997, pp. 130136.[35] U. H.-G. Kressel, Pairwise classification and support vector ma-

    chines,inAdvances in Kernel MethodsSupport Vector Learning, B.Schlkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA:MIT Press, 1999.

    [36] T. Van Gestel, J. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G.

    Dedene, B. De Moor, and J. Vandewalle,Benchmarking least squaressupport vector machine classifiers, Mach. Learn., vol. 54, no. 1, pp.532, 2004.

    [37] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss,The FERETevaluation methodology for face-recognition Algorithms,IEEE Trans.Pattern Anal. Mach. Intell., vol. 22, no. 10, pp. 10901104, Oct. 2000.

    [38] P. J. Phillips, The Facial Recognition Technology (FERET) Database(2004) [Online]. Available: http://www.itl.nist.gov/iad/humanid/feret/feret_master.html

    [39] S. Mika, G. Rtsch, J. Weston, B. Schlkopf, and K.-R. Mller,Fisher discriminant analysis with kernels,in Proc. IEEE Int. Work-

    shop Neural Netw. Signal Process. IX, Aug. 1999, pp. 4148.[40] , Constructing descriptive and discriminant nonlinear features:Rayleigh coefficients in kernel feature spaces, IEEE Trans. Pattern

    Anal. Mach. Intell., vol. 25, no. 5, pp. 623628, May 2003.[41] H. Yu and J. Yang,A direct lDA algorithm for high-dimensional data

    with application to face recognition, Pattern Recognit., vol. 34, pp.20672070, 2001.

    [42] J. Yang, A. F. Frangi, J. Y. Yang, D. Zhang, and Z. Jin, KPCA plusLDA: A complete kernel Fisher discriminant framework for featureextraction and recognition, IEEE Trans. Pattern Anal. Mach. Intell.,vol. 27, no. 2, pp. 230244, Feb. 2005.

    [43] J. Yang, J. Y. Yang, and A. F. Frangi,Combined Fisherfaces frame-work,Image Vis. Comput., vol. 21, no. 12, pp. 10371044, 2003.

    [44] H. P. Huang and Y. H. Liu,Fuzzy support vector machines for pat-tern recognition and data mining,Int. J. Fuzzy Syst., vol. 4, no. 3, pp.826835, 2002.

    [45] J. Suykens and J. Vandewalle,Least squares support vector machine

    classifiers,Neural Process. Lett., vol. 9, pp. 293300, 1999.[46] M. W. Chang, C. J. Lin, and R. C. H. Weng,Analysis of switching dy-

    namics with competing support vector machines,IEEE Trans. NeuralNetw., vol. 15, no. 3, pp. 720727, May 2004.

    [47] S. N. Pang, D. Kim, and S. Y. Bang,Face membership authenticationusing SVM classification tree generated by membership-based LLEdata partition,IEEE Trans. Neural Netw., vol. 16, no. 2, pp. 436446,Mar. 2005.

    [48] S.Li, J.T. Y.Kwok, I.W. H.Tsang,and Y.Wang, Fusing imageswithdifferent focuses using support vector machines,IEEE Trans. Neural

    Netw., vol. 15, no. 6, pp. 15551561, Nov. 2004.

    Yi-Hung Liu (M04) received the B.S. degree innaval architecture and marine engineering fromNational Cheng Kung University, Tainan, Taiwan,R.O.C.,in 1994, and the M.S. degree in engineeringscience and ocean engineering in 1996 and the Ph.D.degree in mechanical engineering in 2003, both fromNational Taiwan University, Taipei, Taiwan, R.O.C.

    He is currently an Assistant Professor with theDepartment of Mechanical Engineering at ChungYuan Christian University, Chungli, Taiwan, R.O.C.

    His research interests include computer vision,machine learning, pattern recognition, data mining, automatic control, and theirassociated applications.

    Yen-Ting Chen was born in Kaohsiung, Taiwan,R.O.C. He received the B.S. and M.S. degrees in

    mechanical engineering from Chung Yuan ChristianUniversity, Chungli, Taiwan, R.O.C., in 2004 and2006, respectively.

    Currently, he is with the Industrial TechnologyResearch Institute (ITRI), Hsinchu, Taiwan, R.O.C.,where he is working for the intelligent robots. Hisresearch interests include machine vision and neuralnetworks.