15
Realtime training on mobile devices for face recognition applications Kwontaeg Choi a , Kar-Ann Toh b , Hyeran Byun a, a Department of Computer Science, Yonsei University, 134 Shinchon-dong, Seodaemun-gu, Seoul 120-749, Republic of Korea b School of Electrical and Electronic Engineering, Yonsei University, 134 Shinchon-dong, Seodaemun-gu, Seoul 120-749, Republic of Korea article info Article history: Received 13 January 2010 Received in revised form 25 June 2010 Accepted 7 August 2010 Keywords: Face recognition Realtime training Mobile application Random projection abstract Due to the increases in processing power and storage capacity of mobile devices over the years, an incorporation of realtime face recognition to mobile devices is no longer unattainable. However, the possibility of the realtime learning of a large number of samples within mobile devices must be established. In this paper, we attempt to establish this possibility by presenting a realtime training algorithm in mobile devices for face recognition related applications. This is differentiated from those traditional algorithms which focused on realtime classification. In order to solve the challenging realtime issue in mobile devices, we extract local face features using some local random bases and then a sequential neural network is trained incrementally with these features. We demonstrate the effectiveness of the proposed algorithm and the feasibility of its application in mobile devices through empirical experiments. Our results show that the proposed algorithm significantly outperforms several popular face recognition methods with a dramatic reduction in computational speed. Moreover, only the proposed method shows the ability to train additional samples incrementally in realtime without memory failure and accuracy degradation using a recent mobile phone model. & 2010 Elsevier Ltd. All rights reserved. 1. Introduction 1.1. Background and motivation Due to the popularity of high performance mobile devices such as Apple’s iPhone, Google’s Android and RIM’s Blackberry, an incorporation of computer vision and pattern recognition techni- ques in camera-equipped mobile devices can further extend their level of usage intelligence. These intelligence include mobile interaction, context-awareness, wearable computing and mobile entertainment. In addition, the rapid growth of corresponding application markets such as AppStore, Android Market and BlackBerry App World for mobile phones, is attracting much attention in both academics and industries. A low cost imple- mentation of an accurate face recognition system can thus be useful for a wide range of applications such as active focusing, photo management, law enforcement and entertainment. To date, the most common application of face recognition in mobile devices is identity authentication for access control and prevention of unauthorized mobile phone usages. Since more and more private information is being stored in mobile devices for uses such as mobile banking, a biometric enabled system may be useful for secure application in addition to a simple PIN code. Another application is law enforcement such as criminal investigation and terrorist tracking. A common scenario could be that of an unknown person being compared with images in a database. The third application is the use of a face for auto focusing in digital cameras. The face detection based auto focusing has already been equipped in most digital cameras, and some products even support a face recognition based auto focus in which registered friends or family members can be focused with high priority. Yet another application of great interest is photo annotation and management of large face image databases. Here, faces in a photograph could be linked to names listed in the user’s address book or social-networking web sites. A mobile phone based face recognition system not only suffers from common problems such as illumination, occlusion and pose variations, but is also limited by the following factors: (i) Low-quality image: The quality of face images produced by a camera equipped in mobile devices can be affected by distortions and occlusions due to various accessories. Thus face variation between training and test samples could be larger than that under fixed and cooperative environments. The dynamic nature of these environments can cause much performance degradation. (ii) Limited computing power: The computational power of mobile devices has been growing rapidly. For example, the recent mobile phone T’Omnia comes with a 800 MHz clock, while many other mobile phones use a clock of 500–600 MHz. However, such computing power may not be Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.08.009 Corresponding author. Tel.: + 82 2 2123 2719; fax: + 82 2 363 2599. E-mail address: [email protected] (H. Byun). Pattern Recognition 44 (2011) 386–400

Realtime training on mobile devices for face recognition applications

Embed Size (px)

Citation preview

Pattern Recognition 44 (2011) 386–400

Contents lists available at ScienceDirect

Pattern Recognition

0031-32

doi:10.1

� Corr

E-m

journal homepage: www.elsevier.com/locate/pr

Realtime training on mobile devices for face recognition applications

Kwontaeg Choi a, Kar-Ann Toh b, Hyeran Byun a,�

a Department of Computer Science, Yonsei University, 134 Shinchon-dong, Seodaemun-gu, Seoul 120-749, Republic of Koreab School of Electrical and Electronic Engineering, Yonsei University, 134 Shinchon-dong, Seodaemun-gu, Seoul 120-749, Republic of Korea

a r t i c l e i n f o

Article history:

Received 13 January 2010

Received in revised form

25 June 2010

Accepted 7 August 2010

Keywords:

Face recognition

Realtime training

Mobile application

Random projection

03/$ - see front matter & 2010 Elsevier Ltd. A

016/j.patcog.2010.08.009

esponding author. Tel.: +82 2 2123 2719; fax

ail address: [email protected] (H. Byun).

a b s t r a c t

Due to the increases in processing power and storage capacity of mobile devices over the years, an

incorporation of realtime face recognition to mobile devices is no longer unattainable. However, the

possibility of the realtime learning of a large number of samples within mobile devices must be

established. In this paper, we attempt to establish this possibility by presenting a realtime training

algorithm in mobile devices for face recognition related applications. This is differentiated from those

traditional algorithms which focused on realtime classification. In order to solve the challenging

realtime issue in mobile devices, we extract local face features using some local random bases and then

a sequential neural network is trained incrementally with these features. We demonstrate the

effectiveness of the proposed algorithm and the feasibility of its application in mobile devices through

empirical experiments. Our results show that the proposed algorithm significantly outperforms several

popular face recognition methods with a dramatic reduction in computational speed. Moreover, only

the proposed method shows the ability to train additional samples incrementally in realtime without

memory failure and accuracy degradation using a recent mobile phone model.

& 2010 Elsevier Ltd. All rights reserved.

1. Introduction

1.1. Background and motivation

Due to the popularity of high performance mobile devices suchas Apple’s iPhone, Google’s Android and RIM’s Blackberry, anincorporation of computer vision and pattern recognition techni-ques in camera-equipped mobile devices can further extend theirlevel of usage intelligence. These intelligence include mobileinteraction, context-awareness, wearable computing and mobileentertainment. In addition, the rapid growth of correspondingapplication markets such as AppStore, Android Market andBlackBerry App World for mobile phones, is attracting muchattention in both academics and industries. A low cost imple-mentation of an accurate face recognition system can thus beuseful for a wide range of applications such as active focusing,photo management, law enforcement and entertainment.

To date, the most common application of face recognition inmobile devices is identity authentication for access control andprevention of unauthorized mobile phone usages. Since moreand more private information is being stored in mobile devicesfor uses such as mobile banking, a biometric enabled systemmay be useful for secure application in addition to a simple PINcode. Another application is law enforcement such as criminal

ll rights reserved.

: +82 2 363 2599.

investigation and terrorist tracking. A common scenario could bethat of an unknown person being compared with images in adatabase. The third application is the use of a face for autofocusing in digital cameras. The face detection based autofocusing has already been equipped in most digital cameras,and some products even support a face recognition based autofocus in which registered friends or family members can befocused with high priority. Yet another application of greatinterest is photo annotation and management of large face imagedatabases. Here, faces in a photograph could be linked to nameslisted in the user’s address book or social-networking web sites.

A mobile phone based face recognition system not only suffersfrom common problems such as illumination, occlusion and posevariations, but is also limited by the following factors:

(i)

Low-quality image: The quality of face images produced by acamera equipped in mobile devices can be affected bydistortions and occlusions due to various accessories. Thusface variation between training and test samples could belarger than that under fixed and cooperative environments.The dynamic nature of these environments can cause muchperformance degradation.

(ii)

Limited computing power: The computational power ofmobile devices has been growing rapidly. For example,the recent mobile phone T’Omnia comes with a 800 MHzclock, while many other mobile phones use a clock of500–600 MHz. However, such computing power may not be

K. Choi et al. / Pattern Recognition 44 (2011) 386–400 387

sufficient for implementation of many complex vision andpattern recognition algorithms. In addition, most processorswithin the mobile devices do not support floating pointinstruction sets.

(iii)

Limited memory: Many recent mobile phones support256 MB RAM and several GB of external flash memory. Thismemory space may be insufficient for face recognitionalgorithms which handle a lot of face images of highdimensionality. Algorithms with a high computational timecost can be performed on mobile devices, although they willtake a long time. However, our implementations without anycode optimization of well-known face recognition methodssuch as principal component analysis (PCA) [1], supportvector machine (SVM) [2] and neural networks [3] failed torun on recent mobile devices due to memory allocation error.

(iv)

Batch-based asymmetric training: Many face recognitionalgorithms adopt a certain optimization procedure whichrequires a lot of training samples in advance. This causesproblems in mobile device application. However, manyprevious works have not paid much attention to the trainingissue for mobile devices. Some works have addressed thisissue by focusing on verification, which has a much lowercomputational cost than does identification. Other worksadopt a server-based method where training can be performedoffline on a PC.

1.2. Related works and problem definition

Although many works have aimed to developing robust facerecognition algorithms, relatively few works have focused onadopting the technology to the domain of mobile computing. Herewe group existing face recognition works in mobile devices bymeans of their approaches (sensing-oriented, classification-or-iented and device-oriented).

Under the sensing-oriented approach [4–6], mobile devices areused to sense video and audio streams in which most processingcomponents including feature extraction, classification and train-ing are performed on a server computer via a wireless network inorder to overcome the computational limitation of mobiledevices. A major drawback of these solutions is that the systemcannot function at locations without a stable network connection.

In the classification-oriented approach [7,8], the sensing,feature extraction and classification tasks are performed withinthe mobile devices and those off-line learning tasks which requirehigh computational time and space are performed on anothercomputer. This approach assumes that off-line training isinfrequent.

In the device-oriented approach, all tasks including off-linetraining are performed within the mobile devices. However, dueto the limitation of computational resources in mobile devices,most works have used a simple prototype and verification systemwith a small number of face images and individuals. For example,[9] adopted skin color and fiducial points for face recognition on aNokia 6630 model (ARM 220 MHz processor). The computationalcosts for training and testing were not evaluated, and the accuracywas very low, as it was evaluated using very few samples. Ref.[10] proposed a verification system using a skin-color based facedetector and a local binary pattern (LBP) on a Nokia N90 mobilephone (ARM 220 MHz processor and 31 MB RAM). The system ranat 2 fps. Ref. [11] demonstrated a realtime, embedded facerecognition system on an ARM 806 MHz. AdaBoost based facedetection [12] and ARENA algorithm, which is a memory-basedrecognition method, were adopted for fast training and testing.However, the accuracy and computational costs of training and

testing are not presented. These works adopt simple templatematching algorithms for classification tasks, which get slower asthe number of samples increase. Moreover, accuracy seems to below. For commercial products, [13] described the OKAO visionalgorithm on various processors for verification. Recognition time,enrollment time and the required ROM/RAM sizes were pre-sented. However, the detailed algorithm and the evaluationresults were not presented.

In this work, we attempt to solve two problems. The firstproblem is realtime training. Given some image samples and thistraining must be performed within a few seconds. The conven-tional batch-based training approach may be inappropriate forthis purpose. The second problem is the large face variationarising from digital images taken using different camera devicesand with partial occlusion due to the accessories. Our solution todeal with these problems is to update the learned modelparameters incrementally using local features.

1.3. Contributions

In this paper, we propose the extraction of local features using arandom basis and then a sequential neural network is trainedincrementally with these features. To the best of our knowledge, thisis the first realtime incremental learning algorithm for mobile phonesthat deals with a large number of face images. Two characteristics ofour proposed algorithm are enumerated as follows:

(i)

Local random basis: Unlike conventional random basis whichextracts orthogonal and global features, we generate a non-orthogonal local random basis which can be robust to localdeformation as well as being computationally efficient.

(ii)

Realtime incremental training: The proposed method com-bines an efficient non-training based feature extractor with asequential classifier, thus reducing the computational time ofretraining dramatically. Moreover, our method can beconsidered an incremental classifier at the decision level.Thus, classification cost is independent to the number oftraining samples. This is differentiated from those well-known incremental subspace methods.

1.4. Organization

The rest of this paper is organized as follows. Section 2 brieflyintroduces existing random projection and sequential feedforwardneural networks. In Section 3, we propose a local random projec-tion to generate a non-orthogonal local basis and an ensembleof incremental classifiers in order to increase the recognitionaccuracy. In Section 4, several experiments are presented usingpublic face databases where seven algorithms have been com-pared. Finally, our conclusions are given in Section 5.

2. Backgrounds

In this section, we present some backgrounds on randomprojection and sequential feedforward neural network for im-mediate reference.

2.1. Random projection

Unlike most training based feature extraction methods whichrequire a lot of training samples and certain optimizing criteria,random projection (RP) [14] has emerged as an efficientdimensional reduction method which does not require any

K. Choi et al. / Pattern Recognition 44 (2011) 386–400388

training samples. In RP, the original high-dimensional data can beprojected onto a low-dimensional subspace using a set ofrandomly generated basis where the projection matrix consistsof i.i.d. entries with a zero mean and a constant variance and areorthonormal.

In order to significantly reduce the computational cost duringthe generation of a projection matrix and the projection onto alow-dimensional subspace, Achliopta [15] proposed a sparserandom projection (SRP) with elements belonging to7

ffiffispðsA1,3Þ and 0 without orthonormalization. A projection

matrix RARD�d with an original face dimension D and a reduceddimension d, consists of the following entities:

Rji ¼ffiffisp

1 with prob:1

2s

0 with prob: 1�1

s

�1 with prob:1

2s

8>>>>>><>>>>>>:

ð1Þ

When the sparsity s is 3, then 23 of the entities contain zero values.

This leads to a threefold increases in speed. Moreover, themultiplications with

ffiffisp

can be delayed and no floating pointarithmetic is needed. More recently, Li et al. [16] proposed a verysparse random projection method. They showed that one can usesb3 (e.g., s¼

ffiffiffiffiDp

, or s¼D=logD) to significantly increase thespeed of the computation. Experimental results in [14] show thatalthough RP represents faces in a random, low-dimensionalsubspace, its overall performance is comparable to that of PCAwhile having lower computational requirements and being dataindependent.

The main reasons that RP is not well applied include lowaccuracy and sensitivity to local deformation. Thus RP is not welladopted for face recognition except in cancelable biometrics[17,18].

2.2. Online sequential extreme learning machine

The multilayer feedforward neural network (also called themultilayer perceptron network, MLP) has been convenientlyadopted for empirical learning in many classification applications.In this work, we consider a MLP with a single hidden layer and anoutput layer formed by neurons with linear activation functions.For convenience, we will call this network a single layerfeedforward network (SLFN) [19].

Given n samples xiARD and a correspond target yi, a standardSLFNs with h hidden nodes and an activation function g can bemathematically modeled as

yi ¼Xh

j ¼ 1

bjgðwjxiþbjÞ ð2Þ

where wj¼[uj 1, uj 2,y, ujD]T is the weight vector connecting thejth hidden node to the input nodes, bj is the threshold of the jthhidden node and bjARh is the weight vector connecting the jthhidden node to the output nodes.

For a K-class problem, denote XARD�n as training samples andY¼ ½b1, . . . ,bK �ARh�K as the weight matrix, which is a collectionof the weight vector b and the indicator matrix YARn�K . The n

equations above can be written more compactly as

HY¼ Y ð3Þ

where

H¼ gðX,w,bÞ ¼

gðw1x1þb1Þ . . . gðwhx1þbhÞ

^ & ^

gðw1xnþb1Þ � � � gðwhxnþbhÞ

264

375 ð4Þ

The weight parameters Y can be estimated by minimizing theleast squares error giving the solution [20]:

Y¼ ðHT HÞ�1HT Y ð5Þ

As SLFNs can work as universal approximators with adjustablehidden parameters, in practice the hidden node parameters ofSLFNs can actually be randomly generated according to anycontinuous sampling distribution. This is seen in a random settingof w and b in the extreme learning machine (ELM) [20]. Unliketraditional implementations and learning theory, the ELM theoryshows that the hidden node parameters can be completelyindependent of the training data. Moreover, for new data samples,the ELM can be retrained efficiently using only those additionalsamples via a recursive least squares formulation. For thispurpose, Huang [21] proposed an online sequential ELM(OS-ELM). When a new block of feature data Ht + 1¼g(Xt + 1, w, b)and the corresponding indicator matrix Yt + 1 are received, theparameter Ytþ1 can be estimated recursively as

Ytþ1 ¼YtþPtþ1HTtþ1ðYtþ1�Htþ1YtÞ ð6Þ

where

Ptþ1 ¼ Pt�PtHTtþ1ðIþHtþ1PtH

Ttþ1Þ

�1Htþ1Pt ð7Þ

However, OS-ELM will incur a high computational cost whendealing with high dimensional face images. Moreover, thelearning could be unstable due to the random assignment of theweight parameters. Therefore OS-LEM is not applicable to mobile-based face recognition. In order to address these limitations ofOS-ELM, we propose a locally random neural network in the nextsection.

3. The proposed locally random incremental classifier

In this section, we present a locally random incrementalclassifier (LRIC) and analyze how can the proposed methodachieve high accuracy and their computational complexities.

3.1. Overview of the proposed method

In this section, we provide an overview of the proposedmethod with comparison to the well known PCA and incrementalPCA (IPCA) [22].

As illustrated in the feature extraction step in Fig. 1, theproposed method extracts local feature using randomly generatedbasis which consists of þ

ffiffisp

,0,�ffiffisp

with probability 1/2s, 1�1/sand 1/2s, respectively, where s is sparsity of basis. This randombasis generation without the need of training samples isextremely inexpensive on mobile devices comparing with asingular value decomposition process over high-dimensional faceimages in PCA and IPCA methods. Moreover, for D dimensionalfeatures, our single basis can be represented using only D/s bitsinstead of D�8�8 bits when we map the non-zero entities þ

ffiffisp

and �ffiffisp

of a basis to 1 and 0, respectively.In the classification step, we adopt a supervised neural

network to classify the extracted local features. Since theproposed method is a recursive formulation, the decision regionsare updated according to newly arrived data. This reduces thecomputational cost since the calculation is cumulative and no re-computation of previous data is needed. In the final classificationstage, the neural network adopts an one-versus-all technique formulti-class problems. Thus the complexity of the classificationtask is proportional to the number of classes C rather than thenumber of training samples N. Meanwhile, the complexity of the

Fig. 1. Overview of the proposed method.

PCA ICA LFA

LNMF RP SRP (s = 3)

SRP (s = 32) LRP (s = 3) LRP (s = 16)

Fig. 2. Examples of bases obtained using various subspace methods. In (f)–(i), black indicates 0, white indicates sqrtðsÞ and gray indicates �sqrtðsÞ.

K. Choi et al. / Pattern Recognition 44 (2011) 386–400 389

conventional subspace PCA approach including its incrementalversion is proportional to N since each test sample is compared tothe entire set of trained samples in order to determine its classlabel. Therefore, the subspace learning is not suitable for mobiledevice applications.

3.2. Local random projection (LRP)

Unlike the global approach, the local approach such asindependent component analysis (ICA) [23], local feature analysis(LFA) [24] or local nonnegative matrix factorization (LNMF) [25] (seeFigs. 2(b)–(d)), offers several advantages including robustness tolocal deformations, lighting variations, and partial occlusion [25].However, the main problem in mobile applications is the highcomputational cost in estimating those local bases from a givenlarge pool of training samples. Meanwhile, RP and SRP (seeFigs. 2(e)–(g)) do not seem to represent any local feature since thedistribution of non-zero entities in a basis is not as localized as thoseextracted by local methods. Therefore, locality of feature andcomputational cost must be well-balanced in mobile devices.

In order to combine robustness of local feature extraction andefficiency of RP, we propose a local random projection (LRP) inwhich the random distribution of non-zero entities in each basis ischanged to a localized distribution as illustrated in Figs. 2(h)–(i).Although learning of these local random bases does not require anytraining samples, we will need to determine four parameters, thequantity, the location, the sparsity and the distribution pattern ofthe local regions, as shown in Fig. 3. These parameters must beselected appropriately since they may influence the accuracy andcomputational costs of the method.

Without loss of generality and for simplicity, we assume thateach basis represents a single and square type of local region, witha quantity parameter of one. The sparsity parameter given by s inEq. (1) is related to the size of the local region, such that when s issmall, the local region is large (see Fig. 3(b)). The location of thelocal region can be set randomly within the image template asshown in Fig. 3(c). The distribution of non-zero entitiesin the local region may be random or structural as shown inFig. 3(d). Here we discuss about how to determine these threeparameters in order to efficiently and effectively extract the localfeatures.

Fig. 3. The four parameters in the proposed local random projection. (a) Quantity; (b) sparsity; (c) location; (d) distribution.

Fig. 4. Single basis representation using a 64-bit integer.

K. Choi et al. / Pattern Recognition 44 (2011) 386–400390

3.2.1. Sparsity setting

Sparsity may influence both effectiveness and efficiency. Forexample, when s is set at 1 in Eq. (1), the ratios of �

ffiffisp

, 0 andffiffisp

within a basis are 50%, 0% and 50%, respectively. Therefore every pixelis sampled using this basis and no local structural information can beextracted. On the contrary, when s is D/2 where D¼1024, the ratios of�

ffiffisp

, 0 andffiffisp

within a basis are 0.0978%, 99.8% and 0.0978%,respectively. In this case, no structural information can be extractedsince most entries within the basis are zeros. Since controlling thedegree of sparsity within each basis will incur a high computationalcost, we shall aim at finding an efficient sparsity representationsuitable for mobile device applications in this work instead ofoptimizing the sparsity for accuracy improvement.

For simplicity, we assume that s is a square, such as 4, 9, 16, 25,36, 49 or 64 ðsA ½22,82

�Þ, in order to avoid the float-pointingoperation in Eq. (1). When s is 4, 25% pixels are sampled, whichmay not be a local basis. When s is 36, only 2.7% pixels aresampled, which may not extract structural information. Therefore,the possible values of s are 9, 16 and 25.

Hence, s is related to the size of the local region, which isusually integer [16] and not a floating-point number. The size ofthe local region can be calculated as D/s. Given a sparsity of 16,the dot product between the jth basis Rj and an image x can becalculated efficiently as follows:

Rj � x¼ffiffiffiffiffiffi16p½�110011 . . .000� � x¼

XRji A1

xi�X

Rji A�1

xi

0@

1A52 ð8Þ

Eq. (8) needs only 64 integer additions and one shift operationwithout floating-point and integer multiplication operations. Thismay lead to a reduction in the time complexity. In conventional

subspace methods, the dot product requires 1024 floating pointmultiplications + 1024 floating-point additions.

Furthermore, a basis with a sparsity of 16 can be representedwith a very small amount of memory space. Since the number ofnon-zero entities in the local region is 64, this can be represented bya 64(¼D/s) bit integer if we map �1 to 0 as shown in Fig 4. Thismeans that a single basis can be represented as a 64-bit integer. Anybasis Rj can be represented as (x, y, 64 bit), where x and y are thelocation of the local region in the template. The memory require-ment of this representation is only 10�d bytes. Meanwhile, thememory requirement for a conventional subspace projection matrixRARD�d is D� d� sizeof(double). This space may be large whenoperating many mobile applications on a mobile device.

3.2.2. Location setting

As shown in Figs. 2(b)–(d), the locations of the hot spot regions asdetermined by local feature extraction methods, may be distributedglobally across the face appearance and may represent distinctivefacial components. In order to generate such basis without the needfor optimization, we consider the three locations in terms of thedegree of overlap, as shown in Fig. 5.

(i)

‘‘Overlapping’’: The location of the local region can be setrandomly within the template window, as shown in Fig. 5(a).Each basis can be represented using 10 bytes including thex–y location as described in the previous subsection. Hence,many locations are possible. However, some pixels may besampled redundantly or may never be sampled at all. In orderto reduce this effect, a lot of bases must be generated.

(ii)

‘‘Non-overlapping’’: Each basis is divided into equally sizedlocal regions, as shown in Fig. 5(b), and the location of the localregion can be set as one of the subregions. In order to store this

Fig. 5. The three location settings of local basis: (a) overlapping; (b) non-overlapping; (c) semi-overlapping.

K. Choi et al. / Pattern Recognition 44 (2011) 386–400 391

type of basis, the location (x,y) can be removed and thus eachbasis can be represented as 64 bits. Unlike in ‘‘overlapping’’ alimited number of bases can be generated. The possible numberof locations is D/(D/s)¼s. For example, 16 locations are possiblewhen the size of the template is 32�32. In order to sampleevery pixels without redundancy, the number of bases must bes times. Compared to ‘‘overlapping’’, the number of bases issmall. However, the accuracy may be low compared to‘‘overlapping’’ using a large number of bases.

(iii)

‘‘Semi-overlapping’’: Subimage partitioning without overlap-ping may lead to neglection of the relationship among the localregions [26]. ‘‘Semi-overlapping’’ can connect the adjacent localregions and combine the different information in each region.In order to store this type of basis, the location (x,y) as in ‘‘non-overlapping’’ can be removed and thus each basis can berepresented as 64 bits. The possible number of locations in‘‘semi-overlapping’’ is sþð

ffiffisp�1Þ2. For example, 16+9 locations

are possible when the size of the template is 32�32.

Fig. 6. Possible basis among different types of distributions.

The number of bases is a critical problem in mobile devices since

it is related to the accuracy and computational costs. Locationselection by overlapping may be the most reliable when a lot ofbases are used. However, this method increases the computationalcost. Location selection by non-overlapping may be less accuratecompared to the other selection methods but the computational costis low. In cases of security applications which require high accuracy,over-lapping selection is proper. In the experiment section, we willevaluate these three location selection methods.

3.2.3. Distribution setting

In order to extract structural information from face appear-ance, the distribution of non-zero entities within local regionsmust be well designed because it may affect the accuracy.The simplest distribution form is random distribution, asillustrated in the leftmost image of Fig. 3(d) where �1 and 1are set randomly with equal probability according to Eq. (1). Eachsingle basis can be generated by simple random permutation from[�1�1�1�1�1�1 �1y11111].

The number of bases in this distribution is fairly large.However, this symmetric random distribution of �1 and 1 is asubset of all possible patterns 2D/s, and this may not wellrepresent the structural information of the face. To alleviate thisconstraint, we consider an asymmetric sparse random projectionas follows:

Rji ¼ffiffisp

1 with prob:k

s

0 with prob: 1�1

s

�1 with prob:1�k

2

8>>>>>>><>>>>>>>:

ð9Þ

where k controls the ratio of �1 and 1. When k¼ 0j1 (biaseddistribution), all entities of each local region are �1j1. When k¼0.5,Eq. (9) is the same as the original equation (1). This asymmetricdistribution may increase the approximation error of randomprojection. However, this problem can be alleviated using anensemble approach.

The main issues in Eq. (9) include how to determine theparameter k and how to distribute those non-zero entities forstructural information extraction. Inspired by face recognitionusing Haar-like features [27], we design a Haar-like distributionwhich can capture the intensity gradients at different locations,spatial frequencies and directions. In order to generate the Haar-like distribution of the middle or rightmost images in Fig. 3(d), thelocal region is decomposed into subregions of 2�2 unequallysized regions, and the entities of each subregion contain anequally distributed �1 or 1. In this asymmetric distribution, k isnot selected explicitly. Unlike the Haar-like feature selection [28]using a boosting method and a lot of training samples, theproposed method generates the Haar-like bases randomly with-out an optimization formulation and training samples for efficientfeature extraction in mobile devices.

According to the ratio k and the distribution patterns of þffiffisp

and �ffiffisp

in Eq. (9), the relationship among the four distributions(asymmetric biased, asymmetric Haar-like, asymmetric randomand symmetric random) is shown in Fig 6. The number of possiblebasis of these distributions can affect the overall accuracy of theensemble framework since similar features projected on a smallnumber of basis can lead to low diversity. In the experiment, wewill evaluate the influence of these distributions on verificationaccuracy.

Table 1An overview of our experimental evaluations.

Evaluation Section Database

Accuracy

Robustness

Realistic variation 4.4.1 AR

Synthesized occlusion 4.4.2 AT&T

Illumination variation 4.4.3 EYALEB

Misalignment 4.4.4 ETRI, BioID

Incremental learning

Chunk-by-chunk update 4.4.5 AR

Influence of parameters

Sparsity, location, distribution 4.4.6 AR

Execution time

On PC

Training 4.5.1 PIE

Test 4.5.2 PIE

On mobile phones

Training 4.6 PIE

Test 4.6 PIE

K. Choi et al. / Pattern Recognition 44 (2011) 386–400392

3.3. Locally random incremental classifier (LRIC)

Due to the randomness of the proposed LRP without the needto train, its accuracy can be lower than that of PCA which istraining based. In order to deal with this problem, we cancombine our LRP with a supervised classifier such as SVM, neuralnetwork or kernel ridge regression at decision level. However,since these classifiers need an iterative search for local solutions, aformulation to recursive mode complicates the problem. There-fore we adopt the OS-ELM which has been formulated in recursiveform. Other incremental classifiers can also be combined with ourLRP under the proposed framework. Thus, we call the combina-tion of LRP and OS-ELM the locally random incremental classifier(LRIC).

In the proposed LRIC, the H matrix in Eqs. (4) and (6) iscalculated as H¼g(RX, w, b) ARn�h. Comparing with the originalOS-ELM which uses a high dimensional face image xARD, wetrain the network using a low dimensional random subspacebased feature RxARd, where d5D. This reduces the computa-tional cost in terms of huge memory consumption.

One consideration during training LRIC is the number of basis d

and hidden units h, which are generally estimated empirically.Since we propose to use an ensemble of networks to improve therecognition accuracy, it turns out that the time consuming taskboils down to the empirical determination of the learningparameters of all networks. Since 1/s of pixels are sampled inthe proposed LRP, we set d as d¼D/s and empirically fixed h¼2d

for realtime training in mobile devices.Finally, we construct many LRIC and combine them into a

powerful decision rule in an ensemble framework in order toimprove the stability and the overall accuracy. After generating C

local random projection matrices Rc (c¼1: :C) and receiving a newblock of feature data, Hc

tþ1 ¼fðRcXtþ1,wc ,bcÞ, we train each LR-ICto find the corresponding matrix Yc as follows:

Yctþ1 ¼Yc

t þPctþ1ðH

ctþ1Þ

TðYtþ1�Hc

tþ1Yct Þ ð10Þ

where Pct +1 is

Pctþ1 ¼ Pc

t�Pct ðH

ctþ1Þ

TðIþHc

tþ1Pct ðH

ctþ1Þ

T�1Hc

tþ1Pct ð11Þ

The class label of unseen sample x can be predicted usingvarious fusion methods. Here, we shall adopt a rank based votingusing five candidate labels within rank 5 rather than a predictedlabel from rank 1. We note that a training-based fusion cannot beapplied in this framework since the proposed method is based onincremental learning.

We consider another well-known incremental classifierincremental SVM (ISVM) [29,30] where the key is to retain theKarush–Kuhn–Tucker (KKT) conditions on all previously seendata, while updating the SVM upon inclusion or deletion of asupport vector, given a new data sample. However, this methodrequires an expensive computational cost to maintain the KKTconditions during the incremental step even on PC. Moreover themethods update an optimal solution of the SVM training problemafter one training sample is added.

3.4. Analysis

In this section, we analyze how can the proposed methodachieve high accuracy and their computational complexities.

3.4.1. Complexity analysis

In this section, we analyze the computational complexity ofthe updating and classification steps. In the following analysis, n isthe number of samples, h is the number of hidden nodes, d is the

number of basis and k is the number of classes. In general, thecomplexity is determined by the dominant Oð�Þ. However, we willpresent the precise computational complexity in order to analyzethe effect of each factor.

(i) Computational complexity for updating Y and P�1

Equation

Complexity

H

O(ndh)¼O(nd2) Eq. (10) O(d2n)+O(dn2) + O(d2n)+ O(d3) + O(n3) + O(d2) Eq. (11) O(d2n) + O(d2n) + O(d3) + O(d2) + O(d2)

(ii) Computational complexity for classification

Equation

Complexity

H

O(hd)¼O(d2)

Y ¼ PY

O(dk)

The analysis shows that the complexity is largely determinedby n3 and d3 and is relatively unaffected by the number of classesk. However, in our application, the relationship of theseparameters is as follows:

n3okod2od3 ð12Þ

In the incremental learning step, the chunk size may be smalland the number of k is also small. However, d2 and d3 may bemuch larger than n3 and k. Therefore, most of the computationalcomplexity is caused by d.

4. Experiments

In this section, we use four databases and eight well-knownalgorithms to evaluate the recognition accuracies and executiontimes on a desktop computer with 3.2 GHz CPU and 2.0G RAMusing MATLAB 7.0. Finally, we evaluate our method on a mobilephone using C++ without code optimization. Since the evaluationis rather extensive and contains many details, an overview of ourexperiments is summarized in Table 1.

K. Choi et al. / Pattern Recognition 44 (2011) 386–400 393

4.1. Databases setup

Under the absence of public face database captured frommobile devices, we evaluate the performance of the proposedmethod using selected public face recognition databases whichcontain various appearance variations such as expression, illumi-nation, occlusion and time interval at controlled and under realworld conditions. Here, the appearance variation between thetraining set and the test set would be a large as that under amobile environment. Particularly, those samples taken undernormal condition are used for training and those with largevariations are used for test. Fig. 7 shows some face image samplesfrom these databases.

(i)

Fig.datab

misa

AR face database [31] (AR). This database contains over 4000color face images corresponding to 126 identities. Theseimages feature frontal views with different facial expressions,illumination conditions, and occlusions (sun glasses andscarf). Each individual has 26 different images, which takenduring two different sessions separated by two weeksintervals, each session consisting of 13 faces with differentfacial expressions, illumination conditions and occlusions. Inour experiments, we use a subset of this database excludingsome damaged or missed images, as in [26]. This subsetcontains 2600 face images corresponding to 100 subjects(50 men, 50 women), with 26 shots for each person.

7. Face databases: (a) AR face database—realistic face variations evaluation. (

ase—illumination evaluation. (d) PIE face database—evaluation using a large

lignment. (f) BioID database—the sensibility evaluation to misalignment.

(ii)

b) AT

num

AT&T face database [32] (AT&T). This database contains 400gray images corresponding to faces of 40 people. For somesubjects, images were taken at different times, variations ofillumination, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). All imageswere taken against a dark homogeneous background with thesubjects in an upright, frontal position (with tolerance fromsome side movement).

(iii)

The Extended Yale face database B [33] (EYALEB). Thisdatabase contains 16,128 face images for 38 human subjectswhich were captured under nine poses and 64 illuminationconditions. In our experiments, we use a subset of thisdatabase excluding some damaged images, as in [26]. Thissubset contains only frontal pose, and includes 1920 imagesfrom 30 subjects.

(iv)

The Pose, Illumination and Expression database [34] (PIE).The PIE database consists of 41,368 images for 68 peoplewhere the face of each person was captured under 13 poses,43 illumination conditions, and four expressions. Since thisdatabase is very large, we choose only five near frontal poses(C05, C07, C09, C27, C29) with different illumination andexpression conditions which give a total of 11,560 images(68 people faces�170 images).

(v)

ETRI database [35] (ETRI). The ETRI database consists of 1100frontal images from 55 identities (21 males, 24 females)where slightly rotated face images in plane and out of plane,

&T face database—synthesized occlusion evaluation. (c) Extended Yale B

ber of training samples. (e) ETRI database—the sensibility evaluation to

TableBenc

Me

Bas

P

L

Dat

R

E

Inc

C

I

I

O

K. Choi et al. / Pattern Recognition 44 (2011) 386–400394

as well as faces with eye glasses and facial hair. This databaseis captured at an indoor office.

(vi)

BioID database [36] (BioID). During the recording of thisdatabase, special emphasis has been placed on ‘‘real world’’conditions. Therefore the test set features a large variety ofilluminations, backgrounds, and face sizes. The datasetconsists of 1521 gray level images with a resolution of384�286 pixels. Each image shows the frontal view of facesfrom 23 different test persons.

Table 3Robustness evaluation with respect to realistic face variations in AR database in

terms of the recognition accuracy rate.

Training 1–7 1–13

Testing Expression Illumination Sunglass Scarf Time

15–17 18–20 8–10 21–23 11–13 24–26 14–26

PCA-SVM 0.6733 0.6833 0.6200 0.3667 0.1467 0.0700 0.5462

LDA-SVM 0.6833 0.6733 0.6333 0.3600 0.1500 0.0767 0.5469

RP-NN 0.6978 0.6533 0.6278 0.3556 0.1656 0.0860 0.5038

ERP-NN 0.7000 0.6600 0.6450 0.3867 0.1633 0.0867 0.5179

ELM 0.7583 0.7850 0.4900 0.2733 0.2217 0.1200 0.4146

LRIC 0.9000 0.9700 0.6467 0.5133 0.7467 0.4667 0.8408

4.2. Preprocessing

Since an automatic face alignment is yet another open issue thatadds on to the complexity of the problem, we shall adopt a manualprocess to align and crop relevant face regions as the preprocessingstep in order to stay focus regarding the effectiveness of theproposed method for feature extraction as a separate issue. Herewe note that several works on real-time facial feature pointsdetection can be achieved within 100 ms on mobile devices [35,9]and many commercial digital cameras are even equipped with real-time face detection for active focusing and face expression recogni-tion where their computational requirements can be adopted as apre-requisite preprocessing step prior to our proposed featureextraction method under a fully automatic operation mode. Aftercropping, the original images are resized to 32�32 in order tosimulate the low-resolution images under the mobile environmentwhere the high-frequency image regions can be blurred.

4.3. Benchmarking setup

We compare our method with eight face recognition algo-rithms which can be categorized into three approaches. Thesettings of these methods are shown in Table 2.

4.4. Performance evaluation

In this section, we conduct three experiments to evaluate therobustness performances with respect to variations in imagingconditions, one experiment on incremental learning, and anotherthree experiments based on three settings of local random basis,giving a total of seven experiments.

4.4.1. Robustness evaluation with respect to realistic face variations

in the AR database

In this experiment, the performances under variations of facialexpression, illumination, occlusion and time using the AR databaseare evaluated. The first seven images (1–7) are used for training, and

2hmarking setup.

thod Setting

eline

CA-SVM 20 to the maximum number of PCA bases, SVM

with a polynomial kernel

DA-SVM Fisher LDA, SVM with a polynomial kernel

a independent

P 20–400 random bases

RP Ensemble of several RPs

remental method

CIPCA [22] IPCA without estimating the covariance matrix

LDA [37] ILDA using the sufficient spanning set

approximation

SVM [30] Extend code for multi-class problem using the

one-versus-all technique

S-ELM The input features: original raw images

the remaining images which are divided into six subsets according tothe variation category are used to evaluate the performance of ourmethod. For the time-variation subset, all images of the first session(1–13) are used for training, and the remaining images (14–26) fromthe second session are used for testing. A total of seven test subsetsare thus generated for the evaluation.

Table 3 compares several algorithms with the proposed LRICalgorithm. In all subsets, the proposed method significantlyoutperforms the other methods in terms of the recognitionaccuracy rate. Particularly, the proposed method is seen to berobust to severe local deformation in the scarf and sunglasssubsets. Under this situation, the global based methods includingrandom projection and ELM appear to be seriously impaired bythe occlusions. The recognition accuracy of ELM without dimen-sion reduction shows a better performance than do most of theconventional methods in the expression, illumination and scarfsubsets, but a worse performance in sunglass and time subsets.Another observation is that random projection methods havealmost the lowest accuracy among all of the methods, except theensemble approach using multiple number of RPs. The standardRP is computationally inexpensive but its accuracy is low.Meanwhile, our local random projection followed by ELM isrelatively robust to severe face variation. From the result, it isclear that the LRIC significantly outperform the other comparedmethods in AR database with severe facial variations.

Fig. 8 illustrates how the proposed method can improve theoverall accuracy. As shown in the figure, the randomness of LRP canreduce accuracy and increase diversity. In order to overcome thisshortcoming of LRP, we adopt a supervised neural network which canincrease the distinctiveness of local features among different classes.

Fig. 8. Accuracy improvement via supervised learning and fusion.

Fig. 9. Synthesized occlusion scenarios in the AT&T database.

Table 4Robustness evaluation with respect to synthesized occlusions in the AT&T

database.

Scenarios Scenario 1 Scenario 2

Occlusion sizes 5 10 15 5 10 15

PCA-SVM 0.9450 0.5475 0.1575 0.7188 0.6038 0.3238

LDA-SVM 0.9375 0.5188 0.1575 0.6872 0.5913 0.3106

RP-NN 0.9037 0.5063 0.1475 0.5125 0.2125 0.2275

ERP-NN 0.9313 0.5800 0.1725 0.5537 0.2087 0.2437

ELM 0.3700 0.1075 0.0800 0.0588 0.0550 0.0525

LRIC 0.9475 0.9187 0.8200 0.9050 0.8812 0.8025

Table 5Robustness evaluation with respect to illumination variations.

Database EYALEB subset 2 EYALEB subset 3 EYALEB subset 4

PCA-SVM 0.9978 0.7447 0.2030

LDA-SVM 0.9430 0.6447 0.1579

RP-NN 0.9825 0.5237 0.0695

ERP-NN 0.9927 0.5447 0.1122

ELM 0.1360 0.0421 0.0376

LRIC 1.0000 0.8553 0.3684

K. Choi et al. / Pattern Recognition 44 (2011) 386–400 395

This leads to improving the accuracy of each local classifier. Finally,the ensemble framework boosts the overall accuracy.

4.4.2. Robustness evaluation with respect to synthesized occlusion in

AT&T database

Unlike conventional face recognition system under controlledenvironments, partial occlusion can occur frequently in photoimages captured by mobile devices in an uncontrolled environ-ment. Therefore we design two scenarios as shown in Fig. 9. In thefirst scenario, partial occlusion occurs only in the test images. Inthe second scenario, partial occlusion is seen in both the trainingand test images. We simulate various levels of occlusions (5�5,10�10 and 15�15 blocks) by replacing a randomly located zeroblock on the face image. In this experiment, eight images of eachidentity are randomly selected for training, and the remainingtwo images are used for the test. The averaged results of 10 runsare reported.

As can be seen from the results in Table 4, the proposedmethod significantly outperforms all of the other comparedmethods. In the first scenario, the accuracies of the comparedconventional methods decrease rapidly when the size of theocclusion block is increased. These methods do not perform wellwhen the size of occlusion block is 15�15 (22%). The ELM basedmethod has recorded the worst accuracy due to an unstable leastsquares solution arising from the small number of trainingsamples. Unlike the AR database, the AT&T database containssmall numbers of samples and classes. The original ELM is verysensitive to the number of training samples. Even for the 15�15occlusion size, the proposed method works well and the accuracydegradation caused by the synthesized occlusion is the least ofcompared to other algorithms.

In the second scenario, a 10�10 occlusion size is also adoptedfor the training set, giving a result similar to that of the firstscenario. Since occlusion is contained in both the training and testsamples, the overall accuracies of all methods are lower thanthose of the first scenario, even for a 5�5 occlusion. However, theproposed method works well in this scenario. The ELM hasrecorded the worst accuracy in all cases due to a low rankproblem and partial occlusion. From Table 4, we can conclude that

the proposed method is relatively more robust to partial occlusionthan other compared methods, even when the occlusion isincluded in the training samples.

4.4.3. Robustness evaluation with respect to illumination variations

in the EYALEB database

In this experiment, the performance with respect to variationsof illumination using the EYALEB database is evaluated. Theimages are divided into five subsets according to the angle oflight. The first subset covers the angular range from 01 to 51, thesecond subset covers 10–201, the third subset covers 25–351, thefourth subset covers 50–701, and the remaining images Z853 arenot used. The images in subset 1 are selected for training, andthe images in the other subsets are used for testing. For all images,a histogram equalization is performed before feature extraction.

Table 5 compares the result of our method with those of theconventional methods. When the angle of the light is small, theaccuracies of all methods, excluding ELM, are fairly high. Thismeans that the proposed local approach has no advantage overthose holistic conventional methods. The ELM has recordedmeaningless accuracy due to a low rank problem, as in theprevious experiment. When the angle of the light source directionis increased, those conventional methods appear to be moresensitive to illumination variations, and the proposed method isrelatively less sensitive. The proposed method does not signifi-cantly outperform the other methods as in previous results onpartial occlusion, possibly due to the illumination being variedover the entire face image.

4.4.4. Sensibility to misalignment in the ETRI and BioID databases

In order to observe the impacts of image misalignment on theproposed method, we perform a sensitivity experiment usingthose datasets that come with the ground truth of eye positions(ETRI and BioID databases). Here, we add various translationalerrors within 75 pixels from the ground truth eye positions.Fig. 10 shows that both the performances of local and globalmethods are sensitive to face alignment and the proposed methodis more sensitive compared to that of PCA. This shows that theperformance gain of the proposed method relies much onaccurate alignment. Hence, in order to take advantage ofthe proposed method, an accurate alignment is a pre-requisite.

Fig. 10. Sensibility to misalignment in the ETRI and BioID databases. (a) ETRI. (b) BioID.

Table 6Incremental learning evaluation in the AR database.

1:2 3:4 5:6 7:8 9:10 11:13

CCIPCA-NN 0.4300 0.3708 0.4331 0.3954 0.2508 0.1769

CCIPCA-NN* 0.4300 0.4800 0.5108 0.5354 0.5415 0.5692

ILDA-NN 0.4054 0.3415 0.4315 0.3723 0.2662 0.0185

ILDA-NN* 0.4054 0.4362 0.5108 0.5469 0.5569 0.5700

ISVM Out of memory error

OS-ELM 0.4285 0.4415 0.5123 0.5100 0.4723 0.4119

LRIC 0.5162 0.5423 0.7254 0.7469 0.7492 0.8215

St+1

St+1xn

Stx Stx’

St

x’

x”x

Fig. 11. Performance degradation of IPCA.

Table 7Sparsity parameter evaluation.

Sparsity—database Expression

subset

Scarf11–13

subset

Time

subset

Avg.

4 (16�16) 0.9400 0.3200 0.8462 0.7020

5 (14�14) 0.9367 0.5567 0.8615 0.7849

6 (13�13) 0.9367 0.6167 0.8800 0.8111

7 (12�12) 0.9300 0.7433 0.8538 0.8423

8 (11�11) 0.9500 0.7467 0.8877 0.8614

10 (10�10) 0.9533 0.8633 0.8762 0.897612 (9�9) 0.9500 0.2733 0.8800 0.7011

16 (8�8) 0.9467 0.7300 0.8254 0.8340

20 (7�7) 0.9367 0.4633 0.8315 0.7438

28 (6�6) 0.9067 0.5167 0.8446 0.7560

40 (5�5) 0.8900 0.3833 0.7923 0.6885

64 (4�4) 0.8200 0.5133 0.6492 0.6608

K. Choi et al. / Pattern Recognition 44 (2011) 386–400396

Apart from adopting existing alignment techniques [38], othermisalignment correction techniques may also be applied toresolve this issue.

4.4.5. Incremental learning evaluation in AR database

The objective of this experiment is to demonstrate theeffectiveness of incremental learning. First, we partition theentire database into three different parts: the initial set (1–2),the additional set (3–13), and the test set (14–26), where theadditional set is used to update the model parameters forincremental learning. Since ILDA and LRIC can be retrained bychunk-by-chunk updating, two images of 100 individuals (total200 images) are used as the chunk dataset, with each model beginupdated five times. Sample-by-sample updating is used forCCIPCA and ISVM because these methods cannot update themodel parameters by means of chunk-by-chunk. The comparisonresults are presented in Table 6.

It is easy to see that the recognition rates increase with thenumber of added samples. When the number of training samplesis small, the recognition accuracy of all methods is low. However,the accuracies of the LRICs are competitively higher than thoseof the other incremental learning methods when the number oftraining samples are increasing. From the result, our first observa-tion is that ISVM failed to learn the high dimensional face imageseven with a large number of samples. The second observation isthat the accuracy of the incremental subspace learning followedby the nearest neighbor (NN) classifier decreases when newsamples are added. In the CCIPCA-NN* and ILDA-NN* methods,the old training samples are re-projected. As can be seen, theperformances of these two methods are better than those beforere-projection.

Fig. 11 shows an illustration regarding the performance of IPCAdue to the lack of reprojection of old training samples. In theoriginal image space, the training sample x and two test samplesxu and x00 can be very close to each other. However, Stx and Stþ1x00

can lie far apart on another axis if only x00 is projected to a new

subspace St + 1. If both the basis and the mean of features areupdated with a large amount, this can lead to severe performancedegradation. This is a critical limitation of incremental subspacelearning.

4.4.6. Evaluation on the influence of parameters in the AR database

In order to evaluate the influence of three parameters of theproposed LRP (sparsity, location and distribution pattern), weconduct three experiments using the expression, scarf and timesubsets of the AR database.

Table 7 shows the influence of various sparsity settings on testrecognition accuracy. As can be seen from the table, the averagerecognition accuracy is varied according to the sparsity settings.In the expression subset which contains small local variations, theinfluence of sparsity is small. When the sparsity is large ð428Þ,the accuracy is low since a small local block does not extractrelevant structural information from the images. In the scarf11–13

subset which contains a large occlusion, the maximum accuracydifference is about 60%. In this case, either a small sparsity or a

Table 8Location parameter evaluation: OV (‘‘overlapping’’), NO (‘‘non-overlapping’’) and

SO ‘‘semi-overlapping’’).

Location OV NO SO

# classifiers 10 16 25 30 40 50 60 16 25

Expression 0.886 0.910 0.933 0.923 0.933 0.933 0.950 0.913 0.936

Scarf11–13 0.423 0.350 0.480 0.533 0.720 0.730 0.843 0.823 0.790

Time 0.668 0.826 0.783 0.851 0.860 0.868 0.872 0.857 0.877

Table 9Distribution parameters evaluation.

Location Symmetric random Asymmetric random Asymmetric haar-like

Expression 0.9300 0.9200 0.9000

Scarf11–13 0.8033 0.8267 0.6767

Time 0.8469 0.8473 0.8208

Table 10Training time evaluation.

Step PCA-SVM LDA-SVM CCIPCA ILDA OS-ELM LRIC

1 0.43 0.32 7.60 5.70 9.36 0.54

2 1.37 0.88 8.81 31.72 5.03 2.09

3 2.62 1.72 8.93 32.20 4.49 2.13

4 4.18 2.72 8.54 33.07 4.65 2.15

5 5.83 3.69 8.15 47.48 4.58 2.21

6 7.77 4.89 7.29 48.99 4.62 2.37

7 9.90 6.07 7.69 35.90 4.66 2.10

8 12.03 7.41 7.01 35.55 4.36 2.18

9 13.60 8.26 7.01 37.91 4.93 2.03

10 15.57 9.49 7.05 39.02 4.71 2.22

11 17.85 10.96 6.73 41.13 4.65 2.08

12 20.38 12.34 6.91 41.85 4.77 2.07

13 24.85 14.30 7.61 44.00 3.87 2.15

14 26.75 15.75 6.96 47.35 3.93 2.25

15 29.83 17.30 7.17 53.18 4.65 2.29

16 31.67 18.70 7.12 61.16 3.97 2.23

17 35.03 21.02 7.27 53.54 4.72 2.25

Table 11Evaluation on the size of each chunk.

CCIPCA ILDA OS-ELM LRIC

2 0.0017 5.1680 1.7730 0.0383

K. Choi et al. / Pattern Recognition 44 (2011) 386–400 397

high sparsity is not a good choice. In the Time subset, a largesparsity appears to have not extracting well the facial features.From this result, when test images contain a severe localvariation, sparsity severely influences accuracy and should notbe too high or too low, verifying the proposed sparsity. In thispaper, we fix the sparsity of 16 for the computational efficiencyand local feature extraction.

Table 8 shows the influences of three location parameters ontest accuracy. The number of combined classifiers C is dependenton the location parameter. In ‘‘overlapping’’, C is taken as 10, 16,25, 30, 40, 50 and 60, respectively, and in ‘‘non-overlapping’’ and‘‘semi-overlapping’’, 16 and 25, respectively. As can be seen, thetest accuracy seems to increase when several classifiers areensembled. The ‘‘overlapping’’ setting with 60 OS-ELM classifiershas shown the best accuracies in most databases. However, thecomplexity of this combination is 3.5 or 2.5 times higher thanthose of ‘‘non-overlapping’’ and ‘‘semi-overlapping’’, respectively.In the expression subset, the maximum accuracy difference isinferior to that of the scarf subset. According to this observation,when there is a severe local variation, the system requires a largenumber of classifiers for reliable accuracy. Taking all of thesefactors into account, ‘‘semi-overlapping’’ is considered reasonablein terms of accuracy and computational cost.

Finally, in order to study the influence of the distributionpattern of non-zero entities, we evaluate symmetric random (SR),asymmetric random (AR) and asymmetric Haar-like distributions.The parameters are fixed at sparsity¼16, location¼ ‘‘overlapping’’and C¼30. Table 9 shows that there is no performance differencebetween SR and AR. In Eq. (9), we introduced k, which controls theratio of �1 and 1, which may cause an increment of approxima-tion error in random projection. Table 9 shows empirically thatthis can be alleviated by a fusion. However, the accuracy of theasymmetric Haar-like distribution is not good. This is because thesimple Haar-like patterns within a small local region are lessdistinctive than that of a symmetric random distribution. AnAdaBoost based feature selection technique using a generic facedatabase may solve this problem.

5 0.0143 5.4928 0.8419 0.0290

8 0.0759 6.1842 1.3522 0.0408

11 0.1158 6.5436 1.8619 0.0395

14 0.1237 7.1666 2.4623 0.0445

17 0.1311 7.9933 2.8265 0.0483

20 0.1408 9.0230 3.1265 0.0505

23 0.1595 11.0799 3.4001 0.0579

26 0.1470 12.3070 3.3635 0.0636

29 0.1522 15.2242 3.2987 0.0819

4.5. Computational cost evaluation in the PIE database

In this section, we conduct three experiments on training timeand one experiment on testing time using the PIE database, whichcontains a large number of samples.

4.5.1. Training time evaluation

We partition the database into 17 parts where each partcontains 680 samples to measure the average time needed toupdate the model parameters. In the cases of PCA-SVM and LDA-SVM, a batch-based retraining approach is used to measure thetraining execution time. In the case of CCIPCA which is based on asample-by-sample updating, the total execution time is measuredfor training all of the 680 samples. For our method, the parametersettings are set at sparsity¼16, location¼ ‘‘non-overlapping’’ anddistribution¼ ‘‘asymmetric random’’.

Table 10 shows the execution times of the above-mentionedbatch and incremental methods to train newly added samples,and Table 11 shows the results for CCIPCA, ILDA, OS-ELM and ouralgorithm for various chunk. As can be seen from the tables, theincremental methods use almost constant times for updating,while the batch based methods need much more time to retrainthe entire samples. The CCIPCA is much faster than the ILDA.Although ILDA is an incremental learning method, its CPU time isseen to be the worst because the algorithm relies on highdimensional QR and SVD decompositions during updates of thetwo scatter matrixes. The training time of OS-ELM seems to bealmost constantly high due to the use of high dimensionalfeatures without dimension reduction. From this result, it is clearthat our algorithm outperforms all of the compared methods.

Fig. 12 plots the average execution times of the LRICs over thesize of each chunk and the number of classes. As can be seen fromthe results, the computational costs of the LRICs are proportionalto the size of each chunk. The number of classes does notsignificantly affect the total computation, as mentioned in thecomplexity analysis.

Fig. 12. Evaluation of various chunk sizes and the number of classes.

Table 12Test time to classify 100 test samples.

# training samples PCA-SVM LDA-SVM CCIPCA ILDA OS-ELM LRIC

680 0.090 0.051 0.641 0.297 0.013 0.033

1360 0.128 0.076 1.297 0.609 0.013 0.031

2040 0.157 0.102 2.094 0.781 0.013 0.026

2720 0.190 0.186 1.750 1.109 0.014 0.025

3400 0.304 0.127 2.750 1.328 0.013 0.024

4080 0.270 0.174 3.781 1.500 0.013 0.025

4760 0.329 0.413 3.422 1.563 0.013 0.024

5440 0.346 0.261 3.609 1.734 0.014 0.042

6120 0.369 0.218 4.469 2.109 0.014 0.026

6800 0.413 0.254 5.328 2.516 0.013 0.026

7480 0.422 0.255 5.578 2.625 0.014 0.037

8160 0.537 0.303 6.234 2.875 0.018 0.035

Table 13Training time evaluation in a mobile device: OS-ELM+ used 500 hidden units and

OS-ELM++ 1000 hidden units, respectively.

5 10 15 20 25 30

Android phone

OS-ELM+ 2985 5904 9070 12081 16216 16938

OS-ELM++ Fail

LRIC 1264 2704 4224 5696 6608 8864

The measurement units are milliseconds.

Table 14Test time evaluation in a mobile device.

OS-ELM+ LRIC

Android phone 194 ms 65 ms

K. Choi et al. / Pattern Recognition 44 (2011) 386–400398

4.5.2. Test time evaluation

Table 12 shows the test execution times of the proposedmethod and the other methods when classifying 100 test samples.When the number of training samples is increased, the test timesof PCA-SVM and LDA-SVM are also slightly increased because thenumber of support vectors is proportional to the number oftraining samples. The LDA-SVM is much faster than the PCA-SVMbecause PCA-SVM uses a much high dimensionality. The testtimes of CCIPCA and ILDA using the nearest neighbor classifier areseen to be severely increased compared to those of PCA-SVM andLDA-SVM. Unlike those of the other methods, the testing times ofboth OS-ELM and the proposed method are fast and nearlyconstant.

4.6. Evaluation in a mobile device

The classification accuracy was not reevaluated under themobile platform since similar numerical implementations basedon double floating points had been adopted for both PC andmobile environments.

For evaluation, a recent mobile phone, Google Dev Phone(Qualcomm 7210 processor 528 MHz, 192 MB and Android 1.5), isused. In order to generate native code libraries from C and C++sources in the Google Dev Phone, we use Android NativeDevelopment Kit because the Java-based Android phone is muchslower than that using a native C code. There is no codeoptimization in our experiments.

Table 13 shows the results of training times with variouschunk sizes, and Table 14 shows the results of test timesin classifying one sample in the mobile device. In the case ofOS-ELM, we failed to train the classifier at 1024 featuredimensions and 1000 hidden units due to a memory allocationerror. Thus we train the OS-ELM classifier at 1024 feature

dimension and 500 hidden units in order to compare it with ourmethod at sparsity¼16, location¼ ‘‘non-overlapping’’ and distri-bution¼ ‘‘asymmetric random’’. In the previous experiment,OS-ELM produced the best accuracy with a large number ofhidden units (over 1000 hidden units). However, OS-ELM cannotbe trained within a practical time frame in the mobile device.When five new images are available, the proposed methods canbe trained in 1265 ms. Although a small number of hidden unitshas been used, the OS-ELM is twice as slow as the LRIC.

4.7. Summary of experiments

(i)

From Tables 3–5, all three databases for accuracy evaluationshowed that the proposed method was robust to localdeformations such as scarf, sunglasses and synthesizedocclusions in both the training and test images. However,regarding illumination variation, the proposed method doesnot significantly outperform the other methods due to thespread of illumination over the entire face image.

(ii)

As observed from Tables 10 and 11, the training time of theproposed method is constantly lower than those of thecompared methods even in the case of combining severalclassifiers. However, the test time is not only fast but alsoindependent of training samples. Moreover, only the pro-posed method has been successfully trained on mobiledevices.

(iii)

The accuracy of the conventional random basis methodwhich extracts orthogonal and global features is lower thanthat of PCA but requires no training time.

(iv)

The accuracy of OS-ELM is slightly higher than that of thesubspace methods. However, as seen from Tables 4 and 5, thereliability of the accuracy is very low due to an unstable leastsquares solution arising from the small number of trainingsamples.

(v)

The training time of CCIPCA seems to be almost constantlylow, and ILDA has expensive computational costs.

(iv)

The test times of RP-NN, CCIPCA-NN, ILDA-NN using thenearest neighbor for the final decision are impractical formobile device applications.

Finally, the compared methods are summarized in Table 15 interms of test accuracy and execution times of training and test.

Table 15Summary of results.

Method Accuracy Training time Test time

PCA-SVM Sensitive to local deformation Proportional to # entire samples Fast

LDA-SVM

RP Poorer than PCA No training time Very slow

ERP

CCIPCA Similar to PCA Constantly low Very slow

ILDA Similar to LDA Very slow Very slow

ISVM Failed to learn Not evaluated Not evaluated

OS-ELM Sensitive to # samples Constantly high Very fast

Proposed Robust to local deformation Constantly low Very fast

K. Choi et al. / Pattern Recognition 44 (2011) 386–400 399

5. Conclusion

In this work, we focused on the challenging problem of learning alarge number of samples sequentially within mobile devices in real-time by presenting a realtime training algorithm for face recognitionrelated applications. Simultaneously, we considered the partialocclusion problem which can occur frequently in photo imagescaptured by mobile devices. In order to address these challengingproblems, we extracted face features using some local random basesand then a sequential neural network was trained incrementally withthese features. Unlike those conventional methods which use a set ofrandom basis that extract orthogonal and global features, wegenerated non-orthogonal local random basis which can be robustto local deformation as well as computationally efficient. Moreover,we adopted an ensemble of local random features and a sequentialneural network that improves the overall accuracy. The proposedincremental learning framework at the decision level is differentiatedfrom those well-known incremental subspace methods at the featurelevel. We demonstrated both the efficiency and effectiveness of theproposed algorithm and the feasibility of its application in mobiledevices using empirical experiments. The results show that theproposed algorithm significantly outperformed several popular facerecognition methods even with partial occlusion in both the trainingand test images. Moreover, the training and test times wereconstantly low under different evaluations, and the memoryconsumptions were also low. This leads to the possibility of trainingmodel parameters in realtime in mobile devices. We believe that theproposed framework of local random projection and incrementalclassifier can be easily extended to other computer vision and patternrecognition techniques in mobile devices.

Acknowledgment

This research was supported by the MKE (The Ministry ofKnowledge Economy), Korea, under the ITRC (Information Technol-ogy Research Center) support program supervised by the NIPA(National IT Industry Promotion Agency) (NIPA-2010-C1090-1021-0008). This work was supported by the National Research Founda-tion of Korea(NRF) through the Biometrics Engineering ResearchCenter(BERC) at Yonsei University (R112002105070030(2010)).

References

[1] M. Turk, A. Pentland, Eigenfaces for recognition, Journal of CognitiveNeuroscience 3 (1) (1991) 71–86.

[2] V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, 2000.[3] C. Bishop, Neural Networks for Pattern Recognition, Oxford University Press,

2005.[4] E. Weinstein, P. Ho, B. Heisele, T. Poggio, K. Steele, A. Agarwal, Handheld face

identification technology in a pervasive computing environment, in: ShortPaper Proceedings, Pervasive 2002, 2002, pp. 48–54.

[5] T. Hazen, E. Weinstein, R. Kabir, A. Park, B. Heisele, Multi-modal face andspeaker identification on a handheld device, in: Workshop on MultimodalUser Authentication, Santa Barbara, CA, 2003, pp. 113–120.

[6] S. Mukherjee, Z. Chen, A. Gangopadhyay, S. Russell, A secure face recognitionsystem for mobile-devices without the need of decryption, in: Workshop onSecure Knowledge Management, 2008.

[7] J. Yang, X. Chen, W. Kunz, A PDA-based face recognition system, in:Proceedings of the Sixth IEEE Workshop on Applications of Computer Vision,IEEE Computer Society, Washington, DC, USA2002, p. 19.

[8] S. Jung, Y. Chung, J. Yoo, K. Moon, Real-time face verification for mobileplatforms, in: Proceedings of the 4th International Symposium on Advancesin Visual Computing, Part II, Springer2008, p. 832.

[9] C. Schneider, N. Esau, L. Kleinjohann, B. Kleinjohann, Feature based facelocalization and recognition on mobile devices, in: Proceedings of the 9thInternational Conference on Control, Automation, Robotics and Vision, 2006,pp. 1–6.

[10] A. Hadid, J. Heikkila, O. Silven, M. Pietikainen, Face and eye detection forperson authentication in mobile phones, in: First ACM/IEEE InternationalConference on Distributed Smart Cameras, 2007, ICDSC’07, 2007, pp. 101–108.

[11] S.W. Chu, M.C. Yeh, K.T. Cheng, A Real-time, Embedded face-annotationsystem, in: Proceeding of the 16th ACM International Conference onMultimedia, 2008, pp. 989–990.

[12] P. Viola, M. Jones, Robust real-time object detection, International Journal ofComputer Vision 57 (2) (2002) 137–154.

[13] Y. Ijiri, M. Sakuragi, S. Lao, Security management for mobile devices by facerecognition, in: Proceeding of the 7th International Conference on MobileData Management, 2006, pp. 49–49.

[14] N. Goal, G. Bebis, A. Nefian, Face recognition experiments with randomprojection, in: Proceedings SPIE, vol. 5779, 2005, pp. 426–437.

[15] D. Achlioptas, Database-friendly random projections: Johnson–Lindenstrausswith binary coins, Journal of Computer and System Sciences 66 (4) (2003)671–687.

[16] P. Li, T. Hastie, K. Church, Very sparse random projections, in: Proceedings ofthe 12th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, ACM2006, p. 296.

[17] A. Jain, K. Nandakumar, A. Nagar, Biometric template security, EURASIPJournal on Advances in Signal Processing 2008 (2008) 1–17.

[18] Y. Kim, A. Beng-Jin-Teoh, K.A. Toh, A performance driven methodology forcancelable face templates generation, Pattern Recognition 43 (7) (2010)2544–2559.

[19] K. Toh, Deterministic neural classification, Neural Computation 20 (6) (2008)1565–1595.

[20] G. Huang, Q. Zhu, C. Siew, Extreme learning machine: theory and applica-tions, Neurocomputing 70 (1–3) (2006) 489–501.

[21] N. Liang, G. Huang, P. Saratchandran, N. Sundararajan, A fast and accurateonline sequential learning algorithm for feedforward networks, IEEETransactions on Neural networks 17 (6) (2006) 1411–1423.

[22] J. Weng, Y. Zhang, W. Hwang, Candid covariance-free incremental principalcomponent analysis, IEEE Transactions on Pattern Analysis and MachineIntelligence 25 (8) (2003) 1034–1040.

[23] M. Bartlett, J. Movellan, T. Sejnowski, Face recognition by independentcomponent analysis, IEEE Transactions on Neural Networks 13 (6) (2002)1450–1464.

[24] P. Penev, J. Atick, Local feature analysis: a general statistical theory for objectrepresentation, Network: Computation in Neural Systems 7 (3) (1996)477–500.

[25] T. Feng, S. Li, H. Shum, H. Zhang, Local nonnegative matrix factorization as avisual representation, in: Proceedings of the 2nd International Conference onDevelopment and Learning, 2001, pp. 178–183.

[26] Y. Zhu, J. Liu, S. Chen, Semi-random subspace method for face recognition,Image and Vision Computing 27 (9) (2009) 1358–1370.

[27] M. Jones, P. Viola, Face recognition using boosted local features, in:Proceedings of International Conference on Computer Vision, 2003.

[28] S. Li, Z. Zhang, Floatboost learning and statistical face detection, IEEE Transac-tions on Pattern Analysis and Machine Intelligence (2004) 1112–1123.

[29] Z. Liang, Y. Li, Incremental support vector machine learning in the primal andapplications, Neurocomputing 72 (2009) 2249–2258.

K. Choi et al. / Pattern Recognition 44 (2011) 386–400400

[30] G. Cauwenberghs, T. Poggio, Incremental and decremental support vectormachine learning, in: Proceedings of Advances in Neural InformationProcessing, The MIT Press2001, p. 409.

[31] A. Martinez, R. Benavente, The AR face database, CVC Technical Report #24.[32] AT& T /http://www.uk.research.att.com/facedatabase.htmlS.[33] A. Georghiades, P. Belhumeur, D. Kriegman, From few to many: illumination

cone models for face recognition under variable lighting and pose, IEEE Transac-tions on Pattern Analysis and Machine Intelligence 23 (6) (2001) 643–660.

[34] T. Sim, S. Baker, M. Bsat, The CMU pose, illumination, and expression (PIE)database, IEEE Transactions on Pattern Analysis and Machine Intelligence 25(12) (2003) 1615–1618.

[35] S. Jung, J. Yoo, A robust eye detection method in facial region, in: ComputerVision/Computer Graphics Collaboration Techniques, pp. 596–606.

[36] BioID face database /http://www.bioid.com/support/downloads/software/bioid-face-database.htmlS.

[37] T. Kim, S. Wong, B. Stenger, J. Kittler, R. Cipolla, Incremental lineardiscriminant analysis using sufficient spanning set approximations, in:Proceedings of Computer Vision and Pattern Recognition, 2007.

[38] S. Shan, Y. Chang, W. Gao, B. Cao, P. Yang, Curse of mis-alignment in facerecognition: problem and a novel mis-alignment learning solution, in: SixthIEEE International Conference on Automatic Face and Gesture Recognition,2004. Proceedings, 2004, pp. 314–320.

Kwontaeg Choi received the B.S. degree in Computer Science from Hallym University, Chooncheon, Korea in 2001, the M.S. degree in Computer Science from YonseiUniversity, Seoul, Korea in 2006. He is currently a Ph.D. candidate in Computer Science from Yonsei University, Seoul, Korea. His research interests include computer vision,pattern recognition, and face recognition.

Kar-Ann Toh is a full professor in the School of Electrical and Electronic Engineering at Yonsei University, South Korea. He received the Ph.D. degree from NanyangTechnological University (NTU), Singapore and worked for two years in the aerospace industry prior to his post-doctoral appointments at research centres in NTU from1998 to 2002. He was affiliated with Institute for Infocomm Research in Singapore from 2002 to 2005 prior to his current appointment in Korea. His research interestsinclude biometrics, pattern classification, optimization and neural networks. He is a co-inventor of a US patent and has made several PCT filings related to biometricapplications. Besides being an active member in publications, Dr. Toh has served as a member of technical program committee for international conferences related tobiometrics and artificial intelligence. He has also served as a reviewer for international journals including several IEEE Transactions. He is a senior member of the IEEE.

Hyeran Byun received the B.S. and M.S. degrees in mathematics from Yonsei University, Seoul, Korea, and the Ph.D. degree in computer science from Purdue University,West Lafayette, IN. She was an Assistant Professor at Hallym University, Chooncheon, Korea, from 1994 to 1995. She is currently a Professor of Computer Science at YonseiUniversity. Her research interests include computer vision, image and video processing, artificial intelligence, event recognition, gesture recognition, and patternrecognition.