Giovani Chiachia â€œLearning Person-Specific Face Representationsâ€

Giovani Chiachia

“Learning Person-Specific Face Representations”

“Aprendendo Representacoes Especıficas para a

Face de cada Pessoa”

CAMPINAS

2013

i

ii

Ficha catalográficaUniversidade Estadual de Campinas

Biblioteca do Instituto de Matemática, Estatística e Computação CientíficaAna Regina Machado - CRB 8/5467

Chiachia, Giovani, 1981- C43L ChiLearning person-specific face representations / Giovani Chiachia. – Campinas,

SP : [s.n.], 2013.

ChiOrientador: Alexandre Xavier Falcão. ChiCoorientador: Anderson de Rezende Rocha. ChiTese (doutorado) – Universidade Estadual de Campinas, Instituto de

Computação.

Chi1. Identificação biométrica. 2. Reconhecimento facial (Computação). 3. Visão

por computador. 4. Aprendizado de máquina. I. Falcão, Alexandre Xavier,1966-. II.Rocha, Anderson de Rezende,1980-. III. Universidade Estadual de Campinas.Instituto de Computação. IV. Título.

Informações para Biblioteca Digital

Título em outro idioma: Aprendendo representações específicas para a face de cada pessoaPalavras-chave em inglês:Biometric identificationHuman face recognition (Computer science)Computer visionMachine learningÁrea de concentração: Ciência da ComputaçãoTitulação: Doutor em Ciência da ComputaçãoBanca examinadora:Alexandre Xavier Falcão [Orientador]Hélio PedriniEduardo Alves do Valle JuniorZhao LiangWalter Jerome ScheirerData de defesa: 27-08-2013Programa de Pós-Graduação: Ciência da Computação

Powered by TCPDF (www.tcpdf.org)

iv

Institute of Computing /Instituto de Computacao

University of Campinas /Universidade Estadual de Campinas

Learning Person-Specific Face Representations

Giovani Chiachia1

August 27, 2013

Examiner Board/Banca Examinadora :

• Prof. Dr. Alexandre Xavier Falcao (Supervisor)

• Prof. Dr. Helio Pedrini

Institute of Computing - UNICAMP

• Prof. Dr. Eduardo Alves do Valle Junior

School of Electrical and Computer Engineering - UNICAMP

• Prof. Dr. Zhao Liang

Institute of Mathematics and Computer Science - USP

• Prof. Dr. Walter Jerome Scheirer

School of Engineering and Applied Sciences - Harvard University

1Financial support: FAPESP scholarship (2010/00994-8) 2009–2013

vii

Abstract

Humans are natural face recognition experts, far outperforming current automated face

recognition algorithms, especially in naturalistic, “in-the-wild” settings. However, a strik-

ing feature of human face recognition is that we are dramatically better at recognizing

highly familiar faces, presumably because we can leverage large amounts of past expe-

rience with the appearance of an individual to aid future recognition. Researchers in

psychology have even suggested that face representations might be partially tailored or

optimized for familiar faces. Meanwhile, the analogous situation in automated face recog-

nition, where a large number of training examples of an individual are available, has been

largely underexplored, in spite of the increasing relevance of this setting in the age of

social media. Inspired by these observations, we propose to explicitly learn enhanced face

representations on a per-individual basis, and we present a collection of methods enabling

this approach and progressively justifying our claim. By learning and operating within

person-specific representations of faces, we are able to consistently improve performance

on both the constrained and the unconstrained face recognition scenarios. In particu-

lar, we achieve state-of-the-art performance on the challenging PubFig83 familiar face

recognition benchmark. We suggest that such person-specific representations introduce

an intermediate form of regularization to the problem, allowing the classifiers to generalize

better through the use of fewer — but more relevant — face features.

ix

Resumo

Os seres humanos sao especialistas natos em reconhecimento de faces, com habilidades

que excedem em muito as dos metodos automatizados vigentes, especialmente em cenarios

nao controlados, onde nao ha a necessidade de colaboracao por parte do indivıduo sendo

reconhecido. No entanto, uma caracterıstica marcante do reconhecimento de face hu-

mano e que nos somos substancialmente melhores no reconhecimento de faces familiares,

provavelmente porque somos capazes de consolidar uma grande quantidade de experiencia

previa com a aparencia de um certo indivıduo e de fazer uso efetivo dessa experiencia

para nos ajudar no reconhecimento futuro. De fato, pesquisadores em psicologia tem ate

mesmo sugerido que a representacao interna que fazemos das faces pode ser parcialmente

adaptada ou otimizada para rostos familiares. Enquanto isso, a situacao analoga no reco-

nhecimento facial automatizado — onde um grande numero de exemplos de treinamento

de um indivıduo estao disponıveis — tem sido muito pouco explorada, apesar da cres-

cente relevancia dessa abordagem na era das mıdias sociais. Inspirados nessas observacoes,

nesta tese propomos uma abordagem em que a representacao da face de cada pessoa e

explicitamente adaptada e realcada com o intuito de reconhece-la melhor. Apresenta-

mos uma colecao de metodos de aprendizado que endereca e progressivamente justifica

tal abordagem. Ao aprender e operar com representacoes especıficas para face de cada

pessoa, nos somos capazes de consistentemente melhorar o poder de reconhecimento dos

nossos algoritmos. Em particular, nos obtemos resultados no estado da arte na base de

dados PubFig83, uma desafiadora colecao de imagens instituıda e tornada publica com

o objetivo de promover o estudo do reconhecimento de faces familiares. Nos sugerimos

que o aprendizado de representacoes especıficas para face de cada pessoa introduz uma

forma intermediaria de regularizacao ao problema de aprendizado, permitindo que os

classificadores generalizem melhor atraves do uso de menos — porem mais relevantes —

caracterısticas faciais.

xi

A minha esposa, Thais Regina de Souza

Chiachia.

xiii

Agradecimentos

Apos anos de trabalho, muitas sao as pessoas as quais eu gostaria de expressar meus

sinceros agradecimentos.

Primeiramente, eu gostaria de agradecer aos professores doutores Alexandre Xavier

Falcao e Anderson Rocha pela orientacao que me deram durante o desenvolvimento deste

trabalho. Nao foram poucas as ocasioes que demandaram suporte, reflexoes e mudancas

de rumo, e eles administraram isso com sabedoria, dando-me autonomia na medida certa

para que hoje, de fato, eu sinta-me pesquisador. Aprendi muito com eles.

Sem o acolhimento da Universidade Estadual de Campinas (UNICAMP) e o fomento

da Fundacao de Amparo a Pesquisa do Estado de Sao Paulo (FAPESP 2010/00994-8),

este trabalho tampouco seria possıvel. Gostaria de agradece-las profundamente pelo apoio.

Nos anos vindouros, comprometo-me a retribuı-lo atraves da aplicacao dos conhecimentos

que obtive em prol de uma sociedade mais prospera.

Tambem gostaria de agradecer aos doutores David Cox e Nicolas Pinto por terem me

recebido tao gentilmente nos EUA durante os seis meses em que la estive na Universidade

de Harvard. Essa passagem foi fundamental para o sucesso deste projeto. Por la, tambem

aprendi muito.

De forma geral, gostaria de agradecer a todos que colaboraram, direta ou indireta-

mente, com este trabalho. Ressalto aqui a colaboracao com o doutor William R. Schwartz,

tao oportuna, e as conversas com os doutores Nicolas Poilvert e James Bergstra, tao produ-

tivas e inspiradoras. A todos os colegas de trabalho, discentes, docentes e administrativos,

meu muito obrigado.

E preciso muita perseveranca para seguir o caminho do doutoramento e minha famılia

e, sem duvida, uma das grandes responsaveis por eu te-la mantido durante esses anos.

Pelo amor incondicional que me dao — e de amor deriva-se suporte, motivacao, paciencia,

empatia, etc. — por fim eu gostaria de agradecer a minha esposa e aos meus pais. Pelos

mesmos motivos, gostaria de agradecer a minha avo, aos meus irmaos e a todos “la em

casa”. Sao pessoas que genuinamente me querem bem, assim como meus grandes amigos,

que aqui tambem agradeco.

xv

“Programming, like all engineering, is

a lot of work: we have to build

everything from scratch. Learning is

more like farming, which lets nature do

most of the work. Farmers combine

seeds with nutrients to grow crops.

Learners combine knowledge with data

to grow programs.”

Pedro Domingos

xvii

Related Publications

I. Giovani Chiachia, Alexandre X. Falcao, and Anderson Rocha. Person-specific Face

Representation for Recognition. In IEEE/IAPR Intl. Joint Conference on Biomet-

rics, Washington DC, 2011.

II. Giovani Chiachia, Nicolas Pinto, William R. Schwartz, Anderson Rocha, Alexan-

dre X. Falcao, and David Cox. Person-Specific Subspace Analysis for Unconstrained

Familiar Face Identification. In British Machine Vision Conference, Surrey, 2012.

III. Manuel Gunther, Artur Costa-Pazo, Changxing Ding, Elhocine Boutellaa, and Gio-

vani Chiachia et al. The 2013 Face Recognition Evaluation in Mobile Environment.

In IEEE/IAPR Intl. Conference on Biometrics, Madrid, 2013.

IV. Giovani Chiachia, Alexandre X. Falcao, Nicolas Pinto, Anderson Rocha, and David

Cox. Learning Person-Specific Face Representations. IEEE Trans. on Pattern

Analysis and Machine Intelligence, (submitted), 2013.

xix

Contents

Abstract ix

Resumo xi

Dedication xiii

Agradecimentos xv

Epigraph xvii

Related Publications xix

1 Introduction 1

1.1 Thesis Organization and Contributions . . . . . . . . . . . . . . . . . . . . 3

2 Background 5

2.1 Face Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Recognition Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Datasets and Evaluation Protocol 10

3.1 Constrained: UND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Unconstrained: PubFig83 . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Preliminary Evaluation 14

4.1 Discriminant Patch Selection (DPS) . . . . . . . . . . . . . . . . . . . . . . 14

4.2 DPS Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Experiments in the Controlled Scenario . . . . . . . . . . . . . . . . . . . . 16

4.4 Experiments in the Unconstrained Scenario . . . . . . . . . . . . . . . . . . 19

5 Person-Specific Subspace Analysis 21

5.1 Partial Least Squares (PLS) . . . . . . . . . . . . . . . . . . . . . . . . . . 21

xxi

5.2 Person-Specific PLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Deep Person-Specific Models 30

6.1 L3+ Top Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7 Conclusion and Future Work 39

Bibliography 41

A Running Example of our Preliminary Evaluation 49

B Additional Results on Person-Specific Subspace Analysis 51

C Scatter Plots from Different Subspace Analysis Techniques 53

D Overview on Deep Visual Hierarchies 55

E Scoring Best in the ICB-2013 Competition and the Applicability of Our

Approach in the MOBIO Dataset 58

E.1 The MOBIO Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

E.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

E.3 Our Winning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

E.4 Learning Person-Specific Representations . . . . . . . . . . . . . . . . . . . 64

E.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

xxiii

List of Tables

4.1 Experimental details and performance evaluation in the controlled scenario 17

4.2 Preliminary evaluation in the unconstrained scenario . . . . . . . . . . . . 20

5.1 Comparison of different face subspace analysis techniques on the PubFig83

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.1 Comparisons in identification mode with our person-specific filter learning

approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 Identification results on PubFig83 available in the literature. . . . . . . . . 37

B.1 Visual comparison of different face subspace analysis techniques . . . . . . 52

B.2 Comparison of different face subspace analysis techniques in the Face-

book100 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

E.1 Systems initially designed for the ICB-2013 competition . . . . . . . . . . . 62

E.2 Results obtained with the replacement of 1-NN predictions by one-versus-

all linear SVMs in the MOBIO dataset . . . . . . . . . . . . . . . . . . . . 63

E.3 Comparison among LDA, PS-PLS, and Deep PS representation learning

approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

E.4 Results obtained by incorporating gallery images in the process of learning

person-specific representations . . . . . . . . . . . . . . . . . . . . . . . . . 65

xxv

List of Figures

1.1 Pipelines illustrating how methods can be regarded with respect to the face

representation approach they employ . . . . . . . . . . . . . . . . . . . . . 2

2.1 Milestones in the history of face representation . . . . . . . . . . . . . . . . 6

2.2 Face recognition from the constrained to the unconstrained scenario . . . . 8

3.1 Training and test images of four individuals in the UND constrained dataset 11

3.2 Images of four individuals in a given split of PubFig83 . . . . . . . . . . . 13

4.1 Person-specific and general models obtained with DPS and the resulting

most discriminant patches . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Per-split evaluation on the UND dataset . . . . . . . . . . . . . . . . . . . 19

5.1 Schematic illustration of our person-specific subspace analysis approach . . 24

5.2 Scatter plot, model illustration, and representative samples resulting from

the use of PS-PLS models . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.1 Schematic diagram of the L3+ convolutional neural network, detailing our

approach to learn deep person-specific models . . . . . . . . . . . . . . . . 33

6.2 Plot of the results obtained with Deep Person-Specific Models in identifi-

cation mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.3 Comparisons with Deep Person-Specific Models in verification mode . . . . 37

A.1 Illustration of the identification scheme adopted in our preliminary evaluation 50

C.1 Visualization of the training and test samples projected onto the first two

projection vectors of each subspace model . . . . . . . . . . . . . . . . . . 54

D.1 Architecture of one hypothetical layer using three well-known biologically-

inspired operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

E.1 Representative training and test images from the MOBIO evaluation set . 60

xxvii

Chapter 1

Introduction

The notion of creating a face “representation” tailored to the structure found in faces

is a longstanding and foundational idea in automated face recognition research [1, 2, 3].

Indeed, a multitude of face recognition approaches employ an initial transformation into

a general representation space before performing further processing [4, 5, 6, 7]. However,

while the resulting face representation naturally captures structure found in common

with all faces, much less attention has been paid to exploring the possibility of face

representations constructed on a per individual basis.

Several observations motivate exploring the problem of person-specific face represen-

tations. First, intuitively, different facial features can be differentially distinctive across

individuals. For instance, a given individual might have a distinctive nose, or a particular

relationship between face features. Meanwhile, in realistic environments, these features

might undergo significant variation due to changes in lighting, viewing angle, occlusion,

etc. Exploring feature extraction that is tailored to specific individuals of interest is a

potentially promising approach to tackling this problem.

In addition, the task of learning specialized representations in a per-individual basis

has a natural relationship to the notion of “familiarity” in human face recognition, in that

the brain may rely on enhanced face representations for familiar individuals [8, 9]. If we

consider that humans are generally excellent at identifying familiar individuals even under

uncontrolled viewing conditions [10] and that the advantage of humans over machines in

this scenario is still substantial [11], face familiarity is a specially relevant notion to pursue

in the design of robust face recognition systems [12].

Finally, we argue that exploring this approach is especially timely today, as cameras

become increasingly ubiquitous, recording an ever-growing torrent of image and video

data. While to date much of face recognition research has focused on matching (e.g.,

same/different) paradigms based on image pairs, the sheer volume of image data, in

combination with user-driven cooperative face labeling, makes “familiar” face recognition

1

2

(a)

person-speci�crepresentation

learning

representationstailored

forgeneralrepresentation

learning/extraction

(b)

our approach

person 1

person 2

person n

...

classi�er

classi�er

classi�er

...

generalrepresentation


learning/extraction

classi�er(s)general

representationface

images

Figure 1.1: Pipelines illustrating how methods can be regarded with respect to the facerepresentation approach they employ. Both pipelines (a) and (b) transform the inputimages into a feature set where the faces are described by the same, general attributes.Common techniques to derive this representation are Eigenface [1], Gabor wavelets [2],Local Binary Patterns [3], Fisherface [4], among others. On top of general face repre-sentations, methods following pipeline (a) directly perform learning tasks. In contrast,as presented in pipeline (b), our approach is to explicitly cast these general representa-tions in person-specific ones by means of intermediate learning tasks that are based ondomain-knowledge, and are aimed at emphasizing the most discriminant face aspects ofeach individual.

increasingly relevant. One context where such an approach is especially attractive is in

social media, where the problem is often to recognize an individual belonging to a lim-

ited, fixed gallery of possible friends, for whom many previous labeled training examples

are frequently available. More generally, the ability to leverage a large number of past

examples of specific individuals is a potential boon any time multiple examples of some

finite number of persons of interest are available.

In Fig. 1.1, we present two distinct pipelines illustrating how our approach compares

with methods most commonly found in the literature. As a first step, both pipelines

(a) and (b) transform the input images into a feature set where the faces are described

by the same, general attributes. Well-known techniques to derive this representation are

Eigenface [1], Gabor wavelets [2], Local Binary Patterns [3], Fisherface [4], Scale-Invariant

Feature Transform [13], among others. On top of general face representations, face recog-

nition methods following pipeline (a) directly perform learning tasks such as training one

or multiple binary classifiers [14, 15, 16], learning similarity measures [6, 17], or learning

sparse encodings [7]. In contrast, as presented in pipeline (b), our approach is to explic-

itly cast these general representations in person-specific ones by means of an intermediate

1.1. Thesis Organization and Contributions 3

learning task that is based on domain-knowledge, and are aimed at emphasizing the most

discriminant face aspects of each individual. From a machine learning perspective, we

believe that these enhanced intermediate representations might alleviate the problem,

allowing the subsequent classifiers to generalize better.

While few previous works have already considered the use of person-specific repre-

sentations in face recognition [18, 19, 20], the advantages of the underlying concept has

never been attested before. Here we validate the concept of person-specific face represen-

tations, and describe approaches to building them ranging from a patch-based method,

to subspace learning, to deep convolutional network features. Taken together, we argue

that these techniques show that the person-specific representation learning approach holds

great promise in advancing face recognition research.

1.1 Thesis Organization and Contributions

As a consequence of being one of the most active pursuits in computer vision [12], the

face recognition problem has been addressed from many different perspectives. In spite

of this fact, it is still possible to devise seminal works in the area. Likewise, it is also

possible to draw a connection between the progress made in the development of the

algorithms and the recognition scenario that they are targeted to. Therefore, in order to

better contextualize this thesis, in Chapter 2 we present a summary of face representation

techniques and recognition scenarios as they evolved over time.

Our experiments consider both the constrained and the unconstrained face recognition

scenarios respectively represented by the UND [21] and the PubFig83 [16] datasets intro-

duced in Chapter 3. After describing these datasets, we then present and evaluate three

distinct methods for person-specific representation learning, with the goal of progressively

validating the overarching approach.

The first method, presented in Chapter 4, is designed to be as simple as possible and

is based on an algorithm that we call “discriminant patch selection” (DPS) [22]. This

algorithm enables us to carry out an evaluation of the idea of person-specific representa-

tions in a constrained face recognition scenario where an intuitive understanding is more

tenable.

Second, in Chapter 5, we explore a more powerful set of techniques based on subspace

projection [23]. In particular, we introduce a person-specific application of partial least

squares (PS-PLS) to generate per-individual subspaces, and show that operating in these

subspaces yields state-of-the-art performance on the PubFig83 benchmark dataset. A key

motivating insight here is that a person-specific subspace, due to its supervised nature,

can capture both aspects of the face that are good for discriminating it from others, as

well as natural variation in appearance that is present in the unconstrained images of that

1.1. Thesis Organization and Contributions 4

individual. We show that generating person-specific subspaces yields significant improve-

ments in face recognition performance as compared to either “general” representation

learning approaches or classic supervised learning alone. Further, we show that such sub-

space methods, when applied atop a deep convolution neural network representation can

achieve recognition performance that exceeds previous state-of-the-art performance.

Therefore, in our third and last method, we incorporate person-specific learning di-

rectly into a deep convolutional neural network. We demonstrate in Chapter 6 that, as

long as we observe a few key principles in the network information flow, it is possible to

learn discriminative filters at the topmost convolutional layer of the network with a simple

approach based on SVMs. The inspiration to this approach comes from the assumption

that class-specific transformations might be learned at the top of the human ventral vi-

sual stream hierarchy [24], and that neurons responding to specific faces might exist in

the brain at even deeper stages [25]. We compare our method with other approaches

and demonstrate that the proposed learning strategy produces an additional and signifi-

cant performance boost on the PubFig83 dataset, for both identification and verification

paradigms.

Finally, a compilation of our contributions and experimental findings, along with new

directions to this line of research, are presented in Chapter 7.

Chapter 2

Background

There is a sensible relationship between the progress made in the development of face

representation algorithms and the recognition scenario that they are targeted to. In this

chapter, we present a summary of these techniques and scenarios as they evolved over

time.

2.1 Face Representation

Since the seminal work of Kanade [26] in automated face recognition, the task of trans-

forming pixel values into features conveying more important information is a paramount

step in any face recognition pipeline. Intuitively, pixel values are highly correlated and

uninformative by their own. So, back in 1973, Kanade proposed to represent faces based

on distances and angles between fiducial points such as eye corners, mouth extrema, nos-

trils, among others, with procedures to automatically detect them [26]. This work is the

first milestone that we consider in the timeline presented in Fig. 2.1 about groundbreaking

contributions to the topic of face representation.

Methods solely based on geometric attributes, as proposed by Kanade, are today

known to discard rich information of facial appearance. After a dormant period [27], face

recognition revived in 1991 with the advent of Eigenface, a technique based on principal

component analysis (PCA) for learning and extracting low dimensional face representa-

tions via subspace projection [1]. Indeed, Eigenface gave rise to a class of face repre-

sentation methods known as holistic [28], with projection vectors operating in the full

image domain. While the Eigenface method learns projection vectors according to the

principle of overall maximal variance, Fisherface, based on linear discriminant analysis

(LDA), learns basis vectors with the objective of maximizing the ratio of between-class and

within-class variance [4]. The incorporation of class label information in the framework

of holistic methods was an important step towards better face representations. Hence,

5

2.1. Face Representation 6


learning/extraction

classi�er(s)general

representationface

images

1973 1991 1997 2006

geometricatributes LBPs

Gabor�ltersEigenface Fisherface

deep visualhierarchies

Kanade[26]

Turk &Pentland [1]

Belhumeuret al. [4]

Wiskottet al. [2]

Ahonenet al. [3]

Chopraet al. [29]

Figure 2.1: Milestones in the history of face representation, from Kanade’s seminalwork [26] to the use of deep visual hierarchies [29]. In order to better contextualize thetechniques, we replicate the traditional recognition pipeline of Fig 1.1(a) at the bottom.

Fisherface is the third, key face representation technique that we highlight in Fig. 2.1.

Contemporary to Fisherface, another milestone in face representation was the use

of Gabor filter responses to represent the appearance of regions around facial fiducial

points [2]. This approach can be seen as an extension of Kanade’s method, in that the

representation relies on fiducial points. However, while Kanade only relied on geomet-

ric measures computed from these points, appearance information extracted from their

neighborhood provide much richer information for the task at hand. This original work in-

troduced the broad idea of locally representing facial features, and inspired a fruitful vein

of representation methods. Within this vein, we point out in Fig. 2.1 the widely used local

binary patterns (LBPs) [3]. It consists of a simple and fast image transformation based

on local pixel value comparisons that leads to a compact texture description. In fact, the

use of LBPs for face representation is coupled with the extraction of local histograms,

resulting in a representation with a certain degree of translation invariance [3].

Finally, the last approach for face representation considered in this overview refers to

a class of representation methods based on deep visual hierarchies, whose first application

on raw face images [29] was at about the same time as LBPs. These hierarchies can be

seen as a form of face representation that departs from the idea of “engineered” features

by instead using a cascade of linear and nonlinear local operations that are — to some

extent — learned directly from the face images, as in the original work of Chopra et

al. [29] with convolutional neural networks.

Today, most of the principles underlying these representation techniques are present

in state-of-the-art approaches. For example, among the best performing methods in un-

2.2. Recognition Scenarios 7

constrained face verification, there are systems that rely on fiducial points to extract face

features from their neighborhood [30, 31], something that borrows ideas from the first

(fiducial points), the fourth (local features), and the fifth (LBPs and related) milestones

presented in Fig. 2.1. PCA and LDA are vastly used as an intermediate processing step

of many current top performers [31, 32]. Deep visual hierarchies have definitely demon-

strated their potential for unconstrained face recognition [16]. In addition, each of these

methods were unfolded and combined in a profusion of ways that are beyond the scope of

this overview. As we shall see throughout the thesis, there is a good overlap between the

general representation techniques highlighted in Fig. 2.1 and the techniques that serve us

as basis to learn person-specific face representations.

2.2 Recognition Scenarios

Research on automatic face recognition in the 1990s and the early 2000s was mostly based

on mugshot-like images with controlled levels of variation. Indeed, it all started with the

Facial Recognition Technology (FERET) program in 1994 [33], that can be regarded as

the first attempt to organize the area around a well-defined problem. After 1994, FERET

evaluations were carried out for more two years and images from the last edition, in

1996, are still available for research purposes. They are similar to the images of the

FRGC (experiment 1) [34] and the UND (collection X1) [21] datasets, shown in the left

part of Fig. 2.2. Since the users were asked to meet specific poses and expressions, and

illumination conditions were carefully taken into account, this image acquisition scenario

is referred to as constrained.

From constrained images, many lessons have been learned. Among them, for example,

the fact that females are harder to recognize than males [35]. These findings and, more

importantly, research directions — such as the need to make systems more robust to

changes in illumination — were only possible with the concerted effort of institutions like

the National Institute of Standards and Technology (NIST), which was in charge of the

FERET and FRGC programs, and currently promotes advances in the area by means of

challenges such as FRVT [36] and GBU [37], among others.1 Nowadays, many benchmarks

for automatic face recognition consider more realistic, uncontrolled face images in their

protocol. For example, the GBU challenge considers face pictures taken outdoors and in

hallways [37]. Likewise, a recent competition on mobile face recognition [32] — based on

the MOBIO dataset [38] — was carried out on images captured with little to no control,2

under conditions approaching the unconstrained setting (Fig. 2.2).

A new perspective to face recognition research was introduced with the release of

1http://www.nist.gov/itl/iad/ig/face.cfm2In fact, users were asked to be seated and pictures were taken indoors.

http://www.nist.gov/itl/iad/ig/face.cfm


recognition scenarios

FRGC(exp. 1) [34]

constrained unconstrained

large,diversegallery

UND(col. X1) [21]

MOBIO[38]

LFW[39]

PubFig83[16]

target scenario

Figure 2.2: Face recognition from the constrained to the unconstrained scenario. The sce-nario around which the area was first organized was constrained, in that individuals wereasked to meet specific poses and expressions, and illumination conditions were carefullytaken into account. With advances on the technology and the advent of the Internet,nowadays many research groups target their algorithms to the unconstrained scenario,where requirements on individuals are minimal.

Labeled Faces in the Wild (LFW) [39], a dataset based on the original idea of collecting

images of celebrities from the Internet with the only requirement that their faces were

detectable by the Viola-Jones algorithm [40]. Even though the resulting dataset was biased

towards face pictures typically found in news media, it embodied a new factor of variation

in the recognition scenario: diversity in appearance. Due to its interesting properties and

ease of use, and possibly also because its curators constantly update and report progress

made on it,3 LFW is currently largely adopted. Indeed, LFW has motivated the creation

of many other datasets, among them PubFig [11] and its refined version PubFig83 [16],

which have similar recording conditions, but serves to other purposes (Fig 2.2).

While LFW contains over 5,000 people, only five individuals have more than 100

images. In contrast, PubFig83 has 83 people, but each individual has at least 100 images.

While LFW — like most NIST challenges, including GBU [37] — is designed for pair

matching tests, and has a protocol that does not allow learning any parameter from

gallery images,4 PubFig83 is designed to approach familiar face recognition, and has a

protocol that actually fosters learning algorithms to take most out of gallery images.

Notwithstanding the fact that many other interesting datasets remain to be cited,5

we believe that the ones that we mentioned here illustrate well the continuum from the

constrained to the unconstrained recognition scenarios. For example, Multi-PIE [41] is

3http://vis-www.cs.umass.edu/lfw/results.html4In fact, LFW is conceived in terms of “pairs”, not individual images. The notion of gallery and probe

images does not even exists in this dataset.5A non-exhaustive list can be found at http://www.face-rec.org/databases.

http://vis-www.cs.umass.edu/lfw/results.html

http://www.face-rec.org/databases


an interesting dataset to study face recognition under severe pose variations. To this

purpose, a laborious setup was used to acquire images from precisely different viewpoints.

Its highly controlled nature enables researchers to factor out other sources of variation and

carefully address the problem. However, exactly because of its motivation, the dataset

reflects a constrained recognition scenario.

Overall, we consider PubFig83 as our target scenario in this work because its has a

large pool of heterogeneous face images for each individual and its evaluation protocol

allows us to learn from these images. In fact, there is a perfect match between the

recognition scenario that this dataset reproduces and the motivation of this thesis.

Chapter 3

Datasets and Evaluation Protocol

We follow the idea of gaining insight into the constrained scenario, where factors interfer-

ing in the results are alleviated, to later extending our representation learning methods to

a scenario that best suits the approach. In the following sections, we present the controlled

and the uncontrolled datasets of our choice, with their respective evaluation protocol, to

accomplish this goal.

3.1 Constrained: UND

Our experiments in the controlled scenario are based on the X1 collection of the UND

face dataset [21]. This dataset is arranged in weekly acquisition sessions in which four

face images were obtained by the combination of a small variation in illumination and

two slightly different facial expressions.

We designed an evaluation protocol that allows us to learn person-specific representa-

tions from gallery images as well as to account for variability in our tests. In particular,

we considered the 54 subjects whose attendance to the acquisition sessions were highest,

so that each person was recorded at least in seven and at most in ten sessions. This

procedure resulted in a dataset with 1,864 images — with at least 28 images per indi-

vidual — which enabled us to split the dataset into ten pairs of training and test sets.

Considering the images in chronological order, for each split, we selected two images of

each individual for the training set and used the remaining images as test samples. In

addition, all images were registered by the position of the eyes, cropped with an elliptical

mask, and were made 260×300 pixels in size.

Fig. 3.1 presents training and test images of four individuals in UND. We can see that

test images differ from training images only by a small amount, specially due to facial ex-

pression. UND represents the typical dataset used in automated face recognition research

until the late 1990s and the early 2000s. While our target scenario is unconstrained face

10

3.1. Constrained: UND 11

train

~test

train

~test

train

~test

train

~test

Figure 3.1: Training and test images of four individuals in the UND dataset. As we cansee, test images differ from training images only by a small amount. This recognitionscenario was typical in automated face recognition research of the early 2000s.

recognition, in this thesis, the controlled images of UND serve to provide insight regarding

the value of person-specific representations.

Evaluations in this dataset are performed in identification mode, where the task is to

identify which of a set of previously-known faces a new test face belongs to.

3.2. Unconstrained: PubFig83 12

3.2 Unconstrained: PubFig83

The PubFig83 dataset [16] is a subset of the PubFig dataset [11], which is, in turn,

a large collection of real-world images of celebrities collected from the Internet. This

subset was established and released to promote research on familiar face recognition from

unconstrained images, and it is the result of a series of processing steps aimed at removing

spurious face samples from PubFig, i.e., non-detectable, near-duplicate, etc. In addition,

only persons for whom 100 or more face images remained were considered, leading to a

dataset with 83 individuals.

To our knowledge, this is the publicly available face dataset with the largest amount

of unconstrained, uncorrelated images per individual. This characteristic is fundamental

in validating our claim — which has a perfect fit with the dataset motivation — and that

is why this thesis is mostly validated on PubFig83.1

We aligned the images by the position of the eyes and followed the original evaluation

protocol of [16], where the dataset is split into ten pairs of training and test sets with

images selected randomly and without replacement. For each individual, 90 images were

considered for training and 10 for test.

In Fig. 3.2, we present images of four individuals in a given split of PubFig83. While

here we only have space to show 10 (out of 90) training images of each individual, all their

respective test images are presented. We can observe that this dataset is considerably

more challenging than UND. Indeed, due to its unconstrained nature, PubFig83 presents

at the same time all factors of variation in face appearance: pose, expression, illumination,

occlusion, hairstyle, aging, among others. Extracting representations from these images

in a way that such intrapersonal variation is alleviated, while extrapersonal variation is

emphasized, is the foundational purpose of automatic face representation research [42].

Another challenging aspect of the dataset is that images are originally 100×100 pixels in

size.

On PubFig83, we report results both in identification mode as well as in verification

mode. In the later, the task is to decide whether or not a given test face belongs to a

claimed identity.

1Though, in part, we additionally validate our methods on the private Facebook100 dataset [16], aswe shall see in Appendix B.

3.2. Unconstrained: PubFig83 13

~train test ~train test ~train test ~train test

Figure 3.2: Images of four individuals in a given split of PubFig83. While here we onlyhave space to show 10 (out of 90) training images of each individual, all their respective testimages are presented. We can observe that this dataset is considerably more challengingthan UND. Indeed, due to its unconstrained nature, PubFig83 presents at the same timeall factors of variation in face appearance.

Chapter 4

Preliminary Evaluation

This preliminary evaluation is aimed at being as simple and intuitive as possible. There-

fore, here we follow the basic idea of matching face images via histograms of Local Binary

Patterns (LBPs) extracted from patches on different positions of the face. Indeed, the

approach presented in this section is closely related to the methods in [3], but using a

different patch selection mechanism that is crucial to our purpose.

Given that we calculate LBPs from an 8-neighborhood, our matching schema considers

histograms with 256 bins. Formally, let P ′ be the set of patches considered for the match-

ing and Hp be the histogram of the LBPs from patch p. The patch-based dissimilarity

between images I1 and I2 is

D(I1, I2,P ′) =∑∀p∈P ′

256∑b=1

|Hp,b(I1)−Hp,b(I2)|, (4.1)

where Hp,b represents the value of bin b of patch p. In other words, the dissimilarity

corresponds to the summation of the absolute difference over the bins of each patch

histogram, i.e., the L1 distance.

4.1 Discriminant Patch Selection (DPS)

The concept of selecting patches to better describe object classes in images has been

studied in many contexts. For example, in [43], the authors present methods for selecting

patches that are informative to detect objects, and, in [44], patch selection is proposed in

a probabilistic framework for the recognition of vehicle types.

The idea of our DPS procedure is to determine (x, y) coordinates for patch selection

according to the discriminability they have in a group of aligned training images with at

least two images per category. For a given patch in a given image, its discriminability

14

4.2. DPS Setup 15

Algorithm 1 Discriminant Patch Selection

Input: Set of training images T and classes C, set of patch positions P, discriminabilityfunction F (p, I,G), and patch selection criterion.

Output: Class-specific models Mc and patches P ′ selected according to the providedcriterion.

Auxiliary: Function C(I), image I, and variables c and d.

1. For each c ∈ C and p ∈ P do Mc,p ← 02. For each patch position p ∈ P do3. For each image I ∈ T do4. c← C(I).5. d← F (p, I, T \{I}).6. Mc,p ←Mc,p + d.7. Select patches from models Mc into P ′ according to the

criterion related to their discriminability.

is measured on an individual basis with respect to patches of the other training images.

By interchanging such image, the discriminability of patches at the same position is

computed for all classes. This is done for the whole set of patches. At the end, each

class is associated with one discriminability value per patch position. We refer to these

mappings as the class-specific models that we use for patch selection.

Let T be a set of labeled training images and P be a set with all patch positions

considered for selection. Assuming that function F (p, I,G) measures how good a patch at

p ∈ P in image I discriminates its class with respect to other patches in the image subset

G = T \{I}, and considering that function C(I) retrieves the correct class c ∈ C to which

image I belongs, a pseudocode for the method can be defined as in Alg. 1.

Note that Mc in Alg. 1 is considered a class-specific model in the sense that the

discriminability of patches at p ∈ P with respect to class c are accumulated in Mc,p.

While the patch selection criterion may take into account the discriminability of the

patches by the problem classes (i.e., by Mc), it may also fuse the models in order to

consider patch discriminabilities common to the whole training set, in which case we

obtain general models.

4.2 DPS Setup

For both experiments in the constrained and in the unconstrained scenario, we consider

T as the training set of a particular dataset split. The set P contains all possible patch

4.3. Experiments in the Controlled Scenario 16

positions regarding patch sizes of 20×20 and 10×10 pixels — empirically chosen for the

UND and the PubFig83 datasets, respectively — lying in the image domain.

Concerning the discriminant function F (p, I, T \{I}), we measure the discriminability

of a patch at position p in a given pivot image I as the identification rank obtained

by matching it with all patches at the same position in the remaining images. The

discriminant criterion is actually the negation of the rank, provided that the lower the

rank, the more discriminant the patch. Such measurement of discriminability by the

identification rank was only possible because we consider at least two training images per

class in the dataset splits (Chapter 3).

Finally, we select patches based on models Mc according to the experiment we want

to evaluate. Our main purpose is to build person-specific representations via the selection

of the most discriminant patches from each Mc model. In order to avoid overlapping

patches, we constrain the selection so that each new selected patch must have its center

at a minimum distance from the previously selected ones.

4.3 Experiments in the Controlled Scenario

As shown in Fig. 1.1(b), the learning of person-specific representations results in repre-

sentation spaces associated to each subject. Therefore, a classification engine is required

to operate in each of these spaces. For the sake of simplicity, the experiments in the con-

trolled scenario are based on nearest neighbor (1-NN) classifications. In order to recognize

a test face, we match it to all faces in the gallery in each representation space according

to Eq. 4.1. As a result, we obtain a number of 1-NN predictions. These predictions are

then fused by a voting scheme, i.e., the identity with the greater number of votes is given

to the test face. See Appendix A for a running example of this identification scheme.

The experiments with controlled images consist of comparing the identification rate

obtained in the UND dataset with the selection of patches according to six different

criteria. We start with the selection of the person-specific most discriminant patches, i.e.,

the criterion that implements the idea of learning a good face representation specific to

each person, and call this selection strategy as experiment A.

In Table 4.1, we present the characteristics of each experiment along with the mean

accuracy and the standard error obtained across the ten dataset splits (Sec. 3.1). As

we can see, in experiment A we have n = |C| = 54 person-specific representation spaces

— corresponding to the number of subjects in the dataset — each one composed by the

concatenation of histograms of LBPs computed from the 48 most discriminant patches of

that person.1 An illustration of a given person-specific model is provided in Fig. 4.1(a)

1We decided to select 48 patches for each person because such number seemed to us appropriate todescribe a large portion of the face.


Table 4.1: Experimental details and performance evaluation in the controlled scenario.As we can see from experiment A, the representation based on the person-specific mostdiscriminant patches resulted in better identification rates. A per-split performance plotis presented in Fig. 4.2.

exp. patch sel. person- # rep. # patches accuracycriteria specific spaces /space (%)

A most disc. yes n 48 97.07±.36

B least disc. yes n 48 92.72±.75C most disc. yes 1 48n 94.89±.65D most disc. no 1 48 94.87±.39E random yes n 48 96.21±.45F non-overlap no 1 13×15 96.23±.42

as well as the patches that were selected to represent this individual in experiment A.

The first alternative patch selection criterion that we compare with experiment A is to

select the least discriminant patches for the person-specific representation spaces. This can

be viewed as a sanity check to assure that DPS is behaving as expected. Compared with

A, it is possible to observe that experiment B presents a significant drop in performance.

The next comparison is the most interesting outcome of this preliminary evaluation. It

consists of contrasting experiment A with experiment C, whose patch selection strategy is

the same, but the patches are assembled into a single representation space. The interest-

ing point to observe is that the same data are employed by both methods. In experiment

A, we consider 54 representation spaces with 48 patches each, while in experiment C, we

consider a single feature space with the same 54×48=2,592 patches. The difference in

performance observed between experiments A and C suggests that undesirable cancella-

tions are occurring when the person-specific representations are tiled in a single space.

We consider this fact as a good support to our hypothesis.2

Experiment D refers to the selection of the 48 patches that are the most discriminant

for all persons simultaneously. In this case, the patch discriminability from the persons

are correspondingly merged by summing them up before the selection, leading to a set of

general discriminant patches (see Sec. 4.1). This strategy is well-known in the literature

and reflects the paradigm of creating a representation space that highlights the importance

of face aspects that better distinguish among all individuals. Fig. 4.1(b) shows the model

obtained in D along with the corresponding most discriminant patches. With respect to

accuracy, we can also observe in Table 4.1 a significant difference between experiments A

2Note that because we are using 1-NN classifiers, these cancellations only occur due to the votingscheme (Appendix A) employed in experiment A before the final prediction. Otherwise, experiments Aand C would perform exactly the same.


person-speci�cmodel

most disc.patches

generalmodel

most disc.patches(a) (b)

Figure 4.1: A given person-specific model and the most discriminant patches for thisindividual (a). The general model obtained with the summation of all person-specificmodels and the corresponding most discriminant patches (b). Models learned from thefirst dataset split.

and D. Aside from this fact, here it is possible to notice the importance of the eyebrows

in face recognition, which are facial features known to contribute in an important way in

human face perception [9, 12].

We also evaluate the random selection of 48 patches per individual within the ellip-

tical face domain, following the same matching strategy used in experiments A and B.

This experiment is called E and performed worse than A as well. Interestingly, however,

experiment E performed better than C and D. We believe that the random criterion, by

being uniform and not allowing overlapping patches, enabled a well distributed selection

of patches within and among the person-specific representation spaces. This possible

representation regularly covering the face image domain may have led to this good per-

formance.

Therefore, the last experiment in the controlled scenario, named F, stands for a reg-

ular grid composition of non-overlapping patches covering the entire image. Given that

images in the UND dataset are 260×300 pixels in size and patches are 20×20 pixels, this

method employs a grid of 13×15=195 patches to describe the faces. We can observe in

Table 4.1 that experiment A also prevails over F, and that they are the top performing

representation strategies.

Notwithstanding the proximity among accuracies presented in Table 4.1, in Fig. 4.2

we provide a per-split comparison of the experiments. This visualization enables us to see

that experiment A achieves a consistently better performance across the splits. Therefore,

when the experiments are paired by the splits and a Wilcoxon signed-rank test is carried

out, the performance of A is significantly different from all other experiments (p < 0.01).

4.4. Experiments in the Unconstrained Scenario 19

Figure 4.2: Per-split evaluation on the UND dataset. Details of each experiment areavailable in Table 4.1. When individually compared with each other experiment, theperformance of A is significantly different according to the Wilcoxon signed-rank test(p < 0.01).

Overall, the results measured in this controlled setting provide early evidence about the

potential importance of learning person-specific face representations.

4.4 Experiments in the Unconstrained Scenario

With the use of the PubFig83 dataset, this preliminary evaluation gains in importance

not only due to the uncontrolled conditions in which the images were obtained, but also

due to the scale of the learning task, which increases from two images of 54 individuals

in the UND dataset to 90 images of 83 subjects in PubFig83.

We start this round of evaluation by considering the top two performing methods from

the previous section, namely experiments A and F. As mentioned in Sec. 3.2, we report

results on PubFig83 following the same protocol of [16], providing mean accuracy and

standard error obtained from ten random training/test dataset splits.

In Table 4.2, we can observe that the difference in performance between experiments

A and F is considerably greater than the difference between the same methods in the

controlled scenario (Table 4.1). Given that predictions in both methods are made with

1-NN classifiers, we understand that the selection of person-specific discriminant patches

provides an important aid in the classifier generalization.

Despite the relative superiority of experiment A over F, when compared with state-of-

the-art methods [16, 23], which use robust visual representations and powerful classifica-

tion engines, the performance obtained with experiment A on PubFig83 is substantially

4.4. Experiments in the Unconstrained Scenario 20

Table 4.2: Preliminary evaluation in the unconstrained scenario. Here the difference inperformance between the top two methods in the controlled scenario (exp. A and F) ismuch greater. However, as exp. G and H suggest, in this scenario we must consider morerobust learning techniques.

exp. patch sel. person- # rep. classi- accuracycriteria specific spaces fier (%)

A most disc. yes n 1-NN 45.25±.56F non-overlap no 1 1-NN 32.16±.71G most disc. yes n SVM 62.94±.28H non-overlap yes 1 SVM 65.28±.52

lower. In order to evaluate the impact of using a better classifier on top of the same

visual representations, we replace the 1-NN classifier in experiments A and F with linear

SVMs, and call these new experiments as G and H, respectively. We use LIBSVM [45]

to train the linear machines and, for each split, we estimate the SVM regularization con-

stant C via grid search, considering a re-split of the training set and possible C values of

{10−3, 10−2, . . . , 105}.As expected, in Table 4.2 we can see that the use of SVMs in experiments G and H

results in a significant performance boost. We note that experiment H, although does not

operate in person-specific representations, is presented as person-specific. This is because

we use a one-versus-all learning strategy when training the classifiers. More interesting,

however, is the fact that SVM operates better in experiment H, when it is provided with

the whole set of LBP histograms, so that its learning principle can make the most out of

the training data.

In general, the experiments conducted in this section give us the idea that we need

better visual representations in order to obtain satisfactory performance on the PubFig83

dataset. Moreover, we observe that the combination of our discriminant patch selection

(DPS) method with the 1-NN classifier, which was fundamental in providing insight into

the controlled problem, cannot cope with the challenging problem imposed by PubFig83.

Therefore, we conclude that beyond better visual representations, we also need more

robust techniques to further pursue the idea of explicitly learning person-specific face

representations.

Chapter 5

Person-Specific Subspace Analysis

The creation of subspaces tailored for faces is a classic technique in the face recognition

literature; a variety of matrix-factorization techniques have been applied to faces (e.g.,

Eigenface [1], Fisherface [4], Tensorface [5], etc.), which seek to model structure across

a set of training faces, such that new face examples can be projected onto these spaces

and can be compared. A principle advantage of projecting onto such subspaces is in the

reduction of noise by limiting comparison to few relevant dimensions of variability in faces,

as measured across a large number of images. However, while these methods naturally

capture general structure across a set of faces, they typically discover either just structure

that is common to reconstruct all faces (as in the case of Eigenface), or just structure

that is common to discriminate all faces at the same time (as in the case of Fisherface).

In this section, we propose the use of a technique to build person-specific models

on any kind of visual representation in Rd. In particular, we build person-specific face

subspaces from orthonormal projection vectors obtained by using a discriminative per-

individual configuration of partial least squares [46], which we refer to as person-specific

PLS or PS-PLS models. While partial least squares methods have been used in other

contexts in face recognition before [47, 48], in the absence of a dataset that contains

many examples per individual such as PubFig83, it is not possible for PLS methods to

model natural variability in face appearance found in unconstrained images. Even though

any projection technique that attempts to discriminate between face identities, one at a

time, can be considered person-specific in some sense, subspace models can offer more

degrees of freedom to accommodate within-class variance in appearance.

5.1 Partial Least Squares (PLS)

Partial least squares is a class of methods primarily designed to model relations between

sets of observed variables by means of latent vectors [46, 49]. It can also be applied as

21

5.1. Partial Least Squares (PLS) 22

a discriminant tool for the estimation of a low dimensional space that maximizes the

separation between samples of different classes. PLS has been used in different areas

[50, 51] and, recently, it is also being successfully applied to computer vision problems for

dimensionality reduction, regression, and classification purposes [47, 48, 52, 53, 54].

Given two matrices X and Y respectively with d and k mean-centered variables and

both with n samples, PLS decomposes X and Y into

X = TPT + E and Y = UQT + F, (5.1)

where Tn×p and Un×p are matrices containing the desired number p of latent vectors,

matrices Pd×p and Qk×p represent the loadings, and matrices En×d and Fn×k are the

residuals.

One approach to perform the PLS decomposition employs the Nonlinear Iterative

Partial Least Squares (NIPALS) algorithm [46], in which projection vectors w and c are

determined iteratively such that

[cov(t,u)]2 = max||w||=||c||=1

[cov(Xw,Yc)]2, (5.2)

where cov(t,u) is the sample covariance between the latent vectors t and u. In order to

compute w and c, given a random initialization of u, the following steps are repeatedly

executed [49]:

1) uold = u 4) t = Xw 7) u = Yc

2) w = XTu 5) c = YT t 8) if ||u− uold|| > ε,

3) ||w|| → 1 6) ||c|| → 1 go to Step 1

When there is only one variable in Y, i.e., if k = 1, then u can be initialized as

u = Y = y. In this case, the steps above are executed only once per latent vector to be

extracted [49]. The loadings are then computed by regressing X on t and Y on u, i.e.,

p = XT t/(tTt) and q = YTu/(uTu). (5.3)

In this work, we use PLS to model the relations between face samples and their identi-

ties. The relationship between X and Y is then asymmetric and the predicted variables in

Y are modeled as indicators. In the asymmetric case, after computing the latent vectors,

matrices X and Y are deflated by subtracting their rank-one approximations based on t,

that is,

X = X− tpT and Y = Y − ttTY/(tTt). (5.4)

Such deflation rule ensures orthogonality among the latent vectors {ti}pi=1 extracted over

the iterations. For details about the different types of PLS, their applicability to regression

and other problems, and how they compare with other techniques, we refer the reader to

[49, 55, 56].

5.2. Person-Specific PLS 23

5.2 Person-Specific PLS

We learn face models with PLS for each person c at a time by setting k = 1, Yn×k = yc,

and yc,s = 1 if sample s (out of n) belongs to class c or yc,s = 0 otherwise. As Y has a

single variable, this variant of PLS is also known as PLS1 [49]. It is worth recalling from

Sec. 5.1 that when k = 1, we can initialize u = yc and that, in this case, obtaining the

projection vectors {w}pi=1 is straightforward. In other words, at each iteration i,

wi = XiTyc, (5.5)

where Xi is the matrix X deflated up to iteration i according to Eq. 5.4.

The person-specific face model that we consider in this case is the subspace spanned

by the set of orthonormal vectors {wi}pi=1 produced by NIPALS for a person c. Given

that the variables in X are also normalized to unit variance, wi expresses the relative

importance of the face features (i.e., the variables) to discriminate person c from the

others. As {wi}pi=1 are orthogonal, this model accounts for within-person variance in

the face appearance throughout the samples, a property also suggested to be relevant in

mental representations of familiar faces [8].

In Fig. 5.1 we illustrate the approach. From the visual representation of the train-

ing samples, PS-PLS creates a different face subspace for each individual. All training

samples are then projected onto each person-specific subspace, so that a classifier can be

trained by considering the different representations of the samples over the subspaces.

The classification engine that we use in our experiments is made by linear SVMs in a

one-versus-all configuration, but it could be of any type provided it can operate in multi-

ple representation spaces. Given a test sample, an overall decision is made according to

decisions made in each person-specific subspace. In this work, we predict the face identity

by choosing the person whose corresponding SVM scored highest.

5.3 Experiments

As already mentioned, PS-PLS models can be learned from arbitrary Rd input spaces.

Hence, we consider four different visual representations in order to evaluate them. The

first visual representation that we take into account is the one that performed best in our

preliminary evaluation on PubFig83 (Sec. 4.4, experiment H), and it is based on non-

overlapping histograms of LBP patches. The second and third representations are called

V1-like+ and HT-L2-1st. They are taken from [16] and can be thought of as biologically-

inspired visual models of increasing complexity. Finally, the fourth visual representation

is similar in spirit to HT-L2-1st and consists of a three-layer hierarchical convolutional

5.3. Experiments 24

Dataset Visual Representations

100 examples/person

...

...

subspaces

R100x100

R~20

PS-PLS...

... ...one-versus-all

linear SVMs

90 trainingsamples

10 testingsamples

R>25,000

100 examples/person

feature extraction

PS-PLS projection matrices

Ove

rall

Perf

orm

ance

Figure 5.1: From the training samples, PS-PLS creates a different face subspace for eachindividual. A different classifier is then trained in each subspace.

network. We refer to this visual representation as L3+, as it is a slight modification of

the HT-L3-1st network found in [15].1

The main baseline for PS-PLS models consists of training linear SVMs straight from

these visual representations, in which case we call the method RAW. In addition to

comparing RAW and PS-PLS, we also consider subspace models obtained via principal

component analysis (PCA), linear discriminant analysis (LDA), and Random Projection

(RP). PCA is intuitively appealing in the context of face recognition and decomposes

the training set in a way that most of the variance among the samples can be explained

by a much smaller and ordered vector basis. LDA is another well-known technique that

attempts to separate samples from different classes by means of projection vectors point-

ing to directions that decrease within-class variance while increasing the between-classes

variance. As our PS-PLS setup seeks to maximize the separation only between-class, we

argue that this offers a good compromise between LDA and PCA. Finally, due to its inter-

esting properties [57, 58], we also consider RP vectors sampled from a univariate normal

1The only difference between L3+ and HT-L3-1st is that the later performs an additional normalizationas a last step.

5.4. Results 25

distribution.

We further evaluate person-specific PCA models (PS-PCA) and multiclass PLS models

with the idea that they would provide insight regarding the value of person-specific spaces.

PS-PCA models are built only with the training samples of the person. For the multiclass

PLS models, we assume k as the number of classes and make Yn×k = {y1,y2, . . . ,yk},with yc,s = 1 if sample s belongs to class c or yc,s = 0 otherwise. Still, in the inner loop of

the NIPALS algorithm, each projection vector is considered after satisfying a convergence

tolerance ε = 10−6 or after 30 iterations, whichever comes first (see Sec. 5.1 for details).

In this case, as Y has multiple variables, this form of PLS is also known as PLS2 [49].

While there remains substantial room to evaluate other subspace methods — including

kernelized versions of PCA [59], LDA [60], and PLS [61] — we chose here to focus on

some of the most popular and straightforward methods available, with the goal of cleanly

assessing the benefit of building person-specific subspaces.

The evaluation framework has two parameters: the regularization constant C of the

linear SVMs, and the number of projection vectors to be considered, which is relevant

in the cases where the projection vectors are ordered by their variance or discriminative

power (PCA, PS-PCA, PLS, and PS-PLS). We use a separate grid search to estimate

these parameters for each split. For this purpose, we re-split the training set so that we

obtain 80 samples per class to generate intermediate models and 10 samples per class to

validate them. We consider {10−3, 10−2, . . . , 105} as possible values to search for C. For

the RAW and LDA models, this is the only parameter that we have to search, because, in

the RAW case, no projection is made in practice and, in LDA, the number of projection

vectors is fixed to the number of classes minus 1.

The possible number of projection vectors that we consider in the search can be repre-

sented as {1m, 2m, . . . , 8m}. For person-specific subspace models, m = 10, i.e., starting

from 10, the number of projection vectors is increased by 10 up to the total number of

data points per person in the validation set. Correspondingly, for the multiclass models,

m = 10n, where n is the number of persons in the dataset. The only exception is PLS,

where m = n. Although PLS is a multiclass model, we observed that the ideal number

of projection vectors is concentrated in the first few, and so we decided to refine the

search accordingly, while keeping the same number of trials as for the other models. For

all methods, the Scikit-learn package [62] was used to compute the subspace models and

LIBSVM [45] was used to train the linear SVMs. In all cases, the data was scaled to zero

mean and unit variance.

5.4. Results 26

Table 5.1: Mean identification rates obtained with different face subspace analysis tech-niques on the PubFig83 dataset. In all cases, the final identities are estimated by linearSVMs. In the last column, we present the most frequent number of projection vectorsfound by grid search (see Sec. 5.3 for details).

Models LBP V1-like+ HT-L2-1st L3+RAW 65.28±.52 74.81±.35 83.66±.55 88.18±.24 d (Rd)Multiclass UnsupervisedRP 61.77±.57 69.04±.44 79.92±.50 85.77±.26 6,640PCA 65.14±.48 74.59±.36 83.36±.47 87.86±.31 6,640Multiclass SupervisedLDA 59.01±.54 76.16±.50 81.14±.30 87.83±.39 –PLS 63.88±.54 74.90±.45 83.07±.47 87.20±.31 332Person-SpecificPS-PCA 21.70±.58 29.95±.31 44.76±.45 54.58±.36 80PS-PLS 67.90±.58 77.59±.53 84.32±.38 89.06±.32 20

5.4 Results

The results are shown in Table 5.1. In general, comparisons are done with the first row,

where performance is assessed with the RAW visual representations. The remaining rows

are divided according to the type of subspace analysis technique.2 It is possible to observe

that the only face subspace in which we could consistently get better results than RAW

across the different representations is PS-PLS.

With the multiclass unsupervised techniques, we see no boost in performance above

RAW. Since unconstrained face images have a considerable amount of noise and these

techniques do not regard its removal while estimating the models, this is perfectly reason-

able. We observe that the visual representation on which the performance of RP dropped

most is V1-like+, the largest in terms of input space dimensionality. Both for RP and

PCA, the most frequent number of projection vectors found by grid search was 6,640,

i.e., the maximum allowed. This gives us the intuition that, operating with these uncon-

strained face images, the best that RP and PCA can do is to retain as much information

in the input space as possible.

For the multiclass supervised subspace models, we observe performance increases only

with LDA on the V1-like+ representation. While for HT-L2-1st and L3+ this may be

simply the case of there being less room for improvement, we think that person-specific

manifolds in the multiclass subspace are impaired by a more complex relation among the

2Note that the performance obtained with RAW LBP representations is the same of exp. H in Table4.2, as the methods are the same.

5.4. Results 27

projection vectors. Since both PLS and PS-PLS follow the same rule for the estimation of

the projection vectors, the results corroborate the idea that representing each individual

in its own subspace results in better performance.

In the person-specific category, we see that PS-PCA considerably diminishes the pre-

dictive power of the features in the input space. In all cases, the best number of projection

vectors found by grid search was 80, i.e., the maximum allowed. When compared with

PS-PLS, we can see here the importance of person-specific models being also discrimina-

tive, besides generative, for this task. We cannot disregard noise in the unconstrained

scenario.

In Appendix B, the very same performance pattern is observed with other two visual

representations and on an additional private dataset called Facebook100. We omit these

numbers here for a better flow in reading and also because, due to privacy concerns, results

on Facebook100 are non-replicable. In any case, these extra experiments strengthen the

advantage of person-specific subspace analysis via PLS in the familiar face identification

setting.

In Fig. 5.2(a), we present a scatter plot of training and test samples projected onto the

first two PS-PLS projection vectors of Adam Sandler’s subspace learned from V1-like+

representations.3 Similar plots for PCA, LDA and multiclass PLS are available in Ap-

pendix C. Considering that the samples of Adam Sandler are in red, Fig. 5.2(a) illustrates

one point that we observed throughout the experiments, i.e., that the predictive power

of the first PS-PLS projection vectors is higher than that of the second one. Indeed, in

PS-PLS, we found that the only projection vector that leads to mean projection responses

significantly different between positive and negative samples is the first one. Although all

subsequent projection vectors considerably increase performance, we believe that, from

the second vector on, they progressively account more for person-specific variance than

discriminative information. In our experiments, performance began to saturate around

20 projection vectors.

Fig. 5.2(b) is the result of mapping the importance of each V1-like+ feature back to the

spatial domain, regarding their relative importance found by the first PS-PLS projection

vector. Based on these illustrations, we can roughly see that higher importance is being

given to Adam Sandler’s mouth and forehead (first row), to Alec Baldwin’s eyes, hairstyle,

and chin (second row), and to the configural relationship of Angelina Jolie’s face attributes

(third row).

Columns in Fig. 5.2(c) show the person-specific most, average, and least responsive

face samples with respect to the projection onto the first PS-PLS projection vector. For

3As PubFig83 is a dataset with celebrities, we use their names in this discussion. Also, we chose touse V1-like+ in this illustration because the relation of image pixels to the elements of its feature vectoris more intuitive.

5.4. Results 28

(c) (e)

1st P

S-PL

S pr

oj. v

ecto

r

(b)

most responsiveleast responsive average

with

in-c

lass

sam

ples

leas

t res

pons

ive

sam

ples

(d)

hard

sam

ples

reco

gniz

ed

no case

(a)

1st PS-PLS proj. vector

PS-PLSsubspace

2nd

proj

. vec

tor

Figure 5.2: (a) Scatter plot of training and test samples projected onto the first twoPS-PLS projection vectors of Adam Sandler’s subspace learned using V1-like+. (b) FirstPS-PLS projection vector for three individuals in the dataset. (c) Within-class most,on average, and least responsive face samples with respect to the projection onto (b).(d) Overall least responsive training sample w.r.t. (b). (e) Test samples correctly rec-ognized when considering person-specific representations, but mistaken when using theRAW description. Samples in the first row of (b-e) are highlighted in (a).

Adam Sandler, these samples are highlighted in the plot in Fig. 5.2(a). It is difficult

to infer anything concrete from these images, but we can see that the least responsive

samples represent large variations in pose alignment and occlusion.

5.4. Results 29

Still in Fig. 5.2, column (d) represents the overall least responsive training sample

with respect to (b). These samples tend to be of the opposite gender, and hair seems

to play a role for the first two individuals. Finally, in column (e) we present one test

sample of each person that was not recognized when considering the RAW description of

the faces, but that was recognized with the aid of PS-PLS models. Despite showing just

one sample for Adam Sandler, there were three such cases, which are all highlighted in

Fig. 5.2(a).

In general, we argue that these subspaces are useful both for noise removal and for

accentuating discriminative person-specific face aspects. In unconstrained face recogni-

tion settings, both of these issues are of fundamental importance. Considering the results

obtained with the RAW visual representations, we see that linear SVMs achieve rea-

sonably high level of performance; however, when these same classifiers are trained and

operate in PS-PLS subspaces, they perform better, suggesting that these 20-dimensional

person-specific subspaces not only embed comparable levels of the available face identity

information, but also amplify it.

Chapter 6

Deep Person-Specific Models

While person-specific subspace analysis is a promising general approach to learning person-

specific representations from arbitrary underlying feature representations, the superior

baseline performance of the L3+ visual representation in Sec. 5.4 led us to explore whether

the key theme of person-specific representation could be incorporated more integrally into

that feature representation.

The L3+ representation is based on the use of deep architectures for processing visual

information [15]. Such approach has a long tradition in the machine learning literature

[63, 64, 65, 66], and has been gaining attention due to recent breakthrough results in a

number of important vision problems [15, 67, 68, 69]. These techniques seek to mimic

the neural computation of the brain in the hope of eventually reproducing its abilities

in specific tasks. The basic architecture employs a hierarchical cascade of linear and

nonlinear operations, applied in the framework of a generalized convolution. For an

overview on this type of visual representation, see Appendix D.

Since the work of Hinton et al. [66], the strategy of greedily learning intermediate levels

of representation as a building block to construct deep networks has been much discussed.

While the focus has been put on unsupervised methods aimed at minimizing some kind

of reconstruction error [70, 71, 72], little attention has been devoted to supervised layer-

wise representation learning. This is possibly because discriminative learning strategies

employed at early layers may prematurely discard information that would be critical to

learn higher-level features about the target [71].

The work of Pinto et al. [15] is of considerable importance to unconstrained face

recognition in general and to this work in particular. On the one hand, it achieves state-

of-the-art performance in the challenging Labeled Faces in the Wild benchmark [39].1 On

the other hand, it is the basis of our L3+, a likewise best performing face representation in

the ICB-2013 competition (Appendix E). In fact, this representation can be understood

1http://vis-www.cs.umass.edu/lfw/results.html

30

http://vis-www.cs.umass.edu/lfw/results.html

6.1. L3+ Top Layer 31

as the read-out of a three-layer convolutional neural network whose architecture was

determined by performing a brute-force optimization of model hyper-parameters, while

using random weights for the network’s convolution filters [15].

Here, we ask if these underlying L3+ representations can be augmented by incorpo-

rating a person-specific learning process for setting their linear filter weights, resulting

in an architecture that is both “deep” and person-specific. In order to construct these

deep person-specific face models, we build on the idea of learning increasingly complex

representations (i.e., filter weights), one layer after the other. To be more precise, we are

interested in learning person-specific models at the top layer of the L3+ network. We focus

on the top layer not only because of the potentially disadvantages of discriminative filter

learning at early layers but also for other two reasons: (i) the neuroscientific conjecture

that class-specific neurons should exist in high levels of the human ventral visual stream

hierarchy [24] and (ii) the experimental evidence suggesting that neurons responding to

faces of specific individuals should exist in the brain at even deeper stages [25].2

6.1 L3+ Top Layer

Given that the top layer of the L3+ network is the object of our interest in the attempt to

learn deep person-specific representations, in this section we briefly describe its architec-

ture and operations according to [15]. As we can observe in the left panel of Fig. 6.1, the

third and topmost layer of the L3+ network sequentially performs linear filtering, filter

response activation, and local pooling.3

The filtering operation takes a 34×34×128 input from the previous layer corresponding

to 128 feature maps and convolves it with filters Φi of size 5× 5× 128 in order to create

k higher level new feature maps

fi = x⊗ Φi ∀i ∈ {1, 2, . . . , k}, (6.1)

where x is the input, ⊗ denotes the convolution operation, and k = 256 is the number of

filters. The output of the filtering operation is then subjected to an activation function

of the form

ai = max(0, fi), (6.2)

and these activations are, in turn, pooled together and spatially downsampled with a

stride of 2 (downsampling factor of 4). In particular, the pooling and downsampling

2Another practical reason not to learn discriminative filters at early layers is spatial variance. Facemisalignment is a serious problem in unconstrained face recognition that is significantly alleviated athigher levels of the network. For example, in L3+, each input cell in the third layer has a receptive fieldcorresponding to a region of 65× 65 pixels in the input image.

3For an intuitive explanation of these operations, we refer the reader to Appendix D.

6.2. Proposed Approach 32

operation can be defined as

pi = downsample2( 10√

(ai)10 ⊗ 17×7), (6.3)

where 17×7 is a 7 × 7 matrix of ones representing the pooling neighborhood. Note that

the pooling operation is simply the L10-norm of the activations in the pooling region,

and can be regarded as a soft-max pooling in the sense of [65]. Finally, after these three

operations, the network outputs a visual representation of size 12× 12× 256.

6.2 Proposed Approach

We propose an approach based on linear support vector machines (SVMs) to learn filters

on the third layer of the L3+ representation. As we can see in the right panel of Fig. 6.1,

an input image when transformed up to layer 2 is a feature vector x of size 34× 34× 128.

From a training set X with n samples, we are interested in learning 5 × 5 × 128 filters

Φi that later will be convolved with representations at the same input level. Given that

these filters are meant to be person-specific, the type of SVM training that we carry out is

one-versus-all and assumes that filters are going to be learned by taking as input the same

neighborhood Ni of 5×5 elements in space from all samples in X. In Fig. 6.1, this means

to consider features in the same red volume from all images, training an SVM with Alec

Baldwin, for example, as the positive class and the other persons as the negative class.

By doing so, a person-specific filter expected to be highly responsive to Alec Baldwin’s

face aspects in Ni is learned.

Let XNibe the training set at neighborhood Ni and yc be the labels for person c such

that yc,s = +1 if sample s (out of n) belongs to class c or yc,s = −1 otherwise. A filter

for c in Ni is simply the hyperplane Φi obtained with the solution of the linear support

vector classification problem

minΦi,bi

1

2||Φi||2 + C

n∑s

max{0, 1− yc,s(Φi · xsNi+ bi)}, (6.4)

where C is the regularization constant that we set to 105 in order to obtain a parameter-

free hard-margin method. In fact, the filter itself is the pair (Φi, bi) with the intercept biensuring that responses from different filters will be in the same range. For the sake of

notation clarity we use only Φi to denote this pair.

It is possible to observe that a correspondence between filters Φ and neighborhoods Nexists, that is, both Φi andNi have the same index i specifying from which region the filter

is going to be learned. Indeed, there is an important fact in determining i that allows us to

train independent filters. Recalling that the spatial resolution of the input samples at layer

6.2. Proposed Approach 33

person-speci�c linear SVM

input

...

Φ1

Φ2

Φk

⊗⊗

⊗

Pool

Filter

Activate

L2

L3person-speci�c

linear SVM

55

34

34

128

128

200

200

Φi

�lter learningsublayers in L3

Figure 6.1: Schematic diagram of the L3+ convolutional neural network, detailing theoperations (sublayers) of its topmost layer (left panel) and illustrating how data from aninput image is sampled in order to learn person-specific filters Φi at a given neighborhood(right panel). Early steps of the network [15] are omitted to emphasize the processingsteps of our interest.

2 is 34×34 and that filters are 5×5 in space, we can train (34−5+1)2 = 900 such filters.

However, we observe that the correlation between filters trained from adjacent regions is

undesirably high, and so there is no benefit in considering them all. This is not the case

though if we subsample possible neighborhoods with a stride of 3, in which situation the

mean correlation among the filters is close to zero. Therefore, the proposed procedure

to learn a third person-specific layer in the L3+ hierarchy considers (b34−53c + 1)2 = 100

filters Φ.4

The final component that we add to our filter learning approach is inspired by an ob-

4Although the number of filters was empirically determined in this study, this number can be seen asa hyperparameter of the proposed method to be adjusted on other problem-domains.

6.3. Experiments and Results 34

servation about the information flow in the network when operating with random filters.5

Provided that after drawing the weights from a uniform distribution the filters are mean

centered, and given the activation function in Eq. 6.2, we observed that, on average, half

of the linear filtering responses after activation are set to zero. The enforcement of such

“calibrated” sparsity showed to be quite relevant to the network performance in our tests

and, therefore, we replicate this behavior by assuming α as the mean response of the

person-specific filters on the training set and using an activation function of the form

ai = max(0, fi − α) (6.5)

instead. Without this shift on activation we found that SVM filters are too selective, i.e.,

almost all filter responses are set to zero if we rather use Eq. 6.2.

The observance of the two aforementioned properties of (i) independence and (ii)

calibrated sparsity in our learning framework allows the network to represent well face

images even of other individuals. No matter which stimuli these person-specific filters

are trained to respond best, these properties naturally enable them to be as informative

as random filters are. However, we expect that when these filters operate in images of

the persons whose face aspects they were trained to discriminate, they might significantly

increase the ability of the system at recognizing these persons.

Even though the proposed approach is tailored to the deep architecture of our in-

terest and designed to strengthen our hypothesis in the context of person-specific face

representation learning, the method seems to extend naturally to other object recognition

problems. To our knowledge, this is the first attempt to learn “stackable” layer-wise rep-

resentations with maximum-margin classifiers. Given the large amount of variation that

unconstrained images have (e.g., Fig. 5.2), even large-scale datasets such as PubFig83

— with thousands of training images — require methods with strong generalization abil-

ities. The idea of piecing together maximum-margin filters in convolutional networks is

potentially relevant in this concern.

6.3 Experiments and Results

The experiments that we carry out in order to evaluate our approach consist of clamping

both the architecture and filter weights of L3+ up to layer 2 and varying two aspects of

its third layer while we measure performance in the PubFig83 dataset. The first aspect

is the filter type, i.e., how filters are determined, and the second aspect is the number of

filters.

5As random filters are known to perform surprisingly well in the general class of convolutional neuralnetworks [73, 74], we found valuable to investigate some of their characteristics.


The obvious baseline with respect to the filter type is the use of random filters, which

are used in the standard L3+ visual representation from Sec. 5.4. We also consider

filters of the type proposed by Coates and Ng in [75], whose use in large quantities

corroborates the notion that good performance can be achieved with inexpensive filter

quantization and encoding techniques [75]. We evaluate their K-means-like method that

takes normalized and ZCA whitened patches as input and computes filters using dot-

products as the similarity metric rather than the Euclidean distance [75].

In order to compare our approach with competitive configurations of these methods, we

scale the number of filters in the third layer up to as many as 2,048, and we vary this num-

ber as an experimental parameter. Both for random as well as for K-means-like filters, we

assess performance with k = {100, 256, 512, 1024, 2048} filters. In the person-specific case,

we measure performance with pure k = 100 person-specific filters, but we also concatenate

them with filters of the two other types, so that the overall number of filters matches the

other cases. This gives rise to methods that we call person-specific (PS)+random and

PS+K-means-like, that are evaluated with k = 100 + {156, 412, 924, 1948} filters.

In addition to random and K-means-like filters, we made a substantial effort to com-

pare our approach with filters trained via backpropagation. However, we found that in

this case, even considering a small number of filters, the network rapidly overfits to the

training samples, resulting in poor performance on the test set. Considering both the

third (convolutional) and the fourth (fully-connected) layers, such network has almost

four million parameters when trained with k = 256 filters. We believe that the availabil-

ity of only n = 7, 470 training samples in PubFig83 did not allow us to obtain good levels

of performance in this attempt.

Regardless the filter type and the number of filters, all other operations and architec-

tural parameters in the third layer are made as presented in Sec. 6.1. The only exception

is the activation function, where Eq. 6.2 is replaced by Eq. 6.5 in cases where the filters

are learned, i.e., when using person-specific and K-means-like filters.6 In these cases, α

is determined as explained in Sec. 6.2. Still concerning filter learning issues, we sample

exactly the same patches in both cases; each person-specific filter (out of 100) is learned

from a set of n patches, and all K-means-like filters are learned from a training set with the

same 100n patches. Finally, on top of all the resulting visual representations, hard-margin

person-specific linear SVMs are trained with C = 105.

The experimental results are presented in Table 6.1 and Fig. 6.2 for the methods in

identification mode and in Fig. 6.3 in verification mode. In accordance to all results

presented in this thesis, we report the mean accuracy and standard error over ten dataset

splits (see Sec. 3.2).

6As advocated in [75], this is in fact a very good encoding scheme to use with large quantities ofK-means-like filters.


Table 6.1: Results in identification mode. A clear boost in performance can be observedwith the use of person-specific filters. It can also be seen that person-specific filterscombine well with the other two types. In particular, when combined with 1948 K-means-like filters, the method achieves the best result on PubFig83 to our knowledge.

Number Filter typeof random K-means person- person-specific+

filters like specific- random K-means100 85.38±.26 83.99±.42 90.62±.27 – –256 88.18±.24 87.69±.29 – 91.43±.24 91.37±.32512 88.76±.32 89.43±.28 – 91.60±.29 91.87±.231024 89.26±.26 90.46±.37 – 91.67±.25 92.17±.272048 89.40±.31 91.29±.34 – 91.07±.27 92.28±.26

100 256 512 1024 2048number of filters in the third layer

84

86

88

90

92

acc

ura

cy(%

)

person-specific (PS) filters

filter type

PS + randomrandomPS + K-means-likeK-means-like

Figure 6.2: Plot of the results obtained in the identification scenario with intervals corre-sponding to standard errors.

From Table 6.1 and Fig. 6.2, we clearly see a dramatic boost in performance when

considering person-specific filters, especially if we take into account correspondence in the

number of filters.7 Comparable levels of performance with 100 person-specific filters is

not achieved even with 2,048 random filters and is only achieved with more than 1,024

K-means-like filters. In addition, we see that person-specific filters combine well with the

other two types. In particular, when combined with 1,948 K-means-like filters, the method

achieves a mean accuracy of 92.28%, the best result on PubFig83 to our knowledge.

As expected, there is a clear relationship between the number of filters and performance

7Note that the performance achieved with 256 random filters in Table 6.1 is the same as in Table 5.1when we use the RAW L3+ representation. In fact, they are the same method.


Table 6.2: Identification results on PubFig83 available in the literature.

Pinto Chiachia Bergstra Carlosimages et al. [16] et al. [23] et al. [76] et al. [77] This

CVPRW’11 BMVC’12 ICML’13 FG’13 thesisunaligned 85.22±.45 – 86.50±.70 – –

aligned 87.11±.56 88.75±.26 – 73.47±.41 92.28±.28

in both random and K-means-like cases. Interestingly, K-means-like performs worse than

random with small numbers of filters but achieves much better performance when this

number increases. While this may contradict the argument that filter learning becomes

crucial with the decrease in filter quantity [75], it appears to corroborate the observation

that large numbers of K-means-like filters might span the space of inputs more equitably,

which increases the chances that a few filters will be near the input and leads to a few

but high activations [75].

For comparison purposes, in Table 6.2 we present other face identification results on

PubFig83 available in the literature. While we make a distinction between results using

aligned images and results using unaligned images, the work of Pinto et al. [16] gives us

an idea about how these setups compare.

0.01 0.1 1 10 100false acceptance rate (%)

54.2

61.8

74.280.0

90.0

100.0

corr

ect

acc

epta

nce

rate

(%

)

filter type

person-specificrandomK-means-like

Figure 6.3: Comparison of the methods in the verification scenario. When the system isset to wrongly accept only 0.01% of the test cases, we can observe a dramatic improvementin performance, which is especially relevant to high security applications.

In Fig. 6.3, we present the performance of the methods in verification mode, where the

task is to decide whether or not a given test face belongs to a claimed identity. Given that

such pair matching is done with the use of only one person-specific model with 100 filters,

we found reasonable to compare methods only with this number of filters. It is possible


to observe that the use of person-specific filters results in a great improvement in correct

acceptance, especially when the system is set to wrongly accept only 0.01% of the test

cases. In high security applications, this difference is of extreme relevance, suggesting that

the approach of learning person-specific representation is not only conceptually relevant

— as we observed throughout the thesis — but also readily applicable in the verification

scenario.

Chapter 7

Conclusion and Future Work

In this thesis, we presented three techniques, based on different learning principles, to ex-

plicitly and progressively build on the idea that generating person-specific representations

can boost face recognition performance.

We motivated the idea as an attempt to model two different attributes of human

face perception, and conducted interrelated experiments in both constrained and uncon-

strained settings, achieving not only insight into the value of person-specific representa-

tion, but also state-of-the-art results.

To tackle the challenging problem of unconstrained face recognition, we first introduced

the use of person-specific subspaces to leverage any kind of input visual representation in

Rd. We believe that this approach represents a first step towards the incorporation of the

notion of face “familiarity” into face recognition systems — a notion that is known to be

of key importance in biological vision. In addition, we proposed an original framework

that uses SVMs to learn “deep” person-specific models in a convolutional neural network,

again achieving superior recognition performance.

With the consistent improvements that we observed throughout the experiments in

both face identification and face verification tasks, we showed that the use of intermediate,

person-specific representation has the power to boost recognition performance beyond

what either generic face representation learning, or traditional supervised learning can

achieve alone.

While any sort of supervised learning might arguably be considered a form of “person-

specific” representation, here we have found that the inclusion of intermediate, problem-

driven person-specific representation learning steps lead to significant boosts in perfor-

mance. One possible explanation for this phenomenon is that such representations intro-

duce an intermediate form of regularization to the face recognition problem, allowing the

classifiers to generalize better by enforcing them to use less but more relevant features.

An important direction to this line of research is to assess the boundaries within

39

40

which this hypothesis holds true. For example, in Appendix E, we show that the lack of

diversity in learnable face images — being this diversity an assumption in human face

familiarity [8] — impairs our approach in a recognition scenario unarguably easier than

PubFig83. Determining the extent of applicability of our approach, and continuing to

explore the wide range of possible techniques for learning person-specific representations

will be a promising area for future research.

In the short term, we also envision the extension of our deep person-specific models

to other problem-domains, in which case they will be class-specific. Indeed, the notion of

learning “stackable” layer-wise representations with maximum-margin classifiers — that

usually leads to classifiers with strong generalization abilities — might be interesting

to explore in problems where training samples (compared to the problem difficulty) are

scarce, i.e., most unconstrained computer vision problems that we currently deal with,

such as face or object recognition. In situations where we have more filters than we

would like to use, we also plan to use Adaboost [78] or some related method to select

them, similar to the approach of Berg and Belhumeur to select a good combination of

SVM classifiers [30]. Finally, we can also investigate a potential compromise between

unsupervised and supervised layer-wise filter learning. Semi-supervised filter learning in

the sense of [79] is also a possibility.

Bibliography

[1] M. Turk and A. Pentland, “Face Recognition using Eigenfaces,” in IEEE Conf. on

Computer Vision and Pattern Recognition, 1991. 1, 2, 5, 21

[2] L. Wiskott, J.-M. Fellous, N. Kruger, and C. V. D. Malsburg, “Face Recognition

By Elastic Bunch Graph Matching,” IEEE Trans. on Pattern Analysis and Machine

Intelligence, vol. 19, pp. 775–779, 1997. 1, 2, 6

[3] T. Ahonen, A. Hadid, and M. Pietikainen, “Face Description with Local Binary

Patterns: Application to Face Recognition,” IEEE Trans. on Pattern Analysis and

Machine Intelligence, vol. 28, no. 12, pp. 2037–2041, 2006. 1, 2, 6, 14

[4] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces vs. Fisherfaces: Recogni-

tion using Class Specific Linear Projection,” IEEE Trans. on Pattern Analysis and

Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997. 1, 2, 5, 21

[5] M. A. O. Vasilescu and D. Terzopoulos, “Multilinear Image Analysis for Facial Recog-

nition,” in IEEE Intl. Conf. on Pattern Recognition, 2002. 1, 21

[6] L. Wolf, T. Hassner, and Y. Taigman, “The One-Shot Similarity Kernel,” in IEEE

Intl. Conf. on Computer Vision, 2009. 1, 2

[7] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust Face Recognition via

Sparse Representation,” IEEE Trans. on Pattern Analysis and Machine Intelligence,

vol. 31, no. 2, pp. 210–227, 2009. 1, 2

[8] A. M. Burton, R. Jenkins, and S. R. Schweinberger, “Mental Representations of

Familiar Faces,” British Journal of Psychology, vol. 102, no. 4, pp. 943–58, 2011. 1,

23, 40

[9] P. Sinha, B. Balas, Y. Ostrovsky, and R. Russell, “Face Recognition by Humans:

Nineteen Results All Computer Vision Researchers Should Know About,” Proceed-

ings of the IEEE, vol. 94, no. 11, pp. 1948–1962, 2006. 1, 18

41

BIBLIOGRAPHY 42

[10] A. M. Burton, S. Wilson, M. Cowan, and V. Bruce, “Face Recognition in Poor-

Quality Video: Evidence From Security Surveillance,” Psych. Science, vol. 10, no. 3,

pp. 243–248, 1999. 1

[11] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute and Simile

Classifiers for Face Verification,” in IEEE Intl. Conf. on Computer Vision, 2009. 1,

8, 12

[12] R. Chellappa, P. Sinha, and P. Phillips, “Face Recognition by Computers and Hu-

mans,” IEEE Computer, vol. 43, no. 2, pp. 46–55, 2010. 1, 3, 18

[13] D. G. Lowe, “Object Recognition from Local Scale-Invariant Features,” in IEEE Intl.

Conf. on Computer Vision, 1999. 2

[14] B. Heisele, P. Ho, and T. Poggio, “Face Recognition with Support Vector Machines:

Global versus Component-based Approach,” in IEEE Conf. on Computer Vision and

Pattern Recognition, 2001. 2

[15] N. Pinto and D. D. Cox, “Beyond Simple Features: A Large-Scale Feature Search

Approach to Unconstrained Face Recognition,” in IEEE Conf. on Automatic Face

and Gesture Recognition, 2011. 2, 24, 30, 31, 33, 55, 56

[16] N. Pinto, Z. Stone, T. Zickler, and D. D. Cox, “Scaling-up Biologically-Inspired

Computer Vision: A Case Study in Unconstrained Face Recognition on Facebook,”

in IEEE Conf. on Computer Vision and Pattern Recognition Workshops, 2011. 2, 3,

7, 8, 12, 19, 23, 37, 51, 58, 59

[17] Z. Cui, W. Li, D. Xu, S. Shan, and X. Chen, “Fusing Robust Face Region Descriptors

via Multiple Metric Learning for Face Recognition in the Wild,” in IEEE Conf. on

Computer Vision and Pattern Recognition, 2013. 2

[18] S. Krishna, J. Black, and S. Panchanathan, “Using Genetic Algorithms to Find

Person-Specific Gabor Feature Detectors for Face Indexing and Recognition,” in

IAPR Intl. Conf. on Biometrics. Springer, 2005, pp. 182–191. 3

[19] S. Zafeiriou, A. Tefas, and I. Pitas, “Learning Discriminant Person-Specific Facial

Models using Expandable Graphs,” IEEE Trans. on Information Forensics and Se-

curity, vol. 2, no. 1, pp. 55–68, 2007. 3

[20] J. Sivic, M. Everingham, and A. Zisserman, ““Who are you?”: Learning Person

Specific Classifiers from Video”,” in IEEE Conf. on Computer Vision and Pattern

Recognition, 2009. 3

BIBLIOGRAPHY 43

[21] X. Chen, P. J. Flynn, and K. W. Bowyer, “Visible-light and Infrared Face Recog-

nition,” in ACM Workshop on Multimodal User Authentication, 2003. 3, 7, 10, 58,

59

[22] G. Chiachia, A. X. Falcao, and A. Rocha, “Person-specific Face Representation for

Recognition,” in IEEE Intl. Joint Conf. on Biometrics, 2011. 3

[23] G. Chiachia, N. Pinto, W. R. Schwartz, A. Rocha, A. X. Falcao, and D. Cox, “Person-

Specific Subspace Analysis for Unconstrained Familiar Face Identification,” in British

Machine Vision Conference, 2012. 3, 19, 37

[24] T. Poggio, “The Computational Magic of the Ventral Stream,” Nature Precedings,

2011, doi:10.1038/npre.2012.6117.3. 4, 31

[25] R. Q. Quiroga, L. Reddy, G. Kreiman, C. Koch, and I. Fried, “Invariant Visual

Representation by Single Neurons in the Human Brain,” Nature, vol. 435, no. 7045,

pp. 1102–1107, 2005. 4, 31

[26] T. Kanade, “Picture Processing by Computer Complex and Recognition of Human

Faces,” Ph.D. dissertation, Kioto University, 1973. 5, 6

[27] S. Z. Li and A. K. Jain, “Introduction,” in Handbook of Face Recognition, S. Z. Li

and A. K. Jain, Eds. Springer, 2011, ch. 1, pp. 1–15. 5

[28] W. Zhao, R. Chellapa, P. J. Phillips, and A. Rosenfeld, “Face Recognition: A Litera-

ture Survey,” ACM Computing Surveys, vol. 35, no. 4, pp. 399–458, December 2003.

5

[29] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a Similarity Metric Discriminatively,

with Application to Face Verification,” in IEEE Conf. on Computer Vision and

Pattern Recognition, 2005. 6

[30] T. Berg and P. N. Belhumeur, “Tom-vs-Pete Classifiers and Identity-Preserving

Alignment for Face Verification,” in British Machine Vision Conference (BMVC),

2012. 7, 40

[31] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of Dimensionality: High-dimensional

Feature and Its Efficient Compression for Face Verification,” in IEEE Conf. on Com-

puter Vision and Pattern Recognition, 2013. 7

[32] M. Gunther, A. Costa-Pazo, C. Ding, E. Boutellaa, and G. C. et al., “The 2013 Face

Recognition Evaluation in Mobile Environment,” in IEEE Intl. Conf. on Biometrics,

2013. 7, 58, 63

BIBLIOGRAPHY 44

[33] P. J. Phillips, H. Moon, P. J. Rauss, and S. Rizvi, “The FERET Evaluation Method-

ology for Face-Recognition Algorithms,” IEEE Trans. on Pattern Analysis and Ma-

chine Intelligence, vol. 22, no. 10, 2000. 7

[34] P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, and W. Worek, “Preliminary

Face Recognition Grand Challenge Results,” in IEEE Intl. Conf. on Automatic Face

and Gesture Recognition, 2006, pp. 15–24. 7

[35] P. J. Phillips, P. Grother, and R. Micheals, “Evaluation Methods in Face Recogni-

tion,” in Handbook of Face Recognition, S. Z. Li and A. K. Jain, Eds. Springer,

2011, ch. 21, pp. 550–574. 7

[36] P. J. Phillips, W. T. Scruggs, A. J. O’Toole, P. J. Flynn, K. W. Bowyer, C. L. Schott,

and M. Sharpe, “FRVT 2006 and ICE 2006 Large-Scale Experimental Results,” IEEE

Trans. on Pattern Analysis and Machine Intelligence, vol. 32, no. 5, pp. 831–846,

2010. 7

[37] P. J. Phillips, J. R. Beveridge, B. Draper, G. H. Givens, A. J. O’Toole, D. Bolme,

J. Dunlop, Y. M. Lui, H. A. Sahibzada, and S. Weimer, “An Introduction to the

Good, the Bad, & the Ugly Face Recognition Challenge Problem,” in IEEE Intl.

Conf. on Automatic Face and Gesture Recognition, 2011. 7, 8

[38] C. McCool, S. Marcel, A. Hadid, M. Pietikainen, P. Matejka, J. Cernocky, N. Poh,

J. Kittler, A. Larcher, C. Levy, D. Matrouf, J.-F. Bonastre, P. Tresadern, and

T. Cootes, “Bi-Modal Person Recognition on a Mobile Phone: Using Mobile Phone

Data,” in IEEE ICME Workshop on Hot Topics in Mobile Multimedia, 2012. 7, 58

[39] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled Faces in the

Wild,” Univ. of Massachusetts, Amherst, Tech. Rep., 2007. 8, 30

[40] P. Viola and M. Jones, “Robust Real-time Object Detection,” Intl. Journal of Com-

puter Vision, vol. 57, no. 2, pp. 137–154, 2004. 8

[41] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-PIE,” in IEEE

Intl. Conf. on Automatic Face and Gesture Recognition, 2008. 8

[42] J.-K. Kamarainen, A. Hadid, , and M. Pietikainen, “Local Representation of Facial

Features,” in Handbook of Face Recognition, S. Z. Li and A. K. Jain, Eds. Springer,

2011, ch. 4, pp. 79–108. 12

[43] A. Vashist, Z. Zhao, A. Elgammal, I. Muchnik, and C. Kulikowski, “Discriminative

Patch Selection using Combinatorial and Statistical Models for Patch-Based Object

BIBLIOGRAPHY 45

Recognition,” IEEE Conf. on Computer Vision and Pattern Recognition Workshops,

2006. 14

[44] M. S. Sarfraz and M. Khan, “A Probabilistic Framework for Patch based Vehicle

Type Recognition,” in Intl. Conf. on Computer Vision Theory and Applications,

2011. 14

[45] C. Chang and C. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM

Trans. on Intelligent Systems and Technology, vol. 2, no. 3, p. 27, 2011. 20, 25

[46] H. Wold, “Partial Least Squares,” Wiley Encyclopedia of Statistical Sciences, vol. 6,

pp. 581–591, 1985. 21, 22

[47] W. R. Schwartz, H. Guo, J. Choi, and L. S. Davis, “Face Identification Using Large

Feature Sets,” IEEE Trans. on Image Processing, vol. 21, no. 4, pp. 2245–2255, 2011.

21, 22, 51

[48] H. Guo, W. R. Schwartz, and L. S. Davis, “Face Verification using Large Feature

Sets and One Shot Similarity,” in IEEE Intl. Joint Conf. on Biometrics, 2011. 21,

22

[49] R. Rosipal and N. Kramer, “Overview and Recent Advances in Partial Least

Squares,” Springer LNCS: Subspace, Latent Structure and Feature Selection Tech-

niques, pp. 34–51, 2006. 21, 22, 23, 25

[50] F. Lindgren, P. Geladi, A. Berglund, M. Sjostrom, and S. Wold, “Interactive Vari-

able Selection (IVS) for PLS. Part II: Chemical Applications,” Wiley Journal of

Chemometrics, vol. 9, no. 5, pp. 331–342, 1995. 22

[51] D. V. Nguyen and D. M. Rocke, “Tumor Classification by Partial Least Squares

Using Microarray Gene Expression Data,” Oxford Bioinformatics, vol. 18, no. 1, pp.

39–50, 2002. 22

[52] W. R. Schwartz, A. Rocha, and H. Pedrini, “Face Spoofing Detection through Partial

Least Squares and Low-Level Descriptors,” in IEEE Intl. Joint Conf. on Biometrics,

2011. 22

[53] W. R. Schwartz, A. Kembhavi, D. Harwood, and L. S. Davis, “Human Detection

Using Partial Least Squares Analysis,” in IEEE Intl. Conf. on Computer Vision,

2009. 22

BIBLIOGRAPHY 46

[54] A. Kembhavi, D. Harwood, and L. Davis, “Vehicle Detection Using Partial Least

Squares,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 33, no. 6,

pp. 1250–1265, 2011. 22

[55] P. Geladi, “Notes on the History and Nature of Partial Least Squares (PLS) Mod-

elling,” Wiley Journal of Chemometrics, vol. 2, no. 4, pp. 231–246, 1988. 22

[56] H. Abdi, “Partial Least Squares Regression and Projection on Latent Structure Re-

gression,” Wiley Int. Reviews: Computational Statistics, vol. 2, no. 4, 2010. 22

[57] E. Bingham and H. Mannila, “Random Projection in Dimensionality Reduction:

Applications to Image and Text Data,” in ACM Conf. on Knowledge Discovery and

Data Mining, 2001. 24

[58] J. Wright and G. Hua, “Implicit Elastic Matching with Random Projections for

Pose-variant Face Recognition,” in IEEE Conf. on Computer Vision and Pattern

Recognition, 2009. 24

[59] B. Scholkopf, A. J. Smola, and K. R. Muller, “Kernel Principal Component Analysis,”

in Springer Intl. Conf. on Artificial Neural Networks, 1997. 25

[60] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers, “Fisher Discriminant

Analysis with Kernels,” in IEEE Neural Networks for Signal Processing Workshop,

1999. 25

[61] R. Rosipal and L. J. Trejo, “Kernel Partial Least Squares Regression in Reproducing

Kernel Hilbert Space,” Journal of Machine Learning Research, vol. 2, pp. 97–123,

2001. 25

[62] F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” Journal of Machine

Learning Research, 2011. 25

[63] K. Fukushima, “Neocognitron: A Self-Organizing Neural Network Model for a Mech-

anism of Pattern Recognition unaffected by Shift in Position,” Springer Biological

Cybernetics, vol. 36, no. 4, pp. 193–202, 1980. 30, 55

[64] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and

L. D. Jackel, “Backpropagation Applied to Handwritten Zip Code Recognition,” MIT

Neural Computation, vol. 1, no. 4, pp. 541–551, 1989. 30, 55

[65] M. Riesenhuber and T. Poggio, “Hierarchical Models of Object Recognition in Cor-

tex,” Nature Neuroscience, vol. 2, pp. 1019–1025, 1999. 30, 32

BIBLIOGRAPHY 47

[66] G. E. Hinton, S. Osindero, and Y. Teh, “A Fast Learning Algorithm for Deep Belief

Nets,” MIT Neural Computation, vol. 18, pp. 1527–1554, 2006. 30, 57

[67] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep

Convolutional Neural Networks,” in Advances in Neural Information Processing Sys-

tems, 2012. 30, 55

[68] Q. V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and

A. Y. Ng, “Building High-level Features Using Large Scale Unsupervised Learning,”

in Intl. Conf. on Machine Learning, 2012. 30, 55

[69] C. F. Cadieu, H. Hong, D. Yamins, N. Pinto, N. J. Majaj, and J. J. DiCarlo, “The

Neural Representation Benchmark and its Evaluation on Brain and Machine,” Intl.

Conf. on Learning Representations, 2013. 30

[70] M. A. Ranzato, F. J. Huang, Y. lan Boureau, and Y. Lecun, “Unsupervised Learning

of Invariant Feature Hierarchies with Applications to Object Recognition,” in IEEE

Conf. on Computer Vision and Pattern Recognition, 2007. 30, 57

[71] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy Layer-wise Training

of Deep Networks,” in Advances in Neural Information Processing Systems, 2007. 30,

57

[72] Q. V. Le, J. Ngiam, Z. Chen, D. Chia, P. W. Koh, and A. Y. Ng, “Tiled Convolutional

Neural Networks,” in Advances in Neural Information Processing Systems, 2010. 30

[73] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the Best Multi-

Stage Architecture for Object Recognition?” in IEEE Intl. Conf. on Computer Vi-

sion, 2009. 34, 56

[74] A. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Ng, “On Random Weights

and Unsupervised Feature Learning,” in Intl. Conf. on Machine Learning, 2011. 34

[75] A. Coates and A. Ng, “The Importance of Encoding Versus Training with Sparse

Coding and Vector Quantization,” in Intl. Conf. on Machine Learning, 2011. 35, 37,

56

[76] J. Bergstra, D. Yamins, and D. D. Cox, “Making a Science of Model Search: Hy-

perparameter Optimization in Hundreds of Dimensions for Vision Architectures,” in

Intl. Conf. on Machine Learning, 2013. 37

BIBLIOGRAPHY 48

[77] G. P. Carlos, H. Pedrini, and W. R. Schwartz, “Fast and Scalable Enrollment for

Face Identification based on Partial Least Squares,” in IEEE Conf. on Automatic

Face and Gesture Recognition, 2013. 37

[78] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning

and an application to boosting,” in Proceedings of the Second European Conference

on Computational Learning Theory. Springer-Verlag, 1995, pp. 23–37. 40

[79] J. Weston, F. Ratle, and R. Collobert, “Deep learning via semi-supervised embed-

ding,” in Intl. Conf. on Machine Learning, 2008. 40, 57

[80] J. J. DiCarlo and D. D. Cox, “Untangling invariant object recognition,” Trends in

Cognitive Sciences, vol. 11, pp. 333–341, 2007. 55

[81] J. J. DiCarlo, D. Zoccolan, and N. C. Rust, “How does the brain solve visual object

recognition?” Neuron, vol. 73, pp. 415–34, 2012 Feb 9 2012. 55

[82] D. H. Hubel and T. N. Wiesel, “Receptive fields of single neurones in the cat’s striate

cortex.” The Journal of physiology, vol. 148, pp. 574–591, 1959. 55

[83] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by

back-propagating errors,” Nature, vol. 323, pp. 533–536+, 1986. 57

[84] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line

Learning and an Application to Boosting,” Journal of Computer and System Sci-

ences, vol. 55, no. 1, pp. 119 – 139, 1997. 62

Appendix A

Running Example of our Preliminary

Evaluation

In Figure A.1, we provide a running example of the identification scheme adopted in our

preliminary evaluation. From top to bottom, the diagram starts with the person-specific

representation of the gallery images Gc,m into the feature spaces Sc, where c denotes the

modeled persons in the training/gallery set, and m indicates which of the multiple samples

of the person is being considered. In this example, we have two persons modeled with

two gallery samples each. Thus, c = {1, 2} and m = {a, b}. After representing the gallery

samples in each person-specific feature space, we obtain samples Gcc,m, which means Gc,m

represented in the feature space modeled for person c.

In order to recognize a probe P, we represent it in each feature space Sc, and the

resulting Pc samples are correspondingly matched to the gallery. In this example, we

match P1 to the samples G1c,m and P2 to the samples G2

c,m. The matchings are then

ranked according to the dissimilarities and an identity is established by each nearest-

neighbor classifier. Here we have two classifiers, one for S1 and one for S2. Finally, a

voting scheme is done by considering the decisions of the classifiers, and the person in the

gallery which has the most votes is taken as the probe identity.

49

50

Matching

G1,a G1,bS1 G2,a G2,bS2

G11,a G1

1,b G12,a G1

2,b G21,a G2

1,b G22,a G2

2,b

Representation

P1 P2

S1 S2P

G11,a

G11,b

G12,a

G12,b

2

1

4

3

Matchingrank in S1

G21,a

G21,b

G22,a

G22,b

3

1

4

2

Matchingrank in S2

G1

G2

1+1

0

Fusion byvoting

Identification

Gallery

Probe

Figure A.1: Illustration of the identification scheme adopted in our preliminary evaluation,considering c = {1, 2} persons in the training/gallery set, with m = {a, b} samples each.From top to bottom, the diagram starts with the person-specific representation of thegallery Gc,m into the feature spaces Sc. Such a representation results in Gc

c,m, whichmeans Gc,m represented in the feature space c. Given a probe P, its representationsPc are correspondingly matched to the gallery. The matchings are then ranked and anidentity is established by each classifier. Finally, a voting scheme is done by consideringtheir decisions, and the person which has the most votes is taken as the probe identity.

Appendix B

Additional Results on

Person-Specific Subspace Analysis

In addition to the core results on subspace analysis presented in Chapter 5, we also eval-

uated the approach on two additional visual representations and one additional dataset.

The first additional representation is named HT-L3-1st and was taken from [16]. It can be

thought of as a visual model slightly different from the L3+ model presented in Chapter 5.

The second additional representation is, in turn, a blend of local binary patterns (LBP),

histogram of oriented gradients (HOG), and Gabor wavelets (LBP+HOG+Gab), and was

taken from [47] in order to test our method with a representation on which partial least

squares (PLS) was already known to perform well.

The additional face dataset that we consider is Facebook100, which is similar in spirit

to PubFig83. Indeed, a remarkably linear relationship between performance achieved on

each set by a variety of algorithms has been reported in [16]. Both sets enable the inves-

tigation of face recognition methods where a considerable number of natural face images

from the individuals is available. As advocated in Chapter 1, Facebook100 reflects the

exact scenario on which learning person-specific representations is especially attractive,

i.e., social media. The reason why we omitted Facebook100 from our main results is

because this dataset is private [16].

As we can observe in Table B.1, the results on PubFig83 with the additional repre-

sentations are similar to the results reported in Chapter 5. Again, the only face subspace

in which we could consistently get better results than RAW is PS-PLS.

For the Facebook100 dataset, we present in Table B.2 the performance obtained with

the most competitive method of each category considered in Tables 5.1 and B.1. The

results are similar to the ones obtained on PubFig83, where PCA representations per-

formed most like RAW, LDA did better in V1-like+, and PS-PLS performed best across

all representations.

51

52

Table B.1: Comparison of different face subspace analysis techniques on two additionalvisual representation applied on the PubFig83 dataset. In all cases, the final identitiesare estimated by linear SVMs.

Models HT-L3-1st LBP+HOG+GabRAW 87.66±.29 82.63±.28Multiclass UnsupervisedRP 85.61±.37 75.07±.37PCA 87.50±.28 82.44±.34Multiclass SupervisedLDA 85.72±.33 83.40±.22PLS 86.63±.35 83.02±.26Person-SpecificPS-PCA 52.65±.62 33.02±.39PS-PLS 88.75±.26 85.42±.29

Table B.2: Comparison of different face subspace analysis techniques in the Facebook100dataset.

Models V1-like+ HT-L2-1st HT-L3-1stRAW 79.96±.19 85.81±.29 88.89±.25PCA 79.81±.18 85.70±.29 88.88±.25LDA 81.04±.29 83.07±.26 87.25±.29PS-PLS 81.53±.25 86.84±.19 89.70±.25

Taken together, these results strengthen the use of person-specific subspace analysis

via PLS in the unconstrained familiar face identification setting.

Appendix C

Scatter Plots from Different

Subspace Analysis Techniques

In Chapter 5, we proposed a person-specific application of partial least squares (PS-PLS)

to generate per-individual subspaces of familiar faces. By means of a straightforward

evaluation methodology, we compared different subspace analysis techniques for modeling

the problem. Extending Fig. 5.2(a), where we showed a scatter plot of training and

test samples projected onto the first two projection vectors of a PS-PLS model, in this

appendix we show these projections for three other subspace analysis techniques evaluated

in our experiments, namely PCA, LDA, and PLS. As in Fig. 5.2(a), Adam Sandler’s

samples are in red.

The overall distribution of the points is in accordance to our expectations, where

samples are spread out in PCA subspace, are more concentrated, apart with respect to

the other classes, and Gaussian shaped in LDA subspace, and are also apart but less

concentrated in PLS and PS-PLS. Due to its person-specific nature, we can observe a

clear better separation of the samples in PS-PLS.

53

54

PCA LDA

PLS PS-PLS

1st projection vector

2nd

proj

ectio

n ve

ctor

Figure C.1: Visualization of the training and test samples projected onto the first twoprojection vectors of each model. All models were obtained from the same training/testsplit using the V1-like+ representation.

Appendix D

Overview on Deep Visual Hierarchies

Humans have an impressive ability in recognizing faces, vehicle types, and a profusion

of other objects without much effort. Fortunately, neuroscience has provided a number

of important directions to computer vision researchers in their attempt to artificially

reproduce these abilities. These directions come not only from recent research suggesting,

for example, that the ventral visual stream of primates consists of a feedforward cascaded

hierarchy that gradually “untangles” information about objects in the scene [80, 81].

These directions come also from seminal works like the one from Hubel and Wiesel [82],

stating that the visual cortex is made by cells that are sensitive to small regions of the

input space, called receptive fields, and that these cells are of two types; one that responds

maximally to specific stimulus, known as simple cells, and another that account for local

invariance to the exact position where the stimulus occurred, known as complex cells.

In fact, computer vision and machine learning researchers have been taking advantage

of these findings for a long time. In the early 1980s, for example, Fukushima [63] proposed

neocognitron, a self-organizing artificial neural network inspired in the cell types of Hubel

and Wiesel [82], designed to extract robust signatures from visual patterns. With the same

inspiration, Lecun et al. [64] proposed convolutional neural networks in late 1980s along

with a procedure to discriminatively train them via backpropagation. Indeed, many other

contributions have been made to computer vision literature in the past decades following

the same principle of learning a visual hierarchy capable of representing high level concepts

straight from image pixels.

Modern approaches like [15, 68, 67] often employ a sequence of well defined opera-

tions such as (i) linear filtering followed by nonlinear activation – mimicking simple cell

behavior, (ii) local pooling – mimicking complex cell behavior, and (iii) local normaliza-

tion – attempting to model competitive interactions among neurons. These operations

can be thought of sublayers of a feedforward network with many layers. In Fig. D.1, we

present the architecture of one hypothetical layer. Note in red the receptive fields of each

55

56

layer input

one

hypo

thet

ical

laye

r

receptive�eld

�ltering

pooling

normalization

input to layer above

numberof �lters

operations

Figure D.1: Architecture of one hypothetical layer using three well-known biologically-inspired operations.

operation and how spatial resolution decreases as information flows in the network. The

intuition of using deep visual hierarchies is that, by “correctly stacking” many of these

biologically-inspired nonlinear operations, increasingly complex abstractions will emerge

at top layers.

There are two important — and strongly related [75] — aspects to consider when

designing deep visual hierarchies. One is how to determine the network architecture and

the other is how to determine which stimuli the filters will maximally respond to.

The architectural details are usually referred to as hyperparameters, and define which

operations should be involved, what are their receptive field, in what order should they

be applied, how many layers should be used, how many filters should be considered in

each layer, etc. While in many cases these hyperparameters are presented as tangential to

the algorithm, recent work has shown that they are crucial to the method’s performance

[15, 75, 73].

The method employed to determine which stimuli the filters will maximally respond

57

to is often called the training procedure. One of the most common training procedures is

backpropagation [83], which adjusts all filters in the network by minimizing the difference

— and propagating it backwards — between the obtained and the desired representation

in the topmost layer of the hierarchy. One problem of learning all filters at the same

time is the huge amount of examples required. In Fig. D.1, one can think that a filter

weight, i.e., a parameter, needs to be learned for each arrow between the bottom sublayer

and the sublayer above. It is not uncommon to have a network with tens of millions

of such parameters. Therefore, this network would likewise require tens of millions of

training samples in order to learn filters that would generalize the network behavior to

new samples.

However, Hinton et al. [66] showed in 2006 that a particular form of probabilistic

graphical model called restricted Boltzmann machines can be trained and stacked in a

greedy manner, so that a bound on the probability of representing well the training data

is increased at each layer. Since then, the term deep learning has been used to denote

various other methods following the same principle of learning filters one layer after the

other [71, 70, 79]. A key advantage of this layer-wise learning strategy is the alleviation

of the aforementioned over-parameterization problem.

Appendix E

Scoring Best in the ICB-2013

Competition and the Applicability of

Our Approach in the MOBIO

Dataset

We were recently fortunate by scoring best in a competition on mobile face recognition

organized as part of the prestigious International Conference on Biometrics [32]. This

competition was carried out on the MOBIO database [38], which can be considered a

relatively unconstrained dataset. In fact, in terms of user collaboration, the MOBIO

dataset can be seen in-between the UND [21] and the PubFig83 [16] datasets. Most

importantly, however, is the fact that this dataset also reflects a timely use case, which is

face recognition in mobile devices.

In this appendix, we first present in Sec. E.1 the MOBIO dataset as well as its relevant

aspects. Then, in Sec. E.2, we describe the performance measures that were used to

evaluate the competitors. With this information, in Sec. E.3 we are able to report details

about our winning method. After that, a thorough evaluation of our person-specific

representation learning approach on this dataset is presented in Sec. E.4. Final remarks,

lessons learned, and directions obtained with this experience are presented in Sec. E.5.

E.1 The MOBIO Dataset

The MOBIO dataset has 152 people with a female-male ratio of nearly 1:2 (100 males

and 52 females). It is the result of an international collaboration, in which images from

six institutions of five different countries were recorded in 12 distinct occasions for each

58

E.1. The MOBIO Dataset 59

individual.1 The dataset can be considered challenging in the sense that images were

acquired without control over factors such as illumination, facial expression, and face

pose. Moreover, in some cases, only parts of the face are visible.

For the competition, 150 out of the 152 individuals were considered. Based on the

gender of the individuals, the evaluation protocol is split up into female and male. Still,

for the sake of fairness, individuals in the dataset are divided into three subsets, namely

the training set, the development set, and the evaluation set.

The training set has 50 individuals — 13 females and 37 males — with 192 images

each and can be used for any purpose to aid the systems, from learning subspace models

to leveraging score normalization. In addition, this is the only subset where gender can

be combined according to the participant’s needs.

The development set has 42 individuals — 18 females and 24 males — and can be

used to tune the hyperparameters of the algorithm, e.g., the number of projection vectors

while learning subspaces, which similarity measure to use, etc. For each person in this

set, there are five gallery images — which in the context of our method we call training

images — and 105 test images. For each gender, participants were asked to submit a score

file containing one similarity score between each test sample and each gallery person. For

example, the score file related to the female protocol in the development set must contain

18×(18×105)=34,020 similarity scores.

The evaluation set, in turn, is used to assess the final system performance. It has 58

individuals — 20 females and 38 males — with samples arranged in exactly the same way

as the development set, i.e., five gallery (or training) images and 105 test images. In order

to disallow participants to optimize parameters on the evaluation set, test file names were

anonymized and shuffled. Luckily, after the competition, the organizers released the test

files with their original names. This way, we are now able to carry out experiments on

our own and compare the performance of new approaches with the competition numbers.

In Fig. E.1, we present training and test images of four individuals in the evaluation

set. While we can clearly see variation in pose, expression, and illumination, we can

also observe that the individuals are — to some extent — collaborating with the image

acquisition process. This is the reason we regard the MOBIO dataset as representing an

intermediate scenario in terms of user collaboration (see Fig. 2.2). It is far from being

as constrained as UND [21], but at the same time is not as “wild” as PubFig83 [16].

More importantly, however, is to observe the difference in appearance among the training

and the test images. In fact, we can see that the five training images of each individual

look quite similar. While this is a natural consequence from the fact that these images

were recorded in the same session, this considerably diminishes the discriminative power

of learning techniques operating on them. Moreover, such homogeneity in appearance is

1In particular for the competition, all images available were captured by mobile phones.

E.1. The MOBIO Dataset 60

gallery (train)

~test

gallery (train)

~test

gallery (train)

~test

gallery (train)

~test

Figure E.1: Training and test images from the MOBIO evaluation set. We can observethat the dataset represents an intermediate recognition scenario in terms of user collab-oration (see Fig. 2.2). It is not as constrained as UND (Fig. 3.1), but at the same timeis not as “wild” as PubFig83 (Fig. 3.2). More importantly, however, is to observe thedifference in appearance among the training and the test images. In fact, we can seethat the five training images of each individual look quite similar. This is not aligned tothe notion of familiarity that we pursue in this thesis, and considerably diminishes thediscriminative power of learning techniques operating on them.

E.2. Performance Measures 61

not aligned to the notion of familiarity that we attempt to approach in this thesis with

PubFig83.

E.2 Performance Measures

During the competition, the systems were analyzed in verification mode and the perfor-

mance metrics adopted by the organizers are based on compromises between false accep-

tance (FAR) and false rejection (FRR) rates. Indeed, what determine the relationship

between these two measures is a threshold θ above which the system predicts that the

matching images are from the same individual. By increasing θ, we decrease FAR and

increase FRR. Conversely, by decreasing θ, we increase FAR and decrease FRR.

The main performance metrics used on the competition are actually known as equal

error rate (EER) and half total error rate (HTER). In particular, EER was adopted to

measure performance in the development set and HTER to measure performance in the

evaluation set. For this purpose, a θdev is first computed to measure the EER on the

development set and then is used to measure the HTER on the evaluation set. Formally,

θdev = arg minθ

|FARdev(θ)− FRRdev(θ)|

EER =FARdev(θdev) + FRRdev(θdev)

2

HTER =FAReval(θdev) + FRReval(θdev)

2

(E.1)

where the subscripts “dev” and “eval” denote values computed on the development and

on the evaluation set, respectively.

As mentioned in Sec. E.1, both development and evaluation sets are split up into female

and male subsets, and the systems are independently evaluated in each gender. For a given

gender, θdev is obtained from the development set and used in the evaluation set of the

same gender. Therefore, the main performance metrics considered in the competition

were two EER values — one for each gender — and, likewise, two HTER values.

E.3 Our Winning Method

We started designing our system by aligning the images with the eye positions provided

by the organizers, as we did for the UND and the PubFig83 datasets. Naturally, the

visual representation of our choice was L3+, as we observed throughout the thesis that

it achieves superior performance. By the time that the competition was running, there

was a rule stating that no parameter could be learned on the evaluation set. Since this

E.3. Our Winning Method 62

Table E.1: Our initial systems. We can observe that learning a subspace model with LDAon the training set was fundamental in performance. Experiment A is the system whosescores we first submitted to the competition organizers, while experiment B is identicalto A, but is does not use LDA.

system LDA matchingEER on dev. set HTER on eval. set

female male female maleA yes 1-NN 5.026 4.405 11.724 7.282B no 1-NN 11.852 10.635 19.732 14.645

forbade us from learning person-specific models from gallery images, we had to recast our

face recognition approach.

In the short timeframe that we had to put together a system meeting these conditions,

we could experiment a few ideas before submitting our score files. Given that the training

set was the only set that we could perform learning tasks, and that individuals in this

set were different from the individuals in the development and the evaluation sets, we

regarded this problem as a transfer learning problem.

In a first attempt, we tried to use deep person-specific filters learned from individuals

in the training set to represent individuals in the other two sets. In accordance to the

procedure presented in Chapter 6, we learned 100 person-specific filters for each individual

(out of 50) and then, as an extension, we used AdaBoost [84] to select an optimal subset

of filters performing best in the development set.

Another idea that occurred to us to leverage L3+ in this scenario was to perform

multiclass supervised subspace learning on the training set, using the techniques of this

type that we had previously evaluated in Chapter 5, namely multiclass partial least squares

(PLS) and linear discriminant analysis (LDA). In this attempt, different from what we

observed in Chapter 5 — where LDA and multiclass PLS performed equivalently —

LDA showed to perform much better than multiclass PLS in transferring discriminative

structure from the training set to the development set.

It turned out that, from the few ideas that we evaluated, our most effective approach by

the submission deadline consisted of the ensemble of standard L3+ visual representations,

LDA subspace analysis performed on the training set, and nearest neighbor predictions

— considering the maximum score obtained while matching each test image to the five

gallery images of each individual. This approach is presented in Table E.1 as system A,

whose scores we first submitted to the competition organizers. Due to the importance of

LDA throughout our experiments with the MOBIO dataset, we also present here system

B, which is identical to A except that it does not use LDA.

A few weeks after submitting system A, we received a manuscript from the organizers

with the description and performance of all systems submitted to the competition. From

E.3. Our Winning Method 63

Table E.2: Results obtained with the replacement of nearest neighbor predictions byone-versus-all linear SVMs. As we can observe, the use of linear SVMs did improveperformance. However, while the performance of system D over B was substantiallybetter, learning linear SVMs on the LDA subspace did not boost performance as greatly,as we can observe by comparing C with A.

system LDA matchingEER on dev. set HTER on eval. set

female male female maleA yes 1-NN 5.026 4.405 11.724 7.282B no 1-NN 11.852 10.635 19.732 14.645C yes linear SVM 4.709 3.492 10.833 6.210D no linear SVM 7.196 6.786 15.655 8.747

that document, we first realized that our system had superior performance.2 In addition,

we also realized that a few other participants actually learned a discriminative binary

model for each individual. They did so by considering gallery images of a single individual

as positive samples and images in the training set as negative samples, repeating this

process for all individuals.

This called our attention because, in our opinion, they were actually learning thou-

sands of parameters from the evaluation set (even though using only gallery images),

something that was clearly forbidden according to the guideline. Our reaction was to im-

mediately replace our nearest neighbor classifier by a one-versus-all linear SVM for each

individual, training them in the same way. As we can observe in Table E.2, the use of

linear SVMs did boost the performance of our systems. However, while the performance

of system D over B was substantially better, learning linear SVMs on the LDA subspace

did not boost performance as greatly, as we can observe by comparing system C with A.3

In the end, the organizers accepted our arguments about the fact that the compe-

tition guideline was misleading, and allowed us to send them new score files from our

slightly better system C, which ended up being the best performing single system of the

competition [32].

The little boost in performance while using SVMs instead of nearest neighbor classi-

fication was somehow surprising to us. In PubFig83, for example, when we replace one

by the other, the difference in performance is quite considerable, of over 30% in favor

2It is worth nothing that our system was considered by the organizers as belonging to the categoryof simple systems, in which predictions are made by a single classification engine. The other category,known as fusion systems, is related to systems that combine many visual representations with variousclassification engines to produce final matching scores. In any case, if we take the mean between HTERson the evaluation set, our single system performed best than all other systems [32].

3It is also important to observe that all performance comparisons carried out during the competitionwere solely based on female and male EERs obtained on the development set. As mentioned in Sec. E.1,test file names in the evaluation set were encrypted at that time.

E.4. Learning Person-Specific Representations 64

of SVMs in terms of identification rate. A more detailed analysis of the data, however,

enables us to conjecture two possible reasons for this fact. First, as mentioned in Sec. E.1,

the gallery/training images of each individual are quite homogeneous (see Fig. E.1), and

this may not allow discriminative learning tasks to capture most informative features

based on them. Second, the appearance of the 50 individuals in the training set — to

which we train linear SVMs against — may not represent well the appearance of other

individuals in either the development and the evaluation sets, which may also explain

discriminative models performing under our expectations.

E.4 Learning Person-Specific Representations

Given the fact that we did not have the chance to learn person-specific representations

by the time of the competition, in this section we present an evaluation of the two best

performing representation learning techniques proposed in this thesis — namely person-

specific partial least squares (PS-PLS) and deep person-specific models (Deep PS) —

on the MOBIO dataset. As performance considering nearest neighbor (1-NN) and SVM

predictions were close in systems A and C (Fig. E.2), in this section we always report

results considering both prediction schemes.

We first evaluate how PS-PLS and Deep PS compare with LDA, which can also be seen

as a representation learning method. In Table E.3, the resulting systems are presented

as E, F, G, and H. We can clearly observe that neither PS-PLS nor Deep PS could

beat systems A and C. While this is in opposition to our experiments on PubFig83

and Facebook100 (Chapters 5 and 6 and Appendix B) — where PS-PLS outperformed

LDA and the advantage of Deep PS was conclusive — it also strengthen the conjecture

presented in the previous section that (i) gallery/training images in the MOBIO dataset

are quite homogeneous and that (ii) individuals in the training set may not represent well

the appearance of other individuals in the development and the evaluation sets. Both of

these issues may have impaired the process of learning person-specific representations in

systems E, F, G, and H.

Another point that is clear to observe from Table E.3 is that LDA appears to be central

in obtaining good performance in this dataset. Therefore, given that PS-PLS models can

be learned from any kind of input in Rd, we decided to further evaluate the construction of

person-specific models with PS-PLS over LDA features. Moreover, we decided to slightly

change the MOBIO protocol by considering a learning scenario closer to the scenario

approached on UND, PubFig83, and Facebook100. To this end, we incorporated gallery

images from the other individuals of the same set/gender as negative samples in the

process of learning PS-PLS models. For example, when we train a person-specific model

for a given female (out of 20) in the evaluation set, now we also include gallery images of

E.4. Learning Person-Specific Representations 65

Table E.3: Comparison among LDA, PS-PLS, and Deep PS representation learning ap-proaches. Neither PS-PLS nor Deep PS could beat systems A and C. While this is inopposition to our experiments throughout the thesis, it emphasizes that the MOBIOdataset and protocol is adverse for learning person-specific representations.

system Rep. Learning matchingEER on dev. set HTER on eval. set

female male female maleA LDA 1-NN 5.026 4.405 11.724 7.282C LDA linear SVM 4.709 3.492 10.833 6.210E PS-PLS 1-NN 12.121 12.033 19.509 13.848F PS-PLS linear SVM 8.934 7.268 15.351 10.123G Deep PS 1-NN 9.101 5.784 12.395 10.064H Deep PS linear SVM 6.878 4.873 16.454 8.679

Table E.4: Results obtained by incorporating gallery images in the process of learningperson-specific representations with PS-PLS on top of LDA features. It is possible toobserve from system I that the competition numbers are considerably improved in thissetting of the MOBIO database. In addition, here we can also note from system J thatrepresentations learned with PS-PLS models consistently resulted in better performance.

gallery included

systemPS-PLS

matchingEER on dev. set HTER on eval. set

on LDA female male female maleA no 1-NN 5.026 4.405 11.724 7.282I no linear SVM 3.181 2.656 8.377 4.931J yes 1-NN 1.796 2.624 6.397 4.182K yes linear SVM 3.439 3.531 10.457 5.644

the other 19 females in the negative set.

These experiments gave rise to systems I, J, and K, as presented in Table E.4. While

the baseline system A was not affected by this new learning strategy, the other baseline

system C (the competition winner) was. Hence, we present system I as its replacement. In

this new scenario, it is possible to observe that the competition numbers are considerably

improved when comparing I with C (Table E.3). Here we can also note from system

J that person-specific representations learned with PS-PLS models consistently resulted

in better performance. Moreover, 1-NN predictions outperformed linear SVMs (system

K) in this particular scenario. In our opinion, the performance of system J supports

our initial guess that the MOBIO dataset — with its homogeneous gallery images and

in its original protocol — represents a ill-posed problem for learning person-specific face

representations.

E.5. Conclusions 66

E.5 Conclusions

Participating in the ICB2-2013 competition on face recognition was opportune in several

ways. First, naturally, having produced a best performing system is the confirmation that

we are grounded in good technology. Also, the fact that we could iterate over many ideas

and rigorously evaluate them in a timely manner strengthen our work in general.

In addition, we learned a lot by evaluating our methods on the MOBIO dataset,

which, even though reflects a presumably easier recognition scenario than PubFig83, has

a different image collection process. While PubFig83 has a large pool of diverse gallery

images and approach the operational scenario of face recognition in social media, MOBIO

has only five homogeneous gallery images for each individual and approach the “one-time

enrollment” operational scenario.

After several unsuccessful attempts and a slight modification in the MOBIO protocol,

the multitude of systems evaluated in this Appendix culminated in the person-specific-

representation-based system J, which was able to beat our ICB-2013 winning method. At

the same time that this reassures our claim, it spontaneously suggest that we should care-

fully consider particularities of the operational scenario to which we target our systems.

Documents

Giovani Chiachia â€œLearning Person-Specific Face Representationsâ€