PERSON IDENTIFICATION USING FACE AND …research-repository.uwa.edu.au/files/3246146/Naseem...PERSON IDENTIFICATION USING FACE AND SPEECH BIOMETRICS by Imran Naseem A Thesis Presented

PERSON IDENTIFICATION USING FACE ANDSPEECH BIOMETRICS

by

Imran Naseem

A Thesis Presented to theGraduate Research School

In Partial Fulfillment of the Requirementsfor the Degree

DOCTOR OF PHILOSOPHY

IN

Electrical, Electronic and Computer Engineering

UNIVERSITY OF WESTERN AUSTRALIA

Crawley, WA 6009, Australia

June 2010

ii

Acknowledgements

In the name of Allah, the Most Gracious and the Most Merciful

All praise and glory goes to Almighty Allah (Subhanahu Wa Ta’ala) who gave me the

courage and patience to carry out this work. Peace and blessings of Allah be upon His

last Prophet Muhammad (peace be upon him). First and foremost gratitude is due to

the esteemed university, the University of Western Australia, and to its learned faculty

members for imparting quality knowledge. My deep appreciation and heartfelt gratitude

goes to my thesis supervisors Dr. Roberto Togneri and Prof. Mohammed Bennamoun for

their constant support and the numerous moments of attention they devoted throughout

the course of this research work. Working with them in a friendly and motivating envi-

ronment was really a joyful experience of my life. I am thankful to Roberto Togneri, in

particular, for having faith in my efforts and ideas, and for giving me liberty in choosing

my research topics. He also efficiently managed and upgraded the Signal and Information

Processing (SIP) lab which made my experiment work a lot easier. Thanks are also due

to Prof. Mohammed Bennamoun, for allowing me to use the lab facility in his school.

I would like to acknowledge my mentor, Dr. Muhammad Hafiz Afzal, for providing

unconditional support, love and guidance from far away. I wish I could be in his company

again. Acknowledgement is due to my friends Dr. Nazim Khan, Ghazi Abu Rumman,

Salim, Bandar, Abdul Rahman, Shafiq, Hisham, Adnan Azam and many others all of

whom I would not be able to itemize. I owe thanks to my house mates and friends Khalid,

Abdul Rahman, Tahir, Umair, Shahnawaz and Asim Aqeel for their help, motivation

and pivotal support. They made my work and stay at UWA very pleasant and joyful.

My heartfelt thanks to my days old friends Hashim Raza Khan, Khawar Saeed, Mudassir

iii

Masood, Imran Azam, Faisal Zaheer, Mazhar Azim, Sajid Anwar, Moinuddin, Saad Azhar,

Arshad Raza and Aiman Rashid. I wish we could get together some time.

Last, but not the least, I thank my family: my respected father, Muhammad Naseem

Siddiqui, and my loving mother, Rashida Gulnaz, for educating me, for unconditional

love, support and encouragement to pursue my interests, even when the interests went

beyond boundaries of language, field and geography. My wife Javeria, with whom I have

recently started a new era of my life full of love, devotion, care and understanding. My

dearest sisters Sadia, Zuvia and Moniza have always been of great moral support for me,

I love all of them and wish them a prosperous future . My brother Arsalan Naseem, for

taking care of family while I am overseas. My grandfather, Muhammad Jameel Siddiqui

(late), was a light for me in dark times and gloomy circumstances. I enjoyed each and

every moment spent in his company, his memories are still a source of joy for me. His

demise was a great setback for the whole family, but then every one has to go, nobody

stays in this mortal world forever, may Allah grant him paradise and forgiveness. With all

humbleness, I pray to Allah Subhanahu Ta’la for the prosperity and well-being of whole

human race, irrespective of religion, cast, creed and ethnicity. May Allah show us all the

right path which will lead us to success of this world and hereafter, Ameen.

iv

Contents

Acknowledgements iii

Abstract xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Face as a Biometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Voice as a Biometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Aims and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.7 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Sparse Representation for Visual Biometric Recognition 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Sparse Representation Classification . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Sparse Representation Classification for Recognition from Still Face Images 16

2.4.1 Yale Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.2 AT&T Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.3 AR Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Sparse Representation Classification for Video-based Face Recognition . . . 21

2.5.1 Scale Invariant Feature Transform (SIFT) for Face Recognition . . . 21

2.5.2 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . 23

v

2.6 Sparse Representation Classification for Ear Biometric . . . . . . . . . . . . 27

2.6.1 Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . 27

2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.8 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Linear Regression for Face Identification 35

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Linear Regression for Face Recogniton . . . . . . . . . . . . . . . . . . . . . 37

3.2.1 Linear Regression Classification (LRC) Algorithm . . . . . . . . . . 37

3.2.2 Modular Approach for the LRC Algorithm . . . . . . . . . . . . . . 39

3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.1 AT&T Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.2 Yale Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.3 Georgia Tech (GT) Database . . . . . . . . . . . . . . . . . . . . . . 46

3.3.4 FERET Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.5 Extended Yale B Database . . . . . . . . . . . . . . . . . . . . . . . 51

3.3.6 AR Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4 Robust Regression for Face Recognition 67

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2 The Problem of Robust Estimation . . . . . . . . . . . . . . . . . . . . . . . 70

4.3 Robust Linear Regression Classification (RLRC) for Robust Face Recognition 73

4.4 Case Study: Face recognition in Presence of Severe Illumination Variations 74

4.4.1 Yale Face Database B . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4.2 CMU-PIE Face Database . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4.3 AR Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.4.4 FERET Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.5 Case Study: Face Recognition in Presence of Random Pixel Noise . . . . . 84

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

vi

4.7 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5 Speaker Identification using Sparse Representation 103

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.2 Sparse Representation for Speaker Identification . . . . . . . . . . . . . . . 106

5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6 Speaker Identification using Linear Regression 111

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.2 Linear Regression Classification (LRC) Algorithm . . . . . . . . . . . . . . . 114

6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.4 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 119

6.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 Conclusions and Future Directions 123

7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Bibliography 127

vii

viii

List of Figures

2.1 A typical subject from the Yale database with various poses and variations. 16

2.2 (a) Recognition accuracy for the (a) Yale and (b) AT&T database with

respect to feature dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 A typical subject from the AT&T database . . . . . . . . . . . . . . . . . . 18

2.4 Gesture variations in the AR database, note the changing position of head

with different poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 A typical localized face from the VidTIMIT database with extracted SIFTs. 22

2.6 A sample video sequence from the VidTIMIT database. . . . . . . . . . . . 24

2.7 (a)Rank profiles and (b) ROC curves for the SIFT, SRC and the combina-

tion of the two classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.8 Variation in performance with respect to bias in fusion. . . . . . . . . . . . 27

2.9 A typical subject from the UND database illustrating different pose and

illumination variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.10 A typical cropped ear (a) and its compressed form in the feature space (b). 28

2.11 (a)Rank Profile for the UND database. (b) ROC curves for the UND

Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.12 (a)Rank Profile for the FEUD database. (b) ROC curves for the FEUD

Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.13 A typical subject from the FEUD database . . . . . . . . . . . . . . . . . . 31

3.1 A typical subject from the AT&T database . . . . . . . . . . . . . . . . . . 42

3.2 (a) Test image from subject 1. (b) Residuals using a randomly selected

false subspace. (c) Residuals using subspace 1 . . . . . . . . . . . . . . . . 43

ix

3.3 (a) Recognition accuracy for the AT&T database with respect to feature

dimension using the LRC algorithm. (b) Cross-validation with 20 random

selections of gallery and probe images. . . . . . . . . . . . . . . . . . . . . . 44

3.4 A typical subject from the Yale database with various poses and variations. 46

3.5 Yale database: Recognition accuracy with respect to feature dimension for

a randomly selected experiment. . . . . . . . . . . . . . . . . . . . . . . . . 47

3.6 Samples of a typical subject from the GT database. . . . . . . . . . . . . . 48

3.7 A typical subject from the FERET database, fa and fb representing frontal

shots with gesture variations while ql and qr correspond to pose variations. 50

3.8 Starting from top, each row illustrates samples from subsets 1, 2, 3, 4 and

5 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.9 Recognition accuracy with varying feature dimension for EP1, EP2, EP3

and EP4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.10 Gesture variations in the AR database, note the changing position of head

with different poses. First and second rows correspond to 2 different sessions

incorporating neutral, happy, angry and screaming expressions respectively. 55

3.11 Examples of contiguous occlusion in the AR database. . . . . . . . . . . . . 59

3.12 The recognition accuracy versus feature dimension for scarf occlusion using

the LRC approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.13 A sample image indicating eyes and mouth locations for the purpose of

manual alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.14 Samples of cropped and aligned faces from the AR database. . . . . . . . . 61

3.15 Case studies for Modular LRC approach for the problem of scarf occlusion. 62

3.16 (a) Distance measures dj(n) for the four partitions, note that non-face com-

ponents make decisions with low evidences. (b) Recognition accuracies for

all blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1 Yale Face Database B: Starting from top, each row represents typical im-

ages from subsets 3, 4 and 5 respectively. Note that subset 5 (third row)

characterizes the worst illumination variations. . . . . . . . . . . . . . . . . 75

x

4.2 The 21 different illumination variations for a typical subject from the CMU

PIE database. These images were captured without any ambient lighting

thereby demonstrating more severe luminance alterations . . . . . . . . . . 77

4.3 Performance curves for the CMU-PIE database under EP 2. . . . . . . . . . 80

4.4 Various luminance variations for a typical subject of the AR database, the

two rows represent two different sessions. . . . . . . . . . . . . . . . . . . . 81

4.5 ROC curves for the FERET database. . . . . . . . . . . . . . . . . . . . . . 84

4.6 First row illustrates some gallery images from subset 1 and 2 while second

row shows some probes from subset 3. . . . . . . . . . . . . . . . . . . . . . 85

4.7 Probe images corrupted with (a) 20% (b) 40% (c) 60% and (d) 80% dead

pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.8 Recognition accuracy of various approaches for a range of dead pixel noise

density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.9 Probes with (a) 20% (b) 40% (c) 70% and (d) 90% salt and pepper noise

density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.10 Recognition accuracy curves in the presence of varying density of salt and

pepper noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.11 Probe images corrupted with (a) 4 (b) 6 (c) 8 and (d) 10 variance speckle

noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.12 Dead-pixel noise: First row elaborates rank-recognition profiles while the

second row shows the Receiver Operating Characteristics (ROC). From left

to right columns indicate 20%, 40%, 60% and 80% noise density respectively. 91

4.13 Salt and pepper noise: First row represents rank recognition curves while

the second row shows Receiver Operating Characteristics (ROC). From left

to right columns indicate 50%, 60% 70% and 80% noise densities respectively. 92

4.14 Speckle noise: First row represents rank recognition curves while second

row shows receiver operating characteristics. From left to right columns

indicate noise densities with variances 2, 4, 6, and 8 respectively. . . . . . . 94

xi

4.15 Gaussian noise: First row represents rank recognition curves while second

row shows Receiver Operating Characteristics (ROC). From left to right

columns indicate noise densities with variances 0.5, 0.7, 0.8, and 0.9 respec-

tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.16 Probe images corrupted with (a) 0.2 (b) 0.4 (c) 0.6 and (d) 0.8 variance

zero-mean Gaussian noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.17 Recognition accuracy of various approaches in the presence of speckle noise

for different variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.18 Recognition accuracy of various approaches in the presence of Gaussian

noise for different variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.1 Experiment Set 1: Recognition accuracy of various approaches with respect

to number of mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.2 Experiment Set 2: Recognition accuracy of various approaches with respect

to number of mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

xii

List of Tables

2.1 Results for Yale database using the leave-one-out method. . . . . . . . . . . 17

2.2 Results for two experiment sets using the AT&T database. . . . . . . . . . 18

2.3 Recognition Results for Gesture Variations under Experiment Set 1 . . . . . 20

2.4 Recognition Results for Gesture Variations under Experiment Set 2 . . . . . 21

2.5 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 Results for EP1 and EP2 using the AT&T database. . . . . . . . . . . . . . 45

3.2 Results for Yale database using the leave-one-out method. . . . . . . . . . . 46

3.3 Results for the Georgia Tech. database. . . . . . . . . . . . . . . . . . . . . 49

3.4 Results for the FERET database. . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5 Results for the Extended Yale B database. . . . . . . . . . . . . . . . . . . . 52

3.6 Recognition Results for Gesture Variations Using the LRC Approach . . . . 55

3.7 Recognition Results for Gesture Variations under EP3 . . . . . . . . . . . . 57

3.8 Recognition Results for Gesture Variations under EP4 . . . . . . . . . . . . 58

3.9 Recognition Results for Occlusion . . . . . . . . . . . . . . . . . . . . . . . . 59

3.10 Comparison of the DEF with the Sum Rule for Three Case Studies . . . . . 64

4.1 Outline of Robust Linear Regression Classification (RLRC) Algorithm . . . 72

4.2 Details of the subsets for Yale Face Database B with respect to light source

directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Recognition Results for Yale Face Database B . . . . . . . . . . . . . . . . . 76

4.4 Performance comparison with state-of-the-art algorithms characterizing train-

ing images captured from near frontal lighting. All results are as reported

in [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xiii

4.5 Performance comparison with state-of-the-art algorithms characterizing train-

ing images with severe lighting conditions. All results are as reported in

[1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.6 Results for the AR database under EP 1. . . . . . . . . . . . . . . . . . . . 81



4.9 Verification Results for dead-pixel noise . . . . . . . . . . . . . . . . . . . . 89

4.10 Verification Results for salt and pepper noise . . . . . . . . . . . . . . . . . 89

4.11 Verification Results for Speckle Noise . . . . . . . . . . . . . . . . . . . . . . 96

4.12 Verification Results for Gaussian Noise . . . . . . . . . . . . . . . . . . . . . 96

5.1 Experimental Results for the TIMIT database . . . . . . . . . . . . . . . . . 108

6.1 Experiment Set 1: Recognition accuracy for various approaches with respect

to different number of mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.2 Experiment Set 2: Recognition accuracy for various approaches with respect

to different number of mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . 116

xiv

Abstract

Increasing security threats have recently highlighted the importance of efficient authenti-

cation systems. Although face and speech biometrics have shown good performance, there

are key robustness issues which challenge the reliability of these systems. For instance

illumination, expression, pose and occlusion remain open challenges in the paradigm of

face recognition. To address these robustness issues, there is a dire need of novel algo-

rithms in these areas. This dissertation investigates the recently proposed face recognition

algorithm called Sparse Representation Classification (SRC) for key robustness issues.

Since local features such as eyes, ears and lips have shown better performance compared

to their global counterparts, the thesis successfully extends the SRC approach to the

problem of ear recognition. In the paradigm of face recognition, three novel algorithms

named Linear Regression Classification (LRC), Modular LRC and Robust Linear Regres-

sion Classification (RLRC) are proposed to address various issues of severe expression

variations, adverse luminance variations, contiguous occlusion and random pixel corrup-

tion. Extensive experiments have been conducted on standard databases and excellent

results have been reported. In particular, using the Modular LRC approach we achieve

the best result ever reported for the challenging scarf occlusion problem. Addressing the

problem of luminance, the proposed RLRC algorithm is able to achieve 100% recognition

accuracy on the most adversely distorted Subset 5 of the Yale Database B outperforming

the contemporary illumination invariant face recognition algorithms. In the paradigm of

speaker recognition, we propose two novel algorithms based on sparse representation and

linear regression. These algorithms are tested on subsets of TIMIT database achieving

competitive performance index compared to the state-of-art approaches. The dissertation

is presented as a compilation of publications.

xv

xvi

Chapter 1

Introduction

1.1 Motivation

With the advent of technology, electronic devices have become an integral part of our day

to day life. These devices are used in several applications. The applications range from

daily use of home appliances to some high performance scientific devices in modern labs.

In many cases, the use of these facilities must be allowed to a limited group of people.

For example, you will not allow a stranger to access your office computer. Similarly, it

will be a serious problem if your bank account information is hacked and can be accessed

through an ATM. Some kind of “key” must be there so that only the concerned person(s),

and not any intruder, can access these facilities. The most commonly used, and the

earliest, technique is that of passwords. The user is required to type in his password.

If the password is correct the user is allowed access to the facility, otherwise the system

rejects the user. There are several problems associated with the use of passwords [2]. The

passwords can be forgotten and it is difficult to memorize different passwords for different

applications. Users therefore tend to have one password for various venues which results

in a high increase of risk-level. With the recent technological development, passwords can

also be “stolen” or hacked. To counter these difficulties, smart-cards were introduced in

conjunction with passwords. The system can only be used for access by the person in

possession of a card. Again, this poses several problems as the card can be lost, stolen or

damaged or even duplicated. To avoid these troubles, systems based on biometrics started

to gain more popularity and acceptance [2].

1

Biometrics is the way of capturing a person’s physical or behavioral characteristics or

traits which can be used for authentication or identification. Biometrics help us avoid

the problems discussed above. In fact, in a biometrics-based environment, subject is an

identity by himself. It is however important to point out that with the recent emergence of

multimedia technology, traditional biometric systems have also become vulnerable. There

is a tendency of fooling the biometric system by presenting very fine quality recorded

biometric samples when the concerned person is actually not present [3]. The problem

referred to as “liveliness detection” is in fact an emerging research area in the field of bio-

metrics. However, even with the issue of liveliness detection, biometrics arguably remain

the best choice for secure authentication. Several biometric systems exist today:

• Fingerprints

• Voice

• Face

• Hand Geometry

• Eyes Geometry

• Iris

• Ear Geometry

The systems based on the above mentioned features have been shown to be very

efficient. One must be mindful that each biometric system comes with some intrinsic

weaknesses and one system suitable for a particular application may not fit well for the

other. For example, fingerprints are arguably the most developed of all the biometrics

and achieve very good recognition accuracy but require a high level of cooperation from

the user. This makes fingerprints non-user-friendly and unsuitable for many important

applications, such as surveillance. Similarly, iris patterns have proven to be excellent

features for human recognition, but the process of getting a good iris image is complex,

expensive and intrusive. Of the above mentioned biometrics, face and speech are the two

most natural choices as they are user-friendly, do not require physical contact and are less

intrusive.

2

1.2 Face as a Biometric

The importance of face recognition is highlighted with the widely deployed video surveil-

lance systems. Surveillance cameras capturing images can be used to monitor abnormal

activities in sensitive areas. The face recognition problem can be defined as the process

of identifying an individual from his/her face image. This face image can be captured

by a camera or can be extracted from a video. Face recognition is a challenging task in

pattern recognition. This is mainly due to the fact that the image of a face is prone to

change due to a number of factors like noise, illumination, viewpoint, age, facial expres-

sions, occlusion etc. In the past 40 years, we have witnessed a major development in the

field of face recognition. The main reason for such expansion is the need of such systems

for various commercial security applications. In spite of this advancement, recognition

systems have faced certain limitations due to the above mentioned robustness issues. In

particular, the issues of contiguous occlusion and illumination variation are considered as

the most challenging problems in the paradigm of face recognition [3].

1.3 Voice as a Biometric

Automatic speaker recognition systems identify people utilizing utterances. Depending

on the nature of the application, speaker identification or speaker verification systems,

could be modeled to operate either in text-dependent or text-independent modes. For

text-dependent mode, the user is required to utter a specific password, while for text-

independent ASR (Automatic Speaker Recognition), there is no need for such a constraint.

Success in both cases depends on the modeling of speech characteristics which distinguishes

one user from the other. Text-dependent mode is used for applications where the user is

willing to cooperate by memorizing a phrase or password to be spoken. Research in the field

of speaker recognition traces back to the early 1960s when Lawrence Kersta at Bell Labs

[4] made the first major step in speaker verification by computers, where he introduced

the term “voiceprint” for a spectrogram. Since then, there has been a tremendous amount

of research in the area and speech has emerged as a mature biometric trait.

3

1.4 Aims and Scope

The research was focused to develop novel and robust recognition algorithms for face and

speech biometrics. Recent developments in the theory of compressive sensing [5] has found

numerous applications in various fields of signal processing. Recently this concept of sparse

representation has been used for the problem of face recognition [6]. Sparse Representa-

tion Classification (SRC) presented in [6] has shown some interesting results. The feature

extraction stage has been a debatable topic within the face recognition community. A

number of approaches have been presented incorporating complicated computations. In

[6] it has been shown that with the choice of an appropriate classifier, merely downsam-

pled images are sufficient at the feature extraction stage to yield good results. Essentially

SRC was the starting point for this research, we successfully extended the intriguing ap-

proach of SRC for other challenging problems of view-based biometric recognition systems.

These experiments were quite encouraging and lead to the development of a novel face

recognition algorithm, called Linear Regression Classification (LRC). The proposed LRC

algorithm showed excellent results on standard face databases. We further extended the

approach to a patch-based technique called Modular LRC to achieve excellent results for

the challenging problem of contiguous occlusion. Noting that LRC formulates the problem

of face recognition as a task of linear regression, we further proposed to use robust statis-

tics to develop a Robust Linear Regression Classification (RLRC) algorithm to tackle the

challenging issues of illumination variations and random pixel corruption.

Traditionally, the problem of speaker identification is tackled using the Gaussian Mix-

ture Model (GMM) - based probabilistic approaches. Recently, however an interesting

concept of GMM mean supervector has enabled the representation of a speaker as a point

in a high dimensional space i.e. the speaker space [7]. Essentially with this approach, a

variable-length utterance can be represented as a fixed-length feature vector in the feature

space which was not possible before. Consequently the problem of speaker recognition

can be tackled as a general problem of pattern recognition. In [7] the concept of Support

Vector Machine (SVM) has been successfully used to yield competitive results compared

to the state-of-art probabilistic approaches. The SVM calculations however tend to be

computationally expensive, in particular for one-against-all SVM architecture, with large

4

number of Gaussian mixtures, the simulations are not possible on a standard machine. It

is therefore imperative to further explore current pattern classification algorithms for the

problem of speaker identification. With this understanding we extended the SRC and LRC

classification algorithms for the problem of speaker identification, achieving competitive

performance index compared to the state-of-art approaches including SVM approach. The

proposed algorithm remains one of the most simple of the current state-of-art classification

approaches.

1.5 Thesis Structure

The research presented in the thesis has either been published/accepted for publication or

under review in prestigious journals and conferences, the thesis is therefore presented as

compilation of these publications. It is worthy to point out that for the sake of consistency

and flow, some chapters consist of more than one publication. Nevertheless each chapter

is self-contained and does not require linkage with any other chapter. Since the thesis

is presented as a combination of publications, there is an inevitable overlap between the

chapters when describing the general problem statements. The organization of the thesis

is as follows:

Chapter 2 is an extensive evaluation of the recently introduced SRC algorithm for view-

based biometrics. It begins with the evaluation of robustness of the SRC algorithm for

two major issues. In particular we address the issues of severe expression variations and

moderate illumination variations. We further investigated the performance of the SRC

algorithm on the video-based face recognition problem. Considering that in the paradigm

of face recognition, local features (such as eyes, ears, lips etc.) have shown good results

compared to the global counterparts, we extended the SRC approach to the problem of

ear recognition. Publications incorporated in the chapter are also reported at the end of

the chapter.

Chapter 3 presents the Linear Regression Classification (LRC) algorithm which is a

novel approach for the problem of face recognition. The difficult problem of occluded faces

5

is also successfully addressed using the modular LRC approach. Publications constituting

the chapter are also reported at the end.

Chapter 4 presents the Robust Linear Regression Classification (RLRC) algorithm

which is a novel approach to address two major robustness issues namely (1) severe il-

lumination variations and (2) random pixel noise. The proposed algorithm has shown

superior performance index compared to the state of art robust approaches. Publications

constituting the chapter are also reported at the end.

Chapter 5 presents the novel extension of the SRC algorithm for the problem of speaker

identification. Experiments have been conducted and results are compared to the state-

of-art approaches including the rcently proposed SVM classification. Publication arising

from the chapter is indicated at the end.

Chapter 6 presents the novel LRC algorithm for the problem of speaker identification.

Extensive experiments have shown good performance index for the proposed approach.

Publication arising from the chapter is indicated at the end.

Chapter 7 concludes the dissertation with a summary of the contributions and sug-

gested future directions of the research.

1.6 Contributions

1. Evaluation of the robustness of Sparse Representation Classification (SRC) algorithm

for (1) slight-to-moderate light variations and (2) severe expression variations.

2. Extension of the SRC algorithm for the problem of ear recognition.

3. Extension of the SRC algorithm for video-based face recognition.

4. Development of a novel face recognition algorithm called Linear Regression Clas-

sification (LRC) demonstrating excellent results compared to the benchmark ap-

proaches.

6

5. Development of Modular LRC approach to tackle the difficult problem of contiguous

occlusion using the novel Distance based Evidence Fusion (DEF) algorithm. The

proposed algorithm achieved the best results ever reported for the scarf occlusion

problem.

6. Development of the Robust Linear Regression Classification (RLRC) algorithm for

the challenging issues of (1) illumination variations and (2) random pixel corruption.

The proposed algorithm achieved excellent results compared to the state-of-art ro-

bust approaches.

7. Development of a novel algorithm in the paradigm of speaker identification using

the concept of sparse representation.

8. Development of a novel linear regression based speaker recognition algorithm achiev-

ing comparable results to the state-of-art approaches.

1.7 Publications

Key publications arising from the thesis are the following:

1. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Linear Regression

for Face Recognition”, In Print IEEE Transactions on Pattern Analysis and Machine

Intelligence (IEEE TPAMI)

2. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Robust Regres-

sion for Face Recognition”, First revision submitted IEEE Transactions on Pattern

Analysis and Machine Intelligence (IEEE TPAMI)

3. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Face Identification

using Linear Regression”, International Conference on Image Processing ICIP09,

Cairo, Egypt.

4. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, Sparse Representa-

tion for View-Based Face Recognition, Accepted as a book chapter in book Advances

in Face Image Analysis: Techniques and Technologies ED. Y.-J. Zhang, IGI Global

Publishing.

7

5. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Sparse Represen-

tation for Speaker Identification”, Accepted in IAPR International Conference on

Pattern Recognition, ICPR’10

6. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Robust Regres-

sion for Face Recognition”, Accepted in IAPR International Conference on Pattern

Recognition, ICPR’10


tation for Video-Based Face Recognition”, book chapter in Advances in Biometrics

(Lecture Notes in Computer Science, LNCS series), Springer Berlin / Heidelberg.

Volume 5558/2009, pages 219-228. 0302-9743 (Print) 1611-3349 (Online), ISBN

978-3-642-01792-6.


tation for Video-Based Face Recognition”, International Conference on Biometrics,

ICB09, Alghero, Italy.

9. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Sparse Represen-

tation for Ear Biometrics”, book chapter in Advances in Visual Computing (Lecture

Notes in Computer Science, LNCS series), Springer Berlin / Heidelberg. Volume

5359/2008, pages 336-345. ISSN 0302-9743 (Print) 1611-3349 (Online), ISBN 978-

3-540-89645-6.


tation for Ear Biometrics”, International Symposium on Visual Computing (ISVC),

December 1-3, 2008, Las Vegas, Nevada, USA.


for Speaker Identification”, Submitted to InterSpeech 2010

8

Chapter 2

Sparse Representation for Visual

Biometric Recognition1

2.1 Introduction

With the ever increasing security threats, the problem of invulnerable authentication sys-

tems is becoming more acute. Traditional means of securing a facility essentially depend

on strategies corresponding to “what you have” or “what you know”, for example smart

cards, keys and passwords. These systems however can easily be fooled. Passwords for

example, are difficult to remember and therefore people tend to use the same password

for multiple facilities making it more susceptible to hacking. Similarly cards and keys can

easily be stolen or forged. A more inalienable approach is therefore to go for strategies

corresponding to “what you are” or “what you exhibit” i.e. biometrics. Although the

issue of “liveliness” has recently been highlighted due to the advancement in digital media

technology, biometrics arguably remain the best choice.

Among the other available biometrics, such as speech, iris, fingerprints, hand geometry

and gait, face seems to be the most natural choice [2]. It is nonintrusive, requires a mini-

mum of user cooperation and is cheap to implement. The importance of face recognition

is highlighted for widely used video surveillance systems where we typically have facial

1Parts from the chapter have been published in International Symposium on Visual Computing(ISVC’08) and International Conference on Biometrics (ICB’09). The research has also been acceptedto be published as a book-chapter in an upcoming book “Advances in Face Image Analysis: Techniquesand Technologies”

9

images of suspects. With the additional temporal dimension, video sequences are much

more informative than still images. As a result the person identification task is facili-

tated due to specific attributes of each subject such as head rotation and pose variation

along the temporal dimension. Additionally more efficient face representations such as

super resolution images can be derived from video sequences for further enhancement of

the overall system. These motivations have urged researchers to look into the develop-

ment of face recognition systems that can utilize the spatiotemporal information in video

sequences. It is therefore becoming imperative to evaluate present state-of-the-art face

recognition algorithms for video-based applications. A face recognition system works in

three modes: 1) Face Identification/Recognition 2) Face Verification and 3) The Watch

List approach. Face identification/recognition is a 1:N matching problem where a Closed

Universe model is used. Therefore each probe image is implicitly assumed to be from one

of the registered users. Face verification on the other hand, is defined as a 1:1 matching

problem and requires the confirmation of the identity claimed by a user. The watch list

approach as proposed in Face Recognition Vendor Test (FRVT 2002) [8], assumes an Open

Universe model where a probe face image may or may not correspond to the registered

users. Similarity scores are computed against each subject in the gallery and an alarm is

raised if the score exceeds a given threshold [3].

Ear is also an important biometric, gaining popularity primarily due to the immunity

against the aging factor [9]. With the increasing age faces sag, speech becomes heavier,

fingerprints wear out and the style of walking declines [9]. Ears however tend to maintain

their shapes and seem to be the most time invariant biometric. Furthermore they are also

pose invariant as they do not change shapes with the change in gestures. These specialties

have urged researchers to consider ears for the purpose of person identification. Historically

speaking, Iannarelli [10] first provided enough experimental evidence in 1989 to draw the

attention of the researchers to the problem of ear recognition. However, it was not before

the last decade that the computer vision community had started evaluating ear recognition

systems with some appreciable results [11]. Various studies in the area include neural

network approaches [12] and Principal Component Analysis (PCA) variations [13, 14].

The PCA approaches however rely heavily on the normalization processes for any reliable

10

results [15].

Appearance-based face recognition systems, either employing the whole face or local

landmark features such as eyes, lips, ears etc., critically depend on manifold learning meth-

ods. A gray-scale face image of order a× b can be represented as an ab-dimensional vector

in the original image space. However any attempt of recognition in such a high dimensional

space is vulnerable to a variety of issues often referred to as the curse of dimensionality.

Typically in pattern recognition problems it is believed that high-dimensional data vectors

are redundant measurements of an underlying source. The objective of manifold learning

is therefore to uncover this “underlying source” by a suitable transformation of high-

dimensional measurements to low-dimensional data vectors. View-based face recognition

methods are no exception to this rule. Therefore, at the feature extraction stage, images

are transformed to low dimensional vectors in a face space. The main objective is to find

a basis function for this transformation, which could distinguishably represent faces in the

face space. Linear transformation from the image space to the feature space is perhaps the

most traditional way of dimensionality-reduction, also called “Linear Subspace Analysis”.

A number of approaches have been reported in the literature including Principal Com-

ponent Analysis (PCA) [16], [13], Linear Discriminant Analysis (LDA) [17] and Indepen-

dent Component Analysis (ICA) [18], [19]. These approaches have been classified in two

categories namely reconstructive and discriminative methods. Reconstructive approaches

(such as PCA and ICA) are reported to be robust for the problem related to contaminated

pixels, whereas discriminative approaches (such as LDA) are known to yield better results

in clean conditions [20]. Nevertheless, the choice of the manifold learning method for a

given problem of face recognition has been a hot topic of research in the face recognition

literature. These debates have recently been challenged by a new concept of “Sparse Rep-

resentation Classification (SRC)” [6]. It has been shown that unorthodox features such

as downsampled images and random projections can serve equally well. As a result the

choice of the feature space may no longer be so critical [6]. What really matters is the

dimensionality of the feature space and the design of the classifier. The key factor to the

success of sparse representation classification is the recent development of “Compressive

Sensing” theory [5].

11

Recent developments in the theory of compressive sensing [5] has found numerous appli-

cations in various fields of signal processing. Recently this concept of sparse representation

has been used for the problem of face recognition [6]. The reported results are encouraging

enough to extend the algorithm to other biometrics and further evaluate their performance

for harder face recognition problems under the most challenging practical constraints such

as video-based face recognition, occlusion and varying ambient illumination. The main

objective of this chapter is therefore two fold: (1) To extend the Sparse Representation

Classification (SRC) method for the problem of the emerging ear biometric. (2) To eval-

uate the SRC algorithm for more challenging and realistic face recognition issues such as

gesture variations, illumination variations and video-based applications.

The rest of the chapter is organized as follows: Section 2.2 briefly covers the problem of

compressive sensing followed by description of Sparse Representation Classification (SRC)

in Section 2.3. Section 2.4 consists of extensive experiments on still face images followed

by evaluations on video-based face recognition problem in Section 2.5. Evaluations for ear

biometric recognition are presented in Section 2.6. The chapter is concluded in Section

2.7 followed by a list of publications arising from this research in Section 2.8.

2.2 Compressive Sensing

Most of the signals of practical interest are compressible in nature. For example, audio

signals are compressible in localized Fourier domain and digital images are compressible in

Discrete Cosine Transform (DCT) and wavelet domains. This concept of compressibility

gives rise to the notion of transform coding so that subsequent processing of information is

computationally efficient. It simply means that a signal when transformed into a specific

domain becomes sparse in nature and could be approximated efficiently by say a K number

of large coefficients, ignoring all the small values. However, the initial data acquisition is

typically performed in accordance with the Nyquist sampling theorem, which states that

a signal could only be safely recovered from the samples if and only if the samples are

drawn at a sampling rate which is at least twice the maximum frequency of the signal.

Consequently, the data acquisition part can be an overhead since a huge number of acquired

samples will have to be further compressed for any subsequent realistic processing.

12

As a result, a legitimate question is whether there is any efficient way of data acquisition

so as to remove the Nyquist overhead, yet safely recovering the signal. The new area of

compressive sensing answers this question. Let us formulate the problem in the following

manner [21].

Let g be a signal vector of order N × 1. Any signal in RN can be represented in terms

of an N ×N orthonormal basis matrix Ψ and N ×1 vector of weighting coefficients s such

that:

g = Ψs (2.1)

It has to be noted here that g and s are essentially two different representations of the

same signal. g is the signal expressed in the time domain while s is the signal represented

in Ψ domain. Note that the transformation of the signal g in Ψ basis makes it K-sparse.

This means that ideally s has only K non-zero entries.

Now the aim of compressive sensing is to measure a low dimensional vector y of order

M × 1 (M < N) such that the original information g can be safely retrieved from y. It

means that we are looking for a transformation Φ such that

yM×1 = ΦM×NgN×1 (2.2)

From equation 2.1

yM×1 = ΦM×NΨN×NsN×1 (2.3)

yM×1 = ΘM×NsN×1

In equation 2.3 the main aim is to design a stable measurement matrix Φ which would

ensure that there is no information loss in compressible signal due to the dimensionality

reduction from RN to R

M [21]. Leaving the issue of measurement matrix for a moment, we

would like to emphasize that given the measurement vector y in equation 2.3 the problem

is still ill-posed as we are looking for N unknowns with a system of M equations. However

the issue is easily resolved due to the K-sparse nature of s which means that essentially

13

there will only be K non-zero entries in s and hence we will be looking to find K unknowns

from a system of M equations where K ≤ M .

It has been shown that for equation 2.3 to be true Θ must satisfy the Restricted Isom-

etry Property (RIP) [22]. Alternatively it has been discussed in [5, 22] that the stability of

the measurement matrix Φ could be ensured if it is incoherent with the sparsifying basis

Ψ. In the framework of compressive sensing this discussion boils down to the selection of

Φ as a random matrix. This means that if, for instance, we select Φ as a Gaussian random

matrix, such that the entries of the matrix Φ are independent and identically distributed

(iid) then Θ will satisfy the RIP with a high probability [5, 22, 23].

Once the RIP property of Θ is satisfied in equation 2.3, the recovery of vector s is

merely a problem of using a suitable reconstruction algorithm. In the compressive sensing

literature [5, 22] it has been shown that s can be recovered with a high probability using

the l1 optimization given that

M ≥ cK log

(N

K

)

(2.4)

c being a small constant.

2.3 Sparse Representation Classification

We now discuss the basic framework of the face recognition system in the context of sparse

representation [6]. Let us assume that we have k distinct classes and ni images available

for training from the ith class. Each training sample is a gray scale image of order a × b.

The image is downsampled to an order w × h and is converted into a 1-D vector vi,j by

concatenating the columns of the downsampled image such that vi,j ∈ Rm (m = wh).

Here i is the index of the class, i = 1, 2, . . . , k and j is the index of the training sample,

j = 1, 2, . . . , ni. All this training data from the ith class is placed in a matrix Ai such that

Ai = [vi,1,vi,2, . . . . . . ,vi,ni] ∈ R

m×ni . As stated in [6], when the training samples from

the ith class are sufficient, the test sample y from the same class will approximately lie in

the linear span of the columns of Ai:

14

y = αi,1vi,1 + αi,2vi,2 + · · · + αi,nivi,ni

(2.5)

where αi,j are real scalar quantities. Now we develop a dictionary matrix A for all k

classes by concatenating Ai, i = 1, 2, . . . , k as follows:

A = [A1, A2, . . . , Ak] ∈ Rm×nik (2.6)

Now a test pattern y can be represented as a linear combination of all n training

samples (n = ni × k):

y = Ax (2.7)

Where x is an unknown vector of coefficients. Now from equation 2.7 it is relatively

straight forward to note that only those entries of x that are non-zero correspond to the

class of y [6]. This means that if we are able to solve equation 2.7 for x we can actually

find the class of the test pattern y. Recent research in compressive sensing and sparse

representation [22, 5, 24, 25, 26] have shown that using the sparsity of the solution of

equation 2.7, enables us to solve the problem using l1-norm minimization:

(l1) : x1 = argmin ‖x‖1 ;Ax = y (2.8)

Once we have estimated x1, ideally it should have nonzero entries corresponding to

the class of y and now deciding the class of y is a simple matter of locating indices of the

non-zero entries in x1. However due to noise and modeling limitations x1 is commonly

corrupted by some small nonzero entries belonging to different classes. To resolve this

problem we define an operator δi for each class i so that δi(x1) gives us a vector ∈ Rn

where the only nonzero entries are from the ith class. This process is repeated k times for

each class. Now for a given class i we can approximate yi = Aδi(x1) and assign the test

pattern to the class with the minimum residual between y and yi.

min︸︷︷︸

i

ri(y) = ‖y − Aδi(x1)‖2 (2.9)

15

2.4 Sparse Representation Classification for Recognition from

Still Face Images

2.4.1 Yale Database

The Yale database, maintained at Yale university, consists of 165 grayscale images from

15 individuals [27]. Images from each subject reflects gesture variations incorporating

normal, happy, sad, sleepy, surprised, and wink expressions. Luminance variation is also

addressed by including images with lighting source from central, right and left directions.

A couple of images with and without spectacles are also included. Figure 2.1 represents 11

different images from a single subject. Experiments are conducted on the original database

without any preprocessing stages of face cropping and/or normalization. Each 320 × 243

grayscale image is downsampled to an order of 25× 25 to get a 625-d feature vector. The

experiments are conducted using the leave-one-out approach as reported quite regularly

in the literature [28], [29], [30]. A comprehensive comparison of various approaches is

provided in Table 2.1. Note that the error rates have been transformed to recognition

rates for [30]. The SRC approach substantially outperformed all reported techniques

showing an improvement of 5.48% over the best contestant i.e. the Fisherfaces approach.

Note that the SRC approach leads the traditional PCA and ICA approaches by a margin

of 22.58% and 26.66% respectively. The choice of feature space dimension is elaborated

in Figure 2.2 (a), the dimensionality curve is shown for a randomly selected leave-one-out

experiment. Classification accuracy becomes fairly constant in an approximately 600-D

feature space.

Figure 2.1: A typical subject from the Yale database with various poses and variations.

16

100 200 300 400 500 600 700 800 900 1000 110050

55

60

65

70

75

80

85

90

95

100

Feature Dimension

Rec

ogni

tion

Acc

urac

y

(a)

15 20 25 30 35 40 45 5050

55

60

65

70

75

80

85

90

95

100

Feature Dimension

Rec

ogni

tion

Acc

urac

y

(b)

Figure 2.2: (a) Recognition accuracy for the (a) Yale and (b) AT&T database withrespect to feature dimension.

Table 2.1: Results for Yale database using the leave-one-out method.

Evaluation Method Approach Recognition Rate

ICA [28] 71.52%Kernel Eigenfaces [28] 72.73%

Edge map [30] 73.94%Eigenfaces [30] 75.60%

Leave-one-out Correlation [30] 76.10%Linear subspace [30] 78.40%

2DPCA [28] 84.24%Eigenface w/o 1st 3 [30] 84.70%

LEM [30] 85.45%Fisherfaces [30] 92.70%

SRC 98.18%

2.4.2 AT&T Database

The AT&T database is maintained at the AT&T Laboratories, Cambridge University. Ten

different images from one of the 40 subjects from the database are shown in Figure 2.3.

The database incorporates facial gestures such as smiling or non-smiling, open or closed

eyes and alterations like glasses or without glasses. It also characterizes a maximum of

20◦ rotation of the face with some scale variations of about 10%.

The choice of dimensionality for the AT&T database is dilated in Figure 2.2(b) which

reflects that the recognition rate becomes fairly constant above a 40-dimensional feature

space. Therefore each 112 × 92 grayscale image is downsampled to an order 7 × 6 and is

transformed to a 42-dimensional feature vector by column concatenation.

17

Figure 2.3: A typical subject from the AT&T database

To provide a comparative value for the SRC approach we follow two evaluation strate-

gies as proposed in the literature [28], [29], [31]. First evaluation strategy takes the first

five images of each individual as a training set, while the last five are designated as probes.

Another set of experiments were conducted using the “leave-one-out” approach. A detailed

comparison of the results for the two experimental setups is summarized in Table 2.2, all

results are as reported in [28]. For first set of experiments the SRC algorithm achieves

a comparable recognition accuracy of 93% in a 42-D feature space, the best results are

reported for the 2DPCA approach which are 3% better than the SRC method. Also for the

second set of experiments the SRC approach attains a high recognition success of 97.5%

in a 42-D feature space, it outperforms the ICA approach by 3.7% (approximately) and is

fairly comparable to Fisherfaces, Eigenfaces, Kernel Eigenfaces and 2DPCA approaches.

Table 2.2: Results for two experiment sets using the AT&T database.

Evaluation Method Approach Recognition Rate

Fisherfaces 94.50%ICA 85.00%

Experiment Set 1 Kernel Eigenfaces 94.00%2DPCA 96.00%SRC 93.00 %


Experiment Set 2 Eigenfaces 97.50%Kernel Eigenfaces 98.00%

2DPCA 98.30%SRC 97.50%

18

2.4.3 AR Database

The AR database consists of more than 4,000 color images of 126 subjects (70 men and 56

women) [32]. The database characterizes divergence from ideal conditions by incorporating

various facial expressions (neutral, smile, anger and scream), luminance alterations (left

light on, right light on and all side lights on) and occlusion modes (sunglass and scarf). Due

to the large number of subjects and the substantial amount of variations, the AR database

is much more challenging compared to the AT&T and Yale databases. It has been used

by researchers as a test-bed to evaluate and benchmark face recognition algorithms. In

this research we address the problem of varying facial expressions, see Figure 2.4. We

evaluate the AR database under two experimental setups as proposed in literature, for all

experiments the 576× 768 image frames are downsampled to an order 8× 10 constituting

an 80-D feature space.

(a) (b) (c) (d)

Figure 2.4: Gesture variations in the AR database, note the changing position of headwith different poses.

For the first set of experiments we follow the setup as designed in [30]. A subset of AR

database consisting of 112 individuals is randomly selected. The system is trained using

only one image per subject which characterizes neutral expression (Figure 2.4(a)), therefore

we have 112 gallery images. The system is tested on the remaining three expressions

shown in Figure 2.4 (b), (c) and (d) altogether making 336 probe images. Table 2.3 shows

a thorough comparison of the SRC approach and the results reported in [30]. EM and

LEM stands for Edge Map and Line Edge Map respectively while all other approaches

being variants of Principle Component Analysis (PCA) [30].

SRC approach achieves a good recognition accuracy of 89.58% in the overall sense

which outperforms the best reported result of 75.67% (112-eigenvectors approach) by a

margin of 13.91%. For the cases of smile and anger expressions we obtained 93.75%

19

Table 2.3: Recognition Results for Gesture Variations under Experiment Set 1

Approach Recognition AccuracySmile Anger Scream Overall

20-eigenvectors 87.85% 78.57% 34.82% 67.08%60-eigenvectors 94.64% 84.82% 41.96% 73.80%112-eigenvectors 93.97% 87.50% 45.54% 75.67%

112-eigenvectors w/o 1st 3 82.04% 73.21% 32.14% 62.46%EM 52.68% 81.25% 20.54% 51.49%LEM 78.57% 92.86% 31.25% 67.56%SRC 93.75% 91.07% 83.93% 89.58%

and 91.07% respectively which are quite comparable to the best contestants i.e. 94.64%

(60-eigenvectors) and 92.86%(LEM). For the screaming expression, the SRC approach

outstandingly beats all the reported approaches attaining a decent recognition accuracy

of 83.93%.

In the second set of experiments we compare the proposed approach with two state of

the art algorithms: Bayesian Eigenfaces (MIT) [33] and FaceIt (Visionics). The Bayesian

Eigenfaces approach was reported to be one of the best in the 1996 FERET test [34],

whereas the FaceIt algorithm (based on Local Feature Analysis [35]) is claimed to be one

of the most successful commercial face recognition system [36]. A new subset of the AR

database is generated by randomly selecting 116 individuals. The system is trained using

the neutral expression of the first session (Figure 2.4 (a)) and therefore we have 116 gallery

images. The system is validated for all other expressions of the same session (Figures 2.4

(b), (c) and (d)) making altogether 348 probe images. A comprehensive comparison of

the SRC approach with these two state of the art algorithms is presented in Table 2.4, all

the results are as reported in [36]. For mild variations due to smile and anger expressions

the SRC approach yields quite competent recognition accuracies of 92.24% and 91.38% in

comparison to FaceIt and MIT approaches. For the severe case of screaming expression the

SRC leads the FaceIt and MIT approaches by a margin of 5.62% and 42.62% respectively.

20

Table 2.4: Recognition Results for Gesture Variations under Experiment Set 2


FaceIt 96.00% 93.00% 78.00% 89.00%MIT 94.00% 72.00% 41.00% 60.00%SRC 92.24% 91.38% 83.62% 89.08%

2.5 Sparse Representation Classification for Video-based Face

Recognition

In this section we evaluate the SRC algorithm for the problem of video-based faced recog-

nition. For the purpose of comparative analysis, experiments were also conducted using

the latest Scale Invariant Feature Transform (SIFT) [37]. We now describe the basic archi-

tecture of SIFT-based classification followed by extensive experiments on the VidTIMIT

database [38], [39].

2.5.1 Scale Invariant Feature Transform (SIFT) for Face Recognition

The Scale Invariant Feature Transform (SIFT) was proposed in 1999 for the extraction of

unique features from images [40]. The idea, initially proposed for a more generic object

recognition task, was later successfully applied for the problem of face recognition [37].

Interesting characteristics of scale/rotation invariance and locality in both spatial and fre-

quency domains have made the SIFT-based approach a pretty much standard technique in

the paradigm of view-based face recognition. The first step in the derivation of the SIFT

features is the identification of potential pixels of interest called “keypoints”, in the face

image. An efficient away of achieving this is to make use of the scale-space extrema of the

Difference-of-Gaussian (DoG) function convolved with the face image [40]. These potential

keypoints are further refined based on the high contrasts, good localization along edges

and the ratio of principal curvatures criterion. Orientation(s) are then assigned to each

keypoint based on local image gradient direction(s). A gradient orientation histogram is

formed using the neighboring pixels of each keypoint. Contribution from neighbors are

weighted by their magnitudes and by a circular Gaussian window. Peaks in the histogram

21

represent the dominant directions and are used to align the histogram for rotation in-

variance. 4 × 4 pixel neighborhoods are used to extract eight bin histograms resulting in

128-dimensional SIFT features. For illumination robustness, the vectors are normalized

to unity, thresholded to a ceiling of 0.2 and finally renormalized to unit length. Figure 2.5

shows a typical face from the VidTIMIT database [38, 39] with extracted SIFT features.

Figure 2.5: A typical localized face from the VidTIMIT database with extracted SIFTs.

During validation a SIFT feature vector from the query video fq is matched with the

feature vector from the gallery:

e = arccos [fq(fg)T ] (2.10)

where fg corresponds to a SIFT vector from a training video sequence. All SIFT

vectors from the query frame are matched with all SIFT features from a training frame

using Equation 2.10. Pairs of features with the minimum error e are considered as matches.

Note that if more than one SIFT vector from a given query frame happens to be the best

match with the same SIFT vector from gallery (i.e. many-to-one match scenario), the one

with the minimum error e is chosen. Other false matches were reduced by matching the

SIFT vectors from only nearby regions of the two images.

In principle, for different image pairs we have different number of matches. This

information is further harnessed to be used as an additional similarity measure between

the two faces. The final similarity score between two frames is computed by normalizing

the average error e between their matching pairs of SIFT features and the total number

22

of matches z on a scale [0,1] and then using a weighted sum rule.

e′ =e − min (e)

max (e − min (e))(2.11)

z′ =z − min (z)

max (z − min (z))(2.12)

s =1

2(βee

′ + βz(1 − z′)) (2.13)

where βe and βz are the weights of normalized average error e′ and normalized number

of matches z′ respectively. It has to be noted that e′ is a distance (dis-similarity) measure

while z′ is a similarity score, therefore in Equation 2.13 z′ is subtracted from 1 for a

homogeneous fusion. Consequently s becomes a distance measure.

2.5.2 Experimental Results and Discussion

The problem of temporal face recognition using the SRC and SIFT feature based face

recognition algorithms was evaluated on the VidTIMIT database [38], [39]. VidTIMIT is

a multimodal database consisting of video sequences and corresponding audio files from

43 distinct subjects. The video section of the database characterizes 10 different video

files from each subject. Each video file is a sequence of 512 × 384 JPEG images. Two

video sequences were used for training while the remaining eight were used for validation.

Due to the high correlation between consecutive frames, training and testing were carried

out on alternate frames. Off-line batch learning mode [41] was used for these experiments

and therefore probe frames did not add any information to the system.

Face localization is the first step in any face recognition system. Fully automatic face

localization was carried out using a Harr-like feature based face detection algorithm [42]

during off-line training and on-line recognition sessions. For the SIFT based face recogni-

tion, each detected face in a video frame was scale-normalized to 150× 150 and histogram

23

Figure 2.6: A sample video sequence from the VidTIMIT database.

equalized before the extraction of the SIFT features. We achieved an identification rate of

93.83%. Verification experiments were also conducted for a more comprehensive compar-

ison between the two approaches. An Equal Error Rate (EER) of 1.8% was achieved for

the SIFT based verification. Verification rate at 0.01 False Accept Rate (FAR) was found

to be 97.32%.

For the SRC classifier, each detected face in a frame is downsampled to order 10 ×

10. Column concatenation is carried out to generate a 100-dimensional feature vector as

discussed in Section 2.3. Off-line batch learning is carried out on alternate frames using

two video sequences as discussed above. Unorthodox downsampled images in combination

with the SRC classifier yielded quite comparable recognition accuracy of 94.45%. EER

dropped to 1.3% with a verification accuracy of 98.23% at 0.01 FAR. The rank profile and

ROC (Receiver Operating Characteristics) curves are shown in Figure 2.7 (a) and 2.7 (b)

respectively.

We further investigated the complementary nature of the two classifiers by fusing them

at the score level. The weighted sum rule is used which is perhaps the major work-horse

in the field of combining classifiers [43]. Both classifiers were equally weighted and a high

recognition accuracy of 97.73% was achieved which outperforms the SIFT based classi-

fier and the SRC classifier by a margin of 3.90% and 3.28% respectively. Verification

experiments also produced superior results with an EER of 0.3% which is better than

the SIFT and the SRC based classification by 1.5% and 1.0% respectively. An excellent

24

1 2 3 4 5 6 7 8 9 100.85

0.9

0.95

1

Rank

Rec

ogni

tion

Rat

e

SIFTSRCFUSION

(a)

10−2

10−1

100

0.95

0.955

0.96

0.965

0.97

0.975

0.98

0.985

0.99

0.995

1

False Accept Rate (Log Scale)

Ver

ifica

tion

Rat

e

SIFTSRCFUSION

(b)

Figure 2.7: (a)Rank profiles and (b) ROC curves for the SIFT, SRC and the combinationof the two classifiers.

25

Table 2.5: Summary of results

Evaluation Attributes SIFT SRC Fusion

Recognition Accuracy 93.83% 94.45% 97.73%

Equal Error Rate 1.80% 1.30% 0.30%

Verification rate at 0.01 FAR 97.32% 98.23% 99.90%

verification of 99.90% at an FAR of 0.01 is reported. Fusion of the two classifiers substan-

tially improved the rank profile as well achieving 100% results at rank-5 only. A detailed

comparison of the results is provided in Table 2.5.

Presented results certainly reflect a comparable performance index for the SRC classi-

fier as compared to state-of-the-art SIFT based recognition. Extensive experiments based

on identification, verification and rank-recognition evaluations consistently reflect better

results for the SRC approach. Moreover the complementary information exhibited by the

SRC method increased the verification success of the combined system to 99.9% for the

standard 0.01 FAR criterion. Figure 2.8 shows variation in the recognition accuracy with

the change in the normalized weight of the SRC classifier at the fusion stage. Approxi-

mately the highest recognition is achieved when both classifiers were equally weighted i.e.

no prior information of the participating experts was incorporated in fusion.

Apart from these appreciable results it was found that the l1-norm minimization using

a large dictionary matrix made the iterative convergence lengthy and slow. To provide a

comparative value we performed computational analysis for a randomly selected identifi-

cation trial. The time required by the SRC algorithm for classifying a single frame on a

typical 2.66 GHz machine with 2 GB memory was found to be 297.46 seconds (approx-

imately 5 minutes). This duration is approximately 5 times greater than the processing

time of the SIFT algorithm for the same frame which was found to be 58.18 seconds (ap-

proximately 1 minute). Typically a video sequence consists of hundreds of frames which

would suggest a rather prolonged span for the evaluation of the whole video sequence.

Noteworthy is the fact that experiments were conducted using an offline learning mode

[41]. The probe frames did not contribute to the dictionary information. Critically speak-

ing, the spatiotemporal information in video sequences is best harnessed using smart online

26

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.990

91

92

93

94

95

96

97

98

99

100

Weight of the SRC Classifier

Rec

ogni

tion

Acc

urac

y

Figure 2.8: Variation in performance with respect to bias in fusion.

[44] and hybrid [45] learning modes. These interactive learning algorithms add useful in-

formation along the temporal dimension and therefore enhance the overall performance.

However, in the context of SRC classification, this would suggest an even larger dictionary

matrix and consequently a lengthier evaluation.

2.6 Sparse Representation Classification for Ear Biometric

2.6.1 Experiments and Discussion

We did extensive experiments to validate the SRC algorithm using subsets of the UND

database [46, 13] and the FEUD [47] database. The subset of the UND database consists

of 32 subjects with six profile images each. Other subjects of the UND database have

fewer images and are therefore inadequate for our ear recognition system. Six different

images of a typical subject from the UND database are shown in Figure 2.9. The subjects

were photographed under varying lighting conditions and with the head rotations of -90

and -75 degrees, observed from the top in a clockwise direction. From each image the ear

portion is manually cropped and is shown in Figure 2.10. At the feature extraction stage,

the ear intensity image is downsampled to an order of 30 × 30 as shown in Figure 2.10.

27

Figure 2.9: A typical subject from the UND database illustrating different pose andillumination variations

The coulumns of the downsampled ear image are concatenated to form a 900-D feature

vector. The features extracted from all the training images are used to develop the gallery

i.e the dictionary matrix A.

(a) (b)

Figure 2.10: A typical cropped ear (a) and its compressed form in the feature space (b).

We evaluated the proposed system under two evaluation protocols, Tr3V3 and Tr4V2.

Tr3V3 corresponds to three training and three testing images per subject while Tr4V2

corresponds to four training and two testing images per person. The proposed algorithm

gave a recognition rate of 91.67% and 96.88% for Tr3V3 and Tr4V2 respectively. The rank

profile of the system is shown in Figure 2.11(a). The ROC curves are shown in Figure

2.11(b) with an Equal Error Rate (EER) of approximately 0.05 and 0.03 for Tr3V3 and

Tr4V2 respectively.

The FEUD ear database consists of 56 subjects with five images each, Figure 2.13

depicts a typical subject of the FEUD database. The ear portions are manually cropped

and a 625-D feature space is formed by downsampling the images to an order 25× 25. We

define two evaluation protocols as Tr3V2 and Tr4V1. Tr3V2 corresponds to three training

and two testing images while Tr4V1 corresponds to four training and one testing image

28

1 2 3 4 5 6 7 8 9 100.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Rank

Rec

ognitio

nR

ate

Tr3V3

Tr4V2

(a)

10−3

10−2

10−1

100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False accept rate (log scale)

Ver

ifica

tion

rate

Tr3V3

Tr4V2

(b)

Figure 2.11: (a)Rank Profile for the UND database. (b) ROC curves for the UNDDatabase.

29

1 2 3 4 5 6 7 8 9 100.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Rank

Rec

ognitio

nR

ate

Tr3V2

Tr4V1

(a)

10−3

10−2

10−1

0.7

0.75

0.8

0.85

0.9

0.95

1

False accept rate (log scale)

Ver

ifica

tion

rate

Tr3V2

Tr4V1

(b)

Figure 2.12: (a)Rank Profile for the FEUD database. (b) ROC curves for the FEUDDatabase.

30

per person. We obtained high recognition rates of 95.54% and 98.21% for Tr3V2 and

Tr4V1 respectively with the rank profile of the validation shown in Figure 2.12(a). The

ROC evaluation of the system gives an EER of 0.02 and 0.01 (approximately) for Tr3V2

and Tr4V1 respectively as shown in Figure 2.12(b).

Figure 2.13: A typical subject from the FEUD database

2.7 Conclusion

Sparse representation classification has recently emerged as the latest paradigm in the

research of appearance-based face recognition. A comprehensive evaluation of the SRC

algorithm provides a comparable index with traditional, state of the art approaches. It has

also been found robust for the problem of varying facial expressions. For the video-based

face recognition, an identification rate of 94.45% is achieved on the VidTIMIT database

which is quite comparable to 93.83% accuracy using state-of-the-art SIFT features based

algorithm. Verification experiments were also conducted and the SRC approach exhibited

an EER of 1.30% which is 0.5% better than the SIFT method. The SRC classifier was

found to nicely complement the SIFT based method, the fusion of the two methods using

the weighted sum rule consistently produced superior results for identification, verification

and rank-recognition experiments. However since SRC requires an iterative convergence

using an l1-norm minimization, the approach was found computationally expensive as com-

pared to the SIFT based recognition. Typically SRC required 5 minutes (approximately)

for processing a single recognition trial which is 5 times greater than the time required

by the SIFT based approach. To the best of our knowledge, this is the first evaluation of

the SRC algorithm on a video database. From the experiments presented in the chapter,

it is quite safe to maintain that additional work is required before the SRC approach is

declared as a standard approach for video-based applications. Computational expense

is arguably an inherent issue with video processing giving rise to the emerging area of

31

“Video Abstraction”. Efficient algorithms have been proposed to cluster video sequences

along the temporal dimension (for example [48] including others). These clusters are then

portrayed by cluster-representative frame(s)/features resulting in a substantial decrease of

complexity. Given the good performance of the SRC algorithm presented in this research,

the evaluation of the method using state-of-the-art video abstraction methods will be the

subject of our future research. For the problem of user authentication using ear biometric,

the proposed system is evaluated using standard ear databases yielding high recognition

accuracies for various evaluation protocols. In particular experiments were conducted on

the UND [46, 13] and the FEUD [47] databases with session variability and incorporating

different head rotations and lighting conditions. The proposed system is found to be ro-

bust under varying light and head rotations yielding a high recognition rate of the order

of 98%. The proposed system does not assume any prior normalization of the ear region

and is found to be robust to varying light conditions and different head rotations. The

interesting outcomes of the research such as a high recognition accuracy with a low train-

ing overhead, robustness to practical constraints and independence from normalization

overheads are certainly encouraging enough to extend the compressive sensing approach

to other biometrics.

32

2.8 Publications

1. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, Sparse Representa-

tion for View-Based Face Recognition, Accepted as a book chapter in book Advances

in Face Image Analysis: Techniques and Technologies ED. Y.-J. Zhang, IGI Global

Publishing.


tation for Video-Based Face Recognition”, book chapter in Advances in Biometrics

(Lecture Notes in Computer Science, LNCS series), Springer Berlin / Heidelberg.

Volume 5558/2009, pages 219-228. 0302-9743 (Print) 1611-3349 (Online), ISBN

978-3-642-01792-6.


tation for Ear Biometrics”, book chapter in Advances in Visual Computing (Lecture

Notes in Computer Science, LNCS series), Springer Berlin / Heidelberg. Volume

5359/2008, pages 336-345. ISSN 0302-9743 (Print) 1611-3349 (Online), ISBN 978-

3-540-89645-6.

33

34

Chapter 3

Linear Regression for Face

Identification 1

3.1 Introduction

Face recognition systems are known to be critically dependent on manifold learning meth-

ods. A gray-scale face image of an order a × b can be represented as an ab-dimensional

vector in the original image space. However any attempt of recognition in such a high

dimensional space is vulnerable to a variety of issues often referred to as curse of di-

mensionality. Therefore, at the feature extraction stage, images are transformed to low

dimensional vectors in the face space. The main objective is to find such a basis func-

tion for this transformation, which could distinguishably represent faces in the face space.

A number of approaches have been reported in the literature such as Principle Compo-

nent Analysis (PCA) [16], [13], Linear Discriminant Analysis (LDA) [17] and Independent

Component Analysis (ICA) [18], [19]. Primarily these approaches are classified in two cat-

egories i.e. reconstructive and discriminative methods. Reconstructive approaches (such

as PCA and ICA) are reported to be robust for the problem of contaminated pixels [49],

whereas discriminative approaches (such as LDA) are known to yield better results in

1Parts from the chapter have been accepted/published in IEEE Transactions on Pattern Analysis andMachine Intelligence (TPAMI) and IEEE International Conference in Image Processing (ICIP’09). Generalproblem statements and literature review in the Introduction section of the chapter are included for thesake of completeness and to make the chapter self-contained. Since the thesis is presented as a compilationof independent publications, the repetition of the general statements between the chapters is thereforeinevitable.

35

clean conditions [20]. Apart from these traditional approaches, it has been shown recently

that unorthodox features such as downsampled images and random projections can serve

equally well. In fact the choice of the feature space may no longer be so critical [6]. What

really matters is the dimensionality of feature space and the design of the classifier.

In this chapter we propose a fairly simple but efficient linear regression based classifi-

cation (LRC) for the problem of face identification. Samples from a specific object class

are known to lie on a linear subspace [17], [50]. We use this concept to develop class spe-

cific models of the registered users simply using the downsampled gallery images, thereby

defining the task of face recognition as a problem of linear regression. Least squares es-

timation is used to estimate the vectors of parameters for a given probe against all class

models. Finally the decision is ruled in favor of the class with the most precise estimation.

The proposed classifier can be categorized as a Nearest Subspace (NS) approach.

An important relevant work is presented in [6] where downsampled images from all

classes are used to develop a dictionary matrix during the training session. Each probe

image is represented as a linear combination of all gallery images thereby resulting in an

ill-conditioned inverse problem. With the latest research in compressive sensing and sparse

representation, sparsity of the vector of coefficients is harnessed to solve the ill-conditioned

problem using the l1-norm minimization. In [51] where the concept of Locally Linear Re-

gression (LLR) is introduced specifically to tackle the problem of pose. Main thrust of the

research is to indicate an approximate linear mapping between a nonfrontal face image and

its frontal counterpart, the estimation of linear mapping is further formulated as a predic-

tion problem with a regression-based solution. For the case of severe pose variations, the

nonfrontal image is sampled to obtain many overlapped local segments. Linear regression

is applied to each small patch to predict the corresponding virtual frontal patch, the LLR

approach has shown some good results in presence of coarse alignment. In [52] a two-step

approach has been adopted fusing the concept of wavelet decomposition and discriminant

analysis to design a sophisticated feature extraction stage. These discriminant features

are used to develop feature planes (for Nearest Feature Plane - NFP classifier) and feature

spaces (for Nearest Feature Space - NFS classifier). The query image is projected onto

the subspaces and decision is ruled in favor of the subspace with the minimum distance.

36

However, the proposed LRC approach, for the first time, uses simply the downsampled

images in combination with the linear regression classification to achieve superior results

compared to the benchmark techniques.

Further for the problem of severe contiguous occlusion, a modular representation of

images is expected to solve the problem [53]. Based on this concept we propose an efficient

Modular LRC Approach. The proposed approach segments a given occluded image and

reaches individual decisions for each block. These intermediate decisions are combined

using a novel Distance based Evidence Fusion (DEF) algorithm to reach the final decision.

The proposed DEF algorithm uses the distance metrics of the intermediate decisions to

decide about the “goodness” of a partition. There are two major advantages of using the

DEF approach. Firstly, the non-face partitions are rejected dynamically, therefore they do

not take part in the final decision making. Secondly the overall recognition performance is

better than the best individual result of the combining partitions due to efficient decision

fusion of the face segments.

The rest of the chapter is organized as follows: In Section 3.2 the proposed LRC and

Modular LRC algorithms are described. This is followed by extensive experiments using

standard databases under a variety of evaluation protocols in Section 3.3. The chapter

concludes in Section 3.4 followed by a list of publications arising from this chapter in

Section 3.5.

3.2 Linear Regression for Face Recogniton

3.2.1 Linear Regression Classification (LRC) Algorithm

Let there be N number of distinguished classes with pi number of training images from

the ith class, i = 1, 2, . . . , N . Each grayscale training image is of an order a × b and is

represented as(m)ui ∈ R

a×b, i = 1, 2, . . . , N and m = 1, 2, . . . , pi. Each gallery image is

downsampled to an order c × d and transformed to vector through column concatenation

such that(m)ui ∈ R

a×b →(m)wi∈ R

q×1, where q = cd, cd << ab. Each image vector is normal-

ized so that maximum pixel value is 1. Using the concept that patterns from the same

class lie on a linear subspace [50], we develop a class specific model Xi by stacking the

37

q-dimensional image vectors,

Xi = [(1)wi

(2)wi . . . . . .

(pi)wi ] ∈ R

q×pi , i = 1, 2, . . . , N (3.1)

Each vector(m)wi , m = 1, 2, . . . , pi, spans a subspace of R

q also called the column space

of Xi. Therefore at the training level each class i is represented by a vector subspace, Xi,

which is also called the regressor or predictor for class i. Let z be an unlabeled test image

and our problem is to classify z as one of the classes i = 1, 2, . . . , N . We transform and

normalize the grayscale image z to an image vector y ∈ Rq×1 as discussed for the gallery.

If y belongs to the ith class it should be represented as a linear combination of the training

images from the same class (lying in the same subspace) i.e.

y = Xiβi , i = 1, 2, . . . , N (3.2)

where βi ∈ Rpi×1 is the vector of parameters. Given that q ≥ pi the system of equations

in equation 3.2 is well-conditioned and βi can be estimated using least squares estimation

[54], [55], [56].

βi =(XT

i Xi

)−1XT

i y (3.3)

The estimated vector of parameters, βi, along with the predictors Xi are used to predict

the response vector for each class i:

yi = Xiβi, i = 1, 2, . . . , N (3.4)

yi = Xi

(XT

i Xi

)−1XT

i y

yi = Hy

Where the predicted vector yi ∈ Rq×1 is the projection of y onto the ith subspaces. In

other words yi is the closest vector, in the ith subspace, to the observation vector y in the

Euclidean sense [57]. H is called a hat matrix since it maps y into yi. We now calculate

the distance measure between the predicted response vector yi, i = 1, 2, . . . , N and the

38

Algorithm: Linear Regression Classification (LRC)

Inputs: Class models Xi ∈ Rq×pi , i = 1, 2, . . . , N and a test image vector y ∈ R

q×1.Output: Class of y

1. βi ∈ Rpi×1 is evaluated against each class model, βi =

(XT

i Xi

)−1

XTi y, i = 1, 2, . . . , N

2. yi is computed for each βi, yi = Xiβi, ı = 1, 2, . . . , N

3. Distance calculation between original and predicted response variables di(y) = ‖y − yi‖2

4. Decision is made in favor of the class with the minimum distance di(y)

original response vector y,

di(y) = ‖y − yi‖2 , i = 1, 2, . . . , N (3.5)

and rule in favor of the class with minimum distance i.e.

min︸︷︷︸

i

di(y), i = 1, 2, . . . , N (3.6)

3.2.2 Modular Approach for the LRC Algorithm

The problem of identifying partially occluded faces could be efficiently dealt with using

the modular representation approach [53]. Contiguous occlusion can safely be assumed

local in nature in a sense that it corrupts only a portion of conterminous pixels of the

image. The amount of contamination being unknown. In the modular approach we utilize

the neighborhood property of the contaminated pixels by dividing the face image into a

number of sub-images. Each sub-image is now processed individually and a final decision

is made by fusing information from all the sub-images. A commonly reported technique

for decision fusion is majority voting [53]. However a major pitfall with majority voting

is that it equally treats noisy and clean partitions. For instance if three out of four

partitions of an image are corrupted, majority voting is likely to be erroneous no matter

how significant the clean partition may be in the context of facial features. The task

becomes even more complicated by the fact that the distribution of occlusion over a face

image is never known a priori and therefore, along with face and non-face sub-images, we

are likely to have face portions corrupted with occlusion. Some sophisticated approaches

have been developed to filter out the potentially contaminated image pixels (for example

[58]). In this section we make use of the specific nature of distance classification to develop

39

a fairly simple but efficient fusion strategy which implicitly deemphasizes corrupted sub-

images improving significantly the overall classification accuracy. We propose to use the

distance metric as an evidence of our belief in the “goodness” of intermediate decisions

taken on the sub-images, the approach is called “Distance based Evidence Fusion” (DEF).

To formulate the concept let us suppose that each training image is segmented in

M partitions and each partitioned image is designated as vn; n = 1, 2, . . . , M . The nth

partition of all pi training images from the ith class are subsampled and transformed

to vectors as discussed in Section 3.2.1 to develop a class specific and partition-specific

subspace U(n)i :

U(n)i =

[

(1)wi

(n) (2)wi

(n)

. . . . . .(pi)wi

(n)]

, i = 1, 2, . . . , N (3.7)

Each class is now represented by M subspaces and altogether we have M ×N subspace

models. Now a given probe image is partitioned into M segments accordingly. Each

partition is transformed to an image vector y(n); n = 1, 2, . . . , M . Given that i is the true

class for the given probe image, y(n) is expected to lie on the nth subspace of the ith class

U(n)i and should satisfy:

y(n) = U(n)i β

(n)i (3.8)

The vector of parameters and the response vectors are estimated as discussed in Section

3.2.1

β(n)i =

[(

U(n)i

)T

U(n)i

]−1 (

U(n)i

)T

y(n) (3.9)

y(n)i = U

(n)i β

(n)i ; i = 1, 2, . . . , N (3.10)

The distance measure between the estimated and the original response vector is com-

puted:

di

(

y(n))

=∥∥∥y(n) − y

(n)i

∥∥∥

2; i = 1, 2, . . . , N (3.11)

40

Now for the nth partition an intermediate decision called j(n) is reached with a corre-

sponding minimum distance calculated as:

dj(n)= min

︸︷︷︸

i

di(y(n)) i = 1, 2, . . . , N (3.12)

Therefore, we now have M decisions j(n) with M corresponding distances dj(n)and we

decide in favor of the class with minimum distance.

Decision = arg min︸︷︷︸

j

dj(n)n = 1, 2, . . . , M (3.13)

3.3 Experimental Results

Extensive experiments were carried out to illustrate the efficacy of the proposed ap-

proach. Essentially three standard databases i.e. the AT&T [59], Yale [27] and the AR

[32] databases have been addressed. These databases incorporate several deviations from

the ideal conditions including pose, illumination, occlusion and gesture alterations. Sev-

eral standard evaluation protocols, reported in the face recognition literature, have been

adopted and a comprehensive comparison of the proposed approach with the state of art

techniques has been presented.

3.3.1 AT&T Database

We attack the problem of face recognition by first addressing the AT&T database [59].

The AT&T database is maintained at the AT&T Laboratories, Cambridge University. Ten

different images from one of the 40 subjects from the database are shown in Figure 3.1.

The database incorporates facial gestures such as smiling or non-smiling, open or closed

eyes and alterations like glasses or without glasses. It also characterizes a maximum of 20◦

rotation of the face with some scale variations of about 10%. Half of the database is used

as gallery while the other half is used for validation. For the purpose of elaboration we

compared the residuals (y− y) generated using the true subspace model with those using

a false subspace model under the framework of the proposed LRC approach. Figures 3.2

41

(a) shows a test image of subject 1. The residuals using a false subspace model (X10),

shown in Figure 3.2 (b) are substantially greater than the residuals using the true subspace

model (X1) as shown in Figure 3.2 (c), note that residuals are shown for a 25-dimensional

feature space. Large residuals using (X10) reflect imprecise prediction of the response

vector whereas small residuals using X1 testify a precise estimation and lead to a correct

classification.

Figure 3.1: A typical subject from the AT&T database

The choice of dimensionality for the AT&T database is dilated in Figure 3.3(a) which

reflects that the recognition rate becomes fairly constant above a 40-dimensional feature

space. Therefore each 112 × 92 grayscale image is downsampled to an order 7 × 6 and is

transformed to a 42-dimensional feature vector by column concatenation. To make the re-

sults independent of a particular choice of a training data set we made 20 different random

selections of the gallery and probe images and found an average recognition accuracy of

96.8%, the worst and the best results for these random selections being 93.5% and 99.5%

respectively. The results for these random experiments are summarized in Figure 3.3(b).

To provide a comparative value for our approach we follow two evaluation protocols as

proposed in the literature [28], [29], [31]. Evaluation Protocol 1 (EP1) takes the first five

images of each individual as a training set, while the last five are designated as probes. For

Evaluation Protocol 2 (EP2) the “leave-one-out” strategy is adopted. A detailed compar-

ison of the results for the two evaluation protocols is summarized in Table 3.1, all results

are as reported in [28]. For EP1 the LRC algorithm achieves a comparable recognition

accuracy of 93.5% in a 50-D feature space, the best results are reported for the 2DPCA

approach which are 2.5% better than the LRC method. Also for EP2 the LRC approach

attains a high recognition success of 98.75% in a 50-D feature space, it outperforms the

42

(a)

0 5 10 15 20 25−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Coefficients

Res

idua

l for

Fal

se C

lass

(b)

0 5 10 15 20 25−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Coefficients

Res

idua

t for

Tru

e C

lass

(c)

Figure 3.2: (a) Test image from subject 1. (b) Residuals using a randomly selected falsesubspace. (c) Residuals using subspace 1

43

0 10 20 30 40 50 60 70 80 90 10010

20

30

40

50

60

70

80

90

100

Feature Dimension

Rec

ogni

tion

Acc

urac

y

(a)

0 2 4 6 8 10 12 14 16 18 2050

55

60

65

70

75

80

85

90

95

100

Different Random Selections of Gallery and Probe

Rec

ogni

tion

Acc

urac

y

(b)

Figure 3.3: (a) Recognition accuracy for the AT&T database with respect to featuredimension using the LRC algorithm. (b) Cross-validation with 20 random selections ofgallery and probe images.

44

ICA approach by 5% (approximately) and is fairly comparable to Fisherfaces, Eigenfaces,

Kernel Eigenfaces and 2DPCA approaches.

Table 3.1: Results for EP1 and EP2 using the AT&T database.

Evaluation Protocol Approach Recognition Rate


EP1 Kernel Eigenfaces 94.00%2DPCA 96.00%LRC 93.50%


EP2 Eigenfaces 97.50%Kernel Eigenfaces 98.00%

2DPCA 98.30%LRC 98.75%

3.3.2 Yale Database

The Yale database, maintained at Yale university, consists of 165 grayscale images from

15 individuals [27]. Images from each subject reflects gesture variations incorporating

normal, happy, sad, sleepy, surprised, and wink expressions. Luminance variation is also

addressed by including images with lighting source from central, right and left directions.

A couple of images with and without spectacles are also included. Figure 3.4 represents 11

different images from a single subject. Experiments are conducted on the original database

without any preprocessing stages of face cropping and/or normalization. Each 320 × 243

grayscale image is downsampled to an order of 25× 25 to get a 625-d feature vector. The

experiments are conducted using the leave-one-out approach as reported quite regularly

in the literature [28], [29], [30]. A comprehensive comparison of various approaches is

provided in Table 3.2. Note that the error rates have been transformed to recognition rates

for [30]. Apart from the Fisherfaces method the LRC approach substantially outperforms

all reported techniques showing an improvement of 7.31% over the best contestant i.e. the

LEM (Line Edge Map) approach. Note that the proposed approach leads the traditional

PCA and ICA approaches by a margin of 17.16% and 21.24% respectively. The choice

of feature space dimension is elaborated in Figure 3.5, the dimensionality curve is shown

45

for a randomly selected leave-one-out experiment. Classification accuracy becomes fairly

constant in an approximately 600-D feature space.

Figure 3.4: A typical subject from the Yale database with various poses and variations.

Table 3.2: Results for Yale database using the leave-one-out method.

Evaluation Protocol Approach Recognition Rate

ICA [28] 71.52%Kernel Eigenfaces [28] 72.73%

Edge map [30] 73.94%Eigenfaces [30] 75.60%

Leave-one-out Correlation [30] 76.10%Linear subspace [30] 78.40%

2DPCA [28] 84.24%Eigenface w/o 1st 3 [30] 84.70%

LEM [30] 85.45%Fisherfaces [30] 92.70%

LRC 92.76%

3.3.3 Georgia Tech (GT) Database

The Georgia Tech (GT) database consists of 50 subjects with 15 images per subject [60].

It characterizes several variations such as pose, expression, cluttered background and

illumination (see Figure 3.6).

Images were downsampled to an order of 15 × 15 to constitute a 225-D feature space.

First 8 images of each subject were used for training while the remaining 7 served as probes

46

0 100 200 300 400 500 600 7000

10

20

30

40

50

60

70

80

90

Feature Dimension

Rec

ogni

tion

Acc

urac

y

Figure 3.5: Yale database: Recognition accuracy with respect to feature dimension for arandomly selected experiment.

47

Figure 3.6: Samples of a typical subject from the GT database.

48

[61], all experiments were conducted on original database without any cropping/normalization.

Table 3.3 shows a detailed comparison of the LRC with a variety of approaches, all results

are as reported in [61] with recognition error rates converted to recognition success rates.

Also results in [61] are shown for a large range of feature dimensions, for the sake of fair

comparison we have picked the best reported results. The proposed LRC algorithm out-

performs the traditional PCAM and PCAE approaches by a margin of 12% and 18.57%

respectively, achieving a high recognition accuracy of 92.57%. It is also shown to be fairly

comparable to all other methods including the latest ERE approaches.

Table 3.3: Results for the Georgia Tech. database.

Method PCAM PCAE BML DSL NLDA

Recognition Rate 80.57% 74.00% 87.43% 90.57% 88.86%

Method FLDA UFS ERE Sb ERE St LRC

Recognition Rate 90.71% 90.86 92.86% 93.14% 92.57%

3.3.4 FERET Database

Evaluation Protocol 1 (EP1)

The FERET database is arguably one of the largest publicly available database [62].

Following [61], [63] we construct a subset of database consisting of 128 subjects with at

least 4 images per subject. We however used 4 images per subject [61], Figure 3.7 shows

images of a typical subject from the FERET database. It has to be noted that in [61] the

database consists of 256 subjects, 128 subjects (i.e. 512 images) are used to develop the

face space while the remaining 128 subjects are used for the face recognition trials. The

proposed LRC approach uses the gallery images of each person to form a linear subspace,

therefore it does not require any additional development of the face space. However it

requires multiple gallery images for a reliable construction of linear subspaces. Using a

single gallery image for each person is not substantial in the context of linear regression, as

this corresponds to only a single regressor (or predictor) observation, leading to erroneous

least-squares calculations.

Cross validation experiments for LRC were conducted in a 42-D feature space, for each

recognition trial 3 images per person were used for training while the system was tested

49

(fa) (fb) (ql) (qr)

Figure 3.7: A typical subject from the FERET database, fa and fb representing frontalshots with gesture variations while ql and qr correspond to pose variations.

for the fourth one. The results are shown in Table 3.4. The frontal images fa and fb

incorporate gesture variations with small pose, scale and rotation changes, whereas ql and

qr correspond to major pose variations (see [62] for details). The proposed LRC approach

copes well with the problem of facial expressions in presence of small pose variations

achieving high recognition rates of 91.41% and 94.53% for fa and fb respectively. It

outperforms the benchmark PCA and ICA I algorithms by margins of 17.19% and 17.97%

for fa and 21.09% and 23.44% for fb respectively. The LRC approach however shows

degraded recognition rates of 78.13% and 84.38% for the severe pose variations of ql and

qr respectively, however even with such major posture changes it is substantially superior

to the PCA and ICA I approaches. In an overall sense we achieve a recognition accuracy of

87.11% which is favorably comparable to 83.00% recognition achieved by ERE [61] using

single gallery images.

Table 3.4: Results for the FERET database.

Experiment Method fa fb ql qr Overall

PCA 74.22% 73.44% 65.63% 72.66% 71.48%EP1 ICA I 73.44% 71.09% 65.63% 68.15% 69.57%

LRC 91.41% 94.53% 78.13% 84.38% 87.11%

PCA 80.00% 78.75% 67.50% 71.75% 74.50%EP2 ICA I 77.50% 77.25% 68.50% 70.25% 73.37%

LRC 93.25% 93.50% 75.25% 76.00% 84.50%

Evaluation Protocol 2 (EP2)

50

In this experimental setup we validated the consistency of the proposed approach with

a large number of subjects. We now have a subset of FERET database consisting of

400 randomly selected persons. Cross-validation experiments were conducted as discussed

above, results are reported in Table 3.4. The proposed LRC approach showed quite agree-

able results with the large database as well. It persistently achieved high recognition rates

of 93.25% and 93.50% for fa and fb respectively. For the case of severe pose variations

of ql and qr, we note a slight degradation in the performance as expected. The overall

performance is however pretty much comparable with an average recognition success of

84.50%. For all case-studies the proposed LRC approach is found to be superior to the

benchmark PCA and ICA I approaches.

3.3.5 Extended Yale B Database

Extensive experiments were carried out using the Extended Yale B database [64], [65].

The database consists of 2,414 frontal-face images of 38 subjects under various lighting

conditions. The database was divided in 5 subsets, subset 1 consisting of 266 images (7

images per subject) under nominal lighting conditions was used as the gallery while all

others were used for validation (see Figure 4.6). Subsets 2 and 3, each consisting of 12

images per subject, characterize slight-to-moderate luminance variations, while subset 4

(14 images per person) and subset 5 (19 images per person) depict severe light variations.

All experiments for the LRC approach were conducted with images downsampled to an

order 20×20, results are shown in Table 3.5. The proposed LRC approach showed excellent

performance for moderate light variations yielding 100% recognition accuracy for subsets

2 and 3. The recognition success however falls to 83.27% and 33.61% for subsets 4 and 5

respectively. The proposed LRC approach has shown better tolerance for considerable il-

lumination variations compared to benchmark reconstructive approaches comprehensively

outperforming PCA and ICA I for all case studies. The proposed algorithm however, could

not withstand severe luminance alterations.

51

Figure 3.8: Starting from top, each row illustrates samples from subsets 1, 2, 3, 4 and 5respectively.

Table 3.5: Results for the Extended Yale B database.

Approach Subset 2 Subset 3 Subset 4 Subset 5

PCA 98.46% 80.04% 15.79% 24.38%

ICA I 98.03% 80.70% 15.98% 22.02%

LRC 100% 100% 83.27% 33.61%

3.3.6 AR Database

The AR database consists of more than 4,000 color images of 126 subjects (70 men and 56

women) [32]. The database characterizes divergence from ideal conditions by incorporating

various facial expressions (neutral, smile, anger and scream), luminance alterations (left

light on, right light on and all side lights on) and occlusion modes (sunglass and scarf).

It also contains adverse scenarios of occlusion with luminance (sunglass with left light on,

52

sunglass with right light on, scarf with left light on and scarf with right light on). To

take care of session variability the pictures were taken in two sessions separated by two

weeks and no restrictions regarding wear, make-up, hair style etc. were imposed on the

participants. Due to the large number of subjects and the substantial amount of variations,

the AR database is much more challenging compared to the AT&T and Yale databases.

It has been used by researchers as a test-bed to evaluate and benchmark face recognition

algorithms. In this research we address two fundamental challenges of face recognition i.e.

facial expression variations and contiguous occlusion.

Gesture Variations

Facial expressions are defined as the variations in appearance of the face induced by inter-

nal emotions or social communications [66]. Analysis of these expressions is an emerging

research area in the field of behavioral sciences [67], [68]. In context of face identification

the problem of varying facial expressions refers to the development of the face recognition

systems which are robust to these changes. The task becomes more challenging due to the

natural variations in the head orientation with the changes in facial expressions as depicted

in Figure 3.10. Most of the face detection and orientation normalization algorithms make

use of the the facial features such as eyes, nose and mouth. It has to be noted that for the

case of adverse gesture variations such as “scream” the eyes of the subject naturally get

closed (see Figure 3.10(d) and (h)). Consequently under such severe conditions the eyes

cannot be automatically detected and therefore face normalization is likely to be erro-

neous. Hence there are two possible configurations for a realistic evaluation of robustness

for a given face recognition algorithm: 1) By implementing an automatic face localization

and normalization module before the actual face recognition module. 2) By evaluating

the algorithm using the original frame of face image rather than a manually localized and

aligned face. With this understanding we validate the proposed LRC algorithm for the

problem of gesture variations on the original, uncropped and unnormalized AR database.

We design four evaluation strategies i.e Evaluation Protocol 1 (EP1), Evaluation Protocol

2 (EP2), Evaluation Protocol 3 (EP3) and Evaluation Protocol 4 (EP4). For all of these

evaluation protocols the 576 × 768 image frames are downsampled to an order 10 × 10

53

constituting a 100-D feature space. The choice of dimensionality is elaborated in Figure

3.9. The recognition accuracy of the proposed approach becomes fairly constant after a

dimensionality index of 50.

0 50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

Feature Dimension

Rec

ogni

tion

Acc

racy

EP1EP2EP3EP4

Figure 3.9: Recognition accuracy with varying feature dimension for EP1, EP2, EP3 andEP4.

Evaluation Protocol 1 Out of 125 subjects of the AR database, a subset is gen-

erated by randomly selecting 100 individuals (50 males and 50 females). The database

characterizes four facial expressions: neutral, smile, anger and scream.

EP1 is based on leave-one-out strategy i.e. each time the system is trained using images

of 3 different expressions (600 images in all) while the testing session is conducted using

the left out expression (200 images) [58]. The LRC algorithm achieves a high recognition

accuracy for all facial expressions, the results for a 100-D feature space are reported

in Table 3.6 with an overall average recognition of 98.88%. For the case of screaming

the proposed approach achieves 99.5% which outperforms the results in [58] by 12.5%,

noteworthy is the fact that results in [58] are shown on a subset consisting of only 50

individuals.

54

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 3.10: Gesture variations in the AR database, note the changing position of headwith different poses. First and second rows correspond to 2 different sessions incorporatingneutral, happy, angry and screaming expressions respectively.

Table 3.6: Recognition Results for Gesture Variations Using the LRC Approach

Evaluation Protocol Gestures Recognition Accuracy

Neutral 99.00%Smile 98.50%

EP1 Anger 98.50%Scream 99.50%Overall 98.88%

Smile 98.00%EP2 Anger 95.00%

Scream 95.00%Overall 96.00%

Evaluation Protocol 2

Under EP2 we design a typical experimental setup by training the system only on

neutral expression (figures 3.10(a) and (e)) while testing on smile, anger and scream ex-

pressions (figures 3.10 (b), (c), (d), (f), (g) and (h)). Therefore the gallery consists of

200 images and we have 600 probe images. The results for a 100-D feature space are

reported in Table 3.6. The LRC algorithm achieves an overall accuracy of 96% with 98%

55

recognition for smile and 95% for scream and anger expression.


Under EP3 we follow the experimental setup as designed in [30]. We now have a subset

of AR database consisting of 112 individuals. The system is trained using only one image

per subject which characterizes neutral expression (Figure 3.10(a)), therefore we have 112

gallery images. The system is tested on the remaining three expressions shown in Figure

3.10 (b), (c) and (d) altogether making 336 probe images. Table 3.7 shows a thorough

comparison of the LRC approach and the results reported in [30]. EM and LEM stands

for Edge Map and Line Edge Map respectively while all other approaches being variants of

Principle Component Analysis (PCA) [30]. Note that in [30] the choice of session for the

AR database is not explicitly mentioned, therefore to remove the ambiguity we evaluated

the LRC algorithm for both sessions.

For the first session, the proposed LRC approach achieves a high recognition accu-

racy of 92.86% in the overall sense which outperforms the best reported result of 75.67%

(112-eigenvectors approach) by a margin of 17.19%. For the cases of smile and anger ex-

pressions we obtained 93.75% and 95.54% respectively which are quite comparable to the

best contestants i.e. 94.64% (60-eigenvectors) and 92.86%(LEM). However, for the most

adverse case of screaming expression, the proposed LRC approach outstandingly beats all

the reported approaches attaining a decent recognition accuracy of 89.29% maintaining a

difference of 43.75% with the best competitor (112-eigenvectors). The main achievement

of the proposed method is the consistent excellent performance for gentle (smile and anger)

as well as severe (scream) facial expressions.

For the second session, Figure 3.10(e) is used for training while validation is conducted

on Figures 3.10 (f), (g) and (h). The consistency of the proposed approach is more

pronounced and highlighted for this experimental setup as the LRC now achieves a high

recognition rate of 97.32% for all three expressions, outperforming the best reported results

for smile, anger and scream by 2.68%, 4.46% and 51.78% respectively. In the overall sense

the LRC approach is now better than the best competitor by a margin of 21.65%.


Under EP4 we compare the proposed approach with two state of the art algorithms:

56

Table 3.7: Recognition Results for Gesture Variations under EP3


20-eigenvectors 87.85% 78.57% 34.82% 67.08%

60-eigenvectors 94.64% 84.82% 41.96% 73.80%

112-eigenvectors 93.97% 87.50% 45.54% 75.67%

112-eigenvectors w/o 1st 3 82.04% 73.21% 32.14% 62.46%

EM 52.68% 81.25% 20.54% 51.49%

LEM 78.57% 92.86% 31.25% 67.56%

LRC (Session 1) 93.75% 95.54% 89.29% 92.86%

LRC (Session 2) 97.32% 97.32% 97.32% 97.32%

Bayesian Eigenfaces (MIT) [33] and FaceIt (Visionics). The Bayesian Eigenfaces approach

was reported to be one of the best in the 1996 FERET test [34], whereas the FaceIt

algorithm (based on Local Feature Analysis [35]) is claimed to be one of the most successful

commercial face recognition system [36]. A new subset of the AR database is generated by

randomly selecting 116 individuals. The system is trained using the neutral expression of

the first session (Figure 3.10 (a)) and therefore we have 116 gallery images. The system is

validated for all other expressions of the same session (Figures 3.10 (b), (c) and (d)) making

altogether 348 probe images. A comprehensive comparison of the LRC approach with these

two state of the art algorithms is presented in Table 3.8, all the results are as reported

in [36]. For mild variations due to smile and anger expressions the LRC approach yields

quite competent recognition accuracies of 93.97% and 95.69% in comparison to FaceIt and

MIT approaches. The efficacy of the proposed approach is highlighted for the severe case

of screaming expression where the LRC comprehensively outperforms the FaceIt and MIT

approaches by a margin of 11.66% and 48.66% respectively. The consistent performance

of the LRC approach yields an excellent identification rate of 93.10% in the overall sense,

which is better than either of the FaceIt or the MIT approach.

Contiguous Occlusion

The problem of face identification in presence of contiguous occlusion is arguably one of the

most challenging paradigm in context of robust face recognition. Commonly used objects

such as caps, sunglasses and scarves tend to obstruct facial features causing recognition

57

Table 3.8: Recognition Results for Gesture Variations under EP4


FaceIt 96.00% 93.00% 78.00% 89.00%

MIT 94.00% 72.00% 41.00% 60.00%

LRC 93.97% 95.69% 89.66% 93.10%

errors. Moreover in the presence of occlusion the problems of automatic face localization

and normalization as discussed in the previous section are even more magnified. Therefore

experiments on manually cropped and aligned databases make an implicit assumption of

an evenly cropped and nicely aligned face which is not available in practice.

The AR database consists of two modes of contiguous occlusion i.e. images with a pair

of sunglasses and scarf. Figure 3.11 reflects these two scenarios for two different sessions.

A subset of AR database consisting of 100 randomly selected individuals (50 men and 50

women) is used for empirical evaluation. The system is trained using Figures 3.10 (a)-(h)

for each subject thereby generating a gallery of 800 images. Probes consist of Figures

3.11 (a) and (b) for sunglass occlusion and Figures 3.11 (c) and (d) for scarf occlusion.

The proposed approach is evaluated on the original database without any manual cropping

and/or normalization.

For the case of sunglass occlusion the proposed LRC approach achieves a high recog-

nition accuracy of 96% in a 100-D feature space. Table 3.9 depicts a detailed comparison

of the LRC approach with a variety of approaches reported in [6] consisting of Principal

Component Analysis (PCA), Independent Component Analysis - architecture I (ICA I),

Local Nonnegative Matrix Factorization (LNMF), least-squares projection onto the sub-

space spanned by all face images and Sparse Representation based Classification (see [6]

for details). NN and NS corresponds to Nearest Neighbors and Nearest Subspace based

classification respectively. The LRC algorithm comprehensively outperforms the best com-

petitor (SRC) by a margin of 9%. To the best of our knowledge the LRC approach achieves

the best results for the case of sunglass occlusion, note that in [6] a comparable recognition

rate of 97.5% has been achieved by a subsequent image partitioning approach.

For the case of severe scarf occlusion the proposed approach gives a recognition accu-

58

(a) (b) (c) (d)

Figure 3.11: Examples of contiguous occlusion in the AR database.

Table 3.9: Recognition Results for Occlusion

Approach Recognition AccuracySunglass Scarf

PCA+NN 70.00% 12.00%

ICA I+NN 53.50% 15.00%

LNMF+NN 33.50% 24.00%

l2+NS 64.50% 12.50%

SRC 87.00% 59.50%

LRC 96.00% 26%

racy of 26% in a 3600-D feature space. Figure 3.12 shows the performance of the system

with respect to an increasing dimensionality of the feature space. Although The LRC

algorithm outperforms the classical PCA and ICA I approaches by a margin of 14% and

11% respectively, it lags the SRC approach by a margin of 33.5%.

We now demonstrate the efficacy of the proposed Modular LRC approach under severe

occlusion conditions. As a preprocessing step the AR database is normalized, both in

scale and orientation, generating a cropped and aligned subset of images consisting of 100

subjects . Images are manually aligned using eyes and mouth locations as shown in Figure

3.13 [69], each image is cropped to an order 292 × 240, some images from the normalized

database are shown in Figure 3.14.

We follow the evaluation protocol as discussed in Section 3.3.6, all images are parti-

tioned into 4 blocks as shown in Figure 3.15 (a). The blocks are numbered in an ascending

order from left to right, starting from the top, the LRC algorithm for each sub-image

uses a 100-D feature space as discussed in previous section. Figure 3.16 (a) elaborates the

59

0 1000 2000 3000 4000 5000 60005

10

15

20

25

30

Feature Dimension

Rec

ogni

tion

Acc

urac

y

Figure 3.12: The recognition accuracy versus feature dimension for scarf occlusion usingthe LRC approach.

efficacy of the proposed approach for a random probe image. In our proposed approach we

have used the distance measures dj(n) as an evidence of our belief in a sub-image. The key

factor to note in Figure 3.16 (a) is that corrupted sub-images (i.e. blocks 3 and 4 in Figure

3.15 (a)) reach a decision with a low belief i.e. high distance measures dj(n). Therefore in

the final decision making these corrupted blocks are rejected thereby giving a high recog-

nition accuracy of 95%. The superiority of the proposed approach is more pronounced by

considering the individual recognition rates of the sub-images in Figure 3.16 (b). Block

1 and 2 yield a high classification accuracy of 94% and 90% respectively where as block

3 and 4 give 1% output each. Note that the effect of the proposed approach is two fold,

firstly it automatically deemphasizes the non-face partitions. Secondly the efficient and

dynamic fusion harnesses the complementary information of the face sub-images to yield

an overall recognition accuracy of 95% which is better than the best of the participating

face partition.

Note that in Figure 3.15 (a) the partitioning is such that the uncorrupted sub-images

60

Figure 3.13: A sample image indicating eyes and mouth locations for the purpose ofmanual alignment.

(a) (b) (c) (d)

Figure 3.14: Samples of cropped and aligned faces from the AR database.

(block 1 and 2) correspond to undistorted and complete eyes which are arguably one of

the most discriminant facial features. Therefore one can argue that high classification

accuracy is due to this coincidence and the approach might not work well otherwise. To

remove this ambiguity we partitioned the images into six and eight blocks as shown in

Figure 3.15 (b) and (c) respectively. Blocks are numbered left to right, starting from the

top. For Figure 3.15 (b) partitions 1 and 2 give high recognition accuracies of 92.5% and

61

(a) (b) (c)

Figure 3.15: Case studies for Modular LRC approach for the problem of scarf occlusion.

90% respectively, while the remaining blocks i.e. 3, 4, 5 and 6 yield 8%, 7%, 0% and 1%

recognition respectively. Interestingly, although the best block gives 92.5% which is 1.5%

less than the best block of Figure 3.15 (a), the overall classification accuracy comes out to

be 95.5%.

Similarly in Figure 3.15 (c), blocks 1, 2, 3 and 4 give classification accuracies of 88.5%,

84.5%, 80% and 77.5% respectively, while the corrupted blocks 5, 6, 7 and 8 produce 3%,

0.5%, 1% and 1% of classification. The proposed evidence based algorithm yields a high

classification accuracy of 95%, a key factor to note is the best individual result is 88.5%

which lags the best individual result of Figure 3.15 (a) by 5.5%, however the proposed

integration of combining blocks yields a comparable overall recognition. Interestingly the

eye-brows regions (blocks 1 and 2) in Figure 3.15 (c) have been found most useful.

To the best of our knowledge the recognition accuracy of 95.5% achieved by the pre-

sented approach is the best result ever reported for the case of scarf occlusion. The

previous best being 93.5% achieved by partitioned SRC approach in [6], also 93% clas-

sification accuracy is reported for 50 subjects in [58]. Finally we compare the proposed

DEF approach with the weighted sum rule which is perhaps the major work-horse in the

field of combining classifiers [43]. The comparison for three case studies in Figure 3.15

is presented in Table 3.10. Note that without any prior knowledge of the goodness of a

specific partition we used equal weights for all sub-images of a given partitioned image.

The DEF approach comprehensively outperforms the sum rule for the three case studies

62

1 2 3 40

0.2

0.4

0.6

0.8

1

1.2

1.4

Index of Sub−images

Dis

tanc

e M

etric

s

(a)

1 2 3 40

10

20

30

40

50

60

70

80

90

100

Index of Sub−images

Cla

ssifi

catio

n A

ccur

acy

(b)

Figure 3.16: (a) Distance measures dj(n) for the four partitions, note that non-facecomponents make decisions with low evidences. (b) Recognition accuracies for all blocks.

63

showing an improvement of 38.5%, 33.5% and 20% respectively. The performance of the

sum rule improves with the increase in the pure face regions thereby demonstrating the

strong dependency on the way partitioning is performed. On the other hand, the proposed

DEF approach shows a quite consistent performance for even the worst case partitioning

i.e. Figure 3.15 (a) consisting of an equal number of face and non-face partitions.

Table 3.10: Comparison of the DEF with the Sum Rule for Three Case Studies

Case Studies Sum Rule DEF

4-partitions 56.50% 95.00%

6-partitions 62.00% 95.50%

8-partitions 75.00% 95.00%

3.4 Conclusion

In this chpater a novel nearest subspace classification algorithm is proposed which for-

mulates the face identification task as a problem of linear regression. The proposed LRC

algorithm is extensively evaluated using the most standard databases with a variety of

evaluation methods reported in the face recognition literature. Specifically the challenges

of varying facial expressions and contiguous occlusion are addressed. Considerable compar-

ative analysis with the state-of-art algorithms clearly reflects the potency of the proposed

approach. The proposed LRC approach reveals a number of interesting outcomes. Apart

from the Modular LRC approach for face identification in presence of disguise, the LRC

approach yields high recognition accuracies without requiring any preprocessing steps of

face localization and/or normalization. We argue that in presence of non-ideal conditions

such as occlusion, illumination and sever gestures, a cropped and aligned face is gener-

ally not available. Therefore, a consistent reliable performance with unprocessed standard

databases makes the LRC algorithm appropriate for real scenarios. For the case of varying

gestures the LRC approach has shown to cope well with the most severe screaming expres-

sion where the state-of-art techniques lag behind, indicating the consistency for mild and

severe changes. For the problem of face recognition in presence of disguise, the Modular

LRC algorithm using an efficient evidential fusion strategy yields the best reported results

64

in the literature. The simple architecture of the proposed approach makes it computa-

tionally efficient making it therefore a suitable candidate for video-based face recognition

applications. Other future directions include the robustness issues related to illumination,

random pixel corruption and pose variations.

65

3.5 Publications


for Face Recognition”, In Print IEEE Transactions on Pattern Analysis and Machine

Intelligence (IEEE TPAMI)

2. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Face Identification

using Linear Regression”, in International Conference on Image Processing ICIP09,

Cairo, Egypt.

66

Chapter 4

Robust Regression for Face

Recognition1

4.1 Introduction

In general, face recognition systems critically depend on manifold learning methods. A

gray-scale face image of order a × b can be represented as an ab dimensional vector in

the original image space. Typically, in pattern recognition problems, it is believed that

high-dimensional data vectors are redundant measurements of an underlying source. The

objective of manifold learning is therefore to uncover this “underlying source” by a suit-

able transformation of high-dimensional measurements to low-dimensional data vectors.

Therefore, at the feature extraction stage, images are transformed to low dimensional

vectors in a face space. The main objective is to find a basis function for this transfor-

mation, which could distinguishably represent faces in the face space. In the presence of

noise, however it is supposed to be an extremely challenging task [3], [70]. It follows from

coding theory that iterative measurements are more likely to safely recover information

in the presence of noise [71], therefore working in a low dimensional feature space main-

taining the aspect of robustness is in fact an ardent problem in object recognition. A

1A part of the chapter has been accepted for publication in International Conference on PatternRecognition (ICPR’10). Initial submission to the IEEE TPAMI received revisions in December 2009, thechapter has been duly revised and resubmitted in April 2010. General problem statements and literaturereview in the Introduction section of the chapter are included for the sake of completeness and to makethe chapter self-contained. Since the thesis is presented as a compilation of publications, the repetition ofthe general statements between the chapters is therefore inevitable.

67

number of approaches have been reported in the literature for dimensionality reduction.

In the context of robustness, these approaches have been broadly classified in two cate-

gories namely generative/reconstructive and discriminative methods [72]. Reconstructive

approaches (such as PCA [13], ICA [31] and NMF [73], [74]) are reported to be robust for

the problem related to missing and contaminated pixels, these methods essentially exploit

the redundancy in the visual data to produce representations with sufficient reconstruc-

tion property. Formally, given an input x and label y, the generative classifiers learn a

model of the joint probability p(x, y) and classify using p(y|x), which is determined using

the Bayes’ rule. The discriminative approaches (such as LDA [17]), on the other hand,

are known to yield better results in “clean” conditions [20] owing to the flexible decision

boundaries. The optimal decision boundaries are determined using the posterior p(y|x)

directly from the data [72] and are consequently more sensitive to outliers. Apart from

these traditional approaches, it has been shown recently that unorthodox features such as

downsampled images and random projections can serve equally well. In fact the choice of

the feature space may no longer be so critical [75], [76], [6]. What really matters is the

dimensionality of the feature space and the design of the classifier.

In the paradigm of face recognition, illumination variation is supposed to be a major

robustness issue [3]. Several approaches have been proposed in the literature to tackle

the problem. A sophisticated approach in [64] models images of a subject with a fixed

pose but different illumination conditions as a convex cone in the space of images. Con-

sequently a small number of training images of each face taken under various lighting

conditions are used for the reconstruction of shape and albedo of the face. Although the

approach has demonstrated some good results, in practice the computation of an exact

illumination cone for a given subject is quite expensive and tedious due to a large num-

ber of extreme rays. Studies have shown that the facial images under varying luminance

conditions can be modeled as low-dimensional linear subspaces [65]. Basis images for this

purpose may be obtained using a 3D model under diffuse lighting based on spherical har-

monics. Arrangement of physical lighting can be harnessed so as to obtain images which

can directly be used as basis vectors for low-dimensional linear space. Another line of ac-

tion is to normalize/compensate the illumination effect by some kind of preprocessing such

68

as histogram equalization, gamma correction and logarithm transform. However these el-

ementary global processing techniques are not of much help in the presence of nonuniform

illumination variations [77]. Moreover some latest approaches such as the Line Edge Map

(LEM) [30] and Face-ARG matching [78] have shown good tolerance under adverse lumi-

nance alterations. The use of geometrical/structural information of the face region justifies

the implicit robustness of these approaches.

Apart from the illumination problem, it has also been shown in the literature that

traditional face recognition approaches do not cope well in the presence of severe random

noise [6], [79], [80], [81],[82]. Most of the approaches in the literature, robust to random

pixel noise, are variants of neural network classification. An important work is presented

in [80] incorporating a robust kernel approach in the presence of severe noise. Encouraging

results have been shown for two important problems of the additive noise (salt and pep-

per) and the multiplicative noise (speckle) compared to the traditional SVM approaches.

Similarly in [82] it has been shown that neural network classifier outperforms traditional

PCA [13], 2DPCA [28], LDA [17] and Laplacianfaces [83] approaches for the case of severe

additive Gaussian noise. Apart from these neural network approaches, recently sparse

representation classification (SRC) has been presented [6], [81]. In the presence of noise

(modeled as uniform random variable) the proposed approach has shown to outperform

the traditional approaches of PCA, ICA I, LNMF and L2+NS, however other important

noise models such as speckle and salt and pepper noise are not addressed.

In this chapter we propose a robust classification algorithm for the problem of face

recognition in the presence of random pixel distortion. Samples from a specific object

class are known to lie on a linear subspace [50], [17]. In our previous work [75], [76]

we proposed to develop class specific models of the registered users thereby defining the

task of face recognition as a problem of linear regression. In the work presented here, we

extend our investigations to the problem of noise contaminated probes, where the inverse

problem is solved using a novel application of the robust linear Huber estimation [84], [85]

and the class label is decided based on the subspace with the most precise estimation.

The proposed approach, although being simple in architecture, has shown demonstrating

results for two critical robustness issues of severe illumination variations and random pixel

69

noise.

The rest of the chapter is organized as follows: The fundamental problem of robust

estimation is discussed in Section 4.2 followed by the face recognition problem formulation

in Section 4.3. Section 4.4 demonstrates the efficacy of the proposed approach for the

problem of severely varying illumination followed by the experiments for random pixel

corruption in Section 4.5. The paper finally concludes in Section 4.6 followed by a list of

publications in Section 4.7

4.2 The Problem of Robust Estimation

Consider a linear model

y = Xβ + e (4.1)

where the dependent or response variable y ∈ Rq×1, the regressor or predictor variable

X ∈ Rq×p, the vector of parameters β ∈ R

p×1 and error term e ∈ Rq×1 . The problem of

robust estimation is to estimate the vector of parameters β so as to minimize the residual

r = y − y; y = Xβ (4.2)

y being the predicted response variable. In classical statistics the error term e is

conventionally taken as a zero mean Gaussian noise [86]. A traditional method to optimize

the regression is to minimize the least squares (LS) problem

arg min︸︷︷︸

β

q∑

j=1

r2j (β) (4.3)

where rj(β) is the jth component of the residual vector r. However in the presence of

outliers, least squares estimation is inefficient and can be biased. Although it has been

claimed that classical statistical methods are robust, they are only robust in the sense of

type I error. Type I error corresponds to the rejection of null hypothesis when it is in

fact true. It is straightforward to note that type I error rate for classical approaches in

the presence of outliers tend to be lower than the nominal value. This is often referred to

70

as conservatism of classical statistics. However due to contaminated data, type II error

increases drastically. Type II error is the error when the null hypothesis is not rejected

when it is in fact false. This drawback is often referred to as inadmissibility of the classical

approaches. Additionally, classical statistical methods are known to perform well with the

homoskedastic data model. In many real scenarios however, this assumption is not true

and heteroskedasticity is indispensable, thereby emphasizing the need of robust estimation.

Several approaches to robust estimation have been proposed such as R-estimators

and L-estimators. However M -estimators have shown superiority due to their generality

and high breakdown point [84], [86]. Primarily M -estimators are based on minimizing a

function of residuals

β = arg min︸︷︷︸

β∈Rp

F (β) ≡

q∑

j=1

ρ(

rj(β))

(4.4)

where ρ(r) is a symmetric function with a unique minimum at zero [84], [85]

ρ(r) =

12γ

r2 for |r| ≤ γ

|r| − 12γ for |r| > γ

(4.5)

γ being a tuning constant called the Huber threshold. Many algorithms have been de-

veloped for calculating the Huber M -estimate in Equation 4.4, some of the most efficient

are based on Newton’s method [87]. M -estimators have been found to be robust and sta-

tistically efficient compared to classical methods [88], [89], [57]. Although robust methods,

in general, are superior to their classical counterparts, they have rarely been addressed

in applied fields [58], [86]. Several reasons have been discussed in [86] for this paradox,

computational expense related to the robust methods has been a major hindrance [88].

However, with recent developments in computational power, this reason has become in-

significant. The reluctance in the use of robust regression methods may also be credited

to the belief of many statisticians that classical methods are robust.

71

Table 4.1: Outline of Robust Linear Regression Classification (RLRC) AlgorithmAlgorithm: Robust Linear Regression Classification (RLRC)

Inputs: Class models Xi ∈ Rq×pi , i = 1, 2, . . . , N and a test image vector y ∈ R


1. βi ∈ Rpi×1 is evaluated against each class model, βi = arg min

︸︷︷︸

βi∈Rpi

{

F (βi) ≡∑q

j=1 ρ(

rj(βi))}

, i = 1, 2, . . . , N

2. yi is computed for each βi, yi = Xiβi, i = 1, 2, . . . , N

3. Distance calculation between original and predicted response variables di(y) = ‖y − yi‖2 , i = 1, 2, . . . , N


72

4.3 Robust Linear Regression Classification (RLRC) for Ro-

bust Face Recognition

Consider N number of distinguished classes with pi number of training images from the

ith class such that i = 1, 2, . . . , N . Each grayscale training image is of an order a × b

and is represented as(m)ui ∈ R

a×b, i = 1, 2, . . . , N and m = 1, 2, . . . , pi. Each gallery

image is downsampled to an order c × d and transformed to a vector through column

concatenation such that(m)ui ∈ R

a×b →(m)wi∈ R

q×1, where q = cd, cd << ab. Each image

vector is normalized so that the maximum pixel value is 1. Using the concept that patterns

from the same class lie on a linear subspace [50], we develop a class specific model Xi by

stacking the q-dimensional image vectors,

Xi = [(1)wi

(2)wi . . . . . .

(pi)wi ] ∈ R

q×pi , i = 1, 2, . . . , N (4.6)



of Xi. Therefore at the training level each class i is represented by a vector subspace, Xi,

which is also called the regressor or predictor for class i. Let z be an unlabeled test image

and our problem is to classify z as one of the classes i = 1, 2, . . . , N . We transform and

normalize the grayscale image z to an image vector y ∈ Rq×1 as discussed for the gallery.

If y belongs to the ith class it should be represented as a linear combination of the training

images from the same class (lying in the same subspace) i.e.

y = Xiβi + e , i = 1, 2, . . . , N (4.7)

where βi ∈ Rpi×1. From the perspective of face recognition the training of the system

corresponds to the development of the explanatory variable (Xi) which is normally done in

a controlled environment, therefore the explanatory variable can safely be regarded as noise

free. The issue of robustness comes into play when a given test pattern is contaminated

with noise which may arise due to luminance, malfunctioning of the sensor, channel noise

etc. Given that q ≥ pi, the system of equations in equation 4.7 is well-conditioned and βi

is estimated using robust Huber estimation as discussed in Section 4.2 [85]

73

βi = arg min︸︷︷︸

βi∈Rpi

F (βi) ≡

q∑

j=1

ρ(

rj(βi))

, i = 1, 2, . . . , N (4.8)

where rj(βi) is the jth component of the residual

r(βi) = y − Xiβi , i = 1, 2, . . . , N (4.9)



yi = Xiβi, i = 1, 2, . . . , N (4.10)

We now calculate the distance measure between the predicted response vector yi, i =

1, 2, . . . , N and the original response vector y,

di(y) = ‖y − yi‖2 , i = 1, 2, . . . , N (4.11)


min︸︷︷︸

i

di(y), i = 1, 2, . . . , N (4.12)

The proposed RLRC algorithm is outlined in Table 4.1.

4.4 Case Study: Face recognition in Presence of Severe Il-

lumination Variations

The proposed RLRC algorithm is extensively evaluated on various databases incorporating

several modes of luminance variations. In particular we address three standard databases

namely Yale Face Database B [64], CMU-PIE database [90] and AR database [32]. For all

experiments images are histogram equalized and transformed to logarithm domain.

74

Figure 4.1: Yale Face Database B: Starting from top, each row represents typical imagesfrom subsets 3, 4 and 5 respectively. Note that subset 5 (third row) characterizes theworst illumination variations.

Table 4.2: Details of the subsets for Yale Face Database B with respect to light sourcedirections.

Subsets 1 2 3 4 5

Lighting angle (degrees) 0−12 13−25 26−50 51−77 >77

Number of images 70 120 120 140 190

4.4.1 Yale Face Database B

Yale face database B consists of 10 individuals with 9 poses incorporating 64 different

illumination alterations for each pose [64]. The database has been used by researchers as

a test-bed for the evaluation of robust face recognition algorithms. Since we are concerned

with the illumination tolerant face recognition problem, only the frontal images of the

subjects are considered. The images are divided into 5 subsets with respect to the angle

between the light source direction and the camera axis (see Figure 4.1), refer to Table 4.2.

Interested readers may also refer to [64] for further details of the database, all images are

downsampled to an order of 50 × 50.

We follow the evaluation protocol as reported in [64], [77], [91], [92], [93], [94], [95].

Training is conducted using subset 1 and the system is validated on the remaining sub-

sets. A detail comparison of the results with some latest approaches is shown in Table

75

Table 4.3: Recognition Results for Yale Face Database BMethods Subset 3 Subset 4 Subset 5

No Normalization [77] 89.20% 48.60% 22.60%

Histogram Equalization [77] 90.80% 45.80% 58.90%

Linear Subspace [64] 100.00% 85.00% N/A

Cones-attached [64] 100.00% 91.40% N/A

Cones-cast [64] 100.00% 100.00% N/A

Gradient Angle [91] 100.00% 98.60% N/A

Harmonic Images [92] 99.70% 96.90% N/A

Illumination Ratio Images [93] 96.70% 81.40% N/A

Quotient Illumination Relighting [94] 100.00% 90.60% 82.50%

9PL [95] 100.00% 97.20% N/A

Method in [77] 100.00% 99.82% 98.29%

RLRC 100.00% 100.00% 100.00%

4.3, all results are as reported in [77]. Note that the error rates have been converted to

the recognition success rates. Since subset 3 incorporates moderate luminance variations,

most of the state-of-art algorithms report error-free recognition as shown in Table 4.3.

For subset 4 with more adverse illumination variations, the proposed algorithm achieves

100% recognition which is either better than or comparable to all the results reported in

the literature. In particular the proposed approach outperforms the Cones-attached, Illu-

mination Ratio Images and Quotient Illumination Relighting methods by 8.60%, 18.60%

and 9.40% respectively. It is also found to be fairly comparable to the latest Cone-cast

and Gradient Angle approaches. Subset 5 represents the worst case scenario with angle

between the light source direction and camera axis being greater than 77◦. The pro-

posed RLRC algorithm consistently achieves 100% recognition for the severe alterations

comparing favorably with all the reported results in the literature beating the Quotient

Illumination Relighting method by more than 17%. Noteworthy is the fact that results for

this subset are not available in the literature for most of the contemporary approaches.

4.4.2 CMU-PIE Face Database

Evaluation Protocol 1 (EP 1)

Extensive experiments were conducted on CMU-PIE database [90]. We follow the eval-

uation protocol as proposed in [1] randomly selecting a subset of database consisting of

65 subjects with 21 illumination variations per subject, all images are resized to an order

76

(1) (2) (3) (4) (5) (6) (7)

(8) (9) (10) (11) (12) (13) (14)

(15) (16) (17) (18) (19) (20) (21)

Figure 4.2: The 21 different illumination variations for a typical subject from the CMUPIE database. These images were captured without any ambient lighting thereby demon-strating more severe luminance alterations

77

Table 4.4: Performance comparison with state-of-the-art algorithms characterizing training images captured from near frontal lighting. Allresults are as reported in [1]

Training Images IPCA 3D Linear Subspace Fisherfaces MACE Filters Corefaces RLRC

5,6,7,8,9,10,11,18,19,20 97.60% 97.30% 97.30% 100.00% 100.00% 100.00%

5,6,7,8,9,10 91.40% 97.10% 89.30% 99.90% 100.00% 99.41%

5,7,9,10 72.40% 93.20% 71.40% 99.90% 99.90% 99.85%

7,10,19 36.10% 50.90% 73.30% 99.10% 99.90% 99.93%

8,9,10 78.00% 97.80% 82.10% 99.90% 99.90% 99.41%

18,19,20 91.00% 98.40% 94.20% 99.90% 100.00% 100.00%

Table 4.5: Performance comparison with state-of-the-art algorithms characterizing training images with severe lighting conditions. Allresults are as reported in [1]

Training Images IPCA 3D Linear Subspace Fisherfaces MACE Filters Corefaces RLRC

3,7,16 95.90% 99.90% 99.90% 100.00% 100.00% 100.00%

1,10,16 90.70% 99.90% 99.90% 100.00% 100.00% 100.00%

2,7,16 88.57% 99.85% 100.00% 100.00% 100.00% 100.00%

4,7,13 91.40% 98.90% 99.10% 100.00% 100.00% 100.00%

3,10,16 91.70% 100.00% 99.90% 100.00% 100.00% 100.00%

3,16 44.30% N/A 49.90% 99.90% 99.90% 99.93%

78

50 × 50. Figure 4.2 represents 21 different alterations for a typical subject each image

being accordingly labeled. We follow two experimental setups as proposed in [1], in the

first set of experiments the system is trained using images with near frontal lighting and

validation is conducted across the whole database. A detailed comparison of the perfor-

mance with the state-of-art approaches is depicted in Table 4.4. The proposed RLRC

algorithm is found to be pretty much comparable with the latest approaches of MACE

filters and Corefaces, it also comprehensively outperforms the IPCA, 3D linear subspace

and Fisherfaces methods for various case-studies of training sessions. For instance with

training images labeled 7, 10 and 19, the proposed RLRC algorithm achieves 99.93% recog-

nition which is 63.83%, 49.03% and 26.63% better than IPCA, 3D linear subspace and

Fisherfaces methods respectively.

For the second set of experiments training is conducted on images captured under

extreme lighting conditions, the system is again validated across the whole database. The

proposed RLRC algorithm is found to be comparable with the latest approaches as shown

in Table 4.5. The only erroneous recognition trial was for the case with the training images

labeled 3 and 16. The error may be attributed to the fact that the system was trained

using only 2 images which accounts for only a couple of regressor or predictor observations

for each class in the context of the RLRC algorithm. Apart from insufficient information,

it has to be noted that images 3 and 16 (Figure 4.2) have adverse luminance conditions.


Under Evaluation Protocol 2 (EP 2) we follow the leave-one-out strategy on the 68 subjects

of the CMU-PIE database as proposed in a recent work of generalized quotient image

[96]. A detailed comparison with the best results in [96] is shown in Figure 4.3. The

proposed RLRC approach consistently attained high recognition accuracy for all leave-

one-out experiments. In particular, apart from one recognition trail we attained an error-

free performance index with 100% recognition accuracy. Only one error was reported for

the seventh leave-one-out experiment where we achieved a recognition rate of 98.53%. It

is appropriate to point out that performance curve for S-QI method in Figure 4.3 is an

approximation to the curve shown in [96].

79

0 2 4 6 8 10 12 14 16 18 2050

55

60

65

70

75

80

85

90

95

100

Leave−one−out Experiments

Rec

ogni

tion

Acc

urac

y

S−QI

RLRC

Figure 4.3: Performance curves for the CMU-PIE database under EP 2.

4.4.3 AR Database

The AR face database contains over 4000 color images taken in two sessions separated by

two weeks [32]. The database characterizes various deviations from the ideal conditions

including facial expressions, luminance conditions and occlusion modes. In particular,

there are three lighting modes with left light on, right light on and both lights on. Figure

4.4 represents these variations for the two sessions.


We follow the evaluation protocol as proposed in [97], a subset of the database consisting

of 118 randomly selected individuals is selected. Training is performed on images with

nominal lighting conditions (Figure 4.4 (a) and (e)) while validation is conducted on

images with adverse ambient lighting (Figure 4.4 (b), (c), (d), (f), (g) and (h)). Therefore

altogether we have 236 (118 × 2) gallery images and 708 (118 × 6) probes.

All images are downsampled to an order of 180 × 180. The results are dilated in

80

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 4.4: Various luminance variations for a typical subject of the AR database, thetwo rows represent two different sessions.

Table 4.6: Results for the AR database under EP 1.Method Recognition Accuracy

LPP 65.25%

DLPP 96.89%

RLRC 95.76%

Table 4.6, the proposed RLRC algorithm outperforms the Locality Preserving Projections

(LPP) method by a margin of 30.51% and is quite comparable to the Discriminant Locality

Preserving Projections (DLPP) method. All results are as reported in [97].


Under EP 2 we follow the experimental setup as proposed in [98]. We now have a subset

of 121 subjects, training is done on Figure 4.4 (a) and the system is validated for adverse

luminance variations of the same session i.e Figures 4.4 (b), (c) and (d). Therefore we

have 121 gallery images and 363 (121 × 3) probes. The results are tabulated in Table

4.7, all the results are as reported in [98]. The proposed RLRC algorithm achieves a high

81

Table 4.7: Results for the AR database under EP 2.Method Recognition Accuracy

PCA 25.90%

PCA+HE 37.70%

PCA+BHE 71.30%

PCA+2D Face Model 81.80%

RLRC 94.49%

Table 4.8: Results for the AR database under EP 3.Method Left-Light Right-Light Both-Lights

1-NN [78] 22.20% 17.80% 3.70%

PCA [78] 7.40% 7.40% 2.20%

LEM [30] 92.90% 91.10% 74.10%

Face-ARG [78] 98.50% 96.30% 91.10%

RLRC 96.30% 94.07% 94.07%

recognition accuracy of 94.49% outperforming the latest 2D face model approach by a

margin of 12.69%.


Under Evauluation Protocol 3 (EP 3) we follow the experimental setup as proposed in

recent works of Face-ARG matching [78] and Line Edge Map (LEM) [30]. These recent

approaches use the geometric quantities and structural information of a human face and

have therefore shown to be robust to severe illumination variations. We select a subset of

AR database consisting of 135 subjects. The system is trained using Figure 4.4 (a) while

Figures 4.4 (b), (c) and (d) serve as probes, altogether we have 135 gallery images and

405 (135 × 3) probes. The results are tabulated in Table 4.8, noteworthy is the fact that

results in [30] are shown for 112 subjects.

The proposed RLRC approach shows a consistent performance across all illumination

modes of the AR database. For the cases of “left light on” and “right light on”, recognition

accuracies of 96.30% and 94.07% are achieved which are fairly comparable to the latest

LEM and Face-ARG approaches as shown in Table 4.8. For the most challenging problem

of illumination with “both lights on” the proposed RLRC approach attains 94.07% recog-

nition which is favorably comparable with the Face-ARG approach and outperforms the

LEM approach by a margin of approximately 20%. The conventional methods of PCA and

82

1-NN reported in [78] are out of discussion as they lag far behind these latest approaches.

4.4.4 FERET Database

The FERET database is arguably one of the largest publicly available database with two

versions [62], gray FERET database and color FERET database. The database addresses

several challenging issues such as expression variations, pose alterations and aging factor

etc. For the case of varying illumination there is only one evaluation protocol recognized

as “fafc” within the framework of gray FERET database. The methodology utilizes only

one gallery image for each of the 1196 subjects, the gallery size is therefore 1196. The

probe set consists of 194 images, refer to [62] for further details on the FERET evaluation

methodology. It is worthy to note that recognizing a person from a single gallery image is

itself an independent, ardent issue within the paradigm of face recognition [99] and as such

not the focus of the presented research. However to evaluate the efficacy of the proposed

algorithm with a single gallery image per subject, we conducted extensive experiments

as shown in Figure 4.5. The proposed RLRC algorithm outperformed 13 of the reported

14 algorithms and lagged only one algorithm tagged as USC MAR 97 in [62]. Figure

4.5 illustrates receiver operating characteristics for three best reported algorithms (in

the sense of recognition accuracy). The proposed RLRC algorithm achieves a verification

accuracy of 70.10% at 0.001 FAR which lags 9.28% as compared to the best result reported

for USC MAR 97, the proposed RLRC algorithm however comprehensively outperforms

the other 13 algorithms beating UMD MAR 97 and EF HIST DEV M12 by a margin of

approximately 30% and 50% (at 0.001 FAR) respectively. It has to be noted that for higher

values of FAR the proposed RLRC algorithm is better than the best reported method.

In particular the RLRC algorithm achieves better performance from 0.017 FAR onwards

with a good equal error rate of 0.03 (approximately) which is better than 0.05 as reported

for USC MAR 97 method [62].

83

10−3

10−2

10−1

100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Accept Rate

Ver

ifica

tion

Rat

e

RLRCusc mar 97umd mar 97ef hist dev m12

Figure 4.5: ROC curves for the FERET database.

4.5 Case Study: Face Recognition in Presence of Random

Pixel Noise

Extensive experiments were carried out using the Extended Yale B database [64], [65].

The database consists of 2,414 frontal-face images of 38 subjects under various lighting

conditions. Subset 1 and 2 consisting of 719 images under normal-to-moderate lighting

conditions, were used as gallery. Subset 3 consisting of 456 images under severe luminance

alterations were designated as probes. Sample gallery and probe images are shown in

Figure 4.6. The choice of training and testing images is specifically to isolate the effect of

noise.

The proposed approach was validated for a range of exemplary noise models specific to

84

Figure 4.6: First row illustrates some gallery images from subset 1 and 2 while secondrow shows some probes from subset 3.

image data. For all experiments the location of noisy pixels is unknown to the algorithm.

Figure 4.7 reflects the probe images corrupted with various degrees of dead pixel noise.

(a) (b) (c) (d)

Figure 4.7: Probe images corrupted with (a) 20% (b) 40% (c) 60% and (d) 80% deadpixels.

The proposed robust linear regression classification approach comprehensively outper-

forms the benchmark reconstructive algorithms of PCA [13] and ICA I [31] showing a high

breakdown point as shown in Figure 4.8. Even for the worst case scenario of 90% corrupted

pixels, the proposed approach achieves 94.52% recognition accuracy outperforming PCA

and ICA I by 49.13% and 75.44% respectively. In the presence of outliers the L1-norm

computation is reported to be more efficient compared to the usual euclidean distance

measure [88], [89]. Also in the literature median filtering has shown to improve image

understanding in the presence of noise [100]. Therefore for an appropriate indication of

the performance index for the proposed RLRC algorithm we also conducted experiments

85

0 10 20 30 40 50 60 70 80 900

10

20

30

40

50

60

70

80

90

100

Noise Percentage

Rec

ogni

tion

Acc

urac

y

PCA + NNICA I + NNRLRCM−PCAPCA+L1M−RLRC

Figure 4.8: Recognition accuracy of various approaches for a range of dead pixel noisedensity.

using median filtering and L1-norm calculations for the PCA. Note that the median pre-

processing is indicated by letter “M” before the corresponding approach. The performance

curves are shown in Figure 4.8, the proposed RLRC algorithm consistently outperforms

these robust variants of the benchmark approaches.

In particular, for the case of 90% corruption we note a major performance difference

with the RLRC algorithm outperforming the L1-norm and M-PCA methods by 89.04%

and 91.67% respectively. Noteworthy is the fact that standard preprocessing and robust

calculations are of no use in such severe noise conditions.

For a comprehensive comparison with the benchmark generative approaches, extensive

verification experiments were conducted, the similarity scores were normalized on the scale

[0 1]. Results are dilated in Figure 4.12 and Table 4.9. Rank recognition profiles for various

degree of dead-pixel contamination show an excellent performance index for the proposed

RLRC approach. With the increasing noise intensity the proposed approach shows good

86

tolerance as compared to PCA and ICA I.

Specifically with 80% noise density, the proposed RLRC approach achieves an excellent

rank-1 recogniiton accuracy of 99.78% comprehensively beating PCA and ICA I rank-1

recognition results by 28.29% and 48.68% respectively. The RLRC approach achieves an

excellent equal error rate (EER) of 0.40% as compared to 5.80% and 11.02% for PCA and

ICA I respectively. Verification rate of 99.78% at a typical 0.01 FAR, as indicated in Table

4.9, also comprehensively outperforms the benchmark approaches.

In the next set of experiments we contaminate the probe images with data drop-out

and snow in the image noise simultaneously, commonly referred to as salt and pepper

noise [101]. This fat-tail distributed noise, also called impulsive noise or spike noise [101],

can be caused by analog-to-digital converter errors and bit errors in transmission [102],

[103]. Figure 4.9 reflects probes distorted with various degrees of salt and pepper noise. In

the overall sense, the proposed RLRC approach is favorably comparable with the bench-

mark reconstructive approaches as depicted in Figure 4.10. At a noise density of 70%

for instance, the RLRC algorithm gives 9.21% and 18.64% better recognition accuracy

compared to PCA and ICA I respectively. However under high noise densities of 80%

and 90%, PCA seems to be better than either ICA I or RLRC. It should be noted that

under such severe conditions of salt and pepper noise, although PCA gives a better com-

parative performance index, the recognition accuracy achieved (for e.g. 48.25% at 80%

noise density) is itself far from satisfactory. For salt and pepper noise we note that the

median preprocessing results in a significant improvement both for PCA and RLRC ap-

proaches. For instance at 70% noise density M-PCA achieves 93.86% recognition which is

22.15% better than simple PCA. Similarly the M-RLRC shows an improvement of almost

20% compared to simple RLRC achieving a maximum recognition of 100%. The L1-norm

calculation is of little benefit as M-RLRC consitently performs better than all competing

approaches.

Verification experiments were also conducted for salt and pepper noise, results are

shown in Figure 4.13 and Table 4.10. For noise density up to 70%, the proposed RLRC

algorithm shows better performance index, both in recognition and verification, compared

to PCA and ICA I. For instance at a 60% contamination level RLRC achieves a high

87

(a) (b) (c) (d)

Figure 4.9: Probes with (a) 20% (b) 40% (c) 70% and (d) 90% salt and pepper noisedensity.

0 10 20 30 40 50 60 70 80 900

10

20

30

40

50

60

70

80

90

100

Pecentage Noise Density

Rec

ogni

tion

Acc

urac

y


Figure 4.10: Recognition accuracy curves in the presence of varying density of salt andpepper noise.

88

Table 4.9: Verification Results for dead-pixel noise

20% Noise Density 40% Noise Density 60% Noise Density 80% Noise Density

Approach EER Verif. Approach EER Verif. Approach EER Verif. Approach EER Verif.

PCA 0.70% 99.78% PCA 1.00% 97.81% PCA 1.80% 94.30% PCA 5.80% 75.44%

ICA I 0.60% 100% ICA I 1.50% 96.93% ICA I 3.70% 92.11% ICA I 11.02% 51.54%

RLRC 0.20% 100% RLRC 0.20% 100% RLRC 0.20% 100% RLRC 0.40% 99.78%

Table 4.10: Verification Results for salt and pepper noise

50% Noise Density 60% Noise Density 70% Noise Density 80% Noise Density


PCA 2.12% 95.61% PCA 4.10% 87.28% PCA 6.36% 75.22% PCA 15.13% 53.07%

ICA I 2.16% 93.86% ICA I 4.25% 83.99% ICA I 8.83% 62.28% ICA I 17.24% 42.54%

RLRC 0.21% 99.78% RLRC 2.18% 97.37% RLRC 5.48% 85.75% RLRC 19.30% 39.91%

89

(a) (b) (c) (d)

Figure 4.11: Probe images corrupted with (a) 4 (b) 6 (c) 8 and (d) 10 variance specklenoise.

rank-1 recognition accuracy of 94.52%, outperforming PCA and ICA I by a difference

of 12.50% and 16.45% respectively (refer to Figure 4.13 (b)). A low EER of 2.18% for

the proposed RLRC approach is also favorable comparable to 4.10% and 4.25% of PCA

and ICA I respectively. Note the major performance difference of the receiver operating

characteristics in Figure 4.13 (f). An excellent verification rate of 97.37% at standard

0.01 FAR outstandingly outperforms the benchmark approaches (refer to Table 4.10).

However, for a noise density greater than 70% the verification results for PCA are better

than either RLRC or ICA I approaches. For insstance at 80% noise contamination, PCA

achieves a verification rate of 53.07% at a typical 0.01 FAR which is better than ICA

I and RLRC by 12.53% and 13.16% respectively, also the EER performance for PCA is

superior compared to both approaches as shown in Table 4.10. The superior performance

of PCA for severe noise density, is however undone by the fact that it is unable to reach

satisfactory performance in the absolute sense as 53.07% success rate is not reliable by

any standard. For low to moderate salt and pepper noise, the proposed RLRC remains

the best choice.

Speckle noise, is regarded as a major interference in digital imaging and therefore forms

another important robustness issue. The proposed approach was extensively evaluated by

adding varying multiplicative speckle noise to probes as shown in Figure 4.11. Speckle

noise is efficiently modeled as a zero mean uniform random variable.

The proposed RLRC approach showed a good performance index as shown in Figure

4.17, consistently achieving a high recognition accuracy for a wide range of error variance.

The effect of speckle noise with a variance of 6 is shown in Figure 4.11 (b), the image

90

1 2 3 4 5 6 7 8 9 100.94

0.95

0.96

0.97

0.98

0.99

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I + NNRLRC

(a)

1 2 3 4 5 6 7 8 9 100.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

RankR

ecog

nitio

n R

ate

PCA+NNICA 1+NNRLRC

(b)

1 2 3 4 5 6 7 8 9 100.85

0.9

0.95

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I+NNRLRC

(c)

1 2 3 4 5 6 7 8 9 100.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I+NNRLRC

(d)

0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.020.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

False Accept Rate

Ver

ifica

tion

Rat

e

PCA+NNICA I+NNRLRC

(e)

0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.020.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

False Accept Rate

Ver

ifica

tion

Rat

e

PCA+NNICA I+NNRLRC

(f)

0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.020.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

False Accept Rate

Ver

ifica

tion

Rat

e

PCA+NNICA I+NNRLRC

(g)

0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.120.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

False AcceptRate

Ver

ifica

tion

Rat

e

PCA+NN+ICA I+NNRLRC

(h)

Figure 4.12: Dead-pixel noise: First row elaborates rank-recognition profiles while the second row shows the Receiver Operating Charac-teristics (ROC). From left to right columns indicate 20%, 40%, 60% and 80% noise density respectively.

91

1 2 3 4 5 6 7 8 9 100.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I+NNRLRC

(a)

1 2 3 4 5 6 7 8 9 100.75

0.8

0.85

0.9

0.95

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I+NNRLRC

(b)

1 2 3 4 5 6 7 8 9 10

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I+NNRLRC

(c)

1 2 3 4 5 6 7 8 9 10

0.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I+NNRLRC

(d)

0.01 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.0180.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

False Accept Rate

Ver

ifica

tion

Rat

e

ROC Curves

PCA+NNICA I +NNRLRC

(e)

0.01 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.0180.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

False Accept Rate

Ver

ifica

tion

Rat

e

ROC Curves

PCA+NNICA 1+NNRLRC

(f)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

False Accept Rate

Ver

ifica

tion

Rat

e

ROC Curves

PCA+NNICA I+NNRLRC

(g)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Accept Rate

Ver

ifica

tion

Rat

e

ROC Curves

PCA+NNICA I +NNRLRC

(h)

Figure 4.13: Salt and pepper noise: First row represents rank recognition curves while the second row shows Receiver Operating Charac-teristics (ROC). From left to right columns indicate 50%, 60% 70% and 80% noise densities respectively.

92

is badly distorted and traditional reconstructive approaches fail to produce competitive

results. The proposed RLRC approach attains a high recognition accuracy of 91.67%

outperforming PCA and ICA I approaches by 16.67% and 23.69% respectively. Noteworthy

is the tolerance and consistency of the proposed approach for highly corrupted data. The

best recognition results are reported for the RLRC approach with median filtering (M-

RLRC indicated by red dashed line in Figure 4.17). For instance for the worst case of

speckle noise with variance 10 the M-RLRC approach achieves a high recognition success

of 97.37% comprehensively outperforming all competing approaches, the best competitor

being the RLRC without any preprocessing achieving 86.40% (solid blue line in Figure

4.17). Median filtering variants again show significant improvement of approximately 12%

compared to the unprocessed computations. Note that the L1-norm robust calculations

are of not much help in such adverse conditions.

Results for verification experiments are shown in Figure 4.14 and Table 4.11. In par-

ticular, in the presence of noise density with variance 8, the RLRC approach achieves a

high rank-1 recognition accuracy of 89.04% outperforming PCA and ICA I by margins

of 19.52% and 22.15% respectively. At a typical 0.01 FAR the proposed RLRC reaches a

verification rate of 94.74% with an EER of only 3.07% substantially outperforming both

contesting approaches (refer to Figures 4.14 (d) and (h)).

Due to the fact that all imaging systems acquire images by counting photons, the

detector noise (modeled as additive white Gaussian noise) is always an important case-

study in the context of robustness [104]. The probes were distorted by adding zero-mean

Gaussian noise with a wide range of error variance as shown in Figure 4.16. Since classical

statistical methods are known to be efficient in the presence of Gaussian noise, we also

conducted experiments by solving Equation 4.7 using the least squares (LS) approach.

To harness redundant measurements in the presence of noise, all experiments for LS and

RLRC were conducted in the original image space. Results shown in Figure 4.18 reflect the

superiority of the proposed method. The RLRC approach consistently outperformed all

other approaches for a wide range of error variance. In particular, with an error variance of

0.8 the RLRC approach beats PCA, ICA I and LS methods by margins of 7.89%, 16.88%

and 48.48% respectively. Even with a severe additive noise of 0.9 variance a reasonable

93

1 2 3 4 5 6 7 8 9 100.75

0.8

0.85

0.9

0.95

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I+NNRLRC

(a)

1 2 3 4 5 6 7 8 9 100.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I+NNRLRC

(b)

1 2 3 4 5 6 7 8 9 100.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I+NNRLRC

(c)

1 2 3 4 5 6 7 8 9 100.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I+NNRLRC

(d)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

False Accept Rate

Ver

ifica

tion

Rat

e

PCA+NNICA I+NNRLRC

(e)

0.02 0.04 0.06 0.08 0.1 0.120.7

0.75

0.8

0.85

0.9

0.95

1

False Accept Rate

Ver

ifica

tion

Rat

e

PCA+NNICA I+NNRLRC

(f)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

False Accept Rate

Ver

ifica

tion

Rat

e

PCA+NNICA I+NNRLRC

(g)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

False Accept Rate

Ver

ifica

tion

Rat

e

PCA+NNICA I+NNRLRC

(h)

Figure 4.14: Speckle noise: First row represents rank recognition curves while second row shows receiver operating characteristics. Fromleft to right columns indicate noise densities with variances 2, 4, 6, and 8 respectively.

94

1 2 3 4 5 6 7 8 9 100.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Rec

ogni

tion

Rat

e

PCA+NNICA I+NNLSRLRC

(a)

1 2 3 4 5 6 7 8 9 100.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

RankR

ecog

nitio

n R

ate

PCA+NNICA+NNLSRLRC

(b)

1 2 3 4 5 6 7 8 9 100.4

0.5

0.6

0.7

0.8

0.9

1

Rec

ogni

tion

Rat

e


(c)

1 2 3 4 5 6 7 8 9 100.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Rec

ogni

tion

Rat

e


(d)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.7

0.75

0.8

0.85

0.9

0.95

1

False Accept Rate

Ver

ifica

tion

Rat

e


(e)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

False Accept Rate

Ver

ifica

tion

Rat

e


(f)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

False Accept Rate

Ver

ifica

tion

Rat

e


(g)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.4

0.5

0.6

0.7

0.8

0.9

1

False Accept Rate

Ver

ifica

tion

Rat

e


(h)

Figure 4.15: Gaussian noise: First row represents rank recognition curves while second row shows Receiver Operating Characteristics(ROC). From left to right columns indicate noise densities with variances 0.5, 0.7, 0.8, and 0.9 respectively.

95

Table 4.11: Verification Results for Speckle Noise

Noise Variance=2 Noise Variance=4 Noise Variance=6 Noise Variance=8


PCA 3.20% 87.94% PCA 4.09% 81.80% PCA 5.71% 79.82% PCA 6.62% 73.90%



Table 4.12: Verification Results for Gaussian Noise

Noise Variance=0.5 Noise Variance=0.7 Noise Variance=0.8 Noise Variance=0.9


PCA 2.19% 95.61% PCA 3.28% 92.11% PCA 2.23% 93.20% PCA 4.05% 86.18%


LS 7.80% 78.73% ICA I 13.49% 56.36% ICA I 15.88% 50.00% ICA I 18.20% 46.93%


96

(a) (b) (c) (d)

Figure 4.16: Probe images corrupted with (a) 0.2 (b) 0.4 (c) 0.6 and (d) 0.8 variancezero-mean Gaussian noise.

93.64% recognition accuracy was achieved. The LS approach showed an interesting behav-

ior, for low variance noise the performance of LS is pretty much comparable to the RLRC

approach. However, with low SNR the LS method substantially lags the robust linear

regression classification. The best recognition results are obtained for the RLRC with

median filtering (M-RLRC), for instance with 0.8 error variance a recognition accuracy

of 99.78% is reported which is 3.73% better than plain RLRC approach. Median filtering

also substantially improved the performance of PCA, however the two top performance

curves are obtained for the M-RLRC and RLRC methods.

Verification results for various SNR case-studies of AWGN are shown in Figure 4.15

and Table 4.12. In particular, for the worst case scenario of 0.9 variance Gaussian noise,

the proposed RLRC approach achieves high verification rate of 98.68% at 0.01 FAR com-

prehensively outperforming PCA, ICA I and the LS methods by 12.50%, 12.28% and

51.75% respectively (see Figure 4.15 (h)). The huge performance difference of more than

50% compared to the LS approach signifies the importance of robust regression for the

particular case-study of face recognition. Also in terms of EER the proposed approach at-

tains excellent performance index of 1.63% while other approaches substantially lag behind

(refer to Table 4.12).

4.6 Conclusion

In this chapter we present a novel robust face recognition algorithm based on the ro-

bust Huber estimation approach. It is for the first time that the problem of robust face

recognition has been formulated as a robust Huber estimation task. The proposed Ro-

97

1 2 3 4 5 6 7 8 9 1020

30

40

50

60

70

80

90

100

Variance

Rec

ogni

tion

Acc

urac

y


Figure 4.17: Recognition accuracy of various approaches in the presence of speckle noisefor different variances.

98

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.920

30

40

50

60

70

80

90

100

Variance

Rec

ogni

tion

Acc

urac

y

PCA + NNICA I + NNLS RLRCM−PCAPCA+L1M−RLRC

Figure 4.18: Recognition accuracy of various approaches in the presence of Gaussiannoise for different variances.

99

bust Linear Regression Classification (RLRC) algorithm has been evaluated for two case

studies i.e severe illumination variation and random pixel corruption. For the case of

illumination invariant face recognition, we have demonstrated results on three standard

databases incorporating adverse luminance alterations. A comprehensive comparison with

the state-of-art robust approaches indicates a comparable performance index for the pro-

posed RLRC approach. We demonstrate, for the first time, an error-free recognition for

the most challenging Subset 5 of the Yale face database B. In addition we report a com-

parable evaluation for the RLRC algorithm on the CMU-PIE and AR databases under

standard evaluation protocols as reported in the literature.

In addition, the problem of random pixel corruption is also addressed. The proposed

RLRC approach has shown good results for various noise models comprehensively outper-

forming the reconstructive benchmark approaches. In particular the proposed approach

attains a high verification rate of 99.78% at 0.01 FAR for the important case-study of

probes contaminated with 80% dead pixel noise. This performance is appreciable consid-

ering that the benchmark approaches are unable to provide satisfactory results under such

severe noisy conditions. Similarly in the presence of severe AWGN the proposed RLRC

approach beats the traditional generative approaches by a margin of an order of 12%. It

has also been experimentally shown that the classical LS approach is extremely inefficient

in the presence of severe AWGN and lags the proposed RLRC algorithm by more than

50%. For a fair comparison the robust variants of the base systems are also evaluated and

a comprehensive comparison demonstrates the efficacy of the proposed approach.

Apart from the good performance index of the proposed approach, there are several

interesting outcomes of the presented research. In the paradigm of view-based face recog-

nition, the choice of features for a given case-study, has been a debatable topic. Recent

research has however shown competency of the unorthodox features such as downsampled

images and random projections, indicating a divergence from the conventional ideology

[75], [76], [6]. The proposed RLRC approach in fact conforms to this emerging belief. It

has been shown that with an appropriate choice of classifier the original image space can

produce good results compared to the traditional subspace approaches. Good results for

randomly distributed noisy pixels are encouraging enough to extend the proposed algo-

100

rithm for the problem of contiguous occlusion where contaminated pixels are known to

have a connected neighborhood.

101

4.7 Publications

1. Imran Naseem, Roberto Togneri and Mohammed Bennamoun, “Robust Regres-

sion for Face Recognition”, First revision submitted IEEE Transactions on Pattern

Analysis and Machine Intelligence (IEEE TPAMI)

2. Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Robust Regres-

sion for Face Recognition”, Accepted in IAPR International Conference on Pattern

Recognition, ICPR’10

102

Chapter 5

Speaker Identification using

Sparse Representation1

5.1 Introduction

Human speech is a natural way of recognizing a person and therefore Automatic Speaker

Recognition (ASR) systems have been widely deployed for secured authentication. Con-

ventional speaker recognition algorithms make use of acoustic features to develop proba-

bilistic speaker models and utilize an adequate statistical distance metric for the classi-

fication purpose. Gaussian Mixture Models (GMM) have been typically used to develop

probabilistic model for each speaker in a given database [105]. The large scale acceptance

of the GMMs as the standard in the ASR can be credited to a number of factors such

as the high accuracy, the ability to scale training algorithms for large data sets and the

probabilistic framework. A speech signal is naturally characterized by continuous changes

in the spectral domain, consequently a number of Gaussian components (typically of the

order of 64) are necessary to model the speaker-dependent features over the length of an

utterance. The collection of these Gaussian components results in the complete Gaussian

mixture model.

A more efficient approach is to develop a Universal Background Model (UBM) using

utterances from a set of speakers and adapt this universal model with respect to a par-

1The chapter has been accepted for publication in International Conference on Pattern Recognition(ICPR’10).

103

ticular speaker using Maximum-A-Posteriori (MAP) adaptation [106]. This state-of-art

approach is commonly referred to as the GMM-UBM. There are several benefits in using

this approach that have accounted for significant performance improvements in the GMM-

based classification. For instance, when training data is not available for the adaptation

of components in the UBM, the speaker values revert to those in the UBM to provide a

more robust speaker model. In contrast, when ample training data is available for a given

GMM component, the values approach those of the ML estimate.

Recently an intriguing variation in the GMM-UBM approach has enabled represen-

tation of a speaker as a point in a high dimensional space i.e. the speaker space [7].

The main idea is to concatenate the means of the GMM components to form a so-called

GMM mean supervector [7]. In this way, a variable-length utterance can be represented

as a fixed-length feature vector in the feature space and therefore the problem of speaker

recognition can be tackled as a general problem of pattern recognition. One technique

that has received significant focus in pattern recognition literature is the Support Vector

Machine (SVM). The discriminative nature of the SVM has been successfully applied to

a variety of pattern recognition tasks. An SVM is basically a two-class classifier that fits

a separating hyperplane between the two classes (assuming linear separability). In re-

cent years, the SVM-based classification has become a major focus in the task of speaker

identification and verification [7].

Typically, in pattern recognition problems, it is believed that high-dimensional data

vectors are redundant measurements of an underlying source. The objective of manifold

learning is therefore to uncover this “underlying source” by a suitable transformation of

high-dimensional measurements to low-dimensional data vectors. The main objective is

to find a basis function for this transformation, which could distinguishably represent

patterns in the feature space. A number of approaches have been reported in the litera-

ture for dimensionality reduction. These approaches have been broadly classified in two

categories namely generative/reconstructive and discriminative methods. Reconstructive

approaches (such as Principal Component Analysis or the PCA [13]), are reported to be

robust for noisy data, these methods essentially exploit the redundancy in the original data

to produce representations with sufficient reconstruction property. Formally, given an in-

104

put x and label y, the generative classifiers learn a model of the joint probability p(x, y)

and classify using p(y|x), which is determined using the Bayes’ rule. The discriminative

approaches (such as Linear Discriminant Analysis or the LDA [17]), on the other hand,



directly from the data and are consequently more sensitive to outliers. In the speaker

recognition community there is a growing interest for the exploration of these manifold

learning methods. The PCA for instance, has shown some good results in this regard and

is usually referred to as Eignevoice approach [107], [108]. Working on the same lines, the

concept of Fishervoice, based on the LDA approach, has recently been proposed to address

the problem of semi-supervised speaker clustering [109].

In this chapter we present a novel speaker identification algorithm in the context of

sparse representation [21]. We propose to utilize the concept of the GMM mean supervec-

tor to develop an overcomplete dictionary using training utterances from all the speakers.

The fixed-length GMM mean supervector of a given test utterance from an unknown

speaker is represented as a linear combination of this overcomplete dictionary. This repre-

sentation is naturally sparse since the test utterance corresponds to only a small fraction of

the whole training database. Using this sparsity, we propose to solve the inverse problem

using the l1-norm minimization as it is shown to be the sparsest solution [24]. The vector

of coefficients thus obtained will have non-zero entries corresponding to the class of the

test utterance. The proposed algorithm is evaluated on a subset of the widely available

TIMIT speech corpus [110]. Comparative analysis with the state-of-art speaker recogni-

tion algorithms yields a fairly comparable performance index for the proposed algorithm.

To the best of our knowledge, it is for the first time that sparse representation classification

is used for the problem of speaker identification.

The rest of the chapter is organized as follows: Basic framework of the proposed

algorithm is presented in Section 5.2 followed by experimental evaluation in Section 5.3.

The chapter is concluded in Section 5.4.

105

5.2 Sparse Representation for Speaker Identification

Sparse or parsimonious representation of signals is regarded as a major research area in

the paradigm of statistical signal processing. Most of the signals of practical interest are

compressible in nature. For example, audio signals are compressible in localized Fourier

domain and digital images are compressible in Discrete Cosine Transform (DCT) and

wavelet domains. Recent research in the area of compressive sampling has shown that if

the optimal representation of a signal is sufficiently sparse when linearly represented with

respect to an overcomplete dictionary (also referred to as measurement matrix ), it can be

efficiently computed using convex optimization [111, 22, 25, 26, 5, 24].

The main objective of the compressive sensing theory is to achieve computational effi-

ciency for information processing using the parsimonious representation of signals. From

this perspective, the compressive sensing theory basically tries to avoid the Shannon-

Nyquist bound by sampling at a much lower rate and still safely recovering the original

information [111]. Although the compressive sensing paradigm is not intended for clas-

sification purpose, the sparse representation of a signal with respect to a basis remains

implicitly discriminative in nature. It selects only those basis vector which most compactly

represents the signal and reject the others [6].

We exploit this discriminative nature of sparse representation to propose a novel

speaker identification algorithm. The proposed algorithm incorporates the GMM mean

supervector kernel approach [7] to represent the utterances as feature vectors of a fixed

dimension.

We now present the basic framework of the proposed speaker identification algorithm.

Let us assume that we have k distinct classes and ni utterances are available for training

from the ith class. Each variable-length training utterance is mapped to a fixed-dimension

feature vector using the GMM mean supervector kernel [7]. Let the resultant feature vector

be designated as vi,j such that vi,j ∈ Rm. Here i is the index of the class, i = 1, 2, . . . , k

and j is the index of the training utterance, j = 1, 2, . . . , ni. All this training data from

the ith class is placed in a matrix Ai such that Ai = [vi,1,vi,2, . . . . . . ,vi,ni] ∈ R

m×ni .

Let y ∈ Rm be the GMM mean supervector for a test utterance from the ith speaker. A

fundamental concept in pattern recognition indicates that patterns from the same class lie

106

on a linear subspace [50], therefore if y belongs to the ith class and the training samples

from the ith class are sufficient, y will approximately lie in the linear span of the columns

of Ai:

y = αi,1vi,1 + αi,2vi,2 + · · · + αi,nivi,ni

(5.1)

where αi,j are real scalar quantities. Since identity i of the test sample y is unknown we

develop a global dictionary matrix A for all k classes by concatenating Ai, i = 1, 2, . . . , k

as follows:

A = [A1, A2, . . . , Ak] ∈ Rm×nik (5.2)

The test pattern y can now be represented as a linear combination of all n training

samples (n = ni × k):

y = Ax (5.3)

where

x = [0, · · · , 0, αi,1, αi,2, · · · , αi,ni, 0, · · · , 0]T ∈ R

n (5.4)

is an unknown vector of coefficients. Note that from equation 5.3 and our earlier discussion

it is straight forward to note that only those entries of x that are non-zero, correspond to

the class of y [6]. This means that if we are able to solve equation 5.3 for x we can actually

find the class of the test pattern y. Recent research in compressive sensing and sparse

representation [22, 25, 26, 5, 24] has shown that the sparsity of the solution of equation

5.3, enables us to solve the problem using the l1-norm minimization:

(l1) : x1 = argmin ‖x‖1 ;Ax = y (5.5)

Once we have estimated x1, ideally it should have nonzero entries corresponding to

the class of y and now deciding the class of y is a simple matter of locating indices of the

non-zero entries in x1. However due to noise and modeling limitations x1 is commonly

corrupted by some small nonzero entries belonging to different classes. To resolve this

107

problem we define an operator δi for each class i so that δi(x1) gives us a vector ∈ Rn

where the only nonzero entries are from the ith class. This process is repeated k times for

each class. Now for a given class i we can approximate yi = Aδi(x1) and assign the test

pattern to the class with the minimum residual between y and yi.

min︸︷︷︸

i

ri(y) = ‖y − Aδi(x1)‖2 (5.6)

5.3 Experimental Evaluation

The TIMIT corpus is a collection of phonetically balanced sentences sampled at 16 kHz (8

kHz bandwidth), consisting of 10 utterances from 630 speakers across 8 dialect regions in

the USA [110]. Extensive experiments were conducted on a randomly selected subset of

the TIMIT database consisting of 114 speakers. For our experiments we used 8 utterances

per speaker for training (5 SX and 3 SI sentences) while 2 utterances (2 SA sentences)

constituted the testing set. Refer to [110], [105] for further details.

Table 5.1: Experimental Results for the TIMIT databaseApproach Recognition Accuracy

GMM 92.98%

GMM-UBM 96.93%

GMM-SVM 97.80%

Sparse Representation 98.24%

At the feature extraction stage, GMM mean supervector approach [7] (consisting of 64

mixtures) is used to generate fixed-length feature vectors from variable length utterances.

In all experiments a pre-emphasis filter with coefficient 0.97 was applied to the sampled

waveform and features were extracted from each 25ms frame and generated every 10ms,

all frames were windowed using the Hamming window function. Comparative analysis is

performed using three state-of-the-art approaches i.e. the GMM [105], the GMM-UBM

[106] and the GMM-SVM [7] speaker identification algorithms. For the implementation

of the GMM and GMM-UBM systems the Hidden Markov Model ToolKit (HTK version

3.4.1) [112], was configured to model a single-state HMM with the standard MLLR (Maxi-

mum Likelihood Linear Regression) and MAP (Maximum-A-Posteriori) adaptation scripts

to adapt the UBM accordingly for the GMM-UBM models and GMM-SVM supervectors.

108

For the GMM-SVM the SVM-KM toolbox [113] was used to implement the one-against-all

SVM classifier.

Results are shown in Table 1. The proposed sparse representation identification al-

gorithm achieves 98.24% recognition accuracy which is better than all the contesting ap-

proaches. The conventional GMM [105] approach for instance, attains 92.98% recognition

which lags 5.26% as compared to the proposed approach. The state-of-art GMM UBM

[106] approach yields a comparable identification success of 96.93%. Recently proposed

GMM-SVM system [7] also attains a good performance with 97.80% recognition.

5.4 Conclusion

With the recent development in the paradigm of speaker recognition, variable-length ut-

terances can be represented as fixed length features in a high dimensional feature space.

The task of speaker identification can now therefore be viewed as a traditional pattern

classification problem. Motivated with these studies, we propose a novel speaker iden-

tification algorithm based on sparse representation. Noting that a given test utterance

from a particular speaker corresponds to only a fraction of the whole training database,

we proposed to develop an overcomplete dictionary of all training utterances. A given

test utterance is thus represented as a linear combination of all training utterances giv-

ing rise to a naturally sparse representation. The inverse problem is solved using the

l1-minimization (as it is the sparsest solution). Consequently the vector of coefficients is

also sparse with non-zero entries corresponding to the class of the unknown speaker. The

proposed algorithm is evaluated on the standard TIMIT database and comparative anal-

ysis is performed with the state-of-art speaker identification approaches. The proposed

sparse representation classification algorithm has shown good performance index and is

favorably comparable with all approaches.

Although the initial investigations for the proposed algorithm are quite good, the

TIMIT database however characterizes ideal acquisition environment and does not depict

key robustness issues (e.g. reverberant noise and session variability). Good performance

index under clean conditions is encouraging enough to extend the proposed approach for

robust speaker recognition addressing more challenging databases.

109

5.5 Publications

Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Sparse Representation

for Speaker Identification”, Accepted in IAPR International Conference on Pattern Recog-

nition, ICPR’10

110

Chapter 6

Speaker Identification using Linear

Regression1

6.1 Introduction

With increasing security concerns, automatic person identification has emerged as an ac-

tive research area over last two decades. Human speech is a natural way of recognizing

a person and therefore Automatic Speaker Recognition systems have been widely de-

ployed for secured authentication. Conventional speaker recognition algorithms make use

of acoustic features to develop probabilistic speaker models and utilize an adequate statis-

tical distance metric for the classification purpose. Gaussian Mixture Models (GMM) have

been typically used to develop probabilistic model for each speaker in a given database

[105]. The large scale acceptance of the GMMs as the standard in the paradigm of speaker

identification can be credited to a number of factors such as the high accuracy, the ability

to scale training algorithms for large data sets and the probabilistic framework. A speech

signal is naturally characterized by continuous changes in the spectral domain, conse-

quently a number of Gaussian components are necessary to model the speaker-dependent

features over the length of an utterance. The collection of these Gaussian components

results in the complete Gaussian mixture model.

1The chapter is under review for prospective publication in InterSpeech 2010. General problem state-ments and literature review in the Introduction section of the chapter are included for the sake of complete-ness and to make the chapter self-contained. Since the thesis is presented as a compilation of independentpublications, the repetition of the general statements between the chapters is therefore inevitable.

111

A more efficient approach is to develop a Universal Background Model (UBM) using

utterances from a set of speakers and adapt this universal model with respect to a par-

ticular speaker using Maximum-A-Posteriori (MAP) adaptation [106]. This state-of-art

approach is commonly referred to as the GMM-UBM. There are several benefits in using

this approach that have accounted for significant performance improvements in the GMM-

based classification. For instance, when training data is not available for the adaptation

of components in the UBM, the speaker values revert to those in the UBM to provide a

more robust speaker model. In contrast, when ample training data is available for a given

GMM component, the values approach those of the ML estimate.

Recently an intriguing variation in the GMM-UBM approach has enabled represen-

tation of a speaker as a point in a high dimensional space i.e. the speaker space [7].

The main idea is to concatenate the means of the GMM components to form a so-called

GMM mean supervector [7]. In this way, a variable-length utterance can be represented

as a fixed-length feature vector in the feature space and therefore the problem of speaker

recognition can be tackled as a general problem of pattern recognition. One technique

that has received significant focus in pattern recognition literature is the Support Vector

Machine (SVM). The discriminative nature of the SVM has been successfully applied to

a variety of pattern recognition tasks. An SVM is basically a two-class classifier that fits

a separating hyperplane between the two classes (assuming linear separability). In re-

cent years, the SVM-based classification has become a major focus in the task of speaker

identification and verification [7].

Typically, in pattern recognition problems, it is believed that high-dimensional data

vectors are redundant measurements of an underlying source. The objective of manifold

learning is therefore to uncover this “underlying source” by a suitable transformation of

high-dimensional measurements to low-dimensional data vectors. The main objective is

to find a basis function for this transformation, which could distinguishably represent

patterns in the feature space. A number of approaches have been reported in the litera-

ture for dimensionality reduction. These approaches have been broadly classified in two

categories namely generative/reconstructive and discriminative methods. Reconstructive

approaches (such as Principal Component Analysis or the PCA [13]), are reported to be

112

robust for noisy data, these methods essentially exploit the redundancy in the original data

to produce representations with sufficient reconstruction property. Formally, given an in-

put x and label y, the generative classifiers learn a model of the joint probability p(x, y)

and classify using p(y|x), which is determined using the Bayes’ rule. The discriminative

approaches (such as Linear Discriminant Analysis or the LDA [17]), on the other hand,



directly from the data and are consequently more sensitive to outliers. In the speaker

recognition community there is a growing interest for the exploration of these manifold

learning methods. The PCA for instance, has shown some good results in this regard

and is usually referred to as Eignevoice approach [107], [108]. Working on the same lines,

the concept of Fishervoice, based on the LDA approach, has recently been proposed to

address the problem of semi-supervised speaker clustering [109]. An important relevant

work is presented in [114] where an overcomplete dictionary matrix is developed using

training utterances from all speakers. Noting intrinsic sparsity of the dictionary matrix,

each test utterance is represented as a linear combination of all training utterances (the

dictionary matrix). The inverse problem is solved using the l1 optimization which is also

the sparsest solution. Consequently the vector of coefficients is also sparse with non-zero

entries corresponding to the class of the unknown speaker.

In the paradigm of pattern recognition it is well known that in general, samples from

same object class lie on a linear subspace [17], [50]. In this research we utilize this concept

to present a novel speaker identification algorithm. Essentially the idea of GMM mean

supervector is used to develop class-specific subspaces using training utterances from each

speaker. The fixed-length GMM mean supervector of a given test utterance from an

unknown speaker is represented against each class model, thereby defining the task of

speaker identification as a problem of linear regression. Least squares estimation is used

to estimate the vectors of parameters for a given test utterance against all speaker models.

Finally the decision is ruled in favor of the class with the most precise estimation. The

proposed classifier can be categorized as a Nearest Subspace (NS) approach. The proposed

algorithm is evaluated on a subset of the widely available TIMIT speech corpus [110].

113

Comparative analysis with the state-of-art speaker recognition algorithms yields a fairly

comparable performance index for the proposed algorithm.

The rest of the chapter is organized as follows: Section 6.2 presents the proposed

algorithm, followed by close-set speaker identification experiments in Section 6.3. The

chapter is concluded in Section 6.4.

6.2 Linear Regression Classification (LRC) Algorithm

Let there be N number of distinguished classes with pi number of training utterances

from the ith class, i = 1, 2, . . . , N . Each variable-length training utterance is mapped to

a fixed-dimension feature vector using the GMM mean supervector kernel [7]. Let the

resultant feature vector be designated as(m)wi∈ R

q×1, q being the length of the feature

vector and m = 1, 2, . . . , pi. Using the concept that patterns from the same class lie on a

linear subspace [50], we develop a class specific model Xi by stacking the feature vectors,

Xi = [(1)wi

(2)wi . . . . . .

(pi)wi ] ∈ R

q×pi , i = 1, 2, . . . , N (6.1)



of Xi. Therefore at the training level each class i is represented by a vector subspace,

Xi, which is also called the regressor or predictor for class i. Let z be an unlabeled

test utterance and our problem is to classify z as one of the classes i = 1, 2, . . . , N . We

transform the utterance z to a feature vector y ∈ Rq×1 as discussed for training. If y

belongs to the ith class it should be represented as a linear combination of the training

utterances from the same class (lying in the same subspace) i.e.

y = Xiβi , i = 1, 2, . . . , N (6.2)

where βi ∈ Rpi×1 is the vector of parameters. Given that q ≥ pi the system of equations

in equation 6.2 is well-conditioned and βi can be estimated using least squares estimation

[54], [55], [56].

βi =(XT

i Xi

)−1XT

i y (6.3)

114

Algorithm: Linear Regression Classification (LRC)

Inputs: Class models Xi ∈ Rq×pi , i = 1, 2, . . . , N and a test utterance feature vector y ∈ R


1. βi ∈ Rpi×1 is evaluated against each class model, βi =

(XT

i Xi

)−1

XTi y, i = 1, 2, . . . , N

2. yi is computed for each βi, yi = Xiβi, i = 1, 2, . . . , N

3. Distance calculation di(y) = ‖y − yi‖2 , i = 1, 2, . . . , N




yi = Xiβi, i = 1, 2, . . . , N (6.4)

yi = Xi

(XT

i Xi

)−1XT

i y

yi = Hy

Where the predicted vector yi ∈ Rq×1 is the projection of y onto the ith subspaces. In

other words yi is the closest vector, in the ith subspace, to the observation vector y in the

Euclidean sense [57]. H is called a hat matrix since it maps y into yi. We now calculate

the distance measure between the predicted response vector yi, i = 1, 2, . . . , N and the

original response vector y,

di(y) = ‖y − yi‖2 , i = 1, 2, . . . , N (6.5)


min︸︷︷︸

i

di(y), i = 1, 2, . . . , N (6.6)

6.3 Experimental Results

The TIMIT corpus is a collection of phonetically balanced sentences sampled at 16 kHz (8

kHz bandwidth), consisting of 10 utterances from 630 speakers across 8 dialect regions in

the USA [110]. Two sets of experiments were conducted on a randomly selected subset of

the TIMIT database consisting of 200 speakers. For Experiment Set 1 we used 8 utterances

115

Table 6.1: Experiment Set 1: Recognition accuracy for various approaches with respectto different number of mixtures.

Approach 16 MIX 32 MIX 64 MIX 128 MIX 256 MIX

GMM 89.75% 94.25% 92.25% 87.50% 73.50%

GMM-UBM 89.75% 95.50% 96.50% 97% 98.50%

GMM-SVM 92.25% 95.75% NA NA NA

LRC 89.75% 96.00% 96.00% 96.25% 95.50%

Table 6.2: Experiment Set 2: Recognition accuracy for various approaches with respectto different number of mixtures.

Approach 16 MIX 32 MIX 64 MIX 128 MIX 256 MIX

GMM 66.50% 60.50% 51.25% 32% 14.75%

GMM-UBM 71.75% 75.25% 74.50% 74% 77.50%

GMM-SVM 70.75% 83.50% 82.50% NA NA

LRC 70.75% 83.00% 78.25% 66.00% 45.50%

per speaker for training (5 SX and 3 SI sentences) while 2 utterances (2 SA sentences)

constituted the testing set. Refer to [110], [105] for further details.

At the feature extraction stage, GMM mean supervector approach [7] is used to gen-

erate fixed-length feature vectors from variable length utterances. In all experiments a

pre-emphasis filter with coefficient 0.97 was applied to the sampled waveform and features

were extracted from each 25ms frame and generated every 10ms, all frames were win-

dowed using the Hamming window function. Essentially 13-dimensional MFCC features

were concatenated with 13-dimensional delta features and 13-dimensional acceleration fea-

tures thereby generating a 39-dimensional feature vector [112]. Comparative analysis is

performed using three state-of-the-art approaches i.e. the GMM [105], the GMM-UBM

[106] and the GMM-SVM [7] speaker identification algorithms. For the implementation

of the GMM and GMM-UBM systems the Hidden Markov Model ToolKit (HTK version

3.4.1) [112], was configured to model a single-state HMM with the standard MLLR (Maxi-

mum Likelihood Linear Regression) and MAP (Maximum-A-Posteriori) adaptation scripts

to adapt the UBM accordingly for the GMM-UBM models and GMM-SVM supervectors.

For the GMM-SVM the SVM-KM toolbox [113] was used to implement the one-against-all

SVM classifier.

For a comprehensive comparison, experiments were conducted with different number

of Gaussian mixtures, results are shown in Table 6.1 and Figure 6.1. The proposed linear

116

0 50 100 150 200 25070

75

80

85

90

95

100

Number of Mixtures

Rec

ogni

tion

Acc

urac

y

GMMGMM−UBMGMM−SVMLRC

Figure 6.1: Experiment Set 1: Recognition accuracy of various approaches with respectto number of mixtures.

regression classification algorithm demonstrates comparable performance index for all ex-

periments. For the case of 32 mixtures for instance it achieves 96.00% recognition accuracy

which is better than all the contesting approaches. The conventional GMM [105] approach

for instance, attains 94.25% recognition which lags 1.75% as compared to the proposed

approach. The state-of-art GMM UBM [106] approach yields a comparable identification

success of 95.50%. Recently proposed GMM-SVM system [7] also attains a good perfor-

mance with 95.75% recognition. The best recognition accuracy of 98.50% is reported for

the GMM-UBM approach with 256 Gaussian mixtures. It should be noted that as the

one-against-all SVM does not scale well with the vector size (i.e. number of mixtures)

results were not reported where the SVM failed to run.

It is interesting to note the reliability of standard speaker recognition approaches with

the change in the number of Gaussian mixtures in Figure 6.1. The primitive GMM system

is the most affected approach while the proposed linear regression classification method

117

showed a consistent performance with respect to varying number of Gaussian mixtures.

The Experiment Set 2 was targeted to evaluate the reliability of the proposed nearest

subspace classification algorithm with less number of training utterances. We therefore

selected 3 SI utterances for training and 2 SA sentences were used to validate the system.

Comprehensive results are shown in Table 6.2 and Figure 6.2. The conventional GMM

system was unable to cope with less number of training utterances and yielded a maximum

recognition accuracy of 66.50% with 32 Gaussian mixtures. The state-of-art GMM-UBM

system achieved a maximum performance index of 77.50% utilizing 256 mixtures. The pro-

posed linear regression classification algorithm attained 83% accuracy with 32 mixtures,

outperforming the GMM and the GMM-UBM systems by 16.50% and 5.50% respectively.

The proposed system achieved a competitive performance index compared to the latest

GMM-SVM approach. Noteworthy is the fact that computations with high number of mix-

tures were not possible for the sophisticated one-against-all SVM approach. The proposed

LRC system is however much simpler in architecture yet achieving comparable results.

0 50 100 150 200 25010

20

30

40

50

60

70

80

90

Number o Mixtures

Rec

ogni

tion

Acc

urac

y

GMMGMM−UBMGMM−SVMLRC

Figure 6.2: Experiment Set 2: Recognition accuracy of various approaches with respectto number of mixtures.

118

6.4 Conclusion and Future Directions

With the recent development in the paradigm of speaker recognition, variable-length utter-

ances can be represented as fixed length features in a high dimensional feature space. The

task of speaker identification can now therefore be viewed as a traditional pattern classi-

fication problem. Motivated with these studies, we propose a novel speaker identification

algorithm based on linear regression. Noting that samples from a particular class lie on

a linear subspace, we proposed to develop class-specific models using training utterances

from each class. A given test utterance is thus represented against each class model and

therefore the pattern recognition task is formulated as a problem of linear regression. The

inverse problem is solved using the least squares estimation and decision is ruled in favor

of the class with minimum reconstruction error. The proposed algorithm is evaluated on

the standard TIMIT database and comparative analysis is performed with the state-of-art

speaker identification approaches.

Although the initial investigations for the proposed algorithm are quite good, the

TIMIT database however characterizes ideal acquisition environment and does not depict

key robustness issues (e.g. reverberant noise and session variability). The proposed frame-

work is not expected to cope with the noisy conditions; since in the presence of outliers,

least squares estimation is inefficient and can be biased [89]. Although it has been claimed

that classical statistical methods are robust, they are only robust in the sense of type

I error. Type I error corresponds to the rejection of null hypothesis when it is in fact

true. It is straightforward to note that type I error rate for classical approaches in the

presence of outliers tend to be lower than the nominal value. This is often referred to

as conservatism of classical statistics. However due to contaminated data, type II error

increases drastically. Type II error is the error when the null hypothesis is not rejected

when it is in fact false. This drawback is often referred to as inadmissibility of the classical

approaches. Additionally, classical statistical methods are known to perform well with the

homoskedastic data model. In many real scenarios however, this assumption is not true

and heteroskedasticity is indispensable, thereby emphasizing the need of robust estimation

[86]. Additionally for realistic applications the assumption of Gaussian noise is not always

true, therefore in general LS estimation lacks robustness.

119

Although robust methods, in general, are superior to their classical counterparts, they

have rarely been addressed in applied fields [86]. Several reasons have been discussed in

[86] for this paradox, computational expense related to the robust methods has been a

major hindrance [88]. However, with recent developments in computational power, this

reason has become insignificant. Therefore the extension of the proposed work will be on

the lines of utilizing iterative robust estimation algorithms to solve the inverse problem in

Equation 6.2.

120

6.5 Publications

Imran Naseem, Roberto Togneri and Mohammed Bennamoun,“Linear Regression for

Speaker Identification”, Submitted to InterSpeech 2010

121

122

Chapter 7

Conclusions and Future Directions

In this chapter we summarize the key contributions of the research. Important future

directions identified by the research are also presented.

7.1 Contributions

Original contributions of the presented research are the following:

• Evaluation of the Sparse Representation Classification (SRC) Concept:

Feature extraction methodology, in the context of face recognition, has been a hot

research area for the past two decades. Sophisticated features with complex com-

putations have been successfully used to tackle various robustness issues. These

studies have recently been challenged by a new concept of SRC. It has been shown

for the first time that the correct design of the classifier induces independence in

the feature extraction module and even randomly selected features, such as down-

sampled images and random projections, can yield competitive results compared to

orthodox feature extraction methodologies. With the successful implementation of

the SRC for view-based face recognition problem, it became imperative to evaluate

the methodology for (1) other view-based biometrics and (2) harder problems in the

paradigm of face recognition.

With this understanding we successfully extended the SRC approach for the prob-

lem of ear recognition. Investigations yielded an agreeable performance index for

123

the SRC approach on several ear databases. The SRC was found to be robust to

light variations and head rotations. We also evaluated the SRC approach for two

challenging issues of face recognition i.e. mild-to-normal illumination variations and

severe expression variations. The SRC system has shown to be tolerant to illumi-

nation variations and has produced a comparable performance index compared to

the benchmark approaches. It has also yielded good results for moderate expression

variations, however for severe expression variations (such as anger and scream) there

is a tendency of performance degradation. The results are however comparable to

most of the benchmark approaches.

Due to the large scale deployment of video surveillance systems, it is becoming imper-

ative to evaluate face recognition algorithms on video sequences. The large amount

of data available for training and testing in a sequence poses many problems such as

over-fitting, natural head rotations, degradation due to computational complexity

etc. With this understanding we extended the evaluations of the SRC approach to

video sequences. The SRC approach attained a good performance index compared

to SIFT (Scale Invariant Feature Transform) achieving a verification rate of 98.23%

at 0.01 FAR for the VidTIMIT database. The complex design of the SRC classifier

due to iterative l1-optimization indicated a lagging in the context of computational

analysis. A randomly selected recognition trial for the SRC classifier is reported to

be approximately 5 times slower than the swift SIFT approach.

• The Novel Linear Regression Classification (LRC) Algorithm for Face

Recognition: A novel face recognition algorithm based on the concept of linear

regression has been presented. We showed for the first time that simply the down-

sampled images in combination with a linear regression classification approach can

produce excellent results for various problems of face recognition. Extensive experi-

ments, incorporating standard databases, were conducted to show the efficacy of the

proposed approach. In particular we showed that for the cases of severe expression

variations, where standard approaches fail to produce satisfactory result, the pro-

posed LRC algorithm attained an excellent performance index. We also introduced

a novel concept of Distance based Evidence Fusion (DEF) to develop a novel mod-

124

ular approach, called Modular LRC, to address the difficult problem of contiguous

occlusion. In particular, we attained the best results ever reported for the difficult

case of scar occlusion using the Modular LRC approach. It has to be noted that

amongst the contemporary databases, the scarf occlusion mode of the AR database

is arguably the only available database incorporating naturally occluded images.

• The Novel Robust Linear Regression Classification (RLRC) Algorithm

for Robust Face Recognition: Extending the concept of the proposed LRC al-

gorithm, and noting that the LRC approach actually formulates the pattern recog-

nition problem as a task of linear regression, we proposed to use robust estimation

to tackle the difficult problem of random pixel corruption. We showed phenomenal

results for severe illumination variations and random pixel image noise. In partic-

ular, we achieved 100% recognition for the most difficult Subset 5 of the Yale Face

Database B which has never been reported in the contemporary literature. Excellent

results have also been demonstrated for standard image-noise models compared to

benchmark generative approaches.

• The Novel Implementation of the SRC for the problem of Speaker Iden-

tification: In the paradigm of speaker recognition, probabilistic modeling has been

the major work horse. Recently however, an intriguing extension of the GMM-UBM

has made it possible to represent an utterance in a low-dimensional feature space

called the “Speaker Space”. The problem of speaker recognition can therefore be

viewed as a general task of pattern recognition. We therefore implemented the con-

cept of the SRC classification for this problem, experiments were conducted using a

subset of the TIMIT database, the proposed framework has produced a competitive

performance index compared to the state-of-art speaker recognition approaches.

• The novel Linear Regression Classification for the Problem of Speaker

Identification: Working along the same lines we proposed a novel implementation

of the Linear Regression Classification (LRC) algorithm for the problem of speaker

identification. To the best of our knowledge, this is for the first time that near-

est subspace classification concept has been introduced in the context of speaker

125

recognition. The proposed framework is the most simple of the present state-of-art

approaches. It primarily uses the concept of supervectors to develop class-specific

speaker models. The test utterance is presented against each speaker model and

therefore the otherwise probabilistic task of speaker identification boils down to a

problem of linear regression. The proposed algorithm is evaluated using the TIMIT

database and has shown a good performance index compared to the benchmark

approaches.

7.2 Future Directions

The presented research has opened a number of future directions in the fields of speaker

and face recognition. The simple architecture of the proposed LRC algorithm makes it

quite tempting for other biometric applications. Computationally complex video-based

face recognition could be a straightforward extension. It will also be interesting to study

the behavior of simple downsampled images, in conjunction with the LRC classifier, for

other view-based biometrics such as iris, lips, hand geometry, body gait etc. Based on

the concept of linear regression, the 3D- biometrics can also be tackled introducing a new

concept of classification for 3D faces, ears etc.

Robust regression has shown some good results for randomly spread noise in a face

image. Given that for contiguous images the corrupted pixels are known to have a con-

nected neighborhood, the proposed RLRC approach can be extended for the problem of

contiguous occlusion. Essentially, we made use of robust Huber estimation to solve the

inverse problem in the context of RLRC, this approach can further be evaluated and devel-

oped using other efficient robust estimation approaches presented in the robust statistics

literature. LRC algorithm has also shown good results for the problem of speaker iden-

tification. However, robustness in the context of LRC, is an open research area. It will

be very interesting to note if the robust statistical methods are able to tackle the issues

related to speech noise. In other words, the use of RLRC for robust speaker recognition

will be a promising research area.

126

Bibliography

[1] M Savvides, B. V. K Vijay Kumar, and P. K Khosla. Corefaces - Robust Shift

Invariant PCA based Correlation Filter for Illumination Tolerant Face Recognition.

In IEEE Conf. on Computer Vision and Pattern Recognition, 2004.

[2] A Jain, A Ross, and S Prabhakar. An Introduction to Biometric Recognition. IEEE

Transactions on Circuits and Systems for Video Technology, 14(1):4–20, Jan 2004.

[3] A. F Abate, M Nappi, D Riccio, and G Sabatino. 2D and 3D Face Recognition: A

Survey. Pattern Recognition Letters, 28(2007):1885–1906, 2007.

[4] M. Pawlewski and J. Jones. Speaker verification: Part 1. Biometric Technology

Today, 14(6):9–11, June 2006.

[5] D. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306,

April 2006.

[6] J. Wright, A. Yang, A. Ganesh, S. Sastri, S, and Y. Ma. Robust Face Recognition

via Sparse Representation. IEEE Trans. PAMI, 31(2):210–227, Feb 2009.

[7] W Campbell, D Sturim, and D Reynolds. Support vector machines using GMM

supervectors for speaker verification. IEEE Signal Processing Letters, 13(5):308–

311, 2006.

[8] P Phillips, P Grother, R Micheals, D Blackburn, E Tabassi, and M Bone. Face

Recognition Vendor Test 2002: Evaluation Report. 2002.

[9] L. Collins. Earmarked(biometrics). IEE Review, 51(11):38–40, Nov. 2005.

127

[10] A. Iannarelli. Ear identification. Paramount Publishing Company, Freemont, Cali-

fornia, 1989.

[11] J. Hurley, D, B. Arbab-Zavar, and S. Nixon, M. Handbook of Biometrics, chapter

The ear as a biometric. 2007.

[12] B. Moreno and A. Sanchez. On the use of outer ear images for personal identifi-

cation in security applications. In Proc. IEEE 33rd Annual Intl. Conf. on Security

Technology, pages 469–476, 1999.

[13] M Turk and A Pentland. Eigenfaces for Recognition. Journal of Cognitive Neurosi-

cence, 3(1):71–86, 1991.

[14] K. Iwano, T. Hirose, E. Kamibayashi, and S. Furui. Audio-visual person authen-

tication using speech and ear images. In Proc. of Workshop on Multimodal User

Authentication, pages 85–90, 2003.

[15] B. Arbab-Zavar, S. Nixon, M, and J. Hurley, D. On model-based analysis of ear

biomtrics. In IEEE Intl. Conf. on Biometrics: Theory, Applications and Systems,

September 2007.

[16] I. T Jolliffe. Pricipal Component Analysis. Springer, New York, 1986.

[17] V Belhumeur, J Hespanha, and D Kriegman. Eigenfaces vs Fisherfaces: Recognition

Using Class Specific Linear Projection. IEEE Tran. PAMI, 17(7):711–720, July 1997.

[18] P Comon. Independent Component Analysis - A New Concept ? Signal Processing,

36:287–314, 1994.

[19] M Bartlett, H Lades, and T Sejnowski. Independent Component Representations

for Face Recognition. In Proc. of the SPIE: Conference on Human Vision and

Electronic Imaging III, 3299:528–539, 1998.

[20] R. O Duda, P. E Hart, and D. G Stork. Pattern Classification. John Wiley & Sons,

Inc., 2000.

[21] R. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24, 2007.

128

[22] E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal

reconstruction from highly incomplete frequency information. IEEE Trans. Inform.

Theory, 52(2):489–509, 2006.

[23] R. Baraniuk, M. Davenport, R. DeVore, and B. Wakin, M. The johnson-

lindenstrauss lemma meets compressed sensing. dsp.rice.edu/cs/jlcs-v03.pdf, 2006.

[24] D. Donoho. For most large underdetermined systems of linear equations the minimal

l1-norm solution is also the sparsest solution. Comm. on Pure and Applied Math,

59(6):797–829, 2006.

[25] E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and

inaccurate measurements. Comm. on Pure and Applied Math, 59(8):1207–1223,

2006.

[26] E. Candes and T. Tao. Near-optimal signal recovery from random projections:

Universal encoding strategies? IEEE Tran. Infm. Theory, 52(12):5406–5425, 2006.

[27] Yale Univ. Face Database. http://cvc.yale.edu/projects/yalefaces, 2002.

[28] J Yang, D Zhang, A. F Frangi, and J Yang. Two-dimensional PCA: A New Approach

to Appearance-based Face Representation and Recognition. IEEE Trans. PAMI,

26(1):131–137, January 2004.

[29] M. H Yang. Kernel Eignefaces vs Kernel Fisherfaces: Face Recognition using Kernel

Methods. Proc. Fifth IEEE Int’l Conf. Automatic FAce and Gesture Recognition

(RGR’02), pages 215–220, May 2002.

[30] Y Gao and M. K. H Leung. Face Recognition using Line Edge Map. IEEE Trans.

PAMI, 24(6):764–779, June 2002.

[31] P. C Yuen and J. H Lai. Face Representation using Independent Component Anal-

ysis. Pattern Recognition, 35(6):1247–1257, 2002.

[32] A Martinez and R Benavente. The AR Face Database. Technical Report 24, CVC,

June 1998.

129

[33] B Moghaddam and A. P Pentland. Probabilistic Visual Learning for Object Repre-

sentation. IEEE Trans. on PAMI, 19(7):696–710, 1997.

[34] P. J Phillips, H Wechsler, J. S Huang, and P. J Rauss. The FERET Database and

Evaluation Procedure for Face-recognition Algorithms. Image and Vision Comput-

ing, 16(5):295–306, 1998.

[35] P Penev and J Atick. Local Feature Analysis: A General Statistical Theory for

Object Representation. Network: Computation in Neural Systems, 7(3):477–500,

1996.

[36] R Gross, J Shi, and J Cohn. Quo Vadis Face Recognition? In Third Workshop on

Empirical Evaluation Methods in Computer Vision, 2001.

[37] M Bicego, A Lagorio, E Grosso, and M Tistarelli. On the use of SIFT features for

face authentication. CVPRW, 2006.

[38] C. Sanderson and K. Paliwal, K. Identity verification using speech and face infor-

mation. Digital Signal Processing, 14(5):449–480, 2004.

[39] C Sanderson. Biometric person recognition: Face, speech and fusion. VDM-Verlag,

2008.

[40] D Lowe. Object recognition from local scale-invariant features. Intl. Conf. on Com-

puter Vision, pages 1150–1157, 1999.

[41] K Lee, J Ho, M Yang, and D Kriegman. Visual tracking and recognition using

probabilistic appearance manifolds. CVIU, 99(3):303–331, 2005.

[42] P Viola and M Jones. Robust real-time face detection. International Journal of

Computer Vision, 57(2):137–154, 2004.

[43] J Kittler, M Hatef, R. P. W Duin, and J Matas. On combining classifiers. IEEE

Trans. on Pattern Analysis and Machine Intelligence, 20(3):226–238, 1998.

[44] L Liu, Y Wang, and T Tan. Online appearance model. CVPR, pages 1–7, 2007.

130

[45] K Lee and D Kriegman. Online probabilistic appearance manifolds for video-based

recognition and tracking. CVPR, 1:852–859, 2005.

[46] J. Flynn, P, W. Bowyer, K, and J. Phillips, P. Assessment of time dependency

in face recognition: An initial study. Audio- and Video-Based Biometric Person

Authentication, pages 44–51, 2003.

[47] L. Lu, X. Zhang, Y. Zhao, and Y. Jia. Ear recognition based on statistical shape

model. In International Conference on Innovative Computing Information and Con-

trol (ICICIC-06), 2006.

[48] A. B Chan and N Vasconcelos. Modeling, clustering, and segmenting video with

mixtures of dynamic textures. IEEE Trans. PAMI, 30:909–926, May 2008.

[49] A Leonardis and H Bischof. Robust Recognition using Eigenimages. Computer

Vision and Image Understanding, 78(1):99–118, 2000.

[50] R Barsi and D Jacobs. Lambertian Reflection and Linear Subspaces. IEEE Trans.

PAMI, 25(3):218–233, 2003.

[51] X Chai, S Shan, X Chen, and W Gao. Locally Linear Regression for Pose-Invariant

Face Recognition. IEEE Trans. PAMI, 16(7):1716–1725, July 2007.

[52] J Chien and C Wu. Discriminant Waveletfaces and Nearest Feature Classifiers for

Face Recognition. IEEE Trans. PAMI, 24(12):1644–1649, Dec 2002.

[53] A Pentland, B Moghaddam, and T Starner. View-based and Modular Eigenspaces

for Face Recognition. Proc. of IEEE Conf. on Computer Vision and Pattern Recog-

nition, 1994.

[54] T Hastie, R Tibshirani, and J Friedman. The Elements of Statistical Learning; Data

Mining, Inference and Prediction. Springer Series in Statistics. Springer, 2001.

[55] G. A. F Seber. Linear Regression Analysis. Wiley, 2003.

[56] T. P Ryan. Modern Regression Methods. Wiley, 1997.

[57] R. G Staudte and S. J Sheather. Robust Estimation and Testing. Wiley, 1990.

131

[58] S Fidler, D Skocaj, and A Leonardis. Combining Reconstructive and Discriminative

Subspace Methods for Robust Classification and Regression by Subsampling. IEEE

Trans. PAMI, 28(3):337–350, March 2006.

[59] F Samaria and A Harter. Parameterisation of a Stochastic Model for Human Face

Identification. Proc. Second IEEE Workshop Applications of Computer Vision, Dec.

1994.

[60] Georgia Tech. Face Database. http://www.anefian.com/face reco.htm, 2007.

[61] Xudong Jiang, Bappaditya Mandal, and Alex Kot. Eigenfeature Regularization and

Extraction in Face Recognition. IEEE Trans. PAMI, 30(3):383–394, March 2008.

[62] P. J Phillips, H Moon, S Rizvi, and P Rauss. The FERET Evaluation Methodology

for Face Recognition Algorithms. IEEE Trans. PAMI, 22(10):1090–1104, Oct 2000.

[63] J Lu, K. N Plataniotis, A. N Venetsanopoulos, and S. Z Li. Ensemble-Based Discrim-

inant Learning with Boostign for Face Recognition. IEEE Trans. PAMI, 17(1):166–

178, Jan 2006.

[64] A Georghiades, P Belhumeur, and D Kriegman. From few to Many: Illumination

Cone Models for Face Recognition under Variable Lighting and Pose. IEEE Trans.

PAMI, 23(6):643–660, 2001.

[65] K. C Lee, J Ho, and D Kriegman. Acquiring Linear Subspaces for Face Recognition

under Variable Lighting. IEEE Trans. PAMI, 27(5):684–698, 2005.

[66] S. Z Li and A. K Jain, editors. Handbook of Face Recognition. Springer, 2005.

[67] P Ekman. The Argument and Evidence About Universals in Facial Expressions of

Emotions, pages 143–164. Wiley, 1989.

[68] K Scherer and P Ekman. Handbook of Methods in Nonverbal; Behavior Research.

Cambridge University Press, Cambridge, UK, 1982.

[69] L Zhang and G. W Cottrell. When Holistic Processing is Not Enough: Local Fea-

tures Save the Day. In Proc. of the Twenty-sixth Annual Cognitive Science Society

Conference, 2004.

132

[70] W Zhao, R Chellappa, P Phillips, and A Rosenfeld. Face Recognition: A Literature

Survey. ACM Computing Surveys, pages 399–458, 2000.

[71] F Macwilliams and N Sloane. The theory of Error-Correcting Codes. North Holland,

1981.

[72] M. P Roath and M Winter. Survey of Appearance-based Methods for Object Recog-

nition. Technical report, Inst. for Computer Graphics and Vision, Graz University,

Austria., January 2008.

[73] Danniel. D Lee and H. S Seung. Learning the Parts of Objects by Non-negative

Matrix Factorization. Nature, 401:788–791, 1999.

[74] Danniel. D Lee and H. S Seung. Algorithms for Non-negative Matrix Factorization.

Advances in Neural Information Processing Systems, pages 556–562, 2001.

[75] Imran Naseem, Roberto Togneri, and Mohammed Bennamoun. Linear Regression

for Face Recognition. IEEE Trans. on PAMI (in press), 2009.

[76] Imran Naseem, Roberto Togneri, and Mohammed Bennamoun. Face Identification

using Linear Regression. IEEE ICIP, 2009.

[77] Chen. Weilong, Meng Er Joo, and Shiqian Wu. Illumination Compensation and

Normalization for Robust Face Recognition Using Discrete Cosine Transform in

Logarithm Domain. IEEE Trans. on Systems, Man and Cybernetics, 36(2):458–464,

2006.

[78] Bo-Gun Park, Kyoung-Mu Lee, and Sang-Uk Lee. Face Recognition using Face-ARG

matching. IEEE Trans. on PAMI, 27(12):1982–1988, 2005.

[79] L. M. Alexandre Levada, D. C Correa, D. H. P Salvadeo, J. H Saito, and D. A

Nelson. Novel Approaches for Face Recognition: Template-Matching using Dynamic

Time Warping and LSTM Neural Network Supervised Classification. Intl. Conf. on

Systems, Signals and Image Processing, pages 241–244, 2008.

[80] Chia-Te Liao and Shang-Hong Lai. A Novel Robust Kernel for Appearance-based

Learning. Intl. Conf. on Pattern Recognition, pages 1–4, Dec. 2008.

133

[81] John Wright, Yi Ma, Julien Mairal, Guillermo Sapiro, Thomas Huang, and

Shuicheng Yan. Sparse Representation for Computer Vision and Pattern Recog-

nition. IEEE Intl. Conf. of Computer Vision and Pattern Recognition, 2009.

[82] Xiaoyin Xu and Majid Ahmadi. A Human Face Recognition System Using Neural

Classifiers. Intl. Conf. on Computer Graphics, imaging and Visualization, 2007.

[83] X He, S Yan, Y Hu, P Niyogi, and H-J Zhang. Face Recognition using Laplacianfaces.

IEEE Trans. PAMI, 27(3):328–340, March 2005.

[84] P. J Huber. Robust Statistics. New York: John Wiley, 1981.

[85] H. B Nielsen. Computing a Minimizer of a Piecewise Quadratic - Implementation.

Technical report, Informatics and Mathematical Modelling, Technical University of

Denmark, DTU, Sep. 1998.

[86] F. R Hampel, E. M Ronchetti, P. J Rousseeuw, and W. A Stahel. Robust Statistics:

The Approach Based on Influence Functions. John Wiley & Sons, 1986, 2005.

[87] K Madsen and H. B Nielsen. Finite Algorithms for Robust Linear Regression. BIT

Computer Science and Numerical Mathematics, 30(4):682 – 699, 1990.

[88] R. D Maronna, D Martin, and V Yohai. Robust Statistics: Theory and Methods.

Wiley, 2006.

[89] P. J Rousseeuw and A. M Leroy. Robust Regression and Outlier Detection. Wiley,

2003.

[90] T Sim, S Baker, and M Bsat. The CMU Pose, Illumination and Expression (PIE)

Database of Human Faces. Technical Report CMU-RT-TR-01-02, Robotics Insti-

tute, Carnegie Mellon University, January 2001.

[91] H. F Chen, P. N Belhumeur, and D. J Kriegman. In Search of Illumination Invariants.

In IEEE Conf. Computer Vision and Pattern Recognition, volume 1, pages 13–15,

2000.

134

[92] L Zhang and D Samaras. Face Recognition under Variable Lighting using Har-

monic Image Exemplars. In IEEE Conf. Computer Vision and Pattern Recognition,

volume 1, pages 19–25, 2003.

[93] J Zhao, Y Su, D Wang, and S Luo. Illumination Ratio Images: Synthesizing and

Recognition with Varying Illumination. Pattern Recognition Letters, 24:2703–2710,

2003.

[94] S Shan, W Gao, B Cao, and D Zhao. Illumination Normalization for Robust Face

Recognition against Varying Lighting Conditions. In IEEE Workshop on AMFG,

pages 157–164, 2003.

[95] K. C Lee, J Ho, and D. J Kriegman. Acquiring Linear subspaces for Face Recognition

under Variable Lighting. IEEE Trans. PAMI, 27(5):684–698, May 2005.

[96] H Wang, Stan Z. Li, and Y Wang. Generalized Quotient Image. In IEEE Conf. on

Computer Vision and Pattern Recognition, 2004.

[97] Weiwei Yu, Xiaolong Teng, and Chongqing Liu. Face Recognition using Discriminant

Locality Preserving Projections. Image Vision Computing, 24:239–248, 2006.

[98] Xudong Xie and Kin-Man Lam. Face Recognition under Varying Illumination based

on a 2D Face Shape Model. Pattern Recognition, 38, 2005.

[99] A Martinez. Recognizing Imprecisely Localized, Partially Occluded, and Expression

Variant Faces from a Single Sample per Class. IEEE TPAMI, 24(6):748–763, June

2002.

[100] Wenyi Zhao and Rama Chellappa, editors. Face Processing: Advanced Modelling

and Methods. Academic Press, 2006.

[101] R. C Gonzalez and R. E Woods. Digital Image Processing. Pearson Prenctice Hall,

2007.

[102] Linda. G Shapiro and George. C Stockman. Computer Vision. Prenctice Hall, 2001.

[103] Charles Boncelet. Handbook of Image and Video Processing, chapter Image Noise

Models. 2005.

135

[104] Junichi Nakamura. Image Sensors and Signal Processing for Digital Still Cameras.

CRC Press, 2005.

[105] D. A Reynolds. Speaker identification and verification using Gaussian mixture

speaker models. Speech Communication, 17(1-2):91–108, August 1995.

[106] D. A Reynolds, T. F Quatieri, and R. B Dunn. Speaker verification using adapted

Gaussian mixture models. Digital Signal Processing, 10(1-3), 2000.

[107] R Kuhn, J-C Junqua, P Nguyen, and N Niedzielski. Rapid speaker adaptation in

Eigenvoice space. IEEE Trans. on Speech and Audio Processing, 8(6):695–706, Nov

2000.

[108] R Kuhn, P Nguyen, J-C Junqua, L Goldwasser, N Niedzielski, S Fincke, K Field,

and M Contolini. Eigenvoices for speaker adaptation. ICSLP, pages 1771–1774,

1998.

[109] S. M Chu, H Tang, and T. S Huang. Fishervoice and semi-supervised speaker

clustering. ICASSP, pages 4089–4092, 2009.

[110] J Garofolo, L Lamel, W Fisher, J Fiscus, D Pallett, and N Dahlgren. Darpa

Timit: Acoustic-phonetic continuous speech corpus CD-ROM. LDC catalog number

LDC93S1, 1993.

[111] E. Candes. Compressive sampling. In International Congress of Mathematicians,

2006.

[112] S Young, D Kershaw, J Odell, D Ollason, V Valtchev, and P Woodland. Hidden

Markov model toolkit (HTK) version 3.4 user guide. 2002.

[113] S Canu, Y Grandvalet, V Guigue, and A Rakotomamonjy. SVM and kernel meth-

ods Matlab toolbox. Perception Systemes et Information, INSA de Rouen, Rouen,

France, 2005.

[114] Imran Naseem, Roberto Togneri, and Mohammed Bennamoun. Sparse Represen-

tation for Speaker Identification. International Conference on Pattern Recognition,

2010.

136

Documents

PERSON IDENTIFICATION USING FACE AND …research-repository.uwa.edu.au/files/3246146/Naseem...PERSON IDENTIFICATION USING FACE AND SPEECH BIOMETRICS by Imran Naseem A Thesis Presented