Download pdf - Human Recognition Using Local 3D Ear and Face Features · Human Recognition Using Local 3D Ear and Face Features Syed Mohammed Shamsul Islam This thesis is presented for the degree

Human Recognition Using Local 3D Ear

and Face Features

Syed Mohammed Shamsul Islam

This thesis is presented for the degree of

Doctor of Philosophy

of The University of Western Australia

School of Computer Science and Software Engineering.

June 2010

a

a c©Copyright 2010 by Syed Mohammed Shamsul Islam

-

T H E U N t v E R s t r Y o FWESTERN AUSTRALIA

DECLARATION FOR THESES CONTAINING PUBLISHED WORK AND/OR WORK PREPARED FOR PUBLICATION

The exominolion of the lhesis is on exominotion of lhe work of the student. The work musl hove beensubstonliolly conducted by the sludent during enrolmenl in the degree.

Where the thesis includes work to which olhers hove conlributed, the thesis must include o slotement thotmokes lhe sludenl's contribufion cleor to the exominers. This moy be in lhe form of o descriplion of ihe precisecontribulion of fhe student lo the work presenled for bxominoiion ond/or o sfotement of lhe percentoge ofthe work thot wos done by the studenl.

In oddition, in the cose of co-oulhored publicolions included in lhe lhesis, eoch ouihor must give lheir signedpermission for the work to be included. lf signotures from oll the ouihors connot be obtoined, the slolementdetoiling the siudent's contribution to lhe work must be signed by the coordinoting supervisor.

Pleose one of the stolements below.

l. This thesis does nof confoin work thol I hove published, nor work under review for publicoiion.

2. This thesis conloins only sole-outhored work, some of which hos been published ond/or prepored forpublicotion under sole oulhorship. The bibliogrophicol detoils of the work ond where it oppeors in the thesis

@ ff'tit thesis contoins published work ond/or work prepored for publicolion, some of which hos been co-oulhored. The bibliogrophicol detoils of the work ond where it oppeors in the thesis ore outlined below.The sludenl musl otloch to this decloroiion o stotement for eoch publicolion lhot clorifies thecontribulion of the sludent to the work. This moy be in the form of o descriplion of the precisecontribulions of the studenl to the published work ond/or o stotement of percent contribulion by lhestudent. This stotemenf must be signed by oll outhors. lf signotures from qll lhe quthors connol beobtoined, the slolemenl defoiling fhe studenl's contribulion lo the published work musf be signed bythe coordinoting supervisor.

The bibliogrophicol detoils of the publicolions included in this fhesis ond the contribulion of the condidole fothose oppeor ot poge XIV ond XVll respectively in the fhesis.

Sludent Signolure . . . . .

Coordinoling Supervisor Signoture.

Heartily dedicated to my parents

Syed Mohammed Nazrul Islam and Mosammat Jebunnesa

and also to my wife Hafeza

for all their love and inspiration

Abstract

The field of Biometrics is rapidly gaining popularity due to increasing breaches

of traditional security systems and the decreasing costs of sensors. Among the bio-

metric traits, the ear and the face are considered to be the most socially accepted

due to their easy and non-intrusive data acquisition. Furthermore, their feature

richness and physical proximity make them good candidates for fusion. However,

occlusions due to the presence of hair and ornaments and deformations due to facial

expressions pose great challenges for real-life applications of these two biometrics.

These challenges are addressed in this dissertation through the development of ef-

ficient and robust algorithms for ear detection, ear data representation and finally,

the combination of ear and face biometrics using robust fusion techniques. The

dissertation is organized as a set of papers already published and/or submitted to

journals or internationally refereed conferences.

In this dissertation, a fast and fully automatic approach for detecting 3D ears

from corresponding 2D and 3D profile images using a Cascaded AdaBoost algorithm

is proposed. The classifiers are trained with three new Haar-like features and the

detection is made using a 16 × 24 detection window placed around the ear. The

approach is significantly robust to hair, earrings and earphones and unlike other

approaches, it does not require any assumption about the localization of the nose or

the ear pit. The proposed ear detection approach achieves a detection rate of 99.9%

on the UND-J Biometrics Database with 830 images of 415 subjects (the largest

publicly available profile database) taking only 7.7 ms on average using a C + +

implementation on a Core 2 Quad 9550, 2.83 GHz PC.

For ear recognition, I initially proposed to apply the Iterative Closest Point

(ICP) algorithm in a hierarchical manner: first with a low and then with higher

resolution meshes of 3D ear data. The results obtained in the first stage are used

for coarse alignment for the next stage and thus the computational cost of this

accurate iterative algorithm is reduced. In order to achieve better efficiency and

robustness to occlusions, 3D local features (L3DFs) are used for data representation

and matching. Local features are used to develop a rejection classifier, to extract a

minimal rectangular feature-rich region and to compute the initial alignment for the

ICP algorithm. An improved technique for feature matching is also proposed using

geometric consistency among the corresponding features. On the UND-J database

with 415 galleries and probes, an identification rate of 93.5% and an Equal Error

Rate (EER) of 4.1% are obtained. Corresponding rates on a new dataset of 50

subjects all wearing ear-phones are 98% and 1%. With an un-optimized MATLAB

implementation, the average time required for the L3DF-based matching and for

the full matching including ICP are 0.06 and 2.28 seconds respectively.

In order to further increase the robustness, two techniques are presented for

fusing the ear biometrics with face biometrics. In score-level fusion, scores from

the face are computed using the same matching technique proposed for the ear

and a weighted sum rule with some complementary weights is used for fusion. For

the fusion of the ear and the face local features (feature-level fusion), the shape

similarity among the local features from the two different modalities is utilized in

the construction of the multimodal ear-face gallery and probe datasets prior to

concatenation. Matching is performed using similar L3DF-based similarity measures

as the ones used in the case of unimodal matching. The proposed score-level fusion

technique achieves identification (rank-1) and verification (at 0.001 FAR) rates of

99.4% and 99.7% respectively with neutral facial expression and 96.8% and 97.1%

respectively with non-neutral facial expressions on the largest available multimodal

dataset using FRGC v.2 and UND databases. The feature-level fusion approach

achieves an accuracy comparable to that of the score-level fusion without requiring

ICP-like expensive algorithms in matching.

The unimodal and multimodal approaches proposed in this dissertation for ear

and face biometrics can be extended for recognition with other biometric traits and

objects and for other applications such as robotics, medicine and forensic sciences.

Acknowledgements

I am thankful to God for giving me the opportunity and strength to complete

my Ph.D. I am also grateful to many people whose help and support made this

journey possible. First of all, I would like to thank my supervisors Mohammed

Bennamoun, Robyn Owens and Rowan Davies. Their continuous guidance and

support were always a source of inspiration for me. They created a motivating,

enthusiastic and friendly environment which is ideal for research. Their critical and

insightful feedback greatly improved my work and quality of presentation.

I am grateful to Dr. Ajmal Syeed Mian for the useful discussions we had regarding

3D shape acquisition and biometrics. I also had useful discussions with other Ph.D.

candidates, visiting scholars and research fellows including Professor Wesley Snyder

from North Carolina State University and Faisal Al-Osaimi. I would also like to

thank the anonymous reviewers whose constructive criticism and feedback helped

me improve my papers and subsequent work.

I am thankful to Jonathan Wan, K. M. Tracey, Ashley Chew, Laurie McKeaig,

Joe Sandon and other staff members in the Computer Science Support Group for

their technical assistance. I would also like to acknowledge the co-operation and

participation of the students and staff members of the CSSE and other schools of

UWA in 3D ear and face data acquisition in our laboratory.

I extend my sincere gratitude to my parents who taught me the value of hard

work and always prayed for my success. I would also like to thank my wife and

kids for their patience during the course of my Ph.D. Without my wife’s support, I

would not have had the peace of mind required for research. I am also grateful to

all my relatives and friends who kept me motivated by reinforcing my enthusiasm.

Finally, I would like to acknowledge the institutions and people who made their

data available for public use, including The University of Notre Dame, the National

Institute of Standards and Technology (NIST), University of Surrey, the University

of Sheffield and the Center for Biological and Computational Learning at MIT for

their face databases, the University of Science and Technology Beijing for their ear

database and D’Errico for the surface fitting code. I also acknowledge the financial

and logistic supports obtained through the grants DP0664228 and LE0775672 of

Australian Research Council (ARC) and Scholarships for International Research

Fees (SIRFs), University International Stipend (UIS), SIRF Safety Net Top-Up

Scholarships and SIRFs-Completion offered to the candidate by the University of

Western Australia.

i

Contents

List of Tables vii

List of Figures ix

Publications Included in this Thesis xiv

Contribution of Candidate to Published Papers xvii

1 Introduction 1

1.1 Motivations of the Research . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.1 A Review of Recent Advances in 3D Ear and Expression In-

variant Face Biometrics . . . . . . . . . . . . . . . . . . . . . . 6

1.4.2 An ICP Based Hierarchical Matching Approach for 3D Ear

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.3 A Fast and Fully Automatic Ear Recognition Approach Based

on 3D Local Surface Features . . . . . . . . . . . . . . . . . . 6

1.4.4 Refining Local 3D Feature Matching through Geometric Con-

sistency for Robust Biometric Recognition . . . . . . . . . . . 7

1.4.5 Efficient Detection and Recognition of Textured 3D Ears . . . 7

1.4.6 Fusion of 3D Ear and Face Biometrics for Robust Human

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 A Review of Recent Advances in 3D Ear and Expression Invariant

Face Biometrics 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Preliminary Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 2D versus 3D Biometrics . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 Unimodal Biometrics . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.4 Multimodal Biometrics . . . . . . . . . . . . . . . . . . . . . . 13

2.2.5 Fusion Techniques . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Image Acquisition Techniques . . . . . . . . . . . . . . . . . . . . . . 14

ii

2.3.1 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Existing Databases . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Face Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.1 2D Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.2 3D Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Ear Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.1 Using Only 2D Data . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.2 Using Only 3D Data . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.3 Using Both 2D and 3D Data . . . . . . . . . . . . . . . . . . . 22

2.5.4 Discussion of the Ear Detection Techniques . . . . . . . . . . . 23

2.6 3D Data Representation Techniques . . . . . . . . . . . . . . . . . . . 23

2.6.1 Global Representation . . . . . . . . . . . . . . . . . . . . . . 24

2.6.2 Local Representation . . . . . . . . . . . . . . . . . . . . . . . 25

2.6.3 Comparative Evaluation of the Representation Techniques . . 27

2.7 Recognition Techniques with 3D Face Data . . . . . . . . . . . . . . . 28

2.7.1 Rigid Approaches . . . . . . . . . . . . . . . . . . . . . . . . 28

2.7.2 Non-rigid Approaches . . . . . . . . . . . . . . . . . . . . . . . 30

2.8 Recognition Techniques with 3D Ear Data . . . . . . . . . . . . . . . 34

2.8.1 Approaches Using Local Features . . . . . . . . . . . . . . . . 35

2.8.2 Approaches Using Global Features . . . . . . . . . . . . . . . 35

2.8.3 Approaches Without Extracting Any Features . . . . . . . . . 36

2.8.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . 36

2.9 Multi-Biometric Recognition with 3D Ear and Face . . . . . . . . . . 38

2.10 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.10.1 Image Acquisition Related Challenges . . . . . . . . . . . . . . 40

2.10.2 Robustness Related Challenges . . . . . . . . . . . . . . . . . 41

2.10.3 Efficiency Related Challenges . . . . . . . . . . . . . . . . . . 41

2.10.4 Application Related Challenges . . . . . . . . . . . . . . . . . 42

2.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3 An ICP Based Hierarchical Matching Approach for 3D Ear Recog-

nition 45

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 3D Ear Detection and Normalization . . . . . . . . . . . . . . . . . . 47

3.3 3D Ear Matching and Recognition . . . . . . . . . . . . . . . . . . . . 48

3.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 49

iii

3.4.1 Dataset Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.2 Recognition Rate of the Single-step ICP . . . . . . . . . . . . 49

3.4.3 Improvement Obtained with the Two-step ICP . . . . . . . . 50

3.4.4 Analysis of the Misclassification . . . . . . . . . . . . . . . . . 50

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 A Fast and Fully Automatic Ear Recognition Approach Based on

3D Local Surface Features 53

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.1 Ear Data Extraction and Normalization . . . . . . . . . . . . 55

4.2.2 Feature Location Identification . . . . . . . . . . . . . . . . . 56

4.2.3 3D Local Feature Extraction . . . . . . . . . . . . . . . . . . . 58

4.2.4 Feature Matching . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.5 Coarse Registration of Gallery and Probe Data . . . . . . . . 60

4.2.6 Fine Matching with ICP . . . . . . . . . . . . . . . . . . . . . 60

4.2.7 Final Similarity Measures . . . . . . . . . . . . . . . . . . . . 60

4.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.1 Data Set Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.2 Recognition Rate with 3D Local Features Only . . . . . . . . 61

4.3.3 Fine Matching with the ICP . . . . . . . . . . . . . . . . . . . 62

4.3.4 Occlusion and Pose Invariance . . . . . . . . . . . . . . . . . . 62

4.3.5 Analysis of the Failures . . . . . . . . . . . . . . . . . . . . . . 63

4.3.6 Recognition Speed . . . . . . . . . . . . . . . . . . . . . . . . 63

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 Refining Local 3D Feature Matching through Geometric Consis-

tency for Robust Biometric Recognition 65

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 66

5.3 Proposed Refinement Technique . . . . . . . . . . . . . . . . . . . . . 67

5.3.1 Computation of Distance Consistency . . . . . . . . . . . . . . 68

5.3.2 Computation of Rotation Consistency . . . . . . . . . . . . . . 68

5.3.3 Final Similarity Measures . . . . . . . . . . . . . . . . . . . . 69

5.4 Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.5 Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 74

iv

6 Efficient Detection and Recognition of Textured 3D Ears 75

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2 Related Work and Contributions . . . . . . . . . . . . . . . . . . . . 77

6.2.1 Ear Detection Approaches . . . . . . . . . . . . . . . . . . . . 77

6.2.2 Ear Recognition Approaches . . . . . . . . . . . . . . . . . . . 79

6.2.3 Motivations and Contributions . . . . . . . . . . . . . . . . . . 81

6.3 Automatic Detection and Extraction of Ear Data . . . . . . . . . . . 82

6.3.1 Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3.2 Construction of the Classifiers . . . . . . . . . . . . . . . . . . 84

6.3.3 Training the Classifiers . . . . . . . . . . . . . . . . . . . . . . 85

6.3.4 Ear Detection with the Cascaded Classifiers . . . . . . . . . . 86

6.3.5 Multi-detection Integration . . . . . . . . . . . . . . . . . . . . 87

6.3.6 3D Ear Region Extraction . . . . . . . . . . . . . . . . . . . . 88

6.3.7 Extracted Ear Data Normalization . . . . . . . . . . . . . . . 89

6.4 Representation and Extraction of Local 3D Features . . . . . . . . . . 89

6.4.1 KeyPoint Selection for L3DFs . . . . . . . . . . . . . . . . . . 89

6.4.2 Feature Extraction and Compression . . . . . . . . . . . . . . 91

6.5 L3DF Based Matching Approach . . . . . . . . . . . . . . . . . . . . 92

6.5.1 Finding Correspondence Between Candidate Features . . . . . 92

6.5.2 Filtering with Geometric Consistency . . . . . . . . . . . . . . 93

6.5.3 Other Similarity Measures Based on L3DFs . . . . . . . . . . 95

6.5.4 Building a Rejection Classifier . . . . . . . . . . . . . . . . . . 96

6.5.5 Extraction of a Minimal Rectangular Area . . . . . . . . . . . 96

6.5.6 Coarse Alignment of Gallery-Probe Pairs . . . . . . . . . . . . 97

6.5.7 Fine Alignment with the ICP . . . . . . . . . . . . . . . . . . 97

6.6 Detection Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.6.1 Correct Detection . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.6.2 False Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.6.3 Detection Speed . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.7 Recognition Performance . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.7.2 Identification and Verification Results on the UND Database . 103

6.7.3 Robustness to Occlusions . . . . . . . . . . . . . . . . . . . . . 105

6.7.4 Robustness to Pose Variations . . . . . . . . . . . . . . . . . . 106

6.7.5 Analysis of the Failures . . . . . . . . . . . . . . . . . . . . . . 106

6.7.6 Speed of Recognition . . . . . . . . . . . . . . . . . . . . . . . 107

v

6.7.7 Evaluation of L3DFs, Geometric Consistency Measures and

Minimal Rectangular Area of Dataset . . . . . . . . . . . . . . 108

6.8 Comparison with Other Approaches . . . . . . . . . . . . . . . . . . . 109

6.8.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.8.2 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7 Fusion of 3D Ear and Face Biometrics for Robust Human Recog-

nition 115

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.2 Related Work and Contributions . . . . . . . . . . . . . . . . . . . . 118

7.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.2.2 Motivations and Contributions . . . . . . . . . . . . . . . . . . 119

7.3 Data Acquisition and Feature Extraction . . . . . . . . . . . . . . . . 121

7.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.3.2 Ear and Face Data Extraction . . . . . . . . . . . . . . . . . . 121

7.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 122

7.4 Unimodal Matching and Score-level Fusion . . . . . . . . . . . . . . . 123

7.4.1 Matching Technique . . . . . . . . . . . . . . . . . . . . . . . 123

7.4.2 Fusion of Scores . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.5 Feature-Level Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.5.1 Fusion of L3DFs from Ear and Face . . . . . . . . . . . . . . . 127

7.5.2 Matching the Fused Features . . . . . . . . . . . . . . . . . . . 127

7.6 Performance of the Score-level Fusion Approach . . . . . . . . . . . . 130

7.6.1 Results Using L3DF-Based Measures Only . . . . . . . . . . . 131

7.6.2 Improvement Using ICP . . . . . . . . . . . . . . . . . . . . . 132

7.6.3 Misclassifications . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.7 Performance of the Feature-Level Fusion Approach . . . . . . . . . . 134

7.7.1 Results on Data with Neutral Expression . . . . . . . . . . . . 134

7.7.2 Results on Data with Non-neutral Facial Expressions . . . . . 134

7.7.3 Choice of the Number of PCA Components . . . . . . . . . . 134

7.7.4 Performance of Different Similarity Measures . . . . . . . . . . 135

7.8 Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.8.1 Comparison between the Proposed Fusion Approaches . . . . 137

7.8.2 Comparison with Other Approaches . . . . . . . . . . . . . . . 137

7.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

vi

8 Conclusion 141

8.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Bibliography 145

vii

List of Tables

2.1 Some existing profile and frontal face databases . . . . . . . . . . . . 15

2.2 Summary of recognition approaches for face with varying expressions-1 32

2.3 Summary of recognition approaches for face with varying expressions-2 33

2.4 Summary of the existing 3D ear recognition approaches . . . . . . . . 37

2.5 Multi-biometric Approaches with 3D ear and face data . . . . . . . . 38

6.1 Summary of the existing 3D ear recognition approaches . . . . . . . . 80

6.2 Ear detection results on different datasets . . . . . . . . . . . . . . . 98

6.3 Performance variations for using L3DFs and geometric consistency

measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.4 Comparison of the detection approach of this paper with the ap-

proaches of Chen and Bhanu [32] and Yan and Bowyer [179] . . . . . 109

6.5 Comparison of the recognition approach of this paper with others on

the UND database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.1 Multi-biometric approaches with 2D and 3D ear and face data . . . . 117

7.2 Summary of score-level fusion results . . . . . . . . . . . . . . . . . . 130

7.3 Summary of the comparison of our fusion approaches with others

(identification and verification rates are measured at rank-1 and at

0.001 FAR respectively) . . . . . . . . . . . . . . . . . . . . . . . . . 136

viii

ix

List of Figures

2.1 Salient features of an external ear (left and right images are 2D and

3D views respectively of the same ear). . . . . . . . . . . . . . . . . . 10

2.2 Taxonomy of the biometric approaches with ear and face that are cov-

ered in this manuscript with reference to the relevant section number

within braces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Classification of fusion techniques in multimodal biometric systems

(adapted from [84]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Comparative results of 2D face detection (performed and illustrated

in [183]): (a) Using OpenCV-based implementation of cascaded Ad-

aBoost [131] (a) and (b) Using approach of Nilsson et al. [128] (best

seen in color). Notice that the approach in (a) failed to detect the

face of the first person. . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Samples of 3D face detection: (a) in the presence of occlusions (b) in

case of multiple faces in a scene (c) in large scale variations [126]. . . 18

2.6 Different steps of localizing ear in the approach of Ansari and Gupta

in [7]: (a) Convex and concave edges extracted from a side face image

(b) Possible outer helix curves (c) Completed ear boundary (best seen

in color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7 Samples of ear detection in sever occlusion using AdaBoost [74] (best

seen in color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.8 Block diagram of the ear detection approach proposed by Yan and

Bowyer [179]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.9 Global data representation: (a) the SFR [114](b) the balloon image

(SSR convexity map) [124]. . . . . . . . . . . . . . . . . . . . . . . . 24

2.10 Local surface data representation: (a) point signatures (face surface

and sphere) (b) iso-contours (c) the spin image (d) the tensor (e) A

feature point (asterisk) and its neighbors (dots) and basic constituents

of an LSP [32] (f) L3DF [83]: A local surface (right image) and the

region of ear from which it is extracted (shown by a circle on the left

image). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.11 Sample of 3D face (top), variance in expressions (middle) and expres-

sion insensitive binary masks of 3D faces (bottom) [114] (best seen in

color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

x

2.12 Illustration of deformation models: (a) Original, annotated face model,

geometry image and normal image (left to right) used in [91] (b) Orig-

inal test scan, anchor point extraction and deformable model used

in [107] (c) Original and bilinear model of two different expressions [125]. 31

2.13 Block diagram of the ear recognition system proposed in [83]. . . . . . 34

2.14 Identification plots of ear, face and fusion of these two modalities [152]. 39

2.15 Block diagram of the L3DF based ear-face multimodal recognition

system fused at score level [76]. . . . . . . . . . . . . . . . . . . . . . 40

3.1 Block diagram of the 3D ear detection approach. . . . . . . . . . . . . 48

3.2 Sample of the full and reduced meshes of the extracted 3D ear data. . 49

3.3 Flowchart of the matching algorithm using coarse-to-fine hierarchical

technique with ICP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4 Example of recognition: (a) 2D and range image of the gallery ear

(b) 2D and range image of the probe ear with a small ear-ring and

hair (This figure is best seen in color). . . . . . . . . . . . . . . . . . 51

3.5 Recognition rates with Single-step and Two-step ICP. . . . . . . . . . 51

3.6 Examples of improvement with the two-step ICP which was not cor-

rectly recognized by the single-step ICP [left image is the gallery and

right one is the probe]. . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.7 Examples of misclassification:(a, b, c) 2D image of a gallery (left) and

the corresponding probe (right) with pose variations, (d) 2D image

of a probe with occlusions (e) Range image of a probe having missing

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1 Block diagram of the proposed ear recognition system. . . . . . . . . 56

4.2 Locations of local features (shown with dots) on the range images of

different views (in rows) of different individuals (in columns). (This

figure is best seen in color). . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Repeatability of local feature locations . . . . . . . . . . . . . . . . . 58

4.4 Example of a 3D local surface (right image). The region from which

it is extracted is shown by a circle on the left image. . . . . . . . . . . 59

4.5 Identification results. . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.6 Examples of correct recognition in the presence of occlusions. (a)

With ear-rings. and (b) With hair. (2D and the corresponding range

images are placed in the top and bottom row respectively) . . . . . . 62

xi

4.7 Example of correct recognition of gallery-probe pairs with pose vari-

ations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.8 Example of misclassification. (a) With large pose variations. (b)

With ear-ring, hair and pose variations. . . . . . . . . . . . . . . . . . 63

5.1 Example of a 3D local surface (right image) and the region from which

it is extracted (left image, marked with a circle) [83]. . . . . . . . . . 67

5.2 Feature correspondences before (left image) and after (right image)

filtering with geometric consistency (best seen in color). . . . . . . . . 69

5.3 Identification results for fusion of ears and faces on dataset A: (a)

without using geometric consistency. (b) with geometric consistency . 70

5.4 Verification results for fusion of ears and faces on dataset A: (a) with-

out using geometric consistency. (b) with geometric consistency . . . 70

5.5 Indentification results using geometric consistency: (a) on dataset B

(b) on dataset C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.6 Comparing identification performance of different geometric consis-

tency measures (on dataset C) . . . . . . . . . . . . . . . . . . . . . . 73

6.1 Block diagram of the proposed ear detection and recognition system. 76

6.2 Block diagram of the proposed ear detection approach. . . . . . . . . 83

6.3 Features used in training the AdaBoost (features (f), (g) and (h) are

proposed for detecting specific features of the ear) . . . . . . . . . . . 84

6.4 Examples of ear (top) and non-ear (bottom) images used in the training. 85

6.5 Sample of detections: (a) Detection with single window. (b) Multi-

detection integration (best seen in color). . . . . . . . . . . . . . . . . 87

6.6 Example of a 3D local surface (right image). The region from which

it is extracted is shown by a circle on the left image. . . . . . . . . . . 90

6.7 (a) Location of keypoints on the gallery (left) and the probe (right)

images of three different individuals (this figure is best seen in color).

(b) Cumulative percentage of repeatability of the keypoints. . . . . . 90

6.8 Feature correspondences after filtering with geometric consistency

(best seen in color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.9 Extraction of a minimal rectangular area containing all the matching

L3DFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.10 Detection in presence of occlusions with hair and ear-rings (the inset

image is the enlargement of the corresponding detected ear). . . . . . 99

6.11 Example of test images for which the detector failed. . . . . . . . . . 99

xii

6.12 Example of ear images detected from profile images with ear-phones. 99

6.13 Detection performance under two different types of synthesized oc-

clusions (on a subset of the UND-F dataset with 203 images). . . . . 100

6.14 Detection of a motion blurred image. . . . . . . . . . . . . . . . . . . 101

6.15 False detection evaluation: (a) FAR (on number of profile images)

with respect to number of stages in the cascade. (b) The ROC curve

for classification of cropped ear and non-ear images (best seen in color).101

6.16 Recognition results on the UND-J dataset: (a) Identification rate.

(b) Verification rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.17 Examples of correct recognition in presence of occlusions: (a) With

ear-rings and (b) With hair (c) With ear-phones (2D and the corre-

sponding range images are placed in the top and bottom row respec-

tively (best seen in color)). . . . . . . . . . . . . . . . . . . . . . . . . 104

6.18 Recognition results on the UND-G dataset: (a) Identification rate for

different off center pose variations. (b) CMC curves for 30◦ and 45◦

off center pose variations. . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.19 Examples of correct recognition of four gallery-probe pairs with pose

variations (best seen in color). . . . . . . . . . . . . . . . . . . . . . . 106

6.20 Examples of failures: (a) Two probe images with missing data. (b)

A gallery-probe pair with a large pose variation. . . . . . . . . . . . . 107

6.21 Comparing identification performance of different L3DF-based simi-

larity measures (without ICP on the UND-F dataset). . . . . . . . . . 108

7.1 Data acquisition using Minolta Vivid 910 scanner: (a) 2D and 3D

profile images captured to extracted 3D ear data. (b) 3D frontal face

scanned to extract 3D face data . . . . . . . . . . . . . . . . . . . . . 122

7.2 Example of an extracted 3D local surface feature [83]. . . . . . . . . 123

7.3 Feature correspondences established between a gallery and a probe

ear after matching is performed. The image of the probe ear (right

image) is flipped for better visibility (best seen in color). . . . . . . . 124

7.4 Block diagram of the proposed multimodal recognition system with

score-level fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.5 Block diagram of the proposed multimodal recognition system with

feature-level fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

xiii

7.6 Correspondences between the features extracted from the frontal face

and those from the ear of the same person. These correspondences are

established using algorithm 1. Only the first 40 features are shown

for a better visibility (best seen in color). . . . . . . . . . . . . . . . . 126

7.7 Block diagram of the feature-set constructed by fusion of ear and

face L3DFs (subscript ck indicates the index number of the closest

feature with respect to its left or right feature for ear-face and face-

ear combination respectively). . . . . . . . . . . . . . . . . . . . . . . 128

7.8 Repeatability of fused features in the gallery and probe images of ten

individuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.9 Identification results for score-level fusion of ear and face (without

using ICP scores): (a) with neutral expression. (b) with non-neutral

expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.10 ROC curves for score-level fusion of ear and face (without using ICP

scores): (a) with neutral expression. (b) with non-neutral expressions. 131

7.11 Recognition rates for different combinations of ear and face weights

using the weighted sum rule. . . . . . . . . . . . . . . . . . . . . . . . 132

7.12 Examples of 2D and corresponding range images of four correctly

recognized probes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.13 Examples of five misclassified multimodal probes where face data have

large expression changes and the ear data have data losses due to hair

and large out-of-plane pose variations compared to their respective

gallery data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.14 Identification results for feature-level fusion of ear and face features

under neutral and non-neutral facial expressions. . . . . . . . . . . . . 134

7.15 Verification results for feature-level fusion of ear and face features

under neutral and non-neutral expressions. . . . . . . . . . . . . . . . 135

7.16 Effect of using a different number of PCA components on the iden-

tification results (data with non-neutral expressions: 100 gallery and

100 probe images). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.17 Performance of various similarity measures used in feature-level fusion

matching for face data with neutral expression. . . . . . . . . . . . . 136

xiv

Publications Arising from This Dissertation

This dissertation contains works already published and/or submitted to journals

or internationally refereed conferences. The bibliographical details of a work and

where it appears in the dissertation are outlined below.

International Journal Publications (Fully Refereed)

[1] Islam, S.M.S. and Bennamoun, M., Owens, R. and Davies, R., “A Review of

Recent Advances in 3D Ear and Expression Invariant Face Biometrics,” ACM

Computing Surveys (revision in preparation), Nov., 2009. (Chapter 2)

[2] Islam, S.M.S, Davies, R., Bennamoun, M and Mian, A.S., “Efficient Detection

and Recognition of Textured 3D Ears,” International Journal of Computer

Vision (revision under review), March, 2010. (Chapter 6)

[3] Islam, S.M.S, Davies, R., Bennamoun, M, Owens, R.A. and Mian, A.S., “Fu-

sion of Local 3D Ear and Face Features for Human Recognition,” IEEE Trans-

actions on Pattern Analysis and Machine Intelligence (under review), June,

2010. (Chapter 7)

Book Chapters (Fully Refereed)

[4] Islam, S.M.S. and Bennamoun, M., Owens, R. and Davies, R., “Biometric

Approaches of 2D-3D Ear AND Face: A Survey,” in Computer and Information

Sciences and Engineering, T. Sobh (ed.), pages 509-514, 2008.

The preliminary ideas of this paper were refined and extended to contribute

towards [1] which forms Chapter 2 of this dissertation.

[5] Islam, S.M.S. and Davies, R. and Mian, A.S. and Bennamoun, M., “A Fast

and Fully Automatic Ear Recognition Approach Based on 3D Local Surface

Features,” J. Blanc-Talon et al. (Eds.): ACIVS 2008, Lecture Notes on Com-

puter Science (LNCS) volume 5259, pages 1081-1092, Oct, 2008. (Chapter

4)

[6] Islam, S.M.S., Bennamoun, M., Mian, A.S. and Davies, R., “Score Level Fusion

of Ear and Face Local 3D Features for Fast and Expression-invariant Human

Recognition,” M. Kamel and A. Campilho (Eds.): ICIAR 2009, Lecture Notes

xv

on Computer Science (LNCS) volume 5627, Springer, Heidelberg, pages 387-

396, 2009.

The preliminary ideas of this paper were refined and extended to contribute

towards [3] which forms Chapter 7 of this dissertation.

International Conference Publications (Fully Refereed)

[7] Islam, S.M.S. and Bennamoun, M., and Davies, R., “Fast and Fully Automatic

Ear Detection Using Cascaded AdaBoost”. In Proc. of IEEE Workshop on

Application of Computer Vision, 2008 (WACV’08), USA, pages 1-6, Jan.7-8,

2008.

The preliminary ideas and results of this paper were refined and extended to

contribute towards [3] which forms Chapter 6 of this dissertation.

[8] Islam, S.M.S. and Bennamoun, M., Mian, A.S. and Davies, R., “A Fully Auto-

matic Approach for Human Recognition from Profile Images Using 2D and3D

Ear Data,” In Proc. of the Fourth International Symposium on 3D Data Pro-

cessing, Visualization and Transmission (3DPVT’08), Atlanta, GA, USA, ,

pages 131-141, June 18-20, 2008. (Chapter 3)

[9] Islam, S.M.S. and Davies, R., “Refining Local 3D Feature Matching through

Geometric Consistency for Robust Biometric Recognition,” In Proc. of Digi-

tal Image Computing: Techniques and Applications (DICTA), pages 513-518,

December 2009. (Chapter 5)

[10] Islam, S.M.S., Bennamoun, M. Mian, A.S. and Davies, R., “Expression-Robust

Human Recognition from Local 3D Ear and Face Features,” In Proc. of

the Tenth Electrical Engineering and Computing Symposium (PEECS 2009),

Perth, Australia, Oct., 2009.

The preliminary ideas and results of this paper were refined and extended to

contribute towards [3] which forms Chapter 7 of this dissertation.

Note: In the 2008 Journal Citation Report, ACM Computing Surveys (ACMCS) is

ranked one among 84 Computer Science Theory and Methods titles. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence (TPAMI) and International

Journal of Computer Vision (IJCV) are the top two ranking journals in the area of

Computer Vision. TPAMI is also ranked number one among 229 Electrical Engi-

neering titles in the same report. WACV, 3DPVT, DICTA and ACIVS conferences

xvi

are ranked as A, C, B and B respectively in the 2008 ranking of the Computing Re-

search and Education Association of Australia (CORE) and the 2010 ranking of the

Excellence in Research for Australia (ERA) Initiative, a program of the Australian

Research Council (ARC).

xvii

Contribution of the Candidate to the Published

Work

The contribution of the candidate in all the published papers was 80%. He co-

authored them with his three supervisors and in some cases also with one member of

the same research group. The candidate developed and implemented the algorithms,

performed the experiments and wrote the papers. Other authors reviewed the papers

and provided useful feedback for improvement.

xviii

1CHAPTER 1

Introduction

Biometric recognition systems are rapidly gaining popularity due to increasing breach

of traditional security systems and the decreasing costs of sensors. The current re-

search trend is to combine multiple biometric traits to improve accuracy and robust-

ness. The availability of a rich set of many distinctive features and the possibility

of easy and non-intrusive data acquisition, make the ear and the face more popular

than other biometric traits. They are also good candidates for fusion due to their

physical proximity. However, occlusions due to the presence of hair and ornaments

and deformations due to facial expressions pose great challenges for real-life appli-

cations of these two biometrics. The main objective of this dissertation is to address

these challenges by developing an efficient and robust approach for ear detection, ear

data representation and finally, combining ear biometrics with face biometrics using

robust fusion techniques. In this chapter, the research motivation of this work is

elaborated followed by a concise statement of the research problem, a list of my ma-

jor contributions and the organization of the dissertation as a set of papers already

published and/or submitted to journals or internationally refereed conferences.

1.1 Motivations of the Research

Incidents of breaches of traditional password or identity card-based systems have

been increasing alarmingly. CIFAS [43], the UK’s fraud prevention service reports a

16% increase in identity frauds in the UK in the year 2008. In the USA, the number

of victims of such fraud increased by 22% making a loss of USD48 billion over the

year 2008 [87]. Such a rapid increase in security breaches is a strong justification

for replacing or augmenting traditional recognition systems with biometric systems

which are based on human traits that cannot be stolen, denied or faked easily. In

a biometric system one (in the unimodal case) or multiple (in the multimodal case)

physiological (e.g. face, fingerprint, palmprint, iris and DNA) or behavioral (such

as handwriting, gait and voice) characteristics of a subject are taken into consid-

eration for automatic recognition purposes [16, 85, 146]. A biometric recognition

system may operate in one or both of two modes: authentication and identification.

In authentication, one-to-one matching is performed to compare a user’s biomet-

ric to the template of a claimed identity. In identification, one-to-many matching

is performed to associate an identity with the user by matching it against every

identity in a database. Biometric systems can be applied efficiently in numerous

real-environment applications including various civil IDs (e.g. passports, drivers

2 Chapter 1. Introduction

licenses, national ID cards and voter ID cards), access control, surveillance, law en-

forcement, multimedia management and human computer interaction. With these

priorities, the allocation of government funds and the increasing computational ef-

ficiency of modern computers have boosted the emergence of biometric recognition

systems over the last few years [14].

Biometric systems operating with a single biometric data suffer from a number

of problems including noise in sensed data, intra-class variations, inter-class similar-

ities (i.e. overlaps of features in the case of large databases), non-universality (i.e.

meaningful biometric data may not be acquired from a subset of users) and spoof

attacks [146]. To overcome these problems, multimodal approaches are proposed.

A system is called multimodal if it collects data from different biometric sources or

uses different types of sensors (e.g. infra-red and reflected light), or uses multiple

samples of data or multiple algorithms [158] to combine the data [16]. Face and

voice data were unified in numerous early works listed in the study of Ross and

Govindarajan [144]. Frischholz and Dieckmann [55] integrated lip data with voice

and face biometrics. Finger, hand, ear, gait and iris images were also integrated

with face images taken by different sensors. In such multimodal systems a decision

can be made on the basis of different subsets of biometrics depending on their avail-

ability and confidence. Such systems are also more robust to spoof attacks as it is

relatively difficult to spoof multiple biometrics simultaneously.

Among the biometric traits, the face and the ear have become popular due

to their rich set of distinctive features as well as the possibility of easy and non-

intrusive acquisition of their images. There has been increasing research interest

in using these two biometric traits for identification and authentication purposes

in the last few years. However, the accuracy and the robustness required for real-

world applications are still to be achieved. Face recognition with neutral expressions

has reached maturity with a high degree of accuracy [17, 78, 117, 190]. However,

changes due to facial expressions, the use of cosmetics and eye glasses, the presence

of facial hair including beards and aging, significantly affect the performance of face

recognition systems. The ear, compared to the face, is much smaller in size but

has a rich structure [8] and a distinct shape [86] which remains unchanged from

the age of 8 to 70 years (as determined by Iannarelli [72] in a study of 10,000

ears). It is, therefore, a very suitable alternative or complement to the face for

effective human recognition [20, 28, 69, 78]. However, the reduced spatial resolution,

uniform distribution of color and sometimes the presence of nearby hair and ear-rings

make the ear very challenging for nonintrusive biometric applications. Therefore, a

1.2. Problem Statement 3

multimodal approach using ear as a biometric trait must also address the issue of

efficiently removing the effect of occlusions.

Fusion of ear and face data in an efficient way is also a challenging problem.

Fusion at the matching score or decision level is easy to perform and therefore,

most of the existing multimodal systems use either of the two. But fusion at these

levels cannot fully exploit the discriminating capabilities of the combined biometrics.

Fusion at the data or feature extraction level is believed to produce better results

in terms of accuracy and robustness because richer information about the ID or

class can be combined at these levels [146]. However, very few works [28, 144,

115, 98] have so far been performed using fusion at these levels. Fusion at the

feature level is the most challenging [144, 146, 86] because the feature sets of various

modalities may not be compatible and the relationship between the feature spaces

of different biometric systems may not be known. Again, resultant feature vectors

may increase in dimensionality and a significantly complex matching algorithm may

be required. In addition, good features may be degraded by bad features during

fusion and hence, we need to apply an efficient feature selection approach prior to

fusion. Moreover, most of the commercial biometric systems do not provide access

to the raw data or feature sets. Therefore, solving these challenges will constitute a

significant contribution.

The speed of the multimodal biometric recognition system is also an important

factor for real time applications, particularly when deployed in public places such

as airports and stadiums. Unfortunately, most accurate algorithms (e.g. ICP) are

computationally expensive [181]. Therefore, developing an accurate as well as time-

efficient algorithm is of great research significance.

Finally, testing with significantly larger databases compared to the ones used

to date and getting acceptable results is another big challenge that needs to be

addressed. Most of the proposed biometric systems which achieve high accuracy are

only tested with databases of less than 100 subjects.

1.2 Problem Statement

The research problem of this dissertation can be stated as follows. Is it possible

to develop efficient and robust approaches for 2D and 3D ear detection and data

representation to enable construction of a robust fusion approach, combining ear

biometrics with 3D face biometrics for effective human recognition? The answer is

yes, and this dissertation outlines how this can be done.


1.3 Research Contributions

The following is a summary of the major contributions in this dissertation.

• A fast and fully automatic ear detection approach using a Cascaded AdaBoost

algorithm is proposed. The classifiers are trained with Haar-like features and

the detection is marked using a rectangular window placed around the ear. No

assumption is made about the localization of the nose or the ear pit. (Chapter

3 and 6)

• For ear recognition, in order to reduce the computational expense of the ICP

algorithm, I propose a hierarchical ICP-based approach where the algorithm

is first used with a low and then with a higher resolution mesh of 3D ear data.

The result of the first stage of ICP is used as an initial transformation (coarse

alignment) for the second stage of ICP. (Chapter 3)

• The local 3D features (L3DFs) originally proposed by Mian and my supervi-

sors [117] for the face are modified to make them suitable for the representation

of 3D ear data. It has been shown that these local 3D features exhibit a high

degree of repeatability for different ear images of the same subject. (Chapter

4)

• The L3DFs have been utilized for matching a gallery and a probe 3D face

dataset using only feature distances. In this dissertation, I improved matching

based on the geometric consistency amongst the matched features. I included

an explicit second round of matching where only the matches that are consis-

tent with the most of the other matches are allowed. (Chapter 5)

• I devised approaches to utilize the result of local feature-based matching for

an early rejection of most of the false matches and for coarsely aligning a

gallery-probe pair prior to a finer match using expensive algorithms such as

ICP. (Chapter 4 and 6)

• A novel approach to extract minimal feature-rich data points which only in-

cludes the matched local features is proposed to obtain a reduced dataset for

final ICP alignment. This reduction significantly increases the time efficiency

and the accuracy of the recognition system. (Chapter 4 and 6)

• Ear recognition experiments are performed on a new in-house database of pro-

file images with ear-phones and on the UND Collection-J which is the largest

1.4. Structure of the dissertation 5

publicly available profile database. High recognition rates are achieved with-

out expensive preprocessing such as an explicit extraction of the ear contours.

(Chapter 6)

• Two complete multimodal recognition systems fusing the ear and face biomet-

rics at score and feature levels are proposed based on the automatic extraction

of efficient local features. To the best of my knowledge, this is the first feature-

level fusion approach using 3D ear and face features extracted from the profile

and frontal face images respectively. (Chapter 7)

• The same type of local 3D features are used to represent both the ear and face

data in both score and feature levels of fusion, which permits fair comparisons

of the two biometric traits and of the two fusion techniques. (Chapter 7)

• Fusion experiments are performed on the largest possible multimodal dataset

using publicly available profile (the UND) and frontal (the FRGC v2) face

databases. The performance of the face biometrics especially with non-neutral

expressions improves significantly when fused with ear biometrics using the

proposed score-level fusion technique. The proposed feature-level fusion ap-

proach achieves an accuracy comparable to that of the score-level fusion with-

out requiring ICP-like expensive matching algorithms. (Chapter 7)

1.4 Structure of the dissertation

This dissertation is organized as a series of papers published in internationally

refereed journals, books and conferences. Each paper constitutes an independent set

of work in the process of biometric recognition with minor overlaps. However, these

papers together contribute to a complete and coherent theme for human recogni-

tion using local 3D features. Chapter 2 to Chapter 7 correspond to our publica-

tions [79], [75], [80], [81] and [82] respectively. In Chapter 2, essential background

to unimodal and multimodal recognition is provided and a comprehensive review is

made to identify potential areas of contribution in the areas of 2D and 3D ear and

face detection, data representation and unimodal and multimodal matching. The

core content of this dissertation is laid out in Chapter 3 to 7. Chapter 3 describes

an ICP based ear recognition approach. A more efficient approach based on local

3D features is described in Chapter 4. An improved matching technique is proposed

in Chapter 5. Chapter 6 describes an efficient and complete recognition system for

the ear biometric starting from the detection to the (matching) decision making.


Chapter 7 describes two different fusion approaches to fuse the ear with the face

biometrics. I summarize my conclusions and provide suggestions for future work in

Chapter 8. A more detailed overview of each chapter is presented below.

1.4.1 A Review of Recent Advances in 3D Ear and Expression Invariant

Face Biometrics (Chapter 2)

In this chapter, a comprehensive review of unimodal and multimodal recogni-

tion using 3D ear and face data is presented. Associated data collection, detection,

representation and matching techniques are covered with a focus on the challeng-

ing problem of expression variations. All approaches are classified according to

their methodologies. Through the analysis of the scope and limitations of these

techniques, it is concluded that further research should investigate fast and fully

automatic ear-face multimodal systems robust to occlusions and deformations.

1.4.2 An ICP Based Hierarchical Matching Approach for 3D Ear Recog-

nition (Chapter 3)

In this chapter, a fully automatic and fast technique based on the AdaBoost

algorithm (fully described in Chapter 6) is used to detect a subject’s ear from his/her

2D and corresponding 3D profile images. A modified version of the Iterative Closest

Point (ICP) algorithm is then used for the matching of this extracted probe ear

to previously stored ear data in a gallery database. A coarse-to-fine hierarchical

technique is used where the ICP algorithm is first applied on low and then on high

resolution meshes of 3D ear data.

1.4.3 A Fast and Fully Automatic Ear Recognition Approach Based on

3D Local Surface Features (Chapter 4)

In this chapter, an approach is proposed for human ear recognition based on

robust 3D local features. The features are constructed at distinctive locations in

the 3D ear data with an approximated surface around them based on their neigh-

borhood information. Correspondences are then established between gallery and

probe features and the two data sets are aligned based on these correspondences.

A minimal rectangular subset of the whole 3D ear data containing only the corre-

sponding features is then passed to the Iterative Closest Point (ICP) algorithm for a

finer alignment and the final recognition. Experiments were performed on the UND

biometric database.

1.4. Structure of the dissertation 7

1.4.4 Refining Local 3D Feature Matching through Geometric Consis-

tency for Robust Biometric Recognition (Chapter 5)

In this chapter, I eliminate some of the incorrect matches that occur during the

local 3D based matching. The approach is based on the geometric consistency among

the initial matches. Some related similarity measures are also computed and consid-

ered for the final matching decision. The performance of the approach is evaluated

on different datasets and compared with the performance of other approaches.

1.4.5 Efficient Detection and Recognition of Textured 3D Ears (Chapter

6)

In this chapter, a very fast 2D AdaBoost detector is combined with a fast 3D local

feature matching and fine matching via an Iterative Closest Point (ICP) algorithm to

obtain a complete, robust and fully automatic system with a good balance between

speed and accuracy. Ear images are detected from 2D profile images using the

proposed Cascaded AdaBoost detector. The corresponding 3D ear data are then

extracted from the co-registered range image and represented with local 3D features.

Unlike previous approaches, local features are used to construct a rejection classifier,

to extract a minimal region with feature-rich data points and finally, to compute

the initial transformation for matching with the ICP algorithm. The performance

of the proposed approaches are also evaluated and compared with other approaches.

1.4.6 Fusion of 3D Ear and Face Biometrics for Robust Human Recog-

nition (Chapter 7)

In this chapter, a 3D local feature based approach is proposed to fuse ear and

face biometrics at the score and the feature levels of fusion. Score-level fusion is

performed using a weighted sum rule with some complementary weights. Feature-

level fusion is based on the similarity among the local features from the ear and

the face. Finally, the proposed fusion approaches are compared with other existing

approaches, demonstrating the superiority of these new techniques.


9CHAPTER 2

A Review of Recent Advances in 3D Ear and

Expression Invariant Face Biometrics

Abstract

Biometric based human recognition is rapidly gaining popularity due to breaches

of traditional security systems and the lowering cost of sensors. The current research

trend is to combine multiple traits to improve accuracy and robustness. This paper

comprehensively reviews unimodal and multimodal recognition using 3D ear and face

data. It covers associated data collection, detection, representation and matching

techniques and focuses on the challenging problem of expression variations. All the

approaches are classified according to their methodologies. Through the analysis of

the scope and limitations of these techniques, it is concluded that further research

should investigate fast and fully automatic ear-face multimodal systems robust to

occlusions and deformations.

keywords

Biometrics, 3D Ear, 3D Face, Detection, 3D Data Representation, Multimodal

Recognition, Facial Expressions

2.1 Introduction

Incidents of breaching traditional password or ID based systems have been in-

creasing significantly as indicated by recent reports. CIFAS [43], the UK’s fraud

prevention service reports a 16% increase in identity frauds in the UK in the year

2008. In the USA, the number of victims of such fraud increased by 22% making a

loss of $48 billion over the year 2008 [87]. The rapid increase in security breaches is

a strong justification for replacing or augmenting traditional ID-based systems with

biometric systems which are based on human traits that cannot be stolen, denied or

faked easily. Biometric systems can be applied efficiently in many government ap-

plications (such as passports, voter ID and drivers licenses) and in forensic, security

and law enforcement applications. With these priorities, the allocation of govern-

ment funds and the increasing computational efficiency of modern computers have

boosted the emergence of biometric recognition systems over the last few years [14].

0This article is under review in the ACM Computing Surveys, November, 2009

10Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face

Biometrics

Concha/Ear Pit

Helix

Antihelix

Antitragus

Tragus

Triangular Fossa

Incisure Intertragica

Lobe

Figure 2.1: Salient features of an external ear (left and right images are 2D and 3D

views respectively of the same ear).

Unlike traditional recognition systems, the success of a biometric system depends

not only on the accuracy but also on the acceptability of the biometric trait being

used. The face and the ear have become popular due to their rich set of many

distinctive features (as illustrated in Fig. 2.1 for the ear) as well as the possibility

of easy and non-intrusive acquisition of their images. Ample research has been per-

formed in the last few years proposing different methods of using these two biometric

traits for identification and authentication purposes. However, the accuracy and the

robustness required for real-world applications are still to be achieved. This implies

that it would be beneficial to look at existing and proposed approaches to identify

the challenges and suggest future research directions.

Pose variations and changes in facial expressions are the most challenging prob-

lems in fully exploiting the non-intrusiveness of face recognition technique [189].

Although 3D approaches are sufficiently robust to pose variations compared to their

2D counterparts, the non-rigid deformations due to facial expressions severely affect

the results of 3D approaches. Consequently, expression variance became one of the

main focuses of the Face Recognition Grand Challenge (FRGC) [141] and gained

substantial interest in the computer vision and pattern recognition research commu-

nities. A few promising methods have been proposed to tackle the problem, however,

none of the approaches is free from limitations and able to fully solve this crucial

problem. Several survey papers ([3, 17, 96, 148, 191]) and books ([46, 147, 193, 100])

have been published on face recognition in general, but to the best of our knowledge,

there is no review devoted to the challenging problem of facial expression variance.

Therefore, in this survey, we focus on 3D face recognition approaches which are

significantly robust to non-neutral facial expressions.

In contrast to the face, the shape of the ear is not affected by expression changes

and also does not change from aging between 8 and 70 years [72]. Although these

features made the ear very popular in the research community, there are very few

2.1. Introduction 11

Biometric Approaches with Ear and Face

3D Data Acquisition

[2.3.1]

Hybrid

(e.g. Qlonerator)

Pure Structured

light (e.g. Minolta)

Passive Stereo

(e.g. Geometrix)

Face

[2.4]

Using Fuzzy

Logic

Using SVM

Using SMQT

and SnoW

2D

(Appearance

Based) [2.4.1]

3D

[2.4.2]

Using

AdaBoost

2D/3D Detection

[2.4,2.5]

Ear [2.5]

Using RBF Network

Using Outer Helix

Curves

One-line Based

Landmarks and 2D

Masks

3D Representation

[2.6]

Local Feature Based

[2.6.2]

LSP

Spherical SI

Point Signatures

Tensor

L3DF

Iso-contours

Spin Image (SI)

Global Feature

Based [2.6.1]

SFR

COSMOS

Balloon

Image

3D Recognition

[2.7-2.9]

Face (with Expression

Variations) [2.7]

Non-Rigid

[2.7.2]

Rigid [2.7.1]

Ear

[2.8]

Using Local

Features [ 2.8.1]

Using Global

Features [2.8.2]

Without Using

Any Features

[2.8.3]

Ear-Face

Multimodal

[2.9]

Score

Level

Fusion

2D [2.5.1] 3D [2.5.2] 2D and 3D [2.5.3]

Using Ear

Shape Model

Using

Landmarks

and 3D Masks

3D Template

Matching

Nose Tip and Ear

Pit Detection and

the Snake

Algorithm

Global-to-Local

Shape

Registration

Intensity Difference

and AdaBoost

Figure 2.2: Taxonomy of the biometric approaches with ear and face that are covered

in this manuscript with reference to the relevant section number within braces.

survey papers ([143, 38, 69, 118]) and most of those describe the 2D and the early

exploratory 3D approaches. The current trend is to use 3D data for ear recognition,

however, some of the recent approaches use either only 2D or both 2D and 3D

information for ear detection. Therefore, in this survey we provide a comprehensive

and up-to-date review of the existing 2D/3D detection and 3D recognition techniques

proposed for the ear biometrics.

Considering the problems of unimodal biometrics (see Section 2.2), attempts have

been made to integrate the face with other biometric traits such as gait [194, 55],

palmprint [56, 97], fingerprint [166, 158], images of the iris [164] and recently, the

ear (see Section 2.9). In addition to the robustness to expression variations and

aging, the ear has another advantage over other alternatives due to its proximity

to the face: ear data can easily and non-intrusively be collected (with the same or


Biometrics

similar sensor) along with the face image. Ear images can efficiently supplement

face images when frontal views are difficult to collect or are occluded. Although

most of the ear-face multimodal approaches report better accuracy and robustness

than individual modalities, it is still important to investigate effective approaches

for fusion by analyzing the existing approaches.

We provide a more comprehensive review than past survey papers on face or

ear recognition by including the data collection, detection and representation tech-

niques currently proposed for these two modalities. A taxonomy of the approaches

is illustrated in Fig. 2.2. In addition, the latest relevant surveys in the field of face

and ear were in 2007 [3, 38] and that in data representation was in 2005 [119]. Our

previous review work in these areas [77, 78] also first appeared in 2007. Given that

technology changes rapidly the authors felt an up-to-date and more elaborate survey

would make a valuable contribution to the field.

.

The paper is organized as follows. Preliminary concepts are introduced in the

next section. Techniques for acquisition and detection of face and ear data are

described in Sections 2.3, 2.4 and 6.3.4 respectively. Representation techniques used

for face and ear biometrics are described in Section 2.6. Recent approaches for

recognition of these two biometric traits are summarized in Sections 2.7 and 2.8.

Multimodal approaches involving these are described in Section 2.9. The challenges

to be met for improving the existing systems are outlined in Section 2.10 and the

conclusions are provided in Section 7.9.

2.2 Preliminary Concepts

2.2.1 Biometrics

Biometrics can be defined as the automatic recognition of a person based on his

or her physiological traits (such as face, fingerprint, palmprint, iris and DNA) or

behavioral traits (e.g. handwriting, gait, signature) [14, 193, 146]. Desirable char-

acteristics of a biometric trait are: universal availability to everyone, distinctiveness

to individuals, permanence over a long period of time and quantitative measurabil-

ity [86].

2.2.2 2D versus 3D Biometrics

Biometrics can be broadly categorized as 2D and 3D, based on the type of data

used. Although 2D images are easier and less expensive to acquire, they have many

inherent problems such as variance to pose and illumination, and sensitivity to the

2.2. Preliminary Concepts 13

use of cosmetics, clothing and other decorations. Biometric systems using 3D data

are potentially free of these problems. However, 3D data sometimes suffer with the

presence of spikes and holes due to sensor errors.

2.2.3 Unimodal Biometrics

Unimodal biometric systems operate with a single type of biometric data. These

systems suffer from a number of problems including noise in sensed data, intra-

class variations, inter-class similarities (i.e. overlaps of features in the case of large

databases), non-universality (for example, in fingerprint systems, incorrect minutiae

may be extracted in the case of poor quality of the ridges) and spoof attacks [158,

146, 86, 85, 110, 142].

2.2.4 Multimodal Biometrics

A system may be called multimodal if it collects data from different biometric

sources or uses different types of sensors (e.g. infra-red or reflected light), or uses

multiple samples of data or multiple algorithms [158] to combine the data [16]. Thus,

in multimodal systems a decision can be made on the basis of different subsets of

biometrics depending on their availability and confidence. These systems are also

more robust to spoof attacks as it is relatively difficult to spoof multiple biometrics

simultaneously. Woodard et al. [166] compared the performance of different modali-

ties and reported a 97% rank-1 recognition rate for a multimodal approach with ear,

face and fingerprint and 80%, 93% and 80% for individual modalities respectively

on a dataset of 425 samples of each modality from 85 subjects.

2.2.5 Fusion Techniques

Different biometric modalities can be fused or combined at different levels of

the biometric recognition process. Based on the level of fusion, Jain et al. [84]

categorized the fusion techniques as illustrated in Fig. 2.3.

Prior to classification, fusion can be performed at the data or sensor level com-

bining raw data from different sensors or at the feature extraction level by combining

feature vectors obtained by either using different sensors or employing different fea-

ture extraction algorithms. Techniques for fusing information after the classification

or matching stage can be grouped into four categories as shown in Fig. 2.3 (adapted

from [84]). In the dynamic classifier selection, the results of that classifier are chosen

based on which is most likely to provide the correct decision for the specific input

pattern [167, 9]. When a different biometric matcher produces a different match

score indicating the proximity of the input data to a template, fusion is performed


Biometrics

Information Fusion in

Multimodal Biometric Systems

Post-classification LevelPre-classification Level

Feature Level

Sensor or

Data Level

Rank Level

Measurement or

Match Score level

Dynamic Classifier

Selection

Abstract or Decision

Level

Figure 2.3: Classification of fusion techniques in multimodal biometric systems

(adapted from [84]).

at the measurement or match score or confidence level. Fusion can also be performed

at the rank level when the output of each biometric matcher is a subset of possible

matches sorted in decreasing order of confidence, and at the abstract or decision

level when each biometric matcher individually decides on the best match based on

the input presented to it (refer to [145, 84, 146, 147] for details).

2.3 Image Acquisition Techniques

Data collection is the very first step in a biometric recognition system. Since

the detailed description of different data acquisition systems is not within the scope

of this paper, a short description of those involving still images is provided in this

section for completeness. We also summarize the existing ear and face databases

which are publicly available for research purposes.

2.3.1 Techniques

In a non-intrusive application, ear and face data can be collected from still im-

ages or video sequences of the profile and frontal views of a subject respectively.

2D images can be captured using ordinary cameras. However, 3D data collection

requires some special sensing devices.

Existing approaches for acquiring 3D data can be classified into three categories:

passive stereo (e.g. the Geometrix system), pure structured light (e.g. the Minolta

sensor) and the hybrid approach (e.g. the 3Q ‘Qlonerator’ system). In the first

approach, 3D locations of a subject are computed from two images taken by two

cameras with a known geometric relationship. In the second approach, light patterns

projected from a light projector are detected in the image of the subject to compute

2.3. Image Acquisition Techniques 15

Table 2.1: Some existing profile and frontal face databases

Database Nameand Source

DataType

#Sub. #Img. Image Description

FRGC v2 [141] 2D,3D

466 50000 Frontal; neutral and smiling; collected using MinoltaVivid 900/910.

FERET [51] 2DColor

1199 14,126 15 sessions with time lapse up to two years.

PIE, CMU [44] 2D 68 41368 13 poses, 43 illuminations, 4 expressions.

NIST MugshotIdentifica-tion [129]

2D 1573 3248 131 and 89 cases with two or more front and profileviews respectively.

CAS-PEAL [25] 2D 1040 30900 27 poses, 5 expressions, 6 Accessories and 15 lightingdirections.

IV 2 MultimodalBiometric [140]

2D,3D

300 2400 Face Data collected using Minolta Vivid 7000 with fivefacial expressions, two different lighting conditions andthree poses (frontal, left profile and right profile).

3D RMA [1] 3D 120 360 Three poses, collected using structured light technique.

The Bospho-rus [15]

3D 105 4666 Various poses, expressions and occlusion conditions.

BU-3DFE [12] 2D,3D

100 2500 Neutral and six more expressions with four differentlighting conditions. Data from 56 males and 44 females.

GavabDB [123] 3D 61 549 Nine images from each subject: 2 frontal views withneutral expression, 2 x-rotated views (30), 2 y-rotatedviews (90) and 3 different expressions).

BioID [13] 2D 23 1521 Upright frontal images with a resolution of 384x286pixel.

UH [91] 2D N/A 884 With neutral and different expression while readingloudly; with or without ear and face accessories; ac-quired using 3dMD-based prototype system.

The Yale FaceDatabase [173]

2D 15 165 Six expressions; three poses, with or without glasses.

USTB-III [160] 2D 79 220 Profile and other poses and partial occlusions.

UND-F [155] 2D,3D

302 942 Profile, time lapsed, collected using Minolta Vivid 910.

UND-J2 [156] 2D,3D

415 1800 Profile, time lapsed.

UCR-ES2 [32] 3D 155 902 Profile images with pose variations (± 35 degrees), sixshots per subjects taken all on the same day using Mi-nolta Vivid 300.


Biometrics

the 3D locations. Instead of a single camera, a stereo camera rig is used along with

a light projector and 3D locations corresponding to the projected light patterns

are computed as in the first approach. In applications where a large number of

corresponding points are required, approaches with light patterns are helpful as

they simplify the selection of such points. Details of the approaches can be found

in [17].

An example of a Minolta scanner and the sample of range data it can capture

are shown in Figure 7.1.

2.3.2 Existing Databases

There are many databases available with different types of 2D and 3D data. Some

large and mostly used (in recent publications) databases for ear and face recognition

are described in Table 2.1. Interested readers are referred to [62, 73] for a summary

of other databases.

2.4 Face Detection Techniques

Face images with a controlled background can be easily detected using color or

motion or both. However, detection of faces with unconstrained backgrounds is

comparatively difficult. A few recent and widely used pose and expression invariant

detection techniques (based on still images) are described in this section.

2.4.1 2D Approaches

As described in earlier surveys ([96, 183, 54, 182, 66]), most of the face detec-

tion approaches are based on 2D information. Kong et al. [96] categorized these

approaches as knowledge-based (encoding human knowledge to capture relation-

ship between facial features), feature-invariant (searching features unaffected by the

variations in pose, viewpoint or lighting conditions), template matching (using pat-

terns of the whole face or facial features to find correlation between faces) and

appearance-based approaches (capturing representative variability of facial appear-

ance by learning the models or templates using a set of training images). Since the

first three types of approaches do not work well in the case of small and poor quality

images, most of the modern approaches are appearance based and can be further

classified as below:

Using AdaBoost-Based Classifiers One of the most popular, fast and robust face

detection algorithms is the cascaded AdaBoost algorithm proposed by Viola and

Jones [161]. This method characterized by the use of integral image and rectangular

2.4. Face Detection Techniques 17

Haar-like features with a cascade of classifiers, where each successive classifier is

based on the rejection or acceptance result of the previous classifier. The authors

reported a detection rate of 93.7% with 422 false positives on the MIT+CMU face

database. A speed of 15 frames per second was obtained while scanning 384 by 288

pixel images using multiple scales of 24×24 size patches on a conventional 700 MHz

Intel Pentium III machine. The algorithm has been followed by many variations

and improvements such as for occluded images (e.g. [60]), for rotation invariance or

multi-view (e.g. [104, 68, 33]) and for speed up (e.g. [151, 121, 64, 168, 137]). Instead

of using Haar-like features, recently, Meynet et al. [113] used Gaussian features and

Xiaohua et al. [169] proposed using Gabor features and hierarchical regions in a

cascade of classifiers.

Using Local SMQT Features and Split SNoW Classifiers Nilsson et al. [128] used

the combination of Successive Mean Quantization Transform (SMQT) features and

a split up Sparse Network of Winnows (SNoW) classifiers for frontal face detection.

With 15 false positives, they obtained around a 100% detection rate on the BioID

database and that of around 81% on the MIT+CMU database. However, while con-

sidering both databases, the detection accuracy reported was 95% with 1.93× 10−7

false positive rate. Fig. 2.4 shows a sample of face detection using this algorithm

compared to the Open Source Computer Vision Library (OpenCV) [131] based im-

plementation of the cascaded AdaBoost algorithm. The authors demonstrated that

the algorithm performs better than other approaches on the BioID database, how-

ever, the approach would fail in detecting faces smaller than 32 × 32 as this is the

fixed size of the patch used for detection. Although the detection time is not re-

ported, the requirement of down sampling the image instead of features/patches

would limit the speed of the detection.

Using Support Vector Machine (SVM) Classifiers One of the earliest approaches

to face detection using SVM was proposed by Osuna et al. [132]. They detected

vertically oriented and un-occluded frontal views of human faces in gray level images.

The approach was further enhanced for multi-view face detection in [102] and [174].

Most recently, Hotta [67] used SVM classifiers with horizontal rectangular features

and a combination kernel of various sizes for a view-independent face detection.

Using Fuzzy Logic Ghiass and Sadati [59] used some filters to obtain binary

images from multi-view face images. They then produced fuzzy models from the

distribution of zeros in the faces and used a fuzzy approach for face detection.

Because of the independence of the approach to skin color, persons with every kind

of skin color can be detected. Experiments on 60 images of Yale Database B under


Biometrics

(a)

(b)

s

s

s

s

ss

s

s

s

Figure 2.4: Comparative results of 2D face detection (performed and illustrated

in [183]): (a) Using OpenCV-based implementation of cascaded AdaBoost [131] (a)

and (b) Using approach of Nilsson et al. [128] (best seen in color). Notice that the

approach in (a) failed to detect the face of the first person.

different illumination conditions show 1.2% error rate on average.

(a)

(b) (c)

Figure 2.5: Samples of 3D face detection: (a) in the presence of occlusions (b) in

case of multiple faces in a scene (c) in large scale variations [126].

2.4.2 3D Approaches

Mian et al. [114] proposed a simple but fully automatic approach for 3D face

detection and obtained 98.3% accuracy on the FRGC v2 dataset. At first, they

detected the nose tip from the 3D depth images and then took a spherical region

around that. However, the system assumes that the image contains only a single

face; only 15 degrees of pose variation exists along the x and y-axes and the nose tip

is not occluded. Moreover, the segmentation is not robust to scale variance because

2.5. Ear Detection Techniques 19

a pre-defined radius of sphere is cropped over the entire database.

Colombo et al. [45] proposed detecting salient features of a 3D face (e.g. nose and

eye) by analyzing the mean and Gaussian curvature of the surfaces in the scene. A

set of candidate faces was built grouping pairs of candidate eyes with the candidate

noses. Distances between eyes and noses were computed and compared with those

in a typical human face and the candidate face is discarded if the disagreement was

too high. The portion of the surface under a candidate face was projected in a

new range image and finally processed by a PCA trained classifier to discriminate

between faces and non-faces. On a set of 150 3D faces the approach achieved a 96%

detection rate. However, the approach is highly sensitive to the presence of outliers

and holes around the eyes and nose regions.

Niese et al. [127] performed 3D point clustering using texture information to

localize the face. The approach is limited to pose variations up to ±45◦ from the

upright position and also relies on the availability of a texture map. However, it is

significantly robust to illumination and expression variations.

Recently, Nair and Cavallaro [126] proposed using a Point Distribution Model

(PDM) for detecting a face from its 3D mesh data. They extracted candidate

locations (inner eye and nose tip vertices) on the mesh from low-level curvature-

based feature maps and used them for fitting the model, thus, not relying on texture,

pose or orientation information. The face was then detected by classifying the

transformations between model points and candidate vertices based on the upper-

bound of the deviation of the parameters from the mean model. A 99.6% detection

rate was achieved on 827 meshes (427 face meshes from GavabDB face database and

400 object scans from the NTU 3D Model Database ver.1). The system is robust

to different facial expressions and minor occlusions. It also works well with multiple

faces in the same scene and with scale variations (see Fig. 2.5). However, the model

fitting procedure of this approach is time consuming and took the authors 121s on

an average over the GavabDb database on a 3.2 GHz Intel Pentium 4 CPU.

2.5 Ear Detection Techniques

In the case of non-intrusive approaches, ear data are segmented from the profile

images which can vary in appearance under different viewing and illumination con-

ditions. Therefore, extracting ear data from arbitrary profile images is a challenging

problem. Existing approaches are categorized and described briefly in this section.


Biometrics

(a)

(b) (c)

Convex

edgesConcave

edges

Figure 2.6: Different steps of localizing ear in the approach of Ansari and Gupta

in [7]: (a) Convex and concave edges extracted from a side face image (b) Possible

outer helix curves (c) Completed ear boundary (best seen in color).

2.5.1 Using Only 2D Data

Using Outer Helix Curves Ansari and Gupta in [7] utilized the shape of the

outer helix curve of the ear for its localization on a 2D profile image with arbitrary

background. They extracted the edges using the Canny edge detector [24] and

segmented them into convex and concave edges (see Fig.2.6). They eliminated non-

ear edges and found the final outer helix curve based on the relative values of angles

and some predefined thresholds. Then, the two end points of the helix curve were

joined with straight lines to get the complete ear boundary. They obtained 93.34%

accuracy of localizing the ears on a database of 700 samples. The approach does

not require any template and can localize the ears rotated in any direction, however,

fails for the images which are of poor quality or occluded due to hair.

Using Radial Basis Function (RBF) Network He et al. [65] used an RBF network

to map a side face image to a surface in which the most steep peak implies the

location of the ear in the side face image. They convoluted the mapping surface

with a Laplacian Mask and extracted the ear with a 200×120 pixels image from the

input side face image of size 320×240 pixels. Yuizono et al. [187] also used a similar

approach, however, they applied both pyramid hierarchy and sequential similarity

detection algorithms to speed up the extraction process. No occlusion with hair or

earrings is considered in this approach.

Using One-line Based Landmarks and 2D Masks Yan and Bowyer [176] proposed

selecting manually Triangular Fossa and Incisure Intertragica (see Fig. 2.1) on the

original 2D profile image and then drawing a line to be used as a landmark. The

landmark was used to find the orientation and size of the ear. A mask was then

rotated and scaled accordingly and applied on the original image to crop the ear

2.5. Ear Detection Techniques 21

data. The authors used this method for PCA-based and edge-based matching.

(a)

(b)

Figure 2.7: Samples of ear detection in sever occlusion using AdaBoost [74] (best

seen in color).

Utilizing Intensity Difference and AdaBoost Algorithm Ear contours were de-

tected based on illumination changes within a chosen window by Choras [37]. In

this method the difference between the maximum and minimum intensity values of a

window is compared to a threshold computed from the mean and standard deviation

of that region to decide whether the center of the region belongs to the contour of

the ear or to the background.

In recent work, Islam et al. [74] proposed an ear detection approach based on

the AdaBoost algorithm described in [149]. Rectangular Haar-like features that

compute the difference of intensity in neighboring regions were used for training

and detection. The approach is fully automatic, very fast and is not affected by

significant occlusions and degradation of the input images (see Fig. 6.5(a)). The

authors reported a detection rate of 99.9% while testing on the UND Biometrics

Database with 830 images of 415 subjects. A 100% detection rate was obtained for

203 images of the UND-F dataset with a False Acceptance Rate (FAR) of 5× 10−6.

A 480 by 640 test image can be scanned in 7.66 ms on a Core 2 Quad 9550, 2.83

GHz machine using a C++ implementation of this ear detection algorithm.

A variant of the AdaBoost algorithm called the Gentle AdaBoost algorithm

and some asymmetric Haar-like features were used by Li and Zhang [186] for ear

detection. On the USTB database of 220 images (which were also used as positive

samples in training the algorithm), they obtained a 99.5% detection rate at 0.023

FAR. Respective results on the CAS-PEAL dataset of 166 images were 98.8% and

0.036%. Detection time was not reported.

2.5.2 Using Only 3D Data

Using Two-line Based Landmarks and 3D Masks Yan and Bowyer [176] drew

two lines on the original range image: one line along the border between the ear and

the face, and the other from the top of the ear to the bottom. They used these lines

to find the orientation and scaling of the ear. A mask was then rotated and scaled


Biometrics

accordingly and applied on the original image to crop 3D ear data in an ICP-based

matching approach. The approach is simple, but requires manual intervention.

Using 3D Template Matching Chen and Bhanu [29] built a template model repre-

sented by an average histogram of shape index values in off-line mode. Their on-line

ear detection mode includes: step edge detection and thresholding, image dilation,

connected-component labeling and template matching. They obtained 91.5% cor-

rect detection rate with 2.52% false alarm rate on 30 side face range images of 30

subjects collected using Minolta Vivid 300. The approach is simple and compara-

tively less expensive as it only searches the potential ear regions around the step

edges, avoiding an exhaustive search over the entire test image. However, detected

rectangular regions sometimes do not include the full ear and sometimes include

some extra regions around the ear.

Using Ear Shape Model Chen and Bhanu [31] represented an ear shape model

by a set of discrete 3D vertices on the ear helix and anti-helix parts and aligned it

with the range images for detecting those ear parts. They reported 92.6% detection

accuracy on the UCR dataset with 312 images and an average detection time of 6.5

sec on a 2.4G Celeron CPU.

2.5.3 Using Both 2D and 3D Data

Global-to-Local Shape Registration Chen and Bhanu [32] used both color and

range images and a global-to-local shape registration. They obtained 99.3% and

87.71% detection rates on the University of California, Riverside (UCR) ear database

of 902 images from 155 subjects and on the University of Notre Dame (UND)

database of 302 subjects respectively. The average detection time reported is 9.48

sec for the UCR dataset on a 2.4G Celeron CPU.

Using Nose-tip and Ear-pit Detection and the Snake Algorithm Yan and Bowyer [179]

proposed taking a predefined sector from the nose tip in order to locate the ear re-

gion. They cropped out the non-ear portion from that sector by skin detection

and detected the ear pit using Gaussian smoothing and curvature estimation. They

then applied an active contour algorithm to extract the ear contour. As illustrated

in Fig. 2.8, they use the ear pit as the starting point for the snake algorithm. Us-

ing the color or the depth information separately for the active contour algorithm,

detection accuracies of 79% and 85% were obtained respectively. The accuracy was

improved to 100% using both the color and depth information. The approach is

fully automatic, however, it will fail if the ear pit is not visible.

2.6. 3D Data Representation Techniques 23

Nose tip detection

Preprocessing for dropping

out shoulder and some hair

areas

Crop out a sector from nose tip

with a radius of 20 cm and

spanning +/- 30 degree.

Crop out non-ear skin regions

first by transforming each pixel

of 2D image into the YCbCr

color space and then by color

matching.

Ear Pit Detection

Apply Gaussian smoothing to

remove some noise using

window size of 11*11 pixels.

Calculate Gaussian curvature

(K) and the mean curvature (H)

Group 3D points at the same

curvature label into a region.

Select region (s) with K>0 and

H>0 as pit region (s).

Apply Symmetric Voting

method to select the real Ear

Pit.

Profile face

image (2D

color and 3D

range image)

Cropped

Ear

Ear Extraction using

Active Contour Algorithm

Select initial contour ellipse

with ear pit as centre, major

axis as 20 pixels and minor as

30 pixels.

Determine appropriate

parameters for growing contour

Take 150 iterations of growth

and crop the final contour as

extracted ear.

Figure 2.8: Block diagram of the ear detection approach proposed by Yan and

Bowyer [179].

2.5.4 Discussion of the Ear Detection Techniques

In this section, ear detection techniques are categorized based on the data used

for detection. Approaches using only 2D data can also be used for 3D ear recognition

because using modern acquisition devices (see Section 2.3) both 2D and range images

can be collected together and they can be co-registered. Therefore, after detection

of the ear from 2D images, corresponding 3D data can be extracted from the co-

registered range image.

Apart from the use of data, as discussed in this section, some approaches consider

only a closer and smaller side face view around the ear while some other consider

the whole profile image. Similarly, some approaches only localize and perform recog-

nition on a roughly cropped ear region, while some other apply more elimination

technique to concisely crop ear data from the background. Besides, some approaches

requires manual intervention for initialization while others are fully automatic.

There are performance variations for the above different approaches. For exam-

ple, more accuracy can be achieved using concise crop of ear data which in turn may

increase the detection time. Although, AdaBoost based detection is very simple and

fast, it requires training with a large population. Therefore, the choice of the most

appropriate technique should depend on the type and requirement of application.

2.6 3D Data Representation Techniques

Prior to recognition, data should be structured in an appropriate manner to

enable efficient matching. Many techniques have been proposed for 2D and 3D

object representation as described and analyzed in [23], [111] and [119]. In this


Biometrics

section, we only describe those representation techniques which can represent 3D

ear and face effectively. Based on how the whole face or ear is defined or how the

matching can be performed using the representation, we categorize the techniques

as global and local and discuss briefly in the following section.

2.6.1 Global Representation

COSMOS: Dorai and Jain [48] proposed a representation technique for 3D free-

form objects which they called Curvedness-Orientation-Shape Map On Sphere (COS-

MOS). In this representation, local and global shape information of an object such

as surface area, curvedness and connectivity are integrated in terms of maximal

surface patches of constant shape index [95]. The patches are mapped onto the

unit sphere via their orientations and aggregated via their shape spectral functions.

Since the ear and the face are objects with arbitrary curves and holes, they can be

represented with this approach, however, it requires that the associated range data

should be occlusion free.

SSR Convexity

(a) (b)

Figure 2.9: Global data representation: (a) the SFR [114](b) the balloon image

(SSR convexity map) [124].

Spherical Face Representation (SFR): Mian et al. [114] proposed the Spherical

Face Representation (SFR) for representing 3D face data. Here, the point cloud

is quantized into spherical bins rather than 3D grids as in the case of their tensor

representation. As shown in Fig. 2.9(a), an n bin SFR is computed by quantizing

the distances of all points from the origin (e.g. nose tip in the case of the face)

into a histogram of n + 1 bins. The authors demonstrated that SFRs belonging to

the same individual follow a similar curve shape and different shapes for different

individuals.

Balloon Image Representation: Pears [138] proposed sampling the Radial Basis

Function (RBF) model of an object (at arbitrary resolutions in 3D space) over a set

of concentric spheres to construct a spherically sampled RBF (SSR) histogram, also

called a balloon image (see Fig. 2.9(b)). A convexity value is computed from this


histogram which is then used to estimate the volumetric intersection of the object

and a bounding sphere, centered on any object surface point. Minimization of this

volume is used to define and localize high curvature surfaces on the object.

This representation is pose invariant and relatively immune to missing parts, as

the RBF function is defined everywhere in 3D space. It also performs well in the

presence of noise as the SSR convexity values are derived as a summation, which has

the effect of suppressing (averaging) noise. It can be used for localizing the nose tip

on face images and then aligning the face to a 3D upper face template using either

ICP on nose-centered data or the method of “isoradius contours” [139].

(a) (b) (c) (d)

(f)(e)

Figure 2.10: Local surface data representation: (a) point signatures (face surface

and sphere) (b) iso-contours (c) the spin image (d) the tensor (e) A feature point (as-

terisk) and its neighbors (dots) and basic constituents of an LSP [32] (f) L3DF [83]:

A local surface (right image) and the region of ear from which it is extracted (shown

by a circle on the left image).

2.6.2 Local Representation

Point Signatures: In the point signatures representation [41], 1D signatures are

extracted from a surface. A sphere of predefined radius is centered at a point on the

surface. The intersection of this sphere with the object’s surface gives a 3D space

curve. A plane is fitted to this space curve at the center point and is translated

in the direction of its normal to the center of the sphere. Next, the 3D curve is

projected perpendicularly to the translated plane, forming a new 2D curve. This

projection of points from the 3D curve to 2D curve forms a signed distance profile


Biometrics

known as point signatures. The starting point of this signature is defined by the

point on the signature that gives the maximum distance from the 3D curve. Point

signatures are calculated for every point on the object’s surface.

Although point signature representation is very simple to implement, more than

one point on the 3D curve may give the maximum and equal distance to the plane

and thus, make the representation ambiguous. The starting point of the signature

is also very sensitive to noise and hence the representation is not stable and robust.

Iso-Contours: Mpiperis et al. [124] proposed this technique for representing the

face. As shown in Fig. 7.2(b), they represented 3D information of the face surface

by a set of planar curves formed by the intersection of the surface with equidistant

parallel planes.

Spin Image (SI) Representation: A spin image is a representation of a data point

belonging to a surface using a 2D array of values (i.e. 2D image) generated like a

sheet spinning about the normal of that point [89]. In this approach, the ear or the

face data can be represented by a stack of spin images computed at each vertex of

its mesh (see Fig. 7.2(c)). It requires a normal estimation which may be corrupted

by residual noise and missing parts in the data.

Sphere-Spin-Image (SSI) Representation: Wang et al. [162] proposed SSIs to

represent the local shape of points on a facial surface by means of a histogram. This

is obtained by mapping the 3D coordinates of a point lying into a spherical space,

centered in that point, into a 2D space. Points are selected by means of a minimum

principal curvature analysis. For each subject a single SSI series is built up and a

simple correlation coefficient is used to compare the similarity between SSI series of

different subjects.

Tensor Representation: Mian et al. [116] proposed representing 3D surfaces with

third order tensors. In their approach, the cloud of points representing a view of

an object (here, face or ear) is first converted into triangular meshes. Then, the 3D

object in the mesh is quantized into a 3D grid to construct a third order tensor (see

Fig. 7.2(d)). Since the co-ordinate basis used to define the grid is extracted from

the underlying surface itself, this tensor representation is pose invariant. This can

be used for representing very low resolution range images.

Local Surface Patch (LSP): Chen and Bhanu [32] proposed a local feature based

representation technique based on a surface patch consisting of a feature point P

and its N neighbors. The representation includes the feature point, its surface type,

centroid of the patch, and a histogram of shape index values against the dot product

of the surface normal at the point and its neighbors [11]. The components of an


LSP are illustrated in Fig. 7.2(e). The 2D histogram and surface type are used for

matching of LSPs of different individuals and the centroid is used for computing the

rigid transformation.

Local 3D Feature (L3DF): Mian et al. [117] and Islam et al. [83], described a local

surface representation technique in which at first, a smaller number of distinctive 3D

keypoints are identified on the detected 3D ear or face region. A 3D surface is then

approximated on each of the keypoints based on the neighborhood information and

used as the feature for that point (see Fig. 7.2(f)). A coordinate frame is defined

by centering on the keypoint and aligning with the principal axes from Principal

Components Analysis (PCA) to make the feature pose invariant.

2.6.3 Comparative Evaluation of the Representation Techniques

Approaches that directly use 3D surface information (e.g. the tensor based repre-

sentation and the L3DF) have comparatively more discriminating capabilities than

those mapping the 3D surface onto 2D histograms or extracting 1D signatures. Mian

et al. [116] compared the performance of the Spin Image to their tensor based repre-

sentation. They reported around 39% more correct pairwise registration at 600 faces

per view and 44% more matching pairs with less than 2 cm error. However, the high

descriptiveness of the tensor representation makes it sensitive to non-rigid deforma-

tions such as facial expression changes. Mian et al. [114] also demonstrated that

SFR is comparatively less descriptive but more robust to non-neutral expressions.

Bhanu and Chen [11] compared Cumulative matching performance of the LSP

with the SI on the UCR dataset and obtained slightly better result for the for-

mer (94.8% and 92.9% rank-1 respectively). They also compared the efficiency of

these two and the SSI. They found the average time required for finding the nearest

neighbors and the group of corresponding surface descriptors and then performing

verification by these three techniques are 89.42, 162.07 and 150.57 seconds respec-

tively on an AMD Opteron 1.8 GHz processor.

Pears [138] compared performance of spin images with balloon images for iden-

tification of nose tip and obtained 70% and 99.6% accuracy respectively.

Mpiperis et al. [124] demonstrated that iso-contours outperform point signatures

both in computational efficiency and in recognition rates. They obtained improve-

ment in accuracy from 86.2% to 91.4% for face recognition using iso-contours.


Biometrics

2.7 Recognition Techniques with 3D Face Data

Among the biometric traits, the face is the most heavily researched one. Exten-

sive surveys on 2D face recognition techniques can be found in [189, 3, 148, 192].

However, the maturity of 2D face recognition failed to resolve several problems

including pose variations and therefore, researchers are now pursuing 3D face recog-

nition. Most of the early 3D approaches obtained quite significant accuracy under

neutral expression but remained severely affected by the deformations due to expres-

sion changes. Therefore, the current trend is to find out approaches robust to such

deformations. We can broadly categorize these approaches as rigid and non-rigid

based on the methodology adopted. In this section, we describe some representative

approaches of each category.

2.7.1 Rigid Approaches

In these approaches, human faces are considered as rigid objects and expression

invariant features of faces (e.g. the length of nose and the distance between eyes)

or rigid/semi-rigid regions such as nose and eyes-forehead regions (see Fig. 2.11) are

identified and matched for recognition purposes. These are simple but suffer from the

disadvantage that deformable parts of the face that still encompass discriminative

information are rejected during matching.

Chua et al. [40] used point signatures for representing 3D faces with non-neutral

expressions and only matched those extracted from the upper part of the face.

Husken [71] presented a multimodal approach using 2D and 3D hierarchical graph

matching (HGM). The HGM worked as an elastic graph storing local features in its

nodes, and structural information in its edges. Their 2D modality performed better

than the 3D. However, the fusion provided better than each individual modality

with a 96.8% verification rate at 0.001 false acceptance rate on FRGC v2 database

while matching neutral vs. all images. The matching approach is faster than ICP

or a similar iterative approach.

Chang et al. [27] developed a 3D face recognition system based on a combina-

tion of the match scores from matching multiple overlapping regions around the

nose (including a circle and an ellipse centered at the nose and the nose itself). The

matching and the fusion were performed using ICP and a product fusion rule respec-

tively. Their algorithm was tested with a database of 4485 3D scans of 449 subjects

including 2349 and 1590 probes and 449 and 355 gallery images for neutral and non-

neutral expressions respectively. The system provided 97.1% and 87.1% recognition

rates with Equal-Error Rate (EER) of 0.12 and 0.23 for neutral and non-neutral ex-

2.7. Recognition Techniques with 3D Face Data 29

Figure 2.11: Sample of 3D face (top), variance in expressions (middle) and expression

insensitive binary masks of 3D faces (bottom) [114] (best seen in color).

pressions respectively. However, in another approach of combining the scores from

3D PCA and 3D ICP algorithms [26], they obtained a better recognition rate of 92%

for data with non-neutral expressions.

Li et al. [101] constructed separate PCA spaces for the texture (which is relatively

less affected by facial expressions) and the non-invariant geometric attributes by

fitting a triangulated generic mask and warping the texture. They combined them

using a linear weighted sum rule for face recognition. While testing the algorithm

on a subset of Yale face database with 90 face images (with six different expressions)

from 15 subjects, 96% accuracy was obtained. The approach was not fully automatic

as the marker points for the face masks were selected manually.

Faltemier et al. [49] demonstrated better results under facial expression changes

using multiple samples with multiple expressions for each subject in the gallery

dataset. On a superset of FRGC v2 dataset (called ND-2006) containing 13450 im-

ages with six different expressions, they obtained a 97.2% rank-one recognition rate

while using five galley images including two with neutral and three with happiness

expressions. The drawbacks of the approach includes the problem in acquisition

of prompted expressions in a controlled setting and also the error that may occur

when the expression of a probe matches that of a gallery scan belonging to a different

subject.

Mian et al. [114] proposed a fully automatic face recognition approach based on

pose correction using the nose tip point and the Hotelling transform algorithm, a

rejection classifier using the SFR (see Section 2.6.1) and the Scale-Invariant Feature

Transform (SIFT) descriptor. For matching, they used a novel region based match-

ing approach and the modified ICP algorithm. In an experiment with the FRGC

v2 dataset, they achieved 99.74% and 98.31% verification rates at a 0.001 false ac-

ceptance rate (FAR) and identification rates of 99.02% and 95.37% for probes with


Biometrics

neutral and non-neutral expressions, respectively.

2.7.2 Non-rigid Approaches

In non-rigid approaches, human faces are considered as non-rigid objects and

deformations are applied to 3D facial scans to counteract expression deformations

or to reduce their influence on the recognition performance [4]. The deformation

modeling is possible as the deformation caused by non-neutral expressions follow

some patterns governed by the underlying anatomy of the face. The success of

non-rigid approaches depends on the ability of differentiating between expression

deformations and interpersonal disparities. Some of the non-rigid approaches are

discussed below:

Li and Barreto [99] proposed integrating expression recognition and face recogni-

tion in a single system in order to solve the problem of facial expression deformations.

In their work, at first an assessment of the expression of an unknown face was made

and then, an appropriate recognition sub-system was used for person recognition.

For non-neutral images, the right face was found through modeling the variations of

the face features between the neutral face and the face with expression. Classifica-

tion was performed using Linear Discriminant Analysis (LDA) and Support Vector

Machine (SVM). The system was tested with 30 neutral and 30 smiling face (3D

range) images from 30 subjects and a 80% recognition rate was obtained in case of

smiling faces.

Bronstein et al. [18] proposed using an isometric model of facial expression. The

probe facial surface was embedded into the surface of the model without requir-

ing the same amount of information. The Generalized Multi-Dimensional Scaling

(GMDS) numerical core was used that provided flexibility in handling partial surface

matching. The GMDS made use of some metric distortions that served as dissim-

ilarity measures resulting in a small number of surface samples which were then

matched using a hierarchical matching strategy. They obtained a 100% recognition

rate for a test with 180 3D faces from 30 subjects with partial occlusions and six

different expressions. In this approach facial expressions are assumed to be isomet-

ric, which is not always true. Besides, it requires intensive preprocessing prior to

matching.

Kakadiaris et al. [91] and Passalis et al. [135] used an annotated face model

(AFM), wavelet analysis, normal maps and a composite alignment algorithm to

extract a biometric signature from 3D face data. The AFM was deformed elastically

to fit each face, thus allowing the annotation of its different anatomical areas such


(a)

(b)

s

s

(b)

(c)

s

Figure 2.12: Illustration of deformation models: (a) Original, annotated face model,

geometry image and normal image (left to right) used in [91] (b) Original test scan,

anchor point extraction and deformable model used in [107] (c) Original and bilinear

model of two different expressions [125].

as the nose, eyes and mouth. The approach was tested with a large dataset of

5000 scans from FRGC v2 and UH databases. Verification rates of 99% and 95.6%

(at 0.001 FAR) were obtained respectively for neutral vs. neutral and neutral vs.

non-neutral subset of the FGRC v2 database. The approach is fully automatic,

pose-invariant and robust to noisy and missing data. Although the final matching

is efficient (using a non-iterative distance metric), the enrollment phase (illustrated

in Fig. 2.12(a)) is computationally expensive.

Lu and Jain [107] fitted a facial surface deformation model (see Fig. 2.12(b))

to the test scans to handle expressions and large pose variations. A geodesic-based

re-sampling approach was applied to extract the landmarks for the model. With a

subset of FRGC v2 dataset (150 test scans of 50 subjects each with one neutral, one

smiling and one surprise expression), rank-one identification accuracy of 97% was

achieved. The system requires manual landmark labeling for deformation modeling.

Wang et al. [163] proposed a guidance-based constraint deformation (GCD)

model to reduce the shape distortion caused by expression. The probe model was

deformed toward the gallery model with some constraints in a Poisson equation

framework prior to matching the two models. The approach reported 11.8% and 6%

improvement of identification and verification performance respectively compared

to ICP. However, since the deformation is not performed according to expression

patterns, some interpersonal disparities might be lost.


Biometrics

Mpiperis et al. [125] developed asymmetric bilinear models (see Fig. 2.12(c))

capable of decoupling the identity and facial expression factors. The models were

constructed after the establishment of correspondence among the set of faces. These

models are then fitted with unknown faces for recognition. A bootstrap set of faces

is used for tuning the models. The system achieved 86% rank-1 face recognition

on the BU-3DFE face dataset of 100 subjects. Images of 50 subjects were used in

training the models and those of the remaining subjects were used for testing. The

performance of the approach is limited by its necessary requirement of a large set

of bootstrap training and an accurate point correspondence between faces.

Table 2.2: Summary of recognition approaches for face with varying expressions-1

Category Approach Methodology Advantages Disadvantages

Rigid Chua etal. [40]

Matching point signaturesextracted from the upperpart of the face.

Simple andfast

Sensitive to outliers,Looses discriminativefeatures on regions otherthan upper face.

Chang etal. [27]

Matching multiple overlap-ping regions around thenose using ICP

Simple Looses some discrimina-tive features from de-formable regions

Husken[71]

Using 2D and 3D hierarchi-cal graph matching (HGM)

Matching isnon-iterativeand hencefast

Results for neutral vs.non-neutral scans wasnot reported.

Li etal. [101]

Combined texture and ge-ometry attributes of facesusing PCA

Considers sixdifferentexpressions

Marker points for facemasks were selected man-ually, tested on smallerdataset

Faltemieret al. [49]

Using multiple sampleswith multiple expressionsfor each subject in thegallery dataset

Get betterresult thanusing singlesample

Problem in acquiringdata, computationallyexpensive

Mian etal. [114]

Using nose tip detection,SFR, SIFT, Region-basedmatching and ICP

Simple, fastand high ver-ification rate

Considers single face perscan

Amor et al. [6] proposed to study the face deformability and elasticity proper-

ties based on a face anatomical analysis and to segment a facial surface in regions

according to their degree of deformation and elasticity. They computed the final


Table 2.3: Summary of recognition approaches for face with varying expressions-2Category Approach Methodology Advantages Disadvantages

Non-rigid

Li andBar-reto [99]

Classifying expression andusing corresponding recog-nition sub-system usingLDA and SVM

Simple andfast

Tested only for neu-tral and smiling ex-pressions, low accu-racy

Bronsteinet al. [18]

Using isometric model,GMDS numerical coreand hierarchical matchingstrategy

Fast androbust topartial occlu-sions

Assumes facial ex-pressions are isomet-ric and involves ex-pensive preprocess-ing

Kakadiariset al. [91]

Using annotated facemodel, wavelet analy-sis, normal maps anda composite alignmentalgorithm

Fully auto-matic andefficientmatching

Expensive enroll-ment

Lu andJain [107]

Fitting a facial sur-face deformation modelto the test scans andusing geodesic-basedre-sampling

Improvedperformanceover rigidICP

Not fully automaticand tested on onlythree expressions

Wang etal. [163]

Using GCD-based defor-mation model in a Poissonequation framework priorto matching

Improvedperformanceover rigidICP

Some interpersonaldisparities may belost during model fit-ting

Mpiperiset al. [125]

Using asymmetric bilinearmodels

Simple Requires expensivetraining and accu-rate point correspon-dences

Amor etal. [6]

Segmentation of facialssurface based on deforma-tions and giving priority tostable regions

Robust toshape defor-mations

Tested on smallerdataset

Al-Osaimiet al. [4]

Modeling expression de-formations from trainingdata in PCA eigenvectorsand using them to morphout the deformations

Preservesmore in-terpersonaldisparities

Training is expensive


Biometrics

matching score giving more importance to the stable facial regions. On a subset

of the IV 2 3D face dataset with 50 gallery and 400 probes (eight instances includ-

ing four non-neutral expressions) of 50 subjects, they obtained 97.5% recognition

rate with EER of 5.5%. The authors demonstrated improved performance of this

approach over global ICP, specially for severe shape deformations.

Al-Osaimi et al. [4], proposed a non-rigid face recognition approach where pat-

terns of expression deformations were modeled from training data in PCA eigenvec-

tors without leaving out the interpersonal disparities. The patterns were then used

to morph out the expression deformations from the 3D scans before matching and

extracting the similarity measures. They obtained best identification rate of 95%

for 400 probes with non-neutral expressions while using a training data size of 1700

scan pairs. On FRGC v2 database, they obtained 98.35% and 97.8% verification

rates at 0.001 FAR for neutral and non-neutral expressions respectively.

A summary of the methodology adopted and the advantages and the disadvan-

tages or limitations of each of the above approaches is provided in Table 2.3.

2.8 Recognition Techniques with 3D Ear Data

After detection and extraction of 3D ear data and/or features, an appropriate

matching algorithm is applied for recognition. Based on whether any features are

extracted from the raw ear data for matching the ears in the gallery and the probe

databases, we can divide the existing ear recognition approaches into three groups

as discussed below.

3D Local

Features

Identification

3D Local Features

Extraction

Feature Matching

Coarse Alignment

Using

Correspondences of

3D Local Features

Fine Matching with

ICP

Recognition

Decision Based on

the Similarity

Measures

Gallery

Feature

Database

Off-line

storage of

gallery

featuresOn-line probe

features

Recognized Ear

Ear Data

Normalization

Exp. 2

Exp.1

Ear Detection and

Ear Data Extraction

2D and 3D Profile Face

Data

Figure 2.13: Block diagram of the ear recognition system proposed in [83].

2.8. Recognition Techniques with 3D Ear Data 35

2.8.1 Approaches Using Local Features

Chen and Bhanu [32] used LSP (see Section 2.6.2) local features for coarse align-

ment prior to applying the ICP algorithm. They obtained an ear recognition accu-

racy of 96.4% using 302 gallery and probe images from the UND database (Collec-

tion F). However, they reported 87.5% recognition for straight-on to 45 degree off

images. They also performed evaluation on the UCR ES2 dataset and obtained a

94.4% rank-one recognition rate. Their approach requires manual extraction of the

ear contour prior to recognition in case the automatic detection (87.71% accuracy)

fails.

In their recent approach, Chan and Bhanu [35] reduced the matching time for ear

recognition by combining feature embedding and SVM based rank learning. ICP

was applied on a short list of candidate models which was generated by ranking

the similarities for all model-test pairs using the learning algorithm. Although they

obtained best result in timing by this approach (192 sec compared to 1270 sec on

an AMD Opteron 1.8 GHz processor), it came with a performance penalty of 2.4%

(96.7% to 94.3%) while testing on the 212 images of 212 subjects from the UND-F

dataset.

Islam et al. [83], used L3DFs (see Section 2.6.2) for coarse alignment. Corre-

spondence between the gallery and the probe features was established and the ini-

tial matching decision was made based on the distance between the feature surfaces.

The rotation between the matching features was used in coarse alignment of the best

few probes. Then they applied ICP on a minimal rectangular subset of the whole

3D ear data containing the corresponding features only. The approach is illustrated

in Fig. 6.1. While evaluating on the first 100 images of the UND-F dataset, they

obtained 90% rank-one recognition rate. They improved the performance in [80]

using a refined matching technique with geometrical consistency checks and using

both the rotation and translation obtained from the corresponding L3DF matching

for the coarse alignment. They obtained rank-one identification rates of 92.77% and

95.03% on the UND-J and the UND-F dataset respectively.

2.8.2 Approaches Using Global Features

Passalis et al. [136] proposed a different approach by extracting a compact bio-

metric signature for matching 3D ears. They used a generic Annotated Ear Model

(AEM), ICP and Simulated Annealing algorithms to register and fit each ear dataset.

They obtained 93.9% and 94.4% recognition rates for the UND Collection J and an

extended database of 525 subjects respectively.


Biometrics

2.8.3 Approaches Without Extracting Any Features

Yan and Bowyer [180] applied 3D ICP on UND-F dataset and obtained 84.1%

identification accuracy. To improve their accuracy, they investigated the effect of

combining scores from different algorithms [178]. On a database of 3D images from

302 subjects, they obtained rank-one recognition rate of 90.2%, 87.7% and 69.9%

for combining 3D ICP with 3D edge, 3D PCA with 3D ICP and 3D PCA with 3D

edge algorithms respectively. They also tested the effect of using multiple images

per subjects on a dataset of 169 subjects (each with at least four images taken at

four different dates). Having two images per person in both the gallery and the

probe datasets and applying a new fusion rule using interval distance distribution

between rank-one and rank-two, they obtained a rank-one recognition rate of 97%

for 3D ICP algorithm. The corresponding result for one gallery and one probe was

only 81.7%. However, they improved their accuracy to 97.5% on a dataset of 404

subjects [176] and 98.7% on the UND-F dataset of 302 subjects by removing outliers

before applying a modified version of ICP [178].

In their recent work [179], Yan and Bowyer applied the snake algorithm to pre-

cisely crop the ear and a further modified version of ICP for matching 3D ear data.

They obtained 97.8% rank-one recognition with an EER of 1.2% on the UND data

set (Collection J) consisting of 1386 probes and 415 gallery images. However, an

accuracy of 95.7% is reported on a dataset of 70 images occluded with earrings. In

another experiment with a dataset of 24 subjects each having a straight-on and a 45

degree off center image, they achieved only a 70.8% recognition rate. The system

requires the nose tip or the ear pit to be clearly visible, which may not be always

assured due to pose variations or covering with hair.

Islam et al. [75] proposed a hierarchical approach for applying ICP for 3D ear

recognition. The ICP algorithm was first applied on low and then on high resolution

meshes of 3D ear data. The automatically detected ear regions were not concisely

cropped. The rank-one recognition accuracy of 93% was reported for the first 100

images of the UND-F dataset.

2.8.4 Summary and Discussion

In this section, approaches for ear recognition are categorized and described.

A summary of the representative methods of these three groups is illustrated in

Table 6.1.

In general, approaches with local features are faster in computation and suitable

for application with large numbers of users. Both approaches in [83] and [32] use

2.8. Recognition Techniques with 3D Ear Data 37

Table 2.4: Summary of the existing 3D ear recognition approaches

Category Source Algorithm UsedDatabaseSize

RR

GalleryProbe (%)

Using localfeatures

Chen andBhanu [32]

LSP and ICP 302 302 96.4

Islam et al. [83] L3DF and ICP 302 302 95.03Chen andBhanu [35]

LSP, SVM andICP

212 212 94.3

Usingglobalfeatures

Passalis etal. [136]

AEM, ICP andDMF

415 415 93.9

Withoutextracting

Yan andBowyer [179]

The snake andmodified ICP

415 1386 97.8

any feature Yan andBowyer [176]

Modified ICP 404 404 97.5

Yan andBowyer [178]

Modified ICP 302 302 98.7

Islam et al. [75] Hierarchical ap-plication of ICP

100 100 93

local features for recognition. However, the latter uses these for coarse alignment

only and the former uses them for the rejection of a large number of false matches

as well as for a coarse alignment of the remaining candidates prior to the ICP

matching. Both rotation and translation are used for the coarse alignment in these

two approaches whereas authors in [179] use only translation (no rotation) for this

purpose.

Approaches based on ICP only are comparatively computationally more expen-

sive as the algorithm is iterative. The recognition performance of the approaches

using ICP also depends on how concisely each ear has been detected because the

hair and skin around the ear makes the alignment unstable. This explains the high

recognition results in [179] and [32] compared to that in [83]. In [179] ICP is used

for matching concisely cropped ear data (using the snake algorithm) and in [32] a

large number (12.29%) of ears are concisely cropped manually in case their auto-

matic detector fails. Also, in [179, 32] ICP is applied on every gallery-probe pair

whereas in [83], the use of L3DF allows one to apply ICP on a subset (best 40) of

pairs for identification.

Although the final matching time is less (less than 1 ms per comparison) in [136],

its enrollment and feature extraction modules using both ICP and Simulated An-

nealing are computationally more expensive (15-30 sec on a 3-GHz Pentium 4 PC).


Biometrics

The approach also requires that the ear pit is not occluded because the annotated

ear model used for fitting the ear data is based on this area. It also does not perform

well for ears with intricate geometric structures.

Most of the existing approaches use either the right or the left ear data for

recognition of a subject. Insignificant performance differences (0.6%) are reported

in [170] for using two ear images separately. However, around 90% accuracy is

reported by Yan and Bowyer [176] when matching with a mirrored left ear with a

stored right ear, indicating that symmetry-based ear recognition cannot be expected

to be highly accurate. Experiments are also performed to see the effect of using both

ear data in a multimodal approach. Lu et al. [106] and Xiaoxun and Yunde [170]

reported approximately 2% and 5% improved accuracy respectively by fusing data

from both ears.

2.9 Multi-Biometric Recognition with 3D Ear and Face

Most of the ear-face biometric approaches use score-level fusion. Although there

are some 2D approaches for fusing ear and face at the level of feature (e.g. [133, 172,

171, 28]) and data (e.g. [184]), to the best of our knowledge, there is no 3D ear-face

multimodal approaches fused at these two lower levels. A summary of the relevant

available multimodal approaches is given in Table 7.1 and discussed as follows.

Table 2.5: Multi-biometric Approaches with 3D ear and face dataSource Methodology Dataset

(#probes,#galleries)

RR(%)

Yan [175] Using ICP with sum and interval fu-sion rules on multi-instance gallery andprobe images.

174*2*2,174*2*2

100

Islam etal. [76, 80]

Using L3DF matching, ICP andweighted sum rule.

315*2,326*2

99.04

Theoharis etal. [152]

Annotated model fitting and using ICP,SA and wavelet analysis.

324*2,324*2

99.7

Yan [175] combined ear and face at score level using the sum and interval fusion

rules. On a dataset of 174 subjects, each with two ear shapes and two face shapes

in the gallery and the probe datasets, they obtained rank-one recognition rates of

93.1%, 97.7% and 100% for the ear, the face and the fusion respectively.

Theoharis et al. [152] proposed a unified 3D face and ear recognition system

using wavelets. They extracted geometry images from 3D ear and face data by

fitting annotated ear and face models representing the respective average shapes to

2.9. Multi-Biometric Recognition with 3D Ear and Face 39

them through an ICP and simulated annealing based registration process. Then,

the wavelet transform was applied to the extracted images to find the biometric

signature. For each modality, the distance between the feature vectors of gallery

and probe is weighted accordingly and then summed up for fusion. Although the

final score of the individual modality was not computed, we classify this approach

as using score level fusion since the feature vector of the two modalities were not

combined before computing distance based on individual modality and the result

would be comparable to the score level fusion.

In a multimodal database composed of 324 gallery and the same number of

probe images (all collected from FRGC v2 and UND databases), 99.7% rank-one

recognition was reported for the above approach. The probe dataset for this ex-

periment contained some images with non-neutral expressions but most of them

were with neutral expression. However, the identification plots of this experiment

(see Fig. 2.14) illustrate the importance of the multimodal fusion. The fact that

the fusion curve can reach a 100% recognition rate before rank 15 whereas neither

single modality can reach 100% up to rank 20, indicates that the cases of failure to

identify a subject are uncorrelated, therefore one modality can compensate for the

shortcoming of the other.

Figure 2.14: Identification plots of ear, face and fusion of these two modalities [152].

Recently, Islam et al. [76] proposed L3DF-based approach for fusing 3D ear and

face data at score level. As shown in Fig. 2.15, at first, they detected the ear and

the face automatically using techniques in [74] and [114] respectively. Following a

normalization step, face and ear L3DFs were extracted and matched as described

in [117] and [83] respectively. Matching scores from the ear and the face modalities

are then fused according to a weighted sum rule. The performance of the system

was evaluated on a multimodal dataset with 326 gallery images and 315 probes


Biometrics

Frontal face

imagesFace detection

Ear detection

Face feature

(L3DF)

extraction

Ear feature

(L3DF)

extraction

Matching face

L3DFs

Matching ear

L3DFs

Fusion of

matching

scoresProfile

face

images

Recognition

Result

Figure 2.15: Block diagram of the L3DF based ear-face multimodal recognition

system fused at score level [76].

with neutral facial expression and 311 probes with non-neutral facial expressions

all collected from the FRGC v2 and the UND Biometric databases. They obtained

98.71% and 98.1% identification rates and 99.68% and 96.83% verification rates

at an FAR of 0.001 respectively for the probe sets with neutral and non-neutral

images respectively. Using a refined matching technique [80], they improved their

identification rate to 99.04% in the case of non-neutral face data.

2.10 Challenges

In this section, the problems faced by the researchers and the challenges to be

addressed in achieving a reliable performance with the ear and the face biometrics

are identified and discussed.

2.10.1 Image Acquisition Related Challenges

1. Sensor Errors: Artifacts are normally found in the sensed 3D data especially

on the oily or the hairy regions of the face or the ear such as the ear pits, the

eyes, eyebrows, mustache or beard. Missing data (holes) occur when a sensor

is unable to acquire data and outliers (spikes) occur due to an inter-reflection

in a projected light pattern or a correspondence error in stereo [17].

Sensor errors can be modeled using statistical techniques. The error can be

minimized by propagating the estimated errors in the identification system to

provide a realistic confidence measure on the final decision.

2. Image Distortions: Digital images obtained by the sensors are sometimes dis-

torted due to compression, damages of storage device and transmission with

noisy channels. To develop robust biometric recognition more research works

need to be performed exploring the sources of image distortions.

3. Cost of the Sensors: Although 2D data can be obtained using very cheap

ordinary camera, 3D sensors are still expensive for common use.

2.10. Challenges 41

2.10.2 Robustness Related Challenges

1. Occlusion: Apart from the artifacts due to the sensor errors, there might

be other reasons for occlusions in the captured data such as the presence of

earrings or facial ornaments, sun-glasses, hair coming over the ear or the face or

the growth of beards and mustaches. Although a quite acceptable recognition

rate is achieved with a clear view of the ear and the face data, their accurate

recognition when occluded is still a great challenge.

2. Facial Expression: Recognizing faces under non-neutral expressions are chal-

lenging due to their non-linear nature and lack of an associated mathematical

model. As discussed in Section 2.7, 3D face recognition techniques proposed

so far have not yet achieved significant accuracy for large databases with un-

constrained expression changes. In comparison to approaches based on non-

invariant features, approaches that analyze and model different modalities of

expressions seem more promising and require further attention.

3. Aging: Aging has an effect on facial appearance and often the gallery im-

ages are taken well before the probe images. To avoid maintaining the gallery

images up-to-date, we can model the generic effect of aging. However, mod-

eling of such effects is very difficult and very few researchers have worked on

this [193].

2.10.3 Efficiency Related Challenges

1. Efficient Fusion Technique: An important avenue for improving existing mul-

timodal biometric systems is to apply an efficient data or feature level fusion.

Fusion at the match score or decision level is easy to perform. But fusion

at these levels may not fully exploit the discriminating capabilities of the

combined biometrics. Fusion at the data or feature extraction level is be-

lieved to produce better results in terms of accuracy and robustness because

richer information about the ID or the class of an object can be combined at

these levels [146]. However, fusion at the feature level is the most challeng-

ing [146, 86, 144], because the feature sets of various modalities may not be

compatible and the relationship between the feature spaces of different biomet-

ric systems may not be known. Again, resultant feature vectors may increase

in dimensionality and a significantly complex matching algorithm may be re-

quired. In addition, good features may be degraded by bad features during

fusion and hence we need to apply an efficient feature selection approach prior


Biometrics

to fusion. These challenges should be addressed for a successful multimodal

approach.

2. Efficiency of Matching Algorithm: The speed of a multimodal biometric recog-

nition system is an important factor for real time applications, particularly

when deployed in public places such as airports and stadiums. Unfortunately,

most of the matching algorithms that address the issue of accuracy (such as

ICP used in [179]) are computationally expensive. Therefore, developing an

accurate as well as time-efficient algorithm is of great research interest.

2.10.4 Application Related Challenges

1. Scalability and Benchmarks: Testing with significantly larger databases and

getting acceptable results is another big challenge to be addressed. Most of

the proposed biometric systems are tested with databases containing data from

fewer than 500 subjects. Again, although there is a benchmark database like

FRGC v2 for 3D face data, there is no comparable benchmark for 3D ear data.

2. Automation: Currently there are very few fully automatic recognition systems

available. However, real-time applications require that the recognition should

be performed in a fully automatic form.

2.11 Conclusion

In this paper, an up-to-date review of existing approaches for two promising

biometric traits, the ear and the face, are described. Starting with preliminary

concepts, the paper categorizes and analyzes all the techniques involving data ac-

quisition, detection, representation and unimodal and multimodal recognition with

these two modalities, thus providing the reader with a comprehensive overview of

the research field. It is found that many solutions have been proposed with unimodal

approaches and most of them report quite high recognition and low error rates in

a controlled scenario, however, they suffer a significant decrease in accuracy in the

presence of pose and expression variations and occlusions. Although it is perceived

that the accuracy and robustness can be increased with fusion of 3D ear and face,

very few such approaches have been proposed. The identification and discussion of

the underlying problems and challenges in this paper imply that significant further

research should be performed in the area of developing fast and fully automatic

ear-face multimodal systems using low-cost acquisition devices and with a data or

feature level of fusion.

2.11. Conclusion 43

Acknowledgements

We acknowledge the use of profile images from the UND and the USTB databases.

We would like to thank Dr. Spadaccini, Dr. Thorne and Dr. Sohel for their useful

reviews.


Biometrics

45CHAPTER 3

An ICP Based Hierarchical Matching Approach

for 3D Ear Recognition

Abstract

The use of ear shape as a biometric trait for recognizing people in different

applications is one of the most recent trends in the research communities. In this

work, a fully automatic and fast technique based on the AdaBoost algorithm is used

to detect a subject’s ear from his/her 2D and corresponding 3D profile images. A

modified version of the Iterative Closest Point (ICP) algorithm is then used for the

matching of this extracted probe ear to the previously stored ear data in a gallery

database. A coarse-to-fine hierarchical technique is used where the ICP algorithm

is first applied on low and then on high resolution meshes of 3D ear data. We

obtain a rank one recognition rate of 93% while testing with the University of Notre

Dame Biometrics Database. The proposed recognition approach does not require

any manual intervention or sharp extraction of ear contour from the detected ear

region. No segmentation of the extracted ear is required and more importantly, the

system performance does not rely on the presence of a particular feature of the ear.

3.1 Introduction

Due to instances of fraud with the traditional ID based systems, biometric recog-

nition systems are gaining popularity day by day. In such a system, one or more

physiological (e.g. face, fingerprint, palmprint, iris and DNA) or behavioral (e.g.

handwriting, gait and voice) traits of a subject are taken into consideration for au-

tomatic recognition. Although the ear as a biometric trait is not as accurate as iris

or DNA, it is non-intrusive and easy to be collected. The face is also non-intrusive

but its appearance is affected by changes in facial expressions, use of cosmetics or

eye glasses. The ear is also smaller in size but rich in features and its shape does

not change with aging between 8 years and 70 years [72]. It can be used separately

or in a multimodal approach with the face for effective human recognition in many

applications including some national IDs, security, surveillance and law enforcement

0This article is published in the Proc. of the Fourth International Symposium on 3D DataProcessing, Visualization and Transmission (3DPVT’08), pp.131-141, June 18-20, 2008 with a title” Fully Automatic Approach for Human Recognition from Profile Images Using 2D and3D EarData”.

46 Chapter 3. An ICP Based Hierarchical Matching Approach for 3D Ear Recognition

applications. However, in any of these applications, accurate recognition of the ear

is an important step.

A biometric recognition system may operate in one or both of two modes: au-

thentication and identification. In authentication, one-to-one matching is performed

to compare a user’s biometric to the template of the claimed identity. In identifi-

cation, one-to-many matching is done to associate an identity with the user by

matching it against every identity in the database. For both modes, an accurate

detection is an essential pre-requisite. However, ear detection from arbitrary profile

images is a challenging problem due to the fact that the ear is sometimes occluded

by hair and earrings and ear images can vary in appearance under different view-

ing and illumination conditions. Consequently, most of the existing ear recognition

algorithms assume that the ear has been accurately detected [77].

One of the earliest ear detection methods uses Canny edge maps to detect the

ear contour [20]. Hurley et al. [70] proposed the “force field transformation” for ear

detection. Alvarez et al. [5] used a modified active contour algorithm and Ovoid

model for detecting the ear. Yan and Bowyer [179] proposed taking a predefined

sector from the nose tip to locate the ear region. The non-ear portion from that

sector is cropped out by skin detection and the ear pit was detected using Gaussian

smoothing and curvature estimation. Then, they applied an active contour algo-

rithm to extract the ear contour. The system is automatic but fails if the ear pit

is not visible. Li Yuan and Mu [185] used a modified CAMSHIFT algorithm to

roughly track the profile image as the region of interest (ROI). Then, contour fitting

is operated on ROI for further accurate localization using the contour information

of the ear.

Most recently, Islam et al. [74] proposed an ear detection approach based on

the AdaBoost algorithm [149]. The system was trained with rectangular Haar-

like features and using a dataset of varied races, sexes, appearances, orientations

and illuminations. The data was collected by cropping and synthesizing from the

University of Notre Dame (UND) biometrics database [157, 179], the NIST Mugshot

Identification Database (MID), the XM2VTSDB [112], the USTB, the MIT-CBL

and the UMIST database. The approach is fully automatic, provides 100% detection

while tested with 203 non-occluded images of the UND profile face database and

also works well with some occluded and degraded images.

As summarized in the survey of Pun et al. [143] and Islam et al. [77], most

of the proposed ear recognition approaches use either PCA (Principal Component

Analysis) or the ICP algorithm for matching. Choras [37] proposed a different

3.2. 3D Ear Detection and Normalization 47

automated geometrical method. Testing with 240 images (20 different views) of 12

subjects, 100% recognition rate is reported. Genetic local search and the force field

transformation based approaches have also been proposed by Yuizono et al. [187]

and Hurley et al. [70] respectively. The first ever ear recognition system tested with

a larger database of 415 subjects is proposed by Yan and Bowyer [179]. Using a

modified version of the ICP, they achieved an accuracy of 95.7% with occlusion and

97.8 % without occlusion (with an Equal-error rate (EER) of 1.2%). The system

does not work well if the ear pit is not visible.

In this work, we have adopted the work of Islam et al. [74] for detecting the

ear from 2D profile images and then, extended it for cropping the corresponding 3D

profile face data. After ear detection, we apply a variant of the Iterative Closest

Point (ICP) algorithm for recognition of the ear at different mesh resolutions of the

extracted 3D ear data. Using two different resolutions hierarchically, we obtain a

rank 1 recognition rate of 93%. The proposed system is fully automatic and does

not rely on the presence of a particular feature of the ear ( e.g. ear pit). It also

does not require an accurate extraction of the ear contour and hence reduces the

computational cost. Besides, the ear recognition results can be combined with other

biometric modalities such as 2D and 3D faces to obtain a more robust and accurate

human recognition system.

The paper is organized as follows. The proposed system for 3D ear detection

and that for 3D ear recognition are described in Section 3.2 and 3.3 respectively.

Results obtained are reported and discussed in Section 5.4. Section 7.9 concludes

our findings.

3.2 3D Ear Detection and Normalization

The proposed 3D ear detection approach utilizes the AdaBoost based 2D ear

detection technique developed by Islam et al. [74] (described fully in Section 6.3 of

this dissertation). Since the 3D profile data are co-registered with corresponding the

2D images, the 2D ear detector is first scanned through the whole profile image to

localize the ear. A rectangle is placed covering the ear. The 3D data corresponding

to this rectangular region is cropped for use in 3D ear recognition. The complete

ear detection process from 2D and 3D face profile image/data is shown in Figure

6.2. A sample of a profile image and corresponding 2D and 3D ear data detected by

our system is also shown in the same figure.

Once the 3D ear is detected, we remove all the spikes and holes by filtering

the data. We perform triangulation on the data points, remove edges longer than a


Validation and update of the

training with false positives

Training of the

classifier using

Cascaded AdaBoost

Pre-processing the

2D data

Crop 2D ear

region using

the trained

classifier

Extract the

corresponding

3D Ear data

Creation of

rectangular features

Data collection

(2D and 3D profile

faces)

On-line

Off-line

2D

3D

2D

Figure 3.1: Block diagram of the 3D ear detection approach.

threshold of 0.6 and finally, remove disconnected points. The data is then normalized

by shifting to its mean and then, uniformly sampled using a grid of size 82 by 56

pixels.

3.3 3D Ear Matching and Recognition

The extracted 3D ear data of a subject (also called probe data) is matched with

one (for authentication) or all (for identification) 3D ear data (also called gallery

data) stored in the gallery database built off-line. Matching can be performed based

on the error of registering between the two data sets, more specifically, two clouds of

points. The ICP algorithm [10] is considered as one of the most accurate algorithms

for this purpose. However, it is computationally expensive and may converge to a

local minimum if the two data sets are not nearly registered. To minimize these

limitations, we adopted a coarse-to-fine hierarchical technique. ICP is first applied

on low and then, on high resolution meshes of the 3D ear data.

The mesh reduction was performed using the surface simplification algorithm of

Garland and Heckbert [57] as it preserves features on the surfaces of a mesh (unlike

the sampling reduction which removes points without any regard to the features).

The degree of mesh reduction can be controlled by fixing the number of triangles

to be remained in the reduced mesh or by fixing the surface error approximations

(using quadric matrices). We achieved better result using the later option (see

Section 3.4.3).

Initially, reduced meshes (see Figure 3.2) of the probe (created on-line) and the

gallery (created off-line) ear data are used for coarse registration with the ICP. The

rotation and translation resulting from this coarse registration are applied to the

original data set and then, the ICP algorithm is applied to them to get a finer

match. The matching approach which we term as the two-step ICP is shown in the

3.4. Results and Discussion 49

(a) Full mesh with

55006 triangles(c) Reduced mesh

with 400 triangles

(b) Reduced mesh

with 2000 triangles

Figure 3.2: Sample of the full and reduced meshes of the extracted 3D ear data.

flowchart of figure 3.3. Subscripts ’p’ and ’q’ in the flowchart are used for the probe

and the gallery respectively.

3.4 Results and Discussion

In this section, results of our recognition system using both two and three levels

of mesh resolution are reported. The reasons for misclassifications are also discussed.

3.4.1 Dataset Used

The recognition performance of the proposed system was evaluated against two

datasets collected from the UND Biometrics Database [157, 179]. Dataset A consists

of arbitrarily selected 200 profile images of 100 different subjects each with the size

of 640 by 480 pixels. Among the chosen images, 100 images collected in the year

2003 are used in the gallery database and another 100 images of the same subjects

collected in the year 2004 are used for the probe database. No image of this dataset

was used in the training of the detection classifiers. In Dataset B, the whole UND

database is used excluding the images of two subjects (due to data error). 300

images taken in the year of 2003 are used as gallery. One image out of multiple

images of the same subjects is arbitrarily chosen as probe.

3.4.2 Recognition Rate of the Single-step ICP

Using Dataset A, we obtained recognition rate of 93% and 94% rank one and

rank ten respectively with the single-step ICP. Testing with Dataset B provided

recognition rate of 93.98%, 94.31%, 95.31% and 96.32% rank one, two, three and

ten respectively. Figure 3.4 shows a sample of the correct recognition. It is noticed

that the system works even in the presence of partial occlusions due to hair and

ear-rings.


Start

Take dp=3D data of the

probe, N=total number

of gallery images, i=1

Create probe mesh Mp for dp

Reduce mesh Mp to Mp_r with data

points dp_r

i>=N?

Create gallery mesh Mg_i for dg_i

Reduce mesh Mg_i to Mg_i_r with data

points dg_i_r

Find rotation R and translation t needed

for registering dp_r with dg_i_r using the

ICP algorithm

Calculate dp=dp×R + t

Find error e for registering dp with dg_i

using the ICP algorithm

Assign E(i)=e and i=i+1

Sort E in ascending order

Find dg_i corresponding to

E(1) as the rank-1, that to

E(2) as rank-2 and that to

E(3) as rank-3 match

Stop

Figure 3.3: Flowchart of the matching algorithm using coarse-to-fine hierarchical

technique with ICP.

3.4.3 Improvement Obtained with the Two-step ICP

The recognition rate with Dataset A improved to 93%, 94% and 95% for rank

1, rank 2 and rank 7 respectively when we applied the two-step ICP as mentioned

in Section 3.3. In our experiment, we choose quadric error factor of 10 for the mesh

reduction. The improvement is shown by the plots in Figure 3.5. It is worth men-

tioning that the recognition result becomes worse (84%, 88% and 88% respectively)

when we fix the number of triangles to be 400 for mesh reduction. This is due to

the fact that the size of the detected window is not same for all the ear images.

The two-step ICP provides better results in the cases where the single-step ICP

converged to a local minimum and also where the probe and the gallery images differ

slightly in rotation and translation. An example of this improvement is shown in

figure 3.6.

3.4.4 Analysis of the Misclassification

Both single-step and two-step ICP failed to recognize some of the probe images.

Visual inspection of these probe images and the corresponding gallery images in-

3.4. Results and Discussion 51

(a) (b)

Figure 3.4: Example of recognition: (a) 2D and range image of the gallery ear (b)

2D and range image of the probe ear with a small ear-ring and hair (This figure is

best seen in color).

90

91

92

93

94

95

96

1 2 3 4 5 6 7 8 9 10

Single-step ICP

Two-step ICP

Rank

Recognition Rate

Figure 3.5: Recognition rates with Single-step and Two-step ICP.

(a) (b)

Figure 3.6: Examples of improvement with the two-step ICP which was not correctly

recognized by the single-step ICP [left image is the gallery and right one is the probe].

dicates that the ICP algorithm cannot work properly if there is large variation in

pose (rotation and translation) as shown in Figure 3.7-(a, b, c), occlusion by hair

or ear-rings as shown in Figure 3.7-(a, d) or missing of data as in Figure 3.7-(e)

is present. The pose variation might occur during the capture of the probe profile


image as mentioned in section 7.3.1 or by the detection program (see Figure 3.7-(d)).

(a)(b)

(c) (d) (e)

Figure 3.7: Examples of misclassification:(a, b, c) 2D image of a gallery (left) and

the corresponding probe (right) with pose variations, (d) 2D image of a probe with

occlusions (e) Range image of a probe having missing data.

3.5 Conclusion

Our method for ear detection and recognition from profile images is fully auto-

matic and robust to some degrees of occlusion due to hair and ear-rings. One of

its major strengths is that it does not assume accurate prior detection of the ear.

It also does not rely on the presence of particular features like the ear pit. Thus,

it is suitable for any non-intrusive biometric application. In this paper, it is also

shown that the recognition performance of the ICP algorithm improves with the

proposed hierarchical approach. ICP performance may be further improved by pro-

viding better initial registration. This can be done by invariant feature matching.

The recognition system can be made more robust to occlusion by representing the

extracted ear data with some occlusion-invariant representations. These include our

future research tasks.

Acknowledgements

Authors acknowledge the use of the University of Notre Dame Biometrics Database

for 3D ear detection and recognition. This research is sponsored by ARC grants

DP0664228 and DP0881813.

53CHAPTER 4

A Fast and Fully Automatic Ear Recognition

Approach Based on 3D Local Surface Features

Abstract

Sensitivity of global features to pose, illumination and scale variations encour-

aged researchers to use local features for object representation and recognition.

Availability of 3D scanners also made the use of 3D data (which is less affected

by such variations compared to its 2D counterpart) very popular in computer vision

applications. In this paper, an approach is proposed for human ear recognition based

on robust 3D local features. The features are constructed on distinctive locations

in the 3D ear data with an approximated surface around them based on the neigh-

borhood information. Correspondences are then established between gallery and

probe features and the two data sets are aligned based on these correspondences.

A minimal rectangular subset of the whole 3D ear data only containing the corre-

sponding features is then passed to the Iterative Closest Point (ICP) algorithm for

final recognition. Experiments were performed on the UND biometric database and

the proposed system achieved 90, 94 and 96 percent recognition rate for rank one,

two and three respectively. The approach is fully automatic, comparatively very fast

and makes no assumption about the localization of the nose or the ear pit, unlike

previous works on ear recognition.

4.1 Introduction

Among the biometric traits used for computer vision, the face and the ear

have gained most of the attention of the research community due to their non-

intrusiveness and the ease of data collection. Face recognition with neutral expres-

sions has reached its maturity with a high degree of accuracy. But changes of face

geometry due to the changes of facial expression, use of cosmetics and eye glasses,

aging, covering with beard or hair significantly affect the performance of face recog-

nition systems. The ear is considered as an alternative to be used separately or

in combination with the face as it is comparatively less affected by such changes.

However, its smaller size and often the presence of nearby hair and ear-rings makes

it very challenging to be used for non-interactive biometric applications.

0Published in the Lecture Notes on Computer Science (LNCS) 5259, J. Blanc-Talon et al.(Eds.), pp. 1081-1092, Oct, 2008.

54Chapter 4. A Fast and Fully Automatic Ear Recognition Approach Based on 3D Local

Surface Features

As noted in the survey of Pun et al. [143] and Islam et al. [77], most of the pro-

posed ear recognition approaches use either Principal Components Analysis (PCA)

[188, 70, 28] or the ICP algorithm [188, 70, 28, 179, 181, 177, 30] or their combination

[176] for matching purposes. Choras [37] and Yuizono et al, [187] proposed geomet-

rical feature-based and genetic local search based approaches respectively. Both of

them reported error-free recognition but with comparatively smaller dataset con-

taining high quality 2D ear images taken on the same day and without having any

hair or ear-ring. Similarly, Hurley et al. [70] proposed the force field transformation

for ear feature extraction and claimed 99.2% recognition on a smaller data set of

only 63 subjects and without considering occlusions with ear-rings and hair.

The first ever ear recognition system tested with a larger database (415 subjects)

is proposed by Yan and Bowyer [179]. Using an automatic ear detection based on

the localization of the nose and the earpit, active contour based ear data extraction

and finally, matching with a modified version of the ICP achieved an accuracy of

95.7% allowing occlusion and 97.8 % on examples without any occlusion (with an

Equal-error rate (EER) of 1.2%). The system is not expected to work properly if

the nose (for example, due to pose variation) or the ear pit (for example, due to its

covering with hair) are not clearly visible which is a common case. In an experiment

where a straight-on ear images was matched with twenty four 45 degree off images

(a subset of Collection G of the UND database), it achieves only 70.8% recognition

rate.

Most of the approaches above are based on global features. This requires an

accurate normalization of ear data with respect to pose, illumination and scale.

These approaches are also inherently sensitive to occlusion. As demonstrated in

this paper, local features are less affected by these factors. Recently, Chen and

Bhanu [32] used a local surface shape descriptor to represent ear data. However,

they only used the representation for a coarse alignment of the ear. The whole ear

data was then used for matching with a modified version of ICP. They obtained

96.4% recognition on the Collection F of UND database (302 subjects) and 87.5%

recognition for straight-on to 45 degree off images. They reported an ear detection

accuracy of 87.1% only. Moreover, they assume that all the ear data are accurately

extracted (manually, if needed) from the profile images prior to recognition.

In this paper, the 3D local surface features proposed for face recognition in [117]

are adapted for the ear recognition. The authors of the work reported a very high

recognition rate of 99% on neutral versus neutral and 93.5% on neutral versus all

face data when tested on the FRGC v2 3D face data set. They also obtained a very

4.2. Methodology 55

good time efficiency of 23 matches per second on a 3.2 GHz Pentium IV machine

with 1GB RAM. However, since ear features are different and more challenging than

face features, we modified the feature creation and matching approach to make them

suitable for ear recognition. Following [117] at first, a smaller number of distinctive

3D feature point locations are identified on each of the fully automatically detected

3D ear region. A 3D surface is then approximated around the selected keypoint

based on the nearby data points and used as the feature for that point. A coordinate

frame centred on the key point and aligned with the principal axes from PCA is

used to make the features pose invariant. Correspondence is established between

the gallery and the probe features and the matching decision is made based on the

distance between the feature surfaces and the transformation between them. This

yields a reasonable recognition method based on only local features. However, the

recognition performance is improved by aligning the probe and the gallery data set

based on the initial transformation between the corresponding features and followed

by the application of the Iterative Closest Point (ICP) algorithm on only a minimal

rectangular subset of the whole 3D ear data containing the corresponding features

only. This novel approach of extracting a reduced data set for final alignment

significantly increases the efficiency in time also. Thus, the proposed system have

three main advantages: 1) fully automatic 2) comparatively very fast and 3) makes

no assumption about the localization of the nose or the ear pit, unlike previous

works on ear recognition.

The paper is organized as follows. The proposed approach for 3D ear recognition

is described in Sect. 5.3. The results obtained are reported and discussed in Sect.

5.4 followed by conclusions in Sect. 7.9.

4.2 Methodology

Our ear recognition system consist of seven main parts as shown in Fig. 6.1.

Each of the components is described in this section.

4.2.1 Ear Data Extraction and Normalization

The ear region is detected on 2D profile face images using the AdaBoost based

detector described by Islam et al. [74]. This detector is chosen as it is fully automatic

and also due to its speed and high accuracy of 99.89% on the UND profile face

database with 942 images of 302 subjects [75]. The corresponding 3D data is then

extracted from the co-registered 3D profile data as described in [75]. To ensure the

whole ear is included and to allow the extraction of features on and slightly outside


Surface Features

3D Local

Features

Identification

3D Local Features

Extraction

Feature Matching

Coarse Alignment

Using

Correspondences of

3D Local Features

Fine Matching with

ICP

Recognition

Decision Based on

the Similarity

Measures

Gallery

Feature

Database

Off-line

storage of

gallery


features

Recognized Ear

Ear Data

Normalization

Exp. 2

Exp.1

Ear Detection and

Ear Data Extraction

2D and 3D Profile Face

Data

Figure 4.1: Block diagram of the proposed ear recognition system.

the ear region, we expanded the detected ear regions by an additional 25 pixels

around each direction.

Consequently, the extracted 3D ear data varies in dimensions depending on the

detection window. Hence, we normalized the 3D data by centering on the mean and

then sampling on a uniform grid of 132 by 106. The surface fitting was performed

using an interpolation algorithm at 0.5mm resolution. Since there were some missing

data regions as shown in Fig. 6.7, we removed interpolated data for those regions

after fitting to the grid.

4.2.2 Feature Location Identification

A 3D local feature can be depicted as a 3D surface constructed using data points

within a sphere of radius r1 centred at location p. As outline by Mian et al [117], the

criteria to check while identifying feature locations is that it should be on a surface

that is distinctive enough to differentiate between range images of different persons.

To avoid many matching features in a single smaller region (in other word, to

increase distinctiveness), we only consider as possible feature points that lie on a

2mm grid. Then we find the distance of each data point from the boundary and

take only those points with a distance greater than a predefined boundary limit.

The boundary limit is chosen slightly longer than the radius of the 3D local feature

surface (r1) so that the feature calculation does not depend on regions outside the

boundary and the allowed region corresponds closely with the ear. We call the points

within this limit as seed points.

To check whether the data points around a seed point contain enough descriptive

information, we adopt the approach of Mian et al. [117] discussed in short as follows.

We randomly choose a seed point and take a sphere of data points around that point

4.2. Methodology 57

Figure 4.2: Locations of local features (shown with dots) on the range images of

different views (in rows) of different individuals (in columns). (This figure is best

seen in color).

which are within a distance of r1. We apply the PCA on those data points and align

them with their principal axes using the computed rotation matrix. The difference

between the ranges of the first two principal axes of the local region is computed

as δ. It is then compared to a threshold (δt). We only accept a seed point to be a

distinctive feature location if the δ is higher than δt. The higher δt the less number of

features we get. But lowering δt can result in the selection of less significant feature

points. This is because, the value of δ indicates extent of unsymmetrical variation

in depth in that point cloud. For example, δt of zero for a point cloud means it

could be completely planar or spherical.

We continue selecting feature locations from the available seed points until we

get a significant number of points (Fn). For a seed resolution of 2mm, r1 of 10,

δt of 2 and Fn of 200, for most of the gallery and the probe ear we found 200

feature locations. We found however, as low as 65 features particularly for cases

where missing data occurs. The value of these parameters were empirically chosen.

However, it is reported by Mian et al. [117] that the performance of the feature

point detection algorithm does not vary significantly with small variations of these

parameters.

Fig. 6.7 shows the suitability of our local features on the ear data. It illustrates

as it appears that local feature locations are different for ear images of different

individuals. It also shows that these features have a high degree of repeatability for

the ear data of the same individual. Here by repeatability we mean the proportion


Surface Features

‘ Nearest Neighbour Error (mm)

Cumulative Percentage of Repeatability

Figure 4.3: Repeatability of local feature locations

of probe feature points that have a corresponding gallery feature point within a par-

ticular distance. Similar to [117], the probe and gallery data of same individual are

aligned using the ICP as in before computation of the repeatability. The cumulative

percentage of repeatability as a function of nearest neighbor error between gallery

and probe features of ten different individual is shown in Fig. 4.3. The repeatability

reaches around 80% at an error of 2mm which is the sampling distance between the

seed points.

4.2.3 3D Local Feature Extraction

After a seed point qualifies as a keypoint, we extract a surface feature from its

neighborhood. As described in Sect. 6.4.1, while testing for suitability of the seed

point we take the sphere of data points r1 away from that seed point and aligned

to their principal axes. We use these rotated data points to construct the 3D local

surface feature. Similar to [117], the principal direction of the local surface is used

as the 3D coordinates to calculate the features. Since the coordinate basis is defined

locally based on the shape of the surface, the computed features are potentially

stable and pose invariant.

We fit a uniformly sampled (with resolution of 1mm) 3D surface of 30×30 lattice

to these data points. In order to avoid the boundary effects, we crop the inner region

of 20× 20 lattice from the bigger surface. This smaller surface is then concatenated

to form a feature vector to be used for matching. Consequently, the dimension of

4.2. Methodology 59

510

1520

510

1520

0

5

10

15

Figure 4.4: Example of a 3D local surface (right image). The region from which it

is extracted is shown by a circle on the left image.

our feature vector is 400. An example of 3D local surface feature is shown in Fig.

7.2.

For surface fitting, we use a publicly available surface fitting code [47]. The

motivation behind the selection of this algorithm is that it builds a surface over

the complete lattice, extrapolating (rather than interpolating) smoothly into the

corners. Therefore, it is less sensitive to noise and outliers in the data.

4.2.4 Feature Matching

The similarity between two features is calculated as the Root Mean Square

(RMS) distance between corresponding points on the 20 × 20 grid generated when

the feature is created (aligned following the axes in the PCA). The RMS distance

is computed from each probe feature location to all the gallery feature locations.

Matching gallery features which are located more than a threshold (th) away are

discarded to avoid matching in quite different areas of the cropped image. The

RMS distance of the probe feature and the remaining gallery feature is then com-

puted. The gallery feature that corresponds to the minimum distance is considered

as the corresponding gallery feature for that particular probe feature. The mean of

the distances for all the matched probe and gallery features is used as a similarity

measure.

Unlike [117], we also used the implied rotation between probe and gallery for each

pair of matching features which is calculated from the rotations used to generate

the two features. The angle between each of these rotations and all the others is

calculated and the rotation which has the most similar rotations (within five degrees)

is chosen as the most representative rotation. The ratio of the size of the largest

cluster of rotation angles to the total number of matching features is used as an

additional similarity measure.


Surface Features

4.2.5 Coarse Registration of Gallery and Probe Data

The input profile images of the gallery and the probe may have pose (rotation and

translation) variations. To minimize the effect of such variations, unlike [117], we

use the correspondence and rotation information obtained from the 3D local feature

matching for the initial or coarse registration of the probe to the gallery image data.

We applied the following approaches for this purpose. In the first approach Singular

Value Decomposition (SVD) is used to find the rotation and translation matrix from

the gallery and probe data points corresponding to the matched 3D local features

(see Sect. 7.4.1). In the second approach, the rotation and the translation with the

maximum number of occurrences (within five degrees and 2mm of limit respectively)

are used to coarsely align the probe data to the gallery.

4.2.6 Fine Matching with ICP

The Iterative Closest Point (ICP) algorithm [10] is considered to be one of the

most accurate algorithm for registration of two clouds of data points provided the

data sets are roughly aligned. Since ICP is computationally expensive, we extracted

a reduced rectangular region fed to a modified version of ICP in [114]. The minimum

and maximum co-ordinate values of the matched local 3D features were used to

extract the reduced rectangular region from the originally detected gallery and probe

ear data. This smaller but feature-rich region also minimizes the probability of being

affected by the presence of hair and ear-rings.

4.2.7 Final Similarity Measures

The final decision regarding the matching is made based on the results of ICP as

well as the local feature matching. Therefore, the final similarity measures are: (i)

Distance between local 3D features (ii) ICP error and (iii) The ratio of the size of

the largest cluster of rotation angles to the total number of matching features (RR).

As in [117], each of the similarity measures was scaled to 0 and 1 for its minimum

and maximum values respectively. A weight factor is then computed as the ratio of

the difference of the minimum value from the mean to that of the second minimum

value from the mean of a similarity measure. The final result is the weighted sum-

mation of all the similarity measures. However, the third similarity measure (RR)

is subtracted from one before multiplication with the corresponding weight factor

as it is opposite to other measures (the higher this value the better are the results).

4.3. Results and Discussions 61

4.3 Results and Discussions

The recognition performance of our proposed approach is evaluated in this sec-

tion. The results with and without ICP are reported separately. Examples of correct

and misclassifications are analyzed. The time requirement for matching is also re-

ported.

4.3.1 Data Set Used

The Collection F from the University of Notre Dame Profile face database is

used to perform recognition experiments of the proposed approach. We have taken

200 profile images of the first 100 different subjects. Among these images, 100 of

the images that were collected in the year 2003 are used in the gallery database and

the first 100 images of the same subjects collected in the year 2004 are used in the

probe database.

2 4 6 8 10 12 14 16 18 2080

82

84

86

88

90

92

94

96

98

100

Rank

Rec

ogni

tion

Rat

e

Local 3D features onlyLocal 3D features and ICP

Figure 4.5: Identification results.

4.3.2 Recognition Rate with 3D Local Features Only

In our first experiment, we performed recognition considering the matching errors

with local 3D features only. We have obtained 84%, 88% and 90% identification rate

for rank-1, rank-2 and rank-3 respectively for this experiment. The results are shown

in the plot of Fig. 4.5.


Surface Features

4.3.3 Fine Matching with the ICP

Some of the matching failures using the local 3D features only are recovered by

fine alignment with the ICP after the initial alignment using the local 3D features.

Using the combined similarity measure described in Sect. 4.2.7, identification rate

improved to reach 90%, 94% and 96% respectively for rank-1, rank-2 and rank-3.

The results are shown in Fig. 4.5.

(a) (b)

Figure 4.6: Examples of correct recognition in the presence of occlusions. (a) With

ear-rings. and (b) With hair. (2D and the corresponding range images are placed

in the top and bottom row respectively)

Figure 4.7: Example of correct recognition of gallery-probe pairs with pose varia-

tions.

4.3.4 Occlusion and Pose Invariance

Our approach with local 3D features is found to be robust to the presence of

partial occlusions due to hair and ear-rings. Some examples are illustrated in Fig.

4.6.

4.3. Results and Discussions 63

The proposed local 3D features are pose-invariant due to the way they have

been created. However, if the pose variation (specially the out-of-plain) causes self-

occlusions, the number of 3D local features and their repeatability decreases in the

gallery-probe pair. Therefore, we noticed some misclassifications using local 3D fea-

tures only in presence of some pose variations. However, with the finer registration

with ICP, most of those failures were recognized correctly. Fig. 4.7 shows two such

examples. Profile images are used for the example on the right to illustrate the pose

variations.

(a) (b)

Figure 4.8: Example of misclassification. (a) With large pose variations. (b) With

ear-ring, hair and pose variations.

4.3.5 Analysis of the Failures

The proposed system fails mostly in cases of missing data (due to sensor error

or ear-rings), large pose variations causing self-occlusion and severe occlusions with

hair and ear-rings.

The repeatability of the local 3D features in misclassified images was found

to be very low. The worst misclassified gallery-probe pair has only around 33%

repeatability at 2mm (the feature point resolution).

Two examples of misclassification are illustrated in Fig. 4.8: one has large in-

plain and out-of-plain rotation, and the other has large out-of-plain rotation, ear-ring

and hair covering a portion. The hair around the gallery ear of the second example

hides the depth variation at the edges (see the range image in Fig. 4.8b).

4.3.6 Recognition Speed

An un-optimized implementation of the recognition system is performed on

MATLAB on a Pentium 4, 3.6 GHz and 3.25 GB RAM. It takes around 0.3762


Surface Features

sec for matching 3D local features in a gallery-probe pair. This timing is a bit

longer than what is reported in [117] as we are not using any compression of the

feature vector. We also perform the computation for finding the rotation similarity

measure (see Sect. 7.4.1) during matching the features. Time required with the ICP

is 8.6 sec for each match on the same platform.

4.4 Conclusion

In this paper, a robust 3D local features based approach is proposed for 3D ear

recognition. The approach is fully automatic and comparatively very fast. It is

shown to be robust to pose and scale variations and occlusion due to hair and ear-

rings. It is also not based on any assumption about the localization of the nose or the

ear pit. The large variation in rank-2 and rank-3 matches also indicates that finer

tuning of the parameters in our system is likely to improve the performance. The

time efficiency of the system can also be improved by reducing the dimensionality

of the local feature vector by projecting the features to the PCA subspace.

Acknowledgements

We acknowledge the use of the UND Biometrics databases for ear detection and

recognition. We also like to thank D’Errico for the surface fitting code. This research

is sponsored by ARC grants DP0664228 and DP0881813.

65CHAPTER 5

Refining Local 3D Feature Matching through

Geometric Consistency for Robust Biometric

Recognition

Abstract

Local features are gaining popularity due to their robustness to occlusion and

other variations such as minor deformation. However, using local features for recog-

nition of biometric traits, which are generally highly similar, can produce large

numbers of false matches. To increase recognition performance, we propose to elim-

inate some incorrect matches using a simple form geometric consistency, and some

associated similarity measures. The performance of the approach is evaluated on

different datasets and compared with some previous approaches. We obtain an im-

provement from 81.60% to 92.77% in rank-1 ear identification on the University of

Notre Dame Biometric Database, the largest publicly available profile database from

the University of Notre Dame with 415 subjects.

keywords

Local features; feature matching; geometric consistency; biometric recognition.

5.1 Introduction

The performance of a biometric recognition system greatly depends on the rep-

resentation of the underlying distinctive features and the algorithm for matching

those features. Due to ease of computation various global features has been used

in biometric systems. However, global features are often not robust to variations

in observation conditions, including the presence of occlusion and deformations. In

search of more robust features under these variations, researchers have proposed

using local features.

Among 2D local features SIFT (scale invariant feature transform) [105] and its

variants are used in many biometric applications [92, 134, 21, 109]. Also Guo and

Xu [63] proposed using the Local Similarity Binary Pattern (LSBP) and Local

Binary Pattern (LBP) for ear data representation and matching. However, 2D data

0This article is published in the Proc. of Digital Image Computing: Techniques and Applications(DICTA), pp. 513-518, December 2009.

66Chapter 5. Refining Local 3D Feature Matching through Geometric Consistency for

Robust Biometric Recognition

has many inherent problems compared to its 3D counter part including sensitivity

to the use of cosmetics, clothing and other decorations. Recently, Mian et al. [117]

proposed the Local 3D feature (L3DF) inspired by 2D SIFT and found to be very

effective for face representation and recognition. Islam et al. [83] also found it to be

very fast and somewhat robust for ear recognition.

Among the biometric traits, ears and faces are considered to be most suitable for

non-intrusive biometric recognition. However, they are also highly similar among

themselves [35]. Therefore, using local features for these two biometric traits (espe-

cially for ears) produces many false or incorrect matches for a gallery-probe pair. In

this paper, we find initial feature matches using L3DFs with simple metrics (such

as RMS distance between features) similar to Mian et al. [117]. However, we pro-

pose using a geometric consistency check to filter out some of the incorrect feature

matches in an additional round of matching. The consistency of feature match is

judged from the difference between the implied distances on the probe and gallery

to a particular match from the first round. This match is chosen to maximize the

consistency in the first round. We also develop a similarity measure based on this

consistency by computing the proportion of consistent distances in the first round.

Finally, we combine this additional measure with the proportion of consistent rota-

tions and mean distance error of the selected features computed as we previously

proposed in [83, 76] (Chapter 4 of this dissertation). Experiments with different

datasets prove the effectiveness of our technique via a considerable improvement in

the recognition results.

The paper is organized as follows. The proposed approach for refining matching

is described in Section 5.3 after describing the background and motivation in Section

5.2. The results obtained are reported and discussed in Section 5.4 and compared

with other approaches in Section 7.8. Section 7.9 concludes with some future

research directions.

5.2 Background and Motivation

The local 3D feature that we use to demonstrate our approach can be depicted

as a 3D surface constructed using data points within a sphere of radius r1 centered

at location p. Fig. 7.2 shows a local 3D feature surface extracted from an ear.

As described in [83](Chapter 4 of this dissertation), a 20 × 20 grid of heights is

approximated using the surface points of a 3D surface feature. The grid is aligned

following the axes defined by the Principal Component Analysis (PCA). The simi-

larity between two features is calculated as the Root Mean Square (RMS) distance

5.3. Proposed Refinement Technique 67

510

1520

510

1520

−5

0

5

Figure 5.1: Example of a 3D local surface (right image) and the region from which

it is extracted (left image, marked with a circle) [83].

between corresponding heights on this grid. The similarity is computed for each

probe feature location to all the gallery feature locations excluding those being lo-

cated more than a threshold away. (As prviously, this threshold is empirically chosen

as 45mm in the case of ear matching since our automatic ear detection procedure

does not precisely locate the ear.) The gallery feature with the maximum similarity

is considered as the matching feature for that particular probe feature.

The starting point for the current work was the observation that for a correspond-

ing probe and gallery the set of feature matches produced by the above technique

generally contains a large proportion of incorrect matches, and that the correct

matches must be geometrically consistent, while the incorrect matches are unlikely

to be. In contrast, for a non-corresponding probe and gallery, it is unlikely that

there will be a large set of features that match well and are geometrically consis-

tent. Here geometrically consistent means that there is a rigid transformation that

maps the probe feature locations to the corresponding gallery feature locations.

While it seems possible to check full geometric consistency during the matching

process, by constructing rigid transformations from sets of matched points using

Random Sample Consensus (RANSAC) [52] or a variant such as Progressive Sample

Consensus (PROSAC) [42], we chose instead to stick close to the original local 3D

feature matching algorithm of Mian et al. [117] in order to build on its demonstrated

strengths.

5.3 Proposed Refinement Technique

In this section, at first we describe how distance consistency can be used to filter

out incorrect matches. Then, we describe the derivation of two similarity measures

to be used in addition to the feature similarity.



5.3.1 Computation of Distance Consistency

In order to discard incorrect matches, we propose to add a second round of

feature matching each time a probe is compared with a gallery. This second round

uses geometric consistency based on information extracted from the feature matches

generated by the first round. The first round of feature matching is done just as

described in Section 5.2, and we use the matches generated to identify those matches

that are most geometrically consistent.

For simplicity, we measure geometric consistency of a feature match just by

counting how many of the other feature matches from the first round yield consistent

distances on the probe and gallery. More precisely, for a match with locations pi, gi

we count how many other matched locations pj, gj satisfy:

||pi − pj| − |gi − gj|| < dkey + κ√|pi − pj|

Here the threshold includes a term to allow for the spacing between candidate key

points, as well as one that grows with the square root of the actual probe distance

|pi − pj| to account for minor deformations and measurement errors.

To exploit geometric consistency quickly and simply, we just find the match from

the first round which is most “distance-consistent” according to this measure. Then,

in the second round, we only allow feature matches that are distance-consistent with

this match - thus for each probe feature we find the best matching gallery feature

that is distance consistent with the chosen match.

We also compute the ratio of the maximum distance consistency to the total

number of matches found in the first round of matching and use that as a similarity

measure, proportion of consistent distances (λ).

5.3.2 Computation of Rotation Consistency

Similar to Islam et al. [83, 76], we also compute a similarity measure based on

the consistency of the rotations implied by the feature matches. Each feature match

implies a certain rotation between the probe and gallery, since we store the rotation

matrix used to create the probe feature from the probe (calculated using PCA), and

similarly for the gallery feature, and we assume that the match occurs because the

features have been aligned in the same way and come from corresponding points.

We can thus calculate the implied rotation from probe to gallery as R−1g Rp (or RT

g Rp

since Rg is a rotation matrix) where Rp and Rg are the rotations used for the probe

and gallery features.

5.3. Proposed Refinement Technique 69

Figure 5.2: Feature correspondences before (left image) and after (right image)

filtering with geometric consistency (best seen in color).

We calculate these rotations for all feature matches, and then for each we deter-

mine the count of how many of the other rotations it is consistent with. Consistency

between two rotations R1 and R2 is determined by finding the angle between them,

i.e., the angle of the rotation R−11 R2 (around the appropriate axis of rotation). When

the angle is less than 10◦ we consider two rotations consistent. We choose the ro-

tation that is consistent with the largest number of other matches, and then we

use the proportion of matches consistent with this (α) as a similarity measure. As

we shall see in Section 5.4, this measure becomes the strongest among the other

measures used prior to applying Iterative Closest Point (ICP) algorithm in our ear

recognition experiments.

Fig. 5.2 illustrates an example of the correspondences between the features of a

probe image and the corresponding gallery image (mirrored in the z direction) before

(left image) and after (the right image) the geometric consistency check is performed.

The green channel (brighter in the black and white) indicates the amount of rota-

tional consistency for each match. It should be clear that filtering has increased the

proportion of the matches that involve corresponding parts of the two ear images.

5.3.3 Final Similarity Measures

When the second round is complete for a particular probe and gallery, we use

the mean error of the feature matches (γ) in the second round, proportion of con-

sistent distances (λ) and the proportion of consistent rotations (α) from the second

round to measure the closeness of the match. Similar to [83], we perform min-max

normalization of the similarity measures and compute a weight factor (η) for each



2 4 6 8 10 12 14 16 18 200.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Iden

tific

atio

n R

ate

Ear

Face

combined

2 4 6 8 10 12 14 16 18 200.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Iden

tific

atio

n R

ate

Ear

Face

combined

(a) (b)

Figure 5.3: Identification results for fusion of ears and faces on dataset A: (a) without

using geometric consistency. (b) with geometric consistency

10−3

10−2

10−1

100

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

False acceptance rate (log scale)

Ver

ifica

tion

rate

Ear

Face

combined

10−3

10−2

10−1

100

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

False acceptance rate (log scale)

Ver

ifica

tion

rate

Ear

Face

combined

(a) (b)

Figure 5.4: Verification results for fusion of ears and faces on dataset A: (a) without

using geometric consistency. (b) with geometric consistency

5.4. Result and Discussion 71

of the measures as the ratio of the difference of the minimum value from the mean

to that of the second minimum value from the mean of that similarity measure. We

compute the weighted sum as follows:

ε = ηfγ + ηr(1− α) + ηd(1− λ)

Based on the above score, we sort the candidate gallery images and apply coarse

alignment and final ICP only on the minimal rectangular area of the best 20 candi-

dates to make the final decision of matching.

For coarse alignment, we use the translation (t) corresponding to the maximum

distance consistency and the rotation (R) corresponding to the largest cluster of

consistent rotations as follows:

dp′ = Rdp + t;

where, dp and dp′ are the probe point coordinates before and after coarse align-

ment.

5.4 Result and Discussion

To evaluate the performance of our refined matching on ear and face biometrics,

we have used Collection F and J from the University of Notre Dame Profile Biomet-

rics Database [157, 179] and a subset of the FRGC v2. For different experiments, we

have sub-divided these two datasets into the following three subsets. In dataset A,

326 frontal images from the FRGC v2 and 326 profile images from the UND Col-

lection J are used for gallery database. Another set of 311 images from these two

databases is used as a probe dataset. All the ear data are automatically extracted

from the profile images using the technique described in [74, 75]. In dataset B, the

earliest 302 images of the UND Collection F are used as gallery and the latest 302

images are used as probe (there is a time lapse of 17.7 weeks on average between

the earliest and latest images). The entire Collection J of the UND database is

considered as dataset C which includes 415 earliest images as the gallery and an-

other 415 images as probes. However, since the ear in one of profile images was not

automatically detected, the number of probes included in recognition experiments

for this dataset is 414.

Using dataset A, we obtain 79.74% and 69.13% rank- 1 identification rate for ear

with or without using the second stage of matching with the distance consistency.

Rank-n means the right answer is in the top n matches. The score level fusion of ear

and face has an improved identification result from 98.07% to 98.71% (see Fig. 5.3)



and an improved verification result from 98.71% to 99.68% at a False Acceptance

Rate, FPR of 0.001 using the proposed matching approach (see Fig. 5.4).

As shown in Fig. 5.3(b), we obtain only a little improvement for the face. This is

because our matching algorithm for faces already requires distance consistency with

respect to the detected nose tip position as described in [114]. For ears it seems

difficult to reliably detect a similar position due to the possibility of occlusion, and

we consider it a strength of our technique that it does not require any specific part

of the ear to be visible.

5 10 15 200.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Iden

tific

atio

n R

ate

L3DF−based measures

ICP followed by L3DF

5 10 15 200.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Iden

tific

atio

n R

ate

L3DF−based measures

ICP followed by L3DF

(a) (b)

Figure 5.5: Indentification results using geometric consistency: (a) on dataset B (b)

on dataset C.

Using the translation from maximum distance consistency and the best rotation

for initial transformation and then applying the ICP algorithm to the gallery and

probe dataset the ear identification result improves to 92.60%. The corresponding

result is 86.50% while using only the rotation and translation obtained from an

application of ICP to the matched feature points for coarse alignment.

On dataset B, we perform two experiments. In the first experiment, we use

feature errors and the proportion of consistent rotations for initial matching and

only the rotation for coarse alignment. We then apply the final ICP on the minimal

datapoints (as described in Section 5.3) of the gallery and the probe. This yields

93.71% rank-1 recognition for ear biometrics. In the second experiment, we use

second stage of matching with distance consistency applied and use both proportion

of consistent distances and rotations for initial matching as well as coarse alignment

prior to application of final ICP on the best 20 gallery candidates. These changes

improve the recognition rate to 95.03% which confirms the significance of using the

5.5. Comparative Study 73

geometric consistencies (see Fig. 5.5(a)).

On dataset C, without the geometric consistency check we obtain 71.57% and

81.60% rank-1 ear identification rate before and after using ICP on the best 20

gallery candidates sorted by L3DF-based measures. However, the results improve

to 79.76% and 92.77% respectively when we use our geometric consistency checks

and measures. The improved results are shown in Fig. 5.5(b).

The strength of the proportion of consistent rotations (propRot) measure over

other similarity measures is illustrated in Fig. 5.6 for an identification scenario on

dataset C. In the figure, the mean error of feature matches and the proportion of

consistent distances are indicated by errL3DF and propDist respectively.

5 10 15 20

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Iden

tific

atio

n R

ate

errL3DF

propRot

propDist

Figure 5.6: Comparing identification performance of different geometric consistency

measures (on dataset C)

5.5 Comparative Study

Islam et al. [83] reported 84% rank-1 ear recognition result using first stage

of feature matching along with the proportion of best rotations on the first 100

images of Collection F of the UND database. For the same dataset using distance

consistency and proportion of consistent distances, we have obtained 92% rank-1

ear recognition.

Using distance consistency, we have obtained 9% improved identification and 8%

improved verification result for ear recognition compared to that reported in [76].

Mian et al. [117] used edge error and node error on the graphs constructed

from the keypoints of the matching local features. Although this roughly measures

the distance consistency between matched features, it is less reliable when there

are many bad matches. Also this work did not use the approach for filtering the

incorrect matches, but instead as a similarity measure only. For the dataset used



in [76], we obtain 69.13% recognition accuracy for ears using feature errors and graph

errors. For the same dataset corresponding results with our distance consistency and

proportion of rotations is 79.74%.

Chan and Bhanu [32] use geometric constraints similar to ours on their Local

Feature Patches. However, they do not use an explicit second round of matching.

5.6 Conclusion and Future Work

Our results indicate that this simple technique yields worthwhile gains. It is also

seems very likely that we could exploit geometric consistency further. For example,

instead of choosing only the most distance-consistent match, we could choose a

whole set of matches that are mutually distance consistent. Then, provided that

we have at least three matches, we could calculate a rigid transformation via least

square error and use this to directly map probe locations to the required gallery

locations (modulo the threshold). We intend to do this in future work, and further

gains seem likely.

Acknowledgment

This research is sponsored by ARC grants DP0664228 and LE0775672. The

authors acknowledge the use of the UND Biometrics database with profile images

and the FRGC v2 face database for ear and face recognition. They also like to thank

M. Bennamoun for his valuable input, A. Mian for his 3D face normalization code

and D’Errico for his surface fitting code.

75CHAPTER 6

Efficient Detection and Recognition of Textured

3D Ears

Abstract

The use of ear shape as a biometric trait is a recent trend in research. However,

fast and accurate detection and recognition of the ear are very challenging because

of its complex geometry. In this work, a very fast 2D AdaBoost detector is combined

with fast 3D local feature matching and fine matching via an Iterative Closest Point

(ICP) algorithm to obtain a complete, robust and fully automatic system with a

good balance between speed and accuracy. Ear images are detected from 2D profile

images using the proposed Cascaded AdaBoost detector. The corresponding 3D

ear data is then extracted from the co-registered range image and represented with

local 3D features. Unlike previous approaches, local features are used to construct

a rejection classifier, to extract a minimal region with feature-rich data points and

finally, to compute the initial transformation for matching with the ICP algorithm.

The proposed system provides a detection rate of 99.9% and an identification rate

of 95.4% on Collection F of the UND database. On a Core 2 Quad 9550, 2.83 GHz

machine, it takes around 7.7 ms to detect an ear from a 640×480 image. Extracting

features from an ear takes 22.2 sec and matching it with a gallery using only the

local features takes 0.06 sec while using the full matching including ICP requires

2.28 sec on average.

keywords

Biometrics, ear detection, 3D ear recognition, 3D local features, geometric con-

sistency.

6.1 Introduction

Instances of fraudulent breaches of traditional identity card based systems have

motivated increased interest in strengthening security using biometrics for automatic

recognition [85, 146]. Among the biometric traits, the face and the ear have received

some significant attention due to the non-intrusiveness and the ease of data collec-

tion. Face recognition with neutral expressions has reached maturity with a high

0This article is under review in the International Journal of Computer Vision, March, 2010.

76 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears

Keypoint

Identification

3D Local

Features

Extraction

L3DF-based

Matching

Fine

Matching

with ICP

Coarse

Alignment

Using

Feature

Corres-

pondences

Gallery

Features

Off-line

storage of

gallery


features

Recognized Ear

Ear Data

Normalization

Ear

Detection

and Ear

Data

Extraction

Recognition

Decision

2D and 3D Profile Face Data

Figure 6.1: Block diagram of the proposed ear detection and recognition system.

degree of accuracy [17, 78, 117, 190]. However, changes due to facial expressions,

the use of cosmetics and eye glasses, the presence of facial hair including beard and

aging significantly affect the performance of face recognition systems. The ear, com-

pared to the face, is much smaller in size but has a rich structure [8] and a distinct

shape [86] which remains unchanged from 8 to 70 years of age (as determined by

Iannarelli [72] in a study of 10,000 ears). It is, therefore, a very suitable alternative

or complement to the face for effective human recognition [20, 28, 69, 78].

However, reduced spatial resolution and uniform distribution of color sometimes

makes it difficult to detect and recognize the ear from arbitrary profile or side face

images. The presence of nearby hair and ear-rings also makes it very challenging for

non-interactive biometric applications.

In this work, we demonstrate that the Cascaded AdaBoost (Adaptive Boost-

ing) [161] approach with appropriate Haar-features allows accurate and very fast

detection of ears while being sufficient for a Local 3D Feature (L3DF) [117] based

recognition. A detection rate of 99.9% is obtained on the UND Biometrics Database

with 830 images of 415 subjects taking only 7.7 ms on average using a C + + im-

plementation on a Core 2 Quad 9550, 2.83 GHz PC. The approach is found to be

significantly robust to ear-rings, hair and ear-phones. As illustrated in Fig. 6.1, the

detected ear sub-window is cropped from the 2D and the corresponding co-registered

3D data and represented with Local 3D Features. These features are constructed

by approximating surfaces around some distinctive keypoints based on the neigh-

6.2. Related Work and Contributions 77

boring information. When matching a probe with a gallery, a rejection classifier is

built based on the distance and geometric consistency among the feature vectors. A

minimal rectangular region containing all the matching features is extracted from

the probe and the best few gallery candidates. These selected and minimal gallery-

probe datasets are coarsely aligned based on the geometric information extracted

from the feature correspondences and then finely matched via the Iterative Closest

Point (ICP) algorithm. While evaluating the performance of the complete system

on the UND-J, the largest available ear database, we obtain an identification rate

of 93.5% with an Equal Error Rate (EER) of 4.1%. The corresponding rates for the

UND-F dataset are 95.4% and 2.3% and the rates for a new dataset of 50 subjects all

wearing ear-phones are 98% and 1%. With an unoptimized MATLAB implementa-

tion, the average time required for the feature extraction, the L3DF-based matching

and for the full matching including ICP are 22.2, 0.06 and 2.28 seconds respectively.

The rest of the paper is organized as follows. Related work and contributions

of this paper are described in Section 7.2. The proposed ear detection approach is

elaborated in Section 6.3. The recognition approach with local 3D features is ex-

plained in Sections 6.4 and 6.5. The performance of the approaches are evaluated in

Sections 6.6 and 6.7. The proposed approaches are compared with other approaches

in Section 7.8 followed by a conclusion in Section 7.9.

6.2 Related Work and Contributions

In this section, we describe the methodology and performance of the existing 2D

and 3D ear detection and recognition approaches. We then discuss the motivation

inspired from the limitations of these approaches and highlight the contributions of

this paper.

6.2.1 Ear Detection Approaches

Based on the type of data used, existing ear detection or ear region extraction

approaches can be classified as 2D, 3D and multimodal 2D+3D. However, most ap-

proaches use only 2D profile images. One of the earliest 2D ear detection approaches

is proposed by Burge and Burger [20] who used Canny edge maps [24] to find the

ear contours. Ansari and Gupta [7] also used a Canny edge detector to extract the

ear edges and segmented them into convex and concave curves. After the elimina-

tion of non-ear edges, they found the final outer helix curve based on the relative

values of angles and some predefined thresholds. They then joined the two end

points of the helix curve with straight lines to get the complete ear boundary. They


obtained 93.3% accuracy of localizing the ears on a database of 700 samples. Ear

contours were also detected based on illumination changes within a chosen window

by Choras [37]. The author compared the difference between the maximum and

minimum intensity values of a window to a threshold computed from the mean and

standard deviation of that region in order to decide whether the center of the region

belongs to the contour of the ear or to the background.

Ear detection approaches that utilize 2D template matching include the work

of Yuizono et al. [187] where both hierarchical and sequential similarity detection

algorithms were used to detect the ear from 2D intensity images. Another technique

based on a modified snake algorithm and an ovoid model was proposed by Alvarez et

al. [5]. It requires the user to input an approximated ear contour which is then used

for estimating the ovoid model parameters for matching. Yan and Bowyer [176]

manually selected Triangular Fossa and Incisure Intertragica on the original 2D

profile image and drew a line to be used as a landmark. One line was along the border

between the ear and the face, and the other from the top of the ear to the bottom.

The authors found this method suitable for PCA-based and edge-based matching.

The Hough Transform can extract shapes with properties equivalent to template

matching and was used by Arbab-Zavar and Nixon [8] to detect the elliptical shape

of the ear. The authors successfully detected the ear region in all of the 252 profile

images of a non-occluded subset of the XM2V TS database. For the UND database,

they first detected the face region using skin detection and the Canny edge operator

followed by the extraction of the ear region using their proposed method with a

success rate of 91%. They also introduced synthetic occlusions vertically from top

to bottom on the ear region of the first dataset and obtained around 93% and 90%

detection rates for 20% and 30% occlusion respectively. Recently, Gentile et al. [58]

used AdaBoost [161] to detect the ear from a profile face as part of their multi-

biometric approach for detecting drivers’ profiles in a security checkpoint. In an

experiment with 46 images from 23 subjects, they obtained an ear detection rate of

97% with seven false positives per image. They did not report the efficiency of their

system.

Approaches using only 3D or range data include Yan and Bowyer’s two-line

based landmarks and 3D masks [176], Chen and Bhanu’s 3D template matching [29]

and the ear-shape-model based approach [31]. Similar to their 2D technique [176]

mentioned above, Yan and Bowyer [176] drew two lines on the original range image

to find the orientation and scaling of the ear. They rotated and scaled a mask

accordingly and applied it on the original image to crop the 3D ear data in an


ICP-based matching approach. Chen and Bhanu [29] combined template matching

with average histogram to detect ears. They achieved a 91.5% detection rate with

about 3% False Positive Rate (FPR). In [31], they represented an ear shape model

by a set of discrete 3D vertices on the ear helix and anti-helix parts and aligned

the model with the range images to detect the ear parts. With this approach, they

obtained 92.5% detection accuracy on the University of California, Riverside (UCR)

ear dataset with 312 images and an average detection time of 6.5 sec on a 2.4 GHz

Celeron CPU.

Among the multimodal 2D+3D approaches, Yan and Bowyer [179] and Chen and

Bhanu [32] are prominent. In the first approach, the ear region was initially located

by taking a predefined sector from the nose tip. The non-ear portion was then

cropped out from that sector using a skin detection algorithm and the ear pit was

detected using Gaussian smoothing and curvature estimation algorithms. An active

contour algorithm was applied to extract the ear contour. Using the color and the

depth information separately for the active contour, ear detection accuracies of 79%

and 85% respectively were obtained. However, using both 2D and 3D information,

ears from all the profile images were successfully extracted. Thus, the system is

automatic but depends highly on the accuracy of detection of nose tip and ear pit

and it fails when the ear pit is not visible. Chen and Bhanu [32] also used both color

and range images to extract ear data. They used a reference ear shape model based

on the helix and anti-helix curves and the global-to-local shape registration. They

obtained 99.3% and 87.7% detection rates while tested on the UCR ear database of

902 images from 155 subjects and on 700 images of the UND database, respectively.

The detection time for the UCR database is reported as 9.5 sec with a MATLAB

implementation on a 2.4 GHz Celeron CPU.

6.2.2 Ear Recognition Approaches

Most of the existing ear recognition techniques are based on 2D data and exten-

sive surveys can be found in [78, 143]. Some of them report very high accuracies but

on smaller databases; e.g. Choras [37] obtained 100% recognition on a database

of 12 subjects and Hurley et al. [70] obtained 99.2% accuracy on a database of 63

subjects. As expected, performance generally drops for larger databases, e.g. Yan

and Bowyer [180] report a performance drop from 92% to 84.1% for database sizes

of 25 and 302 subjects respectively. Also, most of the approaches do not consider

occlusion in the ear images (e.g. [63, 106, 34, 70, 188, 39]). Considering these issues

and the scope of the paper, only those approaches using large 3D databases and


somewhat occluded data are summarized in Table 6.1 and described below.

Table 6.1: Summary of the existing 3D ear recognition approachesPublication Methodology Dataset Rec.

Rate(%)

Name Size(gallery,probe)

Yan and Bowyer,2007 [179]

3D ICP UND-J (415,1386)

97.8

Chen and Bhanu,2007 [32]

LSP and 3D ICP UCR (155, 155) 96.8

UND-F (302, 302) 96.4Passalis et al.,2007 [136]

AEM, ICP,DMF

UND-J (415, 415) 93.9

Cadavid and Abdel-Mottaleb, 2007 [22]

3D ICP Proprietary (61, 25) 84

Yan and Bowyer,2005 [180]

3D ICP UND-F (302, 302) 84.1

Yan and Bowyer [179] applied 3D ICP with an initial translation using the ear

pit location computed during the ear detection process. They achieved 97.8% rank-1

recognition with an Equal-error rate (EER) of 1.2% on the whole UND Collection

J dataset consisting of 1386 probes of 415 subjects and 415 gallery images. They

obtained a recognition rate of 95.7% on a subset of 70 images from this dataset which

have limited occlusions with earrings and hair. In another experiment with the UND

Collection G dataset of 24 subjects each having a straight-on and a 45 degrees off

center image, they achieved 70.8% recognition rate. However, the system is not

expected to work properly if the nose tip or the ear pit are not clearly visible which

may happen sometimes due to pose variations or covering with hair or ear-phones

(see Fig. 6.10 and 6.12).

Chen and Bhanu [32] used a modified version of ICP for 3D ear recognition. They

obtained 96.4% recognition on Collection F of the UND database (including occluded

and non-occluded images of 302 subjects) and 87.5% recognition for straight-on to 45

degree off images. They obtained 94.4% rank-1 recognition rate for the UCR dataset

ES2 which comprises 902 images of 155 subjects taken all in the same day. They

used local features for representation and coarse alignment of ear data and obtained

a better performance than their helix-anti-helix representation. Their approach

assumes perfect ear detection, otherwise manual extraction of the ear contour is

performed prior to recognition.


Passalis et al. [136] used a generic annotated ear model (AEM), ICP and Simu-

lated Annealing algorithms to register and fit each ear dataset. They then extracted

a compact biometric signature for matching. Their approach required 30 sec for en-

rolment per individual and less than 1 ms for matching two biometric signatures

on a Pentium 4, 3 GHz CPU. They computed the full similarity matrix with 415

columns (galleries) and 415 rows (probes) for the UND-J dataset taking seven hours

of enrolment and few minutes of matching and achieved 93.9% recognition rates.

Cadavid and Abdel-Mottaleb [22] extracted a 3D ear model from video sequences

and used 3D ICP for matching. They obtained 84% rank one recognition while

testing with a database of 61 gallery and 25 probe non-occluded images.

All of the above recognition approaches have only considered left or right ears.

An exception is Choras [39] who proposed to pre-classify each detected ear as left

or right based on the geometrical parameters of the earlobe. The author reported

accurate pre-classification of all 800 images from 80 subjects. Hence, distinguishing

left and right ears seems relatively easy. In cases where both profile images are not

available, extracted ear data from the opposite profile can be mirrored for matching

with still relatively reliable recognition. Yan and Bowyer [176, 179] experimentally

demonstrated that although some people’s left and right ears have recognizably

different shapes, most people’s two ears are approximately bilaterally symmetric.

They obtained around 90% recognition rate while matching mirrored left ears to

right ears on a dataset of 119 subjects. We have focused on left ears, but the above

work suggests our research can be used in other situations also.

6.2.3 Motivations and Contributions

Most of the ear detection approaches mentioned above are not fast enough to be

applied in real-time applications. Recently, Viola and Jones have used the AdaBoost

algorithm [53, 150] to detect faces and obtained a speed of 15 frames per second while

scanning 384 by 288 pixel images on a 700 MHz Intel Pentium III [161]. For this

extreme speed and simplicity of implementation, AdaBoost has further been used for

detecting the ball in a soccer game [153], pedestrians [122], eyes [130], mouths [103]

and hands [36]. However, existing ear detection using AdaBoost (see Section 6.2.1)

does not achieve significant accuracy. In fact, even for faces, Viola and Jones [161]

obtained only 93.7% detection rate with 422 false positives on MIT+CMU face

database. Ear detection is more challenging because ears are much smaller than

faces and often covered by hair, ear-rings, ear-phones etc. Challenges lie in reducing

incorrect or partial localization while maintaining high correct detection rate. Hence,


we are motivated to determine the right way to instantiate the general AdaBoost

approach with the specifics required in order to specialize it for ear detection.

Most of the ear recognition approaches use global features and ICP for matching.

Compared to local features, global features are more sensitive to occlusions and vari-

ations in pose, scale and illumination. Although ICP is considered to be the most

accurate matching algorithm, it is computationally expensive and it requires con-

cisely cropped ear data and a good initial alignment between the galley-probe pair

so that it does not converge to a local minimum. Yan and Bowyer [179] suggested

that the performance of ICP might be enhanced using feature classifiers. Recently,

Mian et al. [117] proposed local 3D features for face recognition. Using these fea-

tures alone, they reported 99% recognition accuracy on neutral versus neutral and

93.5% on neutral versus all on the FRGC v2 3D face dataset. They also obtained

a time efficiency of 23 matches per second on a 3.2 GHz Pentium IV machine with

1GB RAM. In this paper, we adapt these features for the ear and use them for

coarse alignment as well as for rejecting a large number of false matches. We also

use L3DFs for extracting a minimal set of datapoints to be used in ICP.

The specific contributions of this paper are as follows:

1. A fast and fully automatic ear detection approach using cascaded AdaBoost

classifiers trained with three new features and a rectangular detection window.

No assumption is made about the localization of the nose or the ear pit.

2. The local 3D features are used for ear recognition in a more accurate way than

originally proposed in [117] for the face including an explicit second round of

matching based on geometric consistency. L3DFs are used not only for coarse

alignment but also for rejecting most false matches.

3. A novel approach for extracting minimal feature-rich data points for the final

ICP alignment is proposed which significantly increases the time efficiency of

the recognition system.

4. Experiments are performed on a new database of profile images with ear-

phones along with the largest publicly available dataset of the UND and high

recognition rates are achieved without an explicit extraction of the ear con-

tours.

6.3 Automatic Detection and Extraction of Ear Data

The ear region is detected on 2D profile images using a detector based on the

AdaBoost algorithm [53, 150, 74, 75]. Following [161], Haar-like features are used

6.3. Automatic Detection and Extraction of Ear Data 83

as weak classifiers and learned from a number of ear and non-ear images. After

training, the detector first scans through the 2D profile images to identify a small

rectangular region containing the ear. The corresponding 3D data is then extracted

from the co-registered 3D profile data. The complete detection framework is shown

in the block diagram of Fig. 6.2. A sample of a profile image and the corresponding

2D and 3D ear data detected by our system is also shown in the same figure. The

details of the construction of the detector and its functional procedures are described

in this section.

On-line

Detection

Off-line

Training

2D

3D Profile

2D Profile

Extract 3D

Ear Data

Scan and

Detect 2D

Ear

Region

Collect 2D and

3D Profile

Face Data

Validate and

Update with

False

Positives Create

Rectangular

Haar

FeaturesPre-

process

2D Data

Train

Classifiers

Using

Cascaded

AdaBoost

Extracted

2D Ear

Extracted

3D Ear

Figure 6.2: Block diagram of the proposed ear detection approach.

6.3.1 Feature Space

The eight different types of rectangular Haar feature templates as shown in

Fig. 6.3 are used to construct our AdaBoost based detector. Among these, the first

five (a-e) were also used by Viola and Jones [161] to detect different types of lines

and curves in the face. We devised the later three templates (f-h) to detect specific

features of the ear which are not available in the frontal face. The center-surround

template is designed to detect any cavity in the ear (e.g. ear pit) and the other two

(adopted from [90]) are for detecting helix and the anti-helix curves. Although (f)

is the intersection of (c) and (e), we use it as a separate feature template because no

linear combination of those features yields (f) and as will be discussed in the next

sub-section, the AdaBoost algorithm used for feature selection greedily chooses the

best individual Haar features, rather than their best combination.


(a) (b)

(h)(g)

(d)(c)

(f)

(e)

Figure 6.3: Features used in training the AdaBoost (features (f), (g) and (h) are

proposed for detecting specific features of the ear)

.

To create a number of Haar features out of the above eight types of templates or

filters, we choose a window to which all the input samples are normalized. Viola and

Jones [161] used a square window of size 24×24 for face detection. Our experiments

with training data show that ears are roughly proportional to a rectangular window

of size 16 × 24. One benefit of choosing a smaller window size is the reduction of

training time and resources. The templates are shifted and scaled horizontally and

vertically along the chosen window and a feature is numbered for each location, scale

and type. Thus, for the chosen window size and a shift of one pixel, we obtain an

over-complete basis of 96413 potential features.

The value of a feature is computed by subtracting the sum of the pixels in the

grey region(s) from that of the dark region(s) (except in the case of (c), (e) and (f)

in Fig. 6.3, where the sum in the dark region is multiplied by 2, 2 and 8 respectively

before performing the subtraction in order to make the weight of this region equal

to that of the grey region(s)).

6.3.2 Construction of the Classifiers

The rectangular Haar-like features described above constitute the weak classifiers

of our detection algorithm. A set of such classifiers are selected and then combined

together to construct a strong classifier via AdaBoost [53, 150] and a sequence of

these are then cascaded following Viola and Jones [161]. Thus, each strong classifier

in the cascade is a linear combination of the best weak classifiers, with weights

inversely proportional to training errors on those examples not previously rejected

by an early stage of the cascade. This results in a fast detection as most of the

negative sub-windows are rejected using only a small number of features associated

with the initial stages.

The optimization of the number of stages, the number of features per stage

and the threshold for each stage for a target detection rate (D) and false positive


rate (Ft) is obtained similar to [161] by aiming for a fixed maximum FPR (fm)

and a minimum detection rate (dmin) for each stage. These are computed from the

following inequalities: Ft < (fm)n and D > (dmin)n, where n is the number of stages,

typically 10-50.

6.3.3 Training the Classifiers

The training dataset to build the proposed ear detector, their preprocessing

stage, the training parameters chosen and other implementation aspects are de-

scribed as follows.

Dataset The positive training set is built with 5000 left ear images cropped from

the profile face images of different databases covering a wide range of races, sexes,

appearances, orientations and illuminations. This set includes 429 images of the

University of Notre Dame (UND) Biometrics Database (Collection F) [157, 179],

659 of the NIST Mugshot Identification Database (MID) [129], 573 of XM2VTSDB

[112], 201 images of the USTB [159], 15 of the MIT-CBCL [165, 120], and 188 of the

UMIST [61, 154] face databases. It also includes around 3000 images synthesized

by rotating -15 to +15 degrees of some images from the USTB, the UND and the

XM2VTSDB databases.

Our negative training set for the first stage of the cascade includes 10,000 images

randomly chosen from a set of around 65,000 non-ear images. These images are

mostly cropped from profile images excluding the ear area. We also include some

images of trees, birds and landscapes randomly downloaded from the web. Examples

of the positive and negative image set are shown in Fig. 6.4. The negative training

set for the second and subsequent stages are made up dynamically as follows. A set

of 6000 large images without ears are scanned through at the end of each stage of

the cascade by the classifier developed in that stage. Any sub-window classified as

an ear is considered as a false positive and a set of not more than 5000 such false

positives are randomly chosen to include in the negative set for the following stages.

Figure 6.4: Examples of ear (top) and non-ear (bottom) images used in the training.

The validation set used to compute the rates of detection and false positives

during the training process includes 5000 positives (cropped and synthesized ear


images) and 6000 negatives (non-ear images). The negatives for the first stage are

randomly chosen from a set of 12000 images not included in the training set. For

the second and the subsequent stages, negatives are randomly chosen from the false

positives found by the classifier of the previous stage and unused in the negative

training set.

Preprocessing the Data As mentioned earlier, input images are collected from

different sources with varying size and intensity values. Therefore, all the input

images are scale normalized to the chosen input pattern size. Viola and Jones

reported a square input pattern of size 24 × 24 as the most suitable for detecting

frontal faces [161]. Considering the shape of the ear, we instead use a rectangular

pattern of size 16× 24.

The variance of the intensity values of images are also normalized to minimize the

effect of lighting. Similar normalization is performed for each sub-window scanned

during testing.

Training the Cascade In order to train the cascade, we choose Ft = 0.001 and

D = 0.98. However, to quickly reject most of the false positives using a small

number of features, we define the first four stages to be completed with 10, 20, 60

and 80 features. We also performed validation after adding ten features for the first

ten stages and then, adding of 25 for the remaining stages. The detection and false

positive rates computed during the validation of each stage follow a gradual decrease

to the target. The training finishes at stage 18 with a total of 5035 rectangular

features including 1425 features in the last stage.

Training Time The training process involved a huge amount of computation due

to the large training set and also for the very low target FPR, taking several weeks

on a single PC. To speed up the process, we distributed the job over a network of

around 30 PCs. For this purpose, we used MATLABMPI which is a MATLAB

implementation of the Message Passing Interface (MPI) standard that allows any

MATLAB program to exploit multiple processors [93]. It helped in reducing the

training time to an order of days. An optimized C or C + + implementation would

reduce this time, but since this training never needs to be performed again, our

MATLAB implementation was sufficient.

6.3.4 Ear Detection with the Cascaded Classifiers

The trained classifiers of all the stages are used to build the ear detector in a

cascaded manner. The detector is scanned over a test profile image in different sizes

and locations. A classifier in the cascade is only used when a sub-window in the


test image is detected as positive (ear) by the classifier of the previous stage and

accepted finally only when it passes through all of them.

To detect various sizes of ears, instead of resizing the given test image, we scale

up the detector along with the corresponding features and use an integral image

calculation. The approach is similar to that of Viola and Jones [161] who also

illustrated this to be more time-efficient than the conventional pyramid approach.

If the rectangular detector (of 16 × 24 or its scaled-up size) matches any sub-

window of the image, a rectangle is drawn to show the detection (See Fig. 6.5). The

integration of multiple detections (if any) is described in the following sub-section

and the overall performance of the detector is discussed in Section 6.6.

As mentioned in Section 6.3.3, our system is trained for detecting ears from the

left profile images only. However, if the input image is a right profile and the ear

detector fails, then the features constituting the detector can be flipped to detect

the right ears.

Figure 6.5: Sample of detections: (a) Detection with single window. (b) Multi-

detection integration (best seen in color).

6.3.5 Multi-detection Integration

Since the detector scans over a region in the test image with different scales and

shift sizes, there is the possibility of multiple detections of the ear or ear-like regions.

To integrate such multiple detections, we propose the clustering algorithm reported

in Algorithm 1.

The clustering algorithm is based on the percentage of overlap of the rectangles

representing the detected sub-windows. We cluster a pair of rectangles together if

the mutual area shared between them is larger than a predefined threshold, minOv


(0 < minOv < 1). A higher value of this parameter may result in multiple detections

near the ear. We empirically chose a threshold value of 0.01.

Based on the observation that the number of true detections at different scales

over the ear region is larger than the false detections on ear-like region(s) (if any),

we added an option in the algorithm to avoid such false positive(s) by only taking

the one that clusters the maximum number of rectangles. This is appropriate when

only one ear needs to be detected which is the case for most recognition applications.

An example of integrating three detection windows is illustrated in Fig. 6.5(b).

Each of the detections is shown by a rectangle in yellow lines while the integrated

detection window is shown by a rectangle in bold dotted cyan lines.

Algorithm 1. Integration of multiple ear detections

0. (Input) Given a set of detected rectangles rects, the

minimum percentage of overlap required minOv and

option for avoiding false detection opt.

1. (Initialize) Set the intermediate

rectangle set tempRects empty.

2. (Multi-detection integration procedure)

2.a While number of rectangles N in rects> 1

i. Find areas of intersection of the

first rectangle in rects with all.

ii. Find the rectangles combRects

and their number intN for whose percentage of

overlap>= minOv.

iv. Store the mean of combRects and intN in

tempRects.

v. Remove the rectangles in combRcts from

rects.

2.b If intN>1 and opt = = 'yes'

i. Find the rectangle fRect in tempRects

for which intN is maximum.

ii. Remove all the rectangle(s)

except fRect from tempRects.

End if

3. (Output) Output the rectangle in tempRects.

6.3.6 3D Ear Region Extraction

Assuming that the 3D profile data are co-registered with corresponding 2D data

(which is normally the case when data is collected with a range scanner), the location

information of the detected rectangular ear region in the 2D profile is used for

6.4. Representation and Extraction of Local 3D Features 89

3D ear data extraction. To ensure that the whole ear is included and to allow

the extraction of features on and slightly outside the ear region, we expanded the

detected ear regions by an additional 25 pixels in each direction. This extended ear

region is then cropped to be used as 3D ear data. Fig. 6.9 illustrates the original

and expanded region of extraction. If our ear detection system indicates that a right

ear is detected, we flip the 3D ear data to allow it to be matched with the left ears

in the gallery.

6.3.7 Extracted Ear Data Normalization

Once the 3D ear is detected, we remove all the spikes by filtering the data. We

perform triangulation on the data points, remove edges longer than a predefined

threshold of 0.6 mm and finally, remove disconnected points [114].

The extracted 3D ear data varies in dimensions depending on the detection

window. Therefore, we normalize the 3D data by centering on the mean and then

sampling on a uniform grid of up to 132 mm by 106 mm. The resampling makes the

datapoints more uniformly distributed and fills up the holes if any. Besides, it makes

the local features more stable and increases the accuracy of ICP based matching.

We perform a surface fitting based on the interpolation of the neighboring data

points at 0.5 mm resolution. This also fills holes or missing data (if any) due to oily

skin or sensor error [176, 179] (as shown in Fig. 6.17 and 6.20).

6.4 Representation and Extraction of Local 3D Features

The performance of any recognition system greatly depends on how the relevant

data is represented and how the significant features are extracted from it. Although

the core of our data representation and the feature extraction technique is similar to

the L3DFs proposed for face data in [117, 83], the technique is modified to make it

suitable for the ear as preliminarily presented in our previous work [83] and further

enhanced as described in this section.

6.4.1 KeyPoint Selection for L3DFs

A 3D local surface feature can be depicted as a 3D surface constructed using

data points within a small sphere of radius r1 centered at a keypoint p. An exam-

ple of a feature is shown in Fig. 7.2. As outlined by Mian et al. [117], keypoints

are selected from surfaces distinct enough to differentiate between range images of

different persons.


510

1520

510

1520

5

10

15

20

Figure 6.6: Example of a 3D local surface (right image). The region from which it

is extracted is shown by a circle on the left image.

Nearest Neighbour Error (mm)

Cumulative % of Repeatability

(a) (b)

Figure 6.7: (a) Location of keypoints on the gallery (left) and the probe (right)

images of three different individuals (this figure is best seen in color). (b) Cumulative

percentage of repeatability of the keypoints.

For keypoints we only consider data points that lie on a grid with a resolution of

2 mm in order to increase distinctiveness of the surface to be extracted. We find the

distance of each of the data points from the boundary and take only those points

with a distance greater than a predefined boundary limit. The boundary limit is

chosen slightly longer than the radius of the 3D local feature surface (r1) so that

the feature calculation does not depend on regions outside the boundary and the

allowed region corresponds closely with the ear. We call the points within this limit

seed points. In our experiments, a boundary limit of r + 10 was found to be the

most suitable.

To check whether the data points around a seed point contain enough descriptive

information, we adopt the approach of Mian et al. [117] discussed in short as follows.

We randomly choose a seed point and take a sphere of data points around that point

which are within a distance of r1. We apply PCA on those data points and align

them with their principal axes. The difference between the eigenvalues along the

6.4. Representation and Extraction of Local 3D Features 91

first two principal axes of the local region is computed as `. It is then compared to

a threshold (t1) and we accept a seed point to be a keypoint if ` > t1. The higher t1

the less features we get, but lowering t1 can result in the selection of less significant

feature points with unreliable orientations. This is because, the value of ` indicates

the extent of unsymmetrical depth variations around a seed point. For example, t1

of zero for a point cloud means it could be completely planar or spherical.

We continue selecting seed points as keypoints until until nf number of features

are created. For a seed resolution (rs) of 2 mm, r1 of 15 mm, t1 of 2 and nf of

200, for most of the gallery and the probe ears, we found 200 keypoints. We found

however, as low as 65 features particularly for cases where missing data occurs.

The values of all the parameters used in the feature extraction are empirically

chosen and the effect of their variation is further discussed in the Appendix. Our

experiments with ear data and those with face data by Mian et al. [117] show that the

performance of the keypoint detection algorithm and hence 3D recognition do not

vary significantly with small variations in the values of these parameters. Therefore,

we use the same values to extract features from all the ear databases.

The suitability of our local features on the ear data is illustrated in Fig. 6.7a.

It shows that keypoints are different for ear images of different individuals. It

also shows that these features have a high degree of repeatability for the ear data

of the same individual. By repeatability we mean that the proportion of probe

feature points that have a corresponding gallery feature point within a particular

distance. We performed a quantitative analysis of the repeatability similar to [117].

The probe and the gallery data of the same individual are aligned using the ICP

algorithm in order to allow computation of repeatability. The cumulative percentage

of repeatability as a function of the nearest neighbor error between gallery and probe

features of ten different individuals is shown in Fig. 6.7b. The repeatability reaches

around 80% at an error of 2 mm which is the sampling distance between the seed

points.

6.4.2 Feature Extraction and Compression

After a seed point qualifies as a keypoint, we extract a surface feature from its

neighborhood. As described in Section 6.4.1, while testing for the suitability of

the seed point we take a sphere of data points with a radius of r1 from that seed

point and align them to their principal axes. We use these rotated data points to

construct the 3D local surface feature. Similar to [117], the principal directions of

the local surface are used as the 3D coordinates to calculate the features. Since the


coordinate basis is defined locally based on the shape of the surface, the computed

features are mostly pose invariant. However, large changes in viewpoints can cause

different points of the ear to occlude and cause perturbations in the local coordinate

basis.

We fit a 30 × 30 uniformly sampled 3D surface (with a resolution of 1 mm) to

these data points. In order to avoid boundary effects, we crop the inner region of

20× 20 datapoints and store it as a feature (see Fig. 7.2).

For surface fitting, we use a publicly available surface fitting code [47]. The

motivation behind the selection of this algorithm is that it builds a surface over

the complete lattice approximating (rather than interpolating) and extrapolating

smoothly into the corners. Therefore, it is less sensitive to noise, outliers and missing

data.

In order to reduce computational time and memory storage and to make features

more robust to noise, we apply PCA on the whole gallery feature set as in [117],

after centering on the mean. The top 11 eigenvectors are then used to project gallery

and probe features into vectors of dimension 11. Unlike [117], we do not normalize

the variance of the dimensions, nor the size of the features. Instead, we preserve as

much as possible of the original geometry in the features.

6.5 L3DF Based Matching Approach

In this Section, our method of matching gallery and probe datasets is described.

We establish correspondences between extracted L3DFs [83] similar to Mian et

al. [117] for face. However, we use geometric consistency checks [80] to refine the

matching and to calculate additional similarity measures. We use the matching

information to reject a large number of false gallery candidates and to coarsely

align the remaining candidates prior to the application of a modified version of ICP

algorithm for fine matching. The complete matching algorithm is formulated in

Algorithm 2 and discussed as follows.

6.5.1 Finding Correspondence Between Candidate Features

The similarity between two features is calculated as the Root Mean Square

(RMS) distance between corresponding points on the 20 × 20 grid generated when

the feature is created (aligned following the axes in the PCA). The RMS distance

is computed from each probe feature to all the gallery features. Matching gallery

features which are located more than a threshold (th1) away are discarded to avoid

matching in quite different areas of the cropped image. The gallery feature with

the minimum distance is considered as the corresponding feature for that particular

6.5. L3DF Based Matching Approach 93

Figure 6.8: Feature correspondences after filtering with geometric consistency (best

seen in color).

probe feature. When multiple probe features match the same gallery feature we

retain the best match for that gallery feature as in [117].

6.5.2 Filtering with Geometric Consistency

Unlike previous works on L3DFs for the face [117], we found it necessary to

improve our matches for the ear using geometric consistency. We add a second round

of feature matching each time a probe is compared with a gallery that uses geometric

consistency based on information extracted from the feature matches generated by

the first round. The first round of feature matching is performed just as described in

Section 7.4.1, and we use the matches generated to identify a subset that are most

geometrically consistent.

For simplicity, we measure geometric consistency of a feature match by counting

the number of the other feature matches from the first round yield consistent dis-

tances on the probe and gallery. More precisely, for a match with locations pi, gi we

count how many other match locations pj, gj satisfy Eqn. (7.1).

||pi − pj| − |gi − gj|| < rs + κ√|pi − pj| (6.1)

The right hand side of the above equation is a function of the spacing between

candidate keypoints or the seed resolution rs, a constant κ and the square root of

the actual probe distance that accounts for minor deformations and measurement

errors. The constant κ is determined empirically as 0.1.


Algorithm 2. Matching a probe with the gallery

0. (Input) Given a probe, gallery data and features, distance thresholds th1

and , angle threshold th2 and minimum number of match m.

1. (Matching based on local 3D features)

1.a (Distance check) For each feature of the probe and all features of a

gallery:

(i) Discard gallery features with distance from the probe feature

location> th1.

(ii) Pair the probe feature with closest gallery feature, by feature

distance.

1.b (Distance consistency check)

(i) For each of the matching feature pairs count how many other

matches satisfy Eqn. (1) with .

(ii) Choose T as the match pair with highest count, T.

(iii) Compute percentage of consistent distance ( d = T / |matches|).

1.c (2nd stage of matching)

(i) For all the gallery features repeat step (1.a) but do not allow

matching which are inconsistent with T.

(ii) Compute the mean of the feature distance of the matching feature

pairs ( f).

(iii) Discard the gallery if there is less than m feature pairs.

1.d (Calculating rotation consistency measure)

(i) For each of the selected matching pairs count how many other

matches have rotation angles within th2.

(ii) Choose R as the rotation for the pair with the highest count, R. (iii) Compute percentage of consistent rotation ( r = R / |matches|).

1.e (Calculating keypoint distance measure)

(i) Align the keypoints of the matching probe features to those of the

corresponding gallery features using ICP.

(ii) Record the ICP error as keypoint distance measure ( n).

2. Repeat step (1) for all galleries.

3. (Rejection classifier)

3a. For each of the gallery candidates:

(i) For each of the similarity measures ( f, d, r, n), compute the

weight factor ( x).

(ii) Compute the similarity score according to Eqn. (2).

3b. Rank the gallery candidates according to and discard those having

rank over 40.

4. (ICP-based matching) For each of the selected gallery candidates:

(i) Extract a minimum rectangle containing all matches from both

gallery and probe data.

(ii) Align the extracted probe data with the gallery data using T and R.

(iii) Align the extracted probe data with that of the gallery using ICP.

5. (Output) Output the gallery having minimum ICP error as the best match

for the probe.


We then simply find the match from the first round which is most ‘distance-

consistent’ according to this measure. In the second round, we follow the same

matching procedure as in round-1 but only allow feature matches that are distance-

consistent with this match. Fig. 6.8 illustrates an example of the matches between

the features of a probe image and the corresponding gallery image (mirrored in the

z direction) in the second round. Here the green channel is used to indicate the

amount of rotational consistency for each match (best viewed in color, although in

grey scale the green channel generally dominates). It is clear that a good proportion

of these matches involve corresponding parts of the two ear images.

6.5.3 Other Similarity Measures Based on L3DFs

In addition to the mean feature distance for all the matched probe and gallery

features (εf ) used in [117], we also derive three more similarity measures based on

the geometric consistency of matched features as discussed below.

We compute the ratio of the maximum distance consistency to the total number

of matches found in the first round of matching and use that as a similarity measure,

proportion of consistent distances (αd).

We also include a component based on the consistency of the rotations implied

by the feature matches in our measure of similarity between probes and galleries.

Each feature match implies a certain 3D rotation between the probe and gallery,

since we store the rotation matrix used to create the probe feature from the probe

(calculated using PCA), and similarly for the gallery feature, and we assume that

the match occurs because the features have been aligned in the same way and come

from corresponding points. We can thus calculate the implied rotation from probe

to gallery as Rg−1Rp where Rp and Rg are the rotations used for the probe and

gallery features.

We calculate these rotations for all feature matches, and for each, we determine

the count of how many of the other rotations it is consistent with. Consistency

between two rotations R1 and R2 is determined by finding the angle between them,

i.e., the rotation angle of R−11 R2 (around the appropriate axis of rotation). We

consider two rotations consistent when the angle is less than 10◦ (th2). We choose

the rotation of the match that is consistent with the largest number of other matches,

and use the proportion of matches consistent with this as a similarity measure called

the proportion of consistent rotations (αr). As we shall see in Section 6.7, this

measure is the strongest among the measures used prior to applying ICP in our ear

recognition experiments and fusing with the other measures provides only a modest


but worthwhile improvement. We also use the rotation with the highest consistency

for ICP coarse alignment as described in Section 6.5.6.

Lastly, we develop another similarity measure called keypoint distance measure

(εn) based on the distance between the keypoints of the corresponding features. To

compute it, we apply ICP only on the keypoints (not the whole dataset) of the

matched features (obtained in the second round of matching). This corresponds to

the ‘graph node error’ described in Mian et al. [117] for the face.

6.5.4 Building a Rejection Classifier

As in [117], the similarity measures (εf , αr, αd and εn) computed in the previous

sub-section are first normalized on the scale from 0 to 1 and then combined using a

confidence weighted sum rule as shown in Eqn. (6.2).

ε = ηfεf + ηr(1− αr) + ηd(1− αd) + ηnεn (6.2)

The weight factors (ηf ,ηr,ηd,ηn) are dynamically computed individually for each

probe during recognition as the ratio between the minimum and second minimum

values, taken relative to the mean value of the similarity measure for that probe

[117]. Therefore, the weights reflect the relative importance of the similarity mea-

sures for a particular probe based on the confidence in each of them. Note that

the second and the third similarity measures (αr and αd) are subtracted from unity

before multiplication with the corresponding weight factor as these have a polarity

opposite to other measures (the higher the values the better is the result).

We observe that the combination of these similarity measures provides an accept-

able recognition rate and most of the misclassified images are matched within the

rank of 40 (see Section 6.7.7). Therefore, unlike [117], we use this combined classifier

as a rejection classifier to discard the huge number of bad candidates retaining only

the best 40 identities (sorted according to this classifier) for fine matching using ICP

as described in the following sections.

6.5.5 Extraction of a Minimal Rectangular Area

We extract a reduced rectangular region (containing all the matching features)

from the originally detected gallery and probe ear data. This region is identi-

fied using the minimum and maximum co-ordinate values of the matched L3DFs.

Fig. 6.9 illustrates this region in comparison with other extraction windows (see

Section 6.3.6).


Extended extraction window

used for feature extraction

Original detection window

Minimal extraction window

used for ICP

Figure 6.9: Extraction of a minimal rectangular area containing all the matching

L3DFs.

L3DFs do not match from regions with occlusion or excessive missing data.

By selecting a sub-window where the L3DFs matches, such regions are generally

excluded. Besides, this smaller but feature-rich region reduces the processing time

as described in Section 6.7.6.

6.5.6 Coarse Alignment of Gallery-Probe Pairs

The input profile images of the gallery and the probe may have pose (rotation

and translation) variations. To minimize the effect of such variations, we apply the

transformation in Eqn. (6.3) to the minimal rectangular area of the probe dataset.

P ′ = RP + t (6.3)

where, P and P ′ are the probe dataset before and after the coarse alignment.

We use the translation (t) corresponding to the matched pair with the maximum

distance consistency and the rotation (R) corresponding to the matched pair with

the largest cluster of consistent rotations for this alignment. Our results show a

better performance with this approach compared to an alternative of minimizing

the sum of squared distance between points in the feature matches.

6.5.7 Fine Alignment with the ICP

The Iterative Closest Point (ICP) algorithm [10] is considered to be one of the

most accurate algorithms for registration of two clouds of data points provided the

datasets are roughly aligned. We apply a modified version of ICP as described

in [114]. The computational expense of ICP is minimized using the minimal rectan-

gular area as described in Section 6.5.5. The final decision regarding the matching

is made based on the results of ICP.


Table 6.2: Ear detection results on different datasets

Name of thedatabase

No. ofimages

Description of the test images including chal-lenges involved

No. ofunde-tectedimage(s)

Detectionrate(%)

UND-F 203 Not used in training and validation of the clas-sifiers

0 100

UND-F 942 Images from 302 subjects including some par-tially occluded images.

1 99.9

UND-J 830 2 images from each of 415 subjects includingsome partially occluded images

1 99.9

UND-J 146 54 images are occluded with earrings and 92images are partially occluded with hair

1 99.1

XM2VTSDB 104 Severely occluded with hair (see Fig. 6.11) 50 51.9

UWADB 50 All images are occluded with ear-phones 0 100

6.6 Detection Performance

In this section, we report and discuss the accuracy and speed of our ear detector.

6.6.1 Correct Detection

We performed experiments on seven different datasets with different types of

occlusions. The results are summarized in Table 6.2 which show the high accuracy

of our detector.

Test profile images in the first dataset in Table 6.2 are carefully separated from

the training and validation set. Images in the fourth and fifth datasets are partially

occluded with hair and ear-rings and those in the sixth dataset are severely occluded

with hair. Some examples of correct detection of such images are shown in Fig. 6.10.

The detector failed only for the images where hair covered most of the ear (see

Fig. 6.11). However, such occluded ear images may not be as useful for biometric

recognition anyway, as they lack sufficient ear features.

For some applications another kind of occlusion is likely to be common: occlusion

due to ear-phones, since people are increasingly using ear-phones with their mobile

phones or to listen to music. Therefore, we collected images from 50 subjects in our

laboratory using a Minolta Vivid 910 range scanner each of whom were requested to

wear ear-phones (see Fig. 6.12). Correct detection of ears in these images, confirms

that our detection algorithm does not require the ear pit to be visible. To the best

6.6. Detection Performance 99

Figure 6.10: Detection in presence of occlusions with hair and ear-rings (the inset

image is the enlargement of the corresponding detected ear).

Figure 6.11: Example of test images for which the detector failed.

Figure 6.12: Example of ear images detected from profile images with ear-phones.


of our knowledge, we are the first to use an ear dataset with ear-phones on.

In order to further analyze the robustness of our ear detector to occlusion, we

synthetically occluded the ear region of profile images similar to Arbab-Zavar and

Nixon [8]. After correctly detecting the ear region without occlusion, we introduced

different percentages of occlusion and repeated the detection. During each pass

of the detection test, occlusion was increased vertically (from top to bottom) or

horizontally (from right to left) by masking in increments of 10% of the originally

detected window. The results of these experiments applied on the first dataset in

Table 2 are shown in Fig. 13. The plots demonstrate a very high detection rate

until an occlusion level of 20% and 30% is reached, which sharply decreases with

40% and 50% occlusion for vertical and horizontal occlusion respectively. A better

performance is obtained under horizontal occlusions which are the most common

types of occlusion caused by hair.

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Det

ectio

n R

ate

Percentage of Occlusion

Horizontal OcclusionVertical Occlusion

Figure 6.13: Detection performance under two different types of synthesized occlu-

sions (on a subset of the UND-F dataset with 203 images).

To evaluate the performance of our detector under pose variations, we performed

experiments with a subset of Collection G of the UND database. It includes straight-

on, 15◦ off center, 30◦ off center, and 45◦ off center images. For each of the poses,

there are 24 images (details of the dataset can be found in [179]). Our detector

successfully detected ears in all of the images proving its robustness up to 45 degrees

of pose variation.

We also found that our detector is robust to other degradations of images such

as motion blur as shown in Fig. 6.14.

6.6. Detection Performance 101

Figure 6.14: Detection of a motion blurred image.

2 4 6 8 10 12 14 16 180

1

2

3

4

5x 10

−3

Number of Stages

Fal

se P

ositi

ve R

ate

0 0.1 0.2 0.3 0.4 0.50.97

0.975

0.98

0.985

0.99

0.995

1

Cor

rect

Cla

ssifi

catio

n R

ate

False Positive Rate

0 0.005 0.010.97

0.98

0.99

(a) (b)

Figure 6.15: False detection evaluation: (a) FAR (on number of profile images) with

respect to number of stages in the cascade. (b) The ROC curve for classification of

cropped ear and non-ear images (best seen in color).


6.6.2 False Detection

On the first dataset in Table 6.2, for a scale factor of 1.25 and step size of 1.5,

our detector scanned a total of 1308335 sub-windows in 203 images of size 640×480.

Only seven sub-windows were falsely detected as ears, resulting in a false positive

rate (FPR) of 5× 10−6. These seven false positives were easily eliminated using the

multi-detection integration as mentioned in Section 6.3.5. The relationship between

the FPR and the number of stages in the cascade is shown in Fig. 6.15(a). As

illustrated, the FPR decreases exponentially with an increase in the number of

stages following the maximum FPR set for each stage, fm = 0.7079. This is due to

the fact that the classifiers of the subsequent stages are trained to classify correctly

the samples misclassified by the previous stages.

In order to evaluate the classification performance of our trained strong classi-

fiers, we cropped and synthesized (by a rotation of -5 to +5 degrees) 4423 ear and

5000 non-ear images. The results are illustrated in Fig. 15(b). Although the correct

classification rate is 97.1% with no false positive (see the inset plot of Fig. 15(b)),

we achieve a very high classification accuracy with very low false positive rate. In

fact, we obtained 99.8% and 99.9% classification rates at 0.04% and 0.2% FPRs

respectively. These correspond to false positives of only 2 and 12 respectively.

6.6.3 Detection Speed

Our detector achieves extremely fast detection speeds. The exact speed of the

detector depends on the step size, shift and scale factor and the first scale. With

an initial step size of 1.5 and scale of 5 with a scale factor of 1.25, the proposed

detector can detect the ear in a 640× 480 image in 7.7 ms on a Core 2 Quad 9550,

2.83 GHz machine using a C + + implementation of the detection algorithm. This

time also includes the time required by the multi-detection integration algorithm

which is 0.1 ms on the average.

6.7 Recognition Performance

The experimental results of ear recognition using our proposed approach on

different datasets are reported and discussed in this section. The robustness of the

approach is evaluated against different types of occlusions and pose variations. The

effect of using L3DFs, geometric consistency and the minimal rectangular area of

datapoints for ICP are also summarized in this section.

6.7. Recognition Performance 103

6.7.1 Datasets

Collections F, G and J from the University of Notre Dame Biometrics Database

[157, 179] are used to perform the recognition experiments of the proposed approach.

Collection F and J include 942 and 830 images of 302 and 415 subjects respectively

collected using a Minolta Vivid 910 range scanner in high resolution mode. There

is a wide time lapse of 17.7 weeks on average between the earliest and latest images

of subjects. There are also variations in pose between them and some images are

occluded with hair and ear rings. The earliest image and the latest image for each

subject are included in the gallery and the probe dataset respectively. As mentioned

in Section 6.6.1, Collection G includes images from 24 subjects each having images

at four different poses, straight-on, 15◦ off center, 30◦ off center and 45◦ off center.

We keep images with straight-on pose in the gallery and others in the probe dataset.

We also tested our algorithm on 100 profile images from 50 subjects with and

without ear-phones on. The images were collected at the University of Western Aus-

tralia using a Minolta Vivid 910 range scanner in low resolution mode. There are sig-

nificant data losses in the ear-pit regions of the images as shown in Fig. 6.17(c). Im-

ages without ear-phones are included in the gallery and others in the probe dataset.

All the ear data were extracted automatically as described in Section 6.3 except

for the following.

1. For the purpose of evaluating the system as a fully automatic one, we consid-

ered the undetected probe (see Section 6.6) as a failure and kept the number

of probes as 415 in the computation of the recognition performance on the

UND-J dataset.

2. There are three images in the 75◦ off center subset of the UND-G dataset

where the 2D and 3D data clearly do not correspond. For these images, we

manually extracted the 3D data.

We also used the same values of the parameters in the matching algorithm for

all experiments across all the databases.

6.7.2 Identification and Verification Results on the UND Database

On the UND Database Collection-F and Collection-J, we obtained rank-1 identi-

fication rate of 95.4% and 93.5% respectively. The Cumulative Match Characteristic

(CMC) curve illustrating the results for the top 40 ranks for UND-J dataset is shown

in Fig. 6.16(a). The plot was obtained using ICP for matching a probe with the

selected gallery dataset after being coarsely aligned using L3DFs only. The little


5 10 15 20 25 30 35 400.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Iden

tific

atio

n R

ate

10−3

10−2

10−1

100

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

False accept rate (log scale)V

erifi

catio

n ra

te

(a) (b)

Figure 6.16: Recognition results on the UND-J dataset: (a) Identification rate. (b)

Verification rate.

(a) (b) (c)

Figure 6.17: Examples of correct recognition in presence of occlusions: (a) With

ear-rings and (b) With hair (c) With ear-phones (2D and the corresponding range

images are placed in the top and bottom row respectively (best seen in color)).


0 15 30 450.2

0.4

0.6

0.8

1

Off center Pose Variation (in Degrees)

Ran

k−1

Iden

tific

atio

n R

ate

1 2 4 6 8 10 12 14 16 18 20 22 240.2

0.4

0.6

0.8

1

Rank

Iden

tific

atio

n R

ate

30 degrees off center45 Degree Off center

(a) (b)

Figure 6.18: Recognition results on the UND-G dataset: (a) Identification rate for

different off center pose variations. (b) CMC curves for 30◦ and 45◦ off center pose

variations.

gain in accuracy up to rank-40 shows that the ICP algorithm is very accurate when

the correct gallery passes the rejection classifier.

We also evaluated the verification performance in terms of the Receiver Oper-

ating Characteristic (ROC) curve and the Equal Error Rate (EER). We obtained a

verification rate of 96.4% at an FAR of 0.001 with an EER of 2.3% for the UND-F

dataset. On the UND-J dataset, we obtain 94% verification at an FAR of 0.001 with

an EER of 4.1% (see Fig. 6.16(b)).

6.7.3 Robustness to Occlusions

To evaluate the robustness of our approach to occlusions, we selected occluded

images from various databases as described below.

1. Ear-rings: We found 11 cases in UND Collection-F where either the gallery

or the probe image is with ear-rings. All these were correctly recognized by

our system. Some of the examples are illustrated in Fig. 6.17a. Although we

used 3D data for recognition, 2D images are shown for a better illustration.

2. Hair: Our approach is also significantly robust to occlusion with hair. Out of

59 images with partial occlusion with hair in the UND-F dataset, 54 are cor-

rectly recognized yielding a recognition rate of 91.5% (see Fig. 6.17b). The mis-

classified examples also have some other problem as discussed in Section 6.7.5.

3. Ear-phones: Experiments on 50 probe images with ear-phones from the

UWADB provide a rank-1 identification of 98% and verification of 98% with an


EER of 1%. These results confirm the robustness of our approach to occlusions

in the ear pit region (see Fig. 6.17(c)).

6.7.4 Robustness to Pose Variations

Pose variations may occur during the capture of the probe profile image or during

the detection phase choosing the detection rectangle in different positions. Large

variations in pose (particularly in the case of out-of-plain rotations) sometimes intro-

duce self-occlusions that decrease the repeatability of local 3D features in a gallery-

probe pair. Therefore, although the local 3D features are somewhat pose invariant

due to the way they have been constructed (see Sections 6.4.2), we noticed some

misclassifications when using only local 3D features. However, with the finer match-

ing via ICP, most of such probe images in the UND-F and UND-J datasets are

recognized correctly. The results on UND-G dataset having probes with pose varia-

tions up to 45◦ are plotted in Fig. 6.18. We achieved 100%, 87.5% and 33.3% rank-1

identification rates for 15◦, 30◦ and 45◦ off center pose variations respectively. Our

results are comparable to those of Yan and Bowyer [179] and Chen and Bhanu [32]

for up to 30◦. Some examples of accurate recognition under large pose variations

are illustrated in Fig. 6.19.

(a) (b)

(c) (d)

Figure 6.19: Examples of correct recognition of four gallery-probe pairs with pose

variations (best seen in color).

6.7.5 Analysis of the Failures

Most of the misclassifications that occurred in all our experiments involve missing

data (inside or very close to the ear contour) in either the gallery or the probe image.

The remaining ones involve large out-of-plane pose variations and/or occlusions with

hair. Some examples are illustrated in Fig. 6.20.


(a) (b)

Figure 6.20: Examples of failures: (a) Two probe images with missing data. (b) A

gallery-probe pair with a large pose variation.

6.7.6 Speed of Recognition

On a Core 2 Quad 9550, 2.83 GHz machine, an unoptimized MATLAB imple-

mentation of our feature extraction algorithm requires around 22.2 sec to extract

local 3D features from a probe ear image. A similar implementation of our algorithm

for matching 200 L3DFs of a probe with those in a gallery requires 0.06 sec on av-

erage. This includes 0.02 sec required by the computation of geometric consistency

measures. For the full algorithm, including L3DF as rejection classifier, followed

by coarse alignment and ICP on a minimal rectangle, the average time to match a

probe-gallery pair in the identification case is 2.28 sec on the UND dataset. Timing

for different combinations of our recognition algorithms are given in Table 3.

Table 6.3: Performance variations for using L3DFs and geometric consistency mea-

suresApproach Rank-1 Identification rate (%) Avg.

UWADB(50-50)

UND-F(100-100)

UND-F(302-302)

UND-J(415-415)

Matchingtime(sec)

ICP only 98 80 - - 58.09

L3DF without geometric consis-tency

86 89 76.8 71.6 0.04

L3DF using geometric consistency 88 92 83.44 79.8 0.06

ICP and L3DF without geometricconsistency

96 98 93.7 81.6 2.43

ICP and L3DF using geometricconsistency

98 98 95.4 93.5 2.28


6.7.7 Evaluation of L3DFs, Geometric Consistency Measures and Min-

imal Rectangular Area of Dataset

We obtain improved accuracy and efficiency using the local 3D feature based

rejection classifier and extracting the minimal rectangular area prior to the applica-

tion of ICP. The improvements are summarized in Table 6.3 where numbers within

parenthesis following the database name are the number of images in the gallery

and probe set respectively.

Results in the first row in the above-mentioned table are computed using the

same ICP implementation as in our final approach but without using L3DFs and

corresponding minimal rectangular area of data points. The wider detection window

with hair and outliers and the absence of proper initial transformation caused ICP

to yield very poor results compared to our approach with L3DFs. In all the cases,

our method is significantly more efficient than the raw ICP algorithm.

As shown in Table 6.3, using geometric consistency measures improved the accu-

racy of identification significantly especially for the larger datasets. Improvements

using these consistency measures for other biometric applications are described in

detail in [80]. The performance of these measures on the UND-F database is shown

in Fig. 6.21. The legends errL3DF , perRot, propDist and errDist are used for sim-

ilarity measures εf , αr, αd and εn respectively as described in Section 6.5.3). The

CMC curves show the significance of the rotation consistency measure compared to

other similarity measures.

5 10 15 20 25 30 35 400.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Iden

tific

atio

n R

ate

errL3DF

propRot

propDist

errDist

Figure 6.21: Comparing identification performance of different L3DF-based similar-

ity measures (without ICP on the UND-F dataset).

6.8. Comparison with Other Approaches 109

6.8 Comparison with Other Approaches

In this section, our detection and recognition approaches are compared with

similar approaches using 3D data and reporting performance on the UND datasets.

Table 6.4 and 6.5 summarize the comparisons. Matching times are computed on

different machines in different approaches. A dual-processor 2.8 GHz Pentium Xeon,

a 1.8 GHz AMD Opteron and a 3 GHz Pentium 4 are used in [179], [32] and [136]

respectively.

6.8.1 Detection

The ear detection approaches in [136] and [180] are not automatic. The ap-

proaches in [32] and [179] use both color and depth information for detection. The

latter achieves a low detection accuracy (79%) using only color information which

also depends on the accuracy of nose tip and ear pit detections. So these approaches

of detection are not directly comparable to ours since we only use grey scale infor-

mation.

Unlike other approaches, we have performed experiments with ear images of

people with ear-phones on and observed that our detection rate does not vary much

if the ear pit is blocked or invisible. Our experiments with synthetic occlusion on

the ear region of interest, demonstrate better results than those reported by Arbab-

Zavar and Nixon [8] for vertically increasing occlusions up to 25% of the ear.

Table 6.4: Comparison of the detection approach of this paper with the approaches

of Chen and Bhanu [32] and Yan and Bowyer [179]Approach Ear

pit de-tected?

Nosetip de-tected?

Datatype

Dataset(#images)

Det.rate(%)

Det.time(sec)

This Paper No No Intensity UND-F (942) 99.9 0.008UND-J (830) 99.9

Chen andBhanu [32]

No No Colorand

UND-F (700) 87.7 N/A

depth UCR (902) 99.3 9.48

Yan Yes Yes Color UND-J 79 N/Aand Depth (1386) 85Bowyer [179] Color

anddepth

100


Table 6.5: Comparison of the recognition approach of this paper with others on the

UND database

Approach Manuallycroppedimages

Conciseness ofthe croppingwindow aroundthe ear

Featuresused

Rejectionclassi-fierused?

Initialtransfor-mationfor ICP

Feature Matchingtime

This Paper Nil A flexible rect-angular area iscropped

Local Yes Translationand rota-tion

L3DF-0.06 sec(MATLAB)

Chen andBhanu [32]

12.3% A concise rect-angular area iscropped

Local No Translationand rota-tion

LSP- 3.7 sec,H/AH- 1.1 sec(C + +)

Yan andBowyer [179]

Nil Very concisecrop along thecontour

Global No Translation N/A

Passalis etal. [136]

All Very concisecrop around theconcha

Global No N/A less than 1 ms(enrolment-15-30 sec)

Regarding robustness to pose variations, unlike other reported approaches, we

performed experiments separately for profiles with various levels of pose variations

and obtained 100% detection accuracy. The approach of Yan and Bowyer [179]

greatly depends on the accuracy of the ear pit detection which can be affected by

large pose variations.

Also, none of the above approaches reports the time required for ear detection

on the UND dataset (although Chen and Bhanu [32] reports 9.48 sec for detection

on the UCR database). In [179], an active contour algorithm is used iteratively

which increases the computational cost. It also uses skin detection and constraints

in the active contour algorithm for a concise cropping of the 3D ear data. Chen and

Bhanu [32] also use skin detection and edge detection for finding the initial Regions

of Interest (ROI), RANSAC-based Data-Aligned Rigidity-Constrained Exhaustive

Search algorithm (DARCES) for initial alignment and the ICP for fine alignment

of a reference model on to the ROIs and the Thin Plate Spline Transformation for

the final ear shape matching. On the other hand, we use the AdaBoost algorithm,

whose off-line training is expensive but the trained detector is extremely fast.

On the basis of the above, our approach of ear detection is faster than all other

above approaches while maintaining high accuracy, and producing boundaries con-

6.8. Comparison with Other Approaches 111

sistent enough for recognition.

6.8.2 Recognition

In our approach, an extremely fast detector is paired with fast matching using 3D

local features which allows us to extract a minimal rectangular area from a flexibly

cropped rectangular ear region sometimes with other parts of the profile image (e.g.

hair and skin). However, recognition performance depends on how concisely each

ear has been detected, especially when using the ICP algorithm. This is because the

hair and skin around the ear makes the alignment unstable. This explains the high

recognition results in [179] where ICP is used for matching concisely cropped ear

data with a penalty in time. Similarly, the slightly better result in [32] is explained

through the fine manual cropping of a large number (12.3%) of ears which could not

be detected by their automatic detector. Also, in [179, 32] ICP is applied on every

gallery-probe pair whereas our use of L3DF allows us to apply ICP on a subset (best

40) of pairs for identification.

Unlike the approaches in [136], [180] and [179], we and Chen and Bhanu [32]

use local features for recognition. However, the construction of their local features

is quite different than ours and they use these for coarse alignment only and not

for rejecting false matches. We use the local 3D features for the rejection of a large

number of false matches as well as for a coarse alignment of the remaining candidates

prior to the ICP matching. This considerably reduces our computational cost since

we apply ICP on a reduced dataset. However, similar to them, we use both rotation

and translation for the coarse alignment whereas Yan and Bowyer [179] use only

translation (no rotation). For matching local features, Chen and Bhanu [32] use a

geometric constraint similar to ours but without a second round of matching. Our

experiments show that matching is improved by a second round (see Section 6.7).

Although the final matching time in [136] is very low (less than 1 ms per com-

parison), its enrolment and the feature extraction modules using both ICP and

Simulated Annealing are slightly more computationally expensive (30 sec compared

to 22.2 sec in our case). The approach has an option of omitting the deformable

model fitting step which can reduce the enrolment timing to 15 sec, however, with

a penalty of 1% in recognition performance. Unlike our approach, it also requires

that the ear pit is not occluded because the annotated ear model used for fitting the

ear data is based on this area. Moreover, the authors mention that the approach

fails in cases of ears with intricate geometric structures.


6.9 Conclusion

In this paper, a complete and fully automatic approach for human recognition

from 2D and 3D profile images is proposed. Our AdaBoost-based ear detection ap-

proach with three new Haar feature templates and the rectangular detector is very

fast and significantly robust to hair, ear rings and ear-phones. The modified con-

struction and efficient use of local 3D features to find potential matches and feature-

rich areas, and also for coarse alignment prior to the ICP, makes our recognition

approach computationally inexpensive and significantly robust to occlusions and

pose variations. Using two-stage feature matching with geometric consistency mea-

sures significantly improved the matching performance. Unlike other approaches,

the performance of our system does not rely on any assumption about the localiza-

tion of the nose or the ear pit. The speed of the recognition can be improved further

by implementing the algorithms in C/C + + and using faster techniques for feature

matching like geometric hashing as well as faster variants of ICP.

Appendix: Parameter Selection

The parameters used in the implementation of our detection, feature extraction

and matching algorithms are listed below with a short description of the effect of

the variation of their values. All the values are determined empirically using the

training data.Detection Related Parameters:

1. Target false positive rate (Ft): We chose Ft = 0.001. Increasing its value will

decrease the number of stages of the cascade for ear detection, but will reduce

the accuracy.

2. Target detection rate (D): We chose D = 0.98. Increasing its value will

increase the number of stages in the cascade or the minimum detection rate

per stage which will necessitate more features to be included in each stage of

the cascade.

3. Minimum overlap (minOv): We used minOv = 0.1 in our multi-detection

integration algorithm. Using a higher value for this parameter may result in

multiple detections near the ear.

Feature Extraction Related Parameters:

1. Inner radius of the feature sphere (r1): It is the radius of the sphere within

which data points are used to construct the 3D local features. For a larger value

6.9. Conclusion 113

of r1, a feature becomes more global and hence, more descriptive. However,

locality is also important to increase robustness to occlusion. Its value is

chosen relative to the average ear size. We tested with values 10, 15, 20 and

30 mm and the best results were obtained with r1 = 15 mm.

2. Boundary limit (r1 + x): It is a function of the inner radius and is used to

avoid the boundary effect. We chose x = 10 mm. A higher value of x will

reduce the number of seed points and hence the keypoints. A lower value may

result in having some keypoints outside the reliable and feature-rich area of

the ear, likely including hair.

3. Threshold for choosing the keypoints (t1): We chose t1 = 2 to have around 200

significant features in most cases. The higher the value of t1, the fewer features

we get. However, lowering t1 can result in the selection of less significant

feature points. For example, t1 = 0 will allow constructing a feature from a

completely planar or spherical surface.

4. Seed resolution (rs): It defines how close we chose a seed point. In our exper-

iments we chose rs = 2 mm.

5. Number of features per ear (nf ): This parameter determines the maximum

number of features to be created per ear. We chose nf = 200. The higher value

of this will increase the possibility of getting more feature points resulting

in more computational cost. However, the recognition becomes critical for

candidates having fewer features.

Matching Related Parameters:

1. The threshold limiting distance between feature locations (th1): It controls

the number of matches to be discarded. The higher its value, the more but

less significant matches will be included. On the other hand, a smaller value

will reduce the number of matches. In our experiments th1 = 45 mm provided

better results.

2. The distance multiplier (κ): This parameter is part of the threshold to deter-

mine the distance consistency. We empirically determine its value as 0.1. A

higher value will allow less consistent matches to be used in constructing the

rejection classifier.

3. The threshold for rotation consistency (th2): We chose th2 = 10◦. A higher

value will allow considering matches having higher rotation variations in the


calculation of rotation consistency. However, smaller values may discard po-

tentially correct matches.

4. The minimum number of matches (m): This parameter limits the number of

gallery candidates having enough matching features with a probe. We chose

m = 10. A higher value may discard potential matches while a lower value

would not allow the keypoint distance measure computation to be performed.

Acknowledgments

This research is sponsored by the Australian Research Council (ARC) grant

DP0664228. The authors acknowledge the use of the UND, the NIST, the XM2VTSDB,

the UMIST, the USTB and the MIT-CBCL face databases for ear detection and the

UND profile and the UCR ear databases for ear recognition. They would also like to

acknowledge Mitsubishi Electric Research Laboratories, Inc., Jones and Viola for the

permission to use their rectangular filters and D’Errico for the surface fitting code.

They would like to thank R. Owens and W. Snyder for their helpful discussions and

K. M. Tracey, J. Wan and A. Chew for their technical assistance.

115CHAPTER 7

Fusion of 3D Ear and Face Biometrics for Robust

Human Recognition

Abstract

The vulnerability of unimodal biometric systems in the presence of noise or occlu-

sions and variations of pose, scale and illumination motivates the use of multimodal

biometrics. However, selecting appropriate biometric modalities and fusion tech-

niques is still very challenging. The ear and the face are highly attractive biometric

modalities for fusion because of their feature-rich physiological structure, physical

proximity and non-intrusive acquisition. In this paper, local 3D features are auto-

matically extracted from both modalities for representation, and fusion is performed

at score and feature levels. Fusion of L3DF-based and ICP-based matching scores

from the two modalities using a weighted sum-rule significantly outperforms cur-

rent unimodal approaches especially for non-neutral facial expressions. To the best

of our knowledge, this paper is the first to propose fusing 3D features extracted

from 3D ear and the frontal face data. The proposed score-level fusion technique

achieves rank-1 identification and verification (at 0.001 FAR) rates of 99.4% and

99.7% respectively with neutral facial expression and 96.8% and 97.1% respectively

with non-neutral facial expressions on the largest multimodal dataset using FRGC

v.2 and UND databases.

keywords

3D ear, 3D face, local 3D surface features, multimodal recognition, score-level

fusion, feature-level fusion.

7.1 Introduction

Traditional identity card and password based systems are increasingly exploited

via theft or fakery [43, 88]. Biometric recognition using human physiological (such

as face, fingerprint, palm-print and iris) or behavioral (e.g. handwriting, gait and

voice) characteristics is comparatively more robust to such frauds. As it is relatively

difficult to spoof multiple biometrics simultaneously, multibiometric systems have

0This article is under review in the IEEE Transaction of Pattern Recognition and MachineIntelligent, June, 2010.

116 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition

recently been proposed where a decision is made based on a combination of different

subsets of biometrics.

In terms of acceptability, the face is considered the most promising biometric

trait. Face data can be collected easily and non-intrusively. The face is also rich

in distinct features. Using 3D or a combination of 2D and 3D face images, very

high recognition rates have been obtained for faces with neutral expression [117].

However, changes in facial expression significantly change the facial geometry [99]

reducing the effectiveness of face recognition algorithms. Occlusions caused by hair

or ornaments introduce an additional challenge. To reduce the impact of expression

variations and occlusions on the recognition performance, researchers have proposed

the integration of face data with other biometric modalities such as fingerprints [158],

palm prints [56], hand geometry [144], gait [194], iris [164], voice [19] and most re-

cently the ear. Among these alternatives, the ear has the advantage that it is

co-located with the face and hence, respective data can easily be collected (with the

same or similar sensor). Moreover, ear shape does not change with expressions and

with ageing from 8 years to 70 years [72]. Another advantage of this choice of modal-

ities is that the ear and face data have a very low correlation, which is a desirable

criterion for any fusion approach. Theoharis et al. [152] computed a correlation of

only 0.16 between an ear and a face image using the Pearson correlation coefficient.

They also illustrated that the ear-face fusion curve can reach a 100% recognition

rate before rank-15 whereas none of the modalities reached 100% accuracy before

rank 20. This implies that the instances of failure to identify a subject using the two

modalities are uncorrelated, and therefore, one modality can generally compensate

for any shortcoming of the other.

Ear and face biometrics can be fused at different levels of the recognition pro-

cess [84]. In this paper, we propose score and feature-level fusion approaches. We

detect the face region of interest from the frontal face images based on the position

of the nose tip [114]. To detect the ear from the profile images, we use our previously

developed ear detection technique using AdaBoost [74] (Section 6.3 of this disserta-

tion). For score-level fusion, following a normalization step, face and ear local 3D

features (L3DFs) are extracted and matched separately. Matching scores from the

two modalities are then fused according to a weighted sum rule. For feature-level

fusion, after extracting L3DFs from the ear and face data, we concatenate them

based on their local shape similarity. We then match the fused features of the probe

and the gallery according to fused feature distance and using geometric consistency

measures. The performance of the proposed approaches is evaluated on the largest

7.1. Introduction 117

Table 7.1: Multi-biometric approaches with 2D and 3D ear and face dataCategory Source Data Type and

Database SizeAlgorithm Id.

Rate

Score-levelfusion

Yan,2006 [175]

3D images from174 subjects

Using ICP with sum and in-terval fusion rules on multi-instance gallery and probe im-ages

100%

Woodard etal., 2006 [166]

3D images from 85subjects

ICP and RMS 97%

Theoharis etal., 2008 [152]

3D images from324 subjects

Annotated model fitting andusing ICP, SA and waveletanalysis

99.7%

Feature-levelFusion

Xu et al.,2007 [171]

2D images from 79subjects

KFDA 96.8%

Pan et al.,2007 [133]

2D profile imagesfrom 38 subjects

CCA, PCA 97.4%

Xu and Mu,2007 [172]

190 2D imagesfrom 38 subjects

KCCA 98.7%

ear-face dataset available comprising 326 gallery images and 315 probes with neutral

facial expression and 311 probes with non-neutral facial expressions. All images are

taken from the FRGC v.2 [141] and the University of Notre Dame (UND) [179]

Biometric databases and there is only one instance per subject in both the gallery

and the probe dataset. We achieve an identification rate of 99.0% and a verification

rate of 99.7% at 0.001 False Acceptance Rate (FAR) for score-level fusion of the ear

with neutral face biometrics. Corresponding results with non-neutral expressions

are 96.8% and 97.1% respectively. We obtain a slightly lower accuracy of 98.4% and

97.8% identification and verification results respectively for feature-level fusion of

the ear with neutral face data. With non-neutral expressions of face data, we ob-

tain 94.9% and 96.8% identification and verification rates respectively. Both fusion

approaches are efficient and fully automatic.

The remainder of the paper is organized as follows. Related work, motivations

and contributions of this paper are described in the next section. The data acquisi-

tion and feature extraction techniques used in the paper are described in Section 7.3.

The proposed approaches for score-level and feature-level fusion are described in Sec-

tions 7.4 and 7.5 respectively. Results are reported and discussed in Sections 7.6

and 7.7. Proposed fusion approaches are compared between themselves and with

other approaches in Section 7.8. A conclusion is provided in Section 7.9.


7.2 Related Work and Contributions

7.2.1 Related Work

Multimodal recognition with ear and face is a very recent research trend. Only

very few approaches have been proposed using different levels of fusion [78] (See

Chapter 2 of this dissertation for details). Some of the most relevant approaches

using score and feature levels of fusion are summarized in Table 7.1 and discussed

below.

Approaches with Score-level Fusion: In score-level fusion, matching scores from

different modalities are combined to make the recognition decision. Different fusion

rules have been proposed. Kittler et al. [94] and Jain et al. [84] empirically demon-

strated that the sum rule provides better results than other score fusion rules in a

number of cases. Recently, Luciano and Krzyzak [108] demonstrated better results

for the weighted sum rule.

There are only few 2D approaches including the works of Luciano and Krzyzak [108,

2] using score-level fusion of the ear and face biometrics. However, to the best of

our knowledge, there are only three approaches using 3D data for score-level fusion

of these two modalities. Considering their relevance to the research in this paper,

only 3D approaches are discussed below.

Yan [175] combined ear and face at score-level using sum and interval fusion

rules. On a dataset of 174 subjects, each with two ear shapes and two face shapes

in the gallery and the probe datasets, they obtained rank-one recognition rates of

93.1%, 97.7% and 100% for the ear, the face and the fusion respectively [118].

Woodard et al. [166] used a score-level fusion technique for combining 3D face

scores with ear and finger scores. Using the Iterative Closest Point (ICP) algorithm,

they obtained 97% rank-one recognition rate on a small database of 85 subjects.

Theoharis et al. [152] proposed extracting geometry images from 3D face and

ear modalities and then fitting annotated ear and face models through an ICP and

simulated annealing based registration process. They applied the wavelet transform

to the extracted images to get individual feature vectors. The distance between

the feature vectors of the gallery and the probe was weighted and then summed

for fusion. Although they combined features before matching, we classify their ap-

proach as score-level fusion because they directly applied the L1 distance metric for

matching the fused features, which is equivalent to matching the individual modal-

ity separately and hence would produce the same results as score-level fusion. In a

multimodal database composed of 324 gallery and the same number of probe images


(all collected from the FRGC v2 and the UND databases), a rank-one identification

rate of 99.7% was reported. The probe dataset for this experiment contained some

images with non-neutral facial expressions but most of them were neutral.

Approaches with Feature-level Fusion: In feature-level fusion, extracted features

from different modalities are combined prior to matching. We are not aware of

any 3D approaches using this fusion technique and only the following three 2D

approaches are found in the literature.

Xu et al. [171] proposed feature-level fusion of 2D profile face and the ear biomet-

rics. They used Kernel Fisher Discriminant Analysis (KFDA) to extract features

and tested their approach on a dataset of 237 gallery and 474 probe 2D profile

images from 79 subjects (all taken from the University of Science and Technology

Beijing (USTB) database). They included three 2D images (with the rotation of -5,

0 and +5 degrees around the vertical axis) per person in the gallery and six images

per person in the probe dataset. They obtained 96.8% recognition accuracy while

using a minimum-distance classifier and the weighted sum rule (with weights 0.55

and 0.45 for face and ear respectively).

In their feature-level fusion of profile face and ear, Pan et al. [133] used Canonical

Correlation Analysis (CCA) to extract features. They reduced the dimension of the

associated feature vector using PCA and used a minimum-distance based classifier.

They built a multi-instance dataset from the USTB database taking three instances

per gallery and two instances per probe for each of the 38 subjects. With this

dataset, they obtained an identification rate of 97.4% using fused feature vector of

dimension 50. The same research group improved the identification rate to 98.7%

using Kernel Canonical Correlation Analysis (KCCA) on the same data set [172].

7.2.2 Motivations and Contributions

As reported in the literature review (Section 7.2.1), most of the score-level fusion

techniques and all of the feature-level fusion techniques use only 2D data. However,

2D data are severely affected by changes in illumination, scale and pose variations

that are common for public applications. Therefore, in this paper, we propose to

use 3D data, which are commonly less sensitive to such variations.

Occlusions and deformations are also very common in non-intrusive applications

of ear-face biometrics. Most of the current ear-face multimodal approaches use

global features which are affected by these variations. In this work, we use local 3D

features (L3DFs) to represent both the ear and face data. L3DFs were first proposed

by our research group [114] and used for object retrieval and face recognition [117].


These features exhibit a high level of repeatability and are very fast to compute.

Mian et al. [114] reported 23 matches per second using MATLAB on a 3.2 GHz

Pentium IV machine.

In comparison to other levels of fusion, score-level fusion has many benefits with

respect to implementation and computation. It involves the processing of less data

and consequently it is a faster and easier way when used to recognize people [84].

Feature-level fusion, on the other hand, may preserve more discriminating features

prior to matching and is intuitively believed to provide better accuracy. However,

this has not been experimentally proved in the literature. Therefore, we propose to

apply both fusion techniques using the same data and similarly extracted features

to compare their recognition performance.

It is also interesting to note that all the existing feature-level fusion approaches

use profile images for both the ear and face biometrics. Although it is easy and

cost effective to use a single image for both modalities, a frontal face possesses

more discriminating features than a profile face and hence may result in a better

recognition accuracy. To explore this, we propose to extract ear features from the

profile image and face features from the frontal face images.

Our approaches are fully automatic all the way from data acquisition to recog-

nition. We use our previously developed tools [83] to automatically extract a short

list of candidates and select a minimal rectangular region from the whole dataset.

Thus, we can minimize the cost when using ICP for improved accuracy, especially

in the case of face images with non-neutral expressions.

The specific contributions of this paper are as follows:

1. Two complete ear-face multimodal recognition systems are proposed based on

fully automatically extracted efficient local features.

2. To the best of our knowledge, this is the first paper to present a feature-level

fusion approach using 3D ear and face features extracted from the profile and

frontal face images respectively.

3. The same type of local 3D features are used to represent both the ear and

face data in both score and feature levels of fusion which permits for fair

comparisons of the two biometric traits as well as the fusion techniques.

4. Experiments are performed on the largest possible multimodal dataset using

publicly available profile (the UND) and frontal (the FRGC v2) face databases.

7.3. Data Acquisition and Feature Extraction 121

5. The performance of the face biometrics especially with non-neutral expressions

improves significantly when fused with the ear biometrics using the proposed

score-level fusion technique.

6. The proposed feature-level fusion approach achieves an accuracy comparable

to that of the score-level fusion of the ear and face data without requiring

ICP-like expensive algorithms in matching.

7.3 Data Acquisition and Feature Extraction

7.3.1 Dataset

To perform our experiments on both the ear and face data, we create a multi-

modal dataset comprising data from Collection-J of the UND Profile face database [179]

and the Fall2003 and Spring2004 datasets of the FRGC v.2 frontal face database [141].

The UND database has images from 415 individuals and the FRGC v.2 database

has images from 466 individuals. However, only 326 images in the gallery of the

UND database are available in the list of the images with neutral expression in the

FRGC v.2 database. Similarly, the number of probe face images with neutral and

non-neutral expressions in the FRGC v.2 database which are also available in the

probe images of the UND database are 311 and 315 respectively. Thus, our multi-

modal dataset includes 326 gallery images and 311 probes with neutral expressions

and 315 probes with non-neutral expressions. To the best of our knowledge, this is

the largest publicly available ear-face database.

7.3.2 Ear and Face Data Extraction

The ear region is detected from 2D profile images using the AdaBoost based

detector described in our previous work [74] (described fully in Section 6.3 of this

dissertation). This detector is chosen because it is very accurate, fast and fully

automatic. A detection accuracy of 99.9% is obtained on the UND profile face

database with 942 images of 302 subjects. The corresponding 3D data are then

extracted from the co-registered 3D profile data as described in [75]. As shown in

Fig. 7.1, a rectangular area of data points around the ear is extracted from the

profile which sometimes includes some portion of the hair and the face. Therefore,

the extracted data are normalized and then uniformly sampled on a grid of 132 mm

by 106 mm.

The face region is detected from the 3D frontal face image based on the position

of the nose tip as described in [114]. Face data are also normalized and sampled on

a uniform grid of 160 mm by 160 mm.


Konica Minolta

Vivid 910

Detected

2D ear

Extracted

3D ear

Konica Minolta

Vivid 910

2D view of

the profile2D Co-ordinates

Range image of

the profile

Range image of

the front face

Extracted

3D face

(a)

(b)

Figure 7.1: Data acquisition using Minolta Vivid 910 scanner: (a) 2D and 3D profile

images captured to extracted 3D ear data. (b) 3D frontal face scanned to extract

3D face data

7.3.3 Feature Extraction

Local 3D features are extracted from 3D ear and face data. A number of distinc-

tive 3D feature point locations (keypoints) are automatically selected on the 3D ear

and 3D face region based on the asymmetrical variations in depth around them. The

variation is determined by the difference between the first two eigenvalues (centered

on the keypoints) following [117]. The number and locations of the keypoints are

found to be different for the ear and the face images of different individuals and

hence can be used as a digital signature for biometric purposes. It is also observed

that these keypoints exhibit a high degree of repeatability in different images of the

same individual [117, 83].

A spherical area of radius R is cropped around the selected keypoints and aligned

on its principal axes. Then, a uniformly sampled (with a resolution of 1mm) 3D

surface of 30× 30 lattice is approximated (using D’Errico’s surface fitting code [47])

on the cropped data points. In order to avoid boundary effects, an inner lattice of

20×20 is cropped from the bigger surface and converted to a 400 dimensional feature

vector for that corresponding keypoint. An example of an extracted ear feature is

illustrated in Fig. 7.2. More details can be found in [117, 83].

7.4. Unimodal Matching and Score-level Fusion 123

Location of a local 3D feature on the range image of an ear Extracted 3D local surface feature

Figure 7.2: Example of an extracted 3D local surface feature [83].

7.4 Unimodal Matching and Score-level Fusion

The main steps in our multimodal recognition system with score-level fusion are

shown in the block diagram of Fig. 7.4. Each of the components is described in this

section.

7.4.1 Matching Technique

Ear images in the gallery and the probe datasets are matched using an L3DF-

based coarse matching and an ICP-based fine matching techniques. A preliminary

version of the approaches can be found in our previous works [83, 80] on ear recog-

nition. We adopt the same approaches with the same parameters for matching of

the face data. The approaches are described briefly as follows.

L3DF-based Matching At first, Root Mean Square (RMS) distance between a

probe feature and all the gallery features are computed. Gallery features with a

keypoint location at a predefined distance threshold λ1 = 45 mm away from that of

the probe feature are discarded. The remaining gallery features are sorted according

to their RMS distance from the probe feature and the one with minimum distance

is paired with the probe feature. If multiple probe features match the same gallery

feature, the best match is retained. Then for a match with locations pi, gi, we count

how many other match locations pj, gj satisfy the following inequality:

||pi − pj| − |gi − gj|| < rs + κ√|pi − pj| (7.1)

where, rs is the seed resolution and the value of κ is empirically chosen as 0.1.

We compute the proportion of distance-consistent matches (ρd) measure as a ratio

of the maximum count and the total number of matches. In the second level, we re-

peat the feature matching, however, only allowing those matches that are consistent


Figure 7.3: Feature correspondences established between a gallery and a probe ear

after matching is performed. The image of the probe ear (right image) is flipped for

better visibility (best seen in color).

Frontal face

images Face

detection

Ear

detection

Face

L3DFs

extraction

Ear

L3DFs

extraction

Matching

face

L3DFs

Matching

ear L3DFs

Profile face

images

Fine

matching of

faces with

ICP

Fine

matching of

ears with

ICP

Fusion of

matching

scores

Recognition

result

Candidate selection

and coarse alignment

using face feature

correspondences

Candidate selection

and coarse alignment

using ear feature

correspondences

Gallery

database of

face L3DFs

Gallery

database of

ear L3DFs

Figure 7.4: Block diagram of the proposed multimodal recognition system with

score-level fusion.

7.4. Unimodal Matching and Score-level Fusion 125

with the most consistent match. We compute the mean feature distance (σs) of the

retained matches. An example of feature correspondence between a probe and the

corresponding gallery ear is shown in Fig. 7.3. In the second level of matching, we

also find the underlying rotation between the matched gallery-probe feature pair and

compute a ratio of the number of occurrences of the maximally occurred rotations

that are within a threshold λ2 = 10o and the total number of matches as the pro-

portion of the consistent rotations (ρr). Finally, we compute the keypoint distance

measure (σk) by only applying ICP on the keypoints of the matched features.

Candidate Selection and ICP-based Matching We select the best 40 gallery candi-

dates sorted according to the feature-based matching scores (described in the above

Section). We also use the maximally occurred distance and the rotation found in the

feature-based matching for a coarse alignment of the gallery and probe ear (or face)

dataset. In order to align them finely, we employ a modified version of the Iterative

Closest Point (ICP) algorithm [114] on a minimal rectangular area of dataset con-

taining only the matching features. We use the ICP error as a similarity measure

(σi).

Final Similarity Measure and Score Normalization In order to make the final

matching decision based on the fusion of ear and face modalities, we combine the

following scores: (i) the mean feature distance of the matches retained by the sec-

ond round (σs), (ii) the proportion of distance-consistent matches (ρd), (iii) the

proportion of consistent rotations from the second round (ρr) and (iv) the ICP error

(σi).

We normalize the above scores on a 0 to 1 scale using the min-max rule. A

weight factor (ω) is then computed as the ratio between the minimum and second

minimum values, taken relative to the mean value of the similarity measure for that

probe. The final score (ηx) for modality x (here ear and face) is then computed

by summing the products of the scores and the corresponding weights (confidence

weighted sum rule) [117] as shown in Eqn. (7.2).

ηx = ωsσs + ωd(1− ρd) + ωr(1− ρr) + ωiσi (7.2)

7.4.2 Fusion of Scores

The matching scores from the ear and face data (ηe and ηf respectively) are

fused using a weighted sum rule (similar to the one used in Eqn. (7.2) as follows:

εs = ωeηe + ωfηf (7.3)


In Eqn. (7.3), ωe and ωf are weights for the ear and the face modalities respec-

tively.

Since L3DFs are more distinctive and reliable for the face data than the ear data,

we factorize the confidence weights (computed above) with some complementary

weights during fusion. We empirically found that allocating nearly double weights

to the face scores provides better results (see Section 7.6.1).

7.5 Feature-Level Fusion

In this section, we describe our proposed approach for fusing local 3D features

obtained from ear and face data. Fig. 7.5 shows a block diagram of the approach.

Face

detection

Ear

detection

Face

L3DFs

extraction

Ear

L3DFs

extraction

Matching

fused

L3DFs

Fusion of

face and

ear L3DFs

Gallery

database of

fused feature

Off-line enrolment

2D and 3D profile

face images

3D frontal face

images

Recognition

result

Figure 7.5: Block diagram of the proposed multimodal recognition system with

feature-level fusion.

Figure 7.6: Correspondences between the features extracted from the frontal face

and those from the ear of the same person. These correspondences are established

using algorithm 1. Only the first 40 features are shown for a better visibility (best

seen in color).

7.5. Feature-Level Fusion 127

7.5.1 Fusion of L3DFs from Ear and Face

Although the global shapes of the ear and the face are different, there exists some

similarity in their local shapes (e.g. curves, ridges and holes). We utilize these sim-

ilarities in establishing correspondences between ear and face local features in order

to fuse these two modalities. Fig. 7.6 shows an example of such correspondences.

For each ear feature, we compute the RMS distance between that feature and all

the face features of the same individual and sort them according to that distance.

The face feature with minimum distance is paired up with the corresponding ear

feature. We do the same for all ear features. Fusion can also be performed as face-ear

combination by pairing a face feature with the most similar ear feature. We consider

both combinations because the number of ear and face features are not always the

same and one ear or face feature correspondences sometimes to multiple similar

features. However, for uniformity of representation, we keep the ear features on the

left and the corresponding face features on the right side of the fused feature vector.

As illustrated in Fig. 7.7, we also keep the two combinations as two halves of the

fused feature-set. The cumulative percentage of repeatability of the fused features

in different images of the same person is reported in Fig. 7.8 for ten subjects.

Each entry of the fused feature-set has 800 columns as each of the ear or face

features vector a dimension of 400. The number of rows of the feature-set depends

on the number of ear features (n) and that of face features (m). In order to reduce

feature dimension, we apply PCA to the fused feature vectors. The number of

selected PCA components was empirically chosen as 10 (See Section 7.7.3).

7.5.2 Matching the Fused Features

In order to match a fused probe feature vector with a fused gallery feature

vector, we apply a two-level matching technique similar to the one used for the ear

or face features in the case of score-level fusion (see Section 7.4.1). However, the

similarity measures are now computed with the constraint that the thresholds or

limits are satisfied for both the ear and face features. For example, during the first

stage of matching, we only allow fused gallery vectors whose both the ear and face

features are within a distance threshold λ1 away from the corresponding ear and

face features in the probe vector. As described in Algorithm 1, we use five different

similarity measures: mean feature distance of the first and the second stage (δ1

and δ2 respectively), proportion of consistent rotations of the second stage (δ3),

proportion of distance-consistent matches (δ4) and keypoint distance measure (δ5).

We perform the above feature-based matching separately for both combinations


Algorithm 1. Matching a multimodal probe with a multimodal gallery.

0. (Input) Given a multimodal (ear-face or face-ear combination)

probe, a multimodal gallery, distance thresholds 1, seed

resolution rs and distance constant , angle threshold 2 and

minimum number of match m.

1. (Distance check) For each feature of the probe and all features of a

gallery:

(i) Discard gallery features with distance from the probe

feature location> 1 for both ear and face features.

(ii) Pair the probe feature with the closest gallery feature, by

both ear and face feature distance.

(iii) Count the number of matches (nt) and discard the gallery

if there is less than m feature pairs.

(iv) Compute the mean of the feature distance of the matching

feature pairs ( 1).

2. (Distance consistency check)

(i) For each of the matching feature pairs find the number (nd)

of other matches that satisfy Eqn. (1).

(ii) Find the match with maximum nd and record it as T.

(iii) Compute proportion of distance-consistent matches

( 3 = max(nd )/ nt).

3. (2nd stage of matching)

(i) For all the gallery features repeat step (1.a) but do not allow

matches which are inconsistent with T.

(ii) Compute the mean of the feature distance of the matching

feature pairs ( 2).

(iii) Discard the gallery if there is less than m feature pairs.

4. (Calculating proportion of rotation consistency measure)

(i) For each of the selected matching pairs find the number (nr) of

other matches that have rotation angles within 2.

(ii) Compute proportion of consistent rotation ( 4 = max(nr )/ nt).

5. (Calculating keypoint distance measure)

(i) Align the keypoints of the matching probe features to those of

the corresponding gallery features using ICP.

(ii) Record the ICP error as keypoint distance measure ( 5).

6. Output similarity measures as 1, 2, 3, 4, 5.

Ear Feature (E_1) Face Feature (F_c1)

Ear Feature (E_2) Face Feature (F_c2)

Ear Feature (E_n) Face Feature (F_ck)

Ear Feature (E_c1) Face Feature (F_1)

Ear Feature (E_c2) Face Feature (F_2)

Ear Feature (E_ck) Face Feature (F_m)

Ear-face combination

Face-ear combination

Figure 7.7: Block diagram of the feature-set constructed by fusion of ear and face

L3DFs (subscript ck indicates the index number of the closest feature with respect

to its left or right feature for ear-face and face-ear combination respectively).

7.5. Feature-Level Fusion 129

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100

Cum

ulat

ive

Per

cent

age

of R

epea

tabi

lity

Nearest Neighbour Error (mm)

Figure 7.8: Repeatability of fused features in the gallery and probe images of ten

individuals.

1 2 4 6 8 10 12 14 16 18 200.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Iden

tific

atio

n R

ate

EarFaceScore−level Fusion

1 2 4 6 8 10 12 14 16 18 200.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Iden

tific

atio

n R

ate


(a) (b)

Figure 7.9: Identification results for score-level fusion of ear and face (without using

ICP scores): (a) with neutral expression. (b) with non-neutral expressions.


(Section 7.5.1) of the fused feature-set. We then compute the mean of the first two

similarity measures resulting from both combinations and retain other measures

for the final score computation. All the similarity measures are normalized on a

scale of 0 to 1 and the corresponding weighting factors ηk (where k=1 to 5 ) are

computed following an approach similar to the one used in Section 7.4.1. The

final similarity measure (εf ) is computed as a weighted sum of all these normalized

similarity measures with double weights to those obtained from the fusion with

respect to the face feature and given by the following equation.

εf = η1δ1 + 2η2δ2 + η3eδ3e + 2η3fδ3f + η4eδ4e +

2η4fδ4f + η5eδ5e + 2η5fδ5f

where, subscript ‘e’ and ‘f’ are used to indicate that the corresponding similarity

measure is computed from the ear-face and the face-ear combination of the features

respectively.

Table 7.2: Summary of score-level fusion results

Modality and Similarity Measures Score-level FusionFacialEx-pres-sion

PerformanceMeasures

Ear:L3DF-basedwith-outGeo.Cons.(1)

Face:L3DF-basedwith-outGeo.Cons.(2)

Ear:L3DF-basedwithGeo.Cons.(3)

Face:L3DF-basedwithGeo.Cons.(4)

Ear:L3DF-basedwith Geo.Cons.+ICP(5)

Face:L3DF-basedwithGeo.Cons.+ICP(6)

(1)+(2)

(3)+(4)

(5)+(6)

Non-Neutral

Id. Rate(%)

78.1 80.0 79.4 84.8 88.6 83.2 94.9 95.2 96.8

Ver. Rate(%)

77.5 82.9 79.1 84.8 87.6 83.5 94.6 96.2 97.1

Neutral Id. Rate(%)

78.5 95.5 80.1 96.8 90.4 97.4 99.4 99.0 98.4

Ver.Rate(%)

78.1 98.4 72.4 98.1 89.7 97.8 99.7 99.4 99.4

7.6 Performance of the Score-level Fusion Approach

The recognition performance of our proposed score-level fusion approach is eval-

uated in this section. Results of fusing L3DFs scores with or without the inclusion

7.6. Performance of the Score-level Fusion Approach 131

10−3

10−2

10−1

100

0.7

0.75

0.8

0.85

0.9

0.95

1

False Acceptance Rate (log scale)

Ver

ifica

tion

Rat

e

EarFaceScore−level fusion

10−3

10−2

10−1

100

0.7

0.75

0.8

0.85

0.9

0.95

1

False Acceptance Rate (log scale)

Ver

ifica

tion

Rat

e


(a) (b)

Figure 7.10: ROC curves for score-level fusion of ear and face (without using ICP

scores): (a) with neutral expression. (b) with non-neutral expressions.

of the ICP score as a similarity measure are shown separately to demonstrate their

individual contributions. Results are summarized in Table 7.2 and discussed below.

7.6.1 Results Using L3DF-Based Measures Only

Identification: Using L3DF-based measures including geometric consistency checks,

we obtain rank-1 identification rates of 80.1% and 96.8% separately for the ear and

the face respectively (rank-n means the right answer is in the top n matches) on the

database described in Section 7.3.1. The score-level fusion of these two modalities,

improves the overall performance to 99.0% accuracy in rank-1 identification. The

results are illustrated in Fig. 7.9(a).

We also perform experiments with a gallery of neutral faces and probes of face

images with different expressions such as happy, sad or angry. For the database

mentioned above, we obtain rank-1 identification rates of 79.4%, 84.8% and 95.2%

for the ear, the face and their score-level fusion respectively (see Fig. 7.9(b)).

The variation of the recognition rates for different combinations of ear and face

complementary weights (See Section 7.4.2) used for fusion with a weighted sum rule

on data with neutral facial expression is illustrated in Fig. 7.11. From the plot we

can see that the recognition rate reaches a peak for ear and face weights of 0.35

and 0.65 respectively. It then declines as greater weighting is given to the face. For

data with non-neutral facial expressions, we obtain the best result with ear and face

weights 0.45 and 0.55 respectively. This asserts that the face data is more reliable

for local 3D features particularly with neutral facial expression than the ear data.


In order to evaluate the contribution of the geometric consistency measures on

the two different modalities and on their fusion, we perform experiments using only

the mean feature distance as similarity measure. We obtain worse results of 72%,

96.8% and 98.7% for the ear, the face (with neutral expression) and the fusion

respectively. Thus, the use of geometric consistency measures contributes to a sig-

nificant improvement for the ear, and only a slight improvement for the face. A

possible reason is that our matching algorithm for faces already requires distance

consistency with respect to the detected nose tip position as described in [114]. In

the case of the ear, it is difficult to reliably detect a similar landmark position due

to the possibility of occlusion. We actually consider this to be a strength of our

technique that it does not require any specific part of the ear to be visible.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 180

82

84

86

88

90

92

94

96

98

100

Iden

tific

atio

n R

ates

Ear Weights (face weight=1-ear weight)

Figure 7.11: Recognition rates for different combinations of ear and face weights

using the weighted sum rule.

Verification: We obtain a verification rate of 98.1% at a FAR of 0.001 with neu-

tral facial expression when equal weights are used for both modalities. However, the

verification rate increases to 99.7% for the same FAR of 0.001, when we assign 0.35

and 0.65 weights to the ear and the face scores respectively. For the probe dataset

with facial expressions, the verification rate with face only is 83.5% which improves

to 94.9% after fusion with equal weights and to 97.1% after fusion with weights 0.45

and 0.55 applied to the ear and the face scores respectively (see Fig. 7.10).

7.6.2 Improvement Using ICP

Considering the ICP scores from both the ear and face data during fusion, we

obtain a slightly improved result for data with non-neutral facial expressions com-

pared to those with L3DF only. The rank-1 identification rates and verification rates

7.6. Performance of the Score-level Fusion Approach 133

at 0.001 FAR obtained for this approach are reported in Table 7.2. We obtain the

best result with ear and face weights (0.35,0.65) and (0.45, 0.55) for non-neutral and

neutral expression data respectively.

Although results of an individual modality improve with the use of ICP for

neutral expression, the fusion result decreased slightly. This implies that we do not

have to apply expensive post-processing (using the ICP algorithm) in applications

where neutral facial expression can be ensured.

Fig. 7.12 shows some of the probes which are misclassified with face data only but

are recognized correctly after fusion with ear data. 2D images of the corresponding

probe range images are also shown in the top row for a better visualization of the

expressions.

(a) (d)(c)(b)

Figure 7.12: Examples of 2D and corresponding range images of four correctly

recognized probes.

7.6.3 Misclassifications

Only five out of 315 probes were misclassified. The range images of those face

and ear probes are shown in the top and the bottom rows respectively in Fig. 7.13.

It is apparent that there are large expression changes in the face probes and data

losses due to hair plus large out-of-plane pose variations in the ear probes.

Figure 7.13: Examples of five misclassified multimodal probes where face data have

large expression changes and the ear data have data losses due to hair and large

out-of-plane pose variations compared to their respective gallery data.


7.7 Performance of the Feature-Level Fusion Approach

In this section, the performance of the proposed feature-level fusion approach is

evaluated. Experiments were performed on the same dataset as the one used for the

score-level fusion for a fair comparison. Along with the results, the selection of the

parameters and the similarity measures are also discussed.

7.7.1 Results on Data with Neutral Expression

For feature-level fusion of ear and face data with neutral facial expression, we

obtain a rank-1 identification rate of 98.4%. In a similar scenario, we obtain a

verification rate of 99.0% at an FAR of 0.001. The results are illustrated in Fig. 7.14

and 7.15 respectively.

1 2 4 6 8 10 12 14 16 18 200.94

0.95

0.96

0.97

0.98

0.99

1

Iden

tific

atio

n R

ate

Rank

Neutral ExpressionNon−neutral Expression

Figure 7.14: Identification results for feature-level fusion of ear and face features

under neutral and non-neutral facial expressions.

7.7.2 Results on Data with Non-neutral Facial Expressions

Our experiments for the evaluation of feature-level fusion of ear and face data

with non-neutral facial expressions result in identification rates of 94.9%, 97.1%

and 97.8% at rank one, two and three respectively (see Fig. 7.14). These results are

obtained using PCA for feature vector reduction and using all the similarity measures

described in Section 7.5.2. As shown in Fig. 7.15, with the same implementation

scenario, we obtain a verification rate of 96.8% at 0.001 FAR.

7.7.3 Choice of the Number of PCA Components

We performed a number of experiments using different numbers of PCA com-

ponents (eigenvalues) on a subset of the data with non-neutral facial expressions

7.7. Performance of the Feature-Level Fusion Approach 135

10−3

10−2

10−1

100

0.7

0.75

0.8

0.85

0.9

0.95

1

False accept rate (log scale)

Ver

ifica

tion

rate

Neutral expressionNon−neutral expression

Figure 7.15: Verification results for feature-level fusion of ear and face features under

neutral and non-neutral expressions.

(100 gallery and 100 probe images) and only using the mean feature distance of the

first stage (δ1) as the similarity measure. The variation in identification results for

the variation in eigenvalues from 10 to 70 with an interval of 10 are illustrated in

Fig. 7.16. The plots show insignificant performance differences for the variation in

eigenvalues and the best result is obtained for an eigenvalue of 10.

1 2 4 6 8 10 12 14 16 18 200.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Iden

tific

atio

n R

ates

PCA−10PCA−20PCA−30PCA−40PCA−50PCA−60PCA−70

Figure 7.16: Effect of using a different number of PCA components on the identifica-

tion results (data with non-neutral expressions: 100 gallery and 100 probe images).

7.7.4 Performance of Different Similarity Measures

The performance of different similarity measures used in feature-level fusion for

face data with neutral expression is shown in Fig. 7.17. Similarity measures are


1 2 4 6 8 10 12 14 16 18 200.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Rank

Iden

tific

atio

n R

ate

Feature Distance−1Feature Distance−2Rotation Consistency (e−f)Rotation Consistency (f−e)Distance Consistency (e−f)Distance Consistency (f−e)Keypoint Distance (e−f)Key Distance (f−e)

Figure 7.17: Performance of various similarity measures used in feature-level fusion

matching for face data with neutral expression.

Table 7.3: Summary of the comparison of our fusion approaches with others (identi-

fication and verification rates are measured at rank-1 and at 0.001 FAR respectively)

Source Methodology Performance CriteriaId.Rate(%)

Veri.Rate(%)

Size ofdataset(#Sub.)

Use ofmulti-instanceenroll-ment

Automation Robustness tonon-neutralexpressions

Yan [175] Using ICP withsum and inter-val fusion rules

100 - 174 2 images Manuallyextractedear

Not shown

Theohariset al. [152]

Annotatedmodel fittingand using ICP,SA and waveletanalysis.

99.7 - 324 No Manual earextraction

Includes fewdata, but noseparate test

Thispaper

Using L3DFmatching, ICPand weightedsum rule.

99.4 99.7 326 No Fully auto-matic

Id. rate-96.8%, Ver.rate 97.1%

L3DF-basedmatching ofcombinedfeatures

98.4 99.0 326 No Fully auto-matic

Id. rate-94.9%, Ver.rate 96.8%

7.8. Comparative Study 137

named in short for δ1, δ2, δ3e, δ3f , δ4e, δ4f , δ5e and δ5f respectively which are described

in Section 7.5.2. As illustrated in this figure, the fourth similarity measure (the

proportion of consistent rotations originating from the face-ear combination (δ3f ))

provides the maximum identification rate.

7.8 Comparative Study

In this section, at first we discuss the comparison between our score-level and

feature-level fusion approaches. We then provide a comparison of our approach

with other 3D ear-face multimodal approaches. These comparisons are summarized

in Table 7.3 and discussed below.

7.8.1 Comparison between the Proposed Fusion Approaches

As summarized in Table 7.3, the proposed score-level fusion approach achieves

better accuracy than the feature-level fusion approach for the multimodal recogni-

tion with the ear and face biometrics. Our previous work [117] with 2D Scale Invari-

ant Feature Transform (SIFT) and 3D local features for multimodal face recognition

also demonstrated similar results. This is mostly due to the difficulty in fusing local

features in a repeatable way. As illustrated in Fig. 7.8, we only obtained a 40%

repeatability of the ear-face fused feature vector even with 10 mm nearest neigh-

bor error. However, the results reported in Section 7.7 express one strength of our

feature-level fusion technique that it performs well even when features are fused

differently between a probe and gallery. If features could be fused in the same way,

the technique may outperform score-level fusion.

7.8.2 Comparison with Other Approaches

Yan [175] and Theoharis et al. [152] used a concise manual extraction and/or a

substantial preprocessing step (e.g. using the snake algorithm and removing the face

or the neck area using a skin detection algorithm) of the ear data which might have

contributed to their higher identification accuracies. In contrast, our approach uses a

fully automatic extraction technique that extracts a rectangular area around the ear,

sometimes including extra regions with hair and skin with a minimal preprocessing

to remove holes or spikes from the extracted 3D ear data.

The performance of the approach in [175] was evaluated on a smaller database

from only 174 subjects each having two ear images and two face images in the gallery

and the probe datasets. Faltemier et al. [50] experimentally demonstrated that the

multi-instance approach performs better than the single-instance approach, however,


with a penalty of additional computation. We use a larger database without any

multi-instance in the gallery or the probe datasets (Section 7.3).

None of the approaches in the table performed separate experiments with face

data under non-neutral expressions that severely affect the performance. Theo-

haris et al. [152] used a subset of the FRGC v2 dataset which includes faces with

non-neutral expressions, but the authors did not mention how many of their selected

faces were with non-neutral expressions. On a larger dataset but with multi-instance

gallery and probe datasets collected from the FRGC v.2 database Mian et al. [117]

obtained 86.7% and 92.7% identification and verification rates respectively using

face L3DFs involving non-neutral expressions. In this paper, we obtain better re-

sults (94.9% and 95.2% respectively) fusing scores from ear L3DFs and face L3DFs

(without considering ICP scores).

The approaches of both Theoharis et al. [152] and Yan [175] did not report

verification rates, which is one of the important indicators of the performance of a

biometric system. Our approach shows high verification accuracy for face data with

both neutral and non-neutral expressions.

The matching time is also not reported in either of these two approaches. How-

ever, since Yan [175] used ICP on the whole dataset, our technique with local features

is expected to be faster than that approach. The final matching in [152] is expected

to be faster, since they used a weighted L1 distance metric to compare wavelet coef-

ficient extracted from ear and face data. However, their registration and deformable

model fitting steps are computationally expensive.

7.9 Conclusion

In this paper, two multimodal ear-face biometric recognition approaches are

proposed, one with fusion at the score-level and another at the feature-level. These

approaches are based on local 3D features which are very fast to compute and robust

to pose and scale variations, and occlusions due to hair and earrings. The fusion

with ear data (which is not sensitive to changes in facial expression) significantly

improves the face recognition results under non-neutral expressions. The comple-

mentary features of the two modalities also provide better results under neutral

expression even without using the expensive ICP algorithm. The feature-level fu-

sion approach performs better than the unimodal approaches based on ear or face

only, but not better than the score-level fusion as intuitively believed. This is possi-

bly due to the difficulty of fusing local ear and face features in a repeatable way. The

performance of the proposed multimodal recognition can be improved further using

7.9. Conclusion 139

a hybrid fusion approach where the result of a lower (e.g. data or feature) level of

fusion can be fed to a higher (e.g. score) level of fusion. One possibility would be to

short list gallery candidates using the result of data-level or feature-level fusion. A

finer matching algorithm such as ICP can then be applied to each modality and at

the end, score-level fusion can be used to make the final matching decision. Thus,

we can combine the benefits of the different possible levels of fusion. Building a

larger ear-face multimodal database with more challenging images to evaluate the

robustness of the proposed techniques can be another avenue of further research.

Acknowledgments

This research is sponsored by the Australian Research Council (ARC) under

grants DP0664228, DP0881813 and LE0775672 and by the University of Western

Australia under Scholarships for International Research Fees (SIRFs)- Completion.

We acknowledge the use of the FRGC v.2 and the UND Biometrics databases for

ear and face detection and recognition.


141CHAPTER 8

Conclusion

8.1 Discussion

In this dissertation, a complete and fully automatic unimodal approach for hu-

man recognition using ear biometrics and two approaches of multimodal recognition

using ear and face data collected from 2D and 3D frontal and profile images are

proposed.

The Iterative Closest Point (ICP) algorithm is found to be one of the most

accurate algorithms for matching ears in the gallery and the probe dataset. In this

dissertation, in order to reduce the computational expense of this iterative algorithm,

I used the ICP algorithm in a hierarchical manner: first with a low and then with

higher resolution meshes of 3D ear data. The result of the first application of ICP

was used as a coarse alignment prior to the second application of ICP. I obtained a

rank one recognition rate of 93% on the UND Biometrics Database.

In order to achieve more efficiency and robustness against occlusion, I devised a

very fast ear detection approach and used 3D local features for data representation

and matching. My AdaBoost-based ear detection approach with three new Haar

feature templates and the rectangular detector is very fast and significantly robust

to hair, earrings and earphones. On Collection J of the UND Biometrics Database

with 830 images from 415 subjects, my proposed system provides a detection rate

of 99.9%. On a Core 2 Quad 9550, 2.83 GHz machine, it takes around 7.7 ms to

detect an ear from a 640× 480 image. Unlike other approaches, the performance of

my proposed system does not rely on any assumptions about the localization of the

nose or the ear pit.

The modified construction and efficient use of local 3D features to find potential

matches and feature-rich areas, and to use them for a coarse alignment prior to the

ICP, make my recognition approach computationally inexpensive and significantly

robust to occlusions and pose variations. Using a two-stage feature matching with

geometric consistency measures significantly improved the matching performance.

It takes only 22.2 sec to extract around 200 local features from a 3D ear. Matching

a probe ear with a gallery using only local features takes 0.06 sec and using the

full matching including ICP requires 2.28 sec on average. The evaluation of the

performance of the complete system on UND-J (the largest available ear database)

gives an identification rate of 93.5% and an Equal Error Rate (EER) of 4.1%. The

corresponding rates for the UND-F dataset are 95.4% and 2.3%. On a new in-house

dataset of 50 subjects all wearing earphones, I obtained an identification rate of 98%

142 Chapter 8. Conclusion

with an EER of 1%. With an un-optimized MATLAB implementation, the average

time required for the L3DF-based matching and for the full matching including ICP

is 0.06 and 2.28 seconds respectively.

In order to overcome the vulnerability of unimodal biometric systems, I pre-

sented two multimodal recognition systems fusing the ear with the face biometrics

at score and feature levels. The face is chosen because of it is physically close to

the ear and similar to the ear, its image can be collected non-intrusively. In both

approaches, local 3D features are used to represent both the ear and face data and

the shape similarity among the features in the gallery and probe dataset are utilized

for classification. The fusion with ear data (which is not sensitive to changes in facial

expression) significantly improves the face recognition results under non-neutral ex-

pressions. The complementary features of the two modalities provide better results

under neutral expression. The evaluation of both approaches is performed on the

largest publicly available multimodal ear-face dataset constructed using data from

the FRGC v.2 and the UND Biometric database. The score-level fusion technique

achieves an identification rate of 99.4% and a verification rate of 99.7% (at 0.001

FAR) with neutral facial expression. Corresponding rates with non-neutral facial

expressions are 96.8% and 97.1% respectively. I obtained comparable recognition

results with the feature-level fusion of the two modalities without using expensive

matching algorithms such as ICP.

Thus, a complete and fully automatic system of unimodal and multimodal recog-

nition using the ear and the face biometrics from detection to decision making is

presented in this dissertation. These approaches can be adapted for recognition with

other biometric traits and objects and have the potential to be extended to other

applications such as robotics, medicine and forensic sciences.

8.2 Future Work

The accuracy and efficiency of the algorithms and approaches proposed in this

dissertation can be improved in a number of ways. Some possibilities are listed

below:

• The performance evaluation of the approaches presented in this dissertation

asserts that local feature-based matching and multimodal systems perform

better than global feature-based matching and unimodal systems respectively.

Therefore, a potential extension of the research would be to further improve the

performance of ear or face modality by combining 2D local features (e.g. SIFT)

8.2. Future Work 143

with 3D local features. 2D features can be extracted from corresponding 2D

images that can be captured and co-registered along with the 3D scans using

most of the current 3D data acquisition devices.

• The performance of the proposed multimodal recognition may be improved

using a hybrid fusion approach where the result of a lower (e.g. data or

feature) level of fusion can be fed to a higher (e.g. score) level of fusion.

One possibility would be to short list gallery candidates using the result of

data-level or feature-level fusion. A finer matching algorithm such as ICP can

then be applied to each modality and finally, a score-level fusion technique

can be used to make the final matching decision. Thus, we can combine the

benefits of the different possible levels of fusion.

• The robustness can further be improved by including some other biometric

traits with the proposed ear-face multimodal approach. For example, gait im-

ages can be included in the case of non-intrusive applications and fingerprints

or iris biometrics can be included in the case of controlled applications.

• Most of the algorithms presented in this dissertation are implemented using

un-optimized MATLAB code. Therefore, the speed of the recognition can be

improved by implementing these algorithms on a C/C ++ platform and using

faster techniques for feature matching like geometric hashing as well as faster

variants of the ICP algorithm using, for example, kd-trees.

• In this dissertation, I explored fusion of the ear and face biometrics at the

feature and score levels. It would be interesting to fuse them at data level and

compare the results. Local features can be extracted from the fused data and

matching can be performed using my proposed matching techniques.

• Fusion approaches in this dissertation were tested with the largest available

ear-face multimodal dataset obtained from publicly available frontal face and

profile databases. The dataset contains images from 326 subjects with a va-

riety of facial expressions, poses and occlusions with hair and ornaments. A

larger ear-face multimodal database with more challenging images (e.g. with

eyeglasses and earplugs) could be built to evaluate the robustness of the pro-

posed techniques.

144 Chapter 8. Conclusion

145

Bibliography

[1] 3DRMA. 3D RMA : 3D database. available at

http://www.sic.rma.ac.be/ beumier/DB/3d rma.html, 1998.

[2] A. F. Abate, M. Nappi, and D. Riccio. Face and ear: A bimodal identification

system. Proc. ICIAR 2006, Part II, page 297304, 2006.

[3] A.F. Abate, M. Nappi, D. Riccio, and G. Sabatino. 2D and 3D face recognition: A

survey. Pattern Recognition Letters, 28(14):1885–1906, Oct. 2007.

[4] F. Al-Osaimi, M. Bennamoun, and A. Mian. An Expression Deformation Ap-

proach to Non-rigid 3D Face Recognition. International Journal of Computer Vision

(IJCV), 81(3):302–316, 2009.

[5] L. Alvarez, E. Gonzalez, and L. Mazorra. Fitting Ear Contour Using an Ovoid

Model. In Proc. Int’l Carnahan Conf. on Security Technology, 2005., pages 145–

148, 2005.

[6] B. B. Amor, M. Ardabilian, and L. Chen. Toward a Region-Based 3D Face Recog-

nition Approach. In Proc. Multimedia and Expo, 2008, pages 101–104, 2008.

[7] S. Ansari and P. Gupta. Localization of Ear Using Outer Helix Curve of the Ear.

In Proc. Int’l Conf. on Computing: Theory and Applications, 2007, pages 688–692,

2007.

[8] B. Arbab-Zavar and M. S. Nixon. On shape-mediated enrolment in ear biometrics.

Advances in visual computing, Lecture Notes in Computer Science, 4842:549–558,

2007.

[9] C. Barbu, R. Iqbal, and Jing Peng. An Ensemble Approach to Robust Biometrics

Fusion. In Proc. CVPR Workshop, 2006, pages 56–56, 2006.

[10] P. J. Besl and N. D. McKay. A Method for Registration of 3-D Shapes. IEEE Trans.

PAMI, 14(2):239–256, 1992.

[11] B. Bhanu and H. Chen. 3D Ear Detection from Side Face Range Images. Springer,

2008.

[12] Binghamton University, USA. Binghamton University 3D Facial Expression (BU-

3DFE). available at

http://www.cs.binghamton.edu/ lijun/Research/3DFE/3DFE Analysis.html, 2006.

146 Bibliography

[13] BioID AG. BioID Face Ddatabase. available at

http://support.bioid.com/downloads/facedb/index.php, 2001.

[14] Biometric Consortium. Introduction to Biometrics. 2009.

[15] Bosphorus. The Bosphorus Database. available at

http://bosphorus.ee.boun.edu.tr/, 2008.

[16] K. W. Bowyer, K. I. Chang, Ping Yan, P. J. Flynn, E. Hansley, and S. Sarkar.

Multi-Modal Biometrics: an Overview. In Proc. Second Workshop on Multimodal

User Authentication, 2006.

[17] K.W. Bowyer, K.I. Chang, and P.J. Flynn. A Survey of Approaches and Challenges

in 3D and Multi-Modal 3D+2D Face Recognition. Computer Vision and Image

Understanding, 101(1):1–15, 2006.

[18] A. Bronstein, M. Bronstein, and R. Kimmel. Robust expression-invariant face recog-

nition from partially missing data. In Proc. ECCV’06, LNCS., pages 396–408, 2006.

[19] R. Brunelli and D. Falavigna. Person Identification Using Multiple Cues. IEEE

Trans. PAMI, 12:955–966, 1995.

[20] M. Burge and W. Burger. Ear Biometrics in Computer Vision. In Proc. ICPR’00,

pages 822–826, 2000.

[21] J.D. Bustard and M.S. Nixon. Robust 2D Ear Registration and Recognition Based

on SIFT Point Matching. In Proc. 2nd IEEE International Conference on Biomet-

rics: Theory, Applications and Systems, 2008. BTAS 2008, pages 1–6, 2008.

[22] S. Cadavid and M. Abdel-Mottaleb. Human Identification Based on 3D Ear Models.

In Proc. IEEE International Conference on Biometrics: Theory, Applications, and

Systems, 2007. BTAS 2007, pages 1–6, 2007.

[23] R. J. Campbell and P. J. Flynn. A Survey of Free-form Object Representation and

Recognition Techniques. Computer Vision and Image Understanding, 81(2):166–

210, 2001.

[24] J. Canny. Towards Fast 3D Ear Recognition for Real-Life Biometric Applications.

IEEE Transaction on Pattern Analysis and Machine Intelligence, 8:679–714, 1986.

[25] CAS-PEAL. CAS-PEAL Face Database. available at

http://www.jdl.ac.cn/peal/index.html, 2004.

Bibliography 147

[26] K.I. Chang, K.W. Bowyer, and P.J. Flynn. Adaptive rigid multi-region selection for

handling expression variation in 3d face recognition. In Proc. CVPR, pages 157–157,

2005.

[27] K.I. Chang, K.W. Bowyer, and P.J. Flynn. Multiple Nose Region Matching

for 3D Face Recognition under Varying Facial Expression. IEEE Trans. PAMI,

28(10):1695–1700, 2006.

[28] Kyong Chang, K.W. Bowyer, S. Sarkar, and B. Victor. Comparison and combination

of ear and face images in appearance-Based biometrics. IEEE Transactions PAMI,

9:1160–1165, 2003.

[29] H. Chen and B. Bhanu. Human Ear Detection from Side Face Range Images. In

Proc. ICPR 2004, 3:574–577, 2004.

[30] H. Chen and B. Bhanu. Contour Matching for 3D Ear Recognition. In Proc. IEEE

Workshops on Application of Computer Vision, pages 123–128, 2005.

[31] H. Chen and B. Bhanu. Shape Model-Based 3D Ear Detection from Side Face Range

Images. pages 122–122, 2005.

[32] H. Chen and B. Bhanu. Human Ear Recognition in 3D. IEEE Trans. PAMI,

29(4):718–737, 2007.

[33] H.-Y. Chen, C.-L. Huang, and C.-M. Fu. Hybrid-boost Learning for Multi-pose Face

Detection and Facial Expression Recognition. Pattern Recognition, 41(3):1173–1185,

2008.

[34] Hui Chen and B. Bhanu. Contour Matching for 3D Ear Recognition. In Proc. IEEE

Workshops on Application of Computer Vision, pages 123–128, 2005.

[35] Hui Chen and B. Bhanu. Efficient Recognition of Highly Similar 3D Objects in

Range Images. IEEE Trans. PAMI, 31(1):172–179, 2009.

[36] Qing Chen, Nicolas D. Georganas, and Emil M. Petriu. Real-time Vision-Based

Hand Gesture Recognition Using Haar-like Features. In Proc. IEEE Instrumentation

and Measurement Technology Conf., pages 1–6, 2007.

[37] M. Choras. Ear Biometrics Based on Geometrical Feature Extraction. Electronic

Letters on Computer Vision and Image Analysis, 5:84–95, 2005.

[38] M. Choras. Image Feature Extraction Methods for Ear Biometrics-A Survey. In

Proc. 6th International Conference on Computer Information Systems and Indus-

trial Management Applications, pages 261–265, 2007.

148 Bibliography

[39] M. Choras. Image Pre-classification for Biometrics Identification Systems. Advances

in Information Processing and Protection, Pejas, J. and Saeed, K. (ed.), Springer

US, 3:361–370, 2007.

[40] C. Chua, F. Han, and Y. Ho. 3D Human Face Recognition Using Point Signatures.

In IEEE Analysis and Modeling of Faces and Gestures, pages 233–238, 2000.

[41] C. S. Chua and R. Jarvis. Point Signatures: A New Representation for 3D Object

Recognition. Int’l Journal of Computer Vision, 25(1):63–85, 1997.

[42] O. Chum and J. Matas. Matching with PROSAC - Progressive Sample Consensus.

In Proc. the CVPR’05, 1:220–226, 2004.

[43] CIFAS. 2008 Fraud Trends, CIFAS. 2009.

[44] CMU. PIE Database, CMU. available at

http://www.ri.cmu.edu/research project detail.html?project id=418

&menu id=261, 2000.

[45] A. Colombo, C. Cusano, and R. Schettini. Face3 a 2D+3D Robust Face Recognition

System. In Proc. ICIAP 2007, pages 393–398, 2007.

[46] K. Delac, M. Grgic, and M. S. (ed.) Bartlett. Recent Advances in Face Recognition.

IN-TECH, Vienna, Austria, 2008.

[47] J. D’Errico. Surface Fitting Using Gridfit. MATLAB Central, File Exchange, 2006.

http://www.mathworks.com/matlabcentral/fileexchange/8998.

[48] C. Dorai and A.K. Jain. COSMOS-A Representation Scheme for 3D Free-form

Objects. IEEE Trans. PAMI, 19(10):1115–1130, 1997.

[49] T.C. Faltemier, K.W. Bowyer, and P.J. Flynn. Using a Multi-Instance Enrollment

Representation to Improve 3D Face Recognition. In Proc. IEEE Int’l Conference

on Biometrics: Theory, Applications, and Systems, pages 1–6, 2007.

[50] T.C. Faltemier, K.W. Bowyer, and P.J. Flynn. Using multi-instance enrollment to

improve performance of 3D face recognition. Computer Vision and Image Under-

standing, 112(2):114–125, Nov. 2008.

[51] FERET . The Color FERET Database. available at

http://face.nist.gov/colorferet/, 2003.

[52] M. A. Fischler and R. C. Bolles. Random Sample Consensus: A Paradigm for Model

Fitting with Applications to Image Analysis and Automated Cartography. Comm.

of the ACM, 24:381–395, 1981.

Bibliography 149

[53] Y. Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-Line

Learning and An Application to Boosting. In Proc. European Conf. on Compu-

tational Learning Theory, 1995.

[54] R. Frischholz. Face Detection Homepage. 2008.

[55] R. Frischholz and U. Dieckmann. Bioid: A multimodal biometric identification

system. IEEE Computer, 33(2):64–68, 2000.

[56] Y. Gao and M. Maggs. Feature-Level Fusion in Personal Identification. In Proc.

CVPR’05, 1:468–473, 2005.

[57] M. Garland and C. Heckbert. Surface Simplification Using Quadric Error Metrics.

In SIGGRAPH, 1997.

[58] J.E. Gentile, K.W. Bowyer, and P.J. Flynn. Profile Face Detection: A Subset Multi-

Biometric Approach. In Proc. Biometrics: Theory, Applications and Systems, 2008.

BTAS 2008, pages 1–6, 2008.

[59] R.S. Ghiass and N. Sadati. Multi-view Face Detection and Recognition under

Variable Lighting Using Fuzzy Logic. In Proc. IEEE International Conference on

Wavelet Analysis and Pattern Recognition (ICWAPR)., 1:74–79, 2008.

[60] L. Goldmann, U. J. Monich, and T. Sikora. Components and Their Topology for Ro-

bust Face Detection in the Presence of Partial Occlusions. IEEE Trans. Information

Forensics and Security, 2(3):559–569, 2007.

[61] D.B. Graham and N.M. Allinson. Characterizing Virtual Eigensignatures for Gen-

eral Purpose Face Recognition. Face Recognition: from Theory to Applications,

NATO ASI Series F, Computer and Systems Sciences, H. Wechsler, P. J. Phillips,

V. Bruce, F. Fogelman-Soulie and T. S. Huang (eds), 163:446–456, 1998.

[62] Grgic, M. and Delac, K. Databases, Face Recogniton Homepage. 2009.

[63] Yimo Guo and Zhengguang Xu. Ear Recognition Using a New Local Matching

Approach. In Proc. the 15th IEEE International Conference on Image Processing,

ICIP’08, pages 289–292, 2008.

[64] C.H. Han and K.-B. Sim. Real-time Face Detection Using AdaBoot Algorithm. In

Proc. the Int’l Conf. on Control, Automation and Systems (ICCAS), pages 1892–

1895, 2008.

150 Bibliography

[65] N. He, K. Sato, and Y. Takahashi. Partial Face Extraction and Recognition Using

Radial Basis Function Networks. IAPR Workshop on Machine Vision Applications,

pages 144–147, 2000.

[66] E. Hjelmas and B.K. Low. Face Detection: A Survey. Computer Vision and Image

Understanding, 83(3):236–274, 2001.

[67] K. Hotta. View Independent Face Detection Based on Horizontal Rectangular Fea-

tures and Accuracy Improvement Using Combination Kernel of Various Sizes. Pat-

tern Recognition, 42(3):437–444, 2009.

[68] Chang Huang, Haizhou Ai, Yuan Li, and Shihong Lao. High-Performance Rotation

Invariant Multiview Face Detection. IEEE Trans. PAMI, 29(4):671–686, 2007.

[69] D. J. Hurley, B. Arbab-Zavar, and M. S. Nixon. The Ear As a Biometric. EUSIPCO

2007, pages 25–29, 2007.

[70] D. J. Hurley, M. S. Nixon, and J. N. Carter. Force Field Feature Extraction for Ear

Biometrics. Computer Vision and Image Understanding, 98(3):491–512, 2005.

[71] M. Husken, M. Brauckmann, S. Gehlen, and C. Malsburg. Strategies and benefits

of fusion of 2d and 3d face recognition. In Proc. CVPR, pages 174–174, 2005.

[72] A. Iannarelli. Ear Identification. Forensic Identification Series . Paramount Pub-

lishing Company, Fremont, California, 1989.

[73] ISL. Image Databases. 2009.

[74] S.M.S. Islam, M. Bennamoun, and R. Davies. Fast and Fully Automatic Ear De-

tection Using Cascaded AdaBoost. In Proc. IEEE Workshop on Application of

Computer Vision, pages 1–6, 2008.

[75] S.M.S. Islam, M. Bennamoun, A. Mian, and R. Davies. A Fully Automatic Approach

for Human Recognition from Profile Images Using 2D and 3D Ear Data. Proc.

3DPVT, pages 131–141, 2008.

[76] S.M.S. Islam, M. Bennamoun, A. Mian, and R. Davies. Score Level Fusion of Ear

and Face Local 3D Features for Fast and Expression-invariant Human Recognition.

M. Kamel and A. Campilho (Eds.): ICIAR 2009, LNCS 5627, Springer, Heidelberg,

pages 387–396, 2009.

[77] S.M.S. Islam, M. Bennamoun, R. Owens, and R. Davies. Biometric Approaches of

2D-3D Ear and Face: A Survey. Proc. Int’l Conf. on Systems, Computing Sciences

and Software Engineering, 2007.

Bibliography 151

[78] S.M.S. Islam, M. Bennamoun, R. Owens, and R. Davies. Biometric Approaches of

2D-3D Ear and Face: A Survey. Advances in Computer and Information Sciences

and Engineering, T. Sobh(ed.), Springer Netherlands, pages 509–514, 2008.

[79] S.M.S. Islam, M. Bennamoun, R. Owens, and R. Davies. A Review of Recent

Advances in 3D Ear and Expression Invariant Face Biometrics. Under review with

ACM Computing Surveys, 2010.

[80] S.M.S. Islam and R. Davies. Refining Local 3D Feature Matching through Geometric

Consistency for Robust Biometric Recognition. Proc. Digital Image Computing:

Techniques and Applications (DICTA), pages 513–518, 2009.

[81] S.M.S. Islam, R. Davies, M. Bennamoun, and A. Mian. A Fast and Fully Automatic

Approach for Ear Detection and 3D Recognition from Profile Images. International

Journal of Computer Vision (Under review), 2010.

[82] S.M.S Islam, R. Davies, M Bennamoun, R.A. Owens, and A.S. Mian. Fusion of

Local 3D Ear and Face Features for Human Recognition. IEEE Transactions on

Pattern Analysis and Machine Intelligence (Under review), 2010.

[83] S.M.S. Islam, R. Davies, A. Mian, and M. Bennamoun. A Fast and Fully Automatic

Ear Recognition Approach Based on 3D Local Surface Features. J. Blanc-Talon et

al. (Eds.): ACIVS 2008, LNCS 5259, Springer, Heidelberg, pages 1081–1092, 2008.

[84] A. K. Jain, K. Nandakumar, and A. Ross. Score Normalization in Multimodal

Biometric Systems. Pattern Recognition, 38(12):2270–2285, 2005.

[85] A. K. Jain, A. Ross, and S. Pankanti. Biometrics: A Tool For Information Security.

IEEE Trans. Information Forensics and Security, 1(2):125–143, 2006.

[86] A. K. Jain, A. Ross, and S. Prabhakar. An introduction to biometric recognition.

IEEE Trans. Circuits and Systems for Video Technology, 14(1):4–20, 2004.

[87] Javelin. The 2009 Identity Fraud Survey Report, Javelin Strategy & Research. 2009.

[88] Javelin. The 2010 Identity Fraud Survey Report, Javelin Strategy & Research.

available at

http://www.idsafety.net/2010IDFraudReportRelease.pdf, 2010.

[89] A. E. Johnson and M. Hebert. Using Spin Images for Efficient Object Recognition

in Cluttered 3D Scenes. IEEE Trans. PAMI, 21(5):674–686, 1999.

[90] M. Jones and P. Viola. Fast Multi-view Face Detection. Technical Report TR2003-

96, MERL, 2003.

152 Bibliography

[91] I.A. Kakadiaris, G. Passalis, G. Toderici, M.N. Murtuza, Yunliang Lu, N. Karam-

patziakis, and T. Theoharis. Three-Dimensional Face Recognition in the Presence

of Facial Expressions: An Annotated Deformable Model Approach. IEEE Trans.

PAMI, 29(4):640–649, 2007.

[92] Yan Ke and R. Sukthankar. PCA-SIFT: a more distinctive representation for local

image descriptors. In Proc. the CVPR’04, 2:506–513, 2004.

[93] J. Kepner. MatlabMPI. Journal of Parallel and Distributed Computing, 64(8):997–

1005, 2004.

[94] J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas. On Combining Classifiers. IEEE

Transactions PAMI, 20(3):226–239, 1998.

[95] J.J. Koenderink and A. J. Doorn. Surface Shape and Curvature Scales. Image and

Vision Computing, 10:557–565, 1992.

[96] S.G. Kong, B.R. Heo, J. an dAbidi, J. Paik, and M.A. Abidi. Recent advances in

visual and infrared face recognitiona review. Computer Vision and Image Under-

standing, 97(1):103–135, 2005.

[97] A. Kumar, D. C. M. Wong, H. Shen, and A. K. Jain. Personal verification using

palmprint and hand geometry biometric. In Proc. Int’l Conf. on Audio- and Video-

Based Person Authentication, pages 668–675, 2003.

[98] A. Kumar and D. Zhang. Personal recognition using hand shape and texture. IEEE

Trans. Image Processing, 15(8):2454–2461, 2006.

[99] Chao Li and A. Barreto. An Integrated 3D Face-Expression Recognition Approach.

In Proc. ICASSP’06, 3:III–III, 2006.

[100] S. Z. Li and A. K. Jain. Handbook of Face Recognition. Springer, 2005.

[101] X. Li, G. Mori, and H. Zhang. Expression-Invariant Face Recognition with Expres-

sion Classification. In Proc. Canadian Conf. on Computer and Robot Vision, pages

77–83, 2006.

[102] Y. Li, S. Gong, J. Sherrah, and H. Liddell. Support Vector Machine Based Multi-

view Face Detection and Recognition. Image and Vision Computing, 22(5):413–427,

2004.

[103] R. Lienhart, L. Liang, and A. Kuranov. A Detector Tree of Boosted Classifiers for

Real-Time Object Detection and Tracking. In Proc. the Int’l Conf. on Multimedia

and Expo, 2003. ICME ’03, 2:277–280, 2003.

Bibliography 153

[104] R. Lienhart and J. Maydt. An Extended Set of Haar-like Features for Rapid Object

Detection. In Proc. the Int’l Conf. on Image Processing. 2002, 1:900–903, 2002.

[105] D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Int. Journal

of Computer Vision, 60:91–110, 2004.

[106] Lu Lu, Xiaoxun Zhang, Youdong Zhao, and Yunde Jia. Human Identification Based

on 3D Ear Models. In Proc. International Conference on Innovative Computing,

Information and Control, 2006. ICICIC ’06, 3:353–356, 2007.

[107] X. Lu and A. K. Jain. Deformation Modeling for Robust 3D Face Matching. IEEE

Trans. of PAMI, 30(8):1346–1356, 2008.

[108] L. Luciano and A. Krzyzak. Automated Multimodal Biometrics Using Face and Ear.

M. Kamel and A. Campilho (Eds.): ICIAR 2009, LNCS 5627, Springer, Heidelberg,

pages 451–460, 2009.

[109] A. Majumdar and R. K. Ward. Discriminative SIFT Features for Face Recogni-

tion. In Proc. Canadian Conference on Electrical and Computer Engineering, 2009.

CCECE ’09, pages 27–30, 2009.

[110] D. Maltoni, J. Anil, J. Wayman, and M. Dario. Biometric Systems: Technology,

Design and Performance Evaluation. Springer Verlag, 2005.

[111] Mamic and M. Bennamoun. Representation and Recognition of 3D Free-Form Ob-

jects. Digital Signal Processing (DSP), Academic Press, 12(1):47–76, Jan. 2002.

[112] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. XM2VTSbd: The

Extended M2VTS Database. In Proc. the 2nd Conf. on Audio and Video-base

Biometric Personal Verification, Springer Verlag, New York, pages 1–6, 1999.

[113] J. Meynet, V. Popovici, and J.-P. Thiran. Face detection with boosted Gaussian

features. Pattern Recognition, 40(8):2283–2291, 2007.

[114] A. S. Mian, M. Bennamoun, and R.; Owens. An Efficient Multimodal 2D-3D Hybrid

Approach to Automatic Face Recognition. IEEE Trans. PAMI, 29(11):1927–1943,

2007.

[115] A.S. Mian, M. Bennamoun, and R. Owens. 2d and 3d multimodal hybrid face

recognition. In Proc. European Conf. on Computer Vision (ECCV), Part 3, pages

344–355, 2006.

154 Bibliography

[116] A.S. Mian, M. Bennamoun, and R. Owens. A Novel Representation and Feature

Matching Algorithm for Automatic Pairwise Registration of Range Images. Inter-

national Journal of Computer Vision (IJCV), 66(1):19–40, 2006.

[117] A.S. Mian, M. Bennamoun, and R. Owens. Keypoint Detection and Local Feature

Matching for Textured 3D Face Recognition. International Journal of Computer

Vision, 79(1):1–12, 2008.

[118] C. Middendorff, K.W. Bowyer, and Ping Yan. Multi-Modal Biometrics Involving

the Human Ear. In Proc. IEEE Conference on CVPR, 3:1–2, 2007.

[119] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE

Trans. PAMI, 27(10):1615–1630, 2005.

[120] MIT-CBCL. MIT-CBCL Face Recognition Database.

http://cbcl.mit.edu/software-datasets/heisele/facerecognition-database.html, 2004.

[121] T. Mita, T. Kaneko, and O. Hori. A Detector Tree of Boosted Classifiers for Real-

Time Object Detection and Tracking. In Proc. IEEE International Conference on

Computer Vision, 2005. ICCV 2005, 2:1619–1626, 2005.

[122] G. Monteiro, P. Peixoto, and U. Nunes. Vision-Based Pedestrian Detection using

Haar-Like features. Robotica 2006-Scientific meeting of the 6th Robotics Portuguese

Festival,Portugal, 2006.

[123] A.B. Moreno and A. Sanchez. GavabDb: A 3D face database. In Proc. Workshop

Biometrics on the Internet COST275, pages 77–85, 2004.

[124] I. Mpiperis, S. Malasioris, and M.G. Strintzis. 3D Face Recognition by Point Sig-

natures and Iso-contours. In Proc. SPPRA), pages 233–238, 2007.

[125] I. Mpiperis, S. Malassiotis, and M. G. Strintzis. Bilinear Models for 3-D Face and

Facial Expression Recognition. IEEE Trans. Information Forensics and Security,

3(3):498–511, Sept. 2008.

[126] P. Nair and A. Cavallaro. 3-D Face Detection, Landmark Localization, and Regis-

tration Using a Point Distribution Model. IEEE Trans. Multimedia, 11(4):611–623,

2009.

[127] R. Niese, A. Al-Hamadi, and B. Michaelis. A Novel Method for 3D Face Detection

and Normalization. Journal of Multimedia, 2(5):1–12, 2007.

Bibliography 155

[128] M. Nilsson, J. Nordberg, and I. Claesson. Face Detection using Local SMQT Fea-

tures and Split up Snow Classifier. In Proc. IEEE Int’l Conference on Acoustics,

Speech and Signal Processing (ICASSP), 2:589–592, 2007.

[129] NIST-MID. NIST Mugshot Identification Database (MID).

http://www.nist.gov/srd/nistsd18.htm, 1994.

[130] Zhiheng Niu, Shiguang Shan, Shengye Yan, Xilin Chen, and Wen Gao. 2D Cascaded

AdaBoost for Eye Localization. In Proc. ICPR 2006, 2:1216–1219, 2006.

[131] OpenCV. OpenCV Library. http://sourceforge.net/projects/opencvlibrary/.

[132] E. Osuna, R. Freund, and F. Girosit. Training Support Vector Machines: an Ap-

plication to Face Detection. In Proc. CVPR’97, pages 130–136, 1997.

[133] Y. Pan, X.; Cao, X. Xu, Y. Lu, and Y. Zhao. The study of multimodal recognition

based on ear and face. In Proc. IEEE Int’l Conference on Audio, Language and

Image Processing,ICALIP, pages 385–389, 2008.

[134] Jae-Han Park, Kyung-Wook Park, Seung-Ho Baeg, and Moon-Hong Baeg. π −SIFT : A photometric and Scale Invariant Feature Transform. In Proc. the 19th

International Conference on Pattern Recognition, 2008. ICPR 2008, 2:1–4, 2008.

[135] G. Passalis, I. Kakadiaris, T. Theoharis, G. Toderici, and N. Murtuza. Evalua-

tion of 3D Face Recognition in the presence of facial expressions: an Annotated

Deformable Model approach. In Proc. IEEE Workshop Face Recognition Grand

Challenge Experiments, 3:171–179, 2005.

[136] G. Passalis, I.A. Kakadiaris, T. Theoharis, G. Toderici, and T. Papaioannou. To-

wards Fast 3D Ear Recognition for Real-Life Biometric Applications. In Proc. IEEE

Conference on Advanced Video and Signal Based Surveillance, 2007. AVSS 2007,

3:39–44, 2007.

[137] S. Pavani, D. Delgado, and A. F. Frangi. Haar-like features with optimally weighted

rectangles for rapid object detection. Pattern Recognition, 43(1):160–172, 2010.

[138] N. Pears. RBF Shape Histograms and Their Application to 3D Face Processing. In

Proc. IEEE International Conference on Automatic Face and Gesture Recognition,

2008. FG08, pages 1–8, 2008.

[139] N.E. Pears and T.D. Heseltine. Isoradius contours: New Representations and Tech-

niques for 3D Face Matching and Registration. In Proc. Int. Symposium on 3DPVT,

pages 176–183, 2006.

156 Bibliography

[140] D. Petrovska-Delacretaz, S. Lelandais, J. Colineau, L. Chen, B. Dorizzi, M. Ardabil-

ian, E. Krichen, M.-A. Mellakh, A. Chaari, S. Guerfi, J. D’Hose, and B. Ben Amor.

he IV2 Multimodal Biometric Database (Including Iris, 2D, 3D, Stereoscopic, and

Talking Face Data), and the IV2-2007 Evaluation Campaign . In In Proc. BTAS08,

pages 1–7, 2008.

[141] P.J. Phillips, P.J. Flynn, T. Scruggs, K.W. Bowyer, Jin Chang, K. Hoffman, J. Mar-

ques, Jaesik Min, and W. Worek. Overview of the Face Recognition Grand Chal-

lenge. In Proc. CVPR’05, 1:947–954, 2005.

[142] P.J. Phillips, A. Martin, C.L. Wilson, and M. Przybocki. An introduction to eval-

uating biometric systems. Computer, 33(2):56–63, 2000.

[143] K. H. Pun and Y. S. Moon. Recent Advances in Ear Biometrics. In Proc. IEEE

Int’l Conf. on Automatic Face and Gesture Recognition, pages 164–169, 2004.

[144] A. Ross and R. Govindarajan. Feature Level Fusion Using Hand and Face Biomet-

rics. In Proc. SPIE Conf. on Biometric Technology for Human Identification II,

pages 196–204, 2005.

[145] A. Ross and A. K. Jain. Information Fusion in Biometrics. Pattern Recognition

Letters, 24(13):2115–2125, 2003.

[146] A. Ross and A. K. Jain. Multimodal Biometrics: An Overview. In Proc. European

Signal Processing Conf., pages 1221–1224, 2004.

[147] A. A. Ross, K. Nandakumar, and A. K. Jain. Handbook of Multibiometrics. Springer,

2006.

[148] A. Ruifrok, A. Scheenstra, and R. C. Veltkamp. A Survey of 3D Face Recogni-

tion Methods. In proc. Audio- and Video-Based Biometric Person Authentication

(AVBPA 2005), LNCS 3546, pages 891–899, 2005.

[149] R.E. Schapire and Y. Singer. Improved Boosting Algorithms Using Confidence-rated

Predictions. Mach. Learn., 37(3):297–336, 1999.

[150] P.Y. Simard, L. Bottou, P. Haffner, and Y. LeCun. A Fast Convolution Algorithm

for Signal Processing and Neural Networks, M. Kearns, S. Solla, and D. Cohn (Eds.).

Advances in Neural Information Processing Systems,, 11:571– 577, 1999.

[151] J. Sochman and J. Malas. AdaBoost with Totally Corrective Updates for Fast Face

Detection. In Proc. IEEE Int’l Conf. on Automatic Face and Gesture Recognition,

2004, pages 445–450, 2004.

Bibliography 157

[152] T. Theoharis, G. Passalis, G. Toderici, and I.A. Kakadiaris. Unified 3D Face and Ear

Recognition Using Wavelets on Geometry Images. Pattern Recognition, 41(3):796–

804, 2008.

[153] A. Treptow and A. Zell. Real-time Object Tracking for Soccerrobots without Color

Information. Robotics and Autonomous Systems, 48(1):41–48, 2004.

[154] UMIST. The UMIST Face Database.

http://images.ee.umist.ac.uk/danny/database.html, 2002.

[155] UND. University of Notre Dame Biometrics Database.

http://www.nd.edu/ cvrl/CVRL/Data Sets.html, 2004.





[158] O. Ushmaev and S. Novikov. Biometric Fusion: Robust Approach. In Proc. Int’l

Workshop on Multimodal User Authentication (MMUA 2006), 2006.

[159] USTB. USTB Ear Database.

http://www.en.ustb.edu.cn/resb/, 2002.

[160] USTB. The USTB database III. available at

http://www.ustb.edu.cn/resb/en/doc/Imagedb 123 intro en.pdf, 2004.

[161] P. Viola and M.J. Jones. Robust Real-Time Face Detection. Int’l Journal of Com-

puter Vision, 57(2):137–154, 2004.

[162] Y. Wang, G. Pan, and Z. Wu. Sphere-spin-image: A Viewpoint Invariant Surface

Representation for 3D Face Recognition. In Proc. Internat. Conf. on Computational

Science (ICCS04), Lecture Notes in Computer Science, Vol. 3037, pages 427–434,

2004.

[163] Y. Wang, G. Pan, and Z. Wu. 3D Face Recognition in the Presence of Expression: A

Guidance-Based Constraint Deformation Approach. In Proc. CVPR07, pages 1–7,

2007.

[164] Y. Wang, T. Tan, and A. K. Jain. Combining Face and Iris Biometrics for Identity

Verification. In Proc. Int’l Conf. on Audio- and Video-Based Person Authentication,

pages 805–813, 2003.

158 Bibliography

[165] B. Weyrauch, J. Huang, B. Heisele, and V. Blanz. Component-Based Face Recog-

nition with 3D Morphable Models. In First IEEE Workshop on Face Processing in

Video, Washington, D.C., 2004.

[166] D.L. Woodard, T.C. Faltemier, Ping Yan, P.J. Flynn, and K.W. Bowyer. A Com-

parison of 3D Biometric Modalities. In Proc. CVPR Workshop, pages 57–61, 2006.

[167] K. Woods, K. Bowyer, and W. P. Kegelmeyer. Combination of Multiple Classifiers

Using Local Accuracy Estimates. Trans. Pattern Anal. Mach. Intell., 19(4):405–410,

1997.

[168] J. Wu, S.C. Brubaker, M.D. Mullin, and J.M. Rehg. Fast Asymmetric Learning for

Cascade Face Detection. IEEE Trans. PAMI, 30(3):369–382, 2008.

[169] L. Xiaohua, K.-M. Lam, S. Lansun, and Z. Jiliu. Face detection using simplified Ga-

bor features and hierarchical regions in a cascade of classifiers. Pattern Recognition

Letters, 30(8):717–728, 2009.

[170] Zhang Xiaoxun and Jia Yunde. Symmetrical Null Space LDA for Face and Ear

Recognition. Neurocomputing, 70(4-6):842–848, 2007.

[171] X.-N. Xu, Z.-C. Mu, and L. Yuan. Feature-level Fusion Method Based on KFDA

for Multimodal Recognition Fusing Ear and Profile Face. In Proc. International

Conference on ICWAPR, 3:1306–1310, 2007.

[172] Xiaona Xu and Zhichun Mu. Feature Fusion Method Based on KCCA for Ear and

Profile Face Based Multimodal Recognition. In Proc. IEEE International Confer-

ence on Automation and Logistics, pages 620–623, 2007.

[173] Yale. The Yale Face Database. available at

http://cvc.yale.edu/projects/yalefaces/yalefaces.html, 1997.

[174] J. Yan. Ensemble SVM Regression Based Multi-View Face Detection System. In

Proc. IEEE Workshop on Machine Learning for Signal Processing, pages 163–169,

2007.

[175] P. Yan. Ear Biometrics in Human Identification. PhD thesis, University of Notre

Dame, 2006.

[176] P. Yan and K. W. Bowyer. Empirical Evaluation of Advanced Ear Biometrics. In

Proc. CVPR, pages 41–41, 2005.

Bibliography 159

[177] P. Yan and K. W. Bowyer. Icp-Based Approaches for 3D Ear Recognition. In

Proc. SPIE-Volume 5779: Biometric Technology for Human Identification II, Anil

K. Jain, Nalini K. Ratha, Editors, pages 282–291, 2005.

[178] P. Yan and K. W. Bowyer. Multi-Biometric 2D and 3D Ear Recognition. T. Kanade,

A. Jain and N. K. Ratha (Eds.) AVBPA 2005, LNCS 3546, Springer, Heidelberg,

pages 503–512, 2005.

[179] P. Yan and K. W. Bowyer. Biometric Recognition Using 3D Ear Shape. IEEE

Trans. PAMI, 29(8):1297–1308, 2007.

[180] Ping Yan and Kevin W. Bowyer. Ear Biometrics Using 2D and 3D Images. In Proc.

CVPR, pages 121–121, 2005.

[181] Ping Yan and Kevin W. Bowyer. An automatic 3d ear recognition system. In Proc.

the Thir d Int’l Symposium on 3DPVT, pages 326–333, 2006.

[182] Ming-Hsuan Yang, D. Kriegman, and N. Ahuja. Detecting Faces in Images: A

Survey. IEEE Trans. PAMI, 24 (1):34–58, 2002.

[183] M.H. Yap, H Ugail, R. Zwiggelaar, B. Rajoub, V. Doherty, S. Appleyard, and

G. Hurdy. A Short Review of Methods for Face Detection and Multifractal Analysis.

In Proc. Int’l Conference on CyberWorlds, pages 231–236, 2009.

[184] L. Yuan, Z. Mu, and Y. Liu. Multimodal Recognition Using Face Profile and Ear.

In Proc. the 1st Int’l Symposium on SCAA, pages 887–891, 2006.

[185] Li Yuan and Zhi-Chun Mu. Ear Detection Based on Skin-Color and Contour Infor-

mation. In Proc. Int’l Conf. on Machine Learning and Cybernetics, 4:2213–2217,

2007.

[186] Li Yuan and Feng Zhang. Ear Detection Based on Improved AdaBoost Algorithm.

Proc. International Conference on Machine Learning and Cybernetics, 4:2414–2417,

2009.

[187] T. Yuizono, Y. Wang, K. Satoh, and S. Nakayama. Study on Individual Recognition

for Ear Images by Using Genetic Local search. In Proc. Congress on Evolutionary

Computation, pages 237–242, 2002.

[188] Hai-Jun Zhang, Zhi-Chun Mu, Wei Qu, Lei-Ming Liu, and Cheng-Yang Zhang. A

Novel Approach for Ear Recognition Based on ICA and RBF Network. In Proc.

Int’l Conf. on Machine Learning and Cybernetics, 2005, pages 4511–4515, 2005.

160 Bibliography

[189] Xiaozheng Zhang and Yongsheng Gao. Face recognition across pose: A review.

Pattern Recognition, 42(11):2876–2896, 2009.

[190] W. Zhao, R. Chellappa, A. Rosenfeld, and P.J. Phillips. Face Recognition: A

Literature Survey. ACM Computing Surveys, pages 399–458, 2003.





[193] S. K. Zhou, R. Chellappa, and W. Zhao. Unconstrained Face Recognition (Interna-

tional Series on Biometrics). Springer, 2006.

[194] Xiaoli Zhou and B. Bhanu. Integrating Face and Gait for Human Recognition. In

Proc. CVPR Workshop, 2006, pages 55–55, 2006.