Human Recognition Using Local 3D Ear
and Face Features
Syed Mohammed Shamsul Islam
This thesis is presented for the degree of
Doctor of Philosophy
of The University of Western Australia
School of Computer Science and Software Engineering.
June 2010
a
a c©Copyright 2010 by Syed Mohammed Shamsul Islam
-
T H E U N t v E R s t r Y o FWESTERN AUSTRALIA
DECLARATION FOR THESES CONTAINING PUBLISHED WORK AND/OR WORK PREPARED FOR PUBLICATION
The exominolion of the lhesis is on exominotion of lhe work of the student. The work musl hove beensubstonliolly conducted by the sludent during enrolmenl in the degree.
Where the thesis includes work to which olhers hove conlributed, the thesis must include o slotement thotmokes lhe sludenl's contribufion cleor to the exominers. This moy be in lhe form of o descriplion of ihe precisecontribulion of fhe student lo the work presenled for bxominoiion ond/or o sfotement of lhe percentoge ofthe work thot wos done by the studenl.
In oddition, in the cose of co-oulhored publicolions included in lhe lhesis, eoch ouihor must give lheir signedpermission for the work to be included. lf signotures from oll the ouihors connot be obtoined, the slolementdetoiling the siudent's contribution to lhe work must be signed by the coordinoting supervisor.
Pleose one of the stolements below.
l. This thesis does nof confoin work thol I hove published, nor work under review for publicoiion.
2. This thesis conloins only sole-outhored work, some of which hos been published ond/or prepored forpublicotion under sole oulhorship. The bibliogrophicol detoils of the work ond where it oppeors in the thesis
@ ff'tit thesis contoins published work ond/or work prepored for publicolion, some of which hos been co-oulhored. The bibliogrophicol detoils of the work ond where it oppeors in the thesis ore outlined below.The sludenl musl otloch to this decloroiion o stotement for eoch publicolion lhot clorifies thecontribulion of the sludent to the work. This moy be in the form of o descriplion of the precisecontribulions of the studenl to the published work ond/or o stotement of percent contribulion by lhestudent. This stotemenf must be signed by oll outhors. lf signotures from qll lhe quthors connol beobtoined, the slolemenl defoiling fhe studenl's contribulion lo the published work musf be signed bythe coordinoting supervisor.
The bibliogrophicol detoils of the publicolions included in this fhesis ond the contribulion of the condidole fothose oppeor ot poge XIV ond XVll respectively in the fhesis.
Sludent Signolure . . . . .
Coordinoling Supervisor Signoture.
Heartily dedicated to my parents
Syed Mohammed Nazrul Islam and Mosammat Jebunnesa
and also to my wife Hafeza
for all their love and inspiration
Abstract
The field of Biometrics is rapidly gaining popularity due to increasing breaches
of traditional security systems and the decreasing costs of sensors. Among the bio-
metric traits, the ear and the face are considered to be the most socially accepted
due to their easy and non-intrusive data acquisition. Furthermore, their feature
richness and physical proximity make them good candidates for fusion. However,
occlusions due to the presence of hair and ornaments and deformations due to facial
expressions pose great challenges for real-life applications of these two biometrics.
These challenges are addressed in this dissertation through the development of ef-
ficient and robust algorithms for ear detection, ear data representation and finally,
the combination of ear and face biometrics using robust fusion techniques. The
dissertation is organized as a set of papers already published and/or submitted to
journals or internationally refereed conferences.
In this dissertation, a fast and fully automatic approach for detecting 3D ears
from corresponding 2D and 3D profile images using a Cascaded AdaBoost algorithm
is proposed. The classifiers are trained with three new Haar-like features and the
detection is made using a 16 × 24 detection window placed around the ear. The
approach is significantly robust to hair, earrings and earphones and unlike other
approaches, it does not require any assumption about the localization of the nose or
the ear pit. The proposed ear detection approach achieves a detection rate of 99.9%
on the UND-J Biometrics Database with 830 images of 415 subjects (the largest
publicly available profile database) taking only 7.7 ms on average using a C + +
implementation on a Core 2 Quad 9550, 2.83 GHz PC.
For ear recognition, I initially proposed to apply the Iterative Closest Point
(ICP) algorithm in a hierarchical manner: first with a low and then with higher
resolution meshes of 3D ear data. The results obtained in the first stage are used
for coarse alignment for the next stage and thus the computational cost of this
accurate iterative algorithm is reduced. In order to achieve better efficiency and
robustness to occlusions, 3D local features (L3DFs) are used for data representation
and matching. Local features are used to develop a rejection classifier, to extract a
minimal rectangular feature-rich region and to compute the initial alignment for the
ICP algorithm. An improved technique for feature matching is also proposed using
geometric consistency among the corresponding features. On the UND-J database
with 415 galleries and probes, an identification rate of 93.5% and an Equal Error
Rate (EER) of 4.1% are obtained. Corresponding rates on a new dataset of 50
subjects all wearing ear-phones are 98% and 1%. With an un-optimized MATLAB
implementation, the average time required for the L3DF-based matching and for
the full matching including ICP are 0.06 and 2.28 seconds respectively.
In order to further increase the robustness, two techniques are presented for
fusing the ear biometrics with face biometrics. In score-level fusion, scores from
the face are computed using the same matching technique proposed for the ear
and a weighted sum rule with some complementary weights is used for fusion. For
the fusion of the ear and the face local features (feature-level fusion), the shape
similarity among the local features from the two different modalities is utilized in
the construction of the multimodal ear-face gallery and probe datasets prior to
concatenation. Matching is performed using similar L3DF-based similarity measures
as the ones used in the case of unimodal matching. The proposed score-level fusion
technique achieves identification (rank-1) and verification (at 0.001 FAR) rates of
99.4% and 99.7% respectively with neutral facial expression and 96.8% and 97.1%
respectively with non-neutral facial expressions on the largest available multimodal
dataset using FRGC v.2 and UND databases. The feature-level fusion approach
achieves an accuracy comparable to that of the score-level fusion without requiring
ICP-like expensive algorithms in matching.
The unimodal and multimodal approaches proposed in this dissertation for ear
and face biometrics can be extended for recognition with other biometric traits and
objects and for other applications such as robotics, medicine and forensic sciences.
Acknowledgements
I am thankful to God for giving me the opportunity and strength to complete
my Ph.D. I am also grateful to many people whose help and support made this
journey possible. First of all, I would like to thank my supervisors Mohammed
Bennamoun, Robyn Owens and Rowan Davies. Their continuous guidance and
support were always a source of inspiration for me. They created a motivating,
enthusiastic and friendly environment which is ideal for research. Their critical and
insightful feedback greatly improved my work and quality of presentation.
I am grateful to Dr. Ajmal Syeed Mian for the useful discussions we had regarding
3D shape acquisition and biometrics. I also had useful discussions with other Ph.D.
candidates, visiting scholars and research fellows including Professor Wesley Snyder
from North Carolina State University and Faisal Al-Osaimi. I would also like to
thank the anonymous reviewers whose constructive criticism and feedback helped
me improve my papers and subsequent work.
I am thankful to Jonathan Wan, K. M. Tracey, Ashley Chew, Laurie McKeaig,
Joe Sandon and other staff members in the Computer Science Support Group for
their technical assistance. I would also like to acknowledge the co-operation and
participation of the students and staff members of the CSSE and other schools of
UWA in 3D ear and face data acquisition in our laboratory.
I extend my sincere gratitude to my parents who taught me the value of hard
work and always prayed for my success. I would also like to thank my wife and
kids for their patience during the course of my Ph.D. Without my wife’s support, I
would not have had the peace of mind required for research. I am also grateful to
all my relatives and friends who kept me motivated by reinforcing my enthusiasm.
Finally, I would like to acknowledge the institutions and people who made their
data available for public use, including The University of Notre Dame, the National
Institute of Standards and Technology (NIST), University of Surrey, the University
of Sheffield and the Center for Biological and Computational Learning at MIT for
their face databases, the University of Science and Technology Beijing for their ear
database and D’Errico for the surface fitting code. I also acknowledge the financial
and logistic supports obtained through the grants DP0664228 and LE0775672 of
Australian Research Council (ARC) and Scholarships for International Research
Fees (SIRFs), University International Stipend (UIS), SIRF Safety Net Top-Up
Scholarships and SIRFs-Completion offered to the candidate by the University of
Western Australia.
i
Contents
List of Tables vii
List of Figures ix
Publications Included in this Thesis xiv
Contribution of Candidate to Published Papers xvii
1 Introduction 1
1.1 Motivations of the Research . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 A Review of Recent Advances in 3D Ear and Expression In-
variant Face Biometrics . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2 An ICP Based Hierarchical Matching Approach for 3D Ear
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.3 A Fast and Fully Automatic Ear Recognition Approach Based
on 3D Local Surface Features . . . . . . . . . . . . . . . . . . 6
1.4.4 Refining Local 3D Feature Matching through Geometric Con-
sistency for Robust Biometric Recognition . . . . . . . . . . . 7
1.4.5 Efficient Detection and Recognition of Textured 3D Ears . . . 7
1.4.6 Fusion of 3D Ear and Face Biometrics for Robust Human
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 A Review of Recent Advances in 3D Ear and Expression Invariant
Face Biometrics 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Preliminary Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 2D versus 3D Biometrics . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Unimodal Biometrics . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Multimodal Biometrics . . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 Fusion Techniques . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Image Acquisition Techniques . . . . . . . . . . . . . . . . . . . . . . 14
ii
2.3.1 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Existing Databases . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Face Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 2D Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 3D Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Ear Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Using Only 2D Data . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 Using Only 3D Data . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.3 Using Both 2D and 3D Data . . . . . . . . . . . . . . . . . . . 22
2.5.4 Discussion of the Ear Detection Techniques . . . . . . . . . . . 23
2.6 3D Data Representation Techniques . . . . . . . . . . . . . . . . . . . 23
2.6.1 Global Representation . . . . . . . . . . . . . . . . . . . . . . 24
2.6.2 Local Representation . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.3 Comparative Evaluation of the Representation Techniques . . 27
2.7 Recognition Techniques with 3D Face Data . . . . . . . . . . . . . . . 28
2.7.1 Rigid Approaches . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7.2 Non-rigid Approaches . . . . . . . . . . . . . . . . . . . . . . . 30
2.8 Recognition Techniques with 3D Ear Data . . . . . . . . . . . . . . . 34
2.8.1 Approaches Using Local Features . . . . . . . . . . . . . . . . 35
2.8.2 Approaches Using Global Features . . . . . . . . . . . . . . . 35
2.8.3 Approaches Without Extracting Any Features . . . . . . . . . 36
2.8.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . 36
2.9 Multi-Biometric Recognition with 3D Ear and Face . . . . . . . . . . 38
2.10 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.10.1 Image Acquisition Related Challenges . . . . . . . . . . . . . . 40
2.10.2 Robustness Related Challenges . . . . . . . . . . . . . . . . . 41
2.10.3 Efficiency Related Challenges . . . . . . . . . . . . . . . . . . 41
2.10.4 Application Related Challenges . . . . . . . . . . . . . . . . . 42
2.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3 An ICP Based Hierarchical Matching Approach for 3D Ear Recog-
nition 45
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 3D Ear Detection and Normalization . . . . . . . . . . . . . . . . . . 47
3.3 3D Ear Matching and Recognition . . . . . . . . . . . . . . . . . . . . 48
3.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 49
iii
3.4.1 Dataset Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Recognition Rate of the Single-step ICP . . . . . . . . . . . . 49
3.4.3 Improvement Obtained with the Two-step ICP . . . . . . . . 50
3.4.4 Analysis of the Misclassification . . . . . . . . . . . . . . . . . 50
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 A Fast and Fully Automatic Ear Recognition Approach Based on
3D Local Surface Features 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Ear Data Extraction and Normalization . . . . . . . . . . . . 55
4.2.2 Feature Location Identification . . . . . . . . . . . . . . . . . 56
4.2.3 3D Local Feature Extraction . . . . . . . . . . . . . . . . . . . 58
4.2.4 Feature Matching . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.5 Coarse Registration of Gallery and Probe Data . . . . . . . . 60
4.2.6 Fine Matching with ICP . . . . . . . . . . . . . . . . . . . . . 60
4.2.7 Final Similarity Measures . . . . . . . . . . . . . . . . . . . . 60
4.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.1 Data Set Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Recognition Rate with 3D Local Features Only . . . . . . . . 61
4.3.3 Fine Matching with the ICP . . . . . . . . . . . . . . . . . . . 62
4.3.4 Occlusion and Pose Invariance . . . . . . . . . . . . . . . . . . 62
4.3.5 Analysis of the Failures . . . . . . . . . . . . . . . . . . . . . . 63
4.3.6 Recognition Speed . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 Refining Local 3D Feature Matching through Geometric Consis-
tency for Robust Biometric Recognition 65
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Proposed Refinement Technique . . . . . . . . . . . . . . . . . . . . . 67
5.3.1 Computation of Distance Consistency . . . . . . . . . . . . . . 68
5.3.2 Computation of Rotation Consistency . . . . . . . . . . . . . . 68
5.3.3 Final Similarity Measures . . . . . . . . . . . . . . . . . . . . 69
5.4 Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5 Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 74
iv
6 Efficient Detection and Recognition of Textured 3D Ears 75
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Related Work and Contributions . . . . . . . . . . . . . . . . . . . . 77
6.2.1 Ear Detection Approaches . . . . . . . . . . . . . . . . . . . . 77
6.2.2 Ear Recognition Approaches . . . . . . . . . . . . . . . . . . . 79
6.2.3 Motivations and Contributions . . . . . . . . . . . . . . . . . . 81
6.3 Automatic Detection and Extraction of Ear Data . . . . . . . . . . . 82
6.3.1 Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.2 Construction of the Classifiers . . . . . . . . . . . . . . . . . . 84
6.3.3 Training the Classifiers . . . . . . . . . . . . . . . . . . . . . . 85
6.3.4 Ear Detection with the Cascaded Classifiers . . . . . . . . . . 86
6.3.5 Multi-detection Integration . . . . . . . . . . . . . . . . . . . . 87
6.3.6 3D Ear Region Extraction . . . . . . . . . . . . . . . . . . . . 88
6.3.7 Extracted Ear Data Normalization . . . . . . . . . . . . . . . 89
6.4 Representation and Extraction of Local 3D Features . . . . . . . . . . 89
6.4.1 KeyPoint Selection for L3DFs . . . . . . . . . . . . . . . . . . 89
6.4.2 Feature Extraction and Compression . . . . . . . . . . . . . . 91
6.5 L3DF Based Matching Approach . . . . . . . . . . . . . . . . . . . . 92
6.5.1 Finding Correspondence Between Candidate Features . . . . . 92
6.5.2 Filtering with Geometric Consistency . . . . . . . . . . . . . . 93
6.5.3 Other Similarity Measures Based on L3DFs . . . . . . . . . . 95
6.5.4 Building a Rejection Classifier . . . . . . . . . . . . . . . . . . 96
6.5.5 Extraction of a Minimal Rectangular Area . . . . . . . . . . . 96
6.5.6 Coarse Alignment of Gallery-Probe Pairs . . . . . . . . . . . . 97
6.5.7 Fine Alignment with the ICP . . . . . . . . . . . . . . . . . . 97
6.6 Detection Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.6.1 Correct Detection . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.6.2 False Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.6.3 Detection Speed . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.7 Recognition Performance . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.7.2 Identification and Verification Results on the UND Database . 103
6.7.3 Robustness to Occlusions . . . . . . . . . . . . . . . . . . . . . 105
6.7.4 Robustness to Pose Variations . . . . . . . . . . . . . . . . . . 106
6.7.5 Analysis of the Failures . . . . . . . . . . . . . . . . . . . . . . 106
6.7.6 Speed of Recognition . . . . . . . . . . . . . . . . . . . . . . . 107
v
6.7.7 Evaluation of L3DFs, Geometric Consistency Measures and
Minimal Rectangular Area of Dataset . . . . . . . . . . . . . . 108
6.8 Comparison with Other Approaches . . . . . . . . . . . . . . . . . . . 109
6.8.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.8.2 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7 Fusion of 3D Ear and Face Biometrics for Robust Human Recog-
nition 115
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Related Work and Contributions . . . . . . . . . . . . . . . . . . . . 118
7.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.2 Motivations and Contributions . . . . . . . . . . . . . . . . . . 119
7.3 Data Acquisition and Feature Extraction . . . . . . . . . . . . . . . . 121
7.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3.2 Ear and Face Data Extraction . . . . . . . . . . . . . . . . . . 121
7.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4 Unimodal Matching and Score-level Fusion . . . . . . . . . . . . . . . 123
7.4.1 Matching Technique . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.2 Fusion of Scores . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5 Feature-Level Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.5.1 Fusion of L3DFs from Ear and Face . . . . . . . . . . . . . . . 127
7.5.2 Matching the Fused Features . . . . . . . . . . . . . . . . . . . 127
7.6 Performance of the Score-level Fusion Approach . . . . . . . . . . . . 130
7.6.1 Results Using L3DF-Based Measures Only . . . . . . . . . . . 131
7.6.2 Improvement Using ICP . . . . . . . . . . . . . . . . . . . . . 132
7.6.3 Misclassifications . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.7 Performance of the Feature-Level Fusion Approach . . . . . . . . . . 134
7.7.1 Results on Data with Neutral Expression . . . . . . . . . . . . 134
7.7.2 Results on Data with Non-neutral Facial Expressions . . . . . 134
7.7.3 Choice of the Number of PCA Components . . . . . . . . . . 134
7.7.4 Performance of Different Similarity Measures . . . . . . . . . . 135
7.8 Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.8.1 Comparison between the Proposed Fusion Approaches . . . . 137
7.8.2 Comparison with Other Approaches . . . . . . . . . . . . . . . 137
7.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
vi
8 Conclusion 141
8.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Bibliography 145
vii
List of Tables
2.1 Some existing profile and frontal face databases . . . . . . . . . . . . 15
2.2 Summary of recognition approaches for face with varying expressions-1 32
2.3 Summary of recognition approaches for face with varying expressions-2 33
2.4 Summary of the existing 3D ear recognition approaches . . . . . . . . 37
2.5 Multi-biometric Approaches with 3D ear and face data . . . . . . . . 38
6.1 Summary of the existing 3D ear recognition approaches . . . . . . . . 80
6.2 Ear detection results on different datasets . . . . . . . . . . . . . . . 98
6.3 Performance variations for using L3DFs and geometric consistency
measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4 Comparison of the detection approach of this paper with the ap-
proaches of Chen and Bhanu [32] and Yan and Bowyer [179] . . . . . 109
6.5 Comparison of the recognition approach of this paper with others on
the UND database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.1 Multi-biometric approaches with 2D and 3D ear and face data . . . . 117
7.2 Summary of score-level fusion results . . . . . . . . . . . . . . . . . . 130
7.3 Summary of the comparison of our fusion approaches with others
(identification and verification rates are measured at rank-1 and at
0.001 FAR respectively) . . . . . . . . . . . . . . . . . . . . . . . . . 136
viii
ix
List of Figures
2.1 Salient features of an external ear (left and right images are 2D and
3D views respectively of the same ear). . . . . . . . . . . . . . . . . . 10
2.2 Taxonomy of the biometric approaches with ear and face that are cov-
ered in this manuscript with reference to the relevant section number
within braces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Classification of fusion techniques in multimodal biometric systems
(adapted from [84]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Comparative results of 2D face detection (performed and illustrated
in [183]): (a) Using OpenCV-based implementation of cascaded Ad-
aBoost [131] (a) and (b) Using approach of Nilsson et al. [128] (best
seen in color). Notice that the approach in (a) failed to detect the
face of the first person. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Samples of 3D face detection: (a) in the presence of occlusions (b) in
case of multiple faces in a scene (c) in large scale variations [126]. . . 18
2.6 Different steps of localizing ear in the approach of Ansari and Gupta
in [7]: (a) Convex and concave edges extracted from a side face image
(b) Possible outer helix curves (c) Completed ear boundary (best seen
in color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Samples of ear detection in sever occlusion using AdaBoost [74] (best
seen in color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Block diagram of the ear detection approach proposed by Yan and
Bowyer [179]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.9 Global data representation: (a) the SFR [114](b) the balloon image
(SSR convexity map) [124]. . . . . . . . . . . . . . . . . . . . . . . . 24
2.10 Local surface data representation: (a) point signatures (face surface
and sphere) (b) iso-contours (c) the spin image (d) the tensor (e) A
feature point (asterisk) and its neighbors (dots) and basic constituents
of an LSP [32] (f) L3DF [83]: A local surface (right image) and the
region of ear from which it is extracted (shown by a circle on the left
image). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.11 Sample of 3D face (top), variance in expressions (middle) and expres-
sion insensitive binary masks of 3D faces (bottom) [114] (best seen in
color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
x
2.12 Illustration of deformation models: (a) Original, annotated face model,
geometry image and normal image (left to right) used in [91] (b) Orig-
inal test scan, anchor point extraction and deformable model used
in [107] (c) Original and bilinear model of two different expressions [125]. 31
2.13 Block diagram of the ear recognition system proposed in [83]. . . . . . 34
2.14 Identification plots of ear, face and fusion of these two modalities [152]. 39
2.15 Block diagram of the L3DF based ear-face multimodal recognition
system fused at score level [76]. . . . . . . . . . . . . . . . . . . . . . 40
3.1 Block diagram of the 3D ear detection approach. . . . . . . . . . . . . 48
3.2 Sample of the full and reduced meshes of the extracted 3D ear data. . 49
3.3 Flowchart of the matching algorithm using coarse-to-fine hierarchical
technique with ICP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Example of recognition: (a) 2D and range image of the gallery ear
(b) 2D and range image of the probe ear with a small ear-ring and
hair (This figure is best seen in color). . . . . . . . . . . . . . . . . . 51
3.5 Recognition rates with Single-step and Two-step ICP. . . . . . . . . . 51
3.6 Examples of improvement with the two-step ICP which was not cor-
rectly recognized by the single-step ICP [left image is the gallery and
right one is the probe]. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7 Examples of misclassification:(a, b, c) 2D image of a gallery (left) and
the corresponding probe (right) with pose variations, (d) 2D image
of a probe with occlusions (e) Range image of a probe having missing
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Block diagram of the proposed ear recognition system. . . . . . . . . 56
4.2 Locations of local features (shown with dots) on the range images of
different views (in rows) of different individuals (in columns). (This
figure is best seen in color). . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Repeatability of local feature locations . . . . . . . . . . . . . . . . . 58
4.4 Example of a 3D local surface (right image). The region from which
it is extracted is shown by a circle on the left image. . . . . . . . . . . 59
4.5 Identification results. . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Examples of correct recognition in the presence of occlusions. (a)
With ear-rings. and (b) With hair. (2D and the corresponding range
images are placed in the top and bottom row respectively) . . . . . . 62
xi
4.7 Example of correct recognition of gallery-probe pairs with pose vari-
ations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.8 Example of misclassification. (a) With large pose variations. (b)
With ear-ring, hair and pose variations. . . . . . . . . . . . . . . . . . 63
5.1 Example of a 3D local surface (right image) and the region from which
it is extracted (left image, marked with a circle) [83]. . . . . . . . . . 67
5.2 Feature correspondences before (left image) and after (right image)
filtering with geometric consistency (best seen in color). . . . . . . . . 69
5.3 Identification results for fusion of ears and faces on dataset A: (a)
without using geometric consistency. (b) with geometric consistency . 70
5.4 Verification results for fusion of ears and faces on dataset A: (a) with-
out using geometric consistency. (b) with geometric consistency . . . 70
5.5 Indentification results using geometric consistency: (a) on dataset B
(b) on dataset C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.6 Comparing identification performance of different geometric consis-
tency measures (on dataset C) . . . . . . . . . . . . . . . . . . . . . . 73
6.1 Block diagram of the proposed ear detection and recognition system. 76
6.2 Block diagram of the proposed ear detection approach. . . . . . . . . 83
6.3 Features used in training the AdaBoost (features (f), (g) and (h) are
proposed for detecting specific features of the ear) . . . . . . . . . . . 84
6.4 Examples of ear (top) and non-ear (bottom) images used in the training. 85
6.5 Sample of detections: (a) Detection with single window. (b) Multi-
detection integration (best seen in color). . . . . . . . . . . . . . . . . 87
6.6 Example of a 3D local surface (right image). The region from which
it is extracted is shown by a circle on the left image. . . . . . . . . . . 90
6.7 (a) Location of keypoints on the gallery (left) and the probe (right)
images of three different individuals (this figure is best seen in color).
(b) Cumulative percentage of repeatability of the keypoints. . . . . . 90
6.8 Feature correspondences after filtering with geometric consistency
(best seen in color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.9 Extraction of a minimal rectangular area containing all the matching
L3DFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.10 Detection in presence of occlusions with hair and ear-rings (the inset
image is the enlargement of the corresponding detected ear). . . . . . 99
6.11 Example of test images for which the detector failed. . . . . . . . . . 99
xii
6.12 Example of ear images detected from profile images with ear-phones. 99
6.13 Detection performance under two different types of synthesized oc-
clusions (on a subset of the UND-F dataset with 203 images). . . . . 100
6.14 Detection of a motion blurred image. . . . . . . . . . . . . . . . . . . 101
6.15 False detection evaluation: (a) FAR (on number of profile images)
with respect to number of stages in the cascade. (b) The ROC curve
for classification of cropped ear and non-ear images (best seen in color).101
6.16 Recognition results on the UND-J dataset: (a) Identification rate.
(b) Verification rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.17 Examples of correct recognition in presence of occlusions: (a) With
ear-rings and (b) With hair (c) With ear-phones (2D and the corre-
sponding range images are placed in the top and bottom row respec-
tively (best seen in color)). . . . . . . . . . . . . . . . . . . . . . . . . 104
6.18 Recognition results on the UND-G dataset: (a) Identification rate for
different off center pose variations. (b) CMC curves for 30◦ and 45◦
off center pose variations. . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.19 Examples of correct recognition of four gallery-probe pairs with pose
variations (best seen in color). . . . . . . . . . . . . . . . . . . . . . . 106
6.20 Examples of failures: (a) Two probe images with missing data. (b)
A gallery-probe pair with a large pose variation. . . . . . . . . . . . . 107
6.21 Comparing identification performance of different L3DF-based simi-
larity measures (without ICP on the UND-F dataset). . . . . . . . . . 108
7.1 Data acquisition using Minolta Vivid 910 scanner: (a) 2D and 3D
profile images captured to extracted 3D ear data. (b) 3D frontal face
scanned to extract 3D face data . . . . . . . . . . . . . . . . . . . . . 122
7.2 Example of an extracted 3D local surface feature [83]. . . . . . . . . 123
7.3 Feature correspondences established between a gallery and a probe
ear after matching is performed. The image of the probe ear (right
image) is flipped for better visibility (best seen in color). . . . . . . . 124
7.4 Block diagram of the proposed multimodal recognition system with
score-level fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.5 Block diagram of the proposed multimodal recognition system with
feature-level fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
xiii
7.6 Correspondences between the features extracted from the frontal face
and those from the ear of the same person. These correspondences are
established using algorithm 1. Only the first 40 features are shown
for a better visibility (best seen in color). . . . . . . . . . . . . . . . . 126
7.7 Block diagram of the feature-set constructed by fusion of ear and
face L3DFs (subscript ck indicates the index number of the closest
feature with respect to its left or right feature for ear-face and face-
ear combination respectively). . . . . . . . . . . . . . . . . . . . . . . 128
7.8 Repeatability of fused features in the gallery and probe images of ten
individuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.9 Identification results for score-level fusion of ear and face (without
using ICP scores): (a) with neutral expression. (b) with non-neutral
expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.10 ROC curves for score-level fusion of ear and face (without using ICP
scores): (a) with neutral expression. (b) with non-neutral expressions. 131
7.11 Recognition rates for different combinations of ear and face weights
using the weighted sum rule. . . . . . . . . . . . . . . . . . . . . . . . 132
7.12 Examples of 2D and corresponding range images of four correctly
recognized probes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.13 Examples of five misclassified multimodal probes where face data have
large expression changes and the ear data have data losses due to hair
and large out-of-plane pose variations compared to their respective
gallery data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.14 Identification results for feature-level fusion of ear and face features
under neutral and non-neutral facial expressions. . . . . . . . . . . . . 134
7.15 Verification results for feature-level fusion of ear and face features
under neutral and non-neutral expressions. . . . . . . . . . . . . . . . 135
7.16 Effect of using a different number of PCA components on the iden-
tification results (data with non-neutral expressions: 100 gallery and
100 probe images). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.17 Performance of various similarity measures used in feature-level fusion
matching for face data with neutral expression. . . . . . . . . . . . . 136
xiv
Publications Arising from This Dissertation
This dissertation contains works already published and/or submitted to journals
or internationally refereed conferences. The bibliographical details of a work and
where it appears in the dissertation are outlined below.
International Journal Publications (Fully Refereed)
[1] Islam, S.M.S. and Bennamoun, M., Owens, R. and Davies, R., “A Review of
Recent Advances in 3D Ear and Expression Invariant Face Biometrics,” ACM
Computing Surveys (revision in preparation), Nov., 2009. (Chapter 2)
[2] Islam, S.M.S, Davies, R., Bennamoun, M and Mian, A.S., “Efficient Detection
and Recognition of Textured 3D Ears,” International Journal of Computer
Vision (revision under review), March, 2010. (Chapter 6)
[3] Islam, S.M.S, Davies, R., Bennamoun, M, Owens, R.A. and Mian, A.S., “Fu-
sion of Local 3D Ear and Face Features for Human Recognition,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence (under review), June,
2010. (Chapter 7)
Book Chapters (Fully Refereed)
[4] Islam, S.M.S. and Bennamoun, M., Owens, R. and Davies, R., “Biometric
Approaches of 2D-3D Ear AND Face: A Survey,” in Computer and Information
Sciences and Engineering, T. Sobh (ed.), pages 509-514, 2008.
The preliminary ideas of this paper were refined and extended to contribute
towards [1] which forms Chapter 2 of this dissertation.
[5] Islam, S.M.S. and Davies, R. and Mian, A.S. and Bennamoun, M., “A Fast
and Fully Automatic Ear Recognition Approach Based on 3D Local Surface
Features,” J. Blanc-Talon et al. (Eds.): ACIVS 2008, Lecture Notes on Com-
puter Science (LNCS) volume 5259, pages 1081-1092, Oct, 2008. (Chapter
4)
[6] Islam, S.M.S., Bennamoun, M., Mian, A.S. and Davies, R., “Score Level Fusion
of Ear and Face Local 3D Features for Fast and Expression-invariant Human
Recognition,” M. Kamel and A. Campilho (Eds.): ICIAR 2009, Lecture Notes
xv
on Computer Science (LNCS) volume 5627, Springer, Heidelberg, pages 387-
396, 2009.
The preliminary ideas of this paper were refined and extended to contribute
towards [3] which forms Chapter 7 of this dissertation.
International Conference Publications (Fully Refereed)
[7] Islam, S.M.S. and Bennamoun, M., and Davies, R., “Fast and Fully Automatic
Ear Detection Using Cascaded AdaBoost”. In Proc. of IEEE Workshop on
Application of Computer Vision, 2008 (WACV’08), USA, pages 1-6, Jan.7-8,
2008.
The preliminary ideas and results of this paper were refined and extended to
contribute towards [3] which forms Chapter 6 of this dissertation.
[8] Islam, S.M.S. and Bennamoun, M., Mian, A.S. and Davies, R., “A Fully Auto-
matic Approach for Human Recognition from Profile Images Using 2D and3D
Ear Data,” In Proc. of the Fourth International Symposium on 3D Data Pro-
cessing, Visualization and Transmission (3DPVT’08), Atlanta, GA, USA, ,
pages 131-141, June 18-20, 2008. (Chapter 3)
[9] Islam, S.M.S. and Davies, R., “Refining Local 3D Feature Matching through
Geometric Consistency for Robust Biometric Recognition,” In Proc. of Digi-
tal Image Computing: Techniques and Applications (DICTA), pages 513-518,
December 2009. (Chapter 5)
[10] Islam, S.M.S., Bennamoun, M. Mian, A.S. and Davies, R., “Expression-Robust
Human Recognition from Local 3D Ear and Face Features,” In Proc. of
the Tenth Electrical Engineering and Computing Symposium (PEECS 2009),
Perth, Australia, Oct., 2009.
The preliminary ideas and results of this paper were refined and extended to
contribute towards [3] which forms Chapter 7 of this dissertation.
Note: In the 2008 Journal Citation Report, ACM Computing Surveys (ACMCS) is
ranked one among 84 Computer Science Theory and Methods titles. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence (TPAMI) and International
Journal of Computer Vision (IJCV) are the top two ranking journals in the area of
Computer Vision. TPAMI is also ranked number one among 229 Electrical Engi-
neering titles in the same report. WACV, 3DPVT, DICTA and ACIVS conferences
xvi
are ranked as A, C, B and B respectively in the 2008 ranking of the Computing Re-
search and Education Association of Australia (CORE) and the 2010 ranking of the
Excellence in Research for Australia (ERA) Initiative, a program of the Australian
Research Council (ARC).
xvii
Contribution of the Candidate to the Published
Work
The contribution of the candidate in all the published papers was 80%. He co-
authored them with his three supervisors and in some cases also with one member of
the same research group. The candidate developed and implemented the algorithms,
performed the experiments and wrote the papers. Other authors reviewed the papers
and provided useful feedback for improvement.
xviii
1CHAPTER 1
Introduction
Biometric recognition systems are rapidly gaining popularity due to increasing breach
of traditional security systems and the decreasing costs of sensors. The current re-
search trend is to combine multiple biometric traits to improve accuracy and robust-
ness. The availability of a rich set of many distinctive features and the possibility
of easy and non-intrusive data acquisition, make the ear and the face more popular
than other biometric traits. They are also good candidates for fusion due to their
physical proximity. However, occlusions due to the presence of hair and ornaments
and deformations due to facial expressions pose great challenges for real-life appli-
cations of these two biometrics. The main objective of this dissertation is to address
these challenges by developing an efficient and robust approach for ear detection, ear
data representation and finally, combining ear biometrics with face biometrics using
robust fusion techniques. In this chapter, the research motivation of this work is
elaborated followed by a concise statement of the research problem, a list of my ma-
jor contributions and the organization of the dissertation as a set of papers already
published and/or submitted to journals or internationally refereed conferences.
1.1 Motivations of the Research
Incidents of breaches of traditional password or identity card-based systems have
been increasing alarmingly. CIFAS [43], the UK’s fraud prevention service reports a
16% increase in identity frauds in the UK in the year 2008. In the USA, the number
of victims of such fraud increased by 22% making a loss of USD48 billion over the
year 2008 [87]. Such a rapid increase in security breaches is a strong justification
for replacing or augmenting traditional recognition systems with biometric systems
which are based on human traits that cannot be stolen, denied or faked easily. In
a biometric system one (in the unimodal case) or multiple (in the multimodal case)
physiological (e.g. face, fingerprint, palmprint, iris and DNA) or behavioral (such
as handwriting, gait and voice) characteristics of a subject are taken into consid-
eration for automatic recognition purposes [16, 85, 146]. A biometric recognition
system may operate in one or both of two modes: authentication and identification.
In authentication, one-to-one matching is performed to compare a user’s biomet-
ric to the template of a claimed identity. In identification, one-to-many matching
is performed to associate an identity with the user by matching it against every
identity in a database. Biometric systems can be applied efficiently in numerous
real-environment applications including various civil IDs (e.g. passports, drivers
2 Chapter 1. Introduction
licenses, national ID cards and voter ID cards), access control, surveillance, law en-
forcement, multimedia management and human computer interaction. With these
priorities, the allocation of government funds and the increasing computational ef-
ficiency of modern computers have boosted the emergence of biometric recognition
systems over the last few years [14].
Biometric systems operating with a single biometric data suffer from a number
of problems including noise in sensed data, intra-class variations, inter-class similar-
ities (i.e. overlaps of features in the case of large databases), non-universality (i.e.
meaningful biometric data may not be acquired from a subset of users) and spoof
attacks [146]. To overcome these problems, multimodal approaches are proposed.
A system is called multimodal if it collects data from different biometric sources or
uses different types of sensors (e.g. infra-red and reflected light), or uses multiple
samples of data or multiple algorithms [158] to combine the data [16]. Face and
voice data were unified in numerous early works listed in the study of Ross and
Govindarajan [144]. Frischholz and Dieckmann [55] integrated lip data with voice
and face biometrics. Finger, hand, ear, gait and iris images were also integrated
with face images taken by different sensors. In such multimodal systems a decision
can be made on the basis of different subsets of biometrics depending on their avail-
ability and confidence. Such systems are also more robust to spoof attacks as it is
relatively difficult to spoof multiple biometrics simultaneously.
Among the biometric traits, the face and the ear have become popular due
to their rich set of distinctive features as well as the possibility of easy and non-
intrusive acquisition of their images. There has been increasing research interest
in using these two biometric traits for identification and authentication purposes
in the last few years. However, the accuracy and the robustness required for real-
world applications are still to be achieved. Face recognition with neutral expressions
has reached maturity with a high degree of accuracy [17, 78, 117, 190]. However,
changes due to facial expressions, the use of cosmetics and eye glasses, the presence
of facial hair including beards and aging, significantly affect the performance of face
recognition systems. The ear, compared to the face, is much smaller in size but
has a rich structure [8] and a distinct shape [86] which remains unchanged from
the age of 8 to 70 years (as determined by Iannarelli [72] in a study of 10,000
ears). It is, therefore, a very suitable alternative or complement to the face for
effective human recognition [20, 28, 69, 78]. However, the reduced spatial resolution,
uniform distribution of color and sometimes the presence of nearby hair and ear-rings
make the ear very challenging for nonintrusive biometric applications. Therefore, a
1.2. Problem Statement 3
multimodal approach using ear as a biometric trait must also address the issue of
efficiently removing the effect of occlusions.
Fusion of ear and face data in an efficient way is also a challenging problem.
Fusion at the matching score or decision level is easy to perform and therefore,
most of the existing multimodal systems use either of the two. But fusion at these
levels cannot fully exploit the discriminating capabilities of the combined biometrics.
Fusion at the data or feature extraction level is believed to produce better results
in terms of accuracy and robustness because richer information about the ID or
class can be combined at these levels [146]. However, very few works [28, 144,
115, 98] have so far been performed using fusion at these levels. Fusion at the
feature level is the most challenging [144, 146, 86] because the feature sets of various
modalities may not be compatible and the relationship between the feature spaces
of different biometric systems may not be known. Again, resultant feature vectors
may increase in dimensionality and a significantly complex matching algorithm may
be required. In addition, good features may be degraded by bad features during
fusion and hence, we need to apply an efficient feature selection approach prior to
fusion. Moreover, most of the commercial biometric systems do not provide access
to the raw data or feature sets. Therefore, solving these challenges will constitute a
significant contribution.
The speed of the multimodal biometric recognition system is also an important
factor for real time applications, particularly when deployed in public places such
as airports and stadiums. Unfortunately, most accurate algorithms (e.g. ICP) are
computationally expensive [181]. Therefore, developing an accurate as well as time-
efficient algorithm is of great research significance.
Finally, testing with significantly larger databases compared to the ones used
to date and getting acceptable results is another big challenge that needs to be
addressed. Most of the proposed biometric systems which achieve high accuracy are
only tested with databases of less than 100 subjects.
1.2 Problem Statement
The research problem of this dissertation can be stated as follows. Is it possible
to develop efficient and robust approaches for 2D and 3D ear detection and data
representation to enable construction of a robust fusion approach, combining ear
biometrics with 3D face biometrics for effective human recognition? The answer is
yes, and this dissertation outlines how this can be done.
4 Chapter 1. Introduction
1.3 Research Contributions
The following is a summary of the major contributions in this dissertation.
• A fast and fully automatic ear detection approach using a Cascaded AdaBoost
algorithm is proposed. The classifiers are trained with Haar-like features and
the detection is marked using a rectangular window placed around the ear. No
assumption is made about the localization of the nose or the ear pit. (Chapter
3 and 6)
• For ear recognition, in order to reduce the computational expense of the ICP
algorithm, I propose a hierarchical ICP-based approach where the algorithm
is first used with a low and then with a higher resolution mesh of 3D ear data.
The result of the first stage of ICP is used as an initial transformation (coarse
alignment) for the second stage of ICP. (Chapter 3)
• The local 3D features (L3DFs) originally proposed by Mian and my supervi-
sors [117] for the face are modified to make them suitable for the representation
of 3D ear data. It has been shown that these local 3D features exhibit a high
degree of repeatability for different ear images of the same subject. (Chapter
4)
• The L3DFs have been utilized for matching a gallery and a probe 3D face
dataset using only feature distances. In this dissertation, I improved matching
based on the geometric consistency amongst the matched features. I included
an explicit second round of matching where only the matches that are consis-
tent with the most of the other matches are allowed. (Chapter 5)
• I devised approaches to utilize the result of local feature-based matching for
an early rejection of most of the false matches and for coarsely aligning a
gallery-probe pair prior to a finer match using expensive algorithms such as
ICP. (Chapter 4 and 6)
• A novel approach to extract minimal feature-rich data points which only in-
cludes the matched local features is proposed to obtain a reduced dataset for
final ICP alignment. This reduction significantly increases the time efficiency
and the accuracy of the recognition system. (Chapter 4 and 6)
• Ear recognition experiments are performed on a new in-house database of pro-
file images with ear-phones and on the UND Collection-J which is the largest
1.4. Structure of the dissertation 5
publicly available profile database. High recognition rates are achieved with-
out expensive preprocessing such as an explicit extraction of the ear contours.
(Chapter 6)
• Two complete multimodal recognition systems fusing the ear and face biomet-
rics at score and feature levels are proposed based on the automatic extraction
of efficient local features. To the best of my knowledge, this is the first feature-
level fusion approach using 3D ear and face features extracted from the profile
and frontal face images respectively. (Chapter 7)
• The same type of local 3D features are used to represent both the ear and face
data in both score and feature levels of fusion, which permits fair comparisons
of the two biometric traits and of the two fusion techniques. (Chapter 7)
• Fusion experiments are performed on the largest possible multimodal dataset
using publicly available profile (the UND) and frontal (the FRGC v2) face
databases. The performance of the face biometrics especially with non-neutral
expressions improves significantly when fused with ear biometrics using the
proposed score-level fusion technique. The proposed feature-level fusion ap-
proach achieves an accuracy comparable to that of the score-level fusion with-
out requiring ICP-like expensive matching algorithms. (Chapter 7)
1.4 Structure of the dissertation
This dissertation is organized as a series of papers published in internationally
refereed journals, books and conferences. Each paper constitutes an independent set
of work in the process of biometric recognition with minor overlaps. However, these
papers together contribute to a complete and coherent theme for human recogni-
tion using local 3D features. Chapter 2 to Chapter 7 correspond to our publica-
tions [79], [75], [80], [81] and [82] respectively. In Chapter 2, essential background
to unimodal and multimodal recognition is provided and a comprehensive review is
made to identify potential areas of contribution in the areas of 2D and 3D ear and
face detection, data representation and unimodal and multimodal matching. The
core content of this dissertation is laid out in Chapter 3 to 7. Chapter 3 describes
an ICP based ear recognition approach. A more efficient approach based on local
3D features is described in Chapter 4. An improved matching technique is proposed
in Chapter 5. Chapter 6 describes an efficient and complete recognition system for
the ear biometric starting from the detection to the (matching) decision making.
6 Chapter 1. Introduction
Chapter 7 describes two different fusion approaches to fuse the ear with the face
biometrics. I summarize my conclusions and provide suggestions for future work in
Chapter 8. A more detailed overview of each chapter is presented below.
1.4.1 A Review of Recent Advances in 3D Ear and Expression Invariant
Face Biometrics (Chapter 2)
In this chapter, a comprehensive review of unimodal and multimodal recogni-
tion using 3D ear and face data is presented. Associated data collection, detection,
representation and matching techniques are covered with a focus on the challeng-
ing problem of expression variations. All approaches are classified according to
their methodologies. Through the analysis of the scope and limitations of these
techniques, it is concluded that further research should investigate fast and fully
automatic ear-face multimodal systems robust to occlusions and deformations.
1.4.2 An ICP Based Hierarchical Matching Approach for 3D Ear Recog-
nition (Chapter 3)
In this chapter, a fully automatic and fast technique based on the AdaBoost
algorithm (fully described in Chapter 6) is used to detect a subject’s ear from his/her
2D and corresponding 3D profile images. A modified version of the Iterative Closest
Point (ICP) algorithm is then used for the matching of this extracted probe ear
to previously stored ear data in a gallery database. A coarse-to-fine hierarchical
technique is used where the ICP algorithm is first applied on low and then on high
resolution meshes of 3D ear data.
1.4.3 A Fast and Fully Automatic Ear Recognition Approach Based on
3D Local Surface Features (Chapter 4)
In this chapter, an approach is proposed for human ear recognition based on
robust 3D local features. The features are constructed at distinctive locations in
the 3D ear data with an approximated surface around them based on their neigh-
borhood information. Correspondences are then established between gallery and
probe features and the two data sets are aligned based on these correspondences.
A minimal rectangular subset of the whole 3D ear data containing only the corre-
sponding features is then passed to the Iterative Closest Point (ICP) algorithm for a
finer alignment and the final recognition. Experiments were performed on the UND
biometric database.
1.4. Structure of the dissertation 7
1.4.4 Refining Local 3D Feature Matching through Geometric Consis-
tency for Robust Biometric Recognition (Chapter 5)
In this chapter, I eliminate some of the incorrect matches that occur during the
local 3D based matching. The approach is based on the geometric consistency among
the initial matches. Some related similarity measures are also computed and consid-
ered for the final matching decision. The performance of the approach is evaluated
on different datasets and compared with the performance of other approaches.
1.4.5 Efficient Detection and Recognition of Textured 3D Ears (Chapter
6)
In this chapter, a very fast 2D AdaBoost detector is combined with a fast 3D local
feature matching and fine matching via an Iterative Closest Point (ICP) algorithm to
obtain a complete, robust and fully automatic system with a good balance between
speed and accuracy. Ear images are detected from 2D profile images using the
proposed Cascaded AdaBoost detector. The corresponding 3D ear data are then
extracted from the co-registered range image and represented with local 3D features.
Unlike previous approaches, local features are used to construct a rejection classifier,
to extract a minimal region with feature-rich data points and finally, to compute
the initial transformation for matching with the ICP algorithm. The performance
of the proposed approaches are also evaluated and compared with other approaches.
1.4.6 Fusion of 3D Ear and Face Biometrics for Robust Human Recog-
nition (Chapter 7)
In this chapter, a 3D local feature based approach is proposed to fuse ear and
face biometrics at the score and the feature levels of fusion. Score-level fusion is
performed using a weighted sum rule with some complementary weights. Feature-
level fusion is based on the similarity among the local features from the ear and
the face. Finally, the proposed fusion approaches are compared with other existing
approaches, demonstrating the superiority of these new techniques.
8 Chapter 1. Introduction
9CHAPTER 2
A Review of Recent Advances in 3D Ear and
Expression Invariant Face Biometrics
Abstract
Biometric based human recognition is rapidly gaining popularity due to breaches
of traditional security systems and the lowering cost of sensors. The current research
trend is to combine multiple traits to improve accuracy and robustness. This paper
comprehensively reviews unimodal and multimodal recognition using 3D ear and face
data. It covers associated data collection, detection, representation and matching
techniques and focuses on the challenging problem of expression variations. All the
approaches are classified according to their methodologies. Through the analysis of
the scope and limitations of these techniques, it is concluded that further research
should investigate fast and fully automatic ear-face multimodal systems robust to
occlusions and deformations.
keywords
Biometrics, 3D Ear, 3D Face, Detection, 3D Data Representation, Multimodal
Recognition, Facial Expressions
2.1 Introduction
Incidents of breaching traditional password or ID based systems have been in-
creasing significantly as indicated by recent reports. CIFAS [43], the UK’s fraud
prevention service reports a 16% increase in identity frauds in the UK in the year
2008. In the USA, the number of victims of such fraud increased by 22% making a
loss of $48 billion over the year 2008 [87]. The rapid increase in security breaches is
a strong justification for replacing or augmenting traditional ID-based systems with
biometric systems which are based on human traits that cannot be stolen, denied or
faked easily. Biometric systems can be applied efficiently in many government ap-
plications (such as passports, voter ID and drivers licenses) and in forensic, security
and law enforcement applications. With these priorities, the allocation of govern-
ment funds and the increasing computational efficiency of modern computers have
boosted the emergence of biometric recognition systems over the last few years [14].
0This article is under review in the ACM Computing Surveys, November, 2009
10Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
Concha/Ear Pit
Helix
Antihelix
Antitragus
Tragus
Triangular Fossa
Incisure Intertragica
Lobe
Figure 2.1: Salient features of an external ear (left and right images are 2D and 3D
views respectively of the same ear).
Unlike traditional recognition systems, the success of a biometric system depends
not only on the accuracy but also on the acceptability of the biometric trait being
used. The face and the ear have become popular due to their rich set of many
distinctive features (as illustrated in Fig. 2.1 for the ear) as well as the possibility
of easy and non-intrusive acquisition of their images. Ample research has been per-
formed in the last few years proposing different methods of using these two biometric
traits for identification and authentication purposes. However, the accuracy and the
robustness required for real-world applications are still to be achieved. This implies
that it would be beneficial to look at existing and proposed approaches to identify
the challenges and suggest future research directions.
Pose variations and changes in facial expressions are the most challenging prob-
lems in fully exploiting the non-intrusiveness of face recognition technique [189].
Although 3D approaches are sufficiently robust to pose variations compared to their
2D counterparts, the non-rigid deformations due to facial expressions severely affect
the results of 3D approaches. Consequently, expression variance became one of the
main focuses of the Face Recognition Grand Challenge (FRGC) [141] and gained
substantial interest in the computer vision and pattern recognition research commu-
nities. A few promising methods have been proposed to tackle the problem, however,
none of the approaches is free from limitations and able to fully solve this crucial
problem. Several survey papers ([3, 17, 96, 148, 191]) and books ([46, 147, 193, 100])
have been published on face recognition in general, but to the best of our knowledge,
there is no review devoted to the challenging problem of facial expression variance.
Therefore, in this survey, we focus on 3D face recognition approaches which are
significantly robust to non-neutral facial expressions.
In contrast to the face, the shape of the ear is not affected by expression changes
and also does not change from aging between 8 and 70 years [72]. Although these
features made the ear very popular in the research community, there are very few
2.1. Introduction 11
Biometric Approaches with Ear and Face
3D Data Acquisition
[2.3.1]
Hybrid
(e.g. Qlonerator)
Pure Structured
light (e.g. Minolta)
Passive Stereo
(e.g. Geometrix)
Face
[2.4]
Using Fuzzy
Logic
Using SVM
Using SMQT
and SnoW
2D
(Appearance
Based) [2.4.1]
3D
[2.4.2]
Using
AdaBoost
2D/3D Detection
[2.4,2.5]
Ear [2.5]
Using RBF Network
Using Outer Helix
Curves
One-line Based
Landmarks and 2D
Masks
3D Representation
[2.6]
Local Feature Based
[2.6.2]
LSP
Spherical SI
Point Signatures
Tensor
L3DF
Iso-contours
Spin Image (SI)
Global Feature
Based [2.6.1]
SFR
COSMOS
Balloon
Image
3D Recognition
[2.7-2.9]
Face (with Expression
Variations) [2.7]
Non-Rigid
[2.7.2]
Rigid [2.7.1]
Ear
[2.8]
Using Local
Features [ 2.8.1]
Using Global
Features [2.8.2]
Without Using
Any Features
[2.8.3]
Ear-Face
Multimodal
[2.9]
Score
Level
Fusion
2D [2.5.1] 3D [2.5.2] 2D and 3D [2.5.3]
Using Ear
Shape Model
Using
Landmarks
and 3D Masks
3D Template
Matching
Nose Tip and Ear
Pit Detection and
the Snake
Algorithm
Global-to-Local
Shape
Registration
Intensity Difference
and AdaBoost
Figure 2.2: Taxonomy of the biometric approaches with ear and face that are covered
in this manuscript with reference to the relevant section number within braces.
survey papers ([143, 38, 69, 118]) and most of those describe the 2D and the early
exploratory 3D approaches. The current trend is to use 3D data for ear recognition,
however, some of the recent approaches use either only 2D or both 2D and 3D
information for ear detection. Therefore, in this survey we provide a comprehensive
and up-to-date review of the existing 2D/3D detection and 3D recognition techniques
proposed for the ear biometrics.
Considering the problems of unimodal biometrics (see Section 2.2), attempts have
been made to integrate the face with other biometric traits such as gait [194, 55],
palmprint [56, 97], fingerprint [166, 158], images of the iris [164] and recently, the
ear (see Section 2.9). In addition to the robustness to expression variations and
aging, the ear has another advantage over other alternatives due to its proximity
to the face: ear data can easily and non-intrusively be collected (with the same or
12Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
similar sensor) along with the face image. Ear images can efficiently supplement
face images when frontal views are difficult to collect or are occluded. Although
most of the ear-face multimodal approaches report better accuracy and robustness
than individual modalities, it is still important to investigate effective approaches
for fusion by analyzing the existing approaches.
We provide a more comprehensive review than past survey papers on face or
ear recognition by including the data collection, detection and representation tech-
niques currently proposed for these two modalities. A taxonomy of the approaches
is illustrated in Fig. 2.2. In addition, the latest relevant surveys in the field of face
and ear were in 2007 [3, 38] and that in data representation was in 2005 [119]. Our
previous review work in these areas [77, 78] also first appeared in 2007. Given that
technology changes rapidly the authors felt an up-to-date and more elaborate survey
would make a valuable contribution to the field.
.
The paper is organized as follows. Preliminary concepts are introduced in the
next section. Techniques for acquisition and detection of face and ear data are
described in Sections 2.3, 2.4 and 6.3.4 respectively. Representation techniques used
for face and ear biometrics are described in Section 2.6. Recent approaches for
recognition of these two biometric traits are summarized in Sections 2.7 and 2.8.
Multimodal approaches involving these are described in Section 2.9. The challenges
to be met for improving the existing systems are outlined in Section 2.10 and the
conclusions are provided in Section 7.9.
2.2 Preliminary Concepts
2.2.1 Biometrics
Biometrics can be defined as the automatic recognition of a person based on his
or her physiological traits (such as face, fingerprint, palmprint, iris and DNA) or
behavioral traits (e.g. handwriting, gait, signature) [14, 193, 146]. Desirable char-
acteristics of a biometric trait are: universal availability to everyone, distinctiveness
to individuals, permanence over a long period of time and quantitative measurabil-
ity [86].
2.2.2 2D versus 3D Biometrics
Biometrics can be broadly categorized as 2D and 3D, based on the type of data
used. Although 2D images are easier and less expensive to acquire, they have many
inherent problems such as variance to pose and illumination, and sensitivity to the
2.2. Preliminary Concepts 13
use of cosmetics, clothing and other decorations. Biometric systems using 3D data
are potentially free of these problems. However, 3D data sometimes suffer with the
presence of spikes and holes due to sensor errors.
2.2.3 Unimodal Biometrics
Unimodal biometric systems operate with a single type of biometric data. These
systems suffer from a number of problems including noise in sensed data, intra-
class variations, inter-class similarities (i.e. overlaps of features in the case of large
databases), non-universality (for example, in fingerprint systems, incorrect minutiae
may be extracted in the case of poor quality of the ridges) and spoof attacks [158,
146, 86, 85, 110, 142].
2.2.4 Multimodal Biometrics
A system may be called multimodal if it collects data from different biometric
sources or uses different types of sensors (e.g. infra-red or reflected light), or uses
multiple samples of data or multiple algorithms [158] to combine the data [16]. Thus,
in multimodal systems a decision can be made on the basis of different subsets of
biometrics depending on their availability and confidence. These systems are also
more robust to spoof attacks as it is relatively difficult to spoof multiple biometrics
simultaneously. Woodard et al. [166] compared the performance of different modali-
ties and reported a 97% rank-1 recognition rate for a multimodal approach with ear,
face and fingerprint and 80%, 93% and 80% for individual modalities respectively
on a dataset of 425 samples of each modality from 85 subjects.
2.2.5 Fusion Techniques
Different biometric modalities can be fused or combined at different levels of
the biometric recognition process. Based on the level of fusion, Jain et al. [84]
categorized the fusion techniques as illustrated in Fig. 2.3.
Prior to classification, fusion can be performed at the data or sensor level com-
bining raw data from different sensors or at the feature extraction level by combining
feature vectors obtained by either using different sensors or employing different fea-
ture extraction algorithms. Techniques for fusing information after the classification
or matching stage can be grouped into four categories as shown in Fig. 2.3 (adapted
from [84]). In the dynamic classifier selection, the results of that classifier are chosen
based on which is most likely to provide the correct decision for the specific input
pattern [167, 9]. When a different biometric matcher produces a different match
score indicating the proximity of the input data to a template, fusion is performed
14Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
Information Fusion in
Multimodal Biometric Systems
Post-classification LevelPre-classification Level
Feature Level
Sensor or
Data Level
Rank Level
Measurement or
Match Score level
Dynamic Classifier
Selection
Abstract or Decision
Level
Figure 2.3: Classification of fusion techniques in multimodal biometric systems
(adapted from [84]).
at the measurement or match score or confidence level. Fusion can also be performed
at the rank level when the output of each biometric matcher is a subset of possible
matches sorted in decreasing order of confidence, and at the abstract or decision
level when each biometric matcher individually decides on the best match based on
the input presented to it (refer to [145, 84, 146, 147] for details).
2.3 Image Acquisition Techniques
Data collection is the very first step in a biometric recognition system. Since
the detailed description of different data acquisition systems is not within the scope
of this paper, a short description of those involving still images is provided in this
section for completeness. We also summarize the existing ear and face databases
which are publicly available for research purposes.
2.3.1 Techniques
In a non-intrusive application, ear and face data can be collected from still im-
ages or video sequences of the profile and frontal views of a subject respectively.
2D images can be captured using ordinary cameras. However, 3D data collection
requires some special sensing devices.
Existing approaches for acquiring 3D data can be classified into three categories:
passive stereo (e.g. the Geometrix system), pure structured light (e.g. the Minolta
sensor) and the hybrid approach (e.g. the 3Q ‘Qlonerator’ system). In the first
approach, 3D locations of a subject are computed from two images taken by two
cameras with a known geometric relationship. In the second approach, light patterns
projected from a light projector are detected in the image of the subject to compute
2.3. Image Acquisition Techniques 15
Table 2.1: Some existing profile and frontal face databases
Database Nameand Source
DataType
#Sub. #Img. Image Description
FRGC v2 [141] 2D,3D
466 50000 Frontal; neutral and smiling; collected using MinoltaVivid 900/910.
FERET [51] 2DColor
1199 14,126 15 sessions with time lapse up to two years.
PIE, CMU [44] 2D 68 41368 13 poses, 43 illuminations, 4 expressions.
NIST MugshotIdentifica-tion [129]
2D 1573 3248 131 and 89 cases with two or more front and profileviews respectively.
CAS-PEAL [25] 2D 1040 30900 27 poses, 5 expressions, 6 Accessories and 15 lightingdirections.
IV 2 MultimodalBiometric [140]
2D,3D
300 2400 Face Data collected using Minolta Vivid 7000 with fivefacial expressions, two different lighting conditions andthree poses (frontal, left profile and right profile).
3D RMA [1] 3D 120 360 Three poses, collected using structured light technique.
The Bospho-rus [15]
3D 105 4666 Various poses, expressions and occlusion conditions.
BU-3DFE [12] 2D,3D
100 2500 Neutral and six more expressions with four differentlighting conditions. Data from 56 males and 44 females.
GavabDB [123] 3D 61 549 Nine images from each subject: 2 frontal views withneutral expression, 2 x-rotated views (30), 2 y-rotatedviews (90) and 3 different expressions).
BioID [13] 2D 23 1521 Upright frontal images with a resolution of 384x286pixel.
UH [91] 2D N/A 884 With neutral and different expression while readingloudly; with or without ear and face accessories; ac-quired using 3dMD-based prototype system.
The Yale FaceDatabase [173]
2D 15 165 Six expressions; three poses, with or without glasses.
USTB-III [160] 2D 79 220 Profile and other poses and partial occlusions.
UND-F [155] 2D,3D
302 942 Profile, time lapsed, collected using Minolta Vivid 910.
UND-J2 [156] 2D,3D
415 1800 Profile, time lapsed.
UCR-ES2 [32] 3D 155 902 Profile images with pose variations (± 35 degrees), sixshots per subjects taken all on the same day using Mi-nolta Vivid 300.
16Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
the 3D locations. Instead of a single camera, a stereo camera rig is used along with
a light projector and 3D locations corresponding to the projected light patterns
are computed as in the first approach. In applications where a large number of
corresponding points are required, approaches with light patterns are helpful as
they simplify the selection of such points. Details of the approaches can be found
in [17].
An example of a Minolta scanner and the sample of range data it can capture
are shown in Figure 7.1.
2.3.2 Existing Databases
There are many databases available with different types of 2D and 3D data. Some
large and mostly used (in recent publications) databases for ear and face recognition
are described in Table 2.1. Interested readers are referred to [62, 73] for a summary
of other databases.
2.4 Face Detection Techniques
Face images with a controlled background can be easily detected using color or
motion or both. However, detection of faces with unconstrained backgrounds is
comparatively difficult. A few recent and widely used pose and expression invariant
detection techniques (based on still images) are described in this section.
2.4.1 2D Approaches
As described in earlier surveys ([96, 183, 54, 182, 66]), most of the face detec-
tion approaches are based on 2D information. Kong et al. [96] categorized these
approaches as knowledge-based (encoding human knowledge to capture relation-
ship between facial features), feature-invariant (searching features unaffected by the
variations in pose, viewpoint or lighting conditions), template matching (using pat-
terns of the whole face or facial features to find correlation between faces) and
appearance-based approaches (capturing representative variability of facial appear-
ance by learning the models or templates using a set of training images). Since the
first three types of approaches do not work well in the case of small and poor quality
images, most of the modern approaches are appearance based and can be further
classified as below:
Using AdaBoost-Based Classifiers One of the most popular, fast and robust face
detection algorithms is the cascaded AdaBoost algorithm proposed by Viola and
Jones [161]. This method characterized by the use of integral image and rectangular
2.4. Face Detection Techniques 17
Haar-like features with a cascade of classifiers, where each successive classifier is
based on the rejection or acceptance result of the previous classifier. The authors
reported a detection rate of 93.7% with 422 false positives on the MIT+CMU face
database. A speed of 15 frames per second was obtained while scanning 384 by 288
pixel images using multiple scales of 24×24 size patches on a conventional 700 MHz
Intel Pentium III machine. The algorithm has been followed by many variations
and improvements such as for occluded images (e.g. [60]), for rotation invariance or
multi-view (e.g. [104, 68, 33]) and for speed up (e.g. [151, 121, 64, 168, 137]). Instead
of using Haar-like features, recently, Meynet et al. [113] used Gaussian features and
Xiaohua et al. [169] proposed using Gabor features and hierarchical regions in a
cascade of classifiers.
Using Local SMQT Features and Split SNoW Classifiers Nilsson et al. [128] used
the combination of Successive Mean Quantization Transform (SMQT) features and
a split up Sparse Network of Winnows (SNoW) classifiers for frontal face detection.
With 15 false positives, they obtained around a 100% detection rate on the BioID
database and that of around 81% on the MIT+CMU database. However, while con-
sidering both databases, the detection accuracy reported was 95% with 1.93× 10−7
false positive rate. Fig. 2.4 shows a sample of face detection using this algorithm
compared to the Open Source Computer Vision Library (OpenCV) [131] based im-
plementation of the cascaded AdaBoost algorithm. The authors demonstrated that
the algorithm performs better than other approaches on the BioID database, how-
ever, the approach would fail in detecting faces smaller than 32 × 32 as this is the
fixed size of the patch used for detection. Although the detection time is not re-
ported, the requirement of down sampling the image instead of features/patches
would limit the speed of the detection.
Using Support Vector Machine (SVM) Classifiers One of the earliest approaches
to face detection using SVM was proposed by Osuna et al. [132]. They detected
vertically oriented and un-occluded frontal views of human faces in gray level images.
The approach was further enhanced for multi-view face detection in [102] and [174].
Most recently, Hotta [67] used SVM classifiers with horizontal rectangular features
and a combination kernel of various sizes for a view-independent face detection.
Using Fuzzy Logic Ghiass and Sadati [59] used some filters to obtain binary
images from multi-view face images. They then produced fuzzy models from the
distribution of zeros in the faces and used a fuzzy approach for face detection.
Because of the independence of the approach to skin color, persons with every kind
of skin color can be detected. Experiments on 60 images of Yale Database B under
18Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
(a)
(b)
s
s
s
s
ss
s
s
s
Figure 2.4: Comparative results of 2D face detection (performed and illustrated
in [183]): (a) Using OpenCV-based implementation of cascaded AdaBoost [131] (a)
and (b) Using approach of Nilsson et al. [128] (best seen in color). Notice that the
approach in (a) failed to detect the face of the first person.
different illumination conditions show 1.2% error rate on average.
(a)
(b) (c)
Figure 2.5: Samples of 3D face detection: (a) in the presence of occlusions (b) in
case of multiple faces in a scene (c) in large scale variations [126].
2.4.2 3D Approaches
Mian et al. [114] proposed a simple but fully automatic approach for 3D face
detection and obtained 98.3% accuracy on the FRGC v2 dataset. At first, they
detected the nose tip from the 3D depth images and then took a spherical region
around that. However, the system assumes that the image contains only a single
face; only 15 degrees of pose variation exists along the x and y-axes and the nose tip
is not occluded. Moreover, the segmentation is not robust to scale variance because
2.5. Ear Detection Techniques 19
a pre-defined radius of sphere is cropped over the entire database.
Colombo et al. [45] proposed detecting salient features of a 3D face (e.g. nose and
eye) by analyzing the mean and Gaussian curvature of the surfaces in the scene. A
set of candidate faces was built grouping pairs of candidate eyes with the candidate
noses. Distances between eyes and noses were computed and compared with those
in a typical human face and the candidate face is discarded if the disagreement was
too high. The portion of the surface under a candidate face was projected in a
new range image and finally processed by a PCA trained classifier to discriminate
between faces and non-faces. On a set of 150 3D faces the approach achieved a 96%
detection rate. However, the approach is highly sensitive to the presence of outliers
and holes around the eyes and nose regions.
Niese et al. [127] performed 3D point clustering using texture information to
localize the face. The approach is limited to pose variations up to ±45◦ from the
upright position and also relies on the availability of a texture map. However, it is
significantly robust to illumination and expression variations.
Recently, Nair and Cavallaro [126] proposed using a Point Distribution Model
(PDM) for detecting a face from its 3D mesh data. They extracted candidate
locations (inner eye and nose tip vertices) on the mesh from low-level curvature-
based feature maps and used them for fitting the model, thus, not relying on texture,
pose or orientation information. The face was then detected by classifying the
transformations between model points and candidate vertices based on the upper-
bound of the deviation of the parameters from the mean model. A 99.6% detection
rate was achieved on 827 meshes (427 face meshes from GavabDB face database and
400 object scans from the NTU 3D Model Database ver.1). The system is robust
to different facial expressions and minor occlusions. It also works well with multiple
faces in the same scene and with scale variations (see Fig. 2.5). However, the model
fitting procedure of this approach is time consuming and took the authors 121s on
an average over the GavabDb database on a 3.2 GHz Intel Pentium 4 CPU.
2.5 Ear Detection Techniques
In the case of non-intrusive approaches, ear data are segmented from the profile
images which can vary in appearance under different viewing and illumination con-
ditions. Therefore, extracting ear data from arbitrary profile images is a challenging
problem. Existing approaches are categorized and described briefly in this section.
20Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
(a)
(b) (c)
Convex
edgesConcave
edges
Figure 2.6: Different steps of localizing ear in the approach of Ansari and Gupta
in [7]: (a) Convex and concave edges extracted from a side face image (b) Possible
outer helix curves (c) Completed ear boundary (best seen in color).
2.5.1 Using Only 2D Data
Using Outer Helix Curves Ansari and Gupta in [7] utilized the shape of the
outer helix curve of the ear for its localization on a 2D profile image with arbitrary
background. They extracted the edges using the Canny edge detector [24] and
segmented them into convex and concave edges (see Fig.2.6). They eliminated non-
ear edges and found the final outer helix curve based on the relative values of angles
and some predefined thresholds. Then, the two end points of the helix curve were
joined with straight lines to get the complete ear boundary. They obtained 93.34%
accuracy of localizing the ears on a database of 700 samples. The approach does
not require any template and can localize the ears rotated in any direction, however,
fails for the images which are of poor quality or occluded due to hair.
Using Radial Basis Function (RBF) Network He et al. [65] used an RBF network
to map a side face image to a surface in which the most steep peak implies the
location of the ear in the side face image. They convoluted the mapping surface
with a Laplacian Mask and extracted the ear with a 200×120 pixels image from the
input side face image of size 320×240 pixels. Yuizono et al. [187] also used a similar
approach, however, they applied both pyramid hierarchy and sequential similarity
detection algorithms to speed up the extraction process. No occlusion with hair or
earrings is considered in this approach.
Using One-line Based Landmarks and 2D Masks Yan and Bowyer [176] proposed
selecting manually Triangular Fossa and Incisure Intertragica (see Fig. 2.1) on the
original 2D profile image and then drawing a line to be used as a landmark. The
landmark was used to find the orientation and size of the ear. A mask was then
rotated and scaled accordingly and applied on the original image to crop the ear
2.5. Ear Detection Techniques 21
data. The authors used this method for PCA-based and edge-based matching.
(a)
(b)
Figure 2.7: Samples of ear detection in sever occlusion using AdaBoost [74] (best
seen in color).
Utilizing Intensity Difference and AdaBoost Algorithm Ear contours were de-
tected based on illumination changes within a chosen window by Choras [37]. In
this method the difference between the maximum and minimum intensity values of a
window is compared to a threshold computed from the mean and standard deviation
of that region to decide whether the center of the region belongs to the contour of
the ear or to the background.
In recent work, Islam et al. [74] proposed an ear detection approach based on
the AdaBoost algorithm described in [149]. Rectangular Haar-like features that
compute the difference of intensity in neighboring regions were used for training
and detection. The approach is fully automatic, very fast and is not affected by
significant occlusions and degradation of the input images (see Fig. 6.5(a)). The
authors reported a detection rate of 99.9% while testing on the UND Biometrics
Database with 830 images of 415 subjects. A 100% detection rate was obtained for
203 images of the UND-F dataset with a False Acceptance Rate (FAR) of 5× 10−6.
A 480 by 640 test image can be scanned in 7.66 ms on a Core 2 Quad 9550, 2.83
GHz machine using a C++ implementation of this ear detection algorithm.
A variant of the AdaBoost algorithm called the Gentle AdaBoost algorithm
and some asymmetric Haar-like features were used by Li and Zhang [186] for ear
detection. On the USTB database of 220 images (which were also used as positive
samples in training the algorithm), they obtained a 99.5% detection rate at 0.023
FAR. Respective results on the CAS-PEAL dataset of 166 images were 98.8% and
0.036%. Detection time was not reported.
2.5.2 Using Only 3D Data
Using Two-line Based Landmarks and 3D Masks Yan and Bowyer [176] drew
two lines on the original range image: one line along the border between the ear and
the face, and the other from the top of the ear to the bottom. They used these lines
to find the orientation and scaling of the ear. A mask was then rotated and scaled
22Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
accordingly and applied on the original image to crop 3D ear data in an ICP-based
matching approach. The approach is simple, but requires manual intervention.
Using 3D Template Matching Chen and Bhanu [29] built a template model repre-
sented by an average histogram of shape index values in off-line mode. Their on-line
ear detection mode includes: step edge detection and thresholding, image dilation,
connected-component labeling and template matching. They obtained 91.5% cor-
rect detection rate with 2.52% false alarm rate on 30 side face range images of 30
subjects collected using Minolta Vivid 300. The approach is simple and compara-
tively less expensive as it only searches the potential ear regions around the step
edges, avoiding an exhaustive search over the entire test image. However, detected
rectangular regions sometimes do not include the full ear and sometimes include
some extra regions around the ear.
Using Ear Shape Model Chen and Bhanu [31] represented an ear shape model
by a set of discrete 3D vertices on the ear helix and anti-helix parts and aligned it
with the range images for detecting those ear parts. They reported 92.6% detection
accuracy on the UCR dataset with 312 images and an average detection time of 6.5
sec on a 2.4G Celeron CPU.
2.5.3 Using Both 2D and 3D Data
Global-to-Local Shape Registration Chen and Bhanu [32] used both color and
range images and a global-to-local shape registration. They obtained 99.3% and
87.71% detection rates on the University of California, Riverside (UCR) ear database
of 902 images from 155 subjects and on the University of Notre Dame (UND)
database of 302 subjects respectively. The average detection time reported is 9.48
sec for the UCR dataset on a 2.4G Celeron CPU.
Using Nose-tip and Ear-pit Detection and the Snake Algorithm Yan and Bowyer [179]
proposed taking a predefined sector from the nose tip in order to locate the ear re-
gion. They cropped out the non-ear portion from that sector by skin detection
and detected the ear pit using Gaussian smoothing and curvature estimation. They
then applied an active contour algorithm to extract the ear contour. As illustrated
in Fig. 2.8, they use the ear pit as the starting point for the snake algorithm. Us-
ing the color or the depth information separately for the active contour algorithm,
detection accuracies of 79% and 85% were obtained respectively. The accuracy was
improved to 100% using both the color and depth information. The approach is
fully automatic, however, it will fail if the ear pit is not visible.
2.6. 3D Data Representation Techniques 23
Nose tip detection
Preprocessing for dropping
out shoulder and some hair
areas
Crop out a sector from nose tip
with a radius of 20 cm and
spanning +/- 30 degree.
Crop out non-ear skin regions
first by transforming each pixel
of 2D image into the YCbCr
color space and then by color
matching.
Ear Pit Detection
Apply Gaussian smoothing to
remove some noise using
window size of 11*11 pixels.
Calculate Gaussian curvature
(K) and the mean curvature (H)
Group 3D points at the same
curvature label into a region.
Select region (s) with K>0 and
H>0 as pit region (s).
Apply Symmetric Voting
method to select the real Ear
Pit.
Profile face
image (2D
color and 3D
range image)
Cropped
Ear
Ear Extraction using
Active Contour Algorithm
Select initial contour ellipse
with ear pit as centre, major
axis as 20 pixels and minor as
30 pixels.
Determine appropriate
parameters for growing contour
Take 150 iterations of growth
and crop the final contour as
extracted ear.
Figure 2.8: Block diagram of the ear detection approach proposed by Yan and
Bowyer [179].
2.5.4 Discussion of the Ear Detection Techniques
In this section, ear detection techniques are categorized based on the data used
for detection. Approaches using only 2D data can also be used for 3D ear recognition
because using modern acquisition devices (see Section 2.3) both 2D and range images
can be collected together and they can be co-registered. Therefore, after detection
of the ear from 2D images, corresponding 3D data can be extracted from the co-
registered range image.
Apart from the use of data, as discussed in this section, some approaches consider
only a closer and smaller side face view around the ear while some other consider
the whole profile image. Similarly, some approaches only localize and perform recog-
nition on a roughly cropped ear region, while some other apply more elimination
technique to concisely crop ear data from the background. Besides, some approaches
requires manual intervention for initialization while others are fully automatic.
There are performance variations for the above different approaches. For exam-
ple, more accuracy can be achieved using concise crop of ear data which in turn may
increase the detection time. Although, AdaBoost based detection is very simple and
fast, it requires training with a large population. Therefore, the choice of the most
appropriate technique should depend on the type and requirement of application.
2.6 3D Data Representation Techniques
Prior to recognition, data should be structured in an appropriate manner to
enable efficient matching. Many techniques have been proposed for 2D and 3D
object representation as described and analyzed in [23], [111] and [119]. In this
24Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
section, we only describe those representation techniques which can represent 3D
ear and face effectively. Based on how the whole face or ear is defined or how the
matching can be performed using the representation, we categorize the techniques
as global and local and discuss briefly in the following section.
2.6.1 Global Representation
COSMOS: Dorai and Jain [48] proposed a representation technique for 3D free-
form objects which they called Curvedness-Orientation-Shape Map On Sphere (COS-
MOS). In this representation, local and global shape information of an object such
as surface area, curvedness and connectivity are integrated in terms of maximal
surface patches of constant shape index [95]. The patches are mapped onto the
unit sphere via their orientations and aggregated via their shape spectral functions.
Since the ear and the face are objects with arbitrary curves and holes, they can be
represented with this approach, however, it requires that the associated range data
should be occlusion free.
SSR Convexity
(a) (b)
Figure 2.9: Global data representation: (a) the SFR [114](b) the balloon image
(SSR convexity map) [124].
Spherical Face Representation (SFR): Mian et al. [114] proposed the Spherical
Face Representation (SFR) for representing 3D face data. Here, the point cloud
is quantized into spherical bins rather than 3D grids as in the case of their tensor
representation. As shown in Fig. 2.9(a), an n bin SFR is computed by quantizing
the distances of all points from the origin (e.g. nose tip in the case of the face)
into a histogram of n + 1 bins. The authors demonstrated that SFRs belonging to
the same individual follow a similar curve shape and different shapes for different
individuals.
Balloon Image Representation: Pears [138] proposed sampling the Radial Basis
Function (RBF) model of an object (at arbitrary resolutions in 3D space) over a set
of concentric spheres to construct a spherically sampled RBF (SSR) histogram, also
called a balloon image (see Fig. 2.9(b)). A convexity value is computed from this
2.6. 3D Data Representation Techniques 25
histogram which is then used to estimate the volumetric intersection of the object
and a bounding sphere, centered on any object surface point. Minimization of this
volume is used to define and localize high curvature surfaces on the object.
This representation is pose invariant and relatively immune to missing parts, as
the RBF function is defined everywhere in 3D space. It also performs well in the
presence of noise as the SSR convexity values are derived as a summation, which has
the effect of suppressing (averaging) noise. It can be used for localizing the nose tip
on face images and then aligning the face to a 3D upper face template using either
ICP on nose-centered data or the method of “isoradius contours” [139].
(a) (b) (c) (d)
(f)(e)
Figure 2.10: Local surface data representation: (a) point signatures (face surface
and sphere) (b) iso-contours (c) the spin image (d) the tensor (e) A feature point (as-
terisk) and its neighbors (dots) and basic constituents of an LSP [32] (f) L3DF [83]:
A local surface (right image) and the region of ear from which it is extracted (shown
by a circle on the left image).
2.6.2 Local Representation
Point Signatures: In the point signatures representation [41], 1D signatures are
extracted from a surface. A sphere of predefined radius is centered at a point on the
surface. The intersection of this sphere with the object’s surface gives a 3D space
curve. A plane is fitted to this space curve at the center point and is translated
in the direction of its normal to the center of the sphere. Next, the 3D curve is
projected perpendicularly to the translated plane, forming a new 2D curve. This
projection of points from the 3D curve to 2D curve forms a signed distance profile
26Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
known as point signatures. The starting point of this signature is defined by the
point on the signature that gives the maximum distance from the 3D curve. Point
signatures are calculated for every point on the object’s surface.
Although point signature representation is very simple to implement, more than
one point on the 3D curve may give the maximum and equal distance to the plane
and thus, make the representation ambiguous. The starting point of the signature
is also very sensitive to noise and hence the representation is not stable and robust.
Iso-Contours: Mpiperis et al. [124] proposed this technique for representing the
face. As shown in Fig. 7.2(b), they represented 3D information of the face surface
by a set of planar curves formed by the intersection of the surface with equidistant
parallel planes.
Spin Image (SI) Representation: A spin image is a representation of a data point
belonging to a surface using a 2D array of values (i.e. 2D image) generated like a
sheet spinning about the normal of that point [89]. In this approach, the ear or the
face data can be represented by a stack of spin images computed at each vertex of
its mesh (see Fig. 7.2(c)). It requires a normal estimation which may be corrupted
by residual noise and missing parts in the data.
Sphere-Spin-Image (SSI) Representation: Wang et al. [162] proposed SSIs to
represent the local shape of points on a facial surface by means of a histogram. This
is obtained by mapping the 3D coordinates of a point lying into a spherical space,
centered in that point, into a 2D space. Points are selected by means of a minimum
principal curvature analysis. For each subject a single SSI series is built up and a
simple correlation coefficient is used to compare the similarity between SSI series of
different subjects.
Tensor Representation: Mian et al. [116] proposed representing 3D surfaces with
third order tensors. In their approach, the cloud of points representing a view of
an object (here, face or ear) is first converted into triangular meshes. Then, the 3D
object in the mesh is quantized into a 3D grid to construct a third order tensor (see
Fig. 7.2(d)). Since the co-ordinate basis used to define the grid is extracted from
the underlying surface itself, this tensor representation is pose invariant. This can
be used for representing very low resolution range images.
Local Surface Patch (LSP): Chen and Bhanu [32] proposed a local feature based
representation technique based on a surface patch consisting of a feature point P
and its N neighbors. The representation includes the feature point, its surface type,
centroid of the patch, and a histogram of shape index values against the dot product
of the surface normal at the point and its neighbors [11]. The components of an
2.6. 3D Data Representation Techniques 27
LSP are illustrated in Fig. 7.2(e). The 2D histogram and surface type are used for
matching of LSPs of different individuals and the centroid is used for computing the
rigid transformation.
Local 3D Feature (L3DF): Mian et al. [117] and Islam et al. [83], described a local
surface representation technique in which at first, a smaller number of distinctive 3D
keypoints are identified on the detected 3D ear or face region. A 3D surface is then
approximated on each of the keypoints based on the neighborhood information and
used as the feature for that point (see Fig. 7.2(f)). A coordinate frame is defined
by centering on the keypoint and aligning with the principal axes from Principal
Components Analysis (PCA) to make the feature pose invariant.
2.6.3 Comparative Evaluation of the Representation Techniques
Approaches that directly use 3D surface information (e.g. the tensor based repre-
sentation and the L3DF) have comparatively more discriminating capabilities than
those mapping the 3D surface onto 2D histograms or extracting 1D signatures. Mian
et al. [116] compared the performance of the Spin Image to their tensor based repre-
sentation. They reported around 39% more correct pairwise registration at 600 faces
per view and 44% more matching pairs with less than 2 cm error. However, the high
descriptiveness of the tensor representation makes it sensitive to non-rigid deforma-
tions such as facial expression changes. Mian et al. [114] also demonstrated that
SFR is comparatively less descriptive but more robust to non-neutral expressions.
Bhanu and Chen [11] compared Cumulative matching performance of the LSP
with the SI on the UCR dataset and obtained slightly better result for the for-
mer (94.8% and 92.9% rank-1 respectively). They also compared the efficiency of
these two and the SSI. They found the average time required for finding the nearest
neighbors and the group of corresponding surface descriptors and then performing
verification by these three techniques are 89.42, 162.07 and 150.57 seconds respec-
tively on an AMD Opteron 1.8 GHz processor.
Pears [138] compared performance of spin images with balloon images for iden-
tification of nose tip and obtained 70% and 99.6% accuracy respectively.
Mpiperis et al. [124] demonstrated that iso-contours outperform point signatures
both in computational efficiency and in recognition rates. They obtained improve-
ment in accuracy from 86.2% to 91.4% for face recognition using iso-contours.
28Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
2.7 Recognition Techniques with 3D Face Data
Among the biometric traits, the face is the most heavily researched one. Exten-
sive surveys on 2D face recognition techniques can be found in [189, 3, 148, 192].
However, the maturity of 2D face recognition failed to resolve several problems
including pose variations and therefore, researchers are now pursuing 3D face recog-
nition. Most of the early 3D approaches obtained quite significant accuracy under
neutral expression but remained severely affected by the deformations due to expres-
sion changes. Therefore, the current trend is to find out approaches robust to such
deformations. We can broadly categorize these approaches as rigid and non-rigid
based on the methodology adopted. In this section, we describe some representative
approaches of each category.
2.7.1 Rigid Approaches
In these approaches, human faces are considered as rigid objects and expression
invariant features of faces (e.g. the length of nose and the distance between eyes)
or rigid/semi-rigid regions such as nose and eyes-forehead regions (see Fig. 2.11) are
identified and matched for recognition purposes. These are simple but suffer from the
disadvantage that deformable parts of the face that still encompass discriminative
information are rejected during matching.
Chua et al. [40] used point signatures for representing 3D faces with non-neutral
expressions and only matched those extracted from the upper part of the face.
Husken [71] presented a multimodal approach using 2D and 3D hierarchical graph
matching (HGM). The HGM worked as an elastic graph storing local features in its
nodes, and structural information in its edges. Their 2D modality performed better
than the 3D. However, the fusion provided better than each individual modality
with a 96.8% verification rate at 0.001 false acceptance rate on FRGC v2 database
while matching neutral vs. all images. The matching approach is faster than ICP
or a similar iterative approach.
Chang et al. [27] developed a 3D face recognition system based on a combina-
tion of the match scores from matching multiple overlapping regions around the
nose (including a circle and an ellipse centered at the nose and the nose itself). The
matching and the fusion were performed using ICP and a product fusion rule respec-
tively. Their algorithm was tested with a database of 4485 3D scans of 449 subjects
including 2349 and 1590 probes and 449 and 355 gallery images for neutral and non-
neutral expressions respectively. The system provided 97.1% and 87.1% recognition
rates with Equal-Error Rate (EER) of 0.12 and 0.23 for neutral and non-neutral ex-
2.7. Recognition Techniques with 3D Face Data 29
Figure 2.11: Sample of 3D face (top), variance in expressions (middle) and expression
insensitive binary masks of 3D faces (bottom) [114] (best seen in color).
pressions respectively. However, in another approach of combining the scores from
3D PCA and 3D ICP algorithms [26], they obtained a better recognition rate of 92%
for data with non-neutral expressions.
Li et al. [101] constructed separate PCA spaces for the texture (which is relatively
less affected by facial expressions) and the non-invariant geometric attributes by
fitting a triangulated generic mask and warping the texture. They combined them
using a linear weighted sum rule for face recognition. While testing the algorithm
on a subset of Yale face database with 90 face images (with six different expressions)
from 15 subjects, 96% accuracy was obtained. The approach was not fully automatic
as the marker points for the face masks were selected manually.
Faltemier et al. [49] demonstrated better results under facial expression changes
using multiple samples with multiple expressions for each subject in the gallery
dataset. On a superset of FRGC v2 dataset (called ND-2006) containing 13450 im-
ages with six different expressions, they obtained a 97.2% rank-one recognition rate
while using five galley images including two with neutral and three with happiness
expressions. The drawbacks of the approach includes the problem in acquisition
of prompted expressions in a controlled setting and also the error that may occur
when the expression of a probe matches that of a gallery scan belonging to a different
subject.
Mian et al. [114] proposed a fully automatic face recognition approach based on
pose correction using the nose tip point and the Hotelling transform algorithm, a
rejection classifier using the SFR (see Section 2.6.1) and the Scale-Invariant Feature
Transform (SIFT) descriptor. For matching, they used a novel region based match-
ing approach and the modified ICP algorithm. In an experiment with the FRGC
v2 dataset, they achieved 99.74% and 98.31% verification rates at a 0.001 false ac-
ceptance rate (FAR) and identification rates of 99.02% and 95.37% for probes with
30Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
neutral and non-neutral expressions, respectively.
2.7.2 Non-rigid Approaches
In non-rigid approaches, human faces are considered as non-rigid objects and
deformations are applied to 3D facial scans to counteract expression deformations
or to reduce their influence on the recognition performance [4]. The deformation
modeling is possible as the deformation caused by non-neutral expressions follow
some patterns governed by the underlying anatomy of the face. The success of
non-rigid approaches depends on the ability of differentiating between expression
deformations and interpersonal disparities. Some of the non-rigid approaches are
discussed below:
Li and Barreto [99] proposed integrating expression recognition and face recogni-
tion in a single system in order to solve the problem of facial expression deformations.
In their work, at first an assessment of the expression of an unknown face was made
and then, an appropriate recognition sub-system was used for person recognition.
For non-neutral images, the right face was found through modeling the variations of
the face features between the neutral face and the face with expression. Classifica-
tion was performed using Linear Discriminant Analysis (LDA) and Support Vector
Machine (SVM). The system was tested with 30 neutral and 30 smiling face (3D
range) images from 30 subjects and a 80% recognition rate was obtained in case of
smiling faces.
Bronstein et al. [18] proposed using an isometric model of facial expression. The
probe facial surface was embedded into the surface of the model without requir-
ing the same amount of information. The Generalized Multi-Dimensional Scaling
(GMDS) numerical core was used that provided flexibility in handling partial surface
matching. The GMDS made use of some metric distortions that served as dissim-
ilarity measures resulting in a small number of surface samples which were then
matched using a hierarchical matching strategy. They obtained a 100% recognition
rate for a test with 180 3D faces from 30 subjects with partial occlusions and six
different expressions. In this approach facial expressions are assumed to be isomet-
ric, which is not always true. Besides, it requires intensive preprocessing prior to
matching.
Kakadiaris et al. [91] and Passalis et al. [135] used an annotated face model
(AFM), wavelet analysis, normal maps and a composite alignment algorithm to
extract a biometric signature from 3D face data. The AFM was deformed elastically
to fit each face, thus allowing the annotation of its different anatomical areas such
2.7. Recognition Techniques with 3D Face Data 31
(a)
(b)
s
s
(b)
(c)
s
Figure 2.12: Illustration of deformation models: (a) Original, annotated face model,
geometry image and normal image (left to right) used in [91] (b) Original test scan,
anchor point extraction and deformable model used in [107] (c) Original and bilinear
model of two different expressions [125].
as the nose, eyes and mouth. The approach was tested with a large dataset of
5000 scans from FRGC v2 and UH databases. Verification rates of 99% and 95.6%
(at 0.001 FAR) were obtained respectively for neutral vs. neutral and neutral vs.
non-neutral subset of the FGRC v2 database. The approach is fully automatic,
pose-invariant and robust to noisy and missing data. Although the final matching
is efficient (using a non-iterative distance metric), the enrollment phase (illustrated
in Fig. 2.12(a)) is computationally expensive.
Lu and Jain [107] fitted a facial surface deformation model (see Fig. 2.12(b))
to the test scans to handle expressions and large pose variations. A geodesic-based
re-sampling approach was applied to extract the landmarks for the model. With a
subset of FRGC v2 dataset (150 test scans of 50 subjects each with one neutral, one
smiling and one surprise expression), rank-one identification accuracy of 97% was
achieved. The system requires manual landmark labeling for deformation modeling.
Wang et al. [163] proposed a guidance-based constraint deformation (GCD)
model to reduce the shape distortion caused by expression. The probe model was
deformed toward the gallery model with some constraints in a Poisson equation
framework prior to matching the two models. The approach reported 11.8% and 6%
improvement of identification and verification performance respectively compared
to ICP. However, since the deformation is not performed according to expression
patterns, some interpersonal disparities might be lost.
32Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
Mpiperis et al. [125] developed asymmetric bilinear models (see Fig. 2.12(c))
capable of decoupling the identity and facial expression factors. The models were
constructed after the establishment of correspondence among the set of faces. These
models are then fitted with unknown faces for recognition. A bootstrap set of faces
is used for tuning the models. The system achieved 86% rank-1 face recognition
on the BU-3DFE face dataset of 100 subjects. Images of 50 subjects were used in
training the models and those of the remaining subjects were used for testing. The
performance of the approach is limited by its necessary requirement of a large set
of bootstrap training and an accurate point correspondence between faces.
Table 2.2: Summary of recognition approaches for face with varying expressions-1
Category Approach Methodology Advantages Disadvantages
Rigid Chua etal. [40]
Matching point signaturesextracted from the upperpart of the face.
Simple andfast
Sensitive to outliers,Looses discriminativefeatures on regions otherthan upper face.
Chang etal. [27]
Matching multiple overlap-ping regions around thenose using ICP
Simple Looses some discrimina-tive features from de-formable regions
Husken[71]
Using 2D and 3D hierarchi-cal graph matching (HGM)
Matching isnon-iterativeand hencefast
Results for neutral vs.non-neutral scans wasnot reported.
Li etal. [101]
Combined texture and ge-ometry attributes of facesusing PCA
Considers sixdifferentexpressions
Marker points for facemasks were selected man-ually, tested on smallerdataset
Faltemieret al. [49]
Using multiple sampleswith multiple expressionsfor each subject in thegallery dataset
Get betterresult thanusing singlesample
Problem in acquiringdata, computationallyexpensive
Mian etal. [114]
Using nose tip detection,SFR, SIFT, Region-basedmatching and ICP
Simple, fastand high ver-ification rate
Considers single face perscan
Amor et al. [6] proposed to study the face deformability and elasticity proper-
ties based on a face anatomical analysis and to segment a facial surface in regions
according to their degree of deformation and elasticity. They computed the final
2.7. Recognition Techniques with 3D Face Data 33
Table 2.3: Summary of recognition approaches for face with varying expressions-2Category Approach Methodology Advantages Disadvantages
Non-rigid
Li andBar-reto [99]
Classifying expression andusing corresponding recog-nition sub-system usingLDA and SVM
Simple andfast
Tested only for neu-tral and smiling ex-pressions, low accu-racy
Bronsteinet al. [18]
Using isometric model,GMDS numerical coreand hierarchical matchingstrategy
Fast androbust topartial occlu-sions
Assumes facial ex-pressions are isomet-ric and involves ex-pensive preprocess-ing
Kakadiariset al. [91]
Using annotated facemodel, wavelet analy-sis, normal maps anda composite alignmentalgorithm
Fully auto-matic andefficientmatching
Expensive enroll-ment
Lu andJain [107]
Fitting a facial sur-face deformation modelto the test scans andusing geodesic-basedre-sampling
Improvedperformanceover rigidICP
Not fully automaticand tested on onlythree expressions
Wang etal. [163]
Using GCD-based defor-mation model in a Poissonequation framework priorto matching
Improvedperformanceover rigidICP
Some interpersonaldisparities may belost during model fit-ting
Mpiperiset al. [125]
Using asymmetric bilinearmodels
Simple Requires expensivetraining and accu-rate point correspon-dences
Amor etal. [6]
Segmentation of facialssurface based on deforma-tions and giving priority tostable regions
Robust toshape defor-mations
Tested on smallerdataset
Al-Osaimiet al. [4]
Modeling expression de-formations from trainingdata in PCA eigenvectorsand using them to morphout the deformations
Preservesmore in-terpersonaldisparities
Training is expensive
34Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
matching score giving more importance to the stable facial regions. On a subset
of the IV 2 3D face dataset with 50 gallery and 400 probes (eight instances includ-
ing four non-neutral expressions) of 50 subjects, they obtained 97.5% recognition
rate with EER of 5.5%. The authors demonstrated improved performance of this
approach over global ICP, specially for severe shape deformations.
Al-Osaimi et al. [4], proposed a non-rigid face recognition approach where pat-
terns of expression deformations were modeled from training data in PCA eigenvec-
tors without leaving out the interpersonal disparities. The patterns were then used
to morph out the expression deformations from the 3D scans before matching and
extracting the similarity measures. They obtained best identification rate of 95%
for 400 probes with non-neutral expressions while using a training data size of 1700
scan pairs. On FRGC v2 database, they obtained 98.35% and 97.8% verification
rates at 0.001 FAR for neutral and non-neutral expressions respectively.
A summary of the methodology adopted and the advantages and the disadvan-
tages or limitations of each of the above approaches is provided in Table 2.3.
2.8 Recognition Techniques with 3D Ear Data
After detection and extraction of 3D ear data and/or features, an appropriate
matching algorithm is applied for recognition. Based on whether any features are
extracted from the raw ear data for matching the ears in the gallery and the probe
databases, we can divide the existing ear recognition approaches into three groups
as discussed below.
3D Local
Features
Identification
3D Local Features
Extraction
Feature Matching
Coarse Alignment
Using
Correspondences of
3D Local Features
Fine Matching with
ICP
Recognition
Decision Based on
the Similarity
Measures
Gallery
Feature
Database
Off-line
storage of
gallery
featuresOn-line probe
features
Recognized Ear
Ear Data
Normalization
Exp. 2
Exp.1
Ear Detection and
Ear Data Extraction
2D and 3D Profile Face
Data
Figure 2.13: Block diagram of the ear recognition system proposed in [83].
2.8. Recognition Techniques with 3D Ear Data 35
2.8.1 Approaches Using Local Features
Chen and Bhanu [32] used LSP (see Section 2.6.2) local features for coarse align-
ment prior to applying the ICP algorithm. They obtained an ear recognition accu-
racy of 96.4% using 302 gallery and probe images from the UND database (Collec-
tion F). However, they reported 87.5% recognition for straight-on to 45 degree off
images. They also performed evaluation on the UCR ES2 dataset and obtained a
94.4% rank-one recognition rate. Their approach requires manual extraction of the
ear contour prior to recognition in case the automatic detection (87.71% accuracy)
fails.
In their recent approach, Chan and Bhanu [35] reduced the matching time for ear
recognition by combining feature embedding and SVM based rank learning. ICP
was applied on a short list of candidate models which was generated by ranking
the similarities for all model-test pairs using the learning algorithm. Although they
obtained best result in timing by this approach (192 sec compared to 1270 sec on
an AMD Opteron 1.8 GHz processor), it came with a performance penalty of 2.4%
(96.7% to 94.3%) while testing on the 212 images of 212 subjects from the UND-F
dataset.
Islam et al. [83], used L3DFs (see Section 2.6.2) for coarse alignment. Corre-
spondence between the gallery and the probe features was established and the ini-
tial matching decision was made based on the distance between the feature surfaces.
The rotation between the matching features was used in coarse alignment of the best
few probes. Then they applied ICP on a minimal rectangular subset of the whole
3D ear data containing the corresponding features only. The approach is illustrated
in Fig. 6.1. While evaluating on the first 100 images of the UND-F dataset, they
obtained 90% rank-one recognition rate. They improved the performance in [80]
using a refined matching technique with geometrical consistency checks and using
both the rotation and translation obtained from the corresponding L3DF matching
for the coarse alignment. They obtained rank-one identification rates of 92.77% and
95.03% on the UND-J and the UND-F dataset respectively.
2.8.2 Approaches Using Global Features
Passalis et al. [136] proposed a different approach by extracting a compact bio-
metric signature for matching 3D ears. They used a generic Annotated Ear Model
(AEM), ICP and Simulated Annealing algorithms to register and fit each ear dataset.
They obtained 93.9% and 94.4% recognition rates for the UND Collection J and an
extended database of 525 subjects respectively.
36Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
2.8.3 Approaches Without Extracting Any Features
Yan and Bowyer [180] applied 3D ICP on UND-F dataset and obtained 84.1%
identification accuracy. To improve their accuracy, they investigated the effect of
combining scores from different algorithms [178]. On a database of 3D images from
302 subjects, they obtained rank-one recognition rate of 90.2%, 87.7% and 69.9%
for combining 3D ICP with 3D edge, 3D PCA with 3D ICP and 3D PCA with 3D
edge algorithms respectively. They also tested the effect of using multiple images
per subjects on a dataset of 169 subjects (each with at least four images taken at
four different dates). Having two images per person in both the gallery and the
probe datasets and applying a new fusion rule using interval distance distribution
between rank-one and rank-two, they obtained a rank-one recognition rate of 97%
for 3D ICP algorithm. The corresponding result for one gallery and one probe was
only 81.7%. However, they improved their accuracy to 97.5% on a dataset of 404
subjects [176] and 98.7% on the UND-F dataset of 302 subjects by removing outliers
before applying a modified version of ICP [178].
In their recent work [179], Yan and Bowyer applied the snake algorithm to pre-
cisely crop the ear and a further modified version of ICP for matching 3D ear data.
They obtained 97.8% rank-one recognition with an EER of 1.2% on the UND data
set (Collection J) consisting of 1386 probes and 415 gallery images. However, an
accuracy of 95.7% is reported on a dataset of 70 images occluded with earrings. In
another experiment with a dataset of 24 subjects each having a straight-on and a 45
degree off center image, they achieved only a 70.8% recognition rate. The system
requires the nose tip or the ear pit to be clearly visible, which may not be always
assured due to pose variations or covering with hair.
Islam et al. [75] proposed a hierarchical approach for applying ICP for 3D ear
recognition. The ICP algorithm was first applied on low and then on high resolution
meshes of 3D ear data. The automatically detected ear regions were not concisely
cropped. The rank-one recognition accuracy of 93% was reported for the first 100
images of the UND-F dataset.
2.8.4 Summary and Discussion
In this section, approaches for ear recognition are categorized and described.
A summary of the representative methods of these three groups is illustrated in
Table 6.1.
In general, approaches with local features are faster in computation and suitable
for application with large numbers of users. Both approaches in [83] and [32] use
2.8. Recognition Techniques with 3D Ear Data 37
Table 2.4: Summary of the existing 3D ear recognition approaches
Category Source Algorithm UsedDatabaseSize
RR
GalleryProbe (%)
Using localfeatures
Chen andBhanu [32]
LSP and ICP 302 302 96.4
Islam et al. [83] L3DF and ICP 302 302 95.03Chen andBhanu [35]
LSP, SVM andICP
212 212 94.3
Usingglobalfeatures
Passalis etal. [136]
AEM, ICP andDMF
415 415 93.9
Withoutextracting
Yan andBowyer [179]
The snake andmodified ICP
415 1386 97.8
any feature Yan andBowyer [176]
Modified ICP 404 404 97.5
Yan andBowyer [178]
Modified ICP 302 302 98.7
Islam et al. [75] Hierarchical ap-plication of ICP
100 100 93
local features for recognition. However, the latter uses these for coarse alignment
only and the former uses them for the rejection of a large number of false matches
as well as for a coarse alignment of the remaining candidates prior to the ICP
matching. Both rotation and translation are used for the coarse alignment in these
two approaches whereas authors in [179] use only translation (no rotation) for this
purpose.
Approaches based on ICP only are comparatively computationally more expen-
sive as the algorithm is iterative. The recognition performance of the approaches
using ICP also depends on how concisely each ear has been detected because the
hair and skin around the ear makes the alignment unstable. This explains the high
recognition results in [179] and [32] compared to that in [83]. In [179] ICP is used
for matching concisely cropped ear data (using the snake algorithm) and in [32] a
large number (12.29%) of ears are concisely cropped manually in case their auto-
matic detector fails. Also, in [179, 32] ICP is applied on every gallery-probe pair
whereas in [83], the use of L3DF allows one to apply ICP on a subset (best 40) of
pairs for identification.
Although the final matching time is less (less than 1 ms per comparison) in [136],
its enrollment and feature extraction modules using both ICP and Simulated An-
nealing are computationally more expensive (15-30 sec on a 3-GHz Pentium 4 PC).
38Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
The approach also requires that the ear pit is not occluded because the annotated
ear model used for fitting the ear data is based on this area. It also does not perform
well for ears with intricate geometric structures.
Most of the existing approaches use either the right or the left ear data for
recognition of a subject. Insignificant performance differences (0.6%) are reported
in [170] for using two ear images separately. However, around 90% accuracy is
reported by Yan and Bowyer [176] when matching with a mirrored left ear with a
stored right ear, indicating that symmetry-based ear recognition cannot be expected
to be highly accurate. Experiments are also performed to see the effect of using both
ear data in a multimodal approach. Lu et al. [106] and Xiaoxun and Yunde [170]
reported approximately 2% and 5% improved accuracy respectively by fusing data
from both ears.
2.9 Multi-Biometric Recognition with 3D Ear and Face
Most of the ear-face biometric approaches use score-level fusion. Although there
are some 2D approaches for fusing ear and face at the level of feature (e.g. [133, 172,
171, 28]) and data (e.g. [184]), to the best of our knowledge, there is no 3D ear-face
multimodal approaches fused at these two lower levels. A summary of the relevant
available multimodal approaches is given in Table 7.1 and discussed as follows.
Table 2.5: Multi-biometric Approaches with 3D ear and face dataSource Methodology Dataset
(#probes,#galleries)
RR(%)
Yan [175] Using ICP with sum and interval fu-sion rules on multi-instance gallery andprobe images.
174*2*2,174*2*2
100
Islam etal. [76, 80]
Using L3DF matching, ICP andweighted sum rule.
315*2,326*2
99.04
Theoharis etal. [152]
Annotated model fitting and using ICP,SA and wavelet analysis.
324*2,324*2
99.7
Yan [175] combined ear and face at score level using the sum and interval fusion
rules. On a dataset of 174 subjects, each with two ear shapes and two face shapes
in the gallery and the probe datasets, they obtained rank-one recognition rates of
93.1%, 97.7% and 100% for the ear, the face and the fusion respectively.
Theoharis et al. [152] proposed a unified 3D face and ear recognition system
using wavelets. They extracted geometry images from 3D ear and face data by
fitting annotated ear and face models representing the respective average shapes to
2.9. Multi-Biometric Recognition with 3D Ear and Face 39
them through an ICP and simulated annealing based registration process. Then,
the wavelet transform was applied to the extracted images to find the biometric
signature. For each modality, the distance between the feature vectors of gallery
and probe is weighted accordingly and then summed up for fusion. Although the
final score of the individual modality was not computed, we classify this approach
as using score level fusion since the feature vector of the two modalities were not
combined before computing distance based on individual modality and the result
would be comparable to the score level fusion.
In a multimodal database composed of 324 gallery and the same number of
probe images (all collected from FRGC v2 and UND databases), 99.7% rank-one
recognition was reported for the above approach. The probe dataset for this ex-
periment contained some images with non-neutral expressions but most of them
were with neutral expression. However, the identification plots of this experiment
(see Fig. 2.14) illustrate the importance of the multimodal fusion. The fact that
the fusion curve can reach a 100% recognition rate before rank 15 whereas neither
single modality can reach 100% up to rank 20, indicates that the cases of failure to
identify a subject are uncorrelated, therefore one modality can compensate for the
shortcoming of the other.
Figure 2.14: Identification plots of ear, face and fusion of these two modalities [152].
Recently, Islam et al. [76] proposed L3DF-based approach for fusing 3D ear and
face data at score level. As shown in Fig. 2.15, at first, they detected the ear and
the face automatically using techniques in [74] and [114] respectively. Following a
normalization step, face and ear L3DFs were extracted and matched as described
in [117] and [83] respectively. Matching scores from the ear and the face modalities
are then fused according to a weighted sum rule. The performance of the system
was evaluated on a multimodal dataset with 326 gallery images and 315 probes
40Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
Frontal face
imagesFace detection
Ear detection
Face feature
(L3DF)
extraction
Ear feature
(L3DF)
extraction
Matching face
L3DFs
Matching ear
L3DFs
Fusion of
matching
scoresProfile
face
images
Recognition
Result
Figure 2.15: Block diagram of the L3DF based ear-face multimodal recognition
system fused at score level [76].
with neutral facial expression and 311 probes with non-neutral facial expressions
all collected from the FRGC v2 and the UND Biometric databases. They obtained
98.71% and 98.1% identification rates and 99.68% and 96.83% verification rates
at an FAR of 0.001 respectively for the probe sets with neutral and non-neutral
images respectively. Using a refined matching technique [80], they improved their
identification rate to 99.04% in the case of non-neutral face data.
2.10 Challenges
In this section, the problems faced by the researchers and the challenges to be
addressed in achieving a reliable performance with the ear and the face biometrics
are identified and discussed.
2.10.1 Image Acquisition Related Challenges
1. Sensor Errors: Artifacts are normally found in the sensed 3D data especially
on the oily or the hairy regions of the face or the ear such as the ear pits, the
eyes, eyebrows, mustache or beard. Missing data (holes) occur when a sensor
is unable to acquire data and outliers (spikes) occur due to an inter-reflection
in a projected light pattern or a correspondence error in stereo [17].
Sensor errors can be modeled using statistical techniques. The error can be
minimized by propagating the estimated errors in the identification system to
provide a realistic confidence measure on the final decision.
2. Image Distortions: Digital images obtained by the sensors are sometimes dis-
torted due to compression, damages of storage device and transmission with
noisy channels. To develop robust biometric recognition more research works
need to be performed exploring the sources of image distortions.
3. Cost of the Sensors: Although 2D data can be obtained using very cheap
ordinary camera, 3D sensors are still expensive for common use.
2.10. Challenges 41
2.10.2 Robustness Related Challenges
1. Occlusion: Apart from the artifacts due to the sensor errors, there might
be other reasons for occlusions in the captured data such as the presence of
earrings or facial ornaments, sun-glasses, hair coming over the ear or the face or
the growth of beards and mustaches. Although a quite acceptable recognition
rate is achieved with a clear view of the ear and the face data, their accurate
recognition when occluded is still a great challenge.
2. Facial Expression: Recognizing faces under non-neutral expressions are chal-
lenging due to their non-linear nature and lack of an associated mathematical
model. As discussed in Section 2.7, 3D face recognition techniques proposed
so far have not yet achieved significant accuracy for large databases with un-
constrained expression changes. In comparison to approaches based on non-
invariant features, approaches that analyze and model different modalities of
expressions seem more promising and require further attention.
3. Aging: Aging has an effect on facial appearance and often the gallery im-
ages are taken well before the probe images. To avoid maintaining the gallery
images up-to-date, we can model the generic effect of aging. However, mod-
eling of such effects is very difficult and very few researchers have worked on
this [193].
2.10.3 Efficiency Related Challenges
1. Efficient Fusion Technique: An important avenue for improving existing mul-
timodal biometric systems is to apply an efficient data or feature level fusion.
Fusion at the match score or decision level is easy to perform. But fusion
at these levels may not fully exploit the discriminating capabilities of the
combined biometrics. Fusion at the data or feature extraction level is be-
lieved to produce better results in terms of accuracy and robustness because
richer information about the ID or the class of an object can be combined at
these levels [146]. However, fusion at the feature level is the most challeng-
ing [146, 86, 144], because the feature sets of various modalities may not be
compatible and the relationship between the feature spaces of different biomet-
ric systems may not be known. Again, resultant feature vectors may increase
in dimensionality and a significantly complex matching algorithm may be re-
quired. In addition, good features may be degraded by bad features during
fusion and hence we need to apply an efficient feature selection approach prior
42Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
to fusion. These challenges should be addressed for a successful multimodal
approach.
2. Efficiency of Matching Algorithm: The speed of a multimodal biometric recog-
nition system is an important factor for real time applications, particularly
when deployed in public places such as airports and stadiums. Unfortunately,
most of the matching algorithms that address the issue of accuracy (such as
ICP used in [179]) are computationally expensive. Therefore, developing an
accurate as well as time-efficient algorithm is of great research interest.
2.10.4 Application Related Challenges
1. Scalability and Benchmarks: Testing with significantly larger databases and
getting acceptable results is another big challenge to be addressed. Most of
the proposed biometric systems are tested with databases containing data from
fewer than 500 subjects. Again, although there is a benchmark database like
FRGC v2 for 3D face data, there is no comparable benchmark for 3D ear data.
2. Automation: Currently there are very few fully automatic recognition systems
available. However, real-time applications require that the recognition should
be performed in a fully automatic form.
2.11 Conclusion
In this paper, an up-to-date review of existing approaches for two promising
biometric traits, the ear and the face, are described. Starting with preliminary
concepts, the paper categorizes and analyzes all the techniques involving data ac-
quisition, detection, representation and unimodal and multimodal recognition with
these two modalities, thus providing the reader with a comprehensive overview of
the research field. It is found that many solutions have been proposed with unimodal
approaches and most of them report quite high recognition and low error rates in
a controlled scenario, however, they suffer a significant decrease in accuracy in the
presence of pose and expression variations and occlusions. Although it is perceived
that the accuracy and robustness can be increased with fusion of 3D ear and face,
very few such approaches have been proposed. The identification and discussion of
the underlying problems and challenges in this paper imply that significant further
research should be performed in the area of developing fast and fully automatic
ear-face multimodal systems using low-cost acquisition devices and with a data or
feature level of fusion.
2.11. Conclusion 43
Acknowledgements
We acknowledge the use of profile images from the UND and the USTB databases.
We would like to thank Dr. Spadaccini, Dr. Thorne and Dr. Sohel for their useful
reviews.
44Chapter 2. A Review of Recent Advances in 3D Ear and Expression Invariant Face
Biometrics
45CHAPTER 3
An ICP Based Hierarchical Matching Approach
for 3D Ear Recognition
Abstract
The use of ear shape as a biometric trait for recognizing people in different
applications is one of the most recent trends in the research communities. In this
work, a fully automatic and fast technique based on the AdaBoost algorithm is used
to detect a subject’s ear from his/her 2D and corresponding 3D profile images. A
modified version of the Iterative Closest Point (ICP) algorithm is then used for the
matching of this extracted probe ear to the previously stored ear data in a gallery
database. A coarse-to-fine hierarchical technique is used where the ICP algorithm
is first applied on low and then on high resolution meshes of 3D ear data. We
obtain a rank one recognition rate of 93% while testing with the University of Notre
Dame Biometrics Database. The proposed recognition approach does not require
any manual intervention or sharp extraction of ear contour from the detected ear
region. No segmentation of the extracted ear is required and more importantly, the
system performance does not rely on the presence of a particular feature of the ear.
3.1 Introduction
Due to instances of fraud with the traditional ID based systems, biometric recog-
nition systems are gaining popularity day by day. In such a system, one or more
physiological (e.g. face, fingerprint, palmprint, iris and DNA) or behavioral (e.g.
handwriting, gait and voice) traits of a subject are taken into consideration for au-
tomatic recognition. Although the ear as a biometric trait is not as accurate as iris
or DNA, it is non-intrusive and easy to be collected. The face is also non-intrusive
but its appearance is affected by changes in facial expressions, use of cosmetics or
eye glasses. The ear is also smaller in size but rich in features and its shape does
not change with aging between 8 years and 70 years [72]. It can be used separately
or in a multimodal approach with the face for effective human recognition in many
applications including some national IDs, security, surveillance and law enforcement
0This article is published in the Proc. of the Fourth International Symposium on 3D DataProcessing, Visualization and Transmission (3DPVT’08), pp.131-141, June 18-20, 2008 with a title” Fully Automatic Approach for Human Recognition from Profile Images Using 2D and3D EarData”.
46 Chapter 3. An ICP Based Hierarchical Matching Approach for 3D Ear Recognition
applications. However, in any of these applications, accurate recognition of the ear
is an important step.
A biometric recognition system may operate in one or both of two modes: au-
thentication and identification. In authentication, one-to-one matching is performed
to compare a user’s biometric to the template of the claimed identity. In identifi-
cation, one-to-many matching is done to associate an identity with the user by
matching it against every identity in the database. For both modes, an accurate
detection is an essential pre-requisite. However, ear detection from arbitrary profile
images is a challenging problem due to the fact that the ear is sometimes occluded
by hair and earrings and ear images can vary in appearance under different view-
ing and illumination conditions. Consequently, most of the existing ear recognition
algorithms assume that the ear has been accurately detected [77].
One of the earliest ear detection methods uses Canny edge maps to detect the
ear contour [20]. Hurley et al. [70] proposed the “force field transformation” for ear
detection. Alvarez et al. [5] used a modified active contour algorithm and Ovoid
model for detecting the ear. Yan and Bowyer [179] proposed taking a predefined
sector from the nose tip to locate the ear region. The non-ear portion from that
sector is cropped out by skin detection and the ear pit was detected using Gaussian
smoothing and curvature estimation. Then, they applied an active contour algo-
rithm to extract the ear contour. The system is automatic but fails if the ear pit
is not visible. Li Yuan and Mu [185] used a modified CAMSHIFT algorithm to
roughly track the profile image as the region of interest (ROI). Then, contour fitting
is operated on ROI for further accurate localization using the contour information
of the ear.
Most recently, Islam et al. [74] proposed an ear detection approach based on
the AdaBoost algorithm [149]. The system was trained with rectangular Haar-
like features and using a dataset of varied races, sexes, appearances, orientations
and illuminations. The data was collected by cropping and synthesizing from the
University of Notre Dame (UND) biometrics database [157, 179], the NIST Mugshot
Identification Database (MID), the XM2VTSDB [112], the USTB, the MIT-CBL
and the UMIST database. The approach is fully automatic, provides 100% detection
while tested with 203 non-occluded images of the UND profile face database and
also works well with some occluded and degraded images.
As summarized in the survey of Pun et al. [143] and Islam et al. [77], most
of the proposed ear recognition approaches use either PCA (Principal Component
Analysis) or the ICP algorithm for matching. Choras [37] proposed a different
3.2. 3D Ear Detection and Normalization 47
automated geometrical method. Testing with 240 images (20 different views) of 12
subjects, 100% recognition rate is reported. Genetic local search and the force field
transformation based approaches have also been proposed by Yuizono et al. [187]
and Hurley et al. [70] respectively. The first ever ear recognition system tested with
a larger database of 415 subjects is proposed by Yan and Bowyer [179]. Using a
modified version of the ICP, they achieved an accuracy of 95.7% with occlusion and
97.8 % without occlusion (with an Equal-error rate (EER) of 1.2%). The system
does not work well if the ear pit is not visible.
In this work, we have adopted the work of Islam et al. [74] for detecting the
ear from 2D profile images and then, extended it for cropping the corresponding 3D
profile face data. After ear detection, we apply a variant of the Iterative Closest
Point (ICP) algorithm for recognition of the ear at different mesh resolutions of the
extracted 3D ear data. Using two different resolutions hierarchically, we obtain a
rank 1 recognition rate of 93%. The proposed system is fully automatic and does
not rely on the presence of a particular feature of the ear ( e.g. ear pit). It also
does not require an accurate extraction of the ear contour and hence reduces the
computational cost. Besides, the ear recognition results can be combined with other
biometric modalities such as 2D and 3D faces to obtain a more robust and accurate
human recognition system.
The paper is organized as follows. The proposed system for 3D ear detection
and that for 3D ear recognition are described in Section 3.2 and 3.3 respectively.
Results obtained are reported and discussed in Section 5.4. Section 7.9 concludes
our findings.
3.2 3D Ear Detection and Normalization
The proposed 3D ear detection approach utilizes the AdaBoost based 2D ear
detection technique developed by Islam et al. [74] (described fully in Section 6.3 of
this dissertation). Since the 3D profile data are co-registered with corresponding the
2D images, the 2D ear detector is first scanned through the whole profile image to
localize the ear. A rectangle is placed covering the ear. The 3D data corresponding
to this rectangular region is cropped for use in 3D ear recognition. The complete
ear detection process from 2D and 3D face profile image/data is shown in Figure
6.2. A sample of a profile image and corresponding 2D and 3D ear data detected by
our system is also shown in the same figure.
Once the 3D ear is detected, we remove all the spikes and holes by filtering
the data. We perform triangulation on the data points, remove edges longer than a
48 Chapter 3. An ICP Based Hierarchical Matching Approach for 3D Ear Recognition
Validation and update of the
training with false positives
Training of the
classifier using
Cascaded AdaBoost
Pre-processing the
2D data
Crop 2D ear
region using
the trained
classifier
Extract the
corresponding
3D Ear data
Creation of
rectangular features
Data collection
(2D and 3D profile
faces)
On-line
Off-line
2D
3D
2D
Figure 3.1: Block diagram of the 3D ear detection approach.
threshold of 0.6 and finally, remove disconnected points. The data is then normalized
by shifting to its mean and then, uniformly sampled using a grid of size 82 by 56
pixels.
3.3 3D Ear Matching and Recognition
The extracted 3D ear data of a subject (also called probe data) is matched with
one (for authentication) or all (for identification) 3D ear data (also called gallery
data) stored in the gallery database built off-line. Matching can be performed based
on the error of registering between the two data sets, more specifically, two clouds of
points. The ICP algorithm [10] is considered as one of the most accurate algorithms
for this purpose. However, it is computationally expensive and may converge to a
local minimum if the two data sets are not nearly registered. To minimize these
limitations, we adopted a coarse-to-fine hierarchical technique. ICP is first applied
on low and then, on high resolution meshes of the 3D ear data.
The mesh reduction was performed using the surface simplification algorithm of
Garland and Heckbert [57] as it preserves features on the surfaces of a mesh (unlike
the sampling reduction which removes points without any regard to the features).
The degree of mesh reduction can be controlled by fixing the number of triangles
to be remained in the reduced mesh or by fixing the surface error approximations
(using quadric matrices). We achieved better result using the later option (see
Section 3.4.3).
Initially, reduced meshes (see Figure 3.2) of the probe (created on-line) and the
gallery (created off-line) ear data are used for coarse registration with the ICP. The
rotation and translation resulting from this coarse registration are applied to the
original data set and then, the ICP algorithm is applied to them to get a finer
match. The matching approach which we term as the two-step ICP is shown in the
3.4. Results and Discussion 49
(a) Full mesh with
55006 triangles(c) Reduced mesh
with 400 triangles
(b) Reduced mesh
with 2000 triangles
Figure 3.2: Sample of the full and reduced meshes of the extracted 3D ear data.
flowchart of figure 3.3. Subscripts ’p’ and ’q’ in the flowchart are used for the probe
and the gallery respectively.
3.4 Results and Discussion
In this section, results of our recognition system using both two and three levels
of mesh resolution are reported. The reasons for misclassifications are also discussed.
3.4.1 Dataset Used
The recognition performance of the proposed system was evaluated against two
datasets collected from the UND Biometrics Database [157, 179]. Dataset A consists
of arbitrarily selected 200 profile images of 100 different subjects each with the size
of 640 by 480 pixels. Among the chosen images, 100 images collected in the year
2003 are used in the gallery database and another 100 images of the same subjects
collected in the year 2004 are used for the probe database. No image of this dataset
was used in the training of the detection classifiers. In Dataset B, the whole UND
database is used excluding the images of two subjects (due to data error). 300
images taken in the year of 2003 are used as gallery. One image out of multiple
images of the same subjects is arbitrarily chosen as probe.
3.4.2 Recognition Rate of the Single-step ICP
Using Dataset A, we obtained recognition rate of 93% and 94% rank one and
rank ten respectively with the single-step ICP. Testing with Dataset B provided
recognition rate of 93.98%, 94.31%, 95.31% and 96.32% rank one, two, three and
ten respectively. Figure 3.4 shows a sample of the correct recognition. It is noticed
that the system works even in the presence of partial occlusions due to hair and
ear-rings.
50 Chapter 3. An ICP Based Hierarchical Matching Approach for 3D Ear Recognition
Start
Take dp=3D data of the
probe, N=total number
of gallery images, i=1
Create probe mesh Mp for dp
Reduce mesh Mp to Mp_r with data
points dp_r
i>=N?
Create gallery mesh Mg_i for dg_i
Reduce mesh Mg_i to Mg_i_r with data
points dg_i_r
Find rotation R and translation t needed
for registering dp_r with dg_i_r using the
ICP algorithm
Calculate dp=dp×R + t
Find error e for registering dp with dg_i
using the ICP algorithm
Assign E(i)=e and i=i+1
Sort E in ascending order
Find dg_i corresponding to
E(1) as the rank-1, that to
E(2) as rank-2 and that to
E(3) as rank-3 match
Stop
Figure 3.3: Flowchart of the matching algorithm using coarse-to-fine hierarchical
technique with ICP.
3.4.3 Improvement Obtained with the Two-step ICP
The recognition rate with Dataset A improved to 93%, 94% and 95% for rank
1, rank 2 and rank 7 respectively when we applied the two-step ICP as mentioned
in Section 3.3. In our experiment, we choose quadric error factor of 10 for the mesh
reduction. The improvement is shown by the plots in Figure 3.5. It is worth men-
tioning that the recognition result becomes worse (84%, 88% and 88% respectively)
when we fix the number of triangles to be 400 for mesh reduction. This is due to
the fact that the size of the detected window is not same for all the ear images.
The two-step ICP provides better results in the cases where the single-step ICP
converged to a local minimum and also where the probe and the gallery images differ
slightly in rotation and translation. An example of this improvement is shown in
figure 3.6.
3.4.4 Analysis of the Misclassification
Both single-step and two-step ICP failed to recognize some of the probe images.
Visual inspection of these probe images and the corresponding gallery images in-
3.4. Results and Discussion 51
(a) (b)
Figure 3.4: Example of recognition: (a) 2D and range image of the gallery ear (b)
2D and range image of the probe ear with a small ear-ring and hair (This figure is
best seen in color).
90
91
92
93
94
95
96
1 2 3 4 5 6 7 8 9 10
Single-step ICP
Two-step ICP
Rank
Recognition Rate
Figure 3.5: Recognition rates with Single-step and Two-step ICP.
(a) (b)
Figure 3.6: Examples of improvement with the two-step ICP which was not correctly
recognized by the single-step ICP [left image is the gallery and right one is the probe].
dicates that the ICP algorithm cannot work properly if there is large variation in
pose (rotation and translation) as shown in Figure 3.7-(a, b, c), occlusion by hair
or ear-rings as shown in Figure 3.7-(a, d) or missing of data as in Figure 3.7-(e)
is present. The pose variation might occur during the capture of the probe profile
52 Chapter 3. An ICP Based Hierarchical Matching Approach for 3D Ear Recognition
image as mentioned in section 7.3.1 or by the detection program (see Figure 3.7-(d)).
(a)(b)
(c) (d) (e)
Figure 3.7: Examples of misclassification:(a, b, c) 2D image of a gallery (left) and
the corresponding probe (right) with pose variations, (d) 2D image of a probe with
occlusions (e) Range image of a probe having missing data.
3.5 Conclusion
Our method for ear detection and recognition from profile images is fully auto-
matic and robust to some degrees of occlusion due to hair and ear-rings. One of
its major strengths is that it does not assume accurate prior detection of the ear.
It also does not rely on the presence of particular features like the ear pit. Thus,
it is suitable for any non-intrusive biometric application. In this paper, it is also
shown that the recognition performance of the ICP algorithm improves with the
proposed hierarchical approach. ICP performance may be further improved by pro-
viding better initial registration. This can be done by invariant feature matching.
The recognition system can be made more robust to occlusion by representing the
extracted ear data with some occlusion-invariant representations. These include our
future research tasks.
Acknowledgements
Authors acknowledge the use of the University of Notre Dame Biometrics Database
for 3D ear detection and recognition. This research is sponsored by ARC grants
DP0664228 and DP0881813.
53CHAPTER 4
A Fast and Fully Automatic Ear Recognition
Approach Based on 3D Local Surface Features
Abstract
Sensitivity of global features to pose, illumination and scale variations encour-
aged researchers to use local features for object representation and recognition.
Availability of 3D scanners also made the use of 3D data (which is less affected
by such variations compared to its 2D counterpart) very popular in computer vision
applications. In this paper, an approach is proposed for human ear recognition based
on robust 3D local features. The features are constructed on distinctive locations
in the 3D ear data with an approximated surface around them based on the neigh-
borhood information. Correspondences are then established between gallery and
probe features and the two data sets are aligned based on these correspondences.
A minimal rectangular subset of the whole 3D ear data only containing the corre-
sponding features is then passed to the Iterative Closest Point (ICP) algorithm for
final recognition. Experiments were performed on the UND biometric database and
the proposed system achieved 90, 94 and 96 percent recognition rate for rank one,
two and three respectively. The approach is fully automatic, comparatively very fast
and makes no assumption about the localization of the nose or the ear pit, unlike
previous works on ear recognition.
4.1 Introduction
Among the biometric traits used for computer vision, the face and the ear
have gained most of the attention of the research community due to their non-
intrusiveness and the ease of data collection. Face recognition with neutral expres-
sions has reached its maturity with a high degree of accuracy. But changes of face
geometry due to the changes of facial expression, use of cosmetics and eye glasses,
aging, covering with beard or hair significantly affect the performance of face recog-
nition systems. The ear is considered as an alternative to be used separately or
in combination with the face as it is comparatively less affected by such changes.
However, its smaller size and often the presence of nearby hair and ear-rings makes
it very challenging to be used for non-interactive biometric applications.
0Published in the Lecture Notes on Computer Science (LNCS) 5259, J. Blanc-Talon et al.(Eds.), pp. 1081-1092, Oct, 2008.
54Chapter 4. A Fast and Fully Automatic Ear Recognition Approach Based on 3D Local
Surface Features
As noted in the survey of Pun et al. [143] and Islam et al. [77], most of the pro-
posed ear recognition approaches use either Principal Components Analysis (PCA)
[188, 70, 28] or the ICP algorithm [188, 70, 28, 179, 181, 177, 30] or their combination
[176] for matching purposes. Choras [37] and Yuizono et al, [187] proposed geomet-
rical feature-based and genetic local search based approaches respectively. Both of
them reported error-free recognition but with comparatively smaller dataset con-
taining high quality 2D ear images taken on the same day and without having any
hair or ear-ring. Similarly, Hurley et al. [70] proposed the force field transformation
for ear feature extraction and claimed 99.2% recognition on a smaller data set of
only 63 subjects and without considering occlusions with ear-rings and hair.
The first ever ear recognition system tested with a larger database (415 subjects)
is proposed by Yan and Bowyer [179]. Using an automatic ear detection based on
the localization of the nose and the earpit, active contour based ear data extraction
and finally, matching with a modified version of the ICP achieved an accuracy of
95.7% allowing occlusion and 97.8 % on examples without any occlusion (with an
Equal-error rate (EER) of 1.2%). The system is not expected to work properly if
the nose (for example, due to pose variation) or the ear pit (for example, due to its
covering with hair) are not clearly visible which is a common case. In an experiment
where a straight-on ear images was matched with twenty four 45 degree off images
(a subset of Collection G of the UND database), it achieves only 70.8% recognition
rate.
Most of the approaches above are based on global features. This requires an
accurate normalization of ear data with respect to pose, illumination and scale.
These approaches are also inherently sensitive to occlusion. As demonstrated in
this paper, local features are less affected by these factors. Recently, Chen and
Bhanu [32] used a local surface shape descriptor to represent ear data. However,
they only used the representation for a coarse alignment of the ear. The whole ear
data was then used for matching with a modified version of ICP. They obtained
96.4% recognition on the Collection F of UND database (302 subjects) and 87.5%
recognition for straight-on to 45 degree off images. They reported an ear detection
accuracy of 87.1% only. Moreover, they assume that all the ear data are accurately
extracted (manually, if needed) from the profile images prior to recognition.
In this paper, the 3D local surface features proposed for face recognition in [117]
are adapted for the ear recognition. The authors of the work reported a very high
recognition rate of 99% on neutral versus neutral and 93.5% on neutral versus all
face data when tested on the FRGC v2 3D face data set. They also obtained a very
4.2. Methodology 55
good time efficiency of 23 matches per second on a 3.2 GHz Pentium IV machine
with 1GB RAM. However, since ear features are different and more challenging than
face features, we modified the feature creation and matching approach to make them
suitable for ear recognition. Following [117] at first, a smaller number of distinctive
3D feature point locations are identified on each of the fully automatically detected
3D ear region. A 3D surface is then approximated around the selected keypoint
based on the nearby data points and used as the feature for that point. A coordinate
frame centred on the key point and aligned with the principal axes from PCA is
used to make the features pose invariant. Correspondence is established between
the gallery and the probe features and the matching decision is made based on the
distance between the feature surfaces and the transformation between them. This
yields a reasonable recognition method based on only local features. However, the
recognition performance is improved by aligning the probe and the gallery data set
based on the initial transformation between the corresponding features and followed
by the application of the Iterative Closest Point (ICP) algorithm on only a minimal
rectangular subset of the whole 3D ear data containing the corresponding features
only. This novel approach of extracting a reduced data set for final alignment
significantly increases the efficiency in time also. Thus, the proposed system have
three main advantages: 1) fully automatic 2) comparatively very fast and 3) makes
no assumption about the localization of the nose or the ear pit, unlike previous
works on ear recognition.
The paper is organized as follows. The proposed approach for 3D ear recognition
is described in Sect. 5.3. The results obtained are reported and discussed in Sect.
5.4 followed by conclusions in Sect. 7.9.
4.2 Methodology
Our ear recognition system consist of seven main parts as shown in Fig. 6.1.
Each of the components is described in this section.
4.2.1 Ear Data Extraction and Normalization
The ear region is detected on 2D profile face images using the AdaBoost based
detector described by Islam et al. [74]. This detector is chosen as it is fully automatic
and also due to its speed and high accuracy of 99.89% on the UND profile face
database with 942 images of 302 subjects [75]. The corresponding 3D data is then
extracted from the co-registered 3D profile data as described in [75]. To ensure the
whole ear is included and to allow the extraction of features on and slightly outside
56Chapter 4. A Fast and Fully Automatic Ear Recognition Approach Based on 3D Local
Surface Features
3D Local
Features
Identification
3D Local Features
Extraction
Feature Matching
Coarse Alignment
Using
Correspondences of
3D Local Features
Fine Matching with
ICP
Recognition
Decision Based on
the Similarity
Measures
Gallery
Feature
Database
Off-line
storage of
gallery
featuresOn-line probe
features
Recognized Ear
Ear Data
Normalization
Exp. 2
Exp.1
Ear Detection and
Ear Data Extraction
2D and 3D Profile Face
Data
Figure 4.1: Block diagram of the proposed ear recognition system.
the ear region, we expanded the detected ear regions by an additional 25 pixels
around each direction.
Consequently, the extracted 3D ear data varies in dimensions depending on the
detection window. Hence, we normalized the 3D data by centering on the mean and
then sampling on a uniform grid of 132 by 106. The surface fitting was performed
using an interpolation algorithm at 0.5mm resolution. Since there were some missing
data regions as shown in Fig. 6.7, we removed interpolated data for those regions
after fitting to the grid.
4.2.2 Feature Location Identification
A 3D local feature can be depicted as a 3D surface constructed using data points
within a sphere of radius r1 centred at location p. As outline by Mian et al [117], the
criteria to check while identifying feature locations is that it should be on a surface
that is distinctive enough to differentiate between range images of different persons.
To avoid many matching features in a single smaller region (in other word, to
increase distinctiveness), we only consider as possible feature points that lie on a
2mm grid. Then we find the distance of each data point from the boundary and
take only those points with a distance greater than a predefined boundary limit.
The boundary limit is chosen slightly longer than the radius of the 3D local feature
surface (r1) so that the feature calculation does not depend on regions outside the
boundary and the allowed region corresponds closely with the ear. We call the points
within this limit as seed points.
To check whether the data points around a seed point contain enough descriptive
information, we adopt the approach of Mian et al. [117] discussed in short as follows.
We randomly choose a seed point and take a sphere of data points around that point
4.2. Methodology 57
Figure 4.2: Locations of local features (shown with dots) on the range images of
different views (in rows) of different individuals (in columns). (This figure is best
seen in color).
which are within a distance of r1. We apply the PCA on those data points and align
them with their principal axes using the computed rotation matrix. The difference
between the ranges of the first two principal axes of the local region is computed
as δ. It is then compared to a threshold (δt). We only accept a seed point to be a
distinctive feature location if the δ is higher than δt. The higher δt the less number of
features we get. But lowering δt can result in the selection of less significant feature
points. This is because, the value of δ indicates extent of unsymmetrical variation
in depth in that point cloud. For example, δt of zero for a point cloud means it
could be completely planar or spherical.
We continue selecting feature locations from the available seed points until we
get a significant number of points (Fn). For a seed resolution of 2mm, r1 of 10,
δt of 2 and Fn of 200, for most of the gallery and the probe ear we found 200
feature locations. We found however, as low as 65 features particularly for cases
where missing data occurs. The value of these parameters were empirically chosen.
However, it is reported by Mian et al. [117] that the performance of the feature
point detection algorithm does not vary significantly with small variations of these
parameters.
Fig. 6.7 shows the suitability of our local features on the ear data. It illustrates
as it appears that local feature locations are different for ear images of different
individuals. It also shows that these features have a high degree of repeatability for
the ear data of the same individual. Here by repeatability we mean the proportion
58Chapter 4. A Fast and Fully Automatic Ear Recognition Approach Based on 3D Local
Surface Features
‘ Nearest Neighbour Error (mm)
Cumulative Percentage of Repeatability
Figure 4.3: Repeatability of local feature locations
of probe feature points that have a corresponding gallery feature point within a par-
ticular distance. Similar to [117], the probe and gallery data of same individual are
aligned using the ICP as in before computation of the repeatability. The cumulative
percentage of repeatability as a function of nearest neighbor error between gallery
and probe features of ten different individual is shown in Fig. 4.3. The repeatability
reaches around 80% at an error of 2mm which is the sampling distance between the
seed points.
4.2.3 3D Local Feature Extraction
After a seed point qualifies as a keypoint, we extract a surface feature from its
neighborhood. As described in Sect. 6.4.1, while testing for suitability of the seed
point we take the sphere of data points r1 away from that seed point and aligned
to their principal axes. We use these rotated data points to construct the 3D local
surface feature. Similar to [117], the principal direction of the local surface is used
as the 3D coordinates to calculate the features. Since the coordinate basis is defined
locally based on the shape of the surface, the computed features are potentially
stable and pose invariant.
We fit a uniformly sampled (with resolution of 1mm) 3D surface of 30×30 lattice
to these data points. In order to avoid the boundary effects, we crop the inner region
of 20× 20 lattice from the bigger surface. This smaller surface is then concatenated
to form a feature vector to be used for matching. Consequently, the dimension of
4.2. Methodology 59
510
1520
510
1520
0
5
10
15
Figure 4.4: Example of a 3D local surface (right image). The region from which it
is extracted is shown by a circle on the left image.
our feature vector is 400. An example of 3D local surface feature is shown in Fig.
7.2.
For surface fitting, we use a publicly available surface fitting code [47]. The
motivation behind the selection of this algorithm is that it builds a surface over
the complete lattice, extrapolating (rather than interpolating) smoothly into the
corners. Therefore, it is less sensitive to noise and outliers in the data.
4.2.4 Feature Matching
The similarity between two features is calculated as the Root Mean Square
(RMS) distance between corresponding points on the 20 × 20 grid generated when
the feature is created (aligned following the axes in the PCA). The RMS distance
is computed from each probe feature location to all the gallery feature locations.
Matching gallery features which are located more than a threshold (th) away are
discarded to avoid matching in quite different areas of the cropped image. The
RMS distance of the probe feature and the remaining gallery feature is then com-
puted. The gallery feature that corresponds to the minimum distance is considered
as the corresponding gallery feature for that particular probe feature. The mean of
the distances for all the matched probe and gallery features is used as a similarity
measure.
Unlike [117], we also used the implied rotation between probe and gallery for each
pair of matching features which is calculated from the rotations used to generate
the two features. The angle between each of these rotations and all the others is
calculated and the rotation which has the most similar rotations (within five degrees)
is chosen as the most representative rotation. The ratio of the size of the largest
cluster of rotation angles to the total number of matching features is used as an
additional similarity measure.
60Chapter 4. A Fast and Fully Automatic Ear Recognition Approach Based on 3D Local
Surface Features
4.2.5 Coarse Registration of Gallery and Probe Data
The input profile images of the gallery and the probe may have pose (rotation and
translation) variations. To minimize the effect of such variations, unlike [117], we
use the correspondence and rotation information obtained from the 3D local feature
matching for the initial or coarse registration of the probe to the gallery image data.
We applied the following approaches for this purpose. In the first approach Singular
Value Decomposition (SVD) is used to find the rotation and translation matrix from
the gallery and probe data points corresponding to the matched 3D local features
(see Sect. 7.4.1). In the second approach, the rotation and the translation with the
maximum number of occurrences (within five degrees and 2mm of limit respectively)
are used to coarsely align the probe data to the gallery.
4.2.6 Fine Matching with ICP
The Iterative Closest Point (ICP) algorithm [10] is considered to be one of the
most accurate algorithm for registration of two clouds of data points provided the
data sets are roughly aligned. Since ICP is computationally expensive, we extracted
a reduced rectangular region fed to a modified version of ICP in [114]. The minimum
and maximum co-ordinate values of the matched local 3D features were used to
extract the reduced rectangular region from the originally detected gallery and probe
ear data. This smaller but feature-rich region also minimizes the probability of being
affected by the presence of hair and ear-rings.
4.2.7 Final Similarity Measures
The final decision regarding the matching is made based on the results of ICP as
well as the local feature matching. Therefore, the final similarity measures are: (i)
Distance between local 3D features (ii) ICP error and (iii) The ratio of the size of
the largest cluster of rotation angles to the total number of matching features (RR).
As in [117], each of the similarity measures was scaled to 0 and 1 for its minimum
and maximum values respectively. A weight factor is then computed as the ratio of
the difference of the minimum value from the mean to that of the second minimum
value from the mean of a similarity measure. The final result is the weighted sum-
mation of all the similarity measures. However, the third similarity measure (RR)
is subtracted from one before multiplication with the corresponding weight factor
as it is opposite to other measures (the higher this value the better are the results).
4.3. Results and Discussions 61
4.3 Results and Discussions
The recognition performance of our proposed approach is evaluated in this sec-
tion. The results with and without ICP are reported separately. Examples of correct
and misclassifications are analyzed. The time requirement for matching is also re-
ported.
4.3.1 Data Set Used
The Collection F from the University of Notre Dame Profile face database is
used to perform recognition experiments of the proposed approach. We have taken
200 profile images of the first 100 different subjects. Among these images, 100 of
the images that were collected in the year 2003 are used in the gallery database and
the first 100 images of the same subjects collected in the year 2004 are used in the
probe database.
2 4 6 8 10 12 14 16 18 2080
82
84
86
88
90
92
94
96
98
100
Rank
Rec
ogni
tion
Rat
e
Local 3D features onlyLocal 3D features and ICP
Figure 4.5: Identification results.
4.3.2 Recognition Rate with 3D Local Features Only
In our first experiment, we performed recognition considering the matching errors
with local 3D features only. We have obtained 84%, 88% and 90% identification rate
for rank-1, rank-2 and rank-3 respectively for this experiment. The results are shown
in the plot of Fig. 4.5.
62Chapter 4. A Fast and Fully Automatic Ear Recognition Approach Based on 3D Local
Surface Features
4.3.3 Fine Matching with the ICP
Some of the matching failures using the local 3D features only are recovered by
fine alignment with the ICP after the initial alignment using the local 3D features.
Using the combined similarity measure described in Sect. 4.2.7, identification rate
improved to reach 90%, 94% and 96% respectively for rank-1, rank-2 and rank-3.
The results are shown in Fig. 4.5.
(a) (b)
Figure 4.6: Examples of correct recognition in the presence of occlusions. (a) With
ear-rings. and (b) With hair. (2D and the corresponding range images are placed
in the top and bottom row respectively)
Figure 4.7: Example of correct recognition of gallery-probe pairs with pose varia-
tions.
4.3.4 Occlusion and Pose Invariance
Our approach with local 3D features is found to be robust to the presence of
partial occlusions due to hair and ear-rings. Some examples are illustrated in Fig.
4.6.
4.3. Results and Discussions 63
The proposed local 3D features are pose-invariant due to the way they have
been created. However, if the pose variation (specially the out-of-plain) causes self-
occlusions, the number of 3D local features and their repeatability decreases in the
gallery-probe pair. Therefore, we noticed some misclassifications using local 3D fea-
tures only in presence of some pose variations. However, with the finer registration
with ICP, most of those failures were recognized correctly. Fig. 4.7 shows two such
examples. Profile images are used for the example on the right to illustrate the pose
variations.
(a) (b)
Figure 4.8: Example of misclassification. (a) With large pose variations. (b) With
ear-ring, hair and pose variations.
4.3.5 Analysis of the Failures
The proposed system fails mostly in cases of missing data (due to sensor error
or ear-rings), large pose variations causing self-occlusion and severe occlusions with
hair and ear-rings.
The repeatability of the local 3D features in misclassified images was found
to be very low. The worst misclassified gallery-probe pair has only around 33%
repeatability at 2mm (the feature point resolution).
Two examples of misclassification are illustrated in Fig. 4.8: one has large in-
plain and out-of-plain rotation, and the other has large out-of-plain rotation, ear-ring
and hair covering a portion. The hair around the gallery ear of the second example
hides the depth variation at the edges (see the range image in Fig. 4.8b).
4.3.6 Recognition Speed
An un-optimized implementation of the recognition system is performed on
MATLAB on a Pentium 4, 3.6 GHz and 3.25 GB RAM. It takes around 0.3762
64Chapter 4. A Fast and Fully Automatic Ear Recognition Approach Based on 3D Local
Surface Features
sec for matching 3D local features in a gallery-probe pair. This timing is a bit
longer than what is reported in [117] as we are not using any compression of the
feature vector. We also perform the computation for finding the rotation similarity
measure (see Sect. 7.4.1) during matching the features. Time required with the ICP
is 8.6 sec for each match on the same platform.
4.4 Conclusion
In this paper, a robust 3D local features based approach is proposed for 3D ear
recognition. The approach is fully automatic and comparatively very fast. It is
shown to be robust to pose and scale variations and occlusion due to hair and ear-
rings. It is also not based on any assumption about the localization of the nose or the
ear pit. The large variation in rank-2 and rank-3 matches also indicates that finer
tuning of the parameters in our system is likely to improve the performance. The
time efficiency of the system can also be improved by reducing the dimensionality
of the local feature vector by projecting the features to the PCA subspace.
Acknowledgements
We acknowledge the use of the UND Biometrics databases for ear detection and
recognition. We also like to thank D’Errico for the surface fitting code. This research
is sponsored by ARC grants DP0664228 and DP0881813.
65CHAPTER 5
Refining Local 3D Feature Matching through
Geometric Consistency for Robust Biometric
Recognition
Abstract
Local features are gaining popularity due to their robustness to occlusion and
other variations such as minor deformation. However, using local features for recog-
nition of biometric traits, which are generally highly similar, can produce large
numbers of false matches. To increase recognition performance, we propose to elim-
inate some incorrect matches using a simple form geometric consistency, and some
associated similarity measures. The performance of the approach is evaluated on
different datasets and compared with some previous approaches. We obtain an im-
provement from 81.60% to 92.77% in rank-1 ear identification on the University of
Notre Dame Biometric Database, the largest publicly available profile database from
the University of Notre Dame with 415 subjects.
keywords
Local features; feature matching; geometric consistency; biometric recognition.
5.1 Introduction
The performance of a biometric recognition system greatly depends on the rep-
resentation of the underlying distinctive features and the algorithm for matching
those features. Due to ease of computation various global features has been used
in biometric systems. However, global features are often not robust to variations
in observation conditions, including the presence of occlusion and deformations. In
search of more robust features under these variations, researchers have proposed
using local features.
Among 2D local features SIFT (scale invariant feature transform) [105] and its
variants are used in many biometric applications [92, 134, 21, 109]. Also Guo and
Xu [63] proposed using the Local Similarity Binary Pattern (LSBP) and Local
Binary Pattern (LBP) for ear data representation and matching. However, 2D data
0This article is published in the Proc. of Digital Image Computing: Techniques and Applications(DICTA), pp. 513-518, December 2009.
66Chapter 5. Refining Local 3D Feature Matching through Geometric Consistency for
Robust Biometric Recognition
has many inherent problems compared to its 3D counter part including sensitivity
to the use of cosmetics, clothing and other decorations. Recently, Mian et al. [117]
proposed the Local 3D feature (L3DF) inspired by 2D SIFT and found to be very
effective for face representation and recognition. Islam et al. [83] also found it to be
very fast and somewhat robust for ear recognition.
Among the biometric traits, ears and faces are considered to be most suitable for
non-intrusive biometric recognition. However, they are also highly similar among
themselves [35]. Therefore, using local features for these two biometric traits (espe-
cially for ears) produces many false or incorrect matches for a gallery-probe pair. In
this paper, we find initial feature matches using L3DFs with simple metrics (such
as RMS distance between features) similar to Mian et al. [117]. However, we pro-
pose using a geometric consistency check to filter out some of the incorrect feature
matches in an additional round of matching. The consistency of feature match is
judged from the difference between the implied distances on the probe and gallery
to a particular match from the first round. This match is chosen to maximize the
consistency in the first round. We also develop a similarity measure based on this
consistency by computing the proportion of consistent distances in the first round.
Finally, we combine this additional measure with the proportion of consistent rota-
tions and mean distance error of the selected features computed as we previously
proposed in [83, 76] (Chapter 4 of this dissertation). Experiments with different
datasets prove the effectiveness of our technique via a considerable improvement in
the recognition results.
The paper is organized as follows. The proposed approach for refining matching
is described in Section 5.3 after describing the background and motivation in Section
5.2. The results obtained are reported and discussed in Section 5.4 and compared
with other approaches in Section 7.8. Section 7.9 concludes with some future
research directions.
5.2 Background and Motivation
The local 3D feature that we use to demonstrate our approach can be depicted
as a 3D surface constructed using data points within a sphere of radius r1 centered
at location p. Fig. 7.2 shows a local 3D feature surface extracted from an ear.
As described in [83](Chapter 4 of this dissertation), a 20 × 20 grid of heights is
approximated using the surface points of a 3D surface feature. The grid is aligned
following the axes defined by the Principal Component Analysis (PCA). The simi-
larity between two features is calculated as the Root Mean Square (RMS) distance
5.3. Proposed Refinement Technique 67
510
1520
510
1520
−5
0
5
Figure 5.1: Example of a 3D local surface (right image) and the region from which
it is extracted (left image, marked with a circle) [83].
between corresponding heights on this grid. The similarity is computed for each
probe feature location to all the gallery feature locations excluding those being lo-
cated more than a threshold away. (As prviously, this threshold is empirically chosen
as 45mm in the case of ear matching since our automatic ear detection procedure
does not precisely locate the ear.) The gallery feature with the maximum similarity
is considered as the matching feature for that particular probe feature.
The starting point for the current work was the observation that for a correspond-
ing probe and gallery the set of feature matches produced by the above technique
generally contains a large proportion of incorrect matches, and that the correct
matches must be geometrically consistent, while the incorrect matches are unlikely
to be. In contrast, for a non-corresponding probe and gallery, it is unlikely that
there will be a large set of features that match well and are geometrically consis-
tent. Here geometrically consistent means that there is a rigid transformation that
maps the probe feature locations to the corresponding gallery feature locations.
While it seems possible to check full geometric consistency during the matching
process, by constructing rigid transformations from sets of matched points using
Random Sample Consensus (RANSAC) [52] or a variant such as Progressive Sample
Consensus (PROSAC) [42], we chose instead to stick close to the original local 3D
feature matching algorithm of Mian et al. [117] in order to build on its demonstrated
strengths.
5.3 Proposed Refinement Technique
In this section, at first we describe how distance consistency can be used to filter
out incorrect matches. Then, we describe the derivation of two similarity measures
to be used in addition to the feature similarity.
68Chapter 5. Refining Local 3D Feature Matching through Geometric Consistency for
Robust Biometric Recognition
5.3.1 Computation of Distance Consistency
In order to discard incorrect matches, we propose to add a second round of
feature matching each time a probe is compared with a gallery. This second round
uses geometric consistency based on information extracted from the feature matches
generated by the first round. The first round of feature matching is done just as
described in Section 5.2, and we use the matches generated to identify those matches
that are most geometrically consistent.
For simplicity, we measure geometric consistency of a feature match just by
counting how many of the other feature matches from the first round yield consistent
distances on the probe and gallery. More precisely, for a match with locations pi, gi
we count how many other matched locations pj, gj satisfy:
||pi − pj| − |gi − gj|| < dkey + κ√|pi − pj|
Here the threshold includes a term to allow for the spacing between candidate key
points, as well as one that grows with the square root of the actual probe distance
|pi − pj| to account for minor deformations and measurement errors.
To exploit geometric consistency quickly and simply, we just find the match from
the first round which is most “distance-consistent” according to this measure. Then,
in the second round, we only allow feature matches that are distance-consistent with
this match - thus for each probe feature we find the best matching gallery feature
that is distance consistent with the chosen match.
We also compute the ratio of the maximum distance consistency to the total
number of matches found in the first round of matching and use that as a similarity
measure, proportion of consistent distances (λ).
5.3.2 Computation of Rotation Consistency
Similar to Islam et al. [83, 76], we also compute a similarity measure based on
the consistency of the rotations implied by the feature matches. Each feature match
implies a certain rotation between the probe and gallery, since we store the rotation
matrix used to create the probe feature from the probe (calculated using PCA), and
similarly for the gallery feature, and we assume that the match occurs because the
features have been aligned in the same way and come from corresponding points.
We can thus calculate the implied rotation from probe to gallery as R−1g Rp (or RT
g Rp
since Rg is a rotation matrix) where Rp and Rg are the rotations used for the probe
and gallery features.
5.3. Proposed Refinement Technique 69
Figure 5.2: Feature correspondences before (left image) and after (right image)
filtering with geometric consistency (best seen in color).
We calculate these rotations for all feature matches, and then for each we deter-
mine the count of how many of the other rotations it is consistent with. Consistency
between two rotations R1 and R2 is determined by finding the angle between them,
i.e., the angle of the rotation R−11 R2 (around the appropriate axis of rotation). When
the angle is less than 10◦ we consider two rotations consistent. We choose the ro-
tation that is consistent with the largest number of other matches, and then we
use the proportion of matches consistent with this (α) as a similarity measure. As
we shall see in Section 5.4, this measure becomes the strongest among the other
measures used prior to applying Iterative Closest Point (ICP) algorithm in our ear
recognition experiments.
Fig. 5.2 illustrates an example of the correspondences between the features of a
probe image and the corresponding gallery image (mirrored in the z direction) before
(left image) and after (the right image) the geometric consistency check is performed.
The green channel (brighter in the black and white) indicates the amount of rota-
tional consistency for each match. It should be clear that filtering has increased the
proportion of the matches that involve corresponding parts of the two ear images.
5.3.3 Final Similarity Measures
When the second round is complete for a particular probe and gallery, we use
the mean error of the feature matches (γ) in the second round, proportion of con-
sistent distances (λ) and the proportion of consistent rotations (α) from the second
round to measure the closeness of the match. Similar to [83], we perform min-max
normalization of the similarity measures and compute a weight factor (η) for each
70Chapter 5. Refining Local 3D Feature Matching through Geometric Consistency for
Robust Biometric Recognition
2 4 6 8 10 12 14 16 18 200.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Iden
tific
atio
n R
ate
Ear
Face
combined
2 4 6 8 10 12 14 16 18 200.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Iden
tific
atio
n R
ate
Ear
Face
combined
(a) (b)
Figure 5.3: Identification results for fusion of ears and faces on dataset A: (a) without
using geometric consistency. (b) with geometric consistency
10−3
10−2
10−1
100
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
False acceptance rate (log scale)
Ver
ifica
tion
rate
Ear
Face
combined
10−3
10−2
10−1
100
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
False acceptance rate (log scale)
Ver
ifica
tion
rate
Ear
Face
combined
(a) (b)
Figure 5.4: Verification results for fusion of ears and faces on dataset A: (a) without
using geometric consistency. (b) with geometric consistency
5.4. Result and Discussion 71
of the measures as the ratio of the difference of the minimum value from the mean
to that of the second minimum value from the mean of that similarity measure. We
compute the weighted sum as follows:
ε = ηfγ + ηr(1− α) + ηd(1− λ)
Based on the above score, we sort the candidate gallery images and apply coarse
alignment and final ICP only on the minimal rectangular area of the best 20 candi-
dates to make the final decision of matching.
For coarse alignment, we use the translation (t) corresponding to the maximum
distance consistency and the rotation (R) corresponding to the largest cluster of
consistent rotations as follows:
dp′ = Rdp + t;
where, dp and dp′ are the probe point coordinates before and after coarse align-
ment.
5.4 Result and Discussion
To evaluate the performance of our refined matching on ear and face biometrics,
we have used Collection F and J from the University of Notre Dame Profile Biomet-
rics Database [157, 179] and a subset of the FRGC v2. For different experiments, we
have sub-divided these two datasets into the following three subsets. In dataset A,
326 frontal images from the FRGC v2 and 326 profile images from the UND Col-
lection J are used for gallery database. Another set of 311 images from these two
databases is used as a probe dataset. All the ear data are automatically extracted
from the profile images using the technique described in [74, 75]. In dataset B, the
earliest 302 images of the UND Collection F are used as gallery and the latest 302
images are used as probe (there is a time lapse of 17.7 weeks on average between
the earliest and latest images). The entire Collection J of the UND database is
considered as dataset C which includes 415 earliest images as the gallery and an-
other 415 images as probes. However, since the ear in one of profile images was not
automatically detected, the number of probes included in recognition experiments
for this dataset is 414.
Using dataset A, we obtain 79.74% and 69.13% rank- 1 identification rate for ear
with or without using the second stage of matching with the distance consistency.
Rank-n means the right answer is in the top n matches. The score level fusion of ear
and face has an improved identification result from 98.07% to 98.71% (see Fig. 5.3)
72Chapter 5. Refining Local 3D Feature Matching through Geometric Consistency for
Robust Biometric Recognition
and an improved verification result from 98.71% to 99.68% at a False Acceptance
Rate, FPR of 0.001 using the proposed matching approach (see Fig. 5.4).
As shown in Fig. 5.3(b), we obtain only a little improvement for the face. This is
because our matching algorithm for faces already requires distance consistency with
respect to the detected nose tip position as described in [114]. For ears it seems
difficult to reliably detect a similar position due to the possibility of occlusion, and
we consider it a strength of our technique that it does not require any specific part
of the ear to be visible.
5 10 15 200.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Iden
tific
atio
n R
ate
L3DF−based measures
ICP followed by L3DF
5 10 15 200.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Iden
tific
atio
n R
ate
L3DF−based measures
ICP followed by L3DF
(a) (b)
Figure 5.5: Indentification results using geometric consistency: (a) on dataset B (b)
on dataset C.
Using the translation from maximum distance consistency and the best rotation
for initial transformation and then applying the ICP algorithm to the gallery and
probe dataset the ear identification result improves to 92.60%. The corresponding
result is 86.50% while using only the rotation and translation obtained from an
application of ICP to the matched feature points for coarse alignment.
On dataset B, we perform two experiments. In the first experiment, we use
feature errors and the proportion of consistent rotations for initial matching and
only the rotation for coarse alignment. We then apply the final ICP on the minimal
datapoints (as described in Section 5.3) of the gallery and the probe. This yields
93.71% rank-1 recognition for ear biometrics. In the second experiment, we use
second stage of matching with distance consistency applied and use both proportion
of consistent distances and rotations for initial matching as well as coarse alignment
prior to application of final ICP on the best 20 gallery candidates. These changes
improve the recognition rate to 95.03% which confirms the significance of using the
5.5. Comparative Study 73
geometric consistencies (see Fig. 5.5(a)).
On dataset C, without the geometric consistency check we obtain 71.57% and
81.60% rank-1 ear identification rate before and after using ICP on the best 20
gallery candidates sorted by L3DF-based measures. However, the results improve
to 79.76% and 92.77% respectively when we use our geometric consistency checks
and measures. The improved results are shown in Fig. 5.5(b).
The strength of the proportion of consistent rotations (propRot) measure over
other similarity measures is illustrated in Fig. 5.6 for an identification scenario on
dataset C. In the figure, the mean error of feature matches and the proportion of
consistent distances are indicated by errL3DF and propDist respectively.
5 10 15 20
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Iden
tific
atio
n R
ate
errL3DF
propRot
propDist
Figure 5.6: Comparing identification performance of different geometric consistency
measures (on dataset C)
5.5 Comparative Study
Islam et al. [83] reported 84% rank-1 ear recognition result using first stage
of feature matching along with the proportion of best rotations on the first 100
images of Collection F of the UND database. For the same dataset using distance
consistency and proportion of consistent distances, we have obtained 92% rank-1
ear recognition.
Using distance consistency, we have obtained 9% improved identification and 8%
improved verification result for ear recognition compared to that reported in [76].
Mian et al. [117] used edge error and node error on the graphs constructed
from the keypoints of the matching local features. Although this roughly measures
the distance consistency between matched features, it is less reliable when there
are many bad matches. Also this work did not use the approach for filtering the
incorrect matches, but instead as a similarity measure only. For the dataset used
74Chapter 5. Refining Local 3D Feature Matching through Geometric Consistency for
Robust Biometric Recognition
in [76], we obtain 69.13% recognition accuracy for ears using feature errors and graph
errors. For the same dataset corresponding results with our distance consistency and
proportion of rotations is 79.74%.
Chan and Bhanu [32] use geometric constraints similar to ours on their Local
Feature Patches. However, they do not use an explicit second round of matching.
5.6 Conclusion and Future Work
Our results indicate that this simple technique yields worthwhile gains. It is also
seems very likely that we could exploit geometric consistency further. For example,
instead of choosing only the most distance-consistent match, we could choose a
whole set of matches that are mutually distance consistent. Then, provided that
we have at least three matches, we could calculate a rigid transformation via least
square error and use this to directly map probe locations to the required gallery
locations (modulo the threshold). We intend to do this in future work, and further
gains seem likely.
Acknowledgment
This research is sponsored by ARC grants DP0664228 and LE0775672. The
authors acknowledge the use of the UND Biometrics database with profile images
and the FRGC v2 face database for ear and face recognition. They also like to thank
M. Bennamoun for his valuable input, A. Mian for his 3D face normalization code
and D’Errico for his surface fitting code.
75CHAPTER 6
Efficient Detection and Recognition of Textured
3D Ears
Abstract
The use of ear shape as a biometric trait is a recent trend in research. However,
fast and accurate detection and recognition of the ear are very challenging because
of its complex geometry. In this work, a very fast 2D AdaBoost detector is combined
with fast 3D local feature matching and fine matching via an Iterative Closest Point
(ICP) algorithm to obtain a complete, robust and fully automatic system with a
good balance between speed and accuracy. Ear images are detected from 2D profile
images using the proposed Cascaded AdaBoost detector. The corresponding 3D
ear data is then extracted from the co-registered range image and represented with
local 3D features. Unlike previous approaches, local features are used to construct
a rejection classifier, to extract a minimal region with feature-rich data points and
finally, to compute the initial transformation for matching with the ICP algorithm.
The proposed system provides a detection rate of 99.9% and an identification rate
of 95.4% on Collection F of the UND database. On a Core 2 Quad 9550, 2.83 GHz
machine, it takes around 7.7 ms to detect an ear from a 640×480 image. Extracting
features from an ear takes 22.2 sec and matching it with a gallery using only the
local features takes 0.06 sec while using the full matching including ICP requires
2.28 sec on average.
keywords
Biometrics, ear detection, 3D ear recognition, 3D local features, geometric con-
sistency.
6.1 Introduction
Instances of fraudulent breaches of traditional identity card based systems have
motivated increased interest in strengthening security using biometrics for automatic
recognition [85, 146]. Among the biometric traits, the face and the ear have received
some significant attention due to the non-intrusiveness and the ease of data collec-
tion. Face recognition with neutral expressions has reached maturity with a high
0This article is under review in the International Journal of Computer Vision, March, 2010.
76 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
Keypoint
Identification
3D Local
Features
Extraction
L3DF-based
Matching
Fine
Matching
with ICP
Coarse
Alignment
Using
Feature
Corres-
pondences
Gallery
Features
Off-line
storage of
gallery
featuresOn-line probe
features
Recognized Ear
Ear Data
Normalization
Ear
Detection
and Ear
Data
Extraction
Recognition
Decision
2D and 3D Profile Face Data
Figure 6.1: Block diagram of the proposed ear detection and recognition system.
degree of accuracy [17, 78, 117, 190]. However, changes due to facial expressions,
the use of cosmetics and eye glasses, the presence of facial hair including beard and
aging significantly affect the performance of face recognition systems. The ear, com-
pared to the face, is much smaller in size but has a rich structure [8] and a distinct
shape [86] which remains unchanged from 8 to 70 years of age (as determined by
Iannarelli [72] in a study of 10,000 ears). It is, therefore, a very suitable alternative
or complement to the face for effective human recognition [20, 28, 69, 78].
However, reduced spatial resolution and uniform distribution of color sometimes
makes it difficult to detect and recognize the ear from arbitrary profile or side face
images. The presence of nearby hair and ear-rings also makes it very challenging for
non-interactive biometric applications.
In this work, we demonstrate that the Cascaded AdaBoost (Adaptive Boost-
ing) [161] approach with appropriate Haar-features allows accurate and very fast
detection of ears while being sufficient for a Local 3D Feature (L3DF) [117] based
recognition. A detection rate of 99.9% is obtained on the UND Biometrics Database
with 830 images of 415 subjects taking only 7.7 ms on average using a C + + im-
plementation on a Core 2 Quad 9550, 2.83 GHz PC. The approach is found to be
significantly robust to ear-rings, hair and ear-phones. As illustrated in Fig. 6.1, the
detected ear sub-window is cropped from the 2D and the corresponding co-registered
3D data and represented with Local 3D Features. These features are constructed
by approximating surfaces around some distinctive keypoints based on the neigh-
6.2. Related Work and Contributions 77
boring information. When matching a probe with a gallery, a rejection classifier is
built based on the distance and geometric consistency among the feature vectors. A
minimal rectangular region containing all the matching features is extracted from
the probe and the best few gallery candidates. These selected and minimal gallery-
probe datasets are coarsely aligned based on the geometric information extracted
from the feature correspondences and then finely matched via the Iterative Closest
Point (ICP) algorithm. While evaluating the performance of the complete system
on the UND-J, the largest available ear database, we obtain an identification rate
of 93.5% with an Equal Error Rate (EER) of 4.1%. The corresponding rates for the
UND-F dataset are 95.4% and 2.3% and the rates for a new dataset of 50 subjects all
wearing ear-phones are 98% and 1%. With an unoptimized MATLAB implementa-
tion, the average time required for the feature extraction, the L3DF-based matching
and for the full matching including ICP are 22.2, 0.06 and 2.28 seconds respectively.
The rest of the paper is organized as follows. Related work and contributions
of this paper are described in Section 7.2. The proposed ear detection approach is
elaborated in Section 6.3. The recognition approach with local 3D features is ex-
plained in Sections 6.4 and 6.5. The performance of the approaches are evaluated in
Sections 6.6 and 6.7. The proposed approaches are compared with other approaches
in Section 7.8 followed by a conclusion in Section 7.9.
6.2 Related Work and Contributions
In this section, we describe the methodology and performance of the existing 2D
and 3D ear detection and recognition approaches. We then discuss the motivation
inspired from the limitations of these approaches and highlight the contributions of
this paper.
6.2.1 Ear Detection Approaches
Based on the type of data used, existing ear detection or ear region extraction
approaches can be classified as 2D, 3D and multimodal 2D+3D. However, most ap-
proaches use only 2D profile images. One of the earliest 2D ear detection approaches
is proposed by Burge and Burger [20] who used Canny edge maps [24] to find the
ear contours. Ansari and Gupta [7] also used a Canny edge detector to extract the
ear edges and segmented them into convex and concave curves. After the elimina-
tion of non-ear edges, they found the final outer helix curve based on the relative
values of angles and some predefined thresholds. They then joined the two end
points of the helix curve with straight lines to get the complete ear boundary. They
78 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
obtained 93.3% accuracy of localizing the ears on a database of 700 samples. Ear
contours were also detected based on illumination changes within a chosen window
by Choras [37]. The author compared the difference between the maximum and
minimum intensity values of a window to a threshold computed from the mean and
standard deviation of that region in order to decide whether the center of the region
belongs to the contour of the ear or to the background.
Ear detection approaches that utilize 2D template matching include the work
of Yuizono et al. [187] where both hierarchical and sequential similarity detection
algorithms were used to detect the ear from 2D intensity images. Another technique
based on a modified snake algorithm and an ovoid model was proposed by Alvarez et
al. [5]. It requires the user to input an approximated ear contour which is then used
for estimating the ovoid model parameters for matching. Yan and Bowyer [176]
manually selected Triangular Fossa and Incisure Intertragica on the original 2D
profile image and drew a line to be used as a landmark. One line was along the border
between the ear and the face, and the other from the top of the ear to the bottom.
The authors found this method suitable for PCA-based and edge-based matching.
The Hough Transform can extract shapes with properties equivalent to template
matching and was used by Arbab-Zavar and Nixon [8] to detect the elliptical shape
of the ear. The authors successfully detected the ear region in all of the 252 profile
images of a non-occluded subset of the XM2V TS database. For the UND database,
they first detected the face region using skin detection and the Canny edge operator
followed by the extraction of the ear region using their proposed method with a
success rate of 91%. They also introduced synthetic occlusions vertically from top
to bottom on the ear region of the first dataset and obtained around 93% and 90%
detection rates for 20% and 30% occlusion respectively. Recently, Gentile et al. [58]
used AdaBoost [161] to detect the ear from a profile face as part of their multi-
biometric approach for detecting drivers’ profiles in a security checkpoint. In an
experiment with 46 images from 23 subjects, they obtained an ear detection rate of
97% with seven false positives per image. They did not report the efficiency of their
system.
Approaches using only 3D or range data include Yan and Bowyer’s two-line
based landmarks and 3D masks [176], Chen and Bhanu’s 3D template matching [29]
and the ear-shape-model based approach [31]. Similar to their 2D technique [176]
mentioned above, Yan and Bowyer [176] drew two lines on the original range image
to find the orientation and scaling of the ear. They rotated and scaled a mask
accordingly and applied it on the original image to crop the 3D ear data in an
6.2. Related Work and Contributions 79
ICP-based matching approach. Chen and Bhanu [29] combined template matching
with average histogram to detect ears. They achieved a 91.5% detection rate with
about 3% False Positive Rate (FPR). In [31], they represented an ear shape model
by a set of discrete 3D vertices on the ear helix and anti-helix parts and aligned
the model with the range images to detect the ear parts. With this approach, they
obtained 92.5% detection accuracy on the University of California, Riverside (UCR)
ear dataset with 312 images and an average detection time of 6.5 sec on a 2.4 GHz
Celeron CPU.
Among the multimodal 2D+3D approaches, Yan and Bowyer [179] and Chen and
Bhanu [32] are prominent. In the first approach, the ear region was initially located
by taking a predefined sector from the nose tip. The non-ear portion was then
cropped out from that sector using a skin detection algorithm and the ear pit was
detected using Gaussian smoothing and curvature estimation algorithms. An active
contour algorithm was applied to extract the ear contour. Using the color and the
depth information separately for the active contour, ear detection accuracies of 79%
and 85% respectively were obtained. However, using both 2D and 3D information,
ears from all the profile images were successfully extracted. Thus, the system is
automatic but depends highly on the accuracy of detection of nose tip and ear pit
and it fails when the ear pit is not visible. Chen and Bhanu [32] also used both color
and range images to extract ear data. They used a reference ear shape model based
on the helix and anti-helix curves and the global-to-local shape registration. They
obtained 99.3% and 87.7% detection rates while tested on the UCR ear database of
902 images from 155 subjects and on 700 images of the UND database, respectively.
The detection time for the UCR database is reported as 9.5 sec with a MATLAB
implementation on a 2.4 GHz Celeron CPU.
6.2.2 Ear Recognition Approaches
Most of the existing ear recognition techniques are based on 2D data and exten-
sive surveys can be found in [78, 143]. Some of them report very high accuracies but
on smaller databases; e.g. Choras [37] obtained 100% recognition on a database
of 12 subjects and Hurley et al. [70] obtained 99.2% accuracy on a database of 63
subjects. As expected, performance generally drops for larger databases, e.g. Yan
and Bowyer [180] report a performance drop from 92% to 84.1% for database sizes
of 25 and 302 subjects respectively. Also, most of the approaches do not consider
occlusion in the ear images (e.g. [63, 106, 34, 70, 188, 39]). Considering these issues
and the scope of the paper, only those approaches using large 3D databases and
80 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
somewhat occluded data are summarized in Table 6.1 and described below.
Table 6.1: Summary of the existing 3D ear recognition approachesPublication Methodology Dataset Rec.
Rate(%)
Name Size(gallery,probe)
Yan and Bowyer,2007 [179]
3D ICP UND-J (415,1386)
97.8
Chen and Bhanu,2007 [32]
LSP and 3D ICP UCR (155, 155) 96.8
UND-F (302, 302) 96.4Passalis et al.,2007 [136]
AEM, ICP,DMF
UND-J (415, 415) 93.9
Cadavid and Abdel-Mottaleb, 2007 [22]
3D ICP Proprietary (61, 25) 84
Yan and Bowyer,2005 [180]
3D ICP UND-F (302, 302) 84.1
Yan and Bowyer [179] applied 3D ICP with an initial translation using the ear
pit location computed during the ear detection process. They achieved 97.8% rank-1
recognition with an Equal-error rate (EER) of 1.2% on the whole UND Collection
J dataset consisting of 1386 probes of 415 subjects and 415 gallery images. They
obtained a recognition rate of 95.7% on a subset of 70 images from this dataset which
have limited occlusions with earrings and hair. In another experiment with the UND
Collection G dataset of 24 subjects each having a straight-on and a 45 degrees off
center image, they achieved 70.8% recognition rate. However, the system is not
expected to work properly if the nose tip or the ear pit are not clearly visible which
may happen sometimes due to pose variations or covering with hair or ear-phones
(see Fig. 6.10 and 6.12).
Chen and Bhanu [32] used a modified version of ICP for 3D ear recognition. They
obtained 96.4% recognition on Collection F of the UND database (including occluded
and non-occluded images of 302 subjects) and 87.5% recognition for straight-on to 45
degree off images. They obtained 94.4% rank-1 recognition rate for the UCR dataset
ES2 which comprises 902 images of 155 subjects taken all in the same day. They
used local features for representation and coarse alignment of ear data and obtained
a better performance than their helix-anti-helix representation. Their approach
assumes perfect ear detection, otherwise manual extraction of the ear contour is
performed prior to recognition.
6.2. Related Work and Contributions 81
Passalis et al. [136] used a generic annotated ear model (AEM), ICP and Simu-
lated Annealing algorithms to register and fit each ear dataset. They then extracted
a compact biometric signature for matching. Their approach required 30 sec for en-
rolment per individual and less than 1 ms for matching two biometric signatures
on a Pentium 4, 3 GHz CPU. They computed the full similarity matrix with 415
columns (galleries) and 415 rows (probes) for the UND-J dataset taking seven hours
of enrolment and few minutes of matching and achieved 93.9% recognition rates.
Cadavid and Abdel-Mottaleb [22] extracted a 3D ear model from video sequences
and used 3D ICP for matching. They obtained 84% rank one recognition while
testing with a database of 61 gallery and 25 probe non-occluded images.
All of the above recognition approaches have only considered left or right ears.
An exception is Choras [39] who proposed to pre-classify each detected ear as left
or right based on the geometrical parameters of the earlobe. The author reported
accurate pre-classification of all 800 images from 80 subjects. Hence, distinguishing
left and right ears seems relatively easy. In cases where both profile images are not
available, extracted ear data from the opposite profile can be mirrored for matching
with still relatively reliable recognition. Yan and Bowyer [176, 179] experimentally
demonstrated that although some people’s left and right ears have recognizably
different shapes, most people’s two ears are approximately bilaterally symmetric.
They obtained around 90% recognition rate while matching mirrored left ears to
right ears on a dataset of 119 subjects. We have focused on left ears, but the above
work suggests our research can be used in other situations also.
6.2.3 Motivations and Contributions
Most of the ear detection approaches mentioned above are not fast enough to be
applied in real-time applications. Recently, Viola and Jones have used the AdaBoost
algorithm [53, 150] to detect faces and obtained a speed of 15 frames per second while
scanning 384 by 288 pixel images on a 700 MHz Intel Pentium III [161]. For this
extreme speed and simplicity of implementation, AdaBoost has further been used for
detecting the ball in a soccer game [153], pedestrians [122], eyes [130], mouths [103]
and hands [36]. However, existing ear detection using AdaBoost (see Section 6.2.1)
does not achieve significant accuracy. In fact, even for faces, Viola and Jones [161]
obtained only 93.7% detection rate with 422 false positives on MIT+CMU face
database. Ear detection is more challenging because ears are much smaller than
faces and often covered by hair, ear-rings, ear-phones etc. Challenges lie in reducing
incorrect or partial localization while maintaining high correct detection rate. Hence,
82 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
we are motivated to determine the right way to instantiate the general AdaBoost
approach with the specifics required in order to specialize it for ear detection.
Most of the ear recognition approaches use global features and ICP for matching.
Compared to local features, global features are more sensitive to occlusions and vari-
ations in pose, scale and illumination. Although ICP is considered to be the most
accurate matching algorithm, it is computationally expensive and it requires con-
cisely cropped ear data and a good initial alignment between the galley-probe pair
so that it does not converge to a local minimum. Yan and Bowyer [179] suggested
that the performance of ICP might be enhanced using feature classifiers. Recently,
Mian et al. [117] proposed local 3D features for face recognition. Using these fea-
tures alone, they reported 99% recognition accuracy on neutral versus neutral and
93.5% on neutral versus all on the FRGC v2 3D face dataset. They also obtained
a time efficiency of 23 matches per second on a 3.2 GHz Pentium IV machine with
1GB RAM. In this paper, we adapt these features for the ear and use them for
coarse alignment as well as for rejecting a large number of false matches. We also
use L3DFs for extracting a minimal set of datapoints to be used in ICP.
The specific contributions of this paper are as follows:
1. A fast and fully automatic ear detection approach using cascaded AdaBoost
classifiers trained with three new features and a rectangular detection window.
No assumption is made about the localization of the nose or the ear pit.
2. The local 3D features are used for ear recognition in a more accurate way than
originally proposed in [117] for the face including an explicit second round of
matching based on geometric consistency. L3DFs are used not only for coarse
alignment but also for rejecting most false matches.
3. A novel approach for extracting minimal feature-rich data points for the final
ICP alignment is proposed which significantly increases the time efficiency of
the recognition system.
4. Experiments are performed on a new database of profile images with ear-
phones along with the largest publicly available dataset of the UND and high
recognition rates are achieved without an explicit extraction of the ear con-
tours.
6.3 Automatic Detection and Extraction of Ear Data
The ear region is detected on 2D profile images using a detector based on the
AdaBoost algorithm [53, 150, 74, 75]. Following [161], Haar-like features are used
6.3. Automatic Detection and Extraction of Ear Data 83
as weak classifiers and learned from a number of ear and non-ear images. After
training, the detector first scans through the 2D profile images to identify a small
rectangular region containing the ear. The corresponding 3D data is then extracted
from the co-registered 3D profile data. The complete detection framework is shown
in the block diagram of Fig. 6.2. A sample of a profile image and the corresponding
2D and 3D ear data detected by our system is also shown in the same figure. The
details of the construction of the detector and its functional procedures are described
in this section.
On-line
Detection
Off-line
Training
2D
3D Profile
2D Profile
Extract 3D
Ear Data
Scan and
Detect 2D
Ear
Region
Collect 2D and
3D Profile
Face Data
Validate and
Update with
False
Positives Create
Rectangular
Haar
FeaturesPre-
process
2D Data
Train
Classifiers
Using
Cascaded
AdaBoost
Extracted
2D Ear
Extracted
3D Ear
Figure 6.2: Block diagram of the proposed ear detection approach.
6.3.1 Feature Space
The eight different types of rectangular Haar feature templates as shown in
Fig. 6.3 are used to construct our AdaBoost based detector. Among these, the first
five (a-e) were also used by Viola and Jones [161] to detect different types of lines
and curves in the face. We devised the later three templates (f-h) to detect specific
features of the ear which are not available in the frontal face. The center-surround
template is designed to detect any cavity in the ear (e.g. ear pit) and the other two
(adopted from [90]) are for detecting helix and the anti-helix curves. Although (f)
is the intersection of (c) and (e), we use it as a separate feature template because no
linear combination of those features yields (f) and as will be discussed in the next
sub-section, the AdaBoost algorithm used for feature selection greedily chooses the
best individual Haar features, rather than their best combination.
84 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
(a) (b)
(h)(g)
(d)(c)
(f)
(e)
Figure 6.3: Features used in training the AdaBoost (features (f), (g) and (h) are
proposed for detecting specific features of the ear)
.
To create a number of Haar features out of the above eight types of templates or
filters, we choose a window to which all the input samples are normalized. Viola and
Jones [161] used a square window of size 24×24 for face detection. Our experiments
with training data show that ears are roughly proportional to a rectangular window
of size 16 × 24. One benefit of choosing a smaller window size is the reduction of
training time and resources. The templates are shifted and scaled horizontally and
vertically along the chosen window and a feature is numbered for each location, scale
and type. Thus, for the chosen window size and a shift of one pixel, we obtain an
over-complete basis of 96413 potential features.
The value of a feature is computed by subtracting the sum of the pixels in the
grey region(s) from that of the dark region(s) (except in the case of (c), (e) and (f)
in Fig. 6.3, where the sum in the dark region is multiplied by 2, 2 and 8 respectively
before performing the subtraction in order to make the weight of this region equal
to that of the grey region(s)).
6.3.2 Construction of the Classifiers
The rectangular Haar-like features described above constitute the weak classifiers
of our detection algorithm. A set of such classifiers are selected and then combined
together to construct a strong classifier via AdaBoost [53, 150] and a sequence of
these are then cascaded following Viola and Jones [161]. Thus, each strong classifier
in the cascade is a linear combination of the best weak classifiers, with weights
inversely proportional to training errors on those examples not previously rejected
by an early stage of the cascade. This results in a fast detection as most of the
negative sub-windows are rejected using only a small number of features associated
with the initial stages.
The optimization of the number of stages, the number of features per stage
and the threshold for each stage for a target detection rate (D) and false positive
6.3. Automatic Detection and Extraction of Ear Data 85
rate (Ft) is obtained similar to [161] by aiming for a fixed maximum FPR (fm)
and a minimum detection rate (dmin) for each stage. These are computed from the
following inequalities: Ft < (fm)n and D > (dmin)n, where n is the number of stages,
typically 10-50.
6.3.3 Training the Classifiers
The training dataset to build the proposed ear detector, their preprocessing
stage, the training parameters chosen and other implementation aspects are de-
scribed as follows.
Dataset The positive training set is built with 5000 left ear images cropped from
the profile face images of different databases covering a wide range of races, sexes,
appearances, orientations and illuminations. This set includes 429 images of the
University of Notre Dame (UND) Biometrics Database (Collection F) [157, 179],
659 of the NIST Mugshot Identification Database (MID) [129], 573 of XM2VTSDB
[112], 201 images of the USTB [159], 15 of the MIT-CBCL [165, 120], and 188 of the
UMIST [61, 154] face databases. It also includes around 3000 images synthesized
by rotating -15 to +15 degrees of some images from the USTB, the UND and the
XM2VTSDB databases.
Our negative training set for the first stage of the cascade includes 10,000 images
randomly chosen from a set of around 65,000 non-ear images. These images are
mostly cropped from profile images excluding the ear area. We also include some
images of trees, birds and landscapes randomly downloaded from the web. Examples
of the positive and negative image set are shown in Fig. 6.4. The negative training
set for the second and subsequent stages are made up dynamically as follows. A set
of 6000 large images without ears are scanned through at the end of each stage of
the cascade by the classifier developed in that stage. Any sub-window classified as
an ear is considered as a false positive and a set of not more than 5000 such false
positives are randomly chosen to include in the negative set for the following stages.
Figure 6.4: Examples of ear (top) and non-ear (bottom) images used in the training.
The validation set used to compute the rates of detection and false positives
during the training process includes 5000 positives (cropped and synthesized ear
86 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
images) and 6000 negatives (non-ear images). The negatives for the first stage are
randomly chosen from a set of 12000 images not included in the training set. For
the second and the subsequent stages, negatives are randomly chosen from the false
positives found by the classifier of the previous stage and unused in the negative
training set.
Preprocessing the Data As mentioned earlier, input images are collected from
different sources with varying size and intensity values. Therefore, all the input
images are scale normalized to the chosen input pattern size. Viola and Jones
reported a square input pattern of size 24 × 24 as the most suitable for detecting
frontal faces [161]. Considering the shape of the ear, we instead use a rectangular
pattern of size 16× 24.
The variance of the intensity values of images are also normalized to minimize the
effect of lighting. Similar normalization is performed for each sub-window scanned
during testing.
Training the Cascade In order to train the cascade, we choose Ft = 0.001 and
D = 0.98. However, to quickly reject most of the false positives using a small
number of features, we define the first four stages to be completed with 10, 20, 60
and 80 features. We also performed validation after adding ten features for the first
ten stages and then, adding of 25 for the remaining stages. The detection and false
positive rates computed during the validation of each stage follow a gradual decrease
to the target. The training finishes at stage 18 with a total of 5035 rectangular
features including 1425 features in the last stage.
Training Time The training process involved a huge amount of computation due
to the large training set and also for the very low target FPR, taking several weeks
on a single PC. To speed up the process, we distributed the job over a network of
around 30 PCs. For this purpose, we used MATLABMPI which is a MATLAB
implementation of the Message Passing Interface (MPI) standard that allows any
MATLAB program to exploit multiple processors [93]. It helped in reducing the
training time to an order of days. An optimized C or C + + implementation would
reduce this time, but since this training never needs to be performed again, our
MATLAB implementation was sufficient.
6.3.4 Ear Detection with the Cascaded Classifiers
The trained classifiers of all the stages are used to build the ear detector in a
cascaded manner. The detector is scanned over a test profile image in different sizes
and locations. A classifier in the cascade is only used when a sub-window in the
6.3. Automatic Detection and Extraction of Ear Data 87
test image is detected as positive (ear) by the classifier of the previous stage and
accepted finally only when it passes through all of them.
To detect various sizes of ears, instead of resizing the given test image, we scale
up the detector along with the corresponding features and use an integral image
calculation. The approach is similar to that of Viola and Jones [161] who also
illustrated this to be more time-efficient than the conventional pyramid approach.
If the rectangular detector (of 16 × 24 or its scaled-up size) matches any sub-
window of the image, a rectangle is drawn to show the detection (See Fig. 6.5). The
integration of multiple detections (if any) is described in the following sub-section
and the overall performance of the detector is discussed in Section 6.6.
As mentioned in Section 6.3.3, our system is trained for detecting ears from the
left profile images only. However, if the input image is a right profile and the ear
detector fails, then the features constituting the detector can be flipped to detect
the right ears.
Figure 6.5: Sample of detections: (a) Detection with single window. (b) Multi-
detection integration (best seen in color).
6.3.5 Multi-detection Integration
Since the detector scans over a region in the test image with different scales and
shift sizes, there is the possibility of multiple detections of the ear or ear-like regions.
To integrate such multiple detections, we propose the clustering algorithm reported
in Algorithm 1.
The clustering algorithm is based on the percentage of overlap of the rectangles
representing the detected sub-windows. We cluster a pair of rectangles together if
the mutual area shared between them is larger than a predefined threshold, minOv
88 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
(0 < minOv < 1). A higher value of this parameter may result in multiple detections
near the ear. We empirically chose a threshold value of 0.01.
Based on the observation that the number of true detections at different scales
over the ear region is larger than the false detections on ear-like region(s) (if any),
we added an option in the algorithm to avoid such false positive(s) by only taking
the one that clusters the maximum number of rectangles. This is appropriate when
only one ear needs to be detected which is the case for most recognition applications.
An example of integrating three detection windows is illustrated in Fig. 6.5(b).
Each of the detections is shown by a rectangle in yellow lines while the integrated
detection window is shown by a rectangle in bold dotted cyan lines.
Algorithm 1. Integration of multiple ear detections
0. (Input) Given a set of detected rectangles rects, the
minimum percentage of overlap required minOv and
option for avoiding false detection opt.
1. (Initialize) Set the intermediate
rectangle set tempRects empty.
2. (Multi-detection integration procedure)
2.a While number of rectangles N in rects> 1
i. Find areas of intersection of the
first rectangle in rects with all.
ii. Find the rectangles combRects
and their number intN for whose percentage of
overlap>= minOv.
iv. Store the mean of combRects and intN in
tempRects.
v. Remove the rectangles in combRcts from
rects.
2.b If intN>1 and opt = = 'yes'
i. Find the rectangle fRect in tempRects
for which intN is maximum.
ii. Remove all the rectangle(s)
except fRect from tempRects.
End if
3. (Output) Output the rectangle in tempRects.
6.3.6 3D Ear Region Extraction
Assuming that the 3D profile data are co-registered with corresponding 2D data
(which is normally the case when data is collected with a range scanner), the location
information of the detected rectangular ear region in the 2D profile is used for
6.4. Representation and Extraction of Local 3D Features 89
3D ear data extraction. To ensure that the whole ear is included and to allow
the extraction of features on and slightly outside the ear region, we expanded the
detected ear regions by an additional 25 pixels in each direction. This extended ear
region is then cropped to be used as 3D ear data. Fig. 6.9 illustrates the original
and expanded region of extraction. If our ear detection system indicates that a right
ear is detected, we flip the 3D ear data to allow it to be matched with the left ears
in the gallery.
6.3.7 Extracted Ear Data Normalization
Once the 3D ear is detected, we remove all the spikes by filtering the data. We
perform triangulation on the data points, remove edges longer than a predefined
threshold of 0.6 mm and finally, remove disconnected points [114].
The extracted 3D ear data varies in dimensions depending on the detection
window. Therefore, we normalize the 3D data by centering on the mean and then
sampling on a uniform grid of up to 132 mm by 106 mm. The resampling makes the
datapoints more uniformly distributed and fills up the holes if any. Besides, it makes
the local features more stable and increases the accuracy of ICP based matching.
We perform a surface fitting based on the interpolation of the neighboring data
points at 0.5 mm resolution. This also fills holes or missing data (if any) due to oily
skin or sensor error [176, 179] (as shown in Fig. 6.17 and 6.20).
6.4 Representation and Extraction of Local 3D Features
The performance of any recognition system greatly depends on how the relevant
data is represented and how the significant features are extracted from it. Although
the core of our data representation and the feature extraction technique is similar to
the L3DFs proposed for face data in [117, 83], the technique is modified to make it
suitable for the ear as preliminarily presented in our previous work [83] and further
enhanced as described in this section.
6.4.1 KeyPoint Selection for L3DFs
A 3D local surface feature can be depicted as a 3D surface constructed using
data points within a small sphere of radius r1 centered at a keypoint p. An exam-
ple of a feature is shown in Fig. 7.2. As outlined by Mian et al. [117], keypoints
are selected from surfaces distinct enough to differentiate between range images of
different persons.
90 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
510
1520
510
1520
5
10
15
20
Figure 6.6: Example of a 3D local surface (right image). The region from which it
is extracted is shown by a circle on the left image.
Nearest Neighbour Error (mm)
Cumulative % of Repeatability
(a) (b)
Figure 6.7: (a) Location of keypoints on the gallery (left) and the probe (right)
images of three different individuals (this figure is best seen in color). (b) Cumulative
percentage of repeatability of the keypoints.
For keypoints we only consider data points that lie on a grid with a resolution of
2 mm in order to increase distinctiveness of the surface to be extracted. We find the
distance of each of the data points from the boundary and take only those points
with a distance greater than a predefined boundary limit. The boundary limit is
chosen slightly longer than the radius of the 3D local feature surface (r1) so that
the feature calculation does not depend on regions outside the boundary and the
allowed region corresponds closely with the ear. We call the points within this limit
seed points. In our experiments, a boundary limit of r + 10 was found to be the
most suitable.
To check whether the data points around a seed point contain enough descriptive
information, we adopt the approach of Mian et al. [117] discussed in short as follows.
We randomly choose a seed point and take a sphere of data points around that point
which are within a distance of r1. We apply PCA on those data points and align
them with their principal axes. The difference between the eigenvalues along the
6.4. Representation and Extraction of Local 3D Features 91
first two principal axes of the local region is computed as `. It is then compared to
a threshold (t1) and we accept a seed point to be a keypoint if ` > t1. The higher t1
the less features we get, but lowering t1 can result in the selection of less significant
feature points with unreliable orientations. This is because, the value of ` indicates
the extent of unsymmetrical depth variations around a seed point. For example, t1
of zero for a point cloud means it could be completely planar or spherical.
We continue selecting seed points as keypoints until until nf number of features
are created. For a seed resolution (rs) of 2 mm, r1 of 15 mm, t1 of 2 and nf of
200, for most of the gallery and the probe ears, we found 200 keypoints. We found
however, as low as 65 features particularly for cases where missing data occurs.
The values of all the parameters used in the feature extraction are empirically
chosen and the effect of their variation is further discussed in the Appendix. Our
experiments with ear data and those with face data by Mian et al. [117] show that the
performance of the keypoint detection algorithm and hence 3D recognition do not
vary significantly with small variations in the values of these parameters. Therefore,
we use the same values to extract features from all the ear databases.
The suitability of our local features on the ear data is illustrated in Fig. 6.7a.
It shows that keypoints are different for ear images of different individuals. It
also shows that these features have a high degree of repeatability for the ear data
of the same individual. By repeatability we mean that the proportion of probe
feature points that have a corresponding gallery feature point within a particular
distance. We performed a quantitative analysis of the repeatability similar to [117].
The probe and the gallery data of the same individual are aligned using the ICP
algorithm in order to allow computation of repeatability. The cumulative percentage
of repeatability as a function of the nearest neighbor error between gallery and probe
features of ten different individuals is shown in Fig. 6.7b. The repeatability reaches
around 80% at an error of 2 mm which is the sampling distance between the seed
points.
6.4.2 Feature Extraction and Compression
After a seed point qualifies as a keypoint, we extract a surface feature from its
neighborhood. As described in Section 6.4.1, while testing for the suitability of
the seed point we take a sphere of data points with a radius of r1 from that seed
point and align them to their principal axes. We use these rotated data points to
construct the 3D local surface feature. Similar to [117], the principal directions of
the local surface are used as the 3D coordinates to calculate the features. Since the
92 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
coordinate basis is defined locally based on the shape of the surface, the computed
features are mostly pose invariant. However, large changes in viewpoints can cause
different points of the ear to occlude and cause perturbations in the local coordinate
basis.
We fit a 30 × 30 uniformly sampled 3D surface (with a resolution of 1 mm) to
these data points. In order to avoid boundary effects, we crop the inner region of
20× 20 datapoints and store it as a feature (see Fig. 7.2).
For surface fitting, we use a publicly available surface fitting code [47]. The
motivation behind the selection of this algorithm is that it builds a surface over
the complete lattice approximating (rather than interpolating) and extrapolating
smoothly into the corners. Therefore, it is less sensitive to noise, outliers and missing
data.
In order to reduce computational time and memory storage and to make features
more robust to noise, we apply PCA on the whole gallery feature set as in [117],
after centering on the mean. The top 11 eigenvectors are then used to project gallery
and probe features into vectors of dimension 11. Unlike [117], we do not normalize
the variance of the dimensions, nor the size of the features. Instead, we preserve as
much as possible of the original geometry in the features.
6.5 L3DF Based Matching Approach
In this Section, our method of matching gallery and probe datasets is described.
We establish correspondences between extracted L3DFs [83] similar to Mian et
al. [117] for face. However, we use geometric consistency checks [80] to refine the
matching and to calculate additional similarity measures. We use the matching
information to reject a large number of false gallery candidates and to coarsely
align the remaining candidates prior to the application of a modified version of ICP
algorithm for fine matching. The complete matching algorithm is formulated in
Algorithm 2 and discussed as follows.
6.5.1 Finding Correspondence Between Candidate Features
The similarity between two features is calculated as the Root Mean Square
(RMS) distance between corresponding points on the 20 × 20 grid generated when
the feature is created (aligned following the axes in the PCA). The RMS distance
is computed from each probe feature to all the gallery features. Matching gallery
features which are located more than a threshold (th1) away are discarded to avoid
matching in quite different areas of the cropped image. The gallery feature with
the minimum distance is considered as the corresponding feature for that particular
6.5. L3DF Based Matching Approach 93
Figure 6.8: Feature correspondences after filtering with geometric consistency (best
seen in color).
probe feature. When multiple probe features match the same gallery feature we
retain the best match for that gallery feature as in [117].
6.5.2 Filtering with Geometric Consistency
Unlike previous works on L3DFs for the face [117], we found it necessary to
improve our matches for the ear using geometric consistency. We add a second round
of feature matching each time a probe is compared with a gallery that uses geometric
consistency based on information extracted from the feature matches generated by
the first round. The first round of feature matching is performed just as described in
Section 7.4.1, and we use the matches generated to identify a subset that are most
geometrically consistent.
For simplicity, we measure geometric consistency of a feature match by counting
the number of the other feature matches from the first round yield consistent dis-
tances on the probe and gallery. More precisely, for a match with locations pi, gi we
count how many other match locations pj, gj satisfy Eqn. (7.1).
||pi − pj| − |gi − gj|| < rs + κ√|pi − pj| (6.1)
The right hand side of the above equation is a function of the spacing between
candidate keypoints or the seed resolution rs, a constant κ and the square root of
the actual probe distance that accounts for minor deformations and measurement
errors. The constant κ is determined empirically as 0.1.
94 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
Algorithm 2. Matching a probe with the gallery
0. (Input) Given a probe, gallery data and features, distance thresholds th1
and , angle threshold th2 and minimum number of match m.
1. (Matching based on local 3D features)
1.a (Distance check) For each feature of the probe and all features of a
gallery:
(i) Discard gallery features with distance from the probe feature
location> th1.
(ii) Pair the probe feature with closest gallery feature, by feature
distance.
1.b (Distance consistency check)
(i) For each of the matching feature pairs count how many other
matches satisfy Eqn. (1) with .
(ii) Choose T as the match pair with highest count, T.
(iii) Compute percentage of consistent distance ( d = T / |matches|).
1.c (2nd stage of matching)
(i) For all the gallery features repeat step (1.a) but do not allow
matching which are inconsistent with T.
(ii) Compute the mean of the feature distance of the matching feature
pairs ( f).
(iii) Discard the gallery if there is less than m feature pairs.
1.d (Calculating rotation consistency measure)
(i) For each of the selected matching pairs count how many other
matches have rotation angles within th2.
(ii) Choose R as the rotation for the pair with the highest count, R. (iii) Compute percentage of consistent rotation ( r = R / |matches|).
1.e (Calculating keypoint distance measure)
(i) Align the keypoints of the matching probe features to those of the
corresponding gallery features using ICP.
(ii) Record the ICP error as keypoint distance measure ( n).
2. Repeat step (1) for all galleries.
3. (Rejection classifier)
3a. For each of the gallery candidates:
(i) For each of the similarity measures ( f, d, r, n), compute the
weight factor ( x).
(ii) Compute the similarity score according to Eqn. (2).
3b. Rank the gallery candidates according to and discard those having
rank over 40.
4. (ICP-based matching) For each of the selected gallery candidates:
(i) Extract a minimum rectangle containing all matches from both
gallery and probe data.
(ii) Align the extracted probe data with the gallery data using T and R.
(iii) Align the extracted probe data with that of the gallery using ICP.
5. (Output) Output the gallery having minimum ICP error as the best match
for the probe.
6.5. L3DF Based Matching Approach 95
We then simply find the match from the first round which is most ‘distance-
consistent’ according to this measure. In the second round, we follow the same
matching procedure as in round-1 but only allow feature matches that are distance-
consistent with this match. Fig. 6.8 illustrates an example of the matches between
the features of a probe image and the corresponding gallery image (mirrored in the
z direction) in the second round. Here the green channel is used to indicate the
amount of rotational consistency for each match (best viewed in color, although in
grey scale the green channel generally dominates). It is clear that a good proportion
of these matches involve corresponding parts of the two ear images.
6.5.3 Other Similarity Measures Based on L3DFs
In addition to the mean feature distance for all the matched probe and gallery
features (εf ) used in [117], we also derive three more similarity measures based on
the geometric consistency of matched features as discussed below.
We compute the ratio of the maximum distance consistency to the total number
of matches found in the first round of matching and use that as a similarity measure,
proportion of consistent distances (αd).
We also include a component based on the consistency of the rotations implied
by the feature matches in our measure of similarity between probes and galleries.
Each feature match implies a certain 3D rotation between the probe and gallery,
since we store the rotation matrix used to create the probe feature from the probe
(calculated using PCA), and similarly for the gallery feature, and we assume that
the match occurs because the features have been aligned in the same way and come
from corresponding points. We can thus calculate the implied rotation from probe
to gallery as Rg−1Rp where Rp and Rg are the rotations used for the probe and
gallery features.
We calculate these rotations for all feature matches, and for each, we determine
the count of how many of the other rotations it is consistent with. Consistency
between two rotations R1 and R2 is determined by finding the angle between them,
i.e., the rotation angle of R−11 R2 (around the appropriate axis of rotation). We
consider two rotations consistent when the angle is less than 10◦ (th2). We choose
the rotation of the match that is consistent with the largest number of other matches,
and use the proportion of matches consistent with this as a similarity measure called
the proportion of consistent rotations (αr). As we shall see in Section 6.7, this
measure is the strongest among the measures used prior to applying ICP in our ear
recognition experiments and fusing with the other measures provides only a modest
96 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
but worthwhile improvement. We also use the rotation with the highest consistency
for ICP coarse alignment as described in Section 6.5.6.
Lastly, we develop another similarity measure called keypoint distance measure
(εn) based on the distance between the keypoints of the corresponding features. To
compute it, we apply ICP only on the keypoints (not the whole dataset) of the
matched features (obtained in the second round of matching). This corresponds to
the ‘graph node error’ described in Mian et al. [117] for the face.
6.5.4 Building a Rejection Classifier
As in [117], the similarity measures (εf , αr, αd and εn) computed in the previous
sub-section are first normalized on the scale from 0 to 1 and then combined using a
confidence weighted sum rule as shown in Eqn. (6.2).
ε = ηfεf + ηr(1− αr) + ηd(1− αd) + ηnεn (6.2)
The weight factors (ηf ,ηr,ηd,ηn) are dynamically computed individually for each
probe during recognition as the ratio between the minimum and second minimum
values, taken relative to the mean value of the similarity measure for that probe
[117]. Therefore, the weights reflect the relative importance of the similarity mea-
sures for a particular probe based on the confidence in each of them. Note that
the second and the third similarity measures (αr and αd) are subtracted from unity
before multiplication with the corresponding weight factor as these have a polarity
opposite to other measures (the higher the values the better is the result).
We observe that the combination of these similarity measures provides an accept-
able recognition rate and most of the misclassified images are matched within the
rank of 40 (see Section 6.7.7). Therefore, unlike [117], we use this combined classifier
as a rejection classifier to discard the huge number of bad candidates retaining only
the best 40 identities (sorted according to this classifier) for fine matching using ICP
as described in the following sections.
6.5.5 Extraction of a Minimal Rectangular Area
We extract a reduced rectangular region (containing all the matching features)
from the originally detected gallery and probe ear data. This region is identi-
fied using the minimum and maximum co-ordinate values of the matched L3DFs.
Fig. 6.9 illustrates this region in comparison with other extraction windows (see
Section 6.3.6).
6.5. L3DF Based Matching Approach 97
Extended extraction window
used for feature extraction
Original detection window
Minimal extraction window
used for ICP
Figure 6.9: Extraction of a minimal rectangular area containing all the matching
L3DFs.
L3DFs do not match from regions with occlusion or excessive missing data.
By selecting a sub-window where the L3DFs matches, such regions are generally
excluded. Besides, this smaller but feature-rich region reduces the processing time
as described in Section 6.7.6.
6.5.6 Coarse Alignment of Gallery-Probe Pairs
The input profile images of the gallery and the probe may have pose (rotation
and translation) variations. To minimize the effect of such variations, we apply the
transformation in Eqn. (6.3) to the minimal rectangular area of the probe dataset.
P ′ = RP + t (6.3)
where, P and P ′ are the probe dataset before and after the coarse alignment.
We use the translation (t) corresponding to the matched pair with the maximum
distance consistency and the rotation (R) corresponding to the matched pair with
the largest cluster of consistent rotations for this alignment. Our results show a
better performance with this approach compared to an alternative of minimizing
the sum of squared distance between points in the feature matches.
6.5.7 Fine Alignment with the ICP
The Iterative Closest Point (ICP) algorithm [10] is considered to be one of the
most accurate algorithms for registration of two clouds of data points provided the
datasets are roughly aligned. We apply a modified version of ICP as described
in [114]. The computational expense of ICP is minimized using the minimal rectan-
gular area as described in Section 6.5.5. The final decision regarding the matching
is made based on the results of ICP.
98 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
Table 6.2: Ear detection results on different datasets
Name of thedatabase
No. ofimages
Description of the test images including chal-lenges involved
No. ofunde-tectedimage(s)
Detectionrate(%)
UND-F 203 Not used in training and validation of the clas-sifiers
0 100
UND-F 942 Images from 302 subjects including some par-tially occluded images.
1 99.9
UND-J 830 2 images from each of 415 subjects includingsome partially occluded images
1 99.9
UND-J 146 54 images are occluded with earrings and 92images are partially occluded with hair
1 99.1
XM2VTSDB 104 Severely occluded with hair (see Fig. 6.11) 50 51.9
UWADB 50 All images are occluded with ear-phones 0 100
6.6 Detection Performance
In this section, we report and discuss the accuracy and speed of our ear detector.
6.6.1 Correct Detection
We performed experiments on seven different datasets with different types of
occlusions. The results are summarized in Table 6.2 which show the high accuracy
of our detector.
Test profile images in the first dataset in Table 6.2 are carefully separated from
the training and validation set. Images in the fourth and fifth datasets are partially
occluded with hair and ear-rings and those in the sixth dataset are severely occluded
with hair. Some examples of correct detection of such images are shown in Fig. 6.10.
The detector failed only for the images where hair covered most of the ear (see
Fig. 6.11). However, such occluded ear images may not be as useful for biometric
recognition anyway, as they lack sufficient ear features.
For some applications another kind of occlusion is likely to be common: occlusion
due to ear-phones, since people are increasingly using ear-phones with their mobile
phones or to listen to music. Therefore, we collected images from 50 subjects in our
laboratory using a Minolta Vivid 910 range scanner each of whom were requested to
wear ear-phones (see Fig. 6.12). Correct detection of ears in these images, confirms
that our detection algorithm does not require the ear pit to be visible. To the best
6.6. Detection Performance 99
Figure 6.10: Detection in presence of occlusions with hair and ear-rings (the inset
image is the enlargement of the corresponding detected ear).
Figure 6.11: Example of test images for which the detector failed.
Figure 6.12: Example of ear images detected from profile images with ear-phones.
100 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
of our knowledge, we are the first to use an ear dataset with ear-phones on.
In order to further analyze the robustness of our ear detector to occlusion, we
synthetically occluded the ear region of profile images similar to Arbab-Zavar and
Nixon [8]. After correctly detecting the ear region without occlusion, we introduced
different percentages of occlusion and repeated the detection. During each pass
of the detection test, occlusion was increased vertically (from top to bottom) or
horizontally (from right to left) by masking in increments of 10% of the originally
detected window. The results of these experiments applied on the first dataset in
Table 2 are shown in Fig. 13. The plots demonstrate a very high detection rate
until an occlusion level of 20% and 30% is reached, which sharply decreases with
40% and 50% occlusion for vertical and horizontal occlusion respectively. A better
performance is obtained under horizontal occlusions which are the most common
types of occlusion caused by hair.
0 10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Det
ectio
n R
ate
Percentage of Occlusion
Horizontal OcclusionVertical Occlusion
Figure 6.13: Detection performance under two different types of synthesized occlu-
sions (on a subset of the UND-F dataset with 203 images).
To evaluate the performance of our detector under pose variations, we performed
experiments with a subset of Collection G of the UND database. It includes straight-
on, 15◦ off center, 30◦ off center, and 45◦ off center images. For each of the poses,
there are 24 images (details of the dataset can be found in [179]). Our detector
successfully detected ears in all of the images proving its robustness up to 45 degrees
of pose variation.
We also found that our detector is robust to other degradations of images such
as motion blur as shown in Fig. 6.14.
6.6. Detection Performance 101
Figure 6.14: Detection of a motion blurred image.
2 4 6 8 10 12 14 16 180
1
2
3
4
5x 10
−3
Number of Stages
Fal
se P
ositi
ve R
ate
0 0.1 0.2 0.3 0.4 0.50.97
0.975
0.98
0.985
0.99
0.995
1
Cor
rect
Cla
ssifi
catio
n R
ate
False Positive Rate
0 0.005 0.010.97
0.98
0.99
(a) (b)
Figure 6.15: False detection evaluation: (a) FAR (on number of profile images) with
respect to number of stages in the cascade. (b) The ROC curve for classification of
cropped ear and non-ear images (best seen in color).
102 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
6.6.2 False Detection
On the first dataset in Table 6.2, for a scale factor of 1.25 and step size of 1.5,
our detector scanned a total of 1308335 sub-windows in 203 images of size 640×480.
Only seven sub-windows were falsely detected as ears, resulting in a false positive
rate (FPR) of 5× 10−6. These seven false positives were easily eliminated using the
multi-detection integration as mentioned in Section 6.3.5. The relationship between
the FPR and the number of stages in the cascade is shown in Fig. 6.15(a). As
illustrated, the FPR decreases exponentially with an increase in the number of
stages following the maximum FPR set for each stage, fm = 0.7079. This is due to
the fact that the classifiers of the subsequent stages are trained to classify correctly
the samples misclassified by the previous stages.
In order to evaluate the classification performance of our trained strong classi-
fiers, we cropped and synthesized (by a rotation of -5 to +5 degrees) 4423 ear and
5000 non-ear images. The results are illustrated in Fig. 15(b). Although the correct
classification rate is 97.1% with no false positive (see the inset plot of Fig. 15(b)),
we achieve a very high classification accuracy with very low false positive rate. In
fact, we obtained 99.8% and 99.9% classification rates at 0.04% and 0.2% FPRs
respectively. These correspond to false positives of only 2 and 12 respectively.
6.6.3 Detection Speed
Our detector achieves extremely fast detection speeds. The exact speed of the
detector depends on the step size, shift and scale factor and the first scale. With
an initial step size of 1.5 and scale of 5 with a scale factor of 1.25, the proposed
detector can detect the ear in a 640× 480 image in 7.7 ms on a Core 2 Quad 9550,
2.83 GHz machine using a C + + implementation of the detection algorithm. This
time also includes the time required by the multi-detection integration algorithm
which is 0.1 ms on the average.
6.7 Recognition Performance
The experimental results of ear recognition using our proposed approach on
different datasets are reported and discussed in this section. The robustness of the
approach is evaluated against different types of occlusions and pose variations. The
effect of using L3DFs, geometric consistency and the minimal rectangular area of
datapoints for ICP are also summarized in this section.
6.7. Recognition Performance 103
6.7.1 Datasets
Collections F, G and J from the University of Notre Dame Biometrics Database
[157, 179] are used to perform the recognition experiments of the proposed approach.
Collection F and J include 942 and 830 images of 302 and 415 subjects respectively
collected using a Minolta Vivid 910 range scanner in high resolution mode. There
is a wide time lapse of 17.7 weeks on average between the earliest and latest images
of subjects. There are also variations in pose between them and some images are
occluded with hair and ear rings. The earliest image and the latest image for each
subject are included in the gallery and the probe dataset respectively. As mentioned
in Section 6.6.1, Collection G includes images from 24 subjects each having images
at four different poses, straight-on, 15◦ off center, 30◦ off center and 45◦ off center.
We keep images with straight-on pose in the gallery and others in the probe dataset.
We also tested our algorithm on 100 profile images from 50 subjects with and
without ear-phones on. The images were collected at the University of Western Aus-
tralia using a Minolta Vivid 910 range scanner in low resolution mode. There are sig-
nificant data losses in the ear-pit regions of the images as shown in Fig. 6.17(c). Im-
ages without ear-phones are included in the gallery and others in the probe dataset.
All the ear data were extracted automatically as described in Section 6.3 except
for the following.
1. For the purpose of evaluating the system as a fully automatic one, we consid-
ered the undetected probe (see Section 6.6) as a failure and kept the number
of probes as 415 in the computation of the recognition performance on the
UND-J dataset.
2. There are three images in the 75◦ off center subset of the UND-G dataset
where the 2D and 3D data clearly do not correspond. For these images, we
manually extracted the 3D data.
We also used the same values of the parameters in the matching algorithm for
all experiments across all the databases.
6.7.2 Identification and Verification Results on the UND Database
On the UND Database Collection-F and Collection-J, we obtained rank-1 identi-
fication rate of 95.4% and 93.5% respectively. The Cumulative Match Characteristic
(CMC) curve illustrating the results for the top 40 ranks for UND-J dataset is shown
in Fig. 6.16(a). The plot was obtained using ICP for matching a probe with the
selected gallery dataset after being coarsely aligned using L3DFs only. The little
104 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
5 10 15 20 25 30 35 400.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Iden
tific
atio
n R
ate
10−3
10−2
10−1
100
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
False accept rate (log scale)V
erifi
catio
n ra
te
(a) (b)
Figure 6.16: Recognition results on the UND-J dataset: (a) Identification rate. (b)
Verification rate.
(a) (b) (c)
Figure 6.17: Examples of correct recognition in presence of occlusions: (a) With
ear-rings and (b) With hair (c) With ear-phones (2D and the corresponding range
images are placed in the top and bottom row respectively (best seen in color)).
6.7. Recognition Performance 105
0 15 30 450.2
0.4
0.6
0.8
1
Off center Pose Variation (in Degrees)
Ran
k−1
Iden
tific
atio
n R
ate
1 2 4 6 8 10 12 14 16 18 20 22 240.2
0.4
0.6
0.8
1
Rank
Iden
tific
atio
n R
ate
30 degrees off center45 Degree Off center
(a) (b)
Figure 6.18: Recognition results on the UND-G dataset: (a) Identification rate for
different off center pose variations. (b) CMC curves for 30◦ and 45◦ off center pose
variations.
gain in accuracy up to rank-40 shows that the ICP algorithm is very accurate when
the correct gallery passes the rejection classifier.
We also evaluated the verification performance in terms of the Receiver Oper-
ating Characteristic (ROC) curve and the Equal Error Rate (EER). We obtained a
verification rate of 96.4% at an FAR of 0.001 with an EER of 2.3% for the UND-F
dataset. On the UND-J dataset, we obtain 94% verification at an FAR of 0.001 with
an EER of 4.1% (see Fig. 6.16(b)).
6.7.3 Robustness to Occlusions
To evaluate the robustness of our approach to occlusions, we selected occluded
images from various databases as described below.
1. Ear-rings: We found 11 cases in UND Collection-F where either the gallery
or the probe image is with ear-rings. All these were correctly recognized by
our system. Some of the examples are illustrated in Fig. 6.17a. Although we
used 3D data for recognition, 2D images are shown for a better illustration.
2. Hair: Our approach is also significantly robust to occlusion with hair. Out of
59 images with partial occlusion with hair in the UND-F dataset, 54 are cor-
rectly recognized yielding a recognition rate of 91.5% (see Fig. 6.17b). The mis-
classified examples also have some other problem as discussed in Section 6.7.5.
3. Ear-phones: Experiments on 50 probe images with ear-phones from the
UWADB provide a rank-1 identification of 98% and verification of 98% with an
106 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
EER of 1%. These results confirm the robustness of our approach to occlusions
in the ear pit region (see Fig. 6.17(c)).
6.7.4 Robustness to Pose Variations
Pose variations may occur during the capture of the probe profile image or during
the detection phase choosing the detection rectangle in different positions. Large
variations in pose (particularly in the case of out-of-plain rotations) sometimes intro-
duce self-occlusions that decrease the repeatability of local 3D features in a gallery-
probe pair. Therefore, although the local 3D features are somewhat pose invariant
due to the way they have been constructed (see Sections 6.4.2), we noticed some
misclassifications when using only local 3D features. However, with the finer match-
ing via ICP, most of such probe images in the UND-F and UND-J datasets are
recognized correctly. The results on UND-G dataset having probes with pose varia-
tions up to 45◦ are plotted in Fig. 6.18. We achieved 100%, 87.5% and 33.3% rank-1
identification rates for 15◦, 30◦ and 45◦ off center pose variations respectively. Our
results are comparable to those of Yan and Bowyer [179] and Chen and Bhanu [32]
for up to 30◦. Some examples of accurate recognition under large pose variations
are illustrated in Fig. 6.19.
(a) (b)
(c) (d)
Figure 6.19: Examples of correct recognition of four gallery-probe pairs with pose
variations (best seen in color).
6.7.5 Analysis of the Failures
Most of the misclassifications that occurred in all our experiments involve missing
data (inside or very close to the ear contour) in either the gallery or the probe image.
The remaining ones involve large out-of-plane pose variations and/or occlusions with
hair. Some examples are illustrated in Fig. 6.20.
6.7. Recognition Performance 107
(a) (b)
Figure 6.20: Examples of failures: (a) Two probe images with missing data. (b) A
gallery-probe pair with a large pose variation.
6.7.6 Speed of Recognition
On a Core 2 Quad 9550, 2.83 GHz machine, an unoptimized MATLAB imple-
mentation of our feature extraction algorithm requires around 22.2 sec to extract
local 3D features from a probe ear image. A similar implementation of our algorithm
for matching 200 L3DFs of a probe with those in a gallery requires 0.06 sec on av-
erage. This includes 0.02 sec required by the computation of geometric consistency
measures. For the full algorithm, including L3DF as rejection classifier, followed
by coarse alignment and ICP on a minimal rectangle, the average time to match a
probe-gallery pair in the identification case is 2.28 sec on the UND dataset. Timing
for different combinations of our recognition algorithms are given in Table 3.
Table 6.3: Performance variations for using L3DFs and geometric consistency mea-
suresApproach Rank-1 Identification rate (%) Avg.
UWADB(50-50)
UND-F(100-100)
UND-F(302-302)
UND-J(415-415)
Matchingtime(sec)
ICP only 98 80 - - 58.09
L3DF without geometric consis-tency
86 89 76.8 71.6 0.04
L3DF using geometric consistency 88 92 83.44 79.8 0.06
ICP and L3DF without geometricconsistency
96 98 93.7 81.6 2.43
ICP and L3DF using geometricconsistency
98 98 95.4 93.5 2.28
108 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
6.7.7 Evaluation of L3DFs, Geometric Consistency Measures and Min-
imal Rectangular Area of Dataset
We obtain improved accuracy and efficiency using the local 3D feature based
rejection classifier and extracting the minimal rectangular area prior to the applica-
tion of ICP. The improvements are summarized in Table 6.3 where numbers within
parenthesis following the database name are the number of images in the gallery
and probe set respectively.
Results in the first row in the above-mentioned table are computed using the
same ICP implementation as in our final approach but without using L3DFs and
corresponding minimal rectangular area of data points. The wider detection window
with hair and outliers and the absence of proper initial transformation caused ICP
to yield very poor results compared to our approach with L3DFs. In all the cases,
our method is significantly more efficient than the raw ICP algorithm.
As shown in Table 6.3, using geometric consistency measures improved the accu-
racy of identification significantly especially for the larger datasets. Improvements
using these consistency measures for other biometric applications are described in
detail in [80]. The performance of these measures on the UND-F database is shown
in Fig. 6.21. The legends errL3DF , perRot, propDist and errDist are used for sim-
ilarity measures εf , αr, αd and εn respectively as described in Section 6.5.3). The
CMC curves show the significance of the rotation consistency measure compared to
other similarity measures.
5 10 15 20 25 30 35 400.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Iden
tific
atio
n R
ate
errL3DF
propRot
propDist
errDist
Figure 6.21: Comparing identification performance of different L3DF-based similar-
ity measures (without ICP on the UND-F dataset).
6.8. Comparison with Other Approaches 109
6.8 Comparison with Other Approaches
In this section, our detection and recognition approaches are compared with
similar approaches using 3D data and reporting performance on the UND datasets.
Table 6.4 and 6.5 summarize the comparisons. Matching times are computed on
different machines in different approaches. A dual-processor 2.8 GHz Pentium Xeon,
a 1.8 GHz AMD Opteron and a 3 GHz Pentium 4 are used in [179], [32] and [136]
respectively.
6.8.1 Detection
The ear detection approaches in [136] and [180] are not automatic. The ap-
proaches in [32] and [179] use both color and depth information for detection. The
latter achieves a low detection accuracy (79%) using only color information which
also depends on the accuracy of nose tip and ear pit detections. So these approaches
of detection are not directly comparable to ours since we only use grey scale infor-
mation.
Unlike other approaches, we have performed experiments with ear images of
people with ear-phones on and observed that our detection rate does not vary much
if the ear pit is blocked or invisible. Our experiments with synthetic occlusion on
the ear region of interest, demonstrate better results than those reported by Arbab-
Zavar and Nixon [8] for vertically increasing occlusions up to 25% of the ear.
Table 6.4: Comparison of the detection approach of this paper with the approaches
of Chen and Bhanu [32] and Yan and Bowyer [179]Approach Ear
pit de-tected?
Nosetip de-tected?
Datatype
Dataset(#images)
Det.rate(%)
Det.time(sec)
This Paper No No Intensity UND-F (942) 99.9 0.008UND-J (830) 99.9
Chen andBhanu [32]
No No Colorand
UND-F (700) 87.7 N/A
depth UCR (902) 99.3 9.48
Yan Yes Yes Color UND-J 79 N/Aand Depth (1386) 85Bowyer [179] Color
anddepth
100
110 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
Table 6.5: Comparison of the recognition approach of this paper with others on the
UND database
Approach Manuallycroppedimages
Conciseness ofthe croppingwindow aroundthe ear
Featuresused
Rejectionclassi-fierused?
Initialtransfor-mationfor ICP
Feature Matchingtime
This Paper Nil A flexible rect-angular area iscropped
Local Yes Translationand rota-tion
L3DF-0.06 sec(MATLAB)
Chen andBhanu [32]
12.3% A concise rect-angular area iscropped
Local No Translationand rota-tion
LSP- 3.7 sec,H/AH- 1.1 sec(C + +)
Yan andBowyer [179]
Nil Very concisecrop along thecontour
Global No Translation N/A
Passalis etal. [136]
All Very concisecrop around theconcha
Global No N/A less than 1 ms(enrolment-15-30 sec)
Regarding robustness to pose variations, unlike other reported approaches, we
performed experiments separately for profiles with various levels of pose variations
and obtained 100% detection accuracy. The approach of Yan and Bowyer [179]
greatly depends on the accuracy of the ear pit detection which can be affected by
large pose variations.
Also, none of the above approaches reports the time required for ear detection
on the UND dataset (although Chen and Bhanu [32] reports 9.48 sec for detection
on the UCR database). In [179], an active contour algorithm is used iteratively
which increases the computational cost. It also uses skin detection and constraints
in the active contour algorithm for a concise cropping of the 3D ear data. Chen and
Bhanu [32] also use skin detection and edge detection for finding the initial Regions
of Interest (ROI), RANSAC-based Data-Aligned Rigidity-Constrained Exhaustive
Search algorithm (DARCES) for initial alignment and the ICP for fine alignment
of a reference model on to the ROIs and the Thin Plate Spline Transformation for
the final ear shape matching. On the other hand, we use the AdaBoost algorithm,
whose off-line training is expensive but the trained detector is extremely fast.
On the basis of the above, our approach of ear detection is faster than all other
above approaches while maintaining high accuracy, and producing boundaries con-
6.8. Comparison with Other Approaches 111
sistent enough for recognition.
6.8.2 Recognition
In our approach, an extremely fast detector is paired with fast matching using 3D
local features which allows us to extract a minimal rectangular area from a flexibly
cropped rectangular ear region sometimes with other parts of the profile image (e.g.
hair and skin). However, recognition performance depends on how concisely each
ear has been detected, especially when using the ICP algorithm. This is because the
hair and skin around the ear makes the alignment unstable. This explains the high
recognition results in [179] where ICP is used for matching concisely cropped ear
data with a penalty in time. Similarly, the slightly better result in [32] is explained
through the fine manual cropping of a large number (12.3%) of ears which could not
be detected by their automatic detector. Also, in [179, 32] ICP is applied on every
gallery-probe pair whereas our use of L3DF allows us to apply ICP on a subset (best
40) of pairs for identification.
Unlike the approaches in [136], [180] and [179], we and Chen and Bhanu [32]
use local features for recognition. However, the construction of their local features
is quite different than ours and they use these for coarse alignment only and not
for rejecting false matches. We use the local 3D features for the rejection of a large
number of false matches as well as for a coarse alignment of the remaining candidates
prior to the ICP matching. This considerably reduces our computational cost since
we apply ICP on a reduced dataset. However, similar to them, we use both rotation
and translation for the coarse alignment whereas Yan and Bowyer [179] use only
translation (no rotation). For matching local features, Chen and Bhanu [32] use a
geometric constraint similar to ours but without a second round of matching. Our
experiments show that matching is improved by a second round (see Section 6.7).
Although the final matching time in [136] is very low (less than 1 ms per com-
parison), its enrolment and the feature extraction modules using both ICP and
Simulated Annealing are slightly more computationally expensive (30 sec compared
to 22.2 sec in our case). The approach has an option of omitting the deformable
model fitting step which can reduce the enrolment timing to 15 sec, however, with
a penalty of 1% in recognition performance. Unlike our approach, it also requires
that the ear pit is not occluded because the annotated ear model used for fitting the
ear data is based on this area. Moreover, the authors mention that the approach
fails in cases of ears with intricate geometric structures.
112 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
6.9 Conclusion
In this paper, a complete and fully automatic approach for human recognition
from 2D and 3D profile images is proposed. Our AdaBoost-based ear detection ap-
proach with three new Haar feature templates and the rectangular detector is very
fast and significantly robust to hair, ear rings and ear-phones. The modified con-
struction and efficient use of local 3D features to find potential matches and feature-
rich areas, and also for coarse alignment prior to the ICP, makes our recognition
approach computationally inexpensive and significantly robust to occlusions and
pose variations. Using two-stage feature matching with geometric consistency mea-
sures significantly improved the matching performance. Unlike other approaches,
the performance of our system does not rely on any assumption about the localiza-
tion of the nose or the ear pit. The speed of the recognition can be improved further
by implementing the algorithms in C/C + + and using faster techniques for feature
matching like geometric hashing as well as faster variants of ICP.
Appendix: Parameter Selection
The parameters used in the implementation of our detection, feature extraction
and matching algorithms are listed below with a short description of the effect of
the variation of their values. All the values are determined empirically using the
training data.Detection Related Parameters:
1. Target false positive rate (Ft): We chose Ft = 0.001. Increasing its value will
decrease the number of stages of the cascade for ear detection, but will reduce
the accuracy.
2. Target detection rate (D): We chose D = 0.98. Increasing its value will
increase the number of stages in the cascade or the minimum detection rate
per stage which will necessitate more features to be included in each stage of
the cascade.
3. Minimum overlap (minOv): We used minOv = 0.1 in our multi-detection
integration algorithm. Using a higher value for this parameter may result in
multiple detections near the ear.
Feature Extraction Related Parameters:
1. Inner radius of the feature sphere (r1): It is the radius of the sphere within
which data points are used to construct the 3D local features. For a larger value
6.9. Conclusion 113
of r1, a feature becomes more global and hence, more descriptive. However,
locality is also important to increase robustness to occlusion. Its value is
chosen relative to the average ear size. We tested with values 10, 15, 20 and
30 mm and the best results were obtained with r1 = 15 mm.
2. Boundary limit (r1 + x): It is a function of the inner radius and is used to
avoid the boundary effect. We chose x = 10 mm. A higher value of x will
reduce the number of seed points and hence the keypoints. A lower value may
result in having some keypoints outside the reliable and feature-rich area of
the ear, likely including hair.
3. Threshold for choosing the keypoints (t1): We chose t1 = 2 to have around 200
significant features in most cases. The higher the value of t1, the fewer features
we get. However, lowering t1 can result in the selection of less significant
feature points. For example, t1 = 0 will allow constructing a feature from a
completely planar or spherical surface.
4. Seed resolution (rs): It defines how close we chose a seed point. In our exper-
iments we chose rs = 2 mm.
5. Number of features per ear (nf ): This parameter determines the maximum
number of features to be created per ear. We chose nf = 200. The higher value
of this will increase the possibility of getting more feature points resulting
in more computational cost. However, the recognition becomes critical for
candidates having fewer features.
Matching Related Parameters:
1. The threshold limiting distance between feature locations (th1): It controls
the number of matches to be discarded. The higher its value, the more but
less significant matches will be included. On the other hand, a smaller value
will reduce the number of matches. In our experiments th1 = 45 mm provided
better results.
2. The distance multiplier (κ): This parameter is part of the threshold to deter-
mine the distance consistency. We empirically determine its value as 0.1. A
higher value will allow less consistent matches to be used in constructing the
rejection classifier.
3. The threshold for rotation consistency (th2): We chose th2 = 10◦. A higher
value will allow considering matches having higher rotation variations in the
114 Chapter 6. Efficient Detection and Recognition of Textured 3D Ears
calculation of rotation consistency. However, smaller values may discard po-
tentially correct matches.
4. The minimum number of matches (m): This parameter limits the number of
gallery candidates having enough matching features with a probe. We chose
m = 10. A higher value may discard potential matches while a lower value
would not allow the keypoint distance measure computation to be performed.
Acknowledgments
This research is sponsored by the Australian Research Council (ARC) grant
DP0664228. The authors acknowledge the use of the UND, the NIST, the XM2VTSDB,
the UMIST, the USTB and the MIT-CBCL face databases for ear detection and the
UND profile and the UCR ear databases for ear recognition. They would also like to
acknowledge Mitsubishi Electric Research Laboratories, Inc., Jones and Viola for the
permission to use their rectangular filters and D’Errico for the surface fitting code.
They would like to thank R. Owens and W. Snyder for their helpful discussions and
K. M. Tracey, J. Wan and A. Chew for their technical assistance.
115CHAPTER 7
Fusion of 3D Ear and Face Biometrics for Robust
Human Recognition
Abstract
The vulnerability of unimodal biometric systems in the presence of noise or occlu-
sions and variations of pose, scale and illumination motivates the use of multimodal
biometrics. However, selecting appropriate biometric modalities and fusion tech-
niques is still very challenging. The ear and the face are highly attractive biometric
modalities for fusion because of their feature-rich physiological structure, physical
proximity and non-intrusive acquisition. In this paper, local 3D features are auto-
matically extracted from both modalities for representation, and fusion is performed
at score and feature levels. Fusion of L3DF-based and ICP-based matching scores
from the two modalities using a weighted sum-rule significantly outperforms cur-
rent unimodal approaches especially for non-neutral facial expressions. To the best
of our knowledge, this paper is the first to propose fusing 3D features extracted
from 3D ear and the frontal face data. The proposed score-level fusion technique
achieves rank-1 identification and verification (at 0.001 FAR) rates of 99.4% and
99.7% respectively with neutral facial expression and 96.8% and 97.1% respectively
with non-neutral facial expressions on the largest multimodal dataset using FRGC
v.2 and UND databases.
keywords
3D ear, 3D face, local 3D surface features, multimodal recognition, score-level
fusion, feature-level fusion.
7.1 Introduction
Traditional identity card and password based systems are increasingly exploited
via theft or fakery [43, 88]. Biometric recognition using human physiological (such
as face, fingerprint, palm-print and iris) or behavioral (e.g. handwriting, gait and
voice) characteristics is comparatively more robust to such frauds. As it is relatively
difficult to spoof multiple biometrics simultaneously, multibiometric systems have
0This article is under review in the IEEE Transaction of Pattern Recognition and MachineIntelligent, June, 2010.
116 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
recently been proposed where a decision is made based on a combination of different
subsets of biometrics.
In terms of acceptability, the face is considered the most promising biometric
trait. Face data can be collected easily and non-intrusively. The face is also rich
in distinct features. Using 3D or a combination of 2D and 3D face images, very
high recognition rates have been obtained for faces with neutral expression [117].
However, changes in facial expression significantly change the facial geometry [99]
reducing the effectiveness of face recognition algorithms. Occlusions caused by hair
or ornaments introduce an additional challenge. To reduce the impact of expression
variations and occlusions on the recognition performance, researchers have proposed
the integration of face data with other biometric modalities such as fingerprints [158],
palm prints [56], hand geometry [144], gait [194], iris [164], voice [19] and most re-
cently the ear. Among these alternatives, the ear has the advantage that it is
co-located with the face and hence, respective data can easily be collected (with the
same or similar sensor). Moreover, ear shape does not change with expressions and
with ageing from 8 years to 70 years [72]. Another advantage of this choice of modal-
ities is that the ear and face data have a very low correlation, which is a desirable
criterion for any fusion approach. Theoharis et al. [152] computed a correlation of
only 0.16 between an ear and a face image using the Pearson correlation coefficient.
They also illustrated that the ear-face fusion curve can reach a 100% recognition
rate before rank-15 whereas none of the modalities reached 100% accuracy before
rank 20. This implies that the instances of failure to identify a subject using the two
modalities are uncorrelated, and therefore, one modality can generally compensate
for any shortcoming of the other.
Ear and face biometrics can be fused at different levels of the recognition pro-
cess [84]. In this paper, we propose score and feature-level fusion approaches. We
detect the face region of interest from the frontal face images based on the position
of the nose tip [114]. To detect the ear from the profile images, we use our previously
developed ear detection technique using AdaBoost [74] (Section 6.3 of this disserta-
tion). For score-level fusion, following a normalization step, face and ear local 3D
features (L3DFs) are extracted and matched separately. Matching scores from the
two modalities are then fused according to a weighted sum rule. For feature-level
fusion, after extracting L3DFs from the ear and face data, we concatenate them
based on their local shape similarity. We then match the fused features of the probe
and the gallery according to fused feature distance and using geometric consistency
measures. The performance of the proposed approaches is evaluated on the largest
7.1. Introduction 117
Table 7.1: Multi-biometric approaches with 2D and 3D ear and face dataCategory Source Data Type and
Database SizeAlgorithm Id.
Rate
Score-levelfusion
Yan,2006 [175]
3D images from174 subjects
Using ICP with sum and in-terval fusion rules on multi-instance gallery and probe im-ages
100%
Woodard etal., 2006 [166]
3D images from 85subjects
ICP and RMS 97%
Theoharis etal., 2008 [152]
3D images from324 subjects
Annotated model fitting andusing ICP, SA and waveletanalysis
99.7%
Feature-levelFusion
Xu et al.,2007 [171]
2D images from 79subjects
KFDA 96.8%
Pan et al.,2007 [133]
2D profile imagesfrom 38 subjects
CCA, PCA 97.4%
Xu and Mu,2007 [172]
190 2D imagesfrom 38 subjects
KCCA 98.7%
ear-face dataset available comprising 326 gallery images and 315 probes with neutral
facial expression and 311 probes with non-neutral facial expressions. All images are
taken from the FRGC v.2 [141] and the University of Notre Dame (UND) [179]
Biometric databases and there is only one instance per subject in both the gallery
and the probe dataset. We achieve an identification rate of 99.0% and a verification
rate of 99.7% at 0.001 False Acceptance Rate (FAR) for score-level fusion of the ear
with neutral face biometrics. Corresponding results with non-neutral expressions
are 96.8% and 97.1% respectively. We obtain a slightly lower accuracy of 98.4% and
97.8% identification and verification results respectively for feature-level fusion of
the ear with neutral face data. With non-neutral expressions of face data, we ob-
tain 94.9% and 96.8% identification and verification rates respectively. Both fusion
approaches are efficient and fully automatic.
The remainder of the paper is organized as follows. Related work, motivations
and contributions of this paper are described in the next section. The data acquisi-
tion and feature extraction techniques used in the paper are described in Section 7.3.
The proposed approaches for score-level and feature-level fusion are described in Sec-
tions 7.4 and 7.5 respectively. Results are reported and discussed in Sections 7.6
and 7.7. Proposed fusion approaches are compared between themselves and with
other approaches in Section 7.8. A conclusion is provided in Section 7.9.
118 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
7.2 Related Work and Contributions
7.2.1 Related Work
Multimodal recognition with ear and face is a very recent research trend. Only
very few approaches have been proposed using different levels of fusion [78] (See
Chapter 2 of this dissertation for details). Some of the most relevant approaches
using score and feature levels of fusion are summarized in Table 7.1 and discussed
below.
Approaches with Score-level Fusion: In score-level fusion, matching scores from
different modalities are combined to make the recognition decision. Different fusion
rules have been proposed. Kittler et al. [94] and Jain et al. [84] empirically demon-
strated that the sum rule provides better results than other score fusion rules in a
number of cases. Recently, Luciano and Krzyzak [108] demonstrated better results
for the weighted sum rule.
There are only few 2D approaches including the works of Luciano and Krzyzak [108,
2] using score-level fusion of the ear and face biometrics. However, to the best of
our knowledge, there are only three approaches using 3D data for score-level fusion
of these two modalities. Considering their relevance to the research in this paper,
only 3D approaches are discussed below.
Yan [175] combined ear and face at score-level using sum and interval fusion
rules. On a dataset of 174 subjects, each with two ear shapes and two face shapes
in the gallery and the probe datasets, they obtained rank-one recognition rates of
93.1%, 97.7% and 100% for the ear, the face and the fusion respectively [118].
Woodard et al. [166] used a score-level fusion technique for combining 3D face
scores with ear and finger scores. Using the Iterative Closest Point (ICP) algorithm,
they obtained 97% rank-one recognition rate on a small database of 85 subjects.
Theoharis et al. [152] proposed extracting geometry images from 3D face and
ear modalities and then fitting annotated ear and face models through an ICP and
simulated annealing based registration process. They applied the wavelet transform
to the extracted images to get individual feature vectors. The distance between
the feature vectors of the gallery and the probe was weighted and then summed
for fusion. Although they combined features before matching, we classify their ap-
proach as score-level fusion because they directly applied the L1 distance metric for
matching the fused features, which is equivalent to matching the individual modal-
ity separately and hence would produce the same results as score-level fusion. In a
multimodal database composed of 324 gallery and the same number of probe images
7.2. Related Work and Contributions 119
(all collected from the FRGC v2 and the UND databases), a rank-one identification
rate of 99.7% was reported. The probe dataset for this experiment contained some
images with non-neutral facial expressions but most of them were neutral.
Approaches with Feature-level Fusion: In feature-level fusion, extracted features
from different modalities are combined prior to matching. We are not aware of
any 3D approaches using this fusion technique and only the following three 2D
approaches are found in the literature.
Xu et al. [171] proposed feature-level fusion of 2D profile face and the ear biomet-
rics. They used Kernel Fisher Discriminant Analysis (KFDA) to extract features
and tested their approach on a dataset of 237 gallery and 474 probe 2D profile
images from 79 subjects (all taken from the University of Science and Technology
Beijing (USTB) database). They included three 2D images (with the rotation of -5,
0 and +5 degrees around the vertical axis) per person in the gallery and six images
per person in the probe dataset. They obtained 96.8% recognition accuracy while
using a minimum-distance classifier and the weighted sum rule (with weights 0.55
and 0.45 for face and ear respectively).
In their feature-level fusion of profile face and ear, Pan et al. [133] used Canonical
Correlation Analysis (CCA) to extract features. They reduced the dimension of the
associated feature vector using PCA and used a minimum-distance based classifier.
They built a multi-instance dataset from the USTB database taking three instances
per gallery and two instances per probe for each of the 38 subjects. With this
dataset, they obtained an identification rate of 97.4% using fused feature vector of
dimension 50. The same research group improved the identification rate to 98.7%
using Kernel Canonical Correlation Analysis (KCCA) on the same data set [172].
7.2.2 Motivations and Contributions
As reported in the literature review (Section 7.2.1), most of the score-level fusion
techniques and all of the feature-level fusion techniques use only 2D data. However,
2D data are severely affected by changes in illumination, scale and pose variations
that are common for public applications. Therefore, in this paper, we propose to
use 3D data, which are commonly less sensitive to such variations.
Occlusions and deformations are also very common in non-intrusive applications
of ear-face biometrics. Most of the current ear-face multimodal approaches use
global features which are affected by these variations. In this work, we use local 3D
features (L3DFs) to represent both the ear and face data. L3DFs were first proposed
by our research group [114] and used for object retrieval and face recognition [117].
120 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
These features exhibit a high level of repeatability and are very fast to compute.
Mian et al. [114] reported 23 matches per second using MATLAB on a 3.2 GHz
Pentium IV machine.
In comparison to other levels of fusion, score-level fusion has many benefits with
respect to implementation and computation. It involves the processing of less data
and consequently it is a faster and easier way when used to recognize people [84].
Feature-level fusion, on the other hand, may preserve more discriminating features
prior to matching and is intuitively believed to provide better accuracy. However,
this has not been experimentally proved in the literature. Therefore, we propose to
apply both fusion techniques using the same data and similarly extracted features
to compare their recognition performance.
It is also interesting to note that all the existing feature-level fusion approaches
use profile images for both the ear and face biometrics. Although it is easy and
cost effective to use a single image for both modalities, a frontal face possesses
more discriminating features than a profile face and hence may result in a better
recognition accuracy. To explore this, we propose to extract ear features from the
profile image and face features from the frontal face images.
Our approaches are fully automatic all the way from data acquisition to recog-
nition. We use our previously developed tools [83] to automatically extract a short
list of candidates and select a minimal rectangular region from the whole dataset.
Thus, we can minimize the cost when using ICP for improved accuracy, especially
in the case of face images with non-neutral expressions.
The specific contributions of this paper are as follows:
1. Two complete ear-face multimodal recognition systems are proposed based on
fully automatically extracted efficient local features.
2. To the best of our knowledge, this is the first paper to present a feature-level
fusion approach using 3D ear and face features extracted from the profile and
frontal face images respectively.
3. The same type of local 3D features are used to represent both the ear and
face data in both score and feature levels of fusion which permits for fair
comparisons of the two biometric traits as well as the fusion techniques.
4. Experiments are performed on the largest possible multimodal dataset using
publicly available profile (the UND) and frontal (the FRGC v2) face databases.
7.3. Data Acquisition and Feature Extraction 121
5. The performance of the face biometrics especially with non-neutral expressions
improves significantly when fused with the ear biometrics using the proposed
score-level fusion technique.
6. The proposed feature-level fusion approach achieves an accuracy comparable
to that of the score-level fusion of the ear and face data without requiring
ICP-like expensive algorithms in matching.
7.3 Data Acquisition and Feature Extraction
7.3.1 Dataset
To perform our experiments on both the ear and face data, we create a multi-
modal dataset comprising data from Collection-J of the UND Profile face database [179]
and the Fall2003 and Spring2004 datasets of the FRGC v.2 frontal face database [141].
The UND database has images from 415 individuals and the FRGC v.2 database
has images from 466 individuals. However, only 326 images in the gallery of the
UND database are available in the list of the images with neutral expression in the
FRGC v.2 database. Similarly, the number of probe face images with neutral and
non-neutral expressions in the FRGC v.2 database which are also available in the
probe images of the UND database are 311 and 315 respectively. Thus, our multi-
modal dataset includes 326 gallery images and 311 probes with neutral expressions
and 315 probes with non-neutral expressions. To the best of our knowledge, this is
the largest publicly available ear-face database.
7.3.2 Ear and Face Data Extraction
The ear region is detected from 2D profile images using the AdaBoost based
detector described in our previous work [74] (described fully in Section 6.3 of this
dissertation). This detector is chosen because it is very accurate, fast and fully
automatic. A detection accuracy of 99.9% is obtained on the UND profile face
database with 942 images of 302 subjects. The corresponding 3D data are then
extracted from the co-registered 3D profile data as described in [75]. As shown in
Fig. 7.1, a rectangular area of data points around the ear is extracted from the
profile which sometimes includes some portion of the hair and the face. Therefore,
the extracted data are normalized and then uniformly sampled on a grid of 132 mm
by 106 mm.
The face region is detected from the 3D frontal face image based on the position
of the nose tip as described in [114]. Face data are also normalized and sampled on
a uniform grid of 160 mm by 160 mm.
122 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
Konica Minolta
Vivid 910
Detected
2D ear
Extracted
3D ear
Konica Minolta
Vivid 910
2D view of
the profile2D Co-ordinates
Range image of
the profile
Range image of
the front face
Extracted
3D face
(a)
(b)
Figure 7.1: Data acquisition using Minolta Vivid 910 scanner: (a) 2D and 3D profile
images captured to extracted 3D ear data. (b) 3D frontal face scanned to extract
3D face data
7.3.3 Feature Extraction
Local 3D features are extracted from 3D ear and face data. A number of distinc-
tive 3D feature point locations (keypoints) are automatically selected on the 3D ear
and 3D face region based on the asymmetrical variations in depth around them. The
variation is determined by the difference between the first two eigenvalues (centered
on the keypoints) following [117]. The number and locations of the keypoints are
found to be different for the ear and the face images of different individuals and
hence can be used as a digital signature for biometric purposes. It is also observed
that these keypoints exhibit a high degree of repeatability in different images of the
same individual [117, 83].
A spherical area of radius R is cropped around the selected keypoints and aligned
on its principal axes. Then, a uniformly sampled (with a resolution of 1mm) 3D
surface of 30× 30 lattice is approximated (using D’Errico’s surface fitting code [47])
on the cropped data points. In order to avoid boundary effects, an inner lattice of
20×20 is cropped from the bigger surface and converted to a 400 dimensional feature
vector for that corresponding keypoint. An example of an extracted ear feature is
illustrated in Fig. 7.2. More details can be found in [117, 83].
7.4. Unimodal Matching and Score-level Fusion 123
Location of a local 3D feature on the range image of an ear Extracted 3D local surface feature
Figure 7.2: Example of an extracted 3D local surface feature [83].
7.4 Unimodal Matching and Score-level Fusion
The main steps in our multimodal recognition system with score-level fusion are
shown in the block diagram of Fig. 7.4. Each of the components is described in this
section.
7.4.1 Matching Technique
Ear images in the gallery and the probe datasets are matched using an L3DF-
based coarse matching and an ICP-based fine matching techniques. A preliminary
version of the approaches can be found in our previous works [83, 80] on ear recog-
nition. We adopt the same approaches with the same parameters for matching of
the face data. The approaches are described briefly as follows.
L3DF-based Matching At first, Root Mean Square (RMS) distance between a
probe feature and all the gallery features are computed. Gallery features with a
keypoint location at a predefined distance threshold λ1 = 45 mm away from that of
the probe feature are discarded. The remaining gallery features are sorted according
to their RMS distance from the probe feature and the one with minimum distance
is paired with the probe feature. If multiple probe features match the same gallery
feature, the best match is retained. Then for a match with locations pi, gi, we count
how many other match locations pj, gj satisfy the following inequality:
||pi − pj| − |gi − gj|| < rs + κ√|pi − pj| (7.1)
where, rs is the seed resolution and the value of κ is empirically chosen as 0.1.
We compute the proportion of distance-consistent matches (ρd) measure as a ratio
of the maximum count and the total number of matches. In the second level, we re-
peat the feature matching, however, only allowing those matches that are consistent
124 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
Figure 7.3: Feature correspondences established between a gallery and a probe ear
after matching is performed. The image of the probe ear (right image) is flipped for
better visibility (best seen in color).
Frontal face
images Face
detection
Ear
detection
Face
L3DFs
extraction
Ear
L3DFs
extraction
Matching
face
L3DFs
Matching
ear L3DFs
Profile face
images
Fine
matching of
faces with
ICP
Fine
matching of
ears with
ICP
Fusion of
matching
scores
Recognition
result
Candidate selection
and coarse alignment
using face feature
correspondences
Candidate selection
and coarse alignment
using ear feature
correspondences
Gallery
database of
face L3DFs
Gallery
database of
ear L3DFs
Figure 7.4: Block diagram of the proposed multimodal recognition system with
score-level fusion.
7.4. Unimodal Matching and Score-level Fusion 125
with the most consistent match. We compute the mean feature distance (σs) of the
retained matches. An example of feature correspondence between a probe and the
corresponding gallery ear is shown in Fig. 7.3. In the second level of matching, we
also find the underlying rotation between the matched gallery-probe feature pair and
compute a ratio of the number of occurrences of the maximally occurred rotations
that are within a threshold λ2 = 10o and the total number of matches as the pro-
portion of the consistent rotations (ρr). Finally, we compute the keypoint distance
measure (σk) by only applying ICP on the keypoints of the matched features.
Candidate Selection and ICP-based Matching We select the best 40 gallery candi-
dates sorted according to the feature-based matching scores (described in the above
Section). We also use the maximally occurred distance and the rotation found in the
feature-based matching for a coarse alignment of the gallery and probe ear (or face)
dataset. In order to align them finely, we employ a modified version of the Iterative
Closest Point (ICP) algorithm [114] on a minimal rectangular area of dataset con-
taining only the matching features. We use the ICP error as a similarity measure
(σi).
Final Similarity Measure and Score Normalization In order to make the final
matching decision based on the fusion of ear and face modalities, we combine the
following scores: (i) the mean feature distance of the matches retained by the sec-
ond round (σs), (ii) the proportion of distance-consistent matches (ρd), (iii) the
proportion of consistent rotations from the second round (ρr) and (iv) the ICP error
(σi).
We normalize the above scores on a 0 to 1 scale using the min-max rule. A
weight factor (ω) is then computed as the ratio between the minimum and second
minimum values, taken relative to the mean value of the similarity measure for that
probe. The final score (ηx) for modality x (here ear and face) is then computed
by summing the products of the scores and the corresponding weights (confidence
weighted sum rule) [117] as shown in Eqn. (7.2).
ηx = ωsσs + ωd(1− ρd) + ωr(1− ρr) + ωiσi (7.2)
7.4.2 Fusion of Scores
The matching scores from the ear and face data (ηe and ηf respectively) are
fused using a weighted sum rule (similar to the one used in Eqn. (7.2) as follows:
εs = ωeηe + ωfηf (7.3)
126 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
In Eqn. (7.3), ωe and ωf are weights for the ear and the face modalities respec-
tively.
Since L3DFs are more distinctive and reliable for the face data than the ear data,
we factorize the confidence weights (computed above) with some complementary
weights during fusion. We empirically found that allocating nearly double weights
to the face scores provides better results (see Section 7.6.1).
7.5 Feature-Level Fusion
In this section, we describe our proposed approach for fusing local 3D features
obtained from ear and face data. Fig. 7.5 shows a block diagram of the approach.
Face
detection
Ear
detection
Face
L3DFs
extraction
Ear
L3DFs
extraction
Matching
fused
L3DFs
Fusion of
face and
ear L3DFs
Gallery
database of
fused feature
Off-line enrolment
2D and 3D profile
face images
3D frontal face
images
Recognition
result
Figure 7.5: Block diagram of the proposed multimodal recognition system with
feature-level fusion.
Figure 7.6: Correspondences between the features extracted from the frontal face
and those from the ear of the same person. These correspondences are established
using algorithm 1. Only the first 40 features are shown for a better visibility (best
seen in color).
7.5. Feature-Level Fusion 127
7.5.1 Fusion of L3DFs from Ear and Face
Although the global shapes of the ear and the face are different, there exists some
similarity in their local shapes (e.g. curves, ridges and holes). We utilize these sim-
ilarities in establishing correspondences between ear and face local features in order
to fuse these two modalities. Fig. 7.6 shows an example of such correspondences.
For each ear feature, we compute the RMS distance between that feature and all
the face features of the same individual and sort them according to that distance.
The face feature with minimum distance is paired up with the corresponding ear
feature. We do the same for all ear features. Fusion can also be performed as face-ear
combination by pairing a face feature with the most similar ear feature. We consider
both combinations because the number of ear and face features are not always the
same and one ear or face feature correspondences sometimes to multiple similar
features. However, for uniformity of representation, we keep the ear features on the
left and the corresponding face features on the right side of the fused feature vector.
As illustrated in Fig. 7.7, we also keep the two combinations as two halves of the
fused feature-set. The cumulative percentage of repeatability of the fused features
in different images of the same person is reported in Fig. 7.8 for ten subjects.
Each entry of the fused feature-set has 800 columns as each of the ear or face
features vector a dimension of 400. The number of rows of the feature-set depends
on the number of ear features (n) and that of face features (m). In order to reduce
feature dimension, we apply PCA to the fused feature vectors. The number of
selected PCA components was empirically chosen as 10 (See Section 7.7.3).
7.5.2 Matching the Fused Features
In order to match a fused probe feature vector with a fused gallery feature
vector, we apply a two-level matching technique similar to the one used for the ear
or face features in the case of score-level fusion (see Section 7.4.1). However, the
similarity measures are now computed with the constraint that the thresholds or
limits are satisfied for both the ear and face features. For example, during the first
stage of matching, we only allow fused gallery vectors whose both the ear and face
features are within a distance threshold λ1 away from the corresponding ear and
face features in the probe vector. As described in Algorithm 1, we use five different
similarity measures: mean feature distance of the first and the second stage (δ1
and δ2 respectively), proportion of consistent rotations of the second stage (δ3),
proportion of distance-consistent matches (δ4) and keypoint distance measure (δ5).
We perform the above feature-based matching separately for both combinations
128 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
Algorithm 1. Matching a multimodal probe with a multimodal gallery.
0. (Input) Given a multimodal (ear-face or face-ear combination)
probe, a multimodal gallery, distance thresholds 1, seed
resolution rs and distance constant , angle threshold 2 and
minimum number of match m.
1. (Distance check) For each feature of the probe and all features of a
gallery:
(i) Discard gallery features with distance from the probe
feature location> 1 for both ear and face features.
(ii) Pair the probe feature with the closest gallery feature, by
both ear and face feature distance.
(iii) Count the number of matches (nt) and discard the gallery
if there is less than m feature pairs.
(iv) Compute the mean of the feature distance of the matching
feature pairs ( 1).
2. (Distance consistency check)
(i) For each of the matching feature pairs find the number (nd)
of other matches that satisfy Eqn. (1).
(ii) Find the match with maximum nd and record it as T.
(iii) Compute proportion of distance-consistent matches
( 3 = max(nd )/ nt).
3. (2nd stage of matching)
(i) For all the gallery features repeat step (1.a) but do not allow
matches which are inconsistent with T.
(ii) Compute the mean of the feature distance of the matching
feature pairs ( 2).
(iii) Discard the gallery if there is less than m feature pairs.
4. (Calculating proportion of rotation consistency measure)
(i) For each of the selected matching pairs find the number (nr) of
other matches that have rotation angles within 2.
(ii) Compute proportion of consistent rotation ( 4 = max(nr )/ nt).
5. (Calculating keypoint distance measure)
(i) Align the keypoints of the matching probe features to those of
the corresponding gallery features using ICP.
(ii) Record the ICP error as keypoint distance measure ( 5).
6. Output similarity measures as 1, 2, 3, 4, 5.
Ear Feature (E_1) Face Feature (F_c1)
Ear Feature (E_2) Face Feature (F_c2)
Ear Feature (E_n) Face Feature (F_ck)
Ear Feature (E_c1) Face Feature (F_1)
Ear Feature (E_c2) Face Feature (F_2)
Ear Feature (E_ck) Face Feature (F_m)
Ear-face combination
Face-ear combination
Figure 7.7: Block diagram of the feature-set constructed by fusion of ear and face
L3DFs (subscript ck indicates the index number of the closest feature with respect
to its left or right feature for ear-face and face-ear combination respectively).
7.5. Feature-Level Fusion 129
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
Cum
ulat
ive
Per
cent
age
of R
epea
tabi
lity
Nearest Neighbour Error (mm)
Figure 7.8: Repeatability of fused features in the gallery and probe images of ten
individuals.
1 2 4 6 8 10 12 14 16 18 200.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Iden
tific
atio
n R
ate
EarFaceScore−level Fusion
1 2 4 6 8 10 12 14 16 18 200.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Iden
tific
atio
n R
ate
EarFaceScore−level Fusion
(a) (b)
Figure 7.9: Identification results for score-level fusion of ear and face (without using
ICP scores): (a) with neutral expression. (b) with non-neutral expressions.
130 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
(Section 7.5.1) of the fused feature-set. We then compute the mean of the first two
similarity measures resulting from both combinations and retain other measures
for the final score computation. All the similarity measures are normalized on a
scale of 0 to 1 and the corresponding weighting factors ηk (where k=1 to 5 ) are
computed following an approach similar to the one used in Section 7.4.1. The
final similarity measure (εf ) is computed as a weighted sum of all these normalized
similarity measures with double weights to those obtained from the fusion with
respect to the face feature and given by the following equation.
εf = η1δ1 + 2η2δ2 + η3eδ3e + 2η3fδ3f + η4eδ4e +
2η4fδ4f + η5eδ5e + 2η5fδ5f
where, subscript ‘e’ and ‘f’ are used to indicate that the corresponding similarity
measure is computed from the ear-face and the face-ear combination of the features
respectively.
Table 7.2: Summary of score-level fusion results
Modality and Similarity Measures Score-level FusionFacialEx-pres-sion
PerformanceMeasures
Ear:L3DF-basedwith-outGeo.Cons.(1)
Face:L3DF-basedwith-outGeo.Cons.(2)
Ear:L3DF-basedwithGeo.Cons.(3)
Face:L3DF-basedwithGeo.Cons.(4)
Ear:L3DF-basedwith Geo.Cons.+ICP(5)
Face:L3DF-basedwithGeo.Cons.+ICP(6)
(1)+(2)
(3)+(4)
(5)+(6)
Non-Neutral
Id. Rate(%)
78.1 80.0 79.4 84.8 88.6 83.2 94.9 95.2 96.8
Ver. Rate(%)
77.5 82.9 79.1 84.8 87.6 83.5 94.6 96.2 97.1
Neutral Id. Rate(%)
78.5 95.5 80.1 96.8 90.4 97.4 99.4 99.0 98.4
Ver.Rate(%)
78.1 98.4 72.4 98.1 89.7 97.8 99.7 99.4 99.4
7.6 Performance of the Score-level Fusion Approach
The recognition performance of our proposed score-level fusion approach is eval-
uated in this section. Results of fusing L3DFs scores with or without the inclusion
7.6. Performance of the Score-level Fusion Approach 131
10−3
10−2
10−1
100
0.7
0.75
0.8
0.85
0.9
0.95
1
False Acceptance Rate (log scale)
Ver
ifica
tion
Rat
e
EarFaceScore−level fusion
10−3
10−2
10−1
100
0.7
0.75
0.8
0.85
0.9
0.95
1
False Acceptance Rate (log scale)
Ver
ifica
tion
Rat
e
EarFaceScore−level Fusion
(a) (b)
Figure 7.10: ROC curves for score-level fusion of ear and face (without using ICP
scores): (a) with neutral expression. (b) with non-neutral expressions.
of the ICP score as a similarity measure are shown separately to demonstrate their
individual contributions. Results are summarized in Table 7.2 and discussed below.
7.6.1 Results Using L3DF-Based Measures Only
Identification: Using L3DF-based measures including geometric consistency checks,
we obtain rank-1 identification rates of 80.1% and 96.8% separately for the ear and
the face respectively (rank-n means the right answer is in the top n matches) on the
database described in Section 7.3.1. The score-level fusion of these two modalities,
improves the overall performance to 99.0% accuracy in rank-1 identification. The
results are illustrated in Fig. 7.9(a).
We also perform experiments with a gallery of neutral faces and probes of face
images with different expressions such as happy, sad or angry. For the database
mentioned above, we obtain rank-1 identification rates of 79.4%, 84.8% and 95.2%
for the ear, the face and their score-level fusion respectively (see Fig. 7.9(b)).
The variation of the recognition rates for different combinations of ear and face
complementary weights (See Section 7.4.2) used for fusion with a weighted sum rule
on data with neutral facial expression is illustrated in Fig. 7.11. From the plot we
can see that the recognition rate reaches a peak for ear and face weights of 0.35
and 0.65 respectively. It then declines as greater weighting is given to the face. For
data with non-neutral facial expressions, we obtain the best result with ear and face
weights 0.45 and 0.55 respectively. This asserts that the face data is more reliable
for local 3D features particularly with neutral facial expression than the ear data.
132 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
In order to evaluate the contribution of the geometric consistency measures on
the two different modalities and on their fusion, we perform experiments using only
the mean feature distance as similarity measure. We obtain worse results of 72%,
96.8% and 98.7% for the ear, the face (with neutral expression) and the fusion
respectively. Thus, the use of geometric consistency measures contributes to a sig-
nificant improvement for the ear, and only a slight improvement for the face. A
possible reason is that our matching algorithm for faces already requires distance
consistency with respect to the detected nose tip position as described in [114]. In
the case of the ear, it is difficult to reliably detect a similar landmark position due
to the possibility of occlusion. We actually consider this to be a strength of our
technique that it does not require any specific part of the ear to be visible.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 180
82
84
86
88
90
92
94
96
98
100
Iden
tific
atio
n R
ates
Ear Weights (face weight=1-ear weight)
Figure 7.11: Recognition rates for different combinations of ear and face weights
using the weighted sum rule.
Verification: We obtain a verification rate of 98.1% at a FAR of 0.001 with neu-
tral facial expression when equal weights are used for both modalities. However, the
verification rate increases to 99.7% for the same FAR of 0.001, when we assign 0.35
and 0.65 weights to the ear and the face scores respectively. For the probe dataset
with facial expressions, the verification rate with face only is 83.5% which improves
to 94.9% after fusion with equal weights and to 97.1% after fusion with weights 0.45
and 0.55 applied to the ear and the face scores respectively (see Fig. 7.10).
7.6.2 Improvement Using ICP
Considering the ICP scores from both the ear and face data during fusion, we
obtain a slightly improved result for data with non-neutral facial expressions com-
pared to those with L3DF only. The rank-1 identification rates and verification rates
7.6. Performance of the Score-level Fusion Approach 133
at 0.001 FAR obtained for this approach are reported in Table 7.2. We obtain the
best result with ear and face weights (0.35,0.65) and (0.45, 0.55) for non-neutral and
neutral expression data respectively.
Although results of an individual modality improve with the use of ICP for
neutral expression, the fusion result decreased slightly. This implies that we do not
have to apply expensive post-processing (using the ICP algorithm) in applications
where neutral facial expression can be ensured.
Fig. 7.12 shows some of the probes which are misclassified with face data only but
are recognized correctly after fusion with ear data. 2D images of the corresponding
probe range images are also shown in the top row for a better visualization of the
expressions.
(a) (d)(c)(b)
Figure 7.12: Examples of 2D and corresponding range images of four correctly
recognized probes.
7.6.3 Misclassifications
Only five out of 315 probes were misclassified. The range images of those face
and ear probes are shown in the top and the bottom rows respectively in Fig. 7.13.
It is apparent that there are large expression changes in the face probes and data
losses due to hair plus large out-of-plane pose variations in the ear probes.
Figure 7.13: Examples of five misclassified multimodal probes where face data have
large expression changes and the ear data have data losses due to hair and large
out-of-plane pose variations compared to their respective gallery data.
134 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
7.7 Performance of the Feature-Level Fusion Approach
In this section, the performance of the proposed feature-level fusion approach is
evaluated. Experiments were performed on the same dataset as the one used for the
score-level fusion for a fair comparison. Along with the results, the selection of the
parameters and the similarity measures are also discussed.
7.7.1 Results on Data with Neutral Expression
For feature-level fusion of ear and face data with neutral facial expression, we
obtain a rank-1 identification rate of 98.4%. In a similar scenario, we obtain a
verification rate of 99.0% at an FAR of 0.001. The results are illustrated in Fig. 7.14
and 7.15 respectively.
1 2 4 6 8 10 12 14 16 18 200.94
0.95
0.96
0.97
0.98
0.99
1
Iden
tific
atio
n R
ate
Rank
Neutral ExpressionNon−neutral Expression
Figure 7.14: Identification results for feature-level fusion of ear and face features
under neutral and non-neutral facial expressions.
7.7.2 Results on Data with Non-neutral Facial Expressions
Our experiments for the evaluation of feature-level fusion of ear and face data
with non-neutral facial expressions result in identification rates of 94.9%, 97.1%
and 97.8% at rank one, two and three respectively (see Fig. 7.14). These results are
obtained using PCA for feature vector reduction and using all the similarity measures
described in Section 7.5.2. As shown in Fig. 7.15, with the same implementation
scenario, we obtain a verification rate of 96.8% at 0.001 FAR.
7.7.3 Choice of the Number of PCA Components
We performed a number of experiments using different numbers of PCA com-
ponents (eigenvalues) on a subset of the data with non-neutral facial expressions
7.7. Performance of the Feature-Level Fusion Approach 135
10−3
10−2
10−1
100
0.7
0.75
0.8
0.85
0.9
0.95
1
False accept rate (log scale)
Ver
ifica
tion
rate
Neutral expressionNon−neutral expression
Figure 7.15: Verification results for feature-level fusion of ear and face features under
neutral and non-neutral expressions.
(100 gallery and 100 probe images) and only using the mean feature distance of the
first stage (δ1) as the similarity measure. The variation in identification results for
the variation in eigenvalues from 10 to 70 with an interval of 10 are illustrated in
Fig. 7.16. The plots show insignificant performance differences for the variation in
eigenvalues and the best result is obtained for an eigenvalue of 10.
1 2 4 6 8 10 12 14 16 18 200.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Iden
tific
atio
n R
ates
PCA−10PCA−20PCA−30PCA−40PCA−50PCA−60PCA−70
Figure 7.16: Effect of using a different number of PCA components on the identifica-
tion results (data with non-neutral expressions: 100 gallery and 100 probe images).
7.7.4 Performance of Different Similarity Measures
The performance of different similarity measures used in feature-level fusion for
face data with neutral expression is shown in Fig. 7.17. Similarity measures are
136 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
1 2 4 6 8 10 12 14 16 18 200.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Rank
Iden
tific
atio
n R
ate
Feature Distance−1Feature Distance−2Rotation Consistency (e−f)Rotation Consistency (f−e)Distance Consistency (e−f)Distance Consistency (f−e)Keypoint Distance (e−f)Key Distance (f−e)
Figure 7.17: Performance of various similarity measures used in feature-level fusion
matching for face data with neutral expression.
Table 7.3: Summary of the comparison of our fusion approaches with others (identi-
fication and verification rates are measured at rank-1 and at 0.001 FAR respectively)
Source Methodology Performance CriteriaId.Rate(%)
Veri.Rate(%)
Size ofdataset(#Sub.)
Use ofmulti-instanceenroll-ment
Automation Robustness tonon-neutralexpressions
Yan [175] Using ICP withsum and inter-val fusion rules
100 - 174 2 images Manuallyextractedear
Not shown
Theohariset al. [152]
Annotatedmodel fittingand using ICP,SA and waveletanalysis.
99.7 - 324 No Manual earextraction
Includes fewdata, but noseparate test
Thispaper
Using L3DFmatching, ICPand weightedsum rule.
99.4 99.7 326 No Fully auto-matic
Id. rate-96.8%, Ver.rate 97.1%
L3DF-basedmatching ofcombinedfeatures
98.4 99.0 326 No Fully auto-matic
Id. rate-94.9%, Ver.rate 96.8%
7.8. Comparative Study 137
named in short for δ1, δ2, δ3e, δ3f , δ4e, δ4f , δ5e and δ5f respectively which are described
in Section 7.5.2. As illustrated in this figure, the fourth similarity measure (the
proportion of consistent rotations originating from the face-ear combination (δ3f ))
provides the maximum identification rate.
7.8 Comparative Study
In this section, at first we discuss the comparison between our score-level and
feature-level fusion approaches. We then provide a comparison of our approach
with other 3D ear-face multimodal approaches. These comparisons are summarized
in Table 7.3 and discussed below.
7.8.1 Comparison between the Proposed Fusion Approaches
As summarized in Table 7.3, the proposed score-level fusion approach achieves
better accuracy than the feature-level fusion approach for the multimodal recogni-
tion with the ear and face biometrics. Our previous work [117] with 2D Scale Invari-
ant Feature Transform (SIFT) and 3D local features for multimodal face recognition
also demonstrated similar results. This is mostly due to the difficulty in fusing local
features in a repeatable way. As illustrated in Fig. 7.8, we only obtained a 40%
repeatability of the ear-face fused feature vector even with 10 mm nearest neigh-
bor error. However, the results reported in Section 7.7 express one strength of our
feature-level fusion technique that it performs well even when features are fused
differently between a probe and gallery. If features could be fused in the same way,
the technique may outperform score-level fusion.
7.8.2 Comparison with Other Approaches
Yan [175] and Theoharis et al. [152] used a concise manual extraction and/or a
substantial preprocessing step (e.g. using the snake algorithm and removing the face
or the neck area using a skin detection algorithm) of the ear data which might have
contributed to their higher identification accuracies. In contrast, our approach uses a
fully automatic extraction technique that extracts a rectangular area around the ear,
sometimes including extra regions with hair and skin with a minimal preprocessing
to remove holes or spikes from the extracted 3D ear data.
The performance of the approach in [175] was evaluated on a smaller database
from only 174 subjects each having two ear images and two face images in the gallery
and the probe datasets. Faltemier et al. [50] experimentally demonstrated that the
multi-instance approach performs better than the single-instance approach, however,
138 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
with a penalty of additional computation. We use a larger database without any
multi-instance in the gallery or the probe datasets (Section 7.3).
None of the approaches in the table performed separate experiments with face
data under non-neutral expressions that severely affect the performance. Theo-
haris et al. [152] used a subset of the FRGC v2 dataset which includes faces with
non-neutral expressions, but the authors did not mention how many of their selected
faces were with non-neutral expressions. On a larger dataset but with multi-instance
gallery and probe datasets collected from the FRGC v.2 database Mian et al. [117]
obtained 86.7% and 92.7% identification and verification rates respectively using
face L3DFs involving non-neutral expressions. In this paper, we obtain better re-
sults (94.9% and 95.2% respectively) fusing scores from ear L3DFs and face L3DFs
(without considering ICP scores).
The approaches of both Theoharis et al. [152] and Yan [175] did not report
verification rates, which is one of the important indicators of the performance of a
biometric system. Our approach shows high verification accuracy for face data with
both neutral and non-neutral expressions.
The matching time is also not reported in either of these two approaches. How-
ever, since Yan [175] used ICP on the whole dataset, our technique with local features
is expected to be faster than that approach. The final matching in [152] is expected
to be faster, since they used a weighted L1 distance metric to compare wavelet coef-
ficient extracted from ear and face data. However, their registration and deformable
model fitting steps are computationally expensive.
7.9 Conclusion
In this paper, two multimodal ear-face biometric recognition approaches are
proposed, one with fusion at the score-level and another at the feature-level. These
approaches are based on local 3D features which are very fast to compute and robust
to pose and scale variations, and occlusions due to hair and earrings. The fusion
with ear data (which is not sensitive to changes in facial expression) significantly
improves the face recognition results under non-neutral expressions. The comple-
mentary features of the two modalities also provide better results under neutral
expression even without using the expensive ICP algorithm. The feature-level fu-
sion approach performs better than the unimodal approaches based on ear or face
only, but not better than the score-level fusion as intuitively believed. This is possi-
bly due to the difficulty of fusing local ear and face features in a repeatable way. The
performance of the proposed multimodal recognition can be improved further using
7.9. Conclusion 139
a hybrid fusion approach where the result of a lower (e.g. data or feature) level of
fusion can be fed to a higher (e.g. score) level of fusion. One possibility would be to
short list gallery candidates using the result of data-level or feature-level fusion. A
finer matching algorithm such as ICP can then be applied to each modality and at
the end, score-level fusion can be used to make the final matching decision. Thus,
we can combine the benefits of the different possible levels of fusion. Building a
larger ear-face multimodal database with more challenging images to evaluate the
robustness of the proposed techniques can be another avenue of further research.
Acknowledgments
This research is sponsored by the Australian Research Council (ARC) under
grants DP0664228, DP0881813 and LE0775672 and by the University of Western
Australia under Scholarships for International Research Fees (SIRFs)- Completion.
We acknowledge the use of the FRGC v.2 and the UND Biometrics databases for
ear and face detection and recognition.
140 Chapter 7. Fusion of 3D Ear and Face Biometrics for Robust Human Recognition
141CHAPTER 8
Conclusion
8.1 Discussion
In this dissertation, a complete and fully automatic unimodal approach for hu-
man recognition using ear biometrics and two approaches of multimodal recognition
using ear and face data collected from 2D and 3D frontal and profile images are
proposed.
The Iterative Closest Point (ICP) algorithm is found to be one of the most
accurate algorithms for matching ears in the gallery and the probe dataset. In this
dissertation, in order to reduce the computational expense of this iterative algorithm,
I used the ICP algorithm in a hierarchical manner: first with a low and then with
higher resolution meshes of 3D ear data. The result of the first application of ICP
was used as a coarse alignment prior to the second application of ICP. I obtained a
rank one recognition rate of 93% on the UND Biometrics Database.
In order to achieve more efficiency and robustness against occlusion, I devised a
very fast ear detection approach and used 3D local features for data representation
and matching. My AdaBoost-based ear detection approach with three new Haar
feature templates and the rectangular detector is very fast and significantly robust
to hair, earrings and earphones. On Collection J of the UND Biometrics Database
with 830 images from 415 subjects, my proposed system provides a detection rate
of 99.9%. On a Core 2 Quad 9550, 2.83 GHz machine, it takes around 7.7 ms to
detect an ear from a 640× 480 image. Unlike other approaches, the performance of
my proposed system does not rely on any assumptions about the localization of the
nose or the ear pit.
The modified construction and efficient use of local 3D features to find potential
matches and feature-rich areas, and to use them for a coarse alignment prior to the
ICP, make my recognition approach computationally inexpensive and significantly
robust to occlusions and pose variations. Using a two-stage feature matching with
geometric consistency measures significantly improved the matching performance.
It takes only 22.2 sec to extract around 200 local features from a 3D ear. Matching
a probe ear with a gallery using only local features takes 0.06 sec and using the
full matching including ICP requires 2.28 sec on average. The evaluation of the
performance of the complete system on UND-J (the largest available ear database)
gives an identification rate of 93.5% and an Equal Error Rate (EER) of 4.1%. The
corresponding rates for the UND-F dataset are 95.4% and 2.3%. On a new in-house
dataset of 50 subjects all wearing earphones, I obtained an identification rate of 98%
142 Chapter 8. Conclusion
with an EER of 1%. With an un-optimized MATLAB implementation, the average
time required for the L3DF-based matching and for the full matching including ICP
is 0.06 and 2.28 seconds respectively.
In order to overcome the vulnerability of unimodal biometric systems, I pre-
sented two multimodal recognition systems fusing the ear with the face biometrics
at score and feature levels. The face is chosen because of it is physically close to
the ear and similar to the ear, its image can be collected non-intrusively. In both
approaches, local 3D features are used to represent both the ear and face data and
the shape similarity among the features in the gallery and probe dataset are utilized
for classification. The fusion with ear data (which is not sensitive to changes in facial
expression) significantly improves the face recognition results under non-neutral ex-
pressions. The complementary features of the two modalities provide better results
under neutral expression. The evaluation of both approaches is performed on the
largest publicly available multimodal ear-face dataset constructed using data from
the FRGC v.2 and the UND Biometric database. The score-level fusion technique
achieves an identification rate of 99.4% and a verification rate of 99.7% (at 0.001
FAR) with neutral facial expression. Corresponding rates with non-neutral facial
expressions are 96.8% and 97.1% respectively. I obtained comparable recognition
results with the feature-level fusion of the two modalities without using expensive
matching algorithms such as ICP.
Thus, a complete and fully automatic system of unimodal and multimodal recog-
nition using the ear and the face biometrics from detection to decision making is
presented in this dissertation. These approaches can be adapted for recognition with
other biometric traits and objects and have the potential to be extended to other
applications such as robotics, medicine and forensic sciences.
8.2 Future Work
The accuracy and efficiency of the algorithms and approaches proposed in this
dissertation can be improved in a number of ways. Some possibilities are listed
below:
• The performance evaluation of the approaches presented in this dissertation
asserts that local feature-based matching and multimodal systems perform
better than global feature-based matching and unimodal systems respectively.
Therefore, a potential extension of the research would be to further improve the
performance of ear or face modality by combining 2D local features (e.g. SIFT)
8.2. Future Work 143
with 3D local features. 2D features can be extracted from corresponding 2D
images that can be captured and co-registered along with the 3D scans using
most of the current 3D data acquisition devices.
• The performance of the proposed multimodal recognition may be improved
using a hybrid fusion approach where the result of a lower (e.g. data or
feature) level of fusion can be fed to a higher (e.g. score) level of fusion.
One possibility would be to short list gallery candidates using the result of
data-level or feature-level fusion. A finer matching algorithm such as ICP can
then be applied to each modality and finally, a score-level fusion technique
can be used to make the final matching decision. Thus, we can combine the
benefits of the different possible levels of fusion.
• The robustness can further be improved by including some other biometric
traits with the proposed ear-face multimodal approach. For example, gait im-
ages can be included in the case of non-intrusive applications and fingerprints
or iris biometrics can be included in the case of controlled applications.
• Most of the algorithms presented in this dissertation are implemented using
un-optimized MATLAB code. Therefore, the speed of the recognition can be
improved by implementing these algorithms on a C/C ++ platform and using
faster techniques for feature matching like geometric hashing as well as faster
variants of the ICP algorithm using, for example, kd-trees.
• In this dissertation, I explored fusion of the ear and face biometrics at the
feature and score levels. It would be interesting to fuse them at data level and
compare the results. Local features can be extracted from the fused data and
matching can be performed using my proposed matching techniques.
• Fusion approaches in this dissertation were tested with the largest available
ear-face multimodal dataset obtained from publicly available frontal face and
profile databases. The dataset contains images from 326 subjects with a va-
riety of facial expressions, poses and occlusions with hair and ornaments. A
larger ear-face multimodal database with more challenging images (e.g. with
eyeglasses and earplugs) could be built to evaluate the robustness of the pro-
posed techniques.
144 Chapter 8. Conclusion
145
Bibliography
[1] 3DRMA. 3D RMA : 3D database. available at
http://www.sic.rma.ac.be/ beumier/DB/3d rma.html, 1998.
[2] A. F. Abate, M. Nappi, and D. Riccio. Face and ear: A bimodal identification
system. Proc. ICIAR 2006, Part II, page 297304, 2006.
[3] A.F. Abate, M. Nappi, D. Riccio, and G. Sabatino. 2D and 3D face recognition: A
survey. Pattern Recognition Letters, 28(14):1885–1906, Oct. 2007.
[4] F. Al-Osaimi, M. Bennamoun, and A. Mian. An Expression Deformation Ap-
proach to Non-rigid 3D Face Recognition. International Journal of Computer Vision
(IJCV), 81(3):302–316, 2009.
[5] L. Alvarez, E. Gonzalez, and L. Mazorra. Fitting Ear Contour Using an Ovoid
Model. In Proc. Int’l Carnahan Conf. on Security Technology, 2005., pages 145–
148, 2005.
[6] B. B. Amor, M. Ardabilian, and L. Chen. Toward a Region-Based 3D Face Recog-
nition Approach. In Proc. Multimedia and Expo, 2008, pages 101–104, 2008.
[7] S. Ansari and P. Gupta. Localization of Ear Using Outer Helix Curve of the Ear.
In Proc. Int’l Conf. on Computing: Theory and Applications, 2007, pages 688–692,
2007.
[8] B. Arbab-Zavar and M. S. Nixon. On shape-mediated enrolment in ear biometrics.
Advances in visual computing, Lecture Notes in Computer Science, 4842:549–558,
2007.
[9] C. Barbu, R. Iqbal, and Jing Peng. An Ensemble Approach to Robust Biometrics
Fusion. In Proc. CVPR Workshop, 2006, pages 56–56, 2006.
[10] P. J. Besl and N. D. McKay. A Method for Registration of 3-D Shapes. IEEE Trans.
PAMI, 14(2):239–256, 1992.
[11] B. Bhanu and H. Chen. 3D Ear Detection from Side Face Range Images. Springer,
2008.
[12] Binghamton University, USA. Binghamton University 3D Facial Expression (BU-
3DFE). available at
http://www.cs.binghamton.edu/ lijun/Research/3DFE/3DFE Analysis.html, 2006.
146 Bibliography
[13] BioID AG. BioID Face Ddatabase. available at
http://support.bioid.com/downloads/facedb/index.php, 2001.
[14] Biometric Consortium. Introduction to Biometrics. 2009.
[15] Bosphorus. The Bosphorus Database. available at
http://bosphorus.ee.boun.edu.tr/, 2008.
[16] K. W. Bowyer, K. I. Chang, Ping Yan, P. J. Flynn, E. Hansley, and S. Sarkar.
Multi-Modal Biometrics: an Overview. In Proc. Second Workshop on Multimodal
User Authentication, 2006.
[17] K.W. Bowyer, K.I. Chang, and P.J. Flynn. A Survey of Approaches and Challenges
in 3D and Multi-Modal 3D+2D Face Recognition. Computer Vision and Image
Understanding, 101(1):1–15, 2006.
[18] A. Bronstein, M. Bronstein, and R. Kimmel. Robust expression-invariant face recog-
nition from partially missing data. In Proc. ECCV’06, LNCS., pages 396–408, 2006.
[19] R. Brunelli and D. Falavigna. Person Identification Using Multiple Cues. IEEE
Trans. PAMI, 12:955–966, 1995.
[20] M. Burge and W. Burger. Ear Biometrics in Computer Vision. In Proc. ICPR’00,
pages 822–826, 2000.
[21] J.D. Bustard and M.S. Nixon. Robust 2D Ear Registration and Recognition Based
on SIFT Point Matching. In Proc. 2nd IEEE International Conference on Biomet-
rics: Theory, Applications and Systems, 2008. BTAS 2008, pages 1–6, 2008.
[22] S. Cadavid and M. Abdel-Mottaleb. Human Identification Based on 3D Ear Models.
In Proc. IEEE International Conference on Biometrics: Theory, Applications, and
Systems, 2007. BTAS 2007, pages 1–6, 2007.
[23] R. J. Campbell and P. J. Flynn. A Survey of Free-form Object Representation and
Recognition Techniques. Computer Vision and Image Understanding, 81(2):166–
210, 2001.
[24] J. Canny. Towards Fast 3D Ear Recognition for Real-Life Biometric Applications.
IEEE Transaction on Pattern Analysis and Machine Intelligence, 8:679–714, 1986.
[25] CAS-PEAL. CAS-PEAL Face Database. available at
http://www.jdl.ac.cn/peal/index.html, 2004.
Bibliography 147
[26] K.I. Chang, K.W. Bowyer, and P.J. Flynn. Adaptive rigid multi-region selection for
handling expression variation in 3d face recognition. In Proc. CVPR, pages 157–157,
2005.
[27] K.I. Chang, K.W. Bowyer, and P.J. Flynn. Multiple Nose Region Matching
for 3D Face Recognition under Varying Facial Expression. IEEE Trans. PAMI,
28(10):1695–1700, 2006.
[28] Kyong Chang, K.W. Bowyer, S. Sarkar, and B. Victor. Comparison and combination
of ear and face images in appearance-Based biometrics. IEEE Transactions PAMI,
9:1160–1165, 2003.
[29] H. Chen and B. Bhanu. Human Ear Detection from Side Face Range Images. In
Proc. ICPR 2004, 3:574–577, 2004.
[30] H. Chen and B. Bhanu. Contour Matching for 3D Ear Recognition. In Proc. IEEE
Workshops on Application of Computer Vision, pages 123–128, 2005.
[31] H. Chen and B. Bhanu. Shape Model-Based 3D Ear Detection from Side Face Range
Images. pages 122–122, 2005.
[32] H. Chen and B. Bhanu. Human Ear Recognition in 3D. IEEE Trans. PAMI,
29(4):718–737, 2007.
[33] H.-Y. Chen, C.-L. Huang, and C.-M. Fu. Hybrid-boost Learning for Multi-pose Face
Detection and Facial Expression Recognition. Pattern Recognition, 41(3):1173–1185,
2008.
[34] Hui Chen and B. Bhanu. Contour Matching for 3D Ear Recognition. In Proc. IEEE
Workshops on Application of Computer Vision, pages 123–128, 2005.
[35] Hui Chen and B. Bhanu. Efficient Recognition of Highly Similar 3D Objects in
Range Images. IEEE Trans. PAMI, 31(1):172–179, 2009.
[36] Qing Chen, Nicolas D. Georganas, and Emil M. Petriu. Real-time Vision-Based
Hand Gesture Recognition Using Haar-like Features. In Proc. IEEE Instrumentation
and Measurement Technology Conf., pages 1–6, 2007.
[37] M. Choras. Ear Biometrics Based on Geometrical Feature Extraction. Electronic
Letters on Computer Vision and Image Analysis, 5:84–95, 2005.
[38] M. Choras. Image Feature Extraction Methods for Ear Biometrics-A Survey. In
Proc. 6th International Conference on Computer Information Systems and Indus-
trial Management Applications, pages 261–265, 2007.
148 Bibliography
[39] M. Choras. Image Pre-classification for Biometrics Identification Systems. Advances
in Information Processing and Protection, Pejas, J. and Saeed, K. (ed.), Springer
US, 3:361–370, 2007.
[40] C. Chua, F. Han, and Y. Ho. 3D Human Face Recognition Using Point Signatures.
In IEEE Analysis and Modeling of Faces and Gestures, pages 233–238, 2000.
[41] C. S. Chua and R. Jarvis. Point Signatures: A New Representation for 3D Object
Recognition. Int’l Journal of Computer Vision, 25(1):63–85, 1997.
[42] O. Chum and J. Matas. Matching with PROSAC - Progressive Sample Consensus.
In Proc. the CVPR’05, 1:220–226, 2004.
[43] CIFAS. 2008 Fraud Trends, CIFAS. 2009.
[44] CMU. PIE Database, CMU. available at
http://www.ri.cmu.edu/research project detail.html?project id=418
&menu id=261, 2000.
[45] A. Colombo, C. Cusano, and R. Schettini. Face3 a 2D+3D Robust Face Recognition
System. In Proc. ICIAP 2007, pages 393–398, 2007.
[46] K. Delac, M. Grgic, and M. S. (ed.) Bartlett. Recent Advances in Face Recognition.
IN-TECH, Vienna, Austria, 2008.
[47] J. D’Errico. Surface Fitting Using Gridfit. MATLAB Central, File Exchange, 2006.
http://www.mathworks.com/matlabcentral/fileexchange/8998.
[48] C. Dorai and A.K. Jain. COSMOS-A Representation Scheme for 3D Free-form
Objects. IEEE Trans. PAMI, 19(10):1115–1130, 1997.
[49] T.C. Faltemier, K.W. Bowyer, and P.J. Flynn. Using a Multi-Instance Enrollment
Representation to Improve 3D Face Recognition. In Proc. IEEE Int’l Conference
on Biometrics: Theory, Applications, and Systems, pages 1–6, 2007.
[50] T.C. Faltemier, K.W. Bowyer, and P.J. Flynn. Using multi-instance enrollment to
improve performance of 3D face recognition. Computer Vision and Image Under-
standing, 112(2):114–125, Nov. 2008.
[51] FERET . The Color FERET Database. available at
http://face.nist.gov/colorferet/, 2003.
[52] M. A. Fischler and R. C. Bolles. Random Sample Consensus: A Paradigm for Model
Fitting with Applications to Image Analysis and Automated Cartography. Comm.
of the ACM, 24:381–395, 1981.
Bibliography 149
[53] Y. Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-Line
Learning and An Application to Boosting. In Proc. European Conf. on Compu-
tational Learning Theory, 1995.
[54] R. Frischholz. Face Detection Homepage. 2008.
[55] R. Frischholz and U. Dieckmann. Bioid: A multimodal biometric identification
system. IEEE Computer, 33(2):64–68, 2000.
[56] Y. Gao and M. Maggs. Feature-Level Fusion in Personal Identification. In Proc.
CVPR’05, 1:468–473, 2005.
[57] M. Garland and C. Heckbert. Surface Simplification Using Quadric Error Metrics.
In SIGGRAPH, 1997.
[58] J.E. Gentile, K.W. Bowyer, and P.J. Flynn. Profile Face Detection: A Subset Multi-
Biometric Approach. In Proc. Biometrics: Theory, Applications and Systems, 2008.
BTAS 2008, pages 1–6, 2008.
[59] R.S. Ghiass and N. Sadati. Multi-view Face Detection and Recognition under
Variable Lighting Using Fuzzy Logic. In Proc. IEEE International Conference on
Wavelet Analysis and Pattern Recognition (ICWAPR)., 1:74–79, 2008.
[60] L. Goldmann, U. J. Monich, and T. Sikora. Components and Their Topology for Ro-
bust Face Detection in the Presence of Partial Occlusions. IEEE Trans. Information
Forensics and Security, 2(3):559–569, 2007.
[61] D.B. Graham and N.M. Allinson. Characterizing Virtual Eigensignatures for Gen-
eral Purpose Face Recognition. Face Recognition: from Theory to Applications,
NATO ASI Series F, Computer and Systems Sciences, H. Wechsler, P. J. Phillips,
V. Bruce, F. Fogelman-Soulie and T. S. Huang (eds), 163:446–456, 1998.
[62] Grgic, M. and Delac, K. Databases, Face Recogniton Homepage. 2009.
[63] Yimo Guo and Zhengguang Xu. Ear Recognition Using a New Local Matching
Approach. In Proc. the 15th IEEE International Conference on Image Processing,
ICIP’08, pages 289–292, 2008.
[64] C.H. Han and K.-B. Sim. Real-time Face Detection Using AdaBoot Algorithm. In
Proc. the Int’l Conf. on Control, Automation and Systems (ICCAS), pages 1892–
1895, 2008.
150 Bibliography
[65] N. He, K. Sato, and Y. Takahashi. Partial Face Extraction and Recognition Using
Radial Basis Function Networks. IAPR Workshop on Machine Vision Applications,
pages 144–147, 2000.
[66] E. Hjelmas and B.K. Low. Face Detection: A Survey. Computer Vision and Image
Understanding, 83(3):236–274, 2001.
[67] K. Hotta. View Independent Face Detection Based on Horizontal Rectangular Fea-
tures and Accuracy Improvement Using Combination Kernel of Various Sizes. Pat-
tern Recognition, 42(3):437–444, 2009.
[68] Chang Huang, Haizhou Ai, Yuan Li, and Shihong Lao. High-Performance Rotation
Invariant Multiview Face Detection. IEEE Trans. PAMI, 29(4):671–686, 2007.
[69] D. J. Hurley, B. Arbab-Zavar, and M. S. Nixon. The Ear As a Biometric. EUSIPCO
2007, pages 25–29, 2007.
[70] D. J. Hurley, M. S. Nixon, and J. N. Carter. Force Field Feature Extraction for Ear
Biometrics. Computer Vision and Image Understanding, 98(3):491–512, 2005.
[71] M. Husken, M. Brauckmann, S. Gehlen, and C. Malsburg. Strategies and benefits
of fusion of 2d and 3d face recognition. In Proc. CVPR, pages 174–174, 2005.
[72] A. Iannarelli. Ear Identification. Forensic Identification Series . Paramount Pub-
lishing Company, Fremont, California, 1989.
[73] ISL. Image Databases. 2009.
[74] S.M.S. Islam, M. Bennamoun, and R. Davies. Fast and Fully Automatic Ear De-
tection Using Cascaded AdaBoost. In Proc. IEEE Workshop on Application of
Computer Vision, pages 1–6, 2008.
[75] S.M.S. Islam, M. Bennamoun, A. Mian, and R. Davies. A Fully Automatic Approach
for Human Recognition from Profile Images Using 2D and 3D Ear Data. Proc.
3DPVT, pages 131–141, 2008.
[76] S.M.S. Islam, M. Bennamoun, A. Mian, and R. Davies. Score Level Fusion of Ear
and Face Local 3D Features for Fast and Expression-invariant Human Recognition.
M. Kamel and A. Campilho (Eds.): ICIAR 2009, LNCS 5627, Springer, Heidelberg,
pages 387–396, 2009.
[77] S.M.S. Islam, M. Bennamoun, R. Owens, and R. Davies. Biometric Approaches of
2D-3D Ear and Face: A Survey. Proc. Int’l Conf. on Systems, Computing Sciences
and Software Engineering, 2007.
Bibliography 151
[78] S.M.S. Islam, M. Bennamoun, R. Owens, and R. Davies. Biometric Approaches of
2D-3D Ear and Face: A Survey. Advances in Computer and Information Sciences
and Engineering, T. Sobh(ed.), Springer Netherlands, pages 509–514, 2008.
[79] S.M.S. Islam, M. Bennamoun, R. Owens, and R. Davies. A Review of Recent
Advances in 3D Ear and Expression Invariant Face Biometrics. Under review with
ACM Computing Surveys, 2010.
[80] S.M.S. Islam and R. Davies. Refining Local 3D Feature Matching through Geometric
Consistency for Robust Biometric Recognition. Proc. Digital Image Computing:
Techniques and Applications (DICTA), pages 513–518, 2009.
[81] S.M.S. Islam, R. Davies, M. Bennamoun, and A. Mian. A Fast and Fully Automatic
Approach for Ear Detection and 3D Recognition from Profile Images. International
Journal of Computer Vision (Under review), 2010.
[82] S.M.S Islam, R. Davies, M Bennamoun, R.A. Owens, and A.S. Mian. Fusion of
Local 3D Ear and Face Features for Human Recognition. IEEE Transactions on
Pattern Analysis and Machine Intelligence (Under review), 2010.
[83] S.M.S. Islam, R. Davies, A. Mian, and M. Bennamoun. A Fast and Fully Automatic
Ear Recognition Approach Based on 3D Local Surface Features. J. Blanc-Talon et
al. (Eds.): ACIVS 2008, LNCS 5259, Springer, Heidelberg, pages 1081–1092, 2008.
[84] A. K. Jain, K. Nandakumar, and A. Ross. Score Normalization in Multimodal
Biometric Systems. Pattern Recognition, 38(12):2270–2285, 2005.
[85] A. K. Jain, A. Ross, and S. Pankanti. Biometrics: A Tool For Information Security.
IEEE Trans. Information Forensics and Security, 1(2):125–143, 2006.
[86] A. K. Jain, A. Ross, and S. Prabhakar. An introduction to biometric recognition.
IEEE Trans. Circuits and Systems for Video Technology, 14(1):4–20, 2004.
[87] Javelin. The 2009 Identity Fraud Survey Report, Javelin Strategy & Research. 2009.
[88] Javelin. The 2010 Identity Fraud Survey Report, Javelin Strategy & Research.
available at
http://www.idsafety.net/2010IDFraudReportRelease.pdf, 2010.
[89] A. E. Johnson and M. Hebert. Using Spin Images for Efficient Object Recognition
in Cluttered 3D Scenes. IEEE Trans. PAMI, 21(5):674–686, 1999.
[90] M. Jones and P. Viola. Fast Multi-view Face Detection. Technical Report TR2003-
96, MERL, 2003.
152 Bibliography
[91] I.A. Kakadiaris, G. Passalis, G. Toderici, M.N. Murtuza, Yunliang Lu, N. Karam-
patziakis, and T. Theoharis. Three-Dimensional Face Recognition in the Presence
of Facial Expressions: An Annotated Deformable Model Approach. IEEE Trans.
PAMI, 29(4):640–649, 2007.
[92] Yan Ke and R. Sukthankar. PCA-SIFT: a more distinctive representation for local
image descriptors. In Proc. the CVPR’04, 2:506–513, 2004.
[93] J. Kepner. MatlabMPI. Journal of Parallel and Distributed Computing, 64(8):997–
1005, 2004.
[94] J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas. On Combining Classifiers. IEEE
Transactions PAMI, 20(3):226–239, 1998.
[95] J.J. Koenderink and A. J. Doorn. Surface Shape and Curvature Scales. Image and
Vision Computing, 10:557–565, 1992.
[96] S.G. Kong, B.R. Heo, J. an dAbidi, J. Paik, and M.A. Abidi. Recent advances in
visual and infrared face recognitiona review. Computer Vision and Image Under-
standing, 97(1):103–135, 2005.
[97] A. Kumar, D. C. M. Wong, H. Shen, and A. K. Jain. Personal verification using
palmprint and hand geometry biometric. In Proc. Int’l Conf. on Audio- and Video-
Based Person Authentication, pages 668–675, 2003.
[98] A. Kumar and D. Zhang. Personal recognition using hand shape and texture. IEEE
Trans. Image Processing, 15(8):2454–2461, 2006.
[99] Chao Li and A. Barreto. An Integrated 3D Face-Expression Recognition Approach.
In Proc. ICASSP’06, 3:III–III, 2006.
[100] S. Z. Li and A. K. Jain. Handbook of Face Recognition. Springer, 2005.
[101] X. Li, G. Mori, and H. Zhang. Expression-Invariant Face Recognition with Expres-
sion Classification. In Proc. Canadian Conf. on Computer and Robot Vision, pages
77–83, 2006.
[102] Y. Li, S. Gong, J. Sherrah, and H. Liddell. Support Vector Machine Based Multi-
view Face Detection and Recognition. Image and Vision Computing, 22(5):413–427,
2004.
[103] R. Lienhart, L. Liang, and A. Kuranov. A Detector Tree of Boosted Classifiers for
Real-Time Object Detection and Tracking. In Proc. the Int’l Conf. on Multimedia
and Expo, 2003. ICME ’03, 2:277–280, 2003.
Bibliography 153
[104] R. Lienhart and J. Maydt. An Extended Set of Haar-like Features for Rapid Object
Detection. In Proc. the Int’l Conf. on Image Processing. 2002, 1:900–903, 2002.
[105] D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Int. Journal
of Computer Vision, 60:91–110, 2004.
[106] Lu Lu, Xiaoxun Zhang, Youdong Zhao, and Yunde Jia. Human Identification Based
on 3D Ear Models. In Proc. International Conference on Innovative Computing,
Information and Control, 2006. ICICIC ’06, 3:353–356, 2007.
[107] X. Lu and A. K. Jain. Deformation Modeling for Robust 3D Face Matching. IEEE
Trans. of PAMI, 30(8):1346–1356, 2008.
[108] L. Luciano and A. Krzyzak. Automated Multimodal Biometrics Using Face and Ear.
M. Kamel and A. Campilho (Eds.): ICIAR 2009, LNCS 5627, Springer, Heidelberg,
pages 451–460, 2009.
[109] A. Majumdar and R. K. Ward. Discriminative SIFT Features for Face Recogni-
tion. In Proc. Canadian Conference on Electrical and Computer Engineering, 2009.
CCECE ’09, pages 27–30, 2009.
[110] D. Maltoni, J. Anil, J. Wayman, and M. Dario. Biometric Systems: Technology,
Design and Performance Evaluation. Springer Verlag, 2005.
[111] Mamic and M. Bennamoun. Representation and Recognition of 3D Free-Form Ob-
jects. Digital Signal Processing (DSP), Academic Press, 12(1):47–76, Jan. 2002.
[112] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. XM2VTSbd: The
Extended M2VTS Database. In Proc. the 2nd Conf. on Audio and Video-base
Biometric Personal Verification, Springer Verlag, New York, pages 1–6, 1999.
[113] J. Meynet, V. Popovici, and J.-P. Thiran. Face detection with boosted Gaussian
features. Pattern Recognition, 40(8):2283–2291, 2007.
[114] A. S. Mian, M. Bennamoun, and R.; Owens. An Efficient Multimodal 2D-3D Hybrid
Approach to Automatic Face Recognition. IEEE Trans. PAMI, 29(11):1927–1943,
2007.
[115] A.S. Mian, M. Bennamoun, and R. Owens. 2d and 3d multimodal hybrid face
recognition. In Proc. European Conf. on Computer Vision (ECCV), Part 3, pages
344–355, 2006.
154 Bibliography
[116] A.S. Mian, M. Bennamoun, and R. Owens. A Novel Representation and Feature
Matching Algorithm for Automatic Pairwise Registration of Range Images. Inter-
national Journal of Computer Vision (IJCV), 66(1):19–40, 2006.
[117] A.S. Mian, M. Bennamoun, and R. Owens. Keypoint Detection and Local Feature
Matching for Textured 3D Face Recognition. International Journal of Computer
Vision, 79(1):1–12, 2008.
[118] C. Middendorff, K.W. Bowyer, and Ping Yan. Multi-Modal Biometrics Involving
the Human Ear. In Proc. IEEE Conference on CVPR, 3:1–2, 2007.
[119] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE
Trans. PAMI, 27(10):1615–1630, 2005.
[120] MIT-CBCL. MIT-CBCL Face Recognition Database.
http://cbcl.mit.edu/software-datasets/heisele/facerecognition-database.html, 2004.
[121] T. Mita, T. Kaneko, and O. Hori. A Detector Tree of Boosted Classifiers for Real-
Time Object Detection and Tracking. In Proc. IEEE International Conference on
Computer Vision, 2005. ICCV 2005, 2:1619–1626, 2005.
[122] G. Monteiro, P. Peixoto, and U. Nunes. Vision-Based Pedestrian Detection using
Haar-Like features. Robotica 2006-Scientific meeting of the 6th Robotics Portuguese
Festival,Portugal, 2006.
[123] A.B. Moreno and A. Sanchez. GavabDb: A 3D face database. In Proc. Workshop
Biometrics on the Internet COST275, pages 77–85, 2004.
[124] I. Mpiperis, S. Malasioris, and M.G. Strintzis. 3D Face Recognition by Point Sig-
natures and Iso-contours. In Proc. SPPRA), pages 233–238, 2007.
[125] I. Mpiperis, S. Malassiotis, and M. G. Strintzis. Bilinear Models for 3-D Face and
Facial Expression Recognition. IEEE Trans. Information Forensics and Security,
3(3):498–511, Sept. 2008.
[126] P. Nair and A. Cavallaro. 3-D Face Detection, Landmark Localization, and Regis-
tration Using a Point Distribution Model. IEEE Trans. Multimedia, 11(4):611–623,
2009.
[127] R. Niese, A. Al-Hamadi, and B. Michaelis. A Novel Method for 3D Face Detection
and Normalization. Journal of Multimedia, 2(5):1–12, 2007.
Bibliography 155
[128] M. Nilsson, J. Nordberg, and I. Claesson. Face Detection using Local SMQT Fea-
tures and Split up Snow Classifier. In Proc. IEEE Int’l Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2:589–592, 2007.
[129] NIST-MID. NIST Mugshot Identification Database (MID).
http://www.nist.gov/srd/nistsd18.htm, 1994.
[130] Zhiheng Niu, Shiguang Shan, Shengye Yan, Xilin Chen, and Wen Gao. 2D Cascaded
AdaBoost for Eye Localization. In Proc. ICPR 2006, 2:1216–1219, 2006.
[131] OpenCV. OpenCV Library. http://sourceforge.net/projects/opencvlibrary/.
[132] E. Osuna, R. Freund, and F. Girosit. Training Support Vector Machines: an Ap-
plication to Face Detection. In Proc. CVPR’97, pages 130–136, 1997.
[133] Y. Pan, X.; Cao, X. Xu, Y. Lu, and Y. Zhao. The study of multimodal recognition
based on ear and face. In Proc. IEEE Int’l Conference on Audio, Language and
Image Processing,ICALIP, pages 385–389, 2008.
[134] Jae-Han Park, Kyung-Wook Park, Seung-Ho Baeg, and Moon-Hong Baeg. π −SIFT : A photometric and Scale Invariant Feature Transform. In Proc. the 19th
International Conference on Pattern Recognition, 2008. ICPR 2008, 2:1–4, 2008.
[135] G. Passalis, I. Kakadiaris, T. Theoharis, G. Toderici, and N. Murtuza. Evalua-
tion of 3D Face Recognition in the presence of facial expressions: an Annotated
Deformable Model approach. In Proc. IEEE Workshop Face Recognition Grand
Challenge Experiments, 3:171–179, 2005.
[136] G. Passalis, I.A. Kakadiaris, T. Theoharis, G. Toderici, and T. Papaioannou. To-
wards Fast 3D Ear Recognition for Real-Life Biometric Applications. In Proc. IEEE
Conference on Advanced Video and Signal Based Surveillance, 2007. AVSS 2007,
3:39–44, 2007.
[137] S. Pavani, D. Delgado, and A. F. Frangi. Haar-like features with optimally weighted
rectangles for rapid object detection. Pattern Recognition, 43(1):160–172, 2010.
[138] N. Pears. RBF Shape Histograms and Their Application to 3D Face Processing. In
Proc. IEEE International Conference on Automatic Face and Gesture Recognition,
2008. FG08, pages 1–8, 2008.
[139] N.E. Pears and T.D. Heseltine. Isoradius contours: New Representations and Tech-
niques for 3D Face Matching and Registration. In Proc. Int. Symposium on 3DPVT,
pages 176–183, 2006.
156 Bibliography
[140] D. Petrovska-Delacretaz, S. Lelandais, J. Colineau, L. Chen, B. Dorizzi, M. Ardabil-
ian, E. Krichen, M.-A. Mellakh, A. Chaari, S. Guerfi, J. D’Hose, and B. Ben Amor.
he IV2 Multimodal Biometric Database (Including Iris, 2D, 3D, Stereoscopic, and
Talking Face Data), and the IV2-2007 Evaluation Campaign . In In Proc. BTAS08,
pages 1–7, 2008.
[141] P.J. Phillips, P.J. Flynn, T. Scruggs, K.W. Bowyer, Jin Chang, K. Hoffman, J. Mar-
ques, Jaesik Min, and W. Worek. Overview of the Face Recognition Grand Chal-
lenge. In Proc. CVPR’05, 1:947–954, 2005.
[142] P.J. Phillips, A. Martin, C.L. Wilson, and M. Przybocki. An introduction to eval-
uating biometric systems. Computer, 33(2):56–63, 2000.
[143] K. H. Pun and Y. S. Moon. Recent Advances in Ear Biometrics. In Proc. IEEE
Int’l Conf. on Automatic Face and Gesture Recognition, pages 164–169, 2004.
[144] A. Ross and R. Govindarajan. Feature Level Fusion Using Hand and Face Biomet-
rics. In Proc. SPIE Conf. on Biometric Technology for Human Identification II,
pages 196–204, 2005.
[145] A. Ross and A. K. Jain. Information Fusion in Biometrics. Pattern Recognition
Letters, 24(13):2115–2125, 2003.
[146] A. Ross and A. K. Jain. Multimodal Biometrics: An Overview. In Proc. European
Signal Processing Conf., pages 1221–1224, 2004.
[147] A. A. Ross, K. Nandakumar, and A. K. Jain. Handbook of Multibiometrics. Springer,
2006.
[148] A. Ruifrok, A. Scheenstra, and R. C. Veltkamp. A Survey of 3D Face Recogni-
tion Methods. In proc. Audio- and Video-Based Biometric Person Authentication
(AVBPA 2005), LNCS 3546, pages 891–899, 2005.
[149] R.E. Schapire and Y. Singer. Improved Boosting Algorithms Using Confidence-rated
Predictions. Mach. Learn., 37(3):297–336, 1999.
[150] P.Y. Simard, L. Bottou, P. Haffner, and Y. LeCun. A Fast Convolution Algorithm
for Signal Processing and Neural Networks, M. Kearns, S. Solla, and D. Cohn (Eds.).
Advances in Neural Information Processing Systems,, 11:571– 577, 1999.
[151] J. Sochman and J. Malas. AdaBoost with Totally Corrective Updates for Fast Face
Detection. In Proc. IEEE Int’l Conf. on Automatic Face and Gesture Recognition,
2004, pages 445–450, 2004.
Bibliography 157
[152] T. Theoharis, G. Passalis, G. Toderici, and I.A. Kakadiaris. Unified 3D Face and Ear
Recognition Using Wavelets on Geometry Images. Pattern Recognition, 41(3):796–
804, 2008.
[153] A. Treptow and A. Zell. Real-time Object Tracking for Soccerrobots without Color
Information. Robotics and Autonomous Systems, 48(1):41–48, 2004.
[154] UMIST. The UMIST Face Database.
http://images.ee.umist.ac.uk/danny/database.html, 2002.
[155] UND. University of Notre Dame Biometrics Database.
http://www.nd.edu/ cvrl/CVRL/Data Sets.html, 2004.
[156] UND. University of Notre Dame Biometrics Database.
http://www.nd.edu/ cvrl/CVRL/Data Sets.html, 2005.
[157] UND. University of Notre Dame Biometrics Database.
http://www.nd.edu/ cvrl/CVRL/Data Sets.html, 2005.
[158] O. Ushmaev and S. Novikov. Biometric Fusion: Robust Approach. In Proc. Int’l
Workshop on Multimodal User Authentication (MMUA 2006), 2006.
[159] USTB. USTB Ear Database.
http://www.en.ustb.edu.cn/resb/, 2002.
[160] USTB. The USTB database III. available at
http://www.ustb.edu.cn/resb/en/doc/Imagedb 123 intro en.pdf, 2004.
[161] P. Viola and M.J. Jones. Robust Real-Time Face Detection. Int’l Journal of Com-
puter Vision, 57(2):137–154, 2004.
[162] Y. Wang, G. Pan, and Z. Wu. Sphere-spin-image: A Viewpoint Invariant Surface
Representation for 3D Face Recognition. In Proc. Internat. Conf. on Computational
Science (ICCS04), Lecture Notes in Computer Science, Vol. 3037, pages 427–434,
2004.
[163] Y. Wang, G. Pan, and Z. Wu. 3D Face Recognition in the Presence of Expression: A
Guidance-Based Constraint Deformation Approach. In Proc. CVPR07, pages 1–7,
2007.
[164] Y. Wang, T. Tan, and A. K. Jain. Combining Face and Iris Biometrics for Identity
Verification. In Proc. Int’l Conf. on Audio- and Video-Based Person Authentication,
pages 805–813, 2003.
158 Bibliography
[165] B. Weyrauch, J. Huang, B. Heisele, and V. Blanz. Component-Based Face Recog-
nition with 3D Morphable Models. In First IEEE Workshop on Face Processing in
Video, Washington, D.C., 2004.
[166] D.L. Woodard, T.C. Faltemier, Ping Yan, P.J. Flynn, and K.W. Bowyer. A Com-
parison of 3D Biometric Modalities. In Proc. CVPR Workshop, pages 57–61, 2006.
[167] K. Woods, K. Bowyer, and W. P. Kegelmeyer. Combination of Multiple Classifiers
Using Local Accuracy Estimates. Trans. Pattern Anal. Mach. Intell., 19(4):405–410,
1997.
[168] J. Wu, S.C. Brubaker, M.D. Mullin, and J.M. Rehg. Fast Asymmetric Learning for
Cascade Face Detection. IEEE Trans. PAMI, 30(3):369–382, 2008.
[169] L. Xiaohua, K.-M. Lam, S. Lansun, and Z. Jiliu. Face detection using simplified Ga-
bor features and hierarchical regions in a cascade of classifiers. Pattern Recognition
Letters, 30(8):717–728, 2009.
[170] Zhang Xiaoxun and Jia Yunde. Symmetrical Null Space LDA for Face and Ear
Recognition. Neurocomputing, 70(4-6):842–848, 2007.
[171] X.-N. Xu, Z.-C. Mu, and L. Yuan. Feature-level Fusion Method Based on KFDA
for Multimodal Recognition Fusing Ear and Profile Face. In Proc. International
Conference on ICWAPR, 3:1306–1310, 2007.
[172] Xiaona Xu and Zhichun Mu. Feature Fusion Method Based on KCCA for Ear and
Profile Face Based Multimodal Recognition. In Proc. IEEE International Confer-
ence on Automation and Logistics, pages 620–623, 2007.
[173] Yale. The Yale Face Database. available at
http://cvc.yale.edu/projects/yalefaces/yalefaces.html, 1997.
[174] J. Yan. Ensemble SVM Regression Based Multi-View Face Detection System. In
Proc. IEEE Workshop on Machine Learning for Signal Processing, pages 163–169,
2007.
[175] P. Yan. Ear Biometrics in Human Identification. PhD thesis, University of Notre
Dame, 2006.
[176] P. Yan and K. W. Bowyer. Empirical Evaluation of Advanced Ear Biometrics. In
Proc. CVPR, pages 41–41, 2005.
Bibliography 159
[177] P. Yan and K. W. Bowyer. Icp-Based Approaches for 3D Ear Recognition. In
Proc. SPIE-Volume 5779: Biometric Technology for Human Identification II, Anil
K. Jain, Nalini K. Ratha, Editors, pages 282–291, 2005.
[178] P. Yan and K. W. Bowyer. Multi-Biometric 2D and 3D Ear Recognition. T. Kanade,
A. Jain and N. K. Ratha (Eds.) AVBPA 2005, LNCS 3546, Springer, Heidelberg,
pages 503–512, 2005.
[179] P. Yan and K. W. Bowyer. Biometric Recognition Using 3D Ear Shape. IEEE
Trans. PAMI, 29(8):1297–1308, 2007.
[180] Ping Yan and Kevin W. Bowyer. Ear Biometrics Using 2D and 3D Images. In Proc.
CVPR, pages 121–121, 2005.
[181] Ping Yan and Kevin W. Bowyer. An automatic 3d ear recognition system. In Proc.
the Thir d Int’l Symposium on 3DPVT, pages 326–333, 2006.
[182] Ming-Hsuan Yang, D. Kriegman, and N. Ahuja. Detecting Faces in Images: A
Survey. IEEE Trans. PAMI, 24 (1):34–58, 2002.
[183] M.H. Yap, H Ugail, R. Zwiggelaar, B. Rajoub, V. Doherty, S. Appleyard, and
G. Hurdy. A Short Review of Methods for Face Detection and Multifractal Analysis.
In Proc. Int’l Conference on CyberWorlds, pages 231–236, 2009.
[184] L. Yuan, Z. Mu, and Y. Liu. Multimodal Recognition Using Face Profile and Ear.
In Proc. the 1st Int’l Symposium on SCAA, pages 887–891, 2006.
[185] Li Yuan and Zhi-Chun Mu. Ear Detection Based on Skin-Color and Contour Infor-
mation. In Proc. Int’l Conf. on Machine Learning and Cybernetics, 4:2213–2217,
2007.
[186] Li Yuan and Feng Zhang. Ear Detection Based on Improved AdaBoost Algorithm.
Proc. International Conference on Machine Learning and Cybernetics, 4:2414–2417,
2009.
[187] T. Yuizono, Y. Wang, K. Satoh, and S. Nakayama. Study on Individual Recognition
for Ear Images by Using Genetic Local search. In Proc. Congress on Evolutionary
Computation, pages 237–242, 2002.
[188] Hai-Jun Zhang, Zhi-Chun Mu, Wei Qu, Lei-Ming Liu, and Cheng-Yang Zhang. A
Novel Approach for Ear Recognition Based on ICA and RBF Network. In Proc.
Int’l Conf. on Machine Learning and Cybernetics, 2005, pages 4511–4515, 2005.
160 Bibliography
[189] Xiaozheng Zhang and Yongsheng Gao. Face recognition across pose: A review.
Pattern Recognition, 42(11):2876–2896, 2009.
[190] W. Zhao, R. Chellappa, A. Rosenfeld, and P.J. Phillips. Face Recognition: A
Literature Survey. ACM Computing Surveys, pages 399–458, 2003.
[191] W. Zhao, R. Chellappa, A. Rosenfeld, and P.J. Phillips. Face Recognition: A
Literature Survey. ACM Computing Surveys, pages 399–458, 2003.
[192] W. Zhao, R. Chellappa, A. Rosenfeld, and P.J. Phillips. Face Recognition: A
Literature Survey. ACM Computing Surveys, pages 399–458, 2003.
[193] S. K. Zhou, R. Chellappa, and W. Zhao. Unconstrained Face Recognition (Interna-
tional Series on Biometrics). Springer, 2006.
[194] Xiaoli Zhou and B. Bhanu. Integrating Face and Gait for Human Recognition. In
Proc. CVPR Workshop, 2006, pages 55–55, 2006.