Automatic Alignment and Reconstruction of Facial Depth Images

Automatic alignment and reconstruction of facial depth images

Contents lists available at ScienceDirectPattern Recognition Lettersjournal homepage: www.elsevier.c om/locate/patrec Pattern Recognition Letters 50 (2014) 8290

Automatic alignment and reconstruction of facial depth images q

Giancarlo Taveira , Leandro A.F. FernandesInstituto de Computao, Universidade Federal Fluminense (UFF), CEP 24210-240 Niteri, RJ, Brazil

a r t i c l e i n f o

Article history:Available online 12 December 2013

Keywords: Depth image Alignment Interpolation Resampling Face image

a b s t r a c t

Face, gender, ethnic and age group classication systems often work through an alignment, feature extraction, and identication pipeline. The quality of the alignment process is thus central to the performance of the identication process. Furthermore, missing portions of depth information can greatly affect results. Appropriate image reconstruction is therefore crucial for the correct operation of those systems. This paper presents a simple and effective approach for the automatic alignment and reconstruction of damaged facial depth images. By using only four facial landmarks and the raw depth data, our approach converts a given damaged depth image into a smooth depth function, performs the 3D alignment of the underlying face with the face of an average person, and produces an aligned depth image having arbitrary resolution. Our experiments show that the proposed approach outperforms commonly used methods. For instance, we show that it improves the quality of a state-of-art gender classication technique. 2013 Elsevier B.V. All rights reserved.

1. Introduction

The ability to retrieve information from facial depth images has many practical applications including face recognition, age group estimation, gender and ethnic group classication. Unfortunately, depth data is often damaged due to limitations intrinsic to off- the-shelf depth-image capturing systems. Examples include, but are not limited to, depth shadowing and the inuence of reective, refractive and infrared absorbing materials in the scene (Zhu et al., 2008). Also, the amount of pixels covering the imaged face and faces orientation often vary from image to image, making difcult or even impossible the use of captured images without the proper alignment and reconstruction of depth data (Szeliski, 2010).Virtually every computer vision researcher that needs to perform alignment and reconstruction of facial depth data usually presents its own solution to the problem. A well-known technique is to identify some facial features by curvature, and compute the alignment based on them (Moreno et al., 2005). Solutions based on principal component analysis (PCA) have also been proposed (Stormer and Rigoll, 2008). However, most of the attempts do not make proper use of depth information while performing the alignment, restricting the solution to the 2D image plane. Also, linear interpolation is commonly used to ll the holes (Wu et al., 2010), leading to unnatural at artifacts on the facial surface.It is remarkable that the solutions commonly applied in the literature contradict the common wisdom that appropriate

q This paper has been recommended for acceptance by Dmitry Goldgof. Corresponding author. Tel.: +55 21 2629 5665; fax: +55 21 2629 5669.E-mail addresses: [email protected] (G. Taveira), [email protected] (L.A.F. Fernandes).

alignment and resampling techniques must be employed in order to produce corrected depth images from the original ones. For instance, it is important to make use of the depth information intrinsic to this kind of data in order to alignment the structures of interest (i.e., the faces) in the 3D space, not just on the image plane. Furthermore, by considering the nature of the surface of interest, it is clear that it is necessary to apply smooth non-linear interpolation techniques capable of reconstructing the damaged portions of the original image and also of producing depth values with sub pixel precision. With such care, the expectation is that the performance of depth-based classication techniques may be improved.This paper presents a simple and effective method for aligning and reconstructing facial depth images from damaged depth data in a completely automatic way (Section 3). The approach uses information extracted from valid pixels to adjust a smooth thin- plate spline (TPS) interpolating function that naturally reconstructs the depth information of missing pixels (see Fig. 1) and computes smooth transitions among existing ones. The approach also ex- plores facial landmarks in order to determine the actual position and orientation of the imaged face in the 3D space. The relation between the set of landmarks in the actual face and a set of canonical landmarks is used to map the shape of the imaged face to a standard space where the resulting aligned image is generated by ray casting the reconstructed surface. The developed ray casting scheme is derived from the relief mapping (RM) technique proposed by Policarpo et al. (2005) for real-time rendering of surface details mapped on coarse triangular meshes. Our approach easily ts into popular processing pipelines, and can be extended to produce correct color and normal map images to be used with the resulting depth images.

0167-8655/$ - see front matter 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2013.12.007

Fig. 1. For the same subject: (a) the original color image, (b) the original damaged depth image, and (c) the image with reconstructed depth information produced by our technique. Six aligned and reconstructed depth images of different subjects are presented in (d). Images (b), (c) and (d) are presented in false-color, where dark red pixels denote the surface closest to the camera. Notice in (c) the smooth transition of depth values in the originally corrupted portions (navy blue pixels in (b)). (For interpretation of the references to colour in this gure legend, the reader is referred to the web version of this article.)

90G. Taveira, L.A.F. Fernandes / Pattern Recognition Letters 50 (2014) 8290

G. Taveira, L.A.F. Fernandes / Pattern Recognition Letters 50 (2014) 829089

Our experiments (Section 4) show that the approximation errors produced by our method are smaller than those using linear interpolation for reconstruction with iterative closest point (ICP) for 3D alignment of depth data and with 2D alignment of the depth images (see Fig. 2). We also present a comparative study among four distinct interpolation methods (nearest-neighbor, linear, natural-neighbor and thin-plate spline) using the proposed alignment method in the 3D domain. Each interpolation method was applied as part of state-of-art gender classication processes proposed by Wu et al. (2010, 2011). Since the classication techniques receive surface normals computed from reconstructed facial depth images as input, the performance of these classication models as a function of the input images indicate the quality and the inuence of each interpolation method on the result.

2. Related work

This section discusses the use of TPS on the interpolation of facial color images, the use of interpolation schemes to resample facial depth data, and alignment schemes for aligning human body surfaces.Rosen (1996) developed the Java applet entitled AlexWarp. Since its creation, the applet has gained popularity among internet users world-wide for its simple and fast method of facial image warping. When the user provides one pair of landmark points, the applet determines the region to warp, warps it, and then out- puts the warped picture. One major drawback of the AlexWarp applet is that transformations can only be applied one at a time. The AlexWarp applet works on colored images in the 2D domain with a limited number of control points.

Whitbeck and Guo (2006) implemented an applet as an improvement over the AlexWarp program. In their implementation, TPS was used to allow more control points to be added instead of just one pair, as in the AlexWarp. The TPS was applied in a 2D domain in order to interpolate warped colored images. In our work, we use TPSs on depth images. We use all points from the original depth image and do not intend to apply warping to the data.Guo et al. (2004) created an average morphable shape represented by a TPS to be used in face recognition applications. The average face was created using a database of 60 individuals (33 males and 27 females) containing only records of asian people which resulted in a model restricted to a particular ethnic group. The facial landmarks, a total of 7, were manually set. In order to reconstruct the face of a subject, they projected the colored 2D image over the average 3D model. A reduced number of control points and the usage of a single TPS are some of the limitations of their work. The authors reported that the results, although not great, showed an interesting potential. In our work, we use several TPS functions to build a different 3D model for each subject. Also, we developed an adaptive block scheme in order to allow the use of all depth values of the image pixels while performing the reconstruction of damaged depth information.Moreno et al. (2005) developed a 3D face modeling system and used two face recognition methods to test their model, one based on PCA and another one based on support vector machine (SVM). Their system aims to work on face images with varying poses, in situations where there is no control over the depth data acqui- sition. They reported that median and Gaussian lters were applied in the pre-processing stage in order to remove noise and to smooth the curvature of the resulting surfaces. Instead, we propose the use

180180120

160160Proposed Approach ICP + Linear

100Proposed Approach ICP + Linear

1401402D + Linear2D + Linear

Proposed ApproachICP + Linear2D + Linear120

12080

Subjects100

80

SubjectsSubjects1006080

606040

4040202020

00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 > 0.8

00 0.75 1.5 2.25 3 3.75 4.5 5.25 > 6

00 0.75 1.5 2.25 3 3.75 4.5 5.25 > 6

Squared error

Squared error

x 104

Squared error

x 105

Fig. 2. Histograms showing the distribution of (a) minimum, (b) mean and (c) maximum squared error values computed from the depth values of a reference image and images produced using the proposed alignment and reconstruction approach (blue), a common 3D ICP-based alignment method with linear reconstruction (green), and a common 2D alignment technique with linear reconstruction (red). Notice that the error values of the proposed approach are smaller than those of the common approaches. (For interpretation of the references to colour in this gure legend, the reader is referred to the web version of this article.)

of a TPS-based interpolation scheme combined with a RM-based ray casting technique to reconstruct missing portions of data as smooth surface patches.Stormer and Rigoll (2008) proposed a procedure consisting of facial feature hypotheses extraction by invariant curvature features, PCA-based classication, and iterative closest point alignment to create aligned and normalized patches of faces in range images. They nd the facial features by pre-processing the spatial discrete data and apply linear interpolation and low-pass lter to get a closed and smooth surface. Their nal results are patches that contain a minor portion of the eyes and nose to serve as input to a classication system. Their alignment and normalization procedure is not very accurate since they use an iterative approach that approximates the facial features by simple square distance nearest neighbor. The use of linear interpolation is also adopted by other face classication systems, including state-of-the-art gender classication techniques (Wu et al., 2010, 2011). It is important to noticethat the interpolating surface of the linear method is C0 continu-ous. We propose the use of a TPS-based interpolation scheme that produces C1 continuous surfaces. Also, in contrast with Stormers et al. approach, we perform non-iterative 3D alignment using only four facial landmarks.Segundo et al. (2012) presented a method for pre-aligning surfaces and to better nd a correspondence of keypoints between two objects. They had also developed a system that automatically detects incorrect results, removing the need of manual human inspection. The authors used speeded up robust features (SURF) to nd the correspondence between a single objects data obtained from different views. They also used SURF to nd the transformation matrix to project the geometric data into the color information.

the rectangular region that contains the face of the imaged person (Section 3.1). Then, we adaptively subdivide the cropped region into smaller regions where TPSs are adjusted to the depth data in each of them (Section 3.2). The TPSs not only guarantee smooth interpolation of depth data while producing the nal image but also provide the reconstruction of damaged and missing portions of depth information. In the third step, we use afne transformations to perform the 3D alignment of the landmarks of the given face with the landmarks of the face of an average person (Sec- tion 3.3). Lastly, by applying the same transformations to the TPSs, we map the input face to a standard space in which we held ray casting in order to produce an aligned facial image having arbitrary resolution (Section 3.4).Section 3.5 describes how the parameters of the average person can be obtained from the images in the dataset. Section 3.6 discusses how to transform the raw color images and how to produce normal maps that are consistent with the resulting depth images.

Image cropping

We dene the axis-aligned cropping rectangle that contains the face by using the image coordinates uno ; vno T of the nose (no) aspivot point, and the horizontal distance dle;re between left (le) and right (re) eyes outer corners and the vertical distance dno;ch between imaged nose and chin (ch) as parameters for computing, respectively, the width and height of the resulting sub-image (Fig. 3). The lower and upper corners of the cropping rectangle are expressed as:

Azouz et al. (2004) proposed a 3D human model that is based on

. max uno uD; 1 .

. min uno uD; wac .

signed-distance for shape analysis. They used PCA to obtain generic

C

max v

v ; 1

and C ;

human characteristics. Their representation model lacks a good pre- processing stage as they only apply Taubin lter to remove noise in the data and no alignment procedure is performed. PCA and ICP were

no D

min vno vD; hac

1

used in the work of Yan and Bowyer (2005). The authors compared different approaches to ear recognition, both 2D and 3D. Ear recognition is particularly relevant in the eld of biometrics. They report that an ICP-based approach outperformed every other results. Kak- adiaris et al. (2007) presented an unied software and hardware

where wac and hac are, respectively, the width and height of the input image having pixel coordinates u 2 1; wac ] and v 2 1; hac ]. uD and vD are computed, respectively, as:

duD min .d1:5 dle;re=2e; wac uno . and2

solution for 3D face recognition. They used a variation of ICP foraligning the images but proposed the use of deformable model tting(DMF) for subsampling depth data. The original ICP method is de-

vD min . 1:5 dno;ch e; hac vno .;

3

scribed by Besl and McKay (1992). In Section 4 we compare our face alignment approach against an ICP-based scheme.Xu et al. (2009) presented a promising way to build a robust recognition system integrating depth and intensity information. Although their face recognition classier is robust, their preprocess- ing stage is very poor. Therefore, their results could be improved by applying a more sophisticated alignment and reconstruction method. Our experiments show that our face alignment and reconstruction scheme may be used to improve gender classication processes. Tekumalla and Cohen (2004) used a method based on the mov- ing least squares (MLS) projection to ll holes in triangular meshes. Wang and Oliveira (2003) also used MLS as a hole-lling approach. This approach, although efcient to implement, can only handle holes of simple geometry that resemble a plane since it relies on parameterizing the vicinity of the hole by orthographic projection onto a plane. Our hole lling procedure, on the other hand, is inde- pendent of the structure of a mesh. It uses the point cloud as inputand it ts a smooth surface to the input data.

3. The proposed alignment and reconstruction approach

The computation of the aligned facial depth image consists offour main steps. First, we take the input raw depth image and crop

where d:e denotes the ceiling function. The proportion value 1:5 was empirically chosen for our experiments.It is important to emphasize that dening a sub-image using a cropping rectangle is an optional step of our approach. By limiting data to the sub-image that contains the region of interest (i.e., the subjects face) we do alleviate the computational cost of subse- quent steps. Furthermore, the cropped image does not have to be perfectly symmetrical to the imaged face because the actual alignment procedure will be performed by the nal steps of our algorithm (see Sections 3.3 and 3.4). Also, the produced cropping rectangles may have different resolutions, since their dimensions are proportional to distances in a given input image. The only requirement for a sub-image is to contain the face of the subject. Therefore, the empirical scaling factor of 1:5 used in (2) and (3) may be changed in order to t the images of different databases. However, we emphasize that the scale factor of 1:5 met our requirements well, even for faces covering different portions of the images.The location of the eyes, nose and chin is usually provided by the image database (e.g., the UND Biometric Database, Collection D (Chang et al., 2003)). They can also be retrieved by automatic techniques (Romero-Huertas and Pears, 2008; Perakis et al., 2009) or manually identied.

Fig. 3. Image cropping. (left) A grayscale visualization of an original depth image (640 480) before any cropping is done. The distances dle;re and dno;ch are proportional to the cropping rectangle. (right) A scaled example of a cropped image (195 293). See (1) for details. Lighter shades of gray correspond to points closer to the camera.

Depth data interpolation and reconstruction

Depth data interpolation and reconstruction is performed using TPSs adjusted to raw depth data. TPS is the 2D analog of the cubic spline in one dimension (Bookstein, 1989). It encodes a scalarT

height function that can be evaluated at a given u; v

coordinate

in order to retrieve the respective scalar value z that best describes the height surface passing through N non-overlapping controlpoints having coordinates u; v; zT . In this paper, u and v are coor-dinates of valid pixels (i.e., pixels storing valid depth information), and z is the associated depth value. It is important to notice that a TPS is adjusted to an unstructured set of control points, and alsoT

that it can be evaluated at any real valued u; v

position, return-

ing a smoothly interpolated z value. Thus, it is clear that sub-pixel sampling and damaged facial depth reconstruction are naturally handled by the TPS-based interpolation scheme adopted in our work.A TPS is described by 2 N 3 parameters, which include six global afne motion parameters and 2 N coefcients for correspon- dences of the control points. These parameters are computed by solving a linear system having a closed-form solution. Due to the large number of parameters, the computation of a single TPS to all valid pixels in a cropped image may be unfeasible. We avoid such an issue by dividing the cropped image into adaptive blocks having a small number of control points, and t a different TPS to each one of the blocks. Such an approach has two advantages:(i) it allows our technique to handle images having arbitrary size; and (ii) the procedure is less prone to numerical instability.

Fig. 4. An example of how the blocks in the last row and in the last column may be smaller in size than the rest of the blocks in the image. In the proposed algorithm, each block may independently grow in size until enough control points are within its boundaries (see the blocks in the lower right corner of the image).

dened in the standard space (i.e., the 3D space where all faces will be aligned). The transformation for a given imaged face is computed from the location of four landmarks in the actual coordinate frame (namely, left eye (le), right eye (re), nose (no) and chin (ch)) and the equivalent locations in the standard coordinate frame. Inthe following equations, each location is represented by a point

The adaptive blocks are initially distributed uniformly over thecropped image as a regular grid comprised by square entries hav-

PS;F

, where S is ac or st for, respectively, actual or standard frames,

and F is one of the labels in fle; re; no; chg.

ing xed size (most blocks in Fig. 4). However, since depth datamay be damaged, some of the blocks may not contain enough control points to dene a TPS. In such a case, we incrementally change

The coordinates of Pac;F

Tage as:

are computed from the input depth im-

the size of an ill-dened block by including a ring of surrounding pixels in it. An ill-dened block grows until it has enough valid pix-

Pac;F .x

ac;

F ; yac;F ; z

ac;F .

zac;

. 1

KF

Q ac;F

.;4

els to solve the linear system of equations that computes the coef-cients of the TPS (see the blocks in the lower right corner of

where Q ac;F

u ; vF

; 1

is the location (in pixels) of the given la-

Tzac;F

FFig. 4). The TPS assigned to a block ts the points covered by the

bel point in image space,

is the depth retrieved from the

T1original block size and the points inside the overlapping region.

uF ; vF

pixel, and Kis the inverse of the matrix that models

However, after the TPS has been tted, the evaluation of the smooth surface related to a block is performed only inside the original coverage of that block.

Three-dimensional face alignment

The alignment stage computes the afne transformation that maps a face dened in the actual space (i.e., the 3D space where the imaged face resides) to a standard position and orientation

the intrinsic camera parameters:0 f mucou 1

A@KB 0f mv ov C:5001

In (5), f is the focal length, mu and mv are the scale factors relatingpixels to distance, c; ou and ov represent the skew and the coordinates of the principal point, respectively. The intrinsic parametersare usually provided by the depth camera, but they can also be retrieved from calibration procedures (Hartley and Zisserman, 2000).

The formulas for computing the 3 3 matrix M and the 3 1 offset vector O modeling the intended afne transformation is given by:

3.5. Computing the average person

The average person is dened in the standard space. It is used as target-face during the face-alignment stage of our procedure. In or-

;M Q P1 and O Pst le MP

ac;le

;6

der to setup an average person one needs to specify the locationPst;F of the four face landmarks. In our experiments we computed

where P and Q are 3 3 matrices computed as:

P Pac;re Pac;le Pac;no Pac;le Pac;ch Pac;le ;7Q Pst;re Pst;le Pst;no Pst;le Pst;ch Pst;le :8

The procedure for computing Pst;F is presented in Section 3.5.Once M and O are known, the afne mapping of a general point

the coordinates of Pst;F from average values retrieved from the input dataset. However, one can place the landmark in the way that is most convenient for a particular application.We build a mean tetrahedron from the tetrahedra dened by the landmarks of each input face. The base of such tetrahedra was dened by the location of both eyes and chin. The apex was set to be the nose. The vertices were computed as:

Pac in the actual space to the standard space is given by:

1Pst;le

0 wst leye 1

1B hst lchin C; Pst;re

0 wst leye 1B hst lchin C;

Pst MPac O:9

2 @A00wst11

2 @A00wst11

3.4. Producing the nal depth image

Pst;ch 2 B hst lchin C

2 B hst l

2 lC

The nal depth image of a given face is computed by casting

@A; Pst;no @0

chin2 ltip

nose A;

rays (one ray per resulting image pixel) from a pinhole camera dened in the standard space to the surface of the subjects face mapped from the actual coordinate frame to the standard coordinate frame (Section 3.3). Due to space restriction, this paper does not present the proposed ray casting procedure in detail. However, it can be derived from a well-known rendering technique: the relief mapping (RM) (Policarpo et al., 2005).The central idea of our ray casting procedure is to use RM to quickly nd the rst intersection of each casted ray and the surface encoded by the set of TPSs (Section 3.2). Once the rst intersection is found for a given ray, the z coordinate of the intersection point in cameras coordinate system is stored in its respective image pixel. As in RM, we start the process with a linear search. Beginning at the center of projection O, we step along the ray passing through the current pixel mapped to the image plane in the 3D space atincrements of d looking for the rst point inside the surface. Oncethe rst point under the TPS surface has been identied, the binary search starts using the last point outside the surface and the current one. The role of the linear search is to quickly approximate the rst intersection between the casted ray and the TPS surface. The role of the binary search, on the other hand, is to nd the exact location of such an intersection.Recall from Section 3.2 that the original surface of the face was encoded into a set of TPSs that dene a height eld in the actual coordinate frame. Rendering such an analytical 3D representation from an arbitrary point of view may be tricky since the depth information cannot be transformed from the actual coordinate frame to the standard space with the guarantee that it will be an unambig- uous height map after such mapping. The use of RM as the ray casting approach for solving the problem is an elegant solution because it turns the harder problem of nding the intersection of the ray with the analytical surface in 3D into the simpler problem of walk- ing in the 2D domain of the height eld function while looking for the intersection in its codomain. To do that, one has to: (i) map the whole situation of the ray casting procedure (i.e., the cameras center of projection Ost and pixels points Q st in the image plane) fromthe standard space to actual space by inverting (9):

where wst and hst are, respectively, the width and height of the (standard) resulting image, leye is the mean distance between the eyes of faces in the dataset, lchin denotes the mean distance from the chin to the middle of the eyes, lnose is the mean distance fromthe orthogonal projection of the nose onto the base plane and the middle of the eyes, and ltip denotes the mean distance from the nose tip to the base plane. These distances were measured in the3D coordinate frame where each input face resides.

3.6. Producing correct color and normal map images

The computation of correct color and normal map images to be used with the aligned depth images is straightforward. The color information related to the rst intersection of the casted ray and the surface encoded by the TPS can be retrieved from another TPS encoding the color of input image pixels. By doing so, one guar- antees a smooth interpolation of color values as well as the reconstruction of missing portions of color information. The normal vectors for the normal map can be retrieved directly from the TPS encoding depth data by computing the normal of the surface at the intersection point.

4. Experiments and discussion

We have implemented our technique using C++ and MATLAB. The C++ code was compiled using Microsoft Visual Studio as dynamic link libraries (DLLs) so that they could be called from MATLAB. OpenMP was used to explore parallel computing in TPS computation. The system was tested on several real depth images. We have applied our method to the UND Biometric Data- base (Collection D) (Chang et al., 2003). This dataset is comprised by 953 images of 277 individuals, recorded using the Minolta Vividseries 3D scanner. The UND database has the advantage that it contains the 2D color images, the corresponding range images, and the location of the facial landmarks in image space.In our experiments, we set the initial size of the adaptive blocks where TPSs are adjusted (Section 3.2) to 32 32. The number of blocks depends on the size of the cropped image (Section 3.1).

Pac M1 Pst O;

The resolution of the nal images was set to wst

hst

118. How-

(ii) nd the rst intersection point using our RM-based ray casting procedure, (iii) map the intersection point back to the standard space using (9), and (iv) compute the nal depth value as the signed distance between the Ost and the intersection point.

ever, it is important to notice that our approach can produce smooth images having any resolution. The location of the facial landmarks in the input images was retrieved from the database. We have found the following mean values (expressed in millimeters) while dening the average person (Section 3.5):

leye 101:3238; lnose 40:0234; lchin 104:3238, and ltip 38:2279. Notice that those values are consistent since they are proportional to the average subjects face. Hence, they could be used without change by any application where our technique could beapplied, even by those that process a different dataset. The syn- thetic pinhole camera (Section 3.4) was placed 1750 mm apart from the base plane of the mean tetrahedron, with optic axis coin- ciding with the displacement vector of the nose and having its x and y axes aligned, respectively, to the x and y axis of the standard coordinate frame.We compared our results to a widely used approach based on 2D alignment with linear interpolation. In this approach, we rst lled the missing portions of depth information using a linear interpolation method provided by MATLAB. In turn, we appliedthe transformation matrix to align the triangle dened by the points Q ac;le ; Q ac;re and Q ac;no to the same standard coordinates presented in Section 3.5. Both operations can be performed bythe imtransform function using the bilinear interpolation scheme (see MATLAB documentation for details). The sum of the square differences in the z values was calculated for both methods. We compared different recordings of faces belonging to the same subject, since their alignment should match better than withother subjects faces. Every recording of a subject was compared against the other recordings of the same subject, and we took the minimum, mean and maximum squared error values produced. Our analysis covered 180 subjects. A histogram showing the minimum, mean and maximum values for each subject can be seen in Fig. 2. The gure also presents the errors produced by aligning the depth data in 3D by using an ICP-based solution, followed by linear interpolation of the surface for reconstruction. We used the ICP implementation provided by the Point Cloud Library in our experiments. The guess transformation matrix was computed from the centroid of the source and target point clouds and from their eigenvectors. The same alignments were achieved by using the facial landmarks to compute the guess transformation matrix. It is important to comment that, in our approach, the larger errors are gathered mostly on the neck region (outside the face). With the common 2D and sometimes with the common 3D approaches, the errors are scattered all over the image. By analyzing just the nose region (Fig. 5), the maximum error produced by our method in this database becomes two orders of magnitude smaller than the error produced by the common approaches.Our alignment procedure is dependent on the identication of facial landmarks in color images. We have veried the robustness of the proposed approach against errors in the detection of the eyes outer corners, nose tip and chin by adding noise to the image

We also performed an experiment that compared four distinct interpolation methods (i.e., nearest-neighbor, linear, natural- neighbor and TPS) applied as part of gender classication models presented by Wu et al. (2010, 2011). We chose to use two of the three gender classication models provided by Wu et al.: Principal Geodesic Analysis (PGA) and Supervised Weighted PGA (SWPGA). Further details on PGA can be found in Wu et al. (2010). The SWPGA is described in Wu et al. (2011).The SWPGA method relies on the proper setting of the parame- ter d, which indicates the number of features (i.e., relevant dimensions of the facial feature space) to be used during the iterative construction of the weight map, and in the number of interactions. In our experiments we followed Wu et al. (2011) and chose d 5 and set the number of iterations to 6000.The performance of the interpolation methods used to provide input data for the gender classication techniques was measured by comparing the confusion matrix (Kohavi and Provost, 1998) and the Matthews Correlation Coefcient (MCC) (Matthews, 1975) of each technique under the k-fold cross-validation frame- work (Kohavi, 1995). The image set used for both training and testing was comprised of 180 images (90 females and 90 males) from different individuals. Although each subject may have several different images in the dataset, only one image per subject was used during the tests. In this way we avoid bias caused by duplicated individuals.The cross-validation was run with k 5 folds. Therefore, each fold contained 36 subjects (18 females and 18 male). At each round, one fold was reserved for testing and the others were used for training. After all rounds have nished, the measured statistics are averaged by the number of rounds.The gender classication can be interpreted as a binary classication. For that matter, it is necessary to t the two genders within two classes: Female and Male. The following nomenclature is used to refer to the cells of the resulting confusion matrix: true females (TF) are the females identied as such, true males (TM) are the males identied as such, false females (FF) are the males incorrectly classied as being females and false males (FM) are the females incorrectly classied as being males. The confusion matrix provides information about the number of correct classications in comparison to the predict classications for each class. Using the aforementioned naming convention, we calculated four distinct measurements: accuracy, true females rate (TFR), true males rate (TMR) and the MCC.The accuracy is the proportion of true results (both TF and TM) in the population. The accuracy can be calculated as follows:

TF TM

location of ducial marks provided by the database. In turn, wecompared every noisy recording of a subject against the other

accuracy TF

FF

FM

TM :10

noisy recordings of the same subject. Fig. 6 presents the histograms of minimum, mean and maximum squared error values produced for each of the 180 subjects regarding the original location of the landmarks, and after adding Gaussian noise with mean 0 and stan-dard deviation (r) ranging from 1 to 5. The histogram in Fig. 6a

TFR measures the proportion of actual females which are correctly identied as such. Similarly, TMR measures the proportion of males which are correctly identied. A perfect predictor would be described as 100% TFR and 100% TMR. The TFR and TMR are described as:

shows that the Gaussian perturbations did not affected the mini-mum error produced for each subject. Notice that squared errors values are virtually zero even for r 5. The comparison between

TFR TF

TFand TMR FM

TM TM FF

:11

Figs. 6b and 2b show that the distribution of mean squared errors is more favorable for the proposed approach with imprecise land-marks location than for the ICP-based alignment with linear interpolation. The distribution of maximum errors produced for r 5(Fig. 6c) is equivalent to the one produced for the ICP-based approach (Fig. 2c). Given that the mean size of cropped regions inthe database is 280 200 pixels and r 5 leads to errors of upto T15 pixels, we conclude that using detected facial landmarks for aligning faces is a feasible solution even in the presence of noise.

By looking at (11), TFR can be interpreted as a bias towards female classication or the capacity of correctly identifying the female gender. Similarly, TMR can be interpreted as the bias towards male classication or the capacity of correctly identifying the male gender.The MCC (Matthews, 1975) is in essence a correlation coefcient between the observed and predicted binary classications. A coefcient of 1 represents a perfect prediction, 0 means no better than random prediction and 1 indicates total disagreement between prediction and observation.

2D Alignment+ Linear Interpolation3D ICP Alignment+ Linear Interpolation9.08.07.0

5.04.03.0

2.6

2.4

2.0

1.8

2.2

2.0

1.8

1.6

1.4

1.2

5.5

Min.: 0.00 | Max.: 412.33> 600500Min.: 0.00 | Max.: 424.98> 600500Min.: 0.00 | Max.: 449999..4444> 600500Min.: 0.00 | Max.: 495.58> 60050040040040040030030030030020020020020010010010010000001200Min.: 29.41 | Max.: 2470.181000Min.: 0.00 | Max.: 579.37> 600500Min.: 0.01 | Max.: 11004455..2222> 600500Min.: 0.00 | Max.: 525.57> 6005008004004004006003003003004002002002002001001001000000x 103x 104x 104x 103Min.: 2724.35 | MMax.: 111556699..776611.0Min.: 117104.28 | Max.: 30507.773.0Min.: 11723.75 | Max.: 26391.082.66.0Min.: 1873.29 | Max.: 6000.9810.02.82.45.04,54.0

3.02.52.0

Fig. 5. Color visualization of the squared error on the nose region. (top) Using our method. (center) Using ICP for 3D alignment and linear interpolation. (bottom) Using 2D alignment and linear interpolation. Notice the difference of maximum error values. (For interpretation of the references to colour in this gure legend, the reader is referred to the web version of this article.)

180

160

140

Subjects120

100

80

180

Original Landmarks Landmarks + Noise ( = 1) Landmarks + Noise ( = 2) Landmarks + Noise ( = 3) Landmarks + Noise ( = 4) Landmarks + Noise ( = 5)160

140

Subjects120

100

80

120

Original LandmarksLandmarks + Noise ( = 1) Landmarks + Noise ( = 2) Landmarks + Noise ( = 3) Landmarks + Noise ( = 4) Landmarks + Noise ( = 5)100

Subjects80

60

Original Landmarks Landmarks + Noise ( = 1) Landmarks + Noise ( = 2) Landmarks + Noise ( = 3) Landmarks + Noise ( = 4) Landmarks + Noise ( = 5)

606040

4040202020

00 1.25 2.5 3.75 5 6.25 7.5 8.75 > 10

00 0.75 1.5 2.25 3 3.75 4.5 5.25 > 6

00 0.75 1.5 2.25 3 3.75 4.5 5.25 > 6

Squared error

x 10-7

Squared error

x 104

Squared error

x 105

Fig. 6. Histograms showing the distribution of (a) minimum, (b) mean and (c) maximum squared error values computed from the comparison of images aligned using the original location of facial landmarks provided by the database, and locations corrupted by Gaussian noise with mean 0 and standard deviation (r) ranging from 1 to 5.

Proposed ApproachWe calculated the average of each metric for the k folds and the results are presented in Fig. 7, respectively for PGA and SWPGA. As expected, the nearest-neighbor interpolation had the poorest accuracy (0:7 for PGA and 0:677 for SWPGA), while the linear and natural-neighbor interpolation are considered tied (0:711 for PGA and 0:683 for SWPGA, both). The TPS had the highest accuracy (0:717 for PGA and 0:706 for SWPGA), suggesting that smoother interpolation can increase the gender classication performance. It is important to emphasize that we have used the full set of PGA features during the classication step of the PGA technique, but only the d 5 leading PGA features during the training step of the SWPGA technique. In that case, one must be careful while reading the graphs in Fig. 7 in order to compare the performance of PGA and SWPGA. Notice that it would not be a fair comparison. As pointed out in Wu et al. (2011), SWPGA outperforms PGA when the same number of PGA features are used during the training and

the classication steps. The results in Fig. 7 should only be ana- lyzed to evaluate the performance of interpolation procedures.By comparing the results of changing the interpolation technique used in both PGA and SWPGA methods (Fig. 7), it is important to notice that the accuracy differences in the PGA model are smaller than those found in the SWPGA model. The TPS in the SWPGA showed an increased accuracy of up to 3%. We believe that, since the SWPGA iteratively creates a weight map to describe relevant discriminating regions, iterative methods have the ten- dency to amplify the errors introduced by the simpler interpolating functions and, as a consequence, the TPS resulted in higher accuracy when applied to the gender discriminating model.The purpose of the weight map computed by the SWPGA is to improve the gender discriminating capacity of the leading features extracted from a training set (Wu et al., 2011). The leading features are estimated from pixel-by-pixel coherence between subjects of

Fig. 7. The results of (a) PGA and (b) SWPGA. Notice how the TPS provides a higher accuracy value than the nearest-neighbor, linear and natural-neighbor interpolations.

the same gender and pixel-by-pixel incoherence between genders. We believe that the lack of continuity (artifacts) introduced by the nearest-neighbor and the linear interpolation schemes affects the computation of proper weights because the artifacts may mask small non-soft features expected in male faces. By comparing the TMF and TFR coefcients computed for simpler interpolation schemes and for the proposed approach, it is possible to conclude that male classication benets from the use of the TPS-based scheme. Notice in Fig. 7b that the TMR increased from 0:685 (nearest) to 0:720 (TPS), while the TFM increased from 0:681 (nearest) to 0:709 (TPS). Such an improvement may be explained by TPS ability to estimate the depth value of a point on the surface from all data points in the same block, and not just from the closest data points as in nearest or linear interpolation, leading to a continuous and coherent surface.Similarly to the aforementioned metrics, the computed MCCs show that the nearest-neighbor, linear, and natural-neighbor interpolations are equivalent to each other when applied with both non-iterative and iterative classication approaches. The TPS, on the other hand, may improve the result of iterative techniques. The MCC coefcients computed for the interpolation schemes used in combination with the PGA were 0:402 (nearest), 0:424 (linear and natural), and 0:435 (TPS). In contrast, the coefcients computed for simpler interpolation schemes with SWPGA were 0:361 (nearest), and 0:375 (linear and natural), while the MCC coefcient for TPS with SWPGA was 0:419. In practice, this means that for non-iterative techniques the simpler interpolation methods, spe- cially the linear and natural-neighbor interpolations, can be applied with no major impact on the nal classication result. However, it is recommended to use our TPS-based approach in techniques that may amplify interpolation errors (e.g., iterative methods).

5. Conclusions

We have presented a completely automatic approach for aligning and reconstructing damaged facial depth images. The approach uses TPS to smoothly interpolate existing data, facial landmarks to ensure data alignment, and RM-based ray casting to render the nal aligned depth image having arbitrary resolution. In order to reduce the high computational costs of the TPS, a block division approach was introduced where separate TPSs are adjusted to each block, considerably reducing the time needed to t an interpolation. We demonstrated the effectiveness of the proposed techniques by implementing it and using it to align faces of several real depth images available in a well-known biometric database.

The proposed alignment and interpolation methods were compared against two common approaches: one based on 2D alignment and linear interpolation that operates on intensity images, and another one based on 3D alignment using ICP and linear interpolation for depth data sampling. The errors in the proposed method were up to two orders of magnitude smaller than the common approach. This result suggests that it is recommended the use of the proposed techniques in order to achieve better results on procedures that are currently based on naive alignment and reconstruction of depth data.When using the proposed alignment method (in the 3D domain) while varying only the interpolation scheme it has been shown that the differences in the nal results were better in favor of the proposed technique. More specically, the experiments sug- gest that the TPS has better results when used within iterative methods since the feedback mechanisms of this kind of procedure tend to amplify errors (e.g., interpolation errors). Also, the TPS has proven able to increase the gender classication accuracy of the SWPGA model by 3%. With that in mind, the presented conclusion is that when the research is in a prototyping phase the researcher may use a simple interpolation method (i.e., linear) in order to val- idate their implementation and later use a more sophisticated interpolation method (i.e., our approach) to improve the nal results.We believe that these ideas may lend to better results on procedures that are currently based on naive alignment of depth data. We are currently exploring ways of analyzing the error propaga- tion through the stages of our algorithm. A reference implementation of the approach described here will be made available to other research groups in the home page of the authors.

Acknowledgments

This work was sponsored by FAPERJ (E-26/111.468/2011). Giancarlo was sponsored by a CAPES fellowship. We thank the Computer Vision Research Laboratory of the University of Notre Dame for the database used in this research. We thank Wu, Smith and Hancock for kindly providing the implementation of their gender classication technique, and the anonymous reviewers for their comments and insightful suggestions.

References

Azouz, Z.B., Rioux, M., Shu, C., Lepage, R., 2004. Analysis of human shape variation using volumetric techniques, In: Proc. of CASA, pp. 197206.

90G. Taveira, L.A.F. Fernandes / Pattern Recognition Letters 50 (2014) 8290

Besl, P.J., McKay, N.D., 1992. Method for registration of 3-D shapes, In: Proc. of IEEE Trans. Pattern Anal. Machine Intell., International Society for Optics and Photonics. pp. 239256.Bookstein, F.L., 1989. Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell. 11, 567585.Chang, K., Bowyer, K., Flynn, P., 2003. Face recognition using 2D and 3D facial data, In: ACM Workshop on Multimodal User Authentication, pp. 2532.Guo, H., Jiang, J., Zhang, L., 2004. Building a 3D morphable face model by using thin plate splines for face reconstruction, In: Proc. of SINOBIOMETRICS, pp. 258267. Hartley, R.I., Zisserman, A., 2000. Multiple View Geometry in Computer Vision.Cambridge University Press.Kakadiaris, I.A., Passalis, G., Toderici, G., Murtuza, M.N., Lu, Y., Karampatziakis, N., Theoharis, T., 2007. Three-dimensional face recognition in the presence of facial expressions: an annotated deformable model approach. Pattern Analysis and Machine Intelligence 29, 640649.Kohavi, R., 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection, In: Proc. of IJCAI, pp. 11371143.Kohavi, R., Provost, F., 1998. Glossary of terms. Mach. Learn. 30, 271274.Matthews, B., 1975. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim. Biophys. Acta 405, 442.Moreno, A.B., Sanchez, A., Velez, J.F., Diaz, F.J., 2005. Face recognition using 3D local geometrical features: PCA vs. SVM, In: Proc. of ISPA, pp. 185190.Perakis, P., Theoharis, T., Passalis, G., Kakadiaris, I.A., 2009. Automatic 3D facial region retrieval from multi-pose facial datasets, In: Proc. of Eurographics Workshop on 3D Object Retrieval, pp. 3744.Policarpo, F., Oliveira, M.M., Comba, J., 2005. Real-time relief mapping on arbitrary polygonal surfaces, In: Proc. of ACM SIGGRAPH I3D, pp. 155162.

Romero-Huertas, M., Pears, N., 2008. 3D facial landmark localisation by matching simple descriptors, In: Proc. of IEEE Intern. Conf. on BTAS, pp. 16.Rosen, A., 1996. AlexWarp applet. Segundo, M.P., Gomes, L., Bellon, O.R.P., Silva, L., 2012. Automating 3D reconstruction pipeline by SURF-based alignment, In: Proc. of IEEE ICIP, pp. 17611764.Stormer, A., Rigoll, G., 2008. A multi-step alignment scheme for face recognition in range images, In: Proc. of IEEE ICIP, pp. 27482751.Szeliski, R., 2010. Computer Vision: Algorithms and Applications. Springer. Tekumalla, L.S., Cohen, E., 2004. A hole-lling algorithm for triangular meshes.Wang, J., Oliveira, M.M., 2003. A hole-lling strategy for reconstruction of smooth surfaces in range images. In: Proc. of SIBGRAPI. IEEE, pp. 1118.Whitbeck, M., Guo, H., 2006. Multiple landmark warping using thin-plate splines, In: Proc. of IPCV, pp. 256263.Wu, J., Smith, W., Hancock, E., 2011. Gender discriminating models from facial surface normals. Pattern Recognit. 44, 28712886.Wu, J., Smith, W.A.P., Hancock, E.R., 2010. Facial gender classication using shape- from-shading. Image Vis. Comput. 28, 10391048.Xu, C., Li, S., Tan, T., Quan, L., 2009. Automatic 3D face recognition from depth and intensity gabor features. Pattern Recognit. 42, 18951905.Yan, P., Bowyer, K.W., 2005. Ear biometrics using 2d and 3d images. In: Proc. of CVPR Workshops. IEEE, p. 121.Zhu, J., Wang, L., Yang, R., Davis, J., 2008. Fusion of time-of-ight depth and stereo for high accuracy depth maps, In: Proc. of CVPR, pp. 18.

Documents

Automatic Alignment and Reconstruction of Facial Depth Images