7
Short Papers___________________________________________________________________________________________________ Joint Depth and Color Camera Calibration with Distortion Correction Daniel Herrera C., Juho Kannala, Member, IEEE, and Janne Heikkila ¨, Member, IEEE Abstract—We present an algorithm that simultaneously calibrates two color cameras, a depth camera, and the relative pose between them. The method is designed to have three key features: accurate, practical, and applicable to a wide range of sensors. The method requires only a planar surface to be imaged from various poses. The calibration does not use depth discontinuities in the depth image, which makes it flexible and robust to noise. We apply this calibration to a Kinect device and present a new depth distortion model for the depth sensor. We perform experiments that show an improved accuracy with respect to the manufacturer’s calibration. Index Terms—Camera calibration, depth camera, camera pair, distortion, Kinect. Ç 1 INTRODUCTION COLOR and depth information provide complementary cues about a scene. Many applications need to capture both simultaneously, like scene reconstruction and image-based rendering. This requires at least two sensors as no single sensor is able to capture both. A basic device for scene reconstruction is a depth and color camera pair, which consists of a color camera rigidly attached to a depth sensor (e.g., time-of-flight (ToF) camera, laser range scanner, structured light scanner). The increasingly popular Kinect device is an example of such a camera pair. In order to reconstruct a scene from the camera pair measure- ments the system must be calibrated. This includes internal calibration of each camera as well as relative pose calibration between the cameras. Color camera calibration has been studied extensively [1], [2]. For depth sensors, different calibration methods have been developed depending on the technology used. ToF cameras simultaneously produce an intensity and a depth image from the same viewpoint, which simplifies calibration because color discontinuities can be accurately localized [3]. Most structured light systems can calibrate the projector and camera separately. However, if the internals of the device are not open, we might not have access to the original intensity images. The Kinect device, for example, uses an infrared camera to detect a projected dot pattern. However, it returns a processed image that is not aligned with the original infrared image. There is a particular need to calibrate the Kinect device because it delivers depth information in Kinect disparity units (kdu), whose conversion to metric units changes for each device. Furthermore, independent calibration of the cameras may not yield the optimal system parameters, and a comprehensive calibration of the system as a whole could improve individual calibration as it uses all the available information. Depth cameras have been observed to suffer from complicated geometric distortions due to the processing performed and the inevitable tolerances in their manufacturing. Whereas a radial and tangential distortion model is usually sufficient to correct the 2D pixel positions in color cameras, depth cameras require a more complicated model to correct the 3D measurement volume. 1.1 Previous Work A standard approach is to calibrate the cameras independently and then calibrate only the relative pose between them [4], [5], [6]. This may not be the optimal solution as measurements from one camera can improve the calibration of the other camera. Moreover, the independent calibration of a depth camera may require a high precision 3D calibration object that can be avoided using joint calibration. Fuchs and Hirzinger [3] propose a multispline model for time- of-flight cameras. Their model has a very high number of parameters and it requires a robotic arm to know the exact pose of the camera. Lindner and Kolb [7] use a high-resolution color camera to determine the pose of the camera, removing the need for a robotic arm. Lichti [8] proposes a calibration method for an individual laser range scanner using only a planar calibration object. It performs a comprehensive calibration of all parameters. However, it relies on the depth camera delivering radiometric intensity and range for each pixel. This is not directly applicable to a camera pair because the color and depth information are not taken from the same reference frame. Zhu et al. [9] describe a method for fusing depth from stereo and ToF cameras. Their calibration uses the triangulation from the stereo cameras as ground truth. This may not be optimal as it ignores the possible errors in stereo triangulation and measurement uncertainties. ToF cameras are known to present distortions both on the optical ray direction and on the measured depth. Kim et al. [10] show a principled approach to correcting these distortions. Cui et al. [11] show that the depth distortion of ToF cameras is radially symmetric and scene dependant. Thus they estimate new distor- tion correction parameters for each image. The Kinect device has also shown radially symmetric distortions [12]. However, being a structured light sensor, the nature of the distortions is different. Kinect devices are calibrated during manufacturing with a proprietary algorithm. The calibrated parameters are stored in the device’s internal memory and are used by the official drivers to perform the reconstruction. This is adequate for casual use, but we have observed that the manufacturer’s calibration does not correct the depth distortion. Other calibration algorithms have been developed by the Kinect community. The first algorithms (e.g., [13]) calibrated only the intrinsics (focal length and principal point) of the infrared camera, but did not calibrate the parameters to convert kinect disparity units to meters. In our previous work [14], we make a comprehensive calibration of all parameters of the camera pair. However, depth distortion was not corrected. Using a similar formulation, Zhang and Zhang [15] augment our previous work with correspondences between the color and depth images, but still do not address distortion of the depth values. Smı ´ sek et al. [12] include a depth distortion correction component as the average of the residuals in metric coordinates. We propose a disparity distortion correction that depends on the observed disparity which further improves accuracy. 1.2 Motivation As a motivation for our work, we propose three requirements that an optimal calibration algorithm must have. To the best of our knowledge, no available calibration algorithm for a depth and color camera pair fulfills all three criteria. Accurate. The method should provide the best combination of intrinsic and extrinsic parameters that minimizes the reprojection error for both cameras over all calibration images. This may seem like an obvious principle, but we stress it because partial 2058 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012 . The authors are with the Department of Electrical and Information Engineering, University of Oulu, PO Box 4500, FI-90014. Finland. E-mail: {dherrera, jkannala, jth}@ee.oulu.f.i Manuscript received 16 Dec. 2011; revised 13 Apr. 2012; accepted 4 May 2012; published online 22 May 2012. Recommended for acceptance by M. Everingham. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2011-12-0895. Digital Object Identifier no. 10.1109/TPAMI.2012.125. 0162-8828/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

Joint Depth and Color Camera Calibration with Distortion Correction

  • Upload
    janne

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Joint Depth and Color Camera Calibration with Distortion Correction

Short Papers___________________________________________________________________________________________________

Joint Depth and Color Camera Calibrationwith Distortion Correction

Daniel Herrera C., Juho Kannala, Member, IEEE,and Janne Heikkila, Member, IEEE

Abstract—We present an algorithm that simultaneously calibrates two color

cameras, a depth camera, and the relative pose between them. The method is

designed to have three key features: accurate, practical, and applicable to a wide

range of sensors. The method requires only a planar surface to be imaged from

various poses. The calibration does not use depth discontinuities in the depth

image, which makes it flexible and robust to noise. We apply this calibration to a

Kinect device and present a new depth distortion model for the depth sensor. We

perform experiments that show an improved accuracy with respect to the

manufacturer’s calibration.

Index Terms—Camera calibration, depth camera, camera pair, distortion, Kinect.

Ç

1 INTRODUCTION

COLOR and depth information provide complementary cues abouta scene. Many applications need to capture both simultaneously,like scene reconstruction and image-based rendering. This requiresat least two sensors as no single sensor is able to capture both. Abasic device for scene reconstruction is a depth and color camerapair, which consists of a color camera rigidly attached to a depthsensor (e.g., time-of-flight (ToF) camera, laser range scanner,structured light scanner). The increasingly popular Kinect deviceis an example of such a camera pair.

In order to reconstruct a scene from the camera pair measure-ments the system must be calibrated. This includes internalcalibration of each camera as well as relative pose calibrationbetween the cameras. Color camera calibration has been studiedextensively [1], [2]. For depth sensors, different calibrationmethods have been developed depending on the technology used.ToF cameras simultaneously produce an intensity and a depthimage from the same viewpoint, which simplifies calibrationbecause color discontinuities can be accurately localized [3]. Moststructured light systems can calibrate the projector and cameraseparately. However, if the internals of the device are not open, wemight not have access to the original intensity images. The Kinectdevice, for example, uses an infrared camera to detect a projecteddot pattern. However, it returns a processed image that is notaligned with the original infrared image.

There is a particular need to calibrate the Kinect device becauseit delivers depth information in Kinect disparity units (kdu), whoseconversion to metric units changes for each device. Furthermore,independent calibration of the cameras may not yield the optimalsystem parameters, and a comprehensive calibration of the systemas a whole could improve individual calibration as it uses all theavailable information.

Depth cameras have been observed to suffer from complicatedgeometric distortions due to the processing performed and the

inevitable tolerances in their manufacturing. Whereas a radial andtangential distortion model is usually sufficient to correct the 2Dpixel positions in color cameras, depth cameras require a morecomplicated model to correct the 3D measurement volume.

1.1 Previous Work

A standard approach is to calibrate the cameras independentlyand then calibrate only the relative pose between them [4], [5],[6]. This may not be the optimal solution as measurements fromone camera can improve the calibration of the other camera.Moreover, the independent calibration of a depth camera mayrequire a high precision 3D calibration object that can be avoidedusing joint calibration.

Fuchs and Hirzinger [3] propose a multispline model for time-of-flight cameras. Their model has a very high number ofparameters and it requires a robotic arm to know the exact poseof the camera. Lindner and Kolb [7] use a high-resolution colorcamera to determine the pose of the camera, removing the need fora robotic arm. Lichti [8] proposes a calibration method for anindividual laser range scanner using only a planar calibrationobject. It performs a comprehensive calibration of all parameters.However, it relies on the depth camera delivering radiometricintensity and range for each pixel. This is not directly applicable toa camera pair because the color and depth information are nottaken from the same reference frame. Zhu et al. [9] describe amethod for fusing depth from stereo and ToF cameras. Theircalibration uses the triangulation from the stereo cameras asground truth. This may not be optimal as it ignores the possibleerrors in stereo triangulation and measurement uncertainties.

ToF cameras are known to present distortions both on theoptical ray direction and on the measured depth. Kim et al. [10]show a principled approach to correcting these distortions. Cuiet al. [11] show that the depth distortion of ToF cameras is radiallysymmetric and scene dependant. Thus they estimate new distor-tion correction parameters for each image. The Kinect device hasalso shown radially symmetric distortions [12]. However, being astructured light sensor, the nature of the distortions is different.

Kinect devices are calibrated during manufacturing with aproprietary algorithm. The calibrated parameters are stored in thedevice’s internal memory and are used by the official drivers toperform the reconstruction. This is adequate for casual use, but wehave observed that the manufacturer’s calibration does not correctthe depth distortion. Other calibration algorithms have beendeveloped by the Kinect community. The first algorithms (e.g.,[13]) calibrated only the intrinsics (focal length and principal point)of the infrared camera, but did not calibrate the parameters toconvert kinect disparity units to meters. In our previous work [14],we make a comprehensive calibration of all parameters of thecamera pair. However, depth distortion was not corrected. Using asimilar formulation, Zhang and Zhang [15] augment our previouswork with correspondences between the color and depth images,but still do not address distortion of the depth values. Smı�sek et al.[12] include a depth distortion correction component as theaverage of the residuals in metric coordinates. We propose adisparity distortion correction that depends on the observeddisparity which further improves accuracy.

1.2 Motivation

As a motivation for our work, we propose three requirements thatan optimal calibration algorithm must have. To the best of ourknowledge, no available calibration algorithm for a depth andcolor camera pair fulfills all three criteria.

Accurate. The method should provide the best combination ofintrinsic and extrinsic parameters that minimizes the reprojectionerror for both cameras over all calibration images. This may seemlike an obvious principle, but we stress it because partial

2058 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012

. The authors are with the Department of Electrical and InformationEngineering, University of Oulu, PO Box 4500, FI-90014. Finland.E-mail: {dherrera, jkannala, jth}@ee.oulu.f.i

Manuscript received 16 Dec. 2011; revised 13 Apr. 2012; accepted 4 May2012; published online 22 May 2012.Recommended for acceptance by M. Everingham.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2011-12-0895.Digital Object Identifier no. 10.1109/TPAMI.2012.125.

0162-8828/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

Page 2: Joint Depth and Color Camera Calibration with Distortion Correction

calibrations, where each camera is calibrated independently andthe relative pose is estimated separately, may not achieve thelowest reprojection error.

Practical. It should be practical to use with readily availablematerials. A high precision 3D calibration object is not easy/cheapto obtain and a robotic arm or a high precision mechanical setup torecord the exact pose of the camera pair is usually not practical,whereas a planar surface is usually readily available.

Widely applicable. To be applicable to a wide range of depthsensors, one cannot assume that color discontinuities are visible onthe depth image. Moreover, some depth sensors, like the one usedfor our experiments, may not provide accurate measurements atsharp depth discontinuities. Thus, neither color nor depthdiscontinuities are suitable features for depth camera calibration.The method should use features based on depth measurementsthat are most reliable for a wide range of cameras (e.g., planes).

Finally, the increasing popularity of the Kinect devices providesan additional motivation for our research. For example, the workfrom Shotton et al. [16] based on the Kinect was selected as the bestpaper at CVPR 2011. We believe that many applications wouldbenefit from improved accuracy. We have previously released aKinect calibration toolbox [14] that has been well received by thedeveloper community. With this work we aim to provide animproved calibration algorithm for the Kinect community.

1.3 Overview of the Approach

We use a planar checkerboard pattern for calibration which can beconstructed from any readily available planar surface (e.g., a flattable, a wall). We use multiple views of the calibration plane and,for each view, all cameras take an image. The checkerboard cornersprovide suitable constraints for the color images, while theplanarity of the points provides constraints on the depth images.The pixels at the borders of the calibration object can be ignoredand thus depth discontinuities are not needed. Fig. 1 shows sampleimages from color and depth cameras used for calibration. Thethree orientations shown constrain the three dimensions of therelative translation between depth and color cameras.

In the color images, the checkerboard corners are extracted. Inthe depth images, the area containing the plane is located. Toinitialize the depth camera intrinsics, the user also selects the cornersof the calibration plane in the depth images. A standard calibrationbased on user selected corners [2] is performed on each imageindividually to initialize the calibration. An iterative nonlinearbundle adjustment is then performed so that our formulation allowsfor a closed-form solution of the disparity distortion parameters.

2 CALIBRATION MODEL

Our setup consists of a depth and color camera pair with anoptional high resolution color camera rigidly attached. Althoughthe camera pair is sufficient for calibration, in the case of a Kinectdevice the internal color camera has low quality. Therefore, if one

needs high-quality color images with depth maps, an externalcamera is very useful. Our system calibrates both color camerassimultaneously. We will refer to the high-resolution camera as theexternal camera, while the color camera from the camera pair willbe referred to simply as the color camera. Our implementation andexperiments use the Kinect sensor, which consists of a projector-camera pair, as the depth sensor that measures per pixel disparity.The external camera is a Canon EOS 5D Mark II.

2.1 Color Camera Intrinsics

We use a similar intrinsic model as Heikkila [1], which consists of apinhole model with radial and tangential distortion correction. Theprojection of a point from color camera coordinates xc ¼ ½xc; yc; zc�>to color image coordinates pc ¼ ½uc; vc�> is obtained through thefollowing equations. The point is first normalized byxn ¼ ½xn; yn�> ¼ ½xc=zc; yc=zc�>. Distortion is then performed:

xg ¼2k3xnyn þ k4

�r2 þ 2x2

n

�k3

�r2 þ 2y2

n

�þ 2k4xnyn

� �; ð1Þ

xk ¼ ð1þ k1r2 þ k2r

4 þ k5r6Þxn þ xg; ð2Þ

where r2 ¼ x2n þ y2

n and kkkkc ¼ ½k1; . . . ; k5� is a vector containing thedistortion coefficients.

Finally, the image coordinates are obtained:

ucvc

� �¼ fcx 0

0 fcy

� �xkyk

� �þ u0c

v0c

� �; ð3Þ

where ffffc ¼ ½fcx; fcy� are the focal lengths and pppp0c ¼ ½u0c; v0c� isthe principal point. The same model applies to the color andexternal cameras. The model for each camera is described byLc ¼ fffffc; pppp0c; kkkkcg.

2.2 Depth Camera Intrinsics

In our experiments we used the Kinect as a depth camera. Yet, themethod allows different kinds of depth sensors by replacing thisintrinsic model. The Kinect’s depth sensor consists of an infraredprojector that emits a constant pattern and an infrared camera thatmeasures the disparity between the observed pattern and aprerecorded image at a known constant depth. The output consistsof an image of scaled disparity values in Kinect disparity units.

The transformation between depth camera coordinates xd ¼½xd; yd; zd�> and depth image coordinate pd ¼ ½ud; vd�> follows asimilar model to that used for the color camera. Equation (3) isused with the respective parameters ffffd and pppp0d. However, whereasthe color camera’s distortion is defined in terms of the forwardmodel (world to image), we define the geometric distortion of thedepth camera in terms of the backward model (image to world).This is computationally convenient in our case because ourformulation of the bundle-adjustment in Section 3.3 backwardprojects optical rays for the depth camera but forward projectsoptical rays for the color camera. Further, previous studies haveshown that the lens distortion model defined by (1) and (2) workswell in both ways [1], [17]. Thus, the geometric distortion modelfor the depth camera is obtained by simply switching the role of xxxxnand xxxxk in (1) and (2).

The relation between the obtained disparity value d and thedepth zd contains two parts: a scaled inverse and a distortioncorrection. The scaled inverse has been observed by most previouscalibration approaches to fit the observations; it is modeled by theequation

zd ¼1

c1dk þ c0; ð4Þ

where c1 and c0 are part of the depth camera intrinsic parametersto be calibrated and dk is the undistorted disparity (i.e., after

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012 2059

Fig. 1. Top: Sample calibration images from the external camera. Bottom:Disparity images. Note the inaccuracies at the edges and that the checkerboard isnot visible.

Page 3: Joint Depth and Color Camera Calibration with Distortion Correction

distortion correction). Note that for all Kinect devices c1 is

negative, which means that higher disparity values correspond

to points farther away from the camera (the opposite of traditional

stereo disparity units).When calibrated using only (4) (i.e., without distortion correc-

tion), the Kinect displays a fixed error pattern in the measurements

(Fig. 2). Because the internal algorithms of the Kinect are proprietary

and closed, it is not possible to pinpoint the exact nature of this

distortion. However, we can correct it based on observations. It was

suggested in [12] that this distortion could be corrected by applying

a spatially varying offset ZZZZ� to the calculated depth

zdd ¼ zd þ ZZZZ�ðu; vÞ; ð5Þ

and it was observed that this usually reduces the reprojection

error. However, we have found that a more accurate calibration is

made by correcting for the distortion directly in disparity units.The shape of the error pattern is constant, but its magnitude

decreases as the distance from the object increases. To demonstrate

this decay we took the errors of all pixels from planes at several

distances and normalized them (dividing all images by Fig. 2).

Fig. 3 shows the resulting median values for each measured

disparity. The normalized error fits well to an exponential decay.

This led us to construct a distortion model that has per-pixel

coefficients and decays exponentially with increasing disparity.Hence, we use a spatially varying offset that decays as the

Kinect disparity increases:

dk ¼ dþDDDD�ðu; vÞ � expð�0 � �1dÞ; ð6Þ

where d is the distorted disparity as returned by the Kinect, DDDD�

contains the spatial distortion pattern, and � ¼ ½�0; �1� models the

decay of the distortion effect.Note that this model does not enforce any smoothness on the

spatial distortion pattern. To properly constrain this pattern, it is

enough to include some (four) images of a flat surface that spans

the entire depth image. We add images of an empty wall at several

depths. These images do not need the checkerboard pattern since

they are only needed to constrain the distortion pattern. This

ensures that all pixels in the depth image have samples to estimate

their coefficients D�ðu; vÞ.Although this disparity distortion model was developed with

the Kinect in mind, it bears similarities with the model of a ToF

camera. Kim et al. [10] obtained results similar to Fig. 3, except that

they fit a sixth degree polynomial instead of an exponential.Furthermore, the calibration of this ToF camera model is simplerbecause they do not use per-pixel coefficients.

Equations (4) and (6) are used when measured disparities aretransformed to metric coordinates, also known as the backwardmodel. The inverse of these functions, the forward model, is alsoneeded to compute the reprojection errors. The inverse of (4) isstraightforward:

dk ¼1

c1zd� c0

c1: ð7Þ

But, the inverse of (6) is a bit more involved because of theexponential. We perform two variable substitutions to isolate theexponential product:

y ¼ expð�0 � �1dk þ �1DDDD�ðu; vÞyÞ;

where y ¼ dk � dDDDD�ðu; vÞ

;

y ¼ expð�1DDDD�ðu; vÞyÞ expð�0 � �1dkÞ;�~y

�1DDDD�ðu; vÞ¼ expð�~yÞ expð�0 � �1dkÞ;

where ~y ¼ �y�1DDDD�ðu; vÞ;~y expð~yÞ ¼ � �1DDDD�ðu; vÞ expð�0 � �1dkÞ:

The product can be solved using the Lambert W function[18]. The Lambert W function is the solution to the relationWðzÞ expðWðzÞÞ ¼ z:

~y ¼ Wð��1DDDD�ðu; vÞ expð�0 � �1dkÞÞ;ðd� dkÞ�1 ¼Wð��1DDDD�ðu; vÞ expð�0 � �1dkÞÞ;

d ¼ dk þWð��1DDDD�ðu; vÞ expð�0 � �1dkÞÞ

�1:

ð8Þ

Although the Lambert W function is a trascendental function,there are many accurate approximations in the literature [18] andmodern mathematical packages include implementations of it(e.g., Matlab).

The model for the depth camera is described by Ld ¼ fffffd;pppp0d; kkkkd; c0; c1; DDDD�; �g, where the first three parameters come fromthe model described in Section 2.1 and the last four are used totransform disparity to depth values.

2.3 Extrinsics and Relative Pose

Fig. 4 shows the different reference frames present in a scene.Points from one reference frame can be transformed to anotherusing a rigid transformation denoted by T ¼ fRRRR; ttttg, where RRRR is arotation and tttt a translation. For example, the transformation of apoint xw from world coordinates fWg to color camera coordinatesfCg follows xc ¼ WRRRRCxw þ WttttC .

2060 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012

Fig. 2. Error residuals (kdu) without distortion correction of a plane at 0.56 m (left)and 1.24 m (right).

Fig. 3. Distortion magnitude with increasing disparity.

Fig. 4. Reference frames and transformations. fDg, fCg, and fEg are the depth,color, and external cameras, respectively. For image i, fVig is attached to thecalibration plane and fWig is the calibration pattern.

Page 4: Joint Depth and Color Camera Calibration with Distortion Correction

Reference fVig is anchored to the corner of the calibrationplane of image i and is only used for initialization. The relativeposes (DT C and ET C) are constant, while each image has its ownworld to camera pose WiT C . By design, the table and thecheckerboard are coplanar but the full transformation betweenfV g and fWg is unknown.

3 CALIBRATION METHOD

A block diagram of our calibration method is presented in Fig. 5.The individual steps are described in the following sections.

3.1 Corner-Based Calibration

The calibration of a color camera is a well-studied problem; we useZhang’s method [2] to initialize the camera parameters. Briefly, thesteps are the following: The checkerboard corners are extractedfrom the intensity image. A homography is then computed foreach image using the known corner positions in world coordinatesfWig and the measured positions in the image. Each homographythen imposes constraints on the camera parameters, which aresolved with a linear system of equations. The distortion coefficientsare initially set to zero.

The same method is used to initialize the depth cameraparameters. However, because the checkerboard is not visible inthe depth image, the four corners of the calibration plane areextracted (the gray plane in Fig. 1). These corners are very noisyand are only used here to obtain an initial guess. The homographyis thus computed between fVig and fDg also using Zhang’smethod. This initializes the focal lengths, principal point, and thetransformation ViT D. Using these initial parameters we obtain anestimate for the expected depth of each selected corner. With thisexpected depth and the measured disparity, an overdeterminedsystem of linear equations is built using (4), which gives an initialguess for the depth parameters (c0 and c1).

3.2 Relative Pose Estimation

The relative pose between the external and color cameras can beobtained directly because their pose with respect to the samereference frame fWg is known. For the depth camera, however,only the pose with respect to fV g is known, which is not alignedto fWg. To obtain the relative pose CT D, we take advantage ofthe fact that fV g and fWg are coplanar by design. We extract theplane equation in each reference frame and use it as a constraint.We define a plane using the equation nnnn>x� � ¼ 0, where nnnn is theunit normal and � is the distance to the origin.

If we divide a rotation matrix into its columns RRRR ¼ ½r1; r2; r3�and choose the parameters of the plane in both frames as nnnn ¼½0; 0; 1�> and � ¼ 0, the plane parameters in camera coordinates are

nnnn ¼ r3 and � ¼ r>3 tttt; ð9Þ

where we use WiRRRRC and Wi ttttC for the color camera and ViRRRRD andVi ttttD for the depth camera.

As mentioned by Unnikrishnan and Hebert [5], the relativepose can be obtained in closed form from several images. The

plane parameters for each image are concatenated in matrices of

the form: MMMMc ¼ ½nnnnc1; nnnnc2; . . . ; nnnncn�, bbbbc ¼ ½�c1; �c2; . . . ; �cn�, and like-

wise for the depth camera to form MMMMd and bbbbd. The relative

transformation is then

CRRRR0D ¼MMMMdMMMM>c ; ð10Þ

CttttD ¼ ðMMMMcMMMM>c Þ�1MMMMcðbbbbc � bbbbdÞ>: ð11Þ

Due to noise, CRRRR0D may not be orthonormal. We obtain a valid

rotation matrix through SVD using: CRRRRD ¼ UV >, where USV > is

the SVD of CRRRR0D.

3.3 Nonlinear Minimization

The calibration method aims to minimize the weighted sum of

squares of the measurement reprojection errors over all parameters

(Lc, Ld, Le, ET C , DT C , and ,WiT C for all images i). The error for the

color camera is the euclidean distance between the measured

corner position ppppc and its reprojected position ppppc (the same for the

external camera, with ppppe and ppppe, respectively), whereas for the

depth camera it is the difference between the measured disparity d

and the predicted disparity d. The predicted disparity is obtained

by calculating the distance to the calibration plane along the optical

ray and transforming to disparity using (7) and (8). Because the

errors have different units, they are weighted using the inverse of

the corresponding measurement variance (�2c , �

2d, �

2e). The resulting

cost function is

c ¼P

ppppc � ppppck k2

�2c

þPðd� dÞ2

�2d

þP

ppppe � ppppek k2

�2e

: ð12Þ

Note that (12) is highly nonlinear and depends on a lot of

parameters (DDDD� contains 307,200 entries). To separate the optimi-

zation of the disparity distortion parameters from the rest we make

a slight modification to (12). Instead of comparing the reprojection

in measured disparity space, we calculate the residuals in

undistorted disparity space:

c ¼P

ppppc � ppppck k2

�2c

þPðdk � dkÞ2

�2d

þP

ppppe � ppppek k2

�2e

: ð13Þ

It is also possible to optimize (12) by inverting the roles of (6) and

(8). However, including the Lambert W function in the backward

camera model would make it cumbersome to use for transforming

measurements into 3D points. We tested both approaches and

found no practical advantage of minimizing (12) over (13).The optimization has three steps, as shown in Fig. 5. The

initialization gives a very rough guess of the depth camera

parameters and relative pose, whereas the color camera parameters

have fairly good initial values. Thus, the first step optimizes only Ldand DT C with all other parameters fixed. The optimization then

continues iteratively with two alternating steps. In the first step, DDDD�

is kept constant and (13) is minimized using the Levenberg-

Marquardt algorithm over all other parameters. In the second step,

the spatial disparity distortion pattern DDDD� is optimized indepen-

dently for each pixel. The initial values of the depth distortion

model (� andDDDD�) are not critical and initially assuming zero for both

has proven to yield accurate results. The algorithm iterates until the

residuals are stable.

3.4 Disparity Distortion Estimation

Optimizing DDDD� separately is more efficient because the entries in

DDDD� are independent of each other and the estimation of DDDD�ðu; vÞtakes into account only measurements obtained from pixel ðu; vÞ.Moreover, when the other parameters are fixed, we can solve for

DDDD�ðu; vÞ in closed-form.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012 2061

Fig. 5. Calibration algorithm. Before dashed line: initialization. After dashedline: nonlinear minimization.

Page 5: Joint Depth and Color Camera Calibration with Distortion Correction

Each disparity measurement d is first undistorted using (6). Wecompute a predicted disparity dk using the distance to the planeand (7). Thus, minimizing (13) over DDDD� is equivalent to minimizingits middle term, which leads to the following cost function:

cd ¼X

images

Xu;v

ðdþDDDD�ðu; vÞ � expð�0 � �1dÞ � dkÞ2: ð14Þ

This minimization is straightforward because (14) is quadratic ineach DDDD�ðu; vÞ and hence the optimal value of each DDDD�ðu; vÞ isobtained by solving a linear equation.

For comparison we also calibrated using the model of Smı�seket al. [12]. The value of ZZZZ�ðu; vÞ is calculated as the mean differencebetween measured depth zd and expected depth zd:

ZZZZ�ðu; vÞ ¼P

N zd � zdN

: ð15Þ

4 RESULTS

The color camera of the Kinect delivers images with a resolution of1;280� 1;024, whereas the resolution of the external camera was2;784� 1;856. Three different data sets were captured (A1, A2, andB1). Two of them were captured with the same Kinect (A1 and A2)and one with a different Kinect (B1). For each set, the capturedimages were divided into calibration and validation groups with60 and 14 images, respectively. The calibration images were usedto estimate all the parameters in the model, then the intrinsicparameters were kept fixed to estimate only the rig’s pose (WT C)for the validation images. All results presented here were obtainedfrom the validation images.

We implemented our algorithm in Matlab. The code has beenreleased as a toolbox for the research community. It can be foundin the same website as our previous work [14]. We used 60 planeorientations for calibration. However, we found that comparableaccuracy is achieved with only 20 positions: four for eachorientation shown in Fig. 1 at different depths, and four of a flatsurface that covers the entire depth image. The calibration with60 positions takes 15 min on a 2.4 GHz computer, but only 3 minwith 20 positions.

Normally, the different calibrations (A1, A2, and B1) would

produce slightly different error variances (�c, �d, and �e). To

compare the data sets, the variances were kept constant (�c ¼0:18 px; �d ¼ 0:9 kdu, and �e ¼ 0:30 px).

4.1 Calibration Accuracy

Fig. 6 shows the spatial patterns obtained for the distortion

correction using (6) and (5). We can observe a very similar pattern

in both images. Table 1 shows a comparison of the calibration

results obtained using both types of distortion correction and no

correction. The three models were calibrated using the same data

sets and the table shows the results of validation against the

respective validation set.The distortion correction proposed by Smı�sek et al. [12]

improves the reprojection error for data sets A1 and A2. However,

because it reduces the error in metric space it increases the

reprojection error for B1. In contrast, our approach produces the

best results for all cases. The standard deviation of the reprojection

errors for all sets were both very low (< 1 px and < 1 kdu), which

demonstrates an accurate calibration. Also note that even though

no spatial smoothness was enforced for the spatial distortion

pattern, the obtained pattern is smooth, proving that the procedure

provides enough constraints.Table 2 shows the results of calibration without the external

camera. The calibration accuracy remains the same. The external

camera is thus not necessary for an accurate calibration. Still, its

joint calibration is a useful feature for many applications that need

a high-quality external camera.

4.2 Comparison with Manufacturer Calibration

The drivers provided by the manufacturer (Primesense) use factory

calibrated settings to convert the disparity measurements to 3D

points. We used these calibration parameters and compared their

performance to that of our calibration. Using the disparity images

from the A2 data set, both calibrations were used to obtain 3D

points. The calibration of our method was done with the A1 data set

to avoid any bias. A plane was fitted to each cloud of points and the

measured depth was compared to the expected one based on the

plane’s depth at the given pixel. The error measurements are shown

in Fig. 7 for both calibrations. The measurements were grouped by

depth in 64 bins from 0.4 m to 3.7 m. For each bin, the standard

deviation of the error was plotted.

2062 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012

Fig. 6. Spatial distortion patterns obtained.

TABLE 1Calibration with Different Distortion Models

Std. deviation of residuals with a 99 percent confidence interval.

TABLE 2Calibration without the External Camera

Std. deviation of residuals with a 99 percent confidence interval.

Fig. 7. Measurement uncertainty for varying depths.

Page 6: Joint Depth and Color Camera Calibration with Distortion Correction

The uncertainty was also simulated using the calibrated model.For a given depth, the expected disparity for each pixel wascalculated using (7) and (8). Gaussian noise (� ¼ 0 and � ¼ 0:6)was applied to this disparity and the corrupted depth is obtainedthrough (4) and (6). The standard deviation of the error betweenthe initial and corrupted depths is plotted as a solid line in Fig. 7.We can see that these virtual results are very close to ourexperimental results.

Like any stereo camera, the Kinect is expected to be moreaccurate the closer the object is. Due to the inverse relation betweendisparity and depth, zero mean noise with constant variance on themeasured disparity will result in higher depth uncertainty as thedepth increases. This relation is shown in Fig. 7. Our methodclearly outperforms the manufacturer calibration in ranges up to1.5 m. At 1 m distance, the manufacturer calibration has twice theuncertainty. After 1.5 m, the distortion correction has a smallerinfluence in the reconstruction and both calibrations haveapproximately similar accuracies.

It is suspected that the infrared image is not distortion correctedbefore the depth estimation algorithm is applied, which producesthe depth distortion pattern. This could be why the disparitydistortion has the same spatial distribution as a radial distortion(i.e., concentric circles). Overall, it seems that the manufacturer’smodel was fit to better explain the observations in the middle andfar range, whereas our model is able to explain the observationsacross the whole measurement range.

4.3 Variability of Kinect Devices

To justify the need for calibrating the Kinect we used the calibrationof one data set on the validation images of another data set. Theexternal camera was not used for the validation here because itsrelative pose is different between the different data sets. The resultsare presented in Table 3. They show that the reconstruction usingthe calibration of another Kinect is highly inaccurate and increasesthe reprojection error considerably, both for color and depthcamera, thus supporting the idea that each Kinect must beindividually calibrated to achieve maximum accuracy.

4.4 3D Ground Truth

We also compared the accuracy of calibration by reconstructing ahollow cube whose sides are known to be at 90� from each other.Fig. 8 shows the reference cube. A point cloud was obtained fromthe disparity image and planes were fitted to the points from eachside. Table 4 shows how much the angle between the obtainedplanes deviates from 90�. Our method clearly achieves a betterreconstruction accuracy.

5 CONCLUSION

We have presented a calibration algorithm for a depth and colorcamera pair that is optimal in the sense of the postulatedprinciples. The algorithm takes into account color and depthfeatures simultaneously to improve calibration of the camera pairsystem as a whole. It requires only a planar surface and a simplecheckerboard pattern.

The results show that our algorithm achieved a bettercalibration for the Kinect than that provided by the manufacturer.The disparity distortion correction model considerably improvedreconstruction accuracy, better than previously proposed models.At 1 m distance our calibration showed twice the reconstructionaccuracy than the manufacturer’s calibration. Moreover, we havereleased our code as a Matlab toolbox to the research community.

The extension of the calibration to several external colorcameras is straightforward and is already implemented in thereleased toolbox. In addition, we believe that our algorithm isflexible enough to be used with other types of depth sensors byreplacing the intrinsics model of the depth camera. The constraintsused can be applied to any type of depth sensor. Future work caninclude the calibration of a ToF and color camera pair.

ACKNOWLEDGMENTS

This project has been funded by the Academy of Finland’s project#127702.

REFERENCES

[1] J. Heikkila, “Geometric Camera Calibration Using Circular Control Points,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 10,pp. 1066-1077, Oct. 2000.

[2] Z. Zhang, “A Flexible New Technique for Camera Calibration,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 22, no. 11, pp. 1330-1334, Nov.2000.

[3] S. Fuchs and G. Hirzinger, “Extrinsic and Depth Calibration of ToF-Cameras,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-6,2008.

[4] Q. Zhang and R. Pless, “Extrinsic Calibration of a Camera and Laser RangeFinder (Improves Camera Calibration),” Proc. IEEE/RSJ Int’l Conf. IntelligentRobots and Systems, vol. 3, pp. 2301-2306, 2004.

[5] R. Unnikrishnan and M. Hebert, “Fast Extrinsic Calibration of a LaserRangefinder to a Camera,” Technical Report CMU-RI-TR-05-09, RoboticsInst., 2005.

[6] D. Scaramuzza, A. Harati, and R. Siegwart, “Extrinsic Self Calibration ofa Camera and a 3D Laser Range Finder from Natural Scenes,” Proc.IEEE/RSJ Int’l Conf. Intelligent Robots and Systems, pp. 4164-4169, 2007.

[7] M. Lindner and A. Kolb, “Calibration of the Intensity-Related DistanceError of the PMD TOF-Camera,” Proc. SPIE: Intelligent Robots and ComputerVision XXV, vol. 6764, pp. 67640W-67640W-8, 2007.

[8] D. Lichti, “Self-Calibration of a 3D Range Camera,” The Int’l Archives of thePphotogrammetry, Remote Sensing, and Spatial Information Sciences, Part B,pp. 927-932, 2008.

[9] J. Zhu, L. Wang, R. Yang, and J. Davis, “Fusion of Time-of-Flight Depth andStereo for High Accuracy Depth Maps,” Proc. IEEE Conf. Computer Visionand Pattern Recognition, pp. 1-8, 2008.

[10] Y. Kim, D. Chan, C. Theobalt, and S. Thrun, “Design and Calibration of aMulti-View ToF Sensor Fusion System,” Proc. IEEE CS Computer Vision andPattern Recognition Workshops, 2008.

[11] Y. Cui, S. Schuon, D. Chan, S. Thrun, and C. Theobalt, “3D Shape Scanningwith a Time-of-Flight Camera,” Proc. IEEE Computer Vision and PatternRecognition Workshops, pp. 1173-1180, 2010.

[12] J. Smı�sek, M. Jan�co�sek, and T. Pajdla, “3D with Kinect,” Proc. IEEE Int’lConf. Computer Vision Workshops, 2011.

[13] N. Burrus, “Kinect Calibration,” http://nicolas.burrus.name/index.php/Research/KinectCalibration, Nov. 2011.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012 2063

TABLE 3Variability of Kinects

Std. deviation of residuals using different sets. Dark cells indicate a devicemismatch.

Fig. 8. 3D reference cube. Color and disparity images.

TABLE 4Angular Error between Reconstructed Planes

ffab is the angle between planes a and b.

Page 7: Joint Depth and Color Camera Calibration with Distortion Correction

[14] D. Herrera, C.J. Kannala, and J. Heikkila, “Accurate and PracticalCalibration of a Depth and Color Camera Pair,” Proc. 14th Int’l Conf.Computer Analysis of Images and Patterns, http://www.ee.oulu.fi/dherrera/kinect/, pp. 437-445, 2011.

[15] C. Zhang and Z. Zhang, “Calibration between Depth and Color Sensors forCommodity Depth Cameras,” Proc. Int’l Conf. Multimedia and Expo, 2011.

[16] J. Shotton, A. Fitzgibbon, M. Cook, and A. Blake, “Real-Time Human PoseRecognition in Parts from Single Depth Images,” Proc. IEEE Computer Visionand Pattern Recognition, pp. 1297-1304, 2011.

[17] D. Brown, “Close-Range Camera Calibration,” Photogrammetric Eng.,vol. 37, no. 8, pp. 855-866, 1971.

[18] D. Barry, J.-Y. Parlange, L. Li, H. Prommer, C. Cunningham, and F.Stagnitti, “Analytical Approximations for Real Values of the Lambert W-Function,” Math. and Computers in Simulation, vol. 53, nos. 1/2, pp. 95-103,2000.

. For more information on this or any other computing topic, please visit ourDigital Library at www.computer.org/publications/dlib.

2064 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 10, OCTOBER 2012