Flexible 3D Models from Uncalibrated CamerasFlexible 3D Models from Uncalibrated Cameras T.F.Cootes, E.C. Di Mauro, C.J.Taylor, AXanitis Department of Medical Biophysics, University

147

Flexible 3D Models from UncalibratedCameras

T.F.Cootes, E.C. Di Mauro, C.J.Taylor, AXanitisDepartment of Medical Biophysics,

University of ManchesterManchester M13 9PT

email: [email protected]

AbstractWe describe how to build statistically-based flexible models of the 3Dstructure of variable objects, given a training set of uncalibrated images.We assume that for each example object there are two labelled imagestaken from different viewpoints. From each image pair a 3D structure canbe reconstructed, up to either an affine or projective transformation, de-pending on which camera model is used. The reconstructions are alignedby choosing the transformations which minimise the distances betweenmatched points across the training set. A statistical analysis results in anestimate of the mean structure of the training examples and a compactparameterised model of the variability in shape across the training set.Experiments have been performed using pinhole and affine camera mo-dels. Results are presented for both synthetic data and real images.

1 IntroductionIn many vision problems we study objects which are either deformable in themselves(such as faces) or can be considered examples of a class of variable shapes (such ascars). In order to represent them we must model this variability explicitly. However,the only information we have about the 3D objects is often that contained in one ormore 2D images. To understand their 3D shape we must perform some form of recon-struction. When we have accurate camera calibration data we can obtain metric infor-mation and recover the euclidean structure of the objects. Unfortunately cameracalibration is not robust, is easily lost (by moving or refocusing the camera) and canbe difficult to obtain. Recent work has shown that it is possible to obtain relative 3Dstructure (up to an affine or projective transformation) from uncalibrated images[4-9]. In this paper we combine this work on structure from uncalibrated cameraswith methods of building statistical models of shape [1]. We demonstrate how tobuild flexible models of three dimensional structure from uncalibrated images ofexamples.

We assume that we have a training set of objects from the class we wish to model, andthat for each we have a pair of arbitrary views. From each pair we can reconstructthe structure of the example up to either a projective or an affine transformation, de-pending upon which camera model is assumed. We give examples for both pinholeand affine camera models. The set of reconstructed examples are aligned into a com-

BMVC 1995 doi:10.5244/C.9.15

148

mon reference frame by finding the affine or projective transform for each which mi-nimises the total variance across the training set. The mean shape and a compactparameterised model of shape variation are obtained by performing a statistical analy-sis in this frame for the set of reconstructions.We show results for synthetic images of cars in which the modes of variation areknown, and for real images of a vice and human faces. We discuss practical applica-tions of the method.

2 BackgroundCootes et al [1] describe 2D statistical models of variable image structure. These mo-dels are generated by manually locating landmark points on the structures of interestin each of a set of training images, aligning these training shapes into a common refer-ence frame then performing a statistical analysis to obtain the mean shape and mainmodes of shape variation. They show how these Point Distribution Models (PDMs)can be used in image search [1,2] by creating Active Shape Models (ASMs). An ASMis analogous to a 'snake' in that it refines its position and shape under the influenceof image evidence, giving robust object location.Hill et al [ 11] show how the PDM/ASM approach can be extended to 3D when volumeor range images are available, for example in medical imaging. A review of other de-formable models is given in [1]. Shen and Hogg [3] have recently shown how a fairlycoarse flexible 3D model can be generated from a set of image sequences.There is a well established literature on methods of extracting 3D structures from twoor more 2D images. Because camera calibration is often inconvenient, and in any casenon-robust, recent work has focused on what can be learnt about scene structurefrom uncalibrated images [6]. Developments in projective geometry [10,12] have ledto various constructions which are invariant to camera parameters and pose [16].Hartley et al [4,5], Mohr et al [7] and Faugeras [6] have described methods of recon-structing the positions of 3D points up to a projective transformation of 3D space,given their projection into two uncalibrated images.An alternative approach is to assume an affine camera model, an approximation ac-ceptable when the distance to the subject is large compared to the subject's depth.In this case the Factorisation Method of Tomasi and Kanade [8,9] can be used to re-construct structure up to an affine transformation of 3D space. This is robust andworks well on noisy data from real images.Most of the work on recovering 3D structure has assumed that the objects viewed arerigid. Sparr describes a framework for dealing with objects which deform in wayswhich are locally affine [15]. Blake et al [13] and Beardesly et al [14] have also devel-oped geometric models which allow affine deformations, for tracking objects in imagesequences.

3 Overview - Flexible Models from 2D ImagesWe assume that we have a training set of paired uncalibrated images of objects of in-terest, in which landmark points representing key points on the objects have beenlocated. Suppose we have N such pairs, each containing n landmark points.To build a flexible model from this data we must perform three steps :

149

i) Reconstruct 3D structure (up to an affine or projective transformation) fromeach paired set of image points.

ii) Align the sets of reconstructed points, applying affine or projective transform-ations to minimise the distances between equivalent points across the trainingset - this defines a reference frame for the model.

iii) Apply a Principal Component Analysis to the reconstructed 3D data in the refer-ence frame, to generate the mean shape and main modes of shape variation.

This is analogous to the method of building 2D Point Distribution Models [1], themain difference being in the alignment stage. In the 2D case alignment involveschoosing the rotation, translation and scale for each example which minimises themean squared deviation of the aligned set of shapes. In this case we must choose themost suitable projective (15 parameter) or affine (12 parameter) transformation toalign the examples.

4 Reconstructing 3D structureGiven two uncalibrated projective views of a structure represented by a set of points,it is possible to compute the relative positions of the points in three dimensions upto an arbitrary (unknown) affine or projective transformation of 3-space dependingon the camera model used. We have investigated the use of both projective and affinecamera models.For a projective camera it is possible to reconstruct structure up to a projective trans-formation of 3D space [4,6]. Faugeras [6], Hartley et al [4,5] and Mohr et al [7] alldescribe reconstruction algorithms.For an affine camera we can reconstruct the structure up to an affine transformationof 3D space using the factorisation (SVD) method of Tomasi and Kanade [8]. Thiswas developed for shape and motion recovery from image sequences, and gives a ro-bust method of recovering the structure given two or more images of an object. Al-though the original method assumed an orthographic projection model (later ex-tended to use a paraperspective camera model [9]) it is able to generate structure upto affine transformation if an uncalibrated affine camera model is used.

5 Aligning into a Reference FrameAfter reconstruction the sets of points can be considered to each lie in an arbitraryreference frame. Before we can apply any statistical analysis, we must move themall into an appropriate co-ordinate frame; for the statistics to be valid this wouldideally be a euclidean co-ordinate frame. This step is analogous to the alignment stepused when building Point Distribution Models in 2D (Cootes et al [1]). In that caseshapes are presented at arbitrary positions with arbitrary orientations and scales. Be-fore a mean shape can be calculated it is necessary to align each example so that theyhave consistent centres, orientations and scales. This is achieved by minimising thesum of square distances between points after transformation, and applying a con-straint to the mean shape. The mean can be constrained by aligning with a set of refer-ence points, usually one of the original examples. We use a similar method for thereconstructed 3D shapes, generalising it to allow affine and projective transform-ations during the alignment.

150

The general alignment algorithm has the following steps :i) Apply a transformation to each set of points to minimise the distance to a refer-

ence setii) REPEAT:

- Calculate the mean structure- Align the mean with a reference set- Re-align each set of points to minimise the distance to the meanUNTIL change sufficiently small.

The reference set serves two purposes. The first is to ensure that the alignment algo-rithm converges. Without constraining the mean by re-aligning it with a referenceset, the system is underdetermined - for instance the shapes can shrink to a point.We could use any of the reconstructed sets of points as the reference set. Its secondpurpose is to define a suitable reference frame for shapes in which to perform a stat-istical analysis of the point positions (see below). After alignment all the examplescan be considered to be in the frame of the reference set. Ideally the reference setshould be the true euclidean positions of some of the points. If the reference set isseverely distorted compared to the true structure, for instance if it is much shorterin one dimension, the statistics of the models will be biased. However, since we arebuilding deformable models, which will only be valid up to an affine or projectivetransformation, an approximate set of reference points is adequate for most situ-ations.

5.1 Projective Case

In the projective case the alignment consists of calculating the projective transform-ation which minimises the distance (measured in euclidean space) between originaland target points. The transformation has 15 degrees of freedom. An initial estimatecan be obtained using the method described in Appendix A. Where necessary thiscan be optimised using a non-linear optimisation on the elements of the projectionmatrix to minimise the euclidean distance between points.Since a projective transformation of 3D space is constrained by knowing 5 point posi-tions, the reference set must define at least 5 points in their approximate euclideanpositions. These are only required in this training phase, and in most situations thisis fairly easy to do, as we usually know the approximate shape of the objects we arestudying.

52 Affine Case

Suppose we have two sets of n matched points {x,; = (xn,xa,xa, l)T} and

{y* = (yn,yn,yn,l)T} which differ by an unknown affine transformation A, ie

y, = Ax; (i = l..n).

Let af be the ;'th row of A, (with aj = (0 0 0 1)). Then the first three rows can be

obtained by a least squares solution to the n linear equations

151

An affine transformation can be constrained by determining 4 points in 3D space, sowe could supply the approximate positions of 4 or more points for the reference set.However, unless our cameras have seriously non-square pixels, the factorisationmethod itself can give a good approximation of the true structure up to scaling [8,9].Thus we can use one of the reconstructed examples as a reference set.

6 Building a Flexible ModelWe build a 3D model using the method of Hill et al [11]. We represent a set of 3Dpoints {x, = (x,,y,,z,, \)T (i = l..n)} as a single 3n element vector

X = (xlt...xn,yu ...,yn,zu ...zn)T

Thus the set of N reconstructed objects in the reference frame are given by the N 3n

element vectors {X, (/' = 1..N)}. We can calculate the mean of these, X , and apply

a Principal Component Analysis (PCA) to generate a set of modes of variation. Thesemodes are given by the unit eigenvectors of the covariance matrix of X; which corre-spond to the t largest eigenvalues (See [1] for more details).A linear model of the landmark points of the objects in the training set is given by

X = X + $bwhere O is a 3n x t matrix of eigenvectors representing the modes,

b is a t element vector of shape parameters.Limits on the shape parameters can be obtained from the statistics of the training set(typically we use limits of about 3 standard deviations).We have built models using both synthetic data and real images. We found, however,that our current implementations of the projective reconstruction and alignment aretoo sensitive to noise to build sensible models from real images. The affine caseproved considerably more robust. We describe experiments and give quantitative re-sults using both synthetic and real images.

7 Experiments Using Synthetic Data

As a synthetic example we generated sets of 16 3D points representing the verticesof a car. Figure 1 shows a typical set and indicates the dimensions which we allowedto vary at random. On average the car was 30 units long, 10 units wide and 9 unitshigh. When projected into an image (using a pinhole camera model) we arranged forits projection to be about 200 pixels wide. For this synthetic example we did not hideoccluded points - we assumed a wire frame model in which all the points could belocated.7.1 Results for Projective ModelWe generated 20 random car structures and projected the points of each into 2images. In the noise free case reconstructions were perfect up to a projectivity. Wealigned the reconstructions as described above, using the known true mean structureas the reference object and built a statistical model of the variations. Figure 2 showsthe most significant mode of variation of the model. This modifies the rear of thecar, changing it from a saloon to an estate. In addition there is a tapering caused bythe projective transformations allowed in the alignment procedure. Figure 3 shows

152

Figure 1: Synthetic car showing data showing di-mensions which vary in the training set

the second mode of variation, a change in the relative height of the bonnet and roof,again with some tapering effects.

Figure 2 : Effect of varying the first shape pa- Figure 3 : Effect of varying second shaperameter (saloon to estate + some tapering) parameter (relative height change)

12 Results for Affine Model

We used similar synthetic data and built models using the affine camera approxima-tion. This proved to be more robust to noise, and we were able to perform quantitat-ive experiments to characterise its performance [18]. The quality of the reconstruc-tion depends on the positional noise in the 2D images, the distance of the camerasfrom the object relative to the object depth and the angle between the cameras.Figure 4 shows the first three modes of shape variation (z = 500 ,6 — 45) of themodel reconstructed from noise free data. There is a small amount of shearingcaused by the affine transformation allowed in the alignment phase, but most of thevariation is that present in the training set.

8 Experiments Using Real Data

8.1 Engineers Vice Results

We took 7 pairs of images of a vice with different jaw openings (Figures 5,6). On eachwe marked 19 landmark points. For these experiments we only used points whichwere visible in all images. We reconstructed assuming an affine camera and traineda statistical model. The reconstruction from the first shape was used as a referenceset for the alignment. The model has only one main mode of variation which repre-sents 91% of the total variance of the training set. This mode, which is illustratedin Figure 7, opens and closes the jaws of the vice - the only degree of freedom affec-ting the shape present in the training images. Subsequent modes model the noise inthe training set.

8.2 Face Image ResultsAs part of another project [ 17] we had available images of faces from various individ-uals with their heads held in different poses (Figure 8). We selected a pair of images

153

\Z.Mode 1: hatchback < > saloon

Mode 2: Front bonnet length change

b, = -2cr,-«-Mode 3: Width of roof change

mean

Figure 4 : First three modes of shape variation (noise free case),Shape parameters varied between +1-2 s.d.s observed in training set.

Figure 5 : Example of image ofvice used for training a model.

Figure 6 : Examples of shapes used for training

Figure 7: Most significant mode of vice model shape variation - opening of the jaws.for each of 12 people, and used an automatic method to locate 144 landmarks on each[17]. We then used these sets of landmarks to build a 3D model, assuming an affinecamera model and using the first reconstruction as a reference. Since each subjectheld their head in a different pose in each of the image pairs, we effectively had twoviews of the same structure. Figure 9 shows the mean of the reconstructed model,and the effect of varying the first shape parameter. The method has successfully re-constructed the structure, and the shape variation gives the main variation in facestructure and expression between individuals in the training set. There is, of course,some noise caused by the errors in the landmark locations and the changes in express-

154

ion between images (they were taken at different times by a single camera), but theoverall structure and shape variation is plausible.

Figure 8: Examples of pairs of training shapes for face model

Front Side Front Side Front Side

IP Side.view

viewbi = - 2sd bl = 0 (mean) bx = + 2sd

Figure 9: Different views of the reconstruction of the mean and first mode of shapevariation of a face model (smiling mode).

9: Discussion and ConclusionsWe have demonstrated that 3D flexible models can be generated from pairs of uncali-brated images. Two camera models have been used. Model building using a projec-tive camera model relies upon a good reconstruction of structure from two uncali-brated images, which can be hard to achieve. For more robust reconstructions manyimages are required, such as from a sequence.The factorisation method gives a far more reliable reconstruction, but requires an af-fine camera model. This is acceptable when the object is far from the camera relativeto its depth, but can cause distortions when the object is nearer.The examples we have given assume all points are visible in all images. If some pointsare occluded in some image pairs, we can reformulate the methods to allow for this.Weights can be assigned to each point in each example, 1 if it is present in both imagesof a pair, 0 if not. We can reconstruct from the points, perform a weighted alignmentand apply a weighted PCA to obtain the model. As long as there is sufficient overlapin the visible points across different pairs, we should be able to obtain a completemodel. Thus we could build up a full model of a 3D object by adding together pairsof views from different orientations. In the affine reconstruction case the full struc-ture of a single object can be recovered from multiple views, each with some occludedpoints, using the SVD method [8,9].

A long term goal is to construct a flexible 3D model given a training set of single uncali-brated images of different examples of a class of objects, taken from arbitrary viewingpositions. This would require estimating the mean structure, its allowed variationsand the projections required into each image to minimise the errors, an extension of

155

current optimisation based reconstruction methods. This is likely to be a difficult op-timisation problem.Elsewhere we demonstrate that we can estimate the model shape parameters giventhe 2D point positions in a new view of a modelled object [18]. The ability to estimateshape parameters from new images will allow classification (such as into differenttypes of car) or certain measurements to be made.In addition this will allow us to implement a local search strategy similar to ActiveShape Models [1]. An initial estimate of the projected model point positions in animage will be refined by locating better estimates nearby in the image and updatingthe projection and shape parameters accordingly. The approach has proved to be afast and robust method of finding instances of 2D Point Distribution Models inimages, and we anticipate an analogous method will allow us to locate projections ofvariable 3D objects.

Appendix A : Recovering the Projective Transformation between Sets of Points

Suppose we have two sets of matched points {x; = (xn,xa,xa,1)T} and

{y< = (yn,yi2,ya, I)7 } which differ by an unknown projective transformation H, ie

a>;y, = Hx, (i = l..ri). We can recover H as follows :

For the i'th point we have

= h[x,where H =

hi

Substituting in for co, and assuming that hA4 = 1.0 gives 3 linear constraints on the

other 15 elements of H :

Thus we have a total of 3« equations which can be solved using least squares ap-proaches if necessary. If we have only m < n known matches (if some points areoccluded for instance) then we have only 3m equations, but a solution can still be ob-tained if m > 5 .

Note that this method is not directly minimising the sum of euclidean distance errors,but gives an approximation acceptable for our purposes. More accurate estimates re-quire a non-linear optimisation procedure.

Acknowledgements /

Tim Cootes is funded by an EPSRC Postdoctoral Fellowship. Enea Di Mauro isfunded by an EPSRC Project Grant. Andreas Lanitis is funded by a University ofManchester Research Studentship and an ORS award.

References[1] T.F.Cootes, C.J.Taylor, D.H.Cooper and J.Graham, Active Shape Models -

Their Training and Application.Compufer Vision and Image Understanding Vol.61, No. 1, 1995. pp.38-59.

156

[2] T.F.Cootes , CJ.Taylor, A.Lanitis, Active Shape Models : Evaluation of aMulti-Resolution Method for Improving Image Search, in Proc. British MachineVision Conference, (Ed. E.Hancock) BMVA Press 1994, pp.327-338.

[3] X.Shen, D.Hogg, 3D Shape Recovery Using a Deformable Model, in Proc. Brit-ish Machine Vision Conference, (Ed. E.Hancock) BMVA Press 1994, pp.387-396.

[4] R.Hartley, R.Gupta and T.Chang, Stereo from Uncalibrated Cameras, in Proc.CVPR'92 IEEE Press, 1992, pp. 761-764.

[5] R.I.Hartley, Projective Reconstruction and Invariants from Multiple Images.IEEEPAMI Vol.16, No. 10, 1994, pp. 1036-1041.

[6] O.Faugeras, What can be seen in three dimensions with an uncalibrated stereorig?, Proc. European Conference on Computer Vision. 1992, pp. 563-578

[7] R.Mohr, B.Boubakeur, P.Brand, Accurate Projective Reconstruction, in [16].pp.257-275.

[8] C.Tomasi, T.Kanade, Shape and Motion from Image Streams under Orthogra-phy: a Factorization Method. IJCV9 (Vol.2), pp. 137-154, 1992.

[9] CJ.Poelman, T.Kanade, A Paraperspective Factorization Method for Shapeand Motion Recovery. Proc. ECCV1994. Lecture Notes in Computer Science,Vol.801, (ed. J.O. Eklundh), pp. 97-108.

[10] O.Faugeras, Three Dimensional Computer Vision, A Geometric Viewpoint,MIT Press, 1993

[11] A.HU1, AThornham, CJ.Taylor. Model Based Interpretation of 3D MedicalImages, in Proc. British Machine Vision Conference 1993. Vol.2. (Ed. J.Illing-worth) BMVA Press, pp. 339-348.

[12] K.Kanatani, Geometric Computation for Machine Vision, Oxford UniversityPress, 1993

[13] A.Blake, R.Curwen, A.Zisserman, Affine-invariant contour tracking with au-tomatic control of spatiotemporal scale, in Proc. Fourth International Conferenceon Computer Vision, IEEE Computer Society Press, 1993, pp.66-75.

[14] P.A.Beardsley, A.Zisserman and D.W.Murray, Navigation using Affine Struc-ture from Motion. Proc. ECCV 1994 (Vol.2). Lecture Notes in ComputerScience 801, ed. J.O.Eklundh, Springer-Verlag, 1994, pp. 85-96.

[15] G.Sparr, A Common Framework for Kinetic Depth Reconstruction and Motionfor Deformable Objects. Proc. ECCV 1994 (Vol.2). Lecture Notes in ComputerScience 801, ed. J.O.Eklundh, Springer-Verlag, 1994, pp. 471-482.

[ 16] J.L.Mundy, A.Zisserman, D.Forsyth (Eds.), Applications of Invariance in Com-puter Vision, Lecture Notes in Computer Science 825, Springer-Verlag, 1993.

[ 17] A.Lanitis, C.J.Taylor, T.F.Cootes. A Unified Approach to Coding and Interpret-ting Face Images. Proc. ICCV1995. pp.368-373.

[18] T.F.Cootes. Building Flexible 3D Shape Models from Uncalibrated CameraImages. Internal Report, Dept. Medical Biophysics, Manchester University, Eng-land. March 1995.

Documents

Flexible 3D Models from Uncalibrated CamerasFlexible 3D Models from Uncalibrated Cameras T.F.Cootes, E.C. Di Mauro, C.J.Taylor, AXanitis Department of Medical Biophysics, University