Nonlinear Spatial Normalization Using Basis Functionsweb.mit.edu/swg/ImagingPubs/Normalization/ash... · normalization of both high-quality magnetic resonance images, and low-resolution

Nonlinear Spatial NormalizationUsing Basis Functions

John Ashburner* and Karl J. Friston

Functional Imaging Laboratory, Wellcome Department of Cognitive Neurology,Institute of Neurology, London, United Kingdom

r r

Abstract: We describe a comprehensive framework for performing rapid and automatic nonlabel-basednonlinear spatial normalizations. The approach adopted minimizes the residual squared differencebetween an image and a template of the same modality. In order to reduce the number of parameters to befitted, the nonlinear warps are described by a linear combination of low spatial frequency basis functions.The objective is to determine the optimum coefficients for each of the bases by minimizing the sum ofsquared differences between the image and template, while simultaneously maximizing the smoothness ofthe transformation using a maximum a posteriori (MAP) approach. Most MAP approaches assume that thevariance associated with each voxel is already known and that there is no covariance between neighboringvoxels. The approach described here attempts to estimate this variance from the data, and also corrects forthe correlations between neighboring voxels. This makes the same approach suitable for the spatialnormalization of both high-quality magnetic resonance images, and low-resolution noisy positronemission tomography images. A fast algorithm has been developed that utilizes Taylor’s theorem and theseparable nature of the basis functions, meaning that most of the nonlinear spatial variability betweenimages can be automatically corrected within a few minutes. Hum. Brain Mapping 7:254–266, 1999.r 1999Wiley-Liss,Inc.

Key words: registration; anatomy; imaging; stereotaxy; basis functions; spatial normalization; PET; MRI;functional mapping

r r

INTRODUCTION

This paper concerns the problem of nonlinear spatialnormalization: namely, how to map a single subject’sbrain image into a standard space. The solution of thisproblem allows for a wide range of voxel-based analy-ses and facilitates the comparison of different subjectsand data bases. The problem of spatial normalization

is not a trivial one; indeed, on some anatomical scales itis not clear that a solution even exists.

A fundamental advantage of using spatially normal-ized images is that activations can be reported accord-ing to a set of meaningful Euclidian coordinates withina standard space [Fox, 1995]. New results can bereadily incorporated into ongoing brain atlas and database projects such as that being developed by theInternational Consortium for Human Brain Mapping(ICBM) [Mazziotta et al., 1995]. The most commonlyadopted coordinate system within the brain imagingcommunity is that described by the atlas of Talairachand Tournoux [1988].

When whole-brain structural images (typically high-resolution magnetic resonance imaging; MRI) of the

Grant sponsor: Wellcome Trust.*Correspondence to: John Ashburner, Functional Imaging Labora-tory, Wellcome Department of Cognitive Neurology, Institute ofNeurology, 12 Queen Square, London WC1N 3BG, UK. E-mail:[email protected] 21 October 1997; Accepted 25 January 1999

r Human Brain Mapping 7:254–266(1999)r

r 1999Wiley-Liss,Inc.

subject are available in addition to the functionalimages, the images can be coregistered using any oneof a number of methods for intermodality registration[Pelizzari et al., 1988; Woods et al., 1992; Studholme etal., 1995; Collignon et al., 1995; Ashburner and Friston,1997]. This allows the spatial transformations thatwarp the images to the reference space to be deter-mined from the structural images. These warps canthen be applied to the functional images. Because thereare only six rigid body parameters required to mapbetween the structural and functional images, thecoregistration parameters can be determined fairlyaccurately. The structural images should have higherspatial resolution, less noise, and more structuralinformation than the functional images, allowing amore accurate nonlinear registration to be obtained.

However, not every functional imaging unit hasready access to a high-quality magnetic resonance(MR) scanner, so for many functional imaging studiesthere are no structural images of the subject availableto the researcher. In this case, it is necessary todetermine the required warps based solely on thefunctional images. These images may have a limitedfield of view, contain very little useful signal, or beparticularly noisy. An ideal spatial normalization rou-tine would need to be robust enough to cope with thistype of data.

Nonlinear spatial transformations can be broadlydivided into label-based and nonlabel-based. Label-basedtechniques identify homologous features (labels) in theimage and template and find the transformations thatbest superpose them. The labels can be points, lines, orsurfaces. Homologous features are often identifiedmanually, but this process is time-consuming andsubjective. Another disadvantage of using points aslandmarks is that there are very few readily identifi-able discrete points in the brain. A similar problem isfaced during identification of homologous lines. How-ever, surfaces are more readily identified, and in manyinstances they can be extracted automatically (or atleast semiautomatically). Once they are identified, thespatial transformation is effected by bringing thehomologies together. If the labels are points, then therequired transformations at each of those points areknown. Between the points, the deforming behavior isnot known, so it is forced to be as ‘‘smooth’’ aspossible. There are a number of methods for modelingthis smoothness. The simplest models include fittingsplines through the points in order to minimize bend-ing energy [Bookstein, 1989]. More complex forms ofinterpolation are often used when the labels are sur-

faces. For example, Thompson and Toga [1996] mappedsurfaces together using a fluid model.

Nonlabel-based approaches identify a spatial trans-formation that minimizes some index of the differencebetween an object and a template image, where bothare treated as unlabeled continuous processes. Thematching criterion is usually based upon minimizingthe sum of squared differences or maximizing thecorrelation coefficient between the images. For thiscriterion to be successful, it requires the template toappear as a warped version of the image. In otherwords, there must be correspondence in the gray levelsof the different tissue types between the image andtemplate.

There are a number of approaches to nonlabel-basedspatial normalization. A potentially enormous numberof parameters are required to describe the nonlineartransformations that warp two images together (i.e.,the problem is very high-dimensional). The forms ofspatial normalization tend to differ in how they copewith the large number of parameters.

Some have abandoned conventional optimizationapproaches, and use viscous fluid models [Christensenet al., 1994, 1996] to describe the warps. In thesemodels, finite-element methods are used to solve thepartial differential equations that model one image asit ‘‘flows’’ to the same shape as the other. The majoradvantage of these methods is that they are able toaccount for large nonlinear displacements and alsoensure that the topology of the warped image ispreserved, but they do have the disadvantage that theyare computationally expensive. Not every unit in thefunctional imaging field has the capacity to routinelyperform spatial normalizations using these methods.

Others adopt a multiresolution approach wherebyonly a few of the parameters are determined at any onetime [Collins et al., 1994b]. Usually, the entire volumeis used to determine parameters that describe globallow-frequency deformations. The volume is then sub-divided, and slightly higher-frequency deformationsare found for each subvolume. This continues until thedesired deformation precision is achieved.

Another approach is to reduce the number of param-eters that model the deformations. Some groups sim-ply use only a 9- or 12-parameter affine transformationto spatially normalize their images, accounting fordifferences in position, orientation, and overall brainsize. Low spatial frequency global variability in headshape can be accommodated by describing deforma-tions by a linear combination of low-frequency basisfunctions [Amit et al., 1991]. The small number ofparameters will not allow every feature to be matched

r Nonlinear Spatial Normalization r

r 255 r

exactly, but it will permit the global head shape to bemodeled. The method described in this paper is onesuch approach. The rational for adopting a low-dimensional approach is that there is not necessarily aone-to-one mapping between any pair of brains. Differ-ent subjects have different patterns of gyral convolu-tions, and even if gyral anatomy can be matchedexactly, this is no guarantee that areas of functionalspecialization will be matched in a homologous way.For the purpose of averaging signals from func-tional images of different subjects, very high-resolu-tion spatial normalization may be unnecessary or un-realistic.

The deformations required to transform images tothe same space are not clearly defined. Unlike rigidbody transformations, where the constraints are ex-plicit, those for nonlinear warping are more arbitrary.Without any constraints it is of course possible totransform any image such that it matches anotherexactly. The issue is therefore less about the nature ofthe transformation and more about defining con-straints or priors under which a transformation iseffected. The validity of a transformation can usuallybe reduced to the validity of these priors. Priors arenormally incorporated using some form of Bayesianscheme, using estimators such as the maximum aposteriori (MAP) estimate or the minimum varianceestimate (MVE). The MAP estimate is the single solu-tion that has the highest a posteriori probability ofbeing correct, and is the estimate that we attempt toobtain in this paper. The MVE was used by Miller et al.[1993], and is the solution that is the conditional meanof the posterior. The MVE is probably more appropri-ate than the MAP estimate for spatial normalization.However, if the errors associated with the parameterestimates and also the priors are normally distributed,then the MVE and the MAP estimates are identical.

The remainder of this paper is organized as follows.The theory section describes how the registrations areperformed, beginning with a description of the Gauss-Newton optimization scheme employed. Followingthis, the paper describes specific implementationaldetails for determining the optimal linear combinationof spatial basis functions. This involves using proper-ties of Kronecker tensor products for the rapid computa-tion of the curvature matrix used by the optimization.A Bayesian framework using priors based upon mem-brane energy is then incorporated into the registrationmodel. The following section provides an evaluationfocusing on the utility of nonlinear deformations perse, and then the use of priors in a Bayesian framework.

There then follows a discussion of the issues raised,and of those for future consideration.

THEORY

There are two steps involved in registering any pairof images together. There is the registration itself,whereby the parameters describing a transformationare determined. Then there is the transformation, whereone of the images is transformed according to the set ofparameters. The registration step involves matchingthe object image to some form of standardized tem-plate image. Unlike in the work of Christensen et al.[1994; 1996] or the segmentation work of Collins et al.[1994a, 1995], spatial normalization requires that theimages themselves are transformed to the space of thetemplate, rather than a transformation being deter-mined that transforms the template to the individualimages.

The nonlinear spatial normalization approach de-scribed here assumes that the images have alreadybeen approximately registered with the template ac-cording to a 9 [Collins et al., 1994b] or 12-parameter[Ashburner et al., 1997] affine registration. This sectionwill illustrate how the parameters describing globalshape differences between the images and template aredetermined.

We begin by introducing a simple method of optimi-zation based upon partial derivatives. Then the param-eters describing the spatial transformations are intro-duced. In the current approach, the nonlinear warpsare modeled by linear combinations of smooth basisfunctions, and a fast algorithm for determining theoptimum combination of basis functions is described.For speed and simplicity, a relatively small number ofparameters (approximately 1,000) are used to describethe nonlinear components of the registration.

The optimization method is extended to utilizeBayesian statistics in order to obtain a more robust fit.This requires knowledge of the errors associated withthe parameter estimates, and also knowledge of the apriori distribution from which the parameters aredrawn. This distribution is modeled in terms of a costfunction based on the membrane energy of the deforma-tions. We conclude with some comments on extendingthe schemes when matching images from differentmodalities.

Basic optimization algorithm

The objective of optimization is to determine a set ofparameters for which some function is minimized (or

r Ashburner and Friston r

r 256 r

maximized). One of the simplest cases involves deter-mining the optimum parameters for a model in orderto minimize of the sum of squared differences betweenthe model and a set of real-world data (x2). Usuallythere are many parameters in the model, and it is notpossible to exhaustively search through the wholeparameter space. The usual approach is to make aninitial estimate, and to iteratively search from there. Ateach iteration, the model is evaluated using the currentestimates, and x2 is computed. A judgment is thenmade about how the parameters should be modified,before continuing on to the next iteration.

The image-registration approach described here isessentially an optimization. In the simplest case, oneimage (the object image) is spatially transformed sothat it matches another (the template image), byminimizing x2. In order for x2 minimization to work, itis important that a distorted version of the objectimage can be adequately modeled by the template (ortemplates). When using a single template image, itshould have the same contrast distribution as theimages that are to be matched to it. The parametersthat are optimized are those that describe the spatialtransformation (although there are often other nui-sance parameters required by the model, such asintensity scaling parameters). The algorithm of choice[Friston et al., 1995] is one that is similar to Gauss-Newton optimization, and it is illustrated here:

Suppose that ei(p) is the function describing thedifference between the object and template images atvoxel i, when the vector of model parameters hasvalues p. For each voxel (i), a first approximation ofTaylor’s theorem can be used to estimate the value thatthis difference will take if the parameters p are in-creased by t:

ei(p 1 t) . ei(p) 1 t1

ei(p)

p11 t2

ei(p)

p2. . .

This allows the construction of a set of simultaneousequations (of the form Ax . e) for estimating thevalues that t should assume in order to minimizeSiei(p 1 t)2:

1e1(p)

p1

e1(p)

p2· · ·

e2(p)

p1

e2(p)

p2· · ·

······

· · ·

2 1t1

t2

···2 . 1

e1(p)

el2(p)···

2.

From this we can derive an iterative scheme forimproving the parameter estimates. For iteration n, theparameters p are updated as:

p(n11) 5 p(n) 2 (ATA)21 ATe (1)

where

A 5 1e1(p)

p1

e1(p)

p2· · ·

e2(p)

p1

e2(p)

p2· · ·

······

· · ·

2 and e 5 1e1(p)

e2(p)···2.

This process is repeated until x2 can no longer bedecreased, or for a fixed number of iterations. Thereis no guarantee that the best global solution will bereached, since the algorithm can get caught in a lo-cal minimum. The number of potential local min-ima is decreased by working with smooth images.This also has the effect of making the first-order Taylorapproximation more accurate for larger displace-ments.

In practice, ATA and ATe from Eq. (1) are computed‘‘on the fly’’ for each iteration. By computing thesematrices using only a few rows of A and e at a time,much less computer memory is required than is neces-sary for storing the whole of matrix A. Also, the partialderivatives ei(p)/pj can be rapidly computed fromthe gradients of the images using the chain rule.1 Thesecalculations will be illustrated more fully in the nextsection.

Parameterizing the spatial transformations

Spatial transformations are described by a linearcombination of smooth basis functions. The choice ofbasis functions depends partly upon how translationsat the boundaries should behave. If points at theboundary over which the transformation is computedare not required to move in any direction, then the basisfunctions should consist of the lowest frequencies ofthe three-dimensional discrete sine transform (DST). Ifthere are to be no constraints at the boundaries, then a

1Trilinear interpolation of the voxel lattice (rather than the samplinglattice) is used to resample the images at the desired coordinates.Gradients of the images are obtained at the same time, using afinite-difference method on the same lattice. No assumptions aremade about voxel values that lie outside the field of view of image f.Points where yi falls outside the domain of f are not included in thecomputations.


r 257 r

three-dimensional discrete cosine transform (DCT) ismore appropriate. Both of these transforms use thesame set of basis functions to represent warps in eachof the directions. Alternatively, a mixture of DCT andDST basis functions can be used to constrain transla-tions at the surfaces of the volume to be parallel to thesurface only (sliding boundary conditions). By using adifferent combination of DCT and DST basis functions,the corners of the volume can be fixed and theremaining points on the surface can be free to move inall directions (bending boundary conditions) [Chris-tensen, 1994]. The basis functions described here arethe lowest-frequency components of the three (ortwo)-dimensional discrete cosine transform. In onedimension, the DCT of a function is generated bypremultiplication with the matrix BT, where the ele-ments of the M by J matrix B are defined by:

bm,1 51

ÎMm 5 1 . . . M

bm,j 5Î 2

Mcos 1 p(2m 2 1) (j 2 1)

(2M) 2m 5 1 . . . M, j 5 2 . . . J. (2)

The two-dimensional DCT basis functions are shownin Figure 1, and a schematic example of a deformationbased upon the DCT is shown in Figure 2.

The optimized parameters can be separated into anumber of distinct groups. The most important arethose for describing translations in the three orthogo-nal directions (t1, t2, and t3). The model for defining thenonlinear warps uses deformations consisting of alinear combination of basis functions. In three dimen-sions, the transformation from coordinates xi, to coordi-nates yi is:

y1,i 5 x1,i 2 u1,i 5 x1,i 2 oj51

J

tj,1b1,j(xi)

y2,i 5 x2,i 2 u2,i 5 x2,i 2 oj51

J

tj,2b2,j(xi)

y3,i 5 x3,i 2 u3,i 5 x3,i 2 oj51

J

tj,3b3,j(xi)

where tjd is the jth coefficient for dimension d, andbjd(x) is the jth basis function at position x for dimen-sion d.

The optimization involves minimizing the sum ofsquared differences between the object image (f) and atemplate image (g). The images may be scaled differ-ently, so an additional parameter (w) is needed toaccommodate this difference. The minimized functionis then:

oi

(f(yi) 2 wg(xi))2.

Each element of vector e (from the previous section)contains f(yi) 2 wg(xi). Derivatives of the functionf(yi) 2 wg(xi) with respect to each parameter arerequired in order to compute matrix A. These can beobtained using the chain rule:

f(yi)

tj,15

f(yi)

y1,i

y1,i

tj,15

f(yi)

y1,ibj(xi)

f(yi)

tj,25

f(yi)

y2,i

y2,i

tj,25

f(yi)

y2,ibj(xi)

f(yi)

tj,35

f(yi)

y3,i

y3,i

tj,35

f(yi)

y3,ibj(xi).

In order to adopt the Gauss-Newton optimizationstrategy, ATA and ATe need to be computed on eachiteration. Assuming that the lowest J frequencies of athree-dimensional DCT are used to define the warps,and there are I sampled positions in the image, then

Figure 1.The lowest-frequency basis functions of a two-dimensional dis-crete cosine transform.


r 258 r

the theoretical size of the matrix A is I 3 (3J3 1 1). Thestraightforward computation of ATA and ATe wouldbe very time consuming. We now describe how thiscan be done in a much more expedient manner.

A fast algorithm

The fast algorithm for computing ATA and ATe isshown in Figure 3. The remainder of this sectionexplains the matrix terminology used, and why it is soefficient.

For simplicity, the algorithm is only illustrated intwo dimensions, although it has been implemented toestimate warps in three dimensions. Images f and g areconsidered as matrices F and G, respectively. Row m ofF is denoted by fm,:, and column n by f:,n. The basisfunctions used by the algorithm are generated from aseparable form from matrices B1 and B2. By treating thetransform coefficients as matrices T1 and T2, the defor-

mation fields can be rapidly constructed by computingB1T1B2

T and B1T2B2T.

Between each iteration, image F is resampled accord-ing to the latest parameter estimates. The deriva-tives of F are also resampled to give =1F and =2F.The ith element of each of these matrixes containf(yi), f(yi)/y1i and f(yi)/y2i, respectively.

The notation diag(2=1f:,m)B1 simply means multiply-ing each element of row i of B1 by 2=1fi,m, and thesymbol ‘‘^’’ refers to the Kronecker tensor product. If B2 is amatrix of order M 3 J, and B1 is a second matrix, then:

B2 ^ B1 5 1b211B1 · · · b21JB1

···· · ·

···b2M1B1 · · · b2MJB1

2.The advantage of the algorithm shown in Figure 3 is

that it utilizes some of the useful properties of Kro-

Figure 2.For the two-dimensional case, the deformation field consists of two scalar fields: one for horizontaldeformations, and the other for vertical deformations. Images at left show the deformation fields as alinear combination of the basis images (see Fig. 1). Center column: Deformations in a more intuitivesense. The deformation field is applied by overlaying it on the object image, and resampling (right).


r 259 r

necker tensor products. This is especially importantwhen the algorithm is implemented in three dimen-sions. The performance enhancement results from areordering of a set of operations such as (B3 ^ B2 ^ B1))T

(B3 ^ B2 ^ B1), to the equivalent (B3T B3) ^ (B2

T B2) ^

(B1T B1). Assuming that the matrices B3, B2, and B1 all

have order M 3 J, then the number of floating-pointoperations is reduced from M3J3(J3 1 2) to approxi-mately 3M(J2 1 J) 1 J6. If M equals 32, and J equals 4,we expect a performance increase of about a factor of20,000. The limiting factor to the algorithm is no longerthe time taken to create the curvature matrix (ATA),but is now the amount of memory required to store it,and the time taken to invert it.

A maximum a posteriori solution

Without regularization, it is possible to introduceunnecessary deformations that only reduce the re-sidual sum of squares by a tiny amount. This couldpotentially make the algorithm very unstable. Regular-ization is achieved using Bayesian statistics.

Bayes’ rule can be expressed in the continuous formas p(ap 0b) ~ p(b 0ap) p(ap) where p(ap) is the priorprobability of ap being true, p(b 0ap) is the conditionalprobability that b is observed given that ap is true, and

p(ap 0b) is the posterior probability of ap being true,given that measurement b has been made. The maxi-mum a posteriori (MAP) estimate for parameters p isthe mode of p(ap 0b). For our purposes, p(ap) representsa known prior probability distribution from which theparameters are drawn, p(b 0ap) is the likelihood ofobtaining the data b given the parameters (the maxi-mum likelihood estimate), and p(ap 0b) is the functionto be maximized.

A probability is related to its Gibb’s form by p(a) ~

e2H(a). Therefore, the posterior probability is maxi-mized when its Gibb’s form is minimized. This isequivalent to minimizing H(b 0ap) 1 H(ap). In thisexpression, H(b 0ap) is related to the residual sum ofsquares. If we assume that the parameters are drawnfrom a zero mean, multinormal distribution describedby a covariance matrix Co, then H(ap) is simply givenby pTCop.

We previously illustrated a method of incorporatingknowledge about a priori parameter distributions intothe Gauss-Newton optimization scheme [Ashburner etal., 1997]. For a zero mean prior distribution, theiterative scheme is:

p(n11) 5 (Co21s2 1 ATA)21(ATAp(n) 2 ATe) (3)

Figure 3.Two-dimensional illustration of the fast algorithm for computing ATA (a) and ATe (b).


r 260 r

where:

s2 5 oi51

I

ei(p)2/n. (4)

n refers to the number of degrees of freedom aftercorrection for correlations between neighboring vox-els.

The a priori distribution

The first requirement for a MAP approach is todefine some form of prior distribution for the param-eters. For a simple linear approach, the priors consistof an a priori estimate of the mean of the parameters(assumed to be zero), and also a covariance matrixdescribing the distribution of the parameters aboutthis mean. There are many possible forms for model-ing these priors, each of which refers to some typeof ‘‘energy’’ term. The form of regularization des-cribed here is based upon the membrane energy orlaplacians of the deformation field [Amit et al., 1991;Gee et al., 1997]. Two other types of linear regulariza-tion (bending energy and linear-elastic energy) are de-scribed in the Appendix. None of these schemesenforce a strict one-to-one mapping between the objectand template images, but this makes little differencefor the small deformations that we are interested inhere.

In three dimensions, the membrane energy of thedeformation field u is:

oi

oj51

3

ok51

3

l 1 uji

xki22

where l is simply a scaling constant. The membraneenergy can be computed from the coefficients of thebasis functions by t1

THt1 1 t2THt2 1 t3

THt3, where t1, t2,and t3 refer to vectors containing the parametersdescribing translations in the three dimensions. Thematrix H is defined by:

H 5 l (B83TB83) ^ (B2

T B2) ^ (B1T B1)

1 l (B3T B3) ^ (B82

T B82) ^ (B1TB1)

1 l (B3T B3) ^ (B2

T B2) ^ (B81TB81)

where the notation B81 refers to the first derivatives ofB1.

Assuming that the parameters consist of (t1Tt2

Tt3Tw)T,

matrix C021 from Eq. (3) can be constructed from H by:

Co21 5 1

H 0 0 0

0 H 0 0

0 0 H 0

0 0 0 02. (5)

H is all zeros, except for the diagonal. Elements on thediagonal represent the reciprocal of the a priori vari-ance of each parameter, and each is given by:

hj1J(k1J3l) 5 lp2M22((j 2 1)2 1 (k 2 1)2 1 (l 2 1)2)

where M is the dimension of the DCT (see Eq. 2), and Jis the number of low-frequency coefficients used in anydimension.

Values of l that are too large will provide too muchregularization and result in underestimated deforma-tions. If the values are too small, there will not beenough regularization and the resulting deformationswill include a large amount of noise. There is no simpleway of estimating what the best values for theseconstants should be.

In the absence of known priors, the membraneenergy provides a useful model in which we assumethat the probability of a particular set of parametersarising is inversely related to the membrane energyassociated with that set. Clearly this model is some-what ad hoc, but it is a useful and sensible one. If thetrue prior distribution of the parameters is known(derived from a large number of subjects), then Co

could be an empirically determined covariance matrixdescribing this distribution. This approach wouldhave the advantage that the resulting deformations aremore typically ‘‘brain-like,’’ and so increase the facevalidity of the approach.

Templates and intensity transformations

So far, only a single intensity scaling parameter (w)has been considered. This is most effective when thereis a linear relationship between the images. However,for spatially normalizing some images, it is necessaryto include other parameters describing intensity trans-formations.

The optimization can be assumed to minimize twosets of parameters: those that describe spatial transfor-mations (pt), and those for describing intensity transfor-mations (pw). This means that the difference function


r 261 r

can be expressed in the generic form:

ei(p) 5 f(t(xi, pt)) 2 w(xi, pw)

where f is the object image, t() is a vector functiondescribing the spatial transformations based uponparameters pt, and w() is a scalar function describingintensity transformations based on parameters pw. xi

represents the coordinates of the ith sampled point.The intensities could vary spatially (e.g., due to

inhomogeneities in the MRI scanner). Linear varia-tions in intensity over the field of view can be ac-counted for by optimizing a function of the form:

oi

(f(xi, pt) 2 (pw1g(xi) 1 pw2x1ig(xi)

1 pw3x2ig(xi) 1 pw4x3ig(xi)))2.

More complex variations could be included by modu-lating with other basis functions (such as the DCTbasis function set described earlier) [Friston et al.,1995]. Information on the smoothness of the inhomoge-neities could be incorporated by appropriate modifica-tions to the matrix Co

21.Another important idea is that a given image can be

matched not to one reference image, but to a series ofimages that all conform to the same space. The ideahere is that (ignoring the spatial differences) any givenimage can be expressed as a linear combination of a setof reference images. For example, these referenceimages might include different modalities (e.g., posi-tron emission tomography (PET), single photon emis-sion computed tomography (SPECT), 18F-DOPA, 18F-deoxy-glucose, T1-weighted MRI, and T*2-weighted MRI)or different anatomical tissues (e.g., gray matter, whitematter, and cerebro-spinal fluid (CSF) segmented fromthe same T1-weighted MRI) or different anatomicalregions (e.g., cortical gray matter, subcortical graymatter, and cerebellum) or finally any combination ofthe above. Any given image, irrespective of its modal-ity, could be approximated with a function of theseimages. A simple example using two images would be:

oi

(f(Mxi) 2 (pw1g1(xi) 1 pw2g2(xi)))2.

Again, some form of model describing the likely apriori distributions of the parameters could be included.

EVALUATION

This section provides an anecdotal evaluation of thetechniques presented in the previous section. We com-

pare spatial normalization both with and withoutnonlinear deformations, and compare nonlinear defor-mations with and without the use of Bayesian priors.

T1-weighted MR images of 12 subjects were spatiallynormalized to the same anatomical space. The normal-izations were performed twice, first using only 12-parameter affine transformations and then using affinetransformations followed by nonlinear transforma-tions. The nonlinear transformation used 392(7 3 8 3 7) parameters to describe deformations ineach of the directions, and four parameters to model alinear scaling and simple linear image intensity inho-mogeneities (making a total of 1,180 parameters in all).The basis functions were those of a three-dimensionalDCT, and the regularization minimized the membraneenergy of the deformation fields (using a value of 0.01for l). Twelve iterations of the nonlinear registrationalgorithm were performed.

Figure 4 shows pixel-by-pixel means and standarddeviations of the normalized images. The mean imagefrom the nonlinear normalization shows more contrastand has edges that are slightly sharper. The standarddeviation image for the nonlinear normalization showsdecreased intensities, demonstrating that the intensitydifferences between the images have been reduced.However, the differences tend to reflect changes in theglobal shape of the heads, rather than differencesbetween cortical anatomy.

This evaluation should illustrate the fact that thenonlinear normalization clearly reduces the sum ofsquared intensity differences between the images. Theamount of residual variance could have been reducedfurther by decreasing the amount of regularization.However, this may lead to some very unnatural-looking distortions being introduced, due to an overes-timation of the a priori variability. Evaluations like thistend to show more favorable results for less heavilyregularized algorithms. With less regularization, theoptimum solution is based more upon minimizing thedifference between the images, and less upon knowl-edge of the a priori distribution of the parameters. Thisis illustrated for a single subject in Figure 5, where thedistortions of gyral anatomy clearly have a very lowface validity (Fig. 5, lower right).

DISCUSSION

The criteria for ‘‘good’’ spatial transformations canbe framed in terms of validity, reliability, and computa-tional efficiency. The validity of a particular transforma-tion device is not easy to define or measure and indeedvaries with the application. For example, a rigid body


r 262 r

transformation may be perfectly valid for realignmentbut not for spatial normalization of an arbitrary braininto a standard stereotactic space. Generally the sortsof validity that are important in spatial transforma-tions can be divided into (1) face validity, established bydemonstrating that the transformation does what it issupposed to do, and (2) construct validity, assessed bycomparison with other techniques or constructs. Facevalidity is a complex issue in functional mapping. Atfirst glance, face validity might be equated with thecoregistration of anatomical homologues in two im-ages. This would be complete and appropriate if thebiological question referred to structural differences ormodes of variation. In other circumstances, however,

this definition of face validity is not appropriate. Forexample, the purpose of spatial normalization (eitherwithin or between subjects) in functional mappingstudies is to maximize the sensitivity to neurophysi-ological change elicited by experimental manipulationof a sensorimotor or cognitive state. In this case, abetter definition of a valid normalization is that whichmaximizes condition-dependent effects with respect toerror (and if relevant, intersubject) effects. This willprobably be effected when functional anatomy iscongruent. This may or may not be the same asregistering structural anatomy.

Figure 4.Means and standard deviations of spatially normalized T1-weightedimages from 12 subjects. Images at left were derived using onlyaffine registration. Those at right used nonlinear registration inaddition to affine registration.

Figure 5.Top left: Object or template image. Top right: An image that hasbeen registered with it using a 12-parameter affine registration.Bottom left: Same image registered using the 12-parameter affineregistration, followed by a regularized global nonlinear registration(using 1,180 parameters, 12 iterations, and a l of 0.01). It should beclear that the shape of the image approaches that of the templatemuch better after nonlinear registration. Bottom right: Image afterthe same affine transformation and nonlinear registration, but thistime without using any regularization. The mean squared differencebetween the image and template after the affine registration was472.1. After regularized nonlinear registration, this was reduced to302.7. Without regularization, a mean squared difference of 287.3was achieved, but this was at the expense of introducing a lot ofunnecessary warping.


r 263 r

Limitations of nonlinear registration

Because deformations are only defined by a fewhundred parameters, the nonlinear registration methoddescribed here does not have the potential precision ofsome other methods. High-frequency deformationscannot be modeled, since the deformations are re-stricted to the lowest spatial frequencies of the basisfunctions. This means that the current approach isunsuitable for attempting exact matches between finecortical structures.

The method is relatively fast, taking on the order of30 sec per iteration, depending upon the number ofbasis functions used. The speed is partly a result of thesmall number of parameters involved, and the simpleoptimization algorithm that assumes an almost qua-dratic error surface. Because the images are firstmatched using a simple affine transformation, there isless ‘‘work’’ for the algorithm to do, and a goodregistration can be achieved with only a few iterations(about 20). The method does not rigorously enforce aone-to-one match between the brains being registered.However, by estimating only the lowest-frequencydeformations and by using appropriate regularization,this constraint is rarely broken.

When higher spatial frequency warps are to befitted, more DCT coefficients are required to describethe deformations. There are practical problems thatoccur when more than about the 8 3 8 3 8 lowest-frequency DCT components are used. One problem isthat of storing and inverting the curvature matrix(ATA). Even with deformations limited to 8 3 8 3 8coefficients, there are at least 1,537 unknown param-eters, requiring a curvature matrix of about 18 M bytes(using double-precision floating-point arithmetic).Other methods which search for more parametersshould be used when more precision is required. Theseinclude the method of Collins et al. [1994a], thehigh-dimensional linear-elasticity model [Miller et al.,1993], and the viscous fluid models [Christensen et al.,1996; Thompson and Toga, 1996].

In practice, however, it may be meaningless to evenattempt an exact match between brains beyond acertain resolution. There is not a one-to-one relation-ship between the cortical structures of one brain andthose of another, so any method that attempts to matchbrains exactly must be folding the brain to create sulciand gyri that do not really exist. Even if an exact matchis possible, because the registration problem is notconvex, the solutions obtained by high-dimensionalwarping techniques may not be truly optimum. Thesemethods are very good at registering gray matter with

gray matter (for example), but there is no guaranteethat the registered gray matter arises from homolo-gous cortical features.

Also, structure and function are not always tightlylinked. Even if structurally equivalent regions can bebrought into exact register, it does not mean that thesame is true for regions that perform the same orsimilar functions. For intersubject averaging, an as-sumption is made that functionally equivalent regionslie in approximately the same parts of the brain. Thisled to the current rationale for smoothing images frommultisubject studies prior to performing analyses.Constructive interference of the smeared activationsignals then has the effect of producing a signal that isroughly in an average location. In order to account forsubstantial fine-scale warps in a spatial normalization,it is necessary for some voxels to increase their vol-umes considerably, and for others to shrink to analmost negligible size. The contribution of the shrunkenregions to the smoothed images is tiny, and thesensitivity of the tests for detecting activations in theseregions is reduced. This is another argument in favorof registering only on a global scale.

The constrained normalization described here as-sumes that the template resembles a warped version ofthe image. Modifications are required in order to applythe method to diseased or lesioned brains. One pos-sible approach is to assume different weights fordifferent brain regions. Lesioned areas could be as-signed lower weights, so that they would have muchless influence on the final solution.

CONCLUSIONS

Consider the deformation fields required to mapbrain images to a common stereotactic space. Fourier-transforming the fields reveals that most of the vari-ance is low-frequency, even when the deformationshave been determined using good high-dimensionalnonlinear registration methods. Therefore, an efficientrepresentation of the fields can be obtained from thelow-frequency coefficients of the transform. The cur-rent approach to spatial normalization utilizes thiscompression. Rather than estimating warps basedupon literally millions of parameters, only a fewhundred parameters are used to represent the deforma-tions as a linear combination of a few low-frequencybasis functions.

The method we have developed is automatic andnonlabel-based. A maximum a posteriori (MAP) ap-proach is used to regularize the optimization. How-ever, the main difference between this and other MAP


r 264 r

approaches is that an estimate of the errors is derivedfrom the data per se. This estimate also includes acorrection for local correlations between voxels. Animplication of this is that the approach is suitable forspatially normalizing a wide range of different imagemodalities. High-quality MR images, and also low-resolution noisy PET images, can be treated the sameway.

The spatial normalization converges rapidly, be-cause it uses an optimization strategy with fast localconvergence properties. Each iteration of the algo-rithm requires the computation of a Hessian matrix(ATA). The straightforward computation of this matrixwould be prohibitively time-consuming. However,this problem has been solved by developing an ex-tremely fast method of computing this matrix thatrelies on the separable properties of the basis func-tions. A performance increase of several orders ofmagnitude is achieved in this way.

REFERENCES

Amit Y, Grenander U, Piccioni M. 1991. Structural image restorationthrough deformable templates. J Am Stat Assoc 86:376–387.

Ashburner J, Friston KJ. 1997. Multimodal image coregistration andpartitioning—a unified framework. Neuroimage 6:209–217.

Ashburner J, Neelin P, Collins DL, Evans AC, Friston KJ. 1997.Incorporating prior knowledge into image registration. Neuroim-age 6:344–352.

Bookstein FL. 1989. Principal warps: thin-plate splines and thedecomposition of deformations. IEEE Trans Pattern Anal Ma-chine Intelligence 11:567–585.

Bookstein FL. 1997a. Landmark methods for forms without land-marks: morphometrics of group differences in outline shape.Med Image Anal 1:225–243.

Bookstein FL. 1997b. Quadratic variation of deformations. In: Dun-can J, Gindi G, editors. Information processing in medicalimaging. New York: Springer-Verlag. p 15–28.

Christensen GE. 1994. Deformable shape models for anatomy.Doctoral thesis. Sener Institute, Washington University.

Christensen GE, Rabbitt RD, Miller MI. 1994. 3D brain mappingusing a deformable neuroanatomy. Phys Med Biol 39:609–618.

Christensen GE, Rabbitt RD, Miller MI. 1996. Deformable templatesusing large deformation kinematics. IEEE Trans Image Process5:1435–1447.

Collignon A, Maes F, Delaere D, Vandermeulen D, Suetens P,Marchal G. 1995. Automated multi-modality image registrationbased on information theory. In: Bizais Y, Barillot C, Di Paola,editors. Information processing in medical imaging. Dordrecht:Kluwer Academic Publishers. p 263–274.

Collins DL, Peters TM, Evans AC. 1994a. An automated 3D non-linear image deformation procedure for determination of grossmorphometric variability in human brain. In: Proc. Conferenceon Visualisation in Biomedical Computing. p 180–190.

Collins DL, Neelin P, Peters TM, Evans AC. 1994b. Automatic 3Dintersubject registration of MR volumetric data in standardizedTalairach space. J Comput Assist Tomogr 18:192–205.

Collins DL, Evans AC, Holmes C, Peters TM. 1995. Automatic 3Dsegmentation of neuro-anatomical structures from MRI. In:Bizais Y, Barillot C, Di Paola R, editors. Information processing inmedical imaging. Dordrecht: Kluwer Academic Publishers. p139–152.

Fox PT. 1995. Spatial normalization origins: objectives, applications,and alternatives. Hum Brain Mapp 3:161–164.

Friston KJ, Ashburner J, Frith CD, Poline J-B, Heather JD, Frackow-iak RSJ. 1995. Spatial registration and normalization of images.Hum Brain Mapp 2:165–189.

Gee JC, Haynor DR, Le Briquer L, Bajcsy RK. 1997. Advances inelastic matching theory and its implementation. In: Goos G,Hartmanis J, Van Leeuwen J. CVRMed-MRCAS’97. New York:Springer-Verlag.

Mazziotta JC, Toga AW, Evans A, Fox P, Lancaster J. 1995. Aprobabilistic atlas of the human brain: theory and rationale for itsdevelopment. Neuroimage 2:89–101.

Miller MI, Christensen GE, Amit Y, Grenander U. 1993. Mathemati-cal textbook of deformable neuroanatomies. Proc Natl Acad SciUSA 90:11944–11948.

Pelizzari CA, Chen GTY, Spelbring DR, Weichselbaum RR, Chen CT.1988. Accurate three-dimensional registration of CT, PET and MRimages of the brain. J Comput Assist Tomogr 13:20–26.

Studholme C, Hill DLG, Hawkes DJ. 1995. Multiresolution voxelsimilarity measures for MR-PET coregistration. In: Bizais Y,Barillot C, Di Paola R. Information Processing in Medical Imag-ing. Dordrecht: Kluwer Academic Publishers. p 287–298.

Talairach J, Tournoux. 1988. Coplanar stereotaxic atlas of the humanbrain. New York: Thieme Medical.

Thompson PM, Toga AW. 1996. Visualization and mapping ofanatomic abnormalities using a probabilistic brain atlas based onrandom fluid transformations. In: Hohne KH, Kikinis R, editors.Proceedings of the International Conference on Visualization inBiomedical Computing. New York: Springer-Verlag. pp 383–392.

Woods RP, Cherry SR, Mazziotta JC. 1992. Rapid automated algo-rithm for aligning and reslicing PET images. J Comput AssistTomogr 16:620–633.

OTHER LINEAR PRIORS

As an alternative to the membrane energy prioralready discussed, we now show how two other priorscan easily be incorporated into the model. For simplic-ity, they will only be shown for the two-dimensional case.

Bending energy

Bookstein’s thin-plate splines [1997a,b] minimizethe bending energy of the deformations. For the two-dimensional case, the bending energy of the deforma-tion field is defined by:

l oi11 2u1i

x1i2 2

2

1 1 2u1i

x2i2 2

2

1 2 1 2u1i

x1ix2i22

21 l o

i11 2u2i

x1i2 2

2

1 1 2u2i

x2i2 2

2

1 2 1 2u2i

x1ix2i22

2.


r 265 r

This can be computed by:

lt1T(B29 ^ B1)T (B29 ^ B1)t1 1 lt1

T (B2 ^ B19)T (B2 ^ B19)t1 1

2lt1T(B28^B18)T(B28^B18)t1 1 lt2

T(B29^B1)T(B29^B1)t21

lt2T(B2 ^ B19)T (B2 ^ B19)t2 1 2lt2

T(B28 ^ B18)T(B28 ^ B18)t2

where the notation B81 and B91 refer to the first andsecond derivatives of B1. This is simplified to t1

THt1 1t2

THT2 where:

H 5 l ((B29TB29) ^ (B1

TB1) 1 (B2TB2) ^ (B19

TB19)

1 2 (B28T B28) ^ (B18

TB18)).

Matrix Co21 from Eq. (3) can be constructed from H as

described by Eq. (5), but this time values on thediagonals are given by:

hj1k3J 5 l 11 p(j 2 1)

M 24

1 1 p(k 2 1)

M 24

1 2 1 p(j 2 1)

M 22

1 p(k 2 1)

M 22

2.

Linear elasticity

The linear-elastic energy [Miller et al., 1993] of atwo-dimensional deformation field is:

oj51

2

ok51

2

oi

l

2 1uj,i

xj,i2 1 uk,i

xk,i2 1

µ

4 1uj,i

xk,i1

uk,i

xj,i22

where l and µ are the Lame elasticity constants. Theelastic energy of the deformations can be computed by:

(µ 1 l/2)t1T(B2 ^ B81)T (B2 ^ B81)t1

1 (µ 1 l/2)t2T(B82 ^ B1)T (B82 ^ B1)t2

1 µ/2t1T(B82 ^ B1)T (B82 ^ B1)t1

1 µ/2t2T(B2 ^ B81)T(B2 ^ B81)t2

1 µ/2t1T(B82 ^ B1)T (B2 ^ B81)t2

1 µ/2t2T(B2 ^ B81)T(B82 ^ B1)t1

1 l/2t1T(B2 ^ B81)T (B82 ^ B1)t2

1 l/2t2T(B82 ^ B1)T(B2 ^ B81)t1.

A regularization based upon this model requires aninverse covariance matrix (Co

21) that is not a simplediagonal matrix. This matrix is constructed as follows:

Co21 5 1

H1 H3 0

H3T H2 0

0 0 02

where

H1 5 (µ 1 l/2) (B2TB2) ^ (B18

T B18) 1 µ/2(B28T B28) ^ (B1

TB1)

H2 5 (µ 1 l/2)(B28T B28) ^ (B1

TB1) 1 µ/2(B2T B2) ^ (B18

T B18)

H3 5 l/2(B2TB28) ^ (B18

T B1) 1 µ/2(B28T B2) ^ (B1

TB18).


r 266 r

Documents

Nonlinear Spatial Normalization Using Basis Functionsweb.mit.edu/swg/ImagingPubs/Normalization/ash... · normalization of both high-quality magnetic resonance images, and low-resolution