13
Computer Methods and Programs in Biomedicine (2004) 73, 101—113 Parallel programming of gradient-based iterative image reconstruction schemes for optical tomography Andreas H. Hielscher a,b, * , Sebastian Bartel b,1 a Departments of Biomedical Engineering and Radiology, Columbia University, 500 West 120th Street, MC 8904, New York, NY 10027, USA b Department of Pathology, Downstate Medical Center, State University of New York, 450 Clarkson Avenue, Brooklyn, NY 11203, USA Received 1 March 2002; accepted 1 February 2003 KEYWORDS Optical tomography; Near-infrared (NIR); Heterogeneous cluster; Parallel computing Summary Optical tomography (OT) is a fast developing novel imaging modality that uses near-infrared (NIR) light to obtain cross-sectional views of optical properties in- side the human body. A major challenge remains the time-consuming, computational- intensive image reconstruction problem that converts NIR transmission measurements into cross-sectional images. To increase the speed of iterative image reconstruction schemes that are commonly applied for OT, we have developed and implemented several parallel algorithms on a cluster of workstations. Static process distribution as well as dynamic load balancing schemes suitable for heterogeneous clusters and varying machine performances are introduced and tested. The resulting algorithms are shown to accelerate the reconstruction process to various degrees, substantially reducing the computation times for clinically relevant problems. © 2003 Elsevier Ireland Ltd. All rights reserved. 1. Introduction Optical tomography (OT) is a fast developing new medical imaging modality that uses near-infrared (NIR) light (650 < < 900 nm) to probe various parts of the body. In general, laser diodes deliver light through optical fibers to several locations around the body part under investigation and transmitted light intensities are recorded. The technology for making sensitive light-transmission measurement Corresponding author. Tel.: +1-212-854-5080; fax: +1-212-854-8725. E-mail address: [email protected] (A.H. Hielscher). 1 Current address: Laser Zentrum Hannover e.V., Hollerithallee 8, D-30419, Hannover, Germany. through the human body is now-a-days readily available [1—10]. Using this instrumentation, sev- eral pilot studies have proven the applicability of OT in medicine and its potential in functional tissue diagnostics is widely recognized. The most promising applications are monitoring of blood oxy- genation [11—13], hemorrhage detection [14,15], functional imaging of brain activities [16—23], Alzheimer diagnosis [24,25], early diagnosis of rheumatic disease in joints [26—28], and breast cancer detection [29—33]. All these applications employ the fact that various disease processes and other physiological changes affect the opti- cal properties (mainly absorption coefficient a and scattering coefficient s ) in biological tissue. Therefore, different optical properties of different 0169-2607/$ — see front matter © 2003 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/S0169-2607(03)00020-8

Parallelprogrammingofgradient-basediterative imagereconstructionschemesforoptical ...orion.bme.columbia.edu/optical-tomography/resources/... · 2009-11-23 · 104 A.H. Hielscher,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Computer Methods and Programs in Biomedicine (2004) 73, 101—113

Parallel programming of gradient-based iterativeimage reconstruction schemes for opticaltomography

Andreas H. Hielschera,b,*, Sebastian Bartelb,1

a Departments of Biomedical Engineering and Radiology, Columbia University,500 West 120th Street, MC 8904, New York, NY 10027, USAb Department of Pathology, Downstate Medical Center, State University of New York,450 Clarkson Avenue, Brooklyn, NY 11203, USA

Received 1 March 2002; accepted 1 February 2003

KEYWORDSOptical tomography;Near-infrared (NIR);Heterogeneous cluster;Parallel computing

Summary Optical tomography (OT) is a fast developing novel imaging modality thatuses near-infrared (NIR) light to obtain cross-sectional views of optical properties in-side the human body. A major challenge remains the time-consuming, computational-intensive image reconstruction problem that converts NIR transmission measurementsinto cross-sectional images. To increase the speed of iterative image reconstructionschemes that are commonly applied for OT, we have developed and implementedseveral parallel algorithms on a cluster of workstations. Static process distributionas well as dynamic load balancing schemes suitable for heterogeneous clusters andvarying machine performances are introduced and tested. The resulting algorithmsare shown to accelerate the reconstruction process to various degrees, substantiallyreducing the computation times for clinically relevant problems.© 2003 Elsevier Ireland Ltd. All rights reserved.

1. Introduction

Optical tomography (OT) is a fast developing newmedical imaging modality that uses near-infrared(NIR) light (650< � < 900 nm) to probe various partsof the body. In general, laser diodes deliver lightthrough optical fibers to several locations aroundthe body part under investigation and transmittedlight intensities are recorded. The technology formaking sensitive light-transmission measurement

∗ Corresponding author. Tel.: +1-212-854-5080;fax: +1-212-854-8725.

E-mail address: [email protected] (A.H. Hielscher).1 Current address: Laser Zentrum Hannover e.V.,

Hollerithallee 8, D-30419, Hannover, Germany.

through the human body is now-a-days readilyavailable [1—10]. Using this instrumentation, sev-eral pilot studies have proven the applicabilityof OT in medicine and its potential in functionaltissue diagnostics is widely recognized. The mostpromising applications are monitoring of blood oxy-genation [11—13], hemorrhage detection [14,15],functional imaging of brain activities [16—23],Alzheimer diagnosis [24,25], early diagnosis ofrheumatic disease in joints [26—28], and breastcancer detection [29—33]. All these applicationsemploy the fact that various disease processesand other physiological changes affect the opti-cal properties (mainly absorption coefficient �aand scattering coefficient �s) in biological tissue.Therefore, different optical properties of different

0169-2607/$ — see front matter © 2003 Elsevier Ireland Ltd. All rights reserved.doi:10.1016/S0169-2607(03)00020-8

102 A.H. Hielscher, S. Bartel

tissues supply the contrast of this new imagingtechnology.

A major challenge in OT is the development ofefficient numerical schemes that transform thesemeasurements into useful cross-sectional images ofthe interior. Unlike X-rays, NIR photons do not crossthe medium on a straight line from the source tothe detector. In addition to being absorbed, light isstrongly scattered throughout the body. The propa-gation of light through tissue can be approximatedas a diffusion process in which any directional in-formation of incident photons is lost within severalmillimeters. Therefore, standard backprojectiontechniques [34] as applied in X-ray tomographyhave been of limited success [35,36]. Most ofthe currently employed algorithms are so calledmodel-based iterative image reconstruction (MO-BIIR) schemes [37—49]. These schemes use a for-ward model to simulate the light propagation in thesample and provide a theoretical predication forthe detector readings. Based on an initial guess ofoptical properties inside the medium, the predicteddetector readings are compared with experimen-tal data using an appropriately defined objectivefunction. The true distribution of optical proper-ties inside the medium is determined by iterativelyupdating the initial guess of this distribution, andby performing new forward calculation until thepredicted data agrees within a given error with thedetector readings. The final distribution of opticalproperties is displayed as an image. This iterativeprocess generally requires substantial computa-tional resources. Depending on the specific forwardmodel, the number of light sources used, the updat-ing scheme, and the desired numerical accuracy,computation times can easily reach several hours oreven days on a single processor. This makes OT cur-rently impractical for many clinical applications.

To overcome this problem first studies haveemerged that suggested the use of parallel pro-gramming technique for the image reconstructionalgorithms [50,51]. However, a systematic analysisof the problem has not yet been presented. In thiswork we analyze and modify a previously developedMOBIIR algorithm to allow for parallel execution ina cluster of heterogeneous workstations. We firstreview the structure this gradient MOBIIR schemefor the analysis of optical tomographic data. Thevarious parts of the existing algorithms are subse-quently analyzed regarding their time consump-tion. Based on this analysis, we develop severalparallel algorithms suited for execution on an arbi-trary number of UNIX-type machines in a local areanetwork. These networks are widely available inacademic and medical environments and a varietyof standardized protocols for distributed and paral-

lel computing are available, which we adapted forthe particular problem of OT. The results are testedfor efficiency and stability in a 10/100 MB Ethernetusing the TCP/IP protocol under various conditions.

2. Methods

2.1. Gradient-based iterative imagereconstruction

In this work we concentrate on the paralleliza-tion of so-called gradient-based iterative im-age reconstructions (GIIR) schemes, which forma subclass MOBIIR algorithms. In recent years,several groups have embraced the concept ofGIIR schemes and it has become the method ofchoice among many groups in the field of OT[37—39,52—56]. Gradient-based schemes em-ploy information about the gradient of the ob-jective function with respect to the opticalproperties of the sample to find updates ofthe initial guess. In this way the reconstruc-tion problem is viewed as an optimization prob-lem. For example, steepest-gradient-descent andconjugated-gradient schemes are well-establishedtechniques in optimization theory [57]. Theseschemes have the advantage over also widely usedNewton-type algorithms that a full Jacobian matrixJ neither needs to be explicitly created nor repeat-edly inverted. For a more detailed comparison ofdifferent schemes in OT see for example [37,38].

As in all algorithms for OT, the goal of the GIIRscheme is to reconstruct the distribution of the op-tical properties inside a medium, from a given setMof measurements on the circumference, ∂�, of themedium. We can divide the GIIR scheme in three dif-ferent major components (see Fig. 1): (1) Forward

Fig. 1 Parallelization outline; the forward model is splitup into single-source forward calculations (FC) and ad-joint differentations (AD). It is called repeatedly in theinner iteration and once in the outer iteration.

Parallel programming of gradient-based iterative image reconstruction schemes 103

Model: this model is a theory or algorithm that pre-dicts a set of measured signals, MP, based on theposition of the light source and the spatial distribu-tion of optical properties � = [�a(r), �s(r)]. In thisstudy we use a finite-difference scheme (FD) thatis based on the time-dependent diffusion equation[37]. (2) Analysis Scheme: here an objective func-tion, �, is defined, which describes the differencebetween the measured, M, and predicted data, P.In this work we define the objective function asleast-square error norm, which is given by

�(ζ) = χ2(ζ) ≡∑s

∑d

(Ms,d − Ps,d(ζ))2

2η2s,d

(1)

The indices s and d identify different sources anddetector positions. The parameter ηs,d is a nor-malization constant, which for example can beset to Ms,d(ζ). In this case the error norm mini-mizes the sum of the squared percentage differ-ence between measured and theoretical data. (3)Updating Scheme: once the objective function isdefined, the task becomes to minimize �, startingfrom an initial guess ζ0(r). This is accomplished intwo substeps: (3a) First, the gradient of the ob-jective function d�(ζ0)/dζ is calculated by meansof the so-called adjoint-differentiation method.(3b) Second, given the gradient an iterative lineminimization in the direction of the gradient isperformed. This step is labeled inner iteration inFig. 1 and consists of several forward calculationsin which the optical parameters � are varied. Oncethe minimum along the line is found, a new gradi-ent is calculated at this minimum (outer iteration)and another line-minimization is performed, nowalong a different direction in the �-space. Thesesteps are repeated until a distribution � is found forwhich �(�) is smallest. A more detailed descriptionof the gradient MOBIIR scheme used in this workcan be found in [37].

2.2. Parallelization

The appropriate decomposition of the originalproblem into smaller subtasks is most crucial toevery parallel implementation of a given algo-rithm. We identified the finite differences (FD)algorithm that is used to calculated the predicteddetector readings P as the most time-consumingportion of our GIIR scheme. Here, the objectivefunction �(�) is determined for the current guessof optical properties � of the medium. The gradi-ent calculation ∂�(�)/∂�i is also performed in thispart of the code by adjoint differentiation, whichconsumes about the same time as one forwardcalculation. Approximately 95% of the computa-

tional time are spent in the finite-differences andadjoint-differentiation schemes. The exact frac-tion depends on the number of sources employed inthe experimental setup, since both, the objectivefunction �(�) and the gradient ∂�(�)/∂�i are givenas sums over all source-detector combinations.The experimental data consists of several sets ofsimultaneous detector readings corresponding todifferent source positions. One could activate allsources simultaneously, however, it proofs to beadvantageous to switch through all source positionssequentially to increase the amount of informationgained. To match these experimental readings withthe simulation and evaluate the objective func-tion, each source present in the experiment re-quires a separate forward calculation. Hence, theproblem of calculating Eq. (1) for N sources quitenaturally breaks up into N mutually independentsingle source forward-calculations (SSFCs). We canrewrite the objective function (1) as

�(ζ) =∑s

∑d

(Ms,d(ζ) − Ps,d(ζ))2

2ηs,d2=

∑s

�s(ζ) (2)

Thus, instead of handling N sources sequentially asin the experiment, we may, therefore, share the to-tal load amongst up to N processors, by assigningeach a disjunct set of sources to evaluate. The ob-jective function is then obtained by summing overthe partial results (Fig. 1).

One SSFC typically requires on the order of 0.01—1 s depending on the size of the FD grid. This makesthem appropriate candidates for sub-tasks to be dis-tributed over the network. Shorter tasks would in-creasingly sense the latency times of the network,because TCP/IP protocol yields latency times of sev-eral milliseconds.

Another consideration is the network bandwidth,which in our case is ideally 100 Mb/s but may de-crease to several 10 Kb/s in times of high traffic. Ifwe assume a minimum transfer rate of 50 Kb/s, theamount of data transferred per SSFC should, there-fore, not exceed 500 Bits to avoid unnecessary wait-ing time. This does, however, not pose a problem:All processors possess their private copy of all datastructures involved in the reconstruction and areprincipally able to perform any SSFC autonomously.When the parallel code forks into different branchesno data, other than the distribution scheme has tobe communicated across the network.

Upon completion of the sub-tasks, when a globalsum over all processors is formed, two data struc-tures are to be transferred over the network tobe available to all participating processors. First,the addends �s of the objective function (Eq. (2)),i.e. single floating point values, are transferred.

104 A.H. Hielscher, S. Bartel

Secondly, the gradient matrix, represents thelargest chunk of data to be moved across thenetwork–—it consists of n×m values, where n andm fix the size of the FD grid. Typically, the gridsize ranges between 20× 20 and 50× 50 pixels, sothat a maximum of 80 Kb have to be moved acrossthe cluster. However, the gradient calculation onlyoccurs once per outer iteration, in about 1/15 ofthe invocations of the forward model.

Although the SSFC is an appropriate partitioningunit for parallelization, this choice leads to certainlimitations. First of all, no more processors thansources may effectively be used. Moreover, certainconfigurations will lead to unfavorable load balanc-ing. For instance, consider four SSFCs to be sharedby three identical processors. Obviously, the 4thforward calculation will leave two processors idle.These limitations are unavoidable in the currentsplitting scheme and have to be considered whentesting or using the algorithm.

3. Implementation

Parallel execution of a program on multiple proces-sors requires a physical network and a communi-cation interface between the nodes. All processorshave to be aware of one another and be able toexchange data securely. The Message Passing In-terface (MPI) is a platform independent standard,which was introduced by the Message Passing Inter-face Forum, in 1994 [58,59]. This interface definesa set of functions for inter-processor communica-tion and data manipulation. MPI hides details of theunderlying hardware and network protocol fromthe developer and provides a transparent interfacebetween all participating nodes (it operates on topof protocols such as TCP/IP, e.g. above layer 4 inthe OSI 7 Layer Network Model). Several imple-mentations of the MPI for different architecturesare freely available. We used the MPICH-packagesuitable for LINUX, IRIX, and SOLARIS, which includesC-libraries and several scripts for compiling andexecuting binaries in a parallel environment.

MPI is designed to support the distributedmemorymodel, so all processors execute identical copiesof the binary in their own memory space and haveno direct access to data on other nodes. It is ac-tually more appropriate to speak of different pro-cesses rather than of processors, since MPICH doesnot distinguish between physically different pro-cessors and multiple instances of a program on thesame processor. If several processes are launched onone node, communication still uses the machine’sTCP/IP interface but is routed directly through theinternal loop-back. In the remainder of this paper

we will use the termini processor and process syn-onymous for instances of a program, independentof their location.

When executing the parallel version of a binary,MPI assigns a unique identification number to eachprocess through which it can be addressed by oth-ers. Communication can take place between twoprocessors or groups of processors, within so-calledcommunicators.

To take advantage of a parallel execution thecode must of course be adapted to behave dif-ferently on each processor and handle disjunctportions of the total computational expense. TheMPI provides functions to retrieve a processor’sidentification at run-time, so that the code cantake different branches depending on which nodeit is executing. In the remainder of this section wewill refer to respective processors IDs by nmy pe anddenote the total number of processors with npe.

Following the arguments given above, the to-tal number of sources nS is to be divided into npesubsets. Both, the objective function �(�) and thegradient ∂�(�)/∂�i are sums over all source contri-butions as indicated by Eq. (2). Within the actualimplementation, these contributions are added upin a loop:

for (i = 0; i < ns; i+ +)

{�+ = �i;grad�+ = grad�i;}

(3)

Therefore, our task is to break down this loopand to assign only part of the total intervali∈[0. . . nS[= [0. . . n1[, [n1. . . n2[, . . . , [nS−1. . . nS[ toeach processor. Let nmy pe∈[0. . . npe] be the index ofthe executing processor, then the following codefragment causes all processors to evaluate thecommands {. . . } for disjunct sub-intervals only:

if (nmype = = 0)

for (i= 0; i < n1; i+ +)

{. . . }else if (nmype = = 1)

for (i=n1; i < n2; i+ +)

{. . . }...

else if (nmype = =npe)for (i=ns&0x00AD;1; i < ns; i+ +)

{. . . }

(4)

This code example merely illustrates the conceptof assigning disjunct intervals to different proces-sors. The actual implementation may, however,look somewhat different from the example above.In the following subsections we will introduce three

Parallel programming of gradient-based iterative image reconstruction schemes 105

approaches for choosing the npe subsets appropri-ately and depict their implementation.

3.1. Static load distribution

In our first approach, the total number of sources[0. . . nS[ is broken into npe equal parts (if possible).Hence, if npe is not an integer denominator of nS,the size of respective intervals will differ by 1. Thecorresponding implementation can be cast in a verycompact form if we rewrite the loop (3) such that

for (i=nmype; i < ns; i+ =npe){� + = �Source(JobId);grad� + = grad�Source(JobId);}

(5)

where �Source(JobId) and gradSource(JobId)represent function calls to a single source for-ward/gradient calculation (SSFC). Hence, proces-sor 0 evaluates the sources 0, npe, 2npe, . . . , proces-sor 1 evaluates sources 1, npe+1, 2npe+1,. . . andso forth. Fig. 2 depicts the splitting mechanism ina graphical form. Obviously, the respective subsetsare always disjunct and completely fill the interval[0. . . nS[.

The major advantage of this approach, is thatit produces virtually no overhead and requires noadditional variables to define the boundaries of therespective intervals. In a homogeneous cluster ofprocessors with npe being a denominator of nS theloop (Eq. (5)) will execute exactly npe times as fastas on a single processor.

Fig. 2 Static distribution scheme. The total number ofsources nS is shared evenly across n PE’s.

If the sources cannot be split up evenly across thecluster, some processors will be idle for the timenecessary for a single forward calculation, leadingto a decay in performance. The worst-case scenariooccurs if the reconstruction is based on a singlesource only. In this case, all but processor 0 willbe idle and no increase in performance will be ob-served. However, this is not a disadvantage of anyparticular distribution scheme but inherent to ourapproach to parallelization. Since one source cal-culation constitutes the smallest possible sub-taskthat can be assigned to a processor it is obviouslyuseless to employ more than nS processors in thereconstruction. While finer levels of subdivision canbe envisioned, we will defer the discussion of cor-responding schemes to Section 4.

The major disadvantage of the static load distri-bution becomes apparent in inhomogeneous clus-ters. Since the total number of sources is spreadevenly across all participating processors, indepen-dent of their performance, slower processors willrequire more time to complete their share, forc-ing the rest of the cluster to wait. Consequently,the overall speed in a cluster of processors of per-formance Pi is determined by the slowest processorin the ensemble. The maximum performance P� ofthe cluster is given by

P� = npe · min{Pi} (6)

3.2. Dynamic job assignment

To use the available processor performance in het-erogeneous clusters more effectively we developeda flexible scheme to assign subtasks to individualprocessors. The scheme employs one dedicated pro-cessor, referred to as PE0, to manage the distribu-tion of SSFC to the cluster. This PE0 keeps track ofthe total number of sources and what fraction ofthem has been processed by any of the other pro-cessors. It does, however, not take part in the ac-tual forward calculation.

All other processors, upon entering the forwardmodel, post a job request to the dedicated proces-sor and wait for an available source to evaluate.PE0 responds to a request by sending out the nextsource index and marking this index as processed.Whenever a processor has finished a forward calcu-lation, it sends another request to PE0 to obtain thenext, unprocessed source.

If no more sources are pending, the dedicatedprocessor broadcasts a termination signal to thecluster and stops listening to requests. The notifi-cation to terminate the reconstruction loop passesthrough the same messaging channel as regular jobassignments. But instead of a valid source index, a

106 A.H. Hielscher, S. Bartel

Fig. 3 Job-request/assignment. Processor PE0 managesthe distribution of SSFCs and assigns source IDs uponrequest by other PE’s.

predefined negative constant is broadcast to all pro-cessors and interpreted there accordingly. Fig. 3 il-lustrates the algorithm in detail. The essential codefragment has the following form:

if (nmype!= 0) // on all but PE0, request source ID{ // and perform SSFCRequestJob (& JobId);while (JobId!=NO JOB) // unless termination signal is sent . . .

{�+ = �source(JobId); //do SSFC &grad� + = grad�source(JobId); // gradient calculationRequestJob (& JobId); // request next source ID}}else // on PE0 assign source ID upon request

// on PE0 listen to requests and{ // assign jobs to processors . . .

SrcCount= 0;while (SrcCount < =ns) //while we have unprocessed SSFCs {WaitRequest(& PeNo); //wait for requests from any PEAssignJob(PeNo, SrcCount) // assign next SSFC to calling PE}BeastTermination() // . . . broadcast termination of all PE’s}

(7)

The concept of this job-request scheme is read-ily understood: Initially, each processor picks upa source index and performs the correspondingforward- or gradient calculation. Faster processorswill complete their task in a shorter period of timethan their slower counterparts and consequentlybe able to post another job request earlier. Onaverage, if we consider many sources (ns�npe),each processor will request a fraction n(i)/ns of allsources, that is proportional to its speed. Here, n(i)is the number of sources handled by processor i.

Obviously this second scheme is much less sus-ceptible to dawdling candidates than the staticload distribution. Extremely slow processors willreceive only a single job and have no more oppor-tunity to request a second one. Furthermore, therequest/assignment scheme adapts to changingprocessor performances during the image recon-struction automatically. Each participant will ob-tain as many sources to evaluate as it can handleat the time.

This advantage only becomes apparent, however,if more sources than processors are available, sinceeven the slowest processor will handle at least oneSSFC. The following considerations yield an esti-mate of the minimal number of SSFC, ns, necessaryto observe an improvement during parallel execu-tion.

Consider a single processor PEj with only a frac-tion 1/q of the performance p of all other mem-bers of the cluster. Let nj be the number of sourceshandled by the slower candidate, compared with nsources handled by each of the other processors.Then the following relations hold:

Parallel programming of gradient-based iterative image reconstruction schemes 107

nj 1qn; nj ≥ 1 (8)

nS = n · (npe − 1) + nj (9)

From this we get:

nj =nS

q(npe − 1) + 1!≥1 (10)

Solving for nS, the minimum number of SSFCs isgiven by

nS ≥ q(npe − 1) + 1 (11)

Thus, approximately npeq sources must be presentto prevent processor j from slowing down the re-construction unduly.

We note further, that some other pathologicalcases may be constructed, e.g. if all but one sourceposition have been evaluated and the slowest pro-cessor manages to snatch the final source shortlybefore a faster colleague posts a request. In thisunfortunate case, the cluster is forced to wait dis-proportionately long, while a faster processor couldhave completed the task earlier.

Again, as for the first scheme, the difficultiesmentioned do not pose much of a problem, if wehave ns � npe. They originate from the relativelycoarse partitioning of the reconstruction process.

However, two drawbacks are specific to this par-ticular job distribution scheme. First, one proces-sor, PE0 is lost for the actual reconstruction process,since it is employed to manage the job assignment.The second and most important drawback, as willbe shown later, is the communicational overhead in-volved with the request/assignment mechanism. Atotal of 2ns + npe messages have to passed across thenetwork during each forward calculation. Althougha single message is only some 10 bytes in size, thehigh latency time of TCP/IP networks causes signif-icant idle times at each job request.

3.3. Dynamic load distribution

The third algorithm developed for the parallel im-age reconstruction combines the flexibility of thedynamic scheme introduced above with the mod-est communicational overhead of the static, firstscheme. The key point of this concept is to deter-mine individual processor speeds in terms of recon-structions per unit time at run-time. This informa-tion is then used to redistribute the total load acrossthe cluster.

To keep track of the current partitioning of thecomplete set of sources [1...ns[ a static source listis maintained on all processors. It is defined as

static int SourceList[NumberOfSources]; (12)

where SourceList[i] contains the ID of the proces-sor handling source position i. Initially, before anyforward calculations are done, all processors areconsider to be equal and the SourceList is populatedevenly, similar to Eq. (3).

In the following, each processor scans theSourceList for elements with its own ID and per-forms the corresponding single-source forwardcalculations. The time ti, required for these cal-culations is recorded, along with the number ofsources ni assigned to this processor, so that wecan determine the effective speed, defined as:

vi =niti

(13)

Once all processors have finished their share ofsources, they broadcast their current speed to allmembers of the cluster. The final step is to updatethe SourceList in view of the different vi of individ-ual processors, so that faster candidates will handlea bigger fraction of all sources in future. If we de-note the number of sources to be processed on PEiwith ni, and the time necessary with ti, we are facedwith the constrained minimization problem in T:

T(n0, n1, . . . , nnpe−1) =max{t0, t1, . . . , tnpe−1}

=max

{n0

v0,n1v1

, . . . ,nnpe−1

vnpe−1

}; with nS =

npe−1∑i−0

ni

(14)

We employ the following algorithm to populateSourceList.

First, the processor speeds are normalized so that∑vi = ns. If we now set

ni = vi ∀i ∈ [0 . . . npe

[; nj ∈ R (15)

both, the minimization and the constraint are sat-isfied, since

(i)n0

v0= n1

v1= . . . = nnpe−1

vnpe−1

(ii)npe−1∑i= 0

ni =npe−1∑i= 0

vi = nS (16)

Note, that the ni thus obtained are real, whereaswe can only assign an integer number of sources toindividual processors. To overcome this difficulty,we use a fast and simple method to generate inte-ger values close to the optimal solution: The ni areiteratively increased by a common factor q, suchthat ni→(1+q)ni, until their round-down values addup to the total number of sources. Typically, the in-crement q is set to 0.05, which usually yields cor-rect ni within only five iterations.

108 A.H. Hielscher, S. Bartel

Finally, the ni are used to populate and updateSourceList. Note, that the calculations above areperformed on all processors simultaneously, sothere is no need for further communication. Oncethe information about processor speeds is sharedacross the cluster, the minimum solution of Eq. (14)is fully determined and requires no more interac-tion. Fig. 4 and the code below (Eq. (17)) summa-rize the essentials of the dynamic load distributionscheme.

Fig. 4 Dynamic scheme: all processors determine theirindividual speed after completion of their share of for-ward calculations. After each iteration, the speeds arecommunicated to the cluster to re-evaluate the parti-tioning of all sources.

PunchTime(); //mark current timeMyCount= 0; // reset source counterfor (i= 0; i < ns; i+ +) // scan SourceList for Jobs to do{ // on this processorif (SourceList[i]= =nmype){� + = �source(i); //do SSFC &grad� + = grad�source(i); // gradient calculationMyCount+ +;}}MyTime=PunchTime(); // record time required andMySpeed= (double)MyCount/MyTime; //determine processor speedUpdateSourceList(MySpeed, SourceList); // redistribute the sources

// according to processors speed

(17)

Two minor modifications, not shown in the cod-ing example, were included to improve the stabil-ity of this algorithm: the current speed is biasedby the average speed of all past forward calcula-tions to dampen oscillations that tend to occur onmulti-processor machines (we found that on thesemachines, single tasks are not tied to a proces-sor but shifted around by the operating system.This resulted in a competition between differentprocesses and oscillations in their performance.So-called processors (PE’s) in the MPI-schemes arenot necessarily identical to physical processors).Secondly, if a processor was not assigned at leaston source, the last recorded speed is adoptedrather than zero speed. This prevents individualprocessors from dropping out of the competitioncompletely. Hence, the speed is determined as:

if (MyCount)MySpeed= (double)MyCount/MyTime+ 0.5 ∗MyAverageSpeed;

elseMySpeed=MyLastSpeed;

(18)

This last scheme proofs to be the most efficientdistribution algorithm. It adapts to large discrep-ancies between different processors and may evenexclude slow performers completely. However, itrequires no more than one joint data exchange be-tween all processors, after each forward calcula-tion. The additional calculations do not consumesignificant computational time.

4. Results and performance

To test the performance of our distribution schemesquantitatively, we used an example from our re-sent research concerning optical tomographic jointimaging (Fig. 5). Based on cross-sectional imagesobtained from magnetic resonance imaging (MRI)of a human proximal-interphalangial (PIP) finger

Parallel programming of gradient-based iterative image reconstruction schemes 109

Fig. 5 Example used for testing the different parallel computing schemes. (a) Magnetic resonance image of a humanPIP finger joint; (b) segmented image of the joint that shows the different tissue types (bone, muscle, tendonand ligament) and there different optical properties. This map was used to generate synthetic input data for themodel-based iterative image reconstruction MOBIIR code; (c) result of the reconstruction of diffusion coefficientsafter 200 iterations.

joint we designed a numerical model that was usedto generate synthetic data. We assigned opticalproperties to different tissues visible in the MRIcross-section images of the joint (see Fig. 5a andb). Since not all tissue types are clearly distin-guishable by MRI data alone, standard anatomicalinformation [60] was additionally used to uniquelyidentify the various positions of different tissues.Segmenting the MRI images in this way we obtaineda two-dimensional slice of the optical proper-ties in a finger joint. These images had a size of44× 44 pixels, with each pixel having dimensions of0.4× 0.4 mm. In order to simulate measurementsof the finger joint, we placed 60 detectors and 60sources around the joint and performed forwardcalculations that yielded detector readings for dif-ferent source positions. This synthetic data wasinput to our gradient MOBIIR algorithm, which wasstarted with a spatially homogeneous initial guessof the optical properties. The result of the recon-struction can be seen in Fig. 5c. For more detailsconcerning this example see [27].

The results of a parallel implementation of theimage reconstruction code do, in noway, differ fromthe serial execution. This is an obvious necessity ofany parallel scheme and was testes thoroughly dur-ing the developing process. Therefore, the qualityof our gradient-based image reconstruction is notsubject to discussion at this point, and the reader

with interest in quality of optical tomographic imag-ing is referred to the previously cited publication[37—56] that address this problem.

In this study we are interested in the performanceof the various parallel algorithm in terms of com-putational speed. In a first test, we determined thegain in reconstruction speed with increasing num-ber of identical processors. Ideally, one would ex-pect the total speed vt(npe) to grow linearly withthe number of processors, according to

vt(n) = γ · npe · vpe (19)

where, vpe is the single processor speed andγ = 0. . . 1 the degree of parallelization. The latterquantifies the fraction of computational expensethat is actually shared across the cluster. It be-comes 1, if 100% of the code have been parallelizedand no more sequential execution is done. In ourcase, given the large number of sources involved,we found that more than 99% of the time is spentwithin the forward model. Under the assumptionthat an optimal splitting of the total work is theo-retically possible, i.e.nSnpe

= n ∈ N (20)

we can estimate �≈1. Any deviation from Eq. (19)is, therefore, caused by overhead or idle time in-troduces by the parallelization scheme.

110 A.H. Hielscher, S. Bartel

Fig. 6 Performance increase with increasing number ofidentical processors. The solid line indicates the perfor-mance of an idealized parallel scheme. Note that thejob-request scheme requires at least two pe’s since thefirst processor manages the distribution of forward cal-culations to the cluster.

Fig. 6 shows the dependency of vt(npe) on thenumber of processors for all three schemes aswell as the idealized case of γ = 1. As expected,the static assignment of sources performs best,since it involves virtually no overhead and requiresonly minimal inter-processor communication. Al-most equally good results are observed for thedynamic distribution scheme. The job-requestscheme, however, performs poorly and yieldsonly 50% of the maximum possible speed. More-over, a gain due to parallelization only sets infor more than two processors, since one pro-cessor is necessary to handle the administrativetasks.

The second test was designed to study the al-gorithms’ ability to perform in heterogeneousclusters, i.e. to adapt to different processorspeeds. Ideally, in generalization of Eq. (16), themaximum speed of a parallel scheme is givenby

vt(v0, v1, . . . , vnpe−1) = γ ·npe−1∑i= 0

vi (21)

In other words, the number of forward calcula-tions per unit time performed on individual proces-sors simply adds up to the total number of forwardcalculations of the cluster. Hence, slow processorsshould only affect the overall reconstruction speedinsofar as they contribute to a lesser degree to thetotal work.

To quantify the performance of all three ap-proaches, we employed 5 processors of identicalspeed but slowed down one processor (pe 1) ar-tificially, by inserting waiting intervals of length

Fig. 7 Effect of a single slower processor (PE0) in acluster of five PE’s. The ideally achievable speed of anoptimal parallel scheme is given by the solid graph.

Twait into the SSFC. In Fig. 7, the abscissa indi-cates the length of these intervals, relative tothe actual computing time TSSFC. Hence, a decayof 100% corresponds to Twait =TSSFC, renderingthe processor half as fast as the normal. In thelimit of Twait→∞, only four processors partici-pate in the reconstruction and the optimal speedapproaches 0.8.

The results clearly state the superiority of thedynamic scheme over the other approaches. Overthe whole range it yields more than 90% of themaximum possible optimal performance. Obvi-ously, the algorithm is able to find an optimaldistribution scheme according to each proces-sor’s capabilities. The speed of the static scheme,on the other hand quickly drops as pe 1 slowsdown. In fact, the overall speed is governed bythe slowest processor and depends on its decayd like

vi = vpe 1 ∼ 11 + d/100

(22)

It is interesting to note that the job-request/assignment algorithm behaves quite stable, asit does not decay to more than 40% relative tothe optimal case (graph with solid line). We con-clude that in spite of the overhead, the schemeadapts to the heterogeneous environment prop-erly. In lower latency networks it might, there-fore, provide a reasonable alternative to dynamicschemes.

5. Discussion

Given the results, the dynamic load distributionappears most suitable for parallel computationof problems involving optical tomographic image

Parallel programming of gradient-based iterative image reconstruction schemes 111

reconstruction in a heterogeneous network. The dy-namic scheme adapts to varying processors speedsand requires only little communicational overheadduring the reconstruction process. This outcome isnot surprising since the dynamic algorithm com-bines the positive aspects of the static-assignmentand job-request approaches. Under realistic con-ditions, in a busy network the dynamic scheme isable to utilize more than 90% of each processor’stime for the reconstruction process. Under thepremise that an optimal distribution of all sourcesacross the cluster is theoretically possible, theoverall parallelization � is, therefore, greater than0.9. Single slower participants will contribute toa lesser degree but never affect the overall speednegatively.

There are no exceptional requirements on theexecution environment to achieve this perfor-mance. The algorithm is, therefore, suitable formost of nowadays network architectures. In thisstudy we executed the parallel reconstruction in acluster of three Pentium-based dual-processor ma-chines (450 and 750 MHz), running LINUX and threeSGI-workstations across at least two passive hubs.Active switches would most likely raise the overallspeed by several percent but are no necessity asour results show.

The algorithm yielded reconstructed images ona 44× 44 grid with up to 12 sources—positions inapproximately 30 s. This benchmark makes in situimage reconstruction in a medical environmentfeasible and will expand the range of clinical ap-plications for NIR diagnostics.

As already noted in previous sections, the numberof source-positions currently presents a limit for theparallelization. In typical experiments as pursuedby our group, the setup includes 12—32 sources,which may be increased up to the total number ofavailable processors, without loss in performance.In spite of this inherent limitation, the possibility toemploy more sources in the reconstruction processposes a significant advantage. By raising the num-ber of sources and/or detectors in a OT setup, theamount of information gained from the medium canbe significantly increased. Due to the high costs ofsensitive and highly dynamic detectors it is princi-pally easier to use more sources rather than detec-tors [8]. However, on the reconstruction side, thischoice amounts to additional computations of thelight distribution originating from these sources.Previously, the time necessary to generate imagesfrom the experimental data rose accordingly. Withparallel schemes as presented in this work the over-all reconstruction time may be kept constant byutilizing additional processors in the reconstructionprocess.

6. Summary

OT is a novel emerging medical imaging modal-ity that has shown great promise in a variety ofclinical pilot studies. This new modality is basedon measurements of transmitted intensities ofnon-ionizing, NIR light. Unlike X-rays, which tra-verses human tissue on a straight line, NIR lightis strongly scattered in tissue and the image re-construction problem involves computationally in-tensive, iterative, model-based algorithms. Longcomputation times have so far limited the prac-tical clinical application of powerful model-basediterative image reconstruction (MOBIIR) scheme.In this paper we addressed this problem by ana-lyzing, implementing, and testing various parallelexecution schemes, which can be executed on anarbitrary number of UNIX-type machines in a localarea network.

We found that in commonly employed MOBIIRschemes over 90% of the computational time isspent in forward solvers, which make prediction ofthe light intensities on the surface of the medium,given a guess of the optical properties inside themedium. Since typically many light sources are usedin OT, and for each light source a forward problemhas to be solved, the algorithm can be divided inton sub-tasks, where n is the number of light sourcesused in the experimental setup. These sub-tasks arethen executed in parallel on distributed processorsto accelerate the image reconstruction process, inthe optimal case, by a factor close to n.

Three different parallel schemes (static load dis-tribution, dynamic job assignment, dynamic loaddistribution) were tested in a heterogeneous clus-ter of networked PCs and workstations. The staticload sharing makes almost optimal use of the avail-able resources if used in a homogeneous clusterand the number of sources being a multiple ofthe number of processors. In heterogeneous net-works, the dynamic scheme performs best, since ittakes different processor speeds into account andavoids unnecessary waiting for slower participantsto complete their share. This parallel algorithmis also insensitive to varying network loads. More-over, it is suitable for arbitrary network architec-tures, as it based on the protocol-independent MPIstandard. However, the associated communicationoverhead can be justified only in reconstructionproblems that are sufficiently large so that thetime necessary to complete a task is much largerthan the latency for communicating the result.Current and future demands on image recon-struction certainly meet this condition and makethe dynamic allocation scheme the preferablealternative.

112 A.H. Hielscher, S. Bartel

Acknowledgements

This work was supported in part by the National In-stitute of Arthritis and Musculoskeletal and Skin Dis-eases, a division of the National Institute of Health(grant # R01 AR46255-01), and the Whitaker Foun-dation (grant # 98-0244).

References

[1] C.H. Schmitz, M. Löcker, J.M. Lasker, A.H. Hielscher, R.L.Barbour, Instrumentation for fast functional optical to-mography, Review of Scientific Instruments 73 (2) (2002)429—439.

[2] F.E.W. Schmidt, M.E. Fry, E.M.C. Hillman, J.C. Hebden, D.T.Delpy, A 32-channel time-resolved instrument for medicaloptical tomography, Review of Scientific Instruments 71(2000) 256—265.

[3] C.H. Schmitz, H.L. Graber, H. Luo, I. Arif, J. Hira, Y. Pei,A. Bluestone, S. Zhong, R. Andronica, I. Soller, N. Ramirez,S.—L.S. Barbour, R.L. Barbour, Instrumentation and cali-bration protocol for imaging dynamic features in dense-scattering media by optical tomography, Applied Optics 39(2000) 6466—6485.

[4] S.B. Colak, M.B. van der Mark, G.W. ‘t Hooft, J.H. Hoogen-raad, E.S. van der Linden, F.A. Kuijpers, Clinical opticaltomography and NIR spectroscopy for breast cancer detec-tion, IEEE Journal of Selected Topics in Quantum Electron-ics 5 (1999) 1143—1158.

[5] V. Ntziachristos, X.H. Ma, A.G. Yodh, B. Chance, Multichan-nel photon counting instrument for spatially resolved nearinfrared spectroscopy, Review of Scientific Instruments 70(1) (1999) 193—201.

[6] A.M. Siegel, J.J.A. Marota, D.A. Boas, Design and evalua-tion of a continuous-wave diffuse optical tomography sys-tem, Optics Express 4 (1999) 287—298.

[7] B.W. Pogue, M. Testorf, T. Mcbride, U. Österberg, K.Paulsen, Instrumentation and design of a frequency-domaindiffuse optical tomography imager for breast cancer de-tection, Optics Express 1 (1997) 391—403.

[8] J.C. Hebden, S.R. Arridge, D.T. Delpy, Optical in imaging inmedicine I: experimental techniques, Physics in Medicineand Biology 42 (1997) 825—840.

[9] H. Jess, H. Erdl, K.T. Moesta, S. Fantini, M.A. Franceschini,E. Gratton, M. Kaschke, Intensity modulated breast imag-ing: technology and clinical pilot study results, in: R.R.Alfano, J.G. Fujimoto (Eds.), Advances in Optical Imagingand Photon Migration, OSA Trends in Optics and Photonics,Vol. II, Optical Society of America, Washington, DC, 1996,pp. 126—129.

[10] J.P. Vanhouten, D.A. Benaron, S. Spilman, D.K. Stevenson,Imaging brain injury using time-resolved near-infrared lightscanning, Pediatric Research 39 (1996) 470—476.

[11] P.J. Kirkpatrick, Use of near-infrared spectroscopy in theadult, Philosophical Transactions of the Royal Societyof London-Series B: Biological Sciences 352 (1997) 701—705.

[12] C. Hock, K. Villringer, F. Muller-Spahn, R. Wenzel, H.Heekeren, S. Schuh-Hofer, M. Hofmann, S. Minoshima,M. Schwaiger, U. Dirnagl, A. Villringer, Decrease in pari-etal cerebral hemoglobin oxygenation during performanceof a verbal fluency task inpatients with Alzheimer’s dis-ease monitored by means of near-infrared spectroscopy

(NIRS)–—correlation with simultaneous CBF-PET measure-ments, Brain Research 755 (1997) 293—303.

[13] T. Noriyuki, H. Ohdan, S. Yoshioka, Y. Miyata, T. Asahara,K. Dohi, Near-infrared spectroscopic method for assessingthe tissue oxygenation state of living lung, American Jour-nal of Respiratory and Critical Care Medicine 156 (1997)1656—1661.

[14] S.R. Hintz, W.F. Cheong, J.P. van Houten, D.K. Stevenson,D.A. Benaron, Bedside imaging of intracranial hemorrhagein the neonate using light: comparison with ultrasound,computed tomography, and magnetic resonance imaging,Pediatric Research 45 (1999) 60—65.

[15] S.P. Gopinath, C.S. Robertson, C.F. Contant, R.K. Narayan,R.G. Grossman, B. Chance, Early detection of delayed trau-matic intracranial hematomas using near-infrared spec-troscopy, Journal of Neurosurgery 83 (1995) 438—444.

[16] D.A. Benaron, S.R. Hintz, A. Villringer, D. Boas, A. Klein-schmidt, J. Frahm, C. Hirth, H. Obrig, J.C. Van Houten,E.L. Kermit, W. Cheong, D.K. Stevenson, Noninvasive func-tional imaging of human brain using light, Journal of Cere-bral Blood Flow and Metabolism 20 (2000) 469—477.

[17] M.A. Franceschini, E. Gratton, S. Fantini, V. Toronov, M. Fil-iaci, On-line optical imaging of the human brain with 160-ms temporal resolution, Optics Express 6 (2000) 49—57.

[18] L.C. Henson, C. Calalang, J.A. Temp, D.S. Ward, Accu-racy of a cerebral oximeter in healthy volunteers underconditions of isocapnic hypoxia, Anesthesiology 88 (1998)58—65.

[19] M. Tamura, Y. Hoshi, F. Okada, Localized near-infraredspectroscopy and functional optical imaging of brain ac-tivity, Philosophical Transactions of the Royal Society ofLondon-Series B: Biological Sciences 352 (1354) (1997)737—742.

[20] G. Gratton, M. Fabiani, P.M. Corballis, D.C. Hood, M.R.Goodman-Wood, J. Hirsch, K. Kim, D. Friedman, E. Grat-ton, Fast and localized event-related optical signals (EROS)in the human occipital cortex: comparisons with the visualevoked potential and fMRI, Neuroimage 6 (1997) 168—180.

[21] A. Maki, Y. Yamashita, E. Watanabe, H. Koizumi, Visualiz-ing human motor activity by using non-invasive optical to-pography, Frontiers of Medical and Biological Engineering7 (1996) 285—297.

[22] C. Hirth, H. Obrig, J. Valdueza, U. Dirnagl, A. Villringer, Si-multaneous assessment of cerebral oxygenation and hemo-dynamics during a motor task. A combined near infraredand transcranial Doppler sonography study, Advances inExperimental Medicine and Biology 411 (1997) 461—469.

[23] R. Wenzel, H. Obrig, J. Ruben, K. Villringer, A. Thiel, J.Bernardinger, U. Dirnagl, A. Villringer, Cerebral blood oxy-genation changes induced by visual stimulation in humans,Journal of Biomedical Optics 4 (1996) 399—404.

[24] C. Hock, K. Villringer, F. Muller-Spahn, M. Hofmann, S.Schuh-Hofer, H. Heekeren, R. Wenzel, U. Dirnagl, A. Vill-ringer, Near infrared spectroscopy in the diagnosis ofAlzheimer’s disease, Annals of the New York Academy ofSciences 777 (1996) 22—29.

[25] A.J. Fallgatter, M. Roesler, L. Sitzmann, A. Heidrich, T.J.Mueller, W.K. Strik, Loss of functional hemispheric asym-metry in Alzheimer’s dementia assessed with near-infraredspectroscopy, Brain Research, Cognitive Brain Research 6(1997) 67—72.

[26] U. Netz, J. Beuthan, H.J. Capius, H.C. Koch, A.D. Klose,A.H. Hielscher, Imaging of rheumatoid arthritis in fingerjoints by sagittal optical tomography, Medical Laser Appli-cation 16 (2001) 306—310.

[27] A. Klose, A.H. Hielscher, K.M. Hanson, J. Beuthan, Three-dimensional optical tomography of a finger joint model for

Parallel programming of gradient-based iterative image reconstruction schemes 113

diagnostic of rheumatoid arthritis, in: D.A. Benaron, B.Chance, M. Ferrari, M. Kohl (eds.), Photon Propagation inTissue IV, vol. 3566, Proceedings of the SPIE–—The Interna-tional Society for Optical Engineering, 1998, pp. 151—160.

[28] A. Klose, V. Prapavat, O. Minet, J. Beuthan, G. Mueller,RA diagnostics applying optical tomography in frequency-domain, Proceedings of the SPIE-The International Societyfor Optical Engineering, vol. 3196, 1997, pp. 194—204.

[29] A.E. Cerussi, A.J. Berger, F. Bevilacqua, N. Shah, D.Jakubowski, J. Butler, R.F. Holcombe, B.J. Tromberg,Sources of absorption and scattering contrast for near-infrared optical mammography, Academic Radiology 8 (3)(2001) 211—218.

[30] S. Nioka, Y. Yung, M. Shnall, S. Zhao, S. Orel, C. Xie, B.Chance, L. Solin, Optical imaging of breast tumor by meansof continuous waves, Advances in Experimental Medicineand Biology 411 (1997) 227—232.

[31] B.J. Tromberg, O. Coquoz, J.B. Fishkin, T. Pham, E.R. An-derson, J. Butler, M. Cahn, J.D. Gross, V. Venugopalan, D.Pham, Non-invasive measurements of breast tissue opti-cal properties using frequency-domain photon migration,Philosophical Transactions of the Royal Society of London-Series B: Biological Sciences 352 (1997) 661—668.

[32] R.R. Alfano, S.G. Demos, S.K. Gayen, Advances in opti-cal imaging of biomedical media, Annals of the New YorkAcademy of Sciences 820 (1997) 248—270 (Discussion 271).

[33] M.A. Franceschini, K.T. Moesta, S. Fantini, G. Gaida, E.Gratton, H. Jess, W.W. Mantulin, M. Seeber, P.M. Schlag,M. Kaschke, Frequency-domain techniques enhance opti-cal mammography: initial clinical results, Proceedings ofthe National Academy of Sciences of the United States ofAmerica 94 (1997) 6468—6473.

[34] J.A. Parker, Image Reconstruction in Radiology, CRC Press,Boca Raton, FL, 1990.

[35] S.A. Walker, S. Fantini, E. Gratton, Image reconstructionby backprojection from frequency domain optical mea-surements in highly scattering media, Applied Optics 36(1997) 170—179.

[36] S.B. Colak, D.G. Papaioannou, G.W. ’t Hooft, M.B. vander Mark, H. Schomberg, J.C.J. Paasschens, J.B.M. Melis-sen, N.A.A.J. van Asten, Tomographic image reconstructionfrom optical projections in light-diffusing media, AppliedOptics 36 (1997) 180—213.

[37] A.H. Hielscher, A.D. Klose, K.M. Hanson, Gradient-basediterative image reconstruction scheme for time resolvedoptical tomography, IEEE Transactions on Medical Imaging18 (3) (1999) 262—271.

[38] S.R. Arridge, Optical tomography in medical imaging, In-verse Problems 15 (2) (1999) R41—R93.

[39] A.D. Klose, A.H. Hielscher, Iterative reconstruction schemefor optical tomography based on the equation of radiativetransfer, Medical Physics 26 (8) (1999) 1698—1707.

[40] R. Roy, E.M. Sevick-Muraca, Truncated Newton’s optimiza-tion scheme for absorption and fluorescence optical to-mography: Part I theory and formulation, Optics Express 4(10) (1999) 353—371.

[41] O. Dorn, A transport—backtransport method for opticaltomography, Inverse Problems 14 (1998) 1107—1130.

[42] Y.Q. Yao, Y. Wang, Y.L. Pei, W.W. Zhu, R.L. Barbour,Frequency-domain optical imaging of absorption and scat-tering distributions by Born iterative method, Journal ofthe Optical Society of America A 14 (1997) 325—342.

[43] H. Jiang, K.D. Paulsen, U.L. Österberg, Optical imagereconstruction using DC data: simulations and experi-

ments, Physics in Medicine and Biology 41 (1996) 1483—1498.

[44] H.B. Jiang, K.D. Paulsen, U.L. Osterberg, P.W. Pogue,M.S. Patterson, Optical image reconstruction using fre-quency domain data: simulations and experiments, Jour-nal of the Optical Society of America A 13 (1996) 253—266.

[45] R.L. Barbour, H.L. Graber, J.W. Chang, S.L.S. Barbour, P.C.Koo, R. Aronson, MRI-guided optical tomography: prospectsand computation for a new imaging method, IEEE Compu-tational Science and Engineering 2 (1995) 63—77.

[46] K.D. Paulsen, H. Jiang, Spatially varying optical propertyreconstruction using a finite element diffusion equationapproximation, Medical Physics 22 (1995) 691—701.

[47] M.A. O’Leary, D.A. Boas, B. Chance, A.G. Yodh, Experimen-tal images of heterogeneous turbid media by frequency-domain diffusion-photon tomography, Optics Letters 20(1995) 426—428.

[48] D.Y. Paithankar, A.U. Chen, B.W. Pogue, M.S. Patterson,E.M. Sevick-Muraca, Imaging of fluorescent yield and life-time from multiply scattered light reemitted from randommedia, Applied Optics 36 (1997) 2260—2272.

[49] M.V. Klibanov, T.R. Lucas, R.M. Frank, A fast and accurateimaging algorithm in optical diffusion tomography, InverseProblems 13 (1997) 1341—1361.

[50] S. Bartel, G. Abdoulaev, A.H. Hielscher, Parallelization ofgradient-based image reconstruction schemes, in: Biomed-ical Topical Meetings, OSA Technical Digests, Optical Soci-ety of America, Washington, DC, 2000, pp. 433—435.

[51] M.J. Holboke, B.J. Tromberg, X. Li, N. Shah, J. Fishkin, D.Kidney, J. Butler, B. Chance, A.G. Yodh, Three-dimensionaldiffuse optical mammography with ultrasound localizationin a human subject, Journal of Biomedical Optics 5 (2)(2000) 237—247.

[52] S.R. Arridge, M.A. Schweiger, A gradient-based optimisa-tion scheme for optical tomography, Optics Express 2 (6)(1998) 213—226.

[53] A.J. Davies, D.B. Christianson, L.C.W. Dixon, R. Roy, P.van der Zee, Reverse differentiation and the inverse diffu-sion problem, Advances in Engineering Software 28 (1997)217—221.

[54] R. Roy, E.M. Sevick-Muraca, Truncated Newton’s optimiza-tion scheme for absorption and fluorescence optical to-mography: Part I theory and formulation, Optics Express 4(10) (1999) 353—371.

[55] A.D. Klose, U. Netz, J. Beuthan, A.H. Hielscher, Opticaltomography using the time-independent equation of ra-diative transfer. Part 1: forward model, Journal of Quan-titative Spectroscopy and Radiative Transfer 72 (5) (2002)691—713.

[56] A.D. Klose, A.H. Hielscher, Optical tomography using thetime-independent equation of radiative transfer. Part 2:inverse model, Journal of Quantitative Spectroscopy andRadiative Transfer 72 (5) (2002) 715—732.

[57] D.G. Luenberger, Linear and Nonlinear Programming,Addison-Wesley, Boston, MA, 1984.

[58] P.S. Pacheco, Parallel Programming with MPI, Morgan Kauf-mann Publishers, San Francisco, CA, 1997.

[59] Special Issue: MPI–—A Message Passing Interface Standard.International Journal of Supercomputer Applications andHigh Performance Computing 8 (1994).

[60] D.W. Stoller, Magnetic Resonance Imaging in Orthopaedicsand Sports Medicine, second ed, Lippincott-Raven, NewYork, 1997.