Fast Integral Equation Methods for BIOEMfor BIOEM in the ...users.ece.utexas.edu/~yilmaz/CEM_APPLICATIONS/yilmaz_uiuc11.pdfHombach, Meier, Burkhards, Kuhn, Kuster, e dependence of

Fast Integral Equation Methods for BIOEMfor BIOEM

in the Petascale Era:Kilo Cores, Mega Codes, Giga Equations,Kilo Cores, Mega Codes, Giga Equations,

Tera Bytes, and Peta Multiplications

Ali E. YILMAZ

Department of Electrical & Computer EngineeringUniversity of Texas at Austin

University of Illinois at Urbana Champaign,Urbana, IL, Apr. 19, 2011

Outline

• Motivation- Body-Centric Wireless Systems- Historical Perspective: Computational Bioelectromagnetics- Computing Trends

• Backgroundac g ou d- FDTD vs. MOM- Pre-corrected FFT/Adaptive Integral Method- Computational Complexity

• Parallelization- General Limits of Parallelization - Traditional Scheme and Its LimitsTraditional Scheme and Its Limits- Proposed Parallelization- Comparison

• BIOEM Problems• BIOEM Problems- High-Resolution Simulations- AustinMan Antrophomorphic Human Model

Body-Centric Wireless Systems

- More functionality near-, on-, in-body - Wide-spread use - Marketing

- Health concerns

Images from news articles in newscientist.com, switched.com, pulse2.com

Computational BIOEM Requirements

• Phantoms (“tissue-equivalent” liquids)IEEE Std. C95.3-1991/2002 and 1528-2003- Resolve wavelength/skin depth in material

(900 MHz)• Anthropomorphic Human Models

/10 5.4mm, /10 3.6 mmλ δ≈ ≈

- Resolve wavelength/skin depth in tissuesGabriel, Lau, Gabriel, "The dielectric properties of biological tissues: III.Parametric models for the dielectric spectrum of tissues," Phys. Med.Biol., Nov. 1996. http://niremf.ifac.cnr.it/tissprop/htmlclie/htmlclie.htm

conformity.com/artman/publish/printer_74.shtml

CerebroSpinalFluid 2.4126 68.638 0.70202 38.147 19.215

Tissue Name Conductivity[S/m]

RelativePermittivity

LossTangent

Wavelength[mm]

Skin Depth[mm]

BoneCortical 0.14331 12.454 0.22984 93.782 131.57

BoneMarrow 0.040208 5.5043 0.1459 141.61 310.58

BrainGrayMatter 0.94227 52.725 0.35694 45.182 41.536

/10 3.8mm, /10 1.9 mmλ δ≈ ≈Muscle 0.94294 55.032 0.34222 44.277 42.355

3Finite element size ~ 10-100 mm- Resolve tissue boundaries

/10 3.8mm, /10 1.9 mmλ δ Finite element size 10 100 mm6~ 10 100 10N − ×

3Finite element size ~0.1-1 mm 9~ 1 10 10N − ×

Computational BIOEMHistorical Perspective: 1970sHistorical Perspective: 1970s

CDC 66002-3 MFlops

Chen, Guru, “Internal EM field and absorbed powerdensity in human torsos induced by 1-500 MHz EMwaves,” IEEE Trans. MTT, Sep. 1977.“Our numerical computation was performed with a CDC

Hagmann, Gandhi, Durney, “Numerical calculation ofelectromagnetic energy deposition for a realisticmodel of man,” IEEE Trans. MTT, Sep. 1979.“A total of 180 cubical cells of various sizes was used“Our numerical computation was performed with a CDC

6500 computer which has a maximum capacity ofinverting a 120x120 matrix. This limits the maximumnumber of cells to 40…A cell size of 10 cm3 seemsappropriate.”

“A total of 180 cubical cells of various sizes was usedto obtain a best fit of the contour on diagrams of the50th percentile standard man…the matrix is 270 by270…”

Computational BIOEM80s and 90s: CG FFT and FDTD80s and 90s: CG-FFT and FDTD

Borup, Gandhi, “Fast-Fourier-Transform method forcalculation of SAR distributions in finely discretized

Borup, Sullivan, Gandhi, “Comparison of the FFTconjugate gradient method and the finite-differencecalculation of SAR distributions in finely discretized

inhomogeneous models of biological bodies,” IEEETrans. MTT, Apr. 1984.“It is planned to extend the FFT method to 3-Dproblems. Thus far, the only technique available…formodels of man is the MOM ”

j g gtime-domain method for the 2-D absorption problem,”IEEE Trans. MTT, Apr. 1987.“…serious errors exist in solutions for the TEpolarization…fictitious line charge sources…havebeen removed these modifications prevent the usemodels of man…is the MOM.”

“An obvious difficulty…is the creation of the model…Another problem to be solved is the presentation of theresults.”

been removed…these modifications prevent…the useof the FFT-CGM and thus require O(N3) operations.”“In contrast, FD-TD…yields excellent solutions…FD-TD has great potential for solving the high-resolutionmodels needed in bioelectromagnetics.”

Computational BIOEM90s and 00s: FDTD Dominance90s and 00s: FDTD Dominance

Sullivan, Gandhi, Taflove, “Use of the finite-difference time-domain method for calculatingEM absorption in man models,” IEEE Trans.p ,Biomed. Eng., Mar. 1988.“FDTD is used to calculate the detailed SARwithin the human body…Comparison…with themethod of moments.”H b h M i B kh d K h K t “Th

SEMCAD X

Hombach, Meier, Burkhards, Kuhn, Kuster, “Thedependence of EM energy absorption uponhuman head modeling at 900 MHz,” IEEE Trans.MTT, Oct. 1996.“Different numerical head phantoms based onpMRI scans of three different adults were usedwith voxel sizes down to 1 mm3…up to 13 tissuetypes. The numerical results are compared withthose of measurements in a multitissuephantom ”phantom…

Wiart, Hadjem, Wong, Bloch, “Analysis of RFexposure in the head tissues of children andadults,” Phys. Med. Biology, IEEE Trans. MTT,June 2008June 2008.“…major efforts…to improve the numericaldosimetry…in particular the FDTD has beenimproved and adapted to this domain.”

Computational BIOEM90s and 00s: FDTD Controversies90s and 00s: FDTD Controversies

Lin, “Cellular mobile telephones and children,” IEEEAntennas Propagat. Mag., Oct. 2002.“Are children more vulnerable to the microwave radiationfrom cellular mobile telephones? Numerical dosimetry

Wust, Nadobny, Seebass, Stalling, Gellermann,Hege, Deuflhard, Felix, “Influence of patient modelsand numerical methods on predicted powerdeposition patterns ” Int J Hyperthermia Jan 1999

Mason, Hurt, Walters, D’Andrea, Gajsek, Ryan, Nelson,

from cellular mobile telephones?…Numerical dosimetry…one of the most productive research areas during thepast decade…Instead of providing a consensus, theseSAR studies have spawned diverse opinions.”

deposition patterns, Int. J. Hyperthermia, Jan. 1999.“CT data sets are transformed…FDTD-method is theapplied on a cubic lattice (voxel model) or atetrahedron grid (region-based segmentation)…Forcomparison, the VSIE method is performed on the

Smith, Ziriax, “Effects of frequency, permittivity, and voxelsize on predicted specific absorption rate values inbiological tissue during electromagnetic-field exposure,”IEEE Trans. MTT, Nov. 2000.“Comparing localized SAR values determined

same tetrahedron grid…power deposition patterns inthe interior of the patient depended strongly on themodels…Due to its shorter calculation time, theFDTD method is currently used in the clinic.Predictions without prior corrections of tissueComparing localized SAR values determined

experimentally...with those obtained using the FDTDmethod demonstrate good agreement, except when theFDTD method calculates “hot or cold” spots (relative towhole-body average SAR).” Kainz, Christ, Kellom, Seidman, Nikoloski, Beard,

Kuster, “Dosimetric comparison of the SAM to 14

Predictions…without prior corrections of tissuespecifications are not always supported by clinicalexperiences.”

Gjonaj, Bartsch, Clemens, Schupp, Weiland, “High-resolution human anatomy models for advancedelectromagnetic field computations,” IEEE Trans. Mag.,Mar. 2002.“ induced SAR in the human head due to cellular

Kuster, Dosimetric comparison of the SAM to 14anatomical head models using a novel definition forthe mobile phone positioning,” Phys. Med. Biol., Nov.2000.“Numerous numerical dosimetric studies have been

bli h d b t th f bil h…induced SAR…in the human head due to cellularphone radiation…for heterogeneous and homogeneousHUGO models. The discrepancy of the results,concerning the location of hot spots…necessity of usinghigh-resolution anatomy models for realistic simulations.”

published about the exposure of mobile phone userswhich concluded with conflicting results. However,many of these studies lack reproducibility due toshortcomings in the description of the phoneposition.”

In the mean time…

Massoudi, Durney, Iskander, “Limitations of the cubicalblock model of man in calculating SAR distributions,”IEEE Trans MTT Aug 1984IEEE Trans. MTT, Aug. 1984.“…while the moment-method, using pulse basisfunctions, gives good values for whole-body averageSAR, the convergence of the solutions for SARdistributions is questionable…Pulse functions should

Jin, Chen, Chew, Gan, Magin, Dimbylow, “Computation

be replaced by a better approximation.”

Schaubert, Wilton, Glisson, “A tetrahedral modelingmethod for electromagnetic scattering by arbitrarilyshaped inhomogeneous dielectric bodies ” IEEE

9 tissues, 8 mm3 resolution

of electromagnetic fields for high-frequency magneticresonance imaging applications,” Phys. Med. Biol., Dec.1996.“The use of pulse basis functions yielded slowconvergence and poor results when dealing with

shaped inhomogeneous dielectric bodies, IEEETrans. AP, Jan.1984.“Special basis functions are defined within tetrahedralvolume elements to insure that the normal electric fieldsatisfies the correct jump condition at interfaces

convergence and poor results when dealing withmaterials with high dielectric contrast. Betterformulations were later proposed…most use mixed-order basis functions and Galerkin’s testing.”

Zwamborn, Vandenberg, “The 3-D weak form of theconjugate-gradient FFT method for solving scatteringproblems,” IEEE Trans. MTT, Sep. 1992.

between different dielectric media.”

“…BCG-FFT method reduces the memory requirementto O(N) and computational complexity to O(NiterNlogN),making the problem solvable on a workstation. Forlossy material…Niter is found to be small.”

“A weak form of the integral equation…is obtained bytesting it with appropriate testing functions. In view ofthe partial derivatives...the volumetric rooftop functionsare chosen as testing and expansion functions.”

Historical Perspective: SupercomputingTop Supercomputers

(top500.org)

• 1. Tian-He 1AIntel X56706 Core 2930 MHz JaguarTian-He 1A Ranger• 2. Jaguar AMD x86_64 Opteron 6 Core 2600 MHz

• 15. Ranger

JaguarTian-He 1A Ranger

gAMD x86_64 Opteron 4 Core 2300 MHz

• Blue Waters (this year)IBM POWER7 processor 8 Core p

Historical Perspective: Limits of Serial ComputingComputing

BACKGROUND

- FDTD vs. MOM

- Pre-corrected FFT/Adaptive Integral Method

- Computational Complexity

FDTD vs. MOM for BIOEM

• FDTD • MOM+FFT-based Algorithm

+ Historical momentum/easy to code+ Matrix free- Uniform grids/staircasing

- Rarer/harder to code- Dense matrix (+ fast algorithm)+ Arbitrary mesh g g

- Special grid truncation- Extended grids

Courant stability condition

y+ Exact radiation condition+ Never mesh free space+ Unknowns only in space- Courant stability condition

-+ Phase dispersion-+ Time-domain (transients)

+ Unknowns only in space+- Incorrect Green function phase speed+- Frequency-domain (dispersive materials)

Simulation Time:

Memory:

Simulation Time:

Memory:iter( log )O N N N( )tO N N ′

( )O N( )O N ′Memory: e o y ( )O N( )O N

For the same accuracy: N N′>>err(FDTD)>>For err(MOM):N N′=

FDTD vs. MOM for BIOEM

Error

Controversialres lts

Computational

results

Computational Research Increases in

Raw Computer

Power

Increasedexpectations

Curiosity

New applications

Power

yUtopia Controversial

resultsTime/Memory Cost

Volume Electric Field Integral Equation

• Time-harmonic VEFIE for Inhomogeneous and Lossy Dielectric Scatterer

≡( ) ( )ε σ μr r

0 0,ε μ incE

( )J r0 0,ε μ incE

( )( ) ( )

jσ

ε εω

= +r

r r

( ) ( ) ( )ε=D r r E r0

( ), ( ),ε σ μr rV 0 0

,ε μ0

( )

( ) 1 ( )( )j

εω

ε

⎛ ⎞⎟⎜ ⎟= −⎜ ⎟⎜ ⎟⎜⎝ ⎠J r D r

rV

( ) ( ) ( )

0

inc sca

0

( ) ( ) ( ) ( ) ( ) ( )

( ) ( )( ) ( )

jk

j

eG dv G

ω φ

μ ′− −

= − = + +∇

⎧ ⎫ ⎧ ⎫′⎪ ⎪ ⎪ ⎪⎪ ⎪ ⎪ ⎪′ ′ ′⎨ ⎬ ⎨ ⎬∫∫∫r r

E r E r E r E r A r r

A r J rr r r r

( )κ r

0

( , ) , ( , )( ) ( ) / ( ) 4

V

G dv Gjφ ωε π

= =⎨ ⎬ ⎨ ⎬′ ′⎪ ⎪ ⎪ ⎪−∇ ⋅ ′−⎪ ⎪ ⎪ ⎪⎩ ⎭ ⎩ ⎭∫∫∫ r r r r

r J r r r

inc 2( ) ( , ) ( ) ( )( ) ( ) ( ) ( ) ( ) V

Gj G dv dv

κω μ κ

′ ′ ′ ′∇ ⋅′ ′ ′ ′ ′= + ∇ ∈∫∫∫ ∫∫∫D r r r r D r

E r r r r D r r0

0

( ) ( ) ( , ) ( ) ( ) V( )

V V

j G dv dvω μ κε ε

= + −∇ ∈∫∫∫ ∫∫∫E r r r r D r rr

Method of Moments

• MOM( ) ( ) ( ) ( )G ′ ′ ′ ′∇D D

- Discretize & Galerkin test

inc 20

0

( ) ( , ) ( ) ( )( ) ( ) ( , ) ( ) ( ) V

( )V V

Gj G dv dv

κω μ κ

ε ε′ ′ ′ ′∇ ⋅′ ′ ′ ′ ′= + −∇ ∈∫∫∫ ∫∫∫

D r r r r D rE r r r r D r r

r

1 1

( ) ( ) ( ) ( )

( ) ( ) VEFIE for 1,...,

N N

k k k k kk k

k k

I jw I

dv k N

κ

κ

′ ′ ′ ′ ′′ ′= =

≅ ≅

=

∑ ∑

∫∫∫

D r V r J r V r

r V r ik−p

+

kV −

- Fill matrix & solve

( ) ( ) , ,k k

V∫∫∫

k+p

kV +ka

/ (3 )a v V± ± ±⎧⎪ ∈p r

kε+

kε−

inc N N

=Z I F / (3 )( )

0 elsewherek k k k

k

a v V⎪ ∈⎪= ⎨⎪⎪⎩

p rV r

N N×Z I F

2( )O N Schaubert, Wilton, Glisson, “A tetrahedral modeling

• Computational Cost

Matrix fill time:2( )O N2( )O N

( )O Nmethod for electromagnetic scattering by arbitrarilyshaped inhomogeneous dielectric bodies,” IEEETrans. AP, Jan.1984.

Time per iteration:

Matrix fill time:

Memory:

Pre-corrected FFT/Adaptive Integral Method

Step 1: Project/AnterpolateEmbed in a regular grid with nodes

CNV

Step 2: Propagate

Step 3: InterpolateStep 4: CorrectS ep Co ec

C ti i i 1

near {IFFT FF [ ] [F ]F }TTT≈ + ⋅Λ ΛGZZI I I

Δ

Correction region size: 1γ =

ΛTΛ

G cΔ

• Computational Costar Cne( + + )O NN pNMatrix fill time:

near C C l + + )o+ g (N N NpN pNO

Λ ΛnearZT Λ−ΛZ G

Time per iteration:

ar Cne( + + )O NN pNMemory:

Benchmark Problems

CAD Sphere Phantom Voxel-Based Sphere Phantom AustinMan Partial BodyΔV (mm) ΔV (mm) ΔV (mm)l / lλ / lδ l / lλ / lδ lN N N

(mm3) (mm) (mm3) (mm) (mm3) (mm)

1030 21 2.4 1.7 ~104 683 20 2.5 1.8 ~1.5 104 683 20 ~4 104

15 5 9.8 7.0 ~6 105 11 5 9.9 7.1 ~9 105 11 5 ~2.5 106

l / lλ / lδ l / lλ / lδ lN N N

EincEincEinc

0.3 1.4 36 26 ~3.2 107 0.2 1.3 40 29 ~5.6 107 0.2 1.3 ~1.6 108

50 mm50 mm50 mm 50 mm

radius=104 mm

41 5 0 97ε σ

radius=104 mm

41 5 0 97ε σ

3500 300 350mm× ×

41 5 0 97. , .ε σ= =r

41 5 0 97. , .ε σ= =r

f = 900 MHz

Definitions for Post-Processing

( , ) ( ) ( , ) ( , )P f f t t dtσ⟨ ⟩ = ⋅∫r r E r E rabsorbed

22

AIM sinMIE d dπ π

θθ θθσ σ θ θ φ−∫ ∫R l i /

( ) ( , ) ( , )

f

f fσ ∗= ⋅r E r E r

1

12

0 02

2

0 0

sinMIE d d

θθ θθ

π π

θθ

φ

σ θ θ φ

=∫ ∫

∫ ∫

Relative RCS Error

Einc

0 0

Einc Einc

50 mm50 mm 50 mm 50 mm

radius=104 mm

41 5 0 97ε σ

radius=104 mm

41 5 0 97ε σ

3500 300 350mm× ×f = 900 MHz

41 5 0 97. , .ε σ= =r

41 5 0 97. , .ε σ= =r

Accuracy683 mm30.2 mm3 11 mm3

P⟨ ⟩ P⟨ ⟩P⟨ ⟩absorbed P⟨ ⟩absorbed

1030 mm315 mm30 3 mm3 1030 mm15 mm30.3 mm

Computational Costs

ΔV=683 mm3

ΔV=0.2 mm3

Need for Parallelization:Minimum P for Different ConstraintsMinimum P for Different Constraints

Parallelization- General Limits

- Traditional Scheme and Its Limitations

- Proposed Scheme

- Comparison

General Limits of Parallelization

Sequential part of the code (Amdahl’s Law)

• (Hard) Scalability Limitations 1 q p ( )

Load imbalance/Can’t divide computations further (run out of parallelism)2

Communication latency/bandwidth3

e or

Mem

ory

e or

Mem

ory

k Ti

me

all C

lock

Tim

e

all C

lock

Tim

e

Wal

l Clo

ck

Wa

P 31 2

Wa

P P

Traditional Parallelization Scheme• Single Workload Distribution Strategy [1]

1. Project/Anterpolate: 1-D Slab xy 6P P= =2. Propagate: 1-D Slab3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near

6P P1D cx cymax

min( , )P N N=

y y

P1 P2 P3 P4 P5 P6x

y

z x

y

zP1 P2 P3 P4 P5 P6

[1] Aluru, Nadkarni, White, “A parallel precorrected FFT based capacitance extraction program for signal integrity analysis,” Proc. Design Automat. Conf., 1996.

Traditional Parallelization Scheme• Single Workload Distribution Strategy [1]

1. Project/Anterpolate: 1-D Slab xy 6P P= =2. Propagate: 1-D Slab3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near

6P P1D cx cymax

min( , )P N N=

P6

P4

P5

P5

P6

P2

P3

P3

P4

y y

P1P2

P1

x

y

z x

y

z

[1] Aluru, Nadkarni, White, “A parallel precorrected FFT based capacitance extraction program for signal integrity analysis,” Proc. Design Automat. Conf., 1996.

P1

Traditional Parallelization Scheme

1. Project/Anterpolate: 1-D Slab• Single Workload Distribution Strategy

2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ) X p YZ

3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near

P1 P2 P3 P4 P5 P6near IFFT{FFT[ ] FFT[ ]}T+Λ ⋅≈ ΛZ I GZI I x

y

z

near C Clog+ + + )(

NO

N N pNp N

near C

1Dmax

+ + +?

( )??? ??

NO

p pNN N

PMemory:

CPU time 1Dmax

+ + +?? ?

)???

(OP

near 1Dmax

+( )PO C

per iteration:

per iteration:Comm time





P1 P2 P3 P4 P5 P6near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I x

y

z

nea Cr Cl+ +(

og)+

NNO

N pNp N

nea C

1D

r

max

+ +? ?

)??

( +? ?

N p

P

pNN NOMemory:

CPU time 1Dmax

+ +?? ?

( )???

+OP

near 1Dmax

+( )PO C

per iteration:




P6

2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ

P4

P5

) X p YZ3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near P2

P3

P1

near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I x

y

z

P4

P5

P6

P4

P5

P6

nea Cr Cl+ +(

og)+

NNO

N pNp N

nea C

1D

r

max

+ +? ?

)??

( +? ?

N p

P

pNN NOMemory:

CPU time

P2

P1

P3

P2

P1

P31Dmax

+ +?? ?

( )???

+OP

ne 1Dm

arax

+( )PO C

per iteration:




P6


P4

P5


P3

near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I

P1

x

y

z

nea Cr Cl+ +(

og)+

NNO

N pNp N

nea C

1D

r

max

+ +? ?

)??

( +? ?

N p

P

pNN NOMemory:

CPU time 1Dmax

+ +?? ?

( )???

+OP

ne 1Dm

arax

+( )PO C

per iteration:




P6


P4

P5


P3

near IFFT{ [FFT }] F [F T ]T+Λ ⋅≈ ΛGZI IZ I

P1

x

y

z

nea Cr Cl+ +(

og)+

NNO

N pNp N

nea C

1D

r

max

+ +? ?

)??

( +? ?

N p

P

pNN NOMemory:

CPU time 1Dmax

+ +?? ?

( )???

+OP

ne 1Dm

arax

+( )PO C

per iteration:




P6


P4

P5


P3

near {IFFT FF [ ] [ ]}FFTTT+Λ ⋅≈ ΛGZZI II

P1

x

y

z

nea Cr Cl+ +(

og)+

NNO

N pNp N

nea C

1D

r

max

+ +? ?

)??

( +? ?

N p

P

pNN NOMemory:

CPU time 1Dmax

+ +?? ?

( )???

+OP

ne 1Dm

arax

+( )PO C

per iteration:






P1 P2 P3 P4 P5 P6near {IFFT FF [ ] [ ]}FFTTT+Λ ⋅≈ ΛGZZI II x

y

z

P4

P5

P6

P4

P5

P6

nea Cr Cl+ +(

og)+

NNO

N pNp N

nea C

1D

r

max

+ +? ?

)??

( +? ?

N p

P

pNN NOMemory:

CPU time

P1 P1

P2

P3

P2

P31Dmax

+ +?? ?

( )???

+OP

ne 1Dm

arax

+( )PO C

per iteration:







y

z

nea Cr Cl+ +(

og)+

NNO

N pNp N

nea C

1D

r

max

+ +? ?

)??

( +? ?

N p

P

pNN NOMemory:

CPU time 1Dmax

+ +?? ?

( )???

+OP

ne 1Dm

arax

+( )PO C

per iteration:







y

z

nea Cr Cl+

g(

o+ + )NN Np pNN

O

nea C

Dm

r

1ax

( ++?? ?

)??

+?

N NO

p pN

P

NMemory:

CPU time 1Dmax

+??? ? ??

( + + )P

O

ne 1Dm

arax

+( )PO C

per iteration:






P1 P2 P3 P4 P5 P6x

y

znear {IFFT FF [ ] [F ]F }TTT≈ + ⋅Λ ΛGZZI I I

nea Cr Cl( +

og+ + )NN Np pNN

O

near C

1Dmax

( + +? ??

+ )???

N NO

pN pN

PMemory:

CPU time 1Dmax

( +?? ?? ??

+ + )P

O

ne 1Dm

arax

+( )PO C

per iteration:


Traditional Parallelization Scheme• Single Workload Distribution Strategy

1. Project/Anterpolate: 1-D Slab2. Propagate: 1-D Slab

a) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ) X p YZ


P1 P2 P3 P4 P5 P6near {IFFT FF [ ] [F ]F }TTT≈ + ⋅Λ ΛGZZI I I x

y

z

P4

P5P6

P4

P5P6

nea Cr Cl( +

og+ + )NN Np pNN

O

near C

1Dmax

( + +? ??

+ )???

N NO

pN pN

PMemory:

CPU time

P2P1

P3

P2P1

P31Dmax

( +?? ?? ??

+ + )P

O

near 1Dmax

( + )PO C

per iteration:


Limits of Traditional Parallelization Scheme

1. Anterpolation, Correction, Interpolation computations not balanced

2. “Grid limited” 1D cx cymax

min( , )P N N=


Sequential part of the code (Amdahl’s Law)

• (Hard) Scalability Limitations 1 q p ( )

Load imbalance/Can’t divide computations further (run out of parallelism)2

Communication latency/bandwidth3

e or

Mem

ory

e or

Mem

ory

k Ti

me

all C

lock

Tim

e

all C

lock

Tim

e

Wal

l Clo

ck

Wa

P 31 2

Wa

P P

Limits of Traditional Parallelization Maximum P and Maximum NMaximum P and Maximum N

radius=104 mm

41 5 0 97. , .ε σ= =r



- Proposed Scheme

- Comparison

Proposed Parallelization Scheme• Two Workload Distribution Strategies [1]

1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute

P4 P5 P6 P4 P5 P6

P1 P2 P3 P1 P2 P3yy

xzxz

[1] Wei, Yılmaz,“A 2-D decomposition based parallelization of AIM for 3-D BIOEM problems,” IEEE APS Symp., 2011

Proposed Parallelization Scheme• Two Workload Distribution Strategies

1. Project/Anterpolate: 2-D Pencil

P4 P5 P6

2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block

a) Shift Balance in X Dimensionb) Shift Balance in Y Dimensionc) Shift Balance in Z Dimension

5 Communicate Near/RedistributeP1 P2 P3

y

5. Communicate Near/Redistribute

xz


1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block

P4 P5 P6


5 Communicate Near/Redistribute5. Communicate Near/RedistributeP1 P2 P3

y

xz


1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block P4

P5P6


5 Communicate Near/Redistribute5. Communicate Near/RedistributeP1 P2

P3

y

xz



P5P6


5 Communicate Near/Redistribute5. Communicate Near/RedistributeP1 P2

P3

y

xz


1. Project/Anterpolate: 2-D Pencil xy yz 3 2P P P= × = ×2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block

3 2P P P× ×2D xy yz

max max maxP P P= ×

cx cy cy czmin( , ) min( , )N N N N= ×


P4 P5 P6

P4 P5 P6

P1 P2 P3

x

y

z x

y

z

P1 P2 P3



3 2P P P× ×


2D xy yzmax max maxP P P= ×

5. Communicate Near/RedistributeP1 P2 P3

P4 P5 P6

P4 P5 P6

x

y

z x

y

z



3 2P P P× ×


2D xy yzmax max maxP P P= ×


P3P6

P2

P3P6

P6

P1

P2P5

P5

x

y

z

P1

P4P4

x

y

z



P4 P5 P6


y

P1 P2 P3

xz



P4 P5 P6


y

P1 P2 P3

xz

P4

P5P6

P4

P5P6

P2P1

P3P4

P2P1

P3P4



P5P6


P1 P2P3

y

xz

P4

P5P6

P4

P5P6

P2P1

P3P4

P2P1

P3P4


1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil

a) FFTZ – ZY Transpose – FFTY– YX Transpose – FFTX

b) Multiply

P4 P5 P6

) p yc) IFFTX – XY Transpose – IFFTY

– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute P1 P2 P33.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute

near IFFT{FFT[ ] FFT[ ]}T+Λ ⋅≈ ΛZ I GZI I x

y

z

P1 P2 P3

{ [ ] [ ]}

near C Clog+ + +( )

N pN N N pNO

near C

xy yzmax max

+ +??? ??

( )+?

N pN N

P

pN

POMemory:

CPU time

x

y

z

xy yzmax max

+ + +?? ? ??

( )? P P

O

near rd

yz xymax max

+ +( )OP P

C CP P

+

per iteration:





b) Multiply

P4 P5 P6



near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I

P1 P2 P3

x

y

z{ [ ] }[ ]

C Cnear

+ +( +l go

)N N

OpNN pN

C

xy yzmax m

e

ax

n ar

+ +?? ??

( +??

)N

P PO

pNN pNMemory:

CPU time xy yz

max max

+ +? ??

( +??

)? P P

O

near rd

yz xymax max

+ +( )OP P

C CP P

+

per iteration:




P1 P2 P3

P4 P5 P6

2. Propagate: 2-D Pencila) FFTZ – ZY Transpose – FFTY

– YX Transpose – FFTXb) Multiply) p yc) IFFTX – XY Transpose – IFFTY

– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute


y

z

P5

P3P6

P5

P3P6

{ [ ] }[ ]

C Cnear

+ +( +l go

)N N

OpNN pN

C

xy yzmax m

e

ax

n ar

+ +?? ??

( +??

)N

P PO

pNN pNMemory:

CPU time

P4P1

P2

P4P1

P2xy yzmax max

+ +? ??

( +??

)? P P

O

near rd

yzm x

xymaxa

( + )+OP

PC

P

PC +

per iteration:




P1 P2 P3

P4 P5 P6





y

z{ [ ] }[ ]

C Cnear

+ +( +l go

)N N

OpNN pN

C

xy yzmax m

e

ax

n ar

+ +?? ??

( +??

)N

P PO

pNN pNMemory:

CPU time xy yz

max max

+ +? ??

( +??

)? P P

O

near rd

yzm x

xymaxa

( + )+OP

PC

P

PC +

per iteration:



1. Project/Anterpolate: 2-D Pencil P3


– YX Transpose – FFTXb) Multiply P2

P6


– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute

P1

P5

3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute


y

z

P4

P4

P5P6

P4

P5P6

{ [ ] }[ ]

C Cnear

+ +( +l go

)N N

OpNN pN

C

xy yzmax m

e

ax

n ar

+ +?? ??

( +??

)N

P PO

pNN pNMemory:

CPU time

P2P1

P3

P2P1

P3xy yzmax max

+ +? ??

( +??

)? P P

O

yz xymax

near rd

max

( + + )C CP P

PO

P+

per iteration:






P6



P1

P5



y

z

P4

{ [ ] }[ ]

C Cnear

+ +( +l go

)N N

OpNN pN

C

xy yzmax m

e

ax

n ar

+ +?? ??

( +??

)N

P PO

pNN pNMemory:

CPU time xy yz

max max

+ +? ??

( +??

)? P P

O

yz xymax

near rd

max

( + + )C CP P

PO

P+

per iteration:






P6



P1

P5


near IFFT{ [FFT }] F [F T ]T+Λ ⋅≈ ΛGZI IZ I x

y

z

P4

{ [ }] [ ]

C Cnear

+ +( +l go

)N N

OpNN pN

C

xy yzmax m

e

ax

n ar

+ +?? ??

( +??

)N

P PO

pNN pNMemory:

CPU time xy yz

max max

+ +? ??

( +??

)? P P

O

yz xymax

near rd

max

( + + )C CP P

PO

P+

per iteration:






P6



P1

P5


near {IFFT FF [ ] [ ]}T FFTT+Λ ⋅≈ ΛGZZI II x

y

z

P4

{ [ ] [ ]}

C Cnear

+ +( +l go

)N N

OpNN pN

C

xy yzmax m

e

ax

n ar

+ +?? ??

( +??

)N

P PO

pNN pNMemory:

CPU time xy yz

max max

+ +? ??

( +??

)? P P

O

yz xymax

near rd

max

( + + )C CP P

PO

P+

per iteration:




P1 P2 P3

P4 P5 P6





y

z

P4

P5P6

P4

P5P6

{ [ ] [ ]}

C Cnear

+ +( +l go

)N N

OpNN pN

C

xy yzmax m

e

ax

n ar

+ +?? ??

( +??

)N

P PO

pNN pNMemory:

CPU time

P2P1

P3

P2P1

P3xy yzmax max

+ +? ??

( +??

)? P P

O

yz xymax

near rd

max

( + + )C CP P

PO

P+

per iteration:




P1 P2 P3

P4 P5 P6





y

z{ [ ] [ ]}

C Cnear

+ +( +l go

)N N

OpNN pN

C

xy yzmax m

e

ax

n ar

+ +?? ??

( +??

)N

P PO

pNN pNMemory:

CPU time xy yz

max max

+ +? ??

( +??

)? P P

O

yz xymax

near rd

max

( + + )C CP P

PO

P+

per iteration:





b) Multiply

P4 P5 P6




y

z

P1 P2 P3

P5

P3P6

P5

P3P6

{ [ ] [ ]}

C Cnear

+ +( +l go

)N N

OpNN pN

C

xy yzmax m

e

ax

n ar

+ +?? ??

( +??

)N

P PO

pNN pNMemory:

CPU time

P4P1

P2

P4P1

P2xy yzmax max

+ +? ??

( +??

)? P P

O

yz xymax

near rd

max

( + + )C CP P

PO

P+

per iteration:





b) Multiply

P4 P5 P6




y

z

P1 P2 P3

{ [ ] [ ]}

C Cnear

+ +( +l go

)N N

OpNN pN

C

xy yzmax m

e

ax

n ar

+ +?? ??

( +??

)N

P PO

pNN pNMemory:

CPU time xy yz

max max

+ +? ??

( +??

)? P P

O

yz xymax

near rd

max

( + + )C CP P

PO

P+

per iteration:





b) Multiply

P4 P5 P6




y

z

P1 P2 P3

{ [ ] [ ]}

C Cnear

+lo

+ +g

( )N N

OpN pNN

C

xy yzmax ma

n ar

x

e

??+( + +

????)

N

P PO

pN pNNMemory:

CPU time

x

y

z

xy yzmax max

+? ?? ???

+ +( )P P

O

yz xymax

near rd

max

( + + )C CP P

PO

P+

per iteration:





b) Multiply

P4 P5 P6




y

z{ [ ] [ ]}

C Cnear

+lo

+ +g

( )N N

OpN pNN

C

xy yzmax ma

n ar

x

e

??+( + +

????)

N

P PO

pN pNNMemory:

CPU time P4

P5P6

P4

P5P6

xy yzmax max

+? ?? ???

+ +( )P P

O

rd

yz xym

ne

ax

ar

max

+( + )P P

CP P

CO +

per iteration:

per iteration:Comm time P2

P1

P3

P2P1

P3




b) Multiply

P4P5

P6


– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute P1 P23.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute

P1P3

x

y

znear {IFFT FF [ ] [F ]F }T TT≈ + ⋅Λ ΛGZZI I I{ [ ] [ ]}

C Cnear

( + + +log

)N NpN p

ONN

C

xy yzmax ma

ne r

x

a

??( + +

?+

? ?)

N

P PO

pN pNNP

Memory:

CPU time xy yz

max xma

( + + +?? ?? ?

)OPPP

rd

yz xym

ne

ax

ar

max

+( + )P P

CP P

CO +

per iteration:





b) Multiply

P4P5

P6


– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute P1 P23.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute

near {IFFT FF [ ] [F ]F }T TT≈ + ⋅Λ ΛGZZI I I

P1 P3

x

y

z{ [ ] [ ]}

C Cnear

( + + +log

)N NpN p

ONN

C

xy yzmax ma

ne r

x

a

??( + +

?+

? ?)

N

P PO

pN pNNP

Memory:

CPU time P4

P5P6

P4

P5P6

xy yzmax xma

( + + +?? ?? ?

)OPPP

near rd

yz xymax max

( + + )P

CP

OP

CP

+

per iteration:

per iteration:Comm time P2

P1

P3

P2P1

P3



- Proposed Scheme

- Comparison

Traditional vs. Proposed Scheme

1. Correction computations better balanced

2. Grid limitation extended 2D cx cy cy czmax

min( , )min( , )P N N N N=

Traditional vs. ProposedMaximum P and Maximum NMaximum P and Maximum N

radius=104 mm

41 5 0 97. , .ε σ= =r

P vs. N (Relative Efficiency=90%)

0.83N

0.59N

0 68

Ranger

radius=104 mm

41 5 0 97. , .ε σ= =r

0.68N

P vs. N (Relative Efficiency=50%)

0.43N

Ranger

radius=104 mm

41 5 0 97. , .ε σ= =r

0.66N


Sequential part of the code (Amdahl’s Law)• (Hard) Scalability Limitations 1

N i N tiLoad imbalance/Can’t divide computations further (run out of parallelism)

C i ti l t /b d idth3

2- No size N arrays or operations

- Two workload distribution strategies- 2D pencil decomposition based 3D FFTs [1]

Communication latency/bandwidth

Mem

ory

3M

emor

y- Hybrid MPI/OpenMP [2]

e

ock

Tim

e or

M

ock

Tim

e or

M

all C

lock

Tim

e

Wal

l Clo

P 31 2

Wal

l Clo

P

Wa

P31 2

[2] Wei, Yılmaz,“A hybrid message passing/shared memory parallelization of the adaptive integral method for multi-core clusters,” Parallel Comput., 2011.

[1] Wei, Yılmaz,“A 2-D decomposition based parallelization of AIM for 3-D BIOEM problems,” IEEE APS Symp., 2011

BIOEM Problems- High-Resolution Simulations

- AustinMan Antropohomorphic Model

High-Resolution Simulations683 mm31030 mm3 683 mm3

Pabsorbed

11 mm3 11 mm315 mm3

0.2 mm30.3 mm3 0.2 mm3

Sample Tissues: 11 mm3 vs. 0.2 mm3

Pabsorbed

Muscle + Tendon

Skin Dry + Fat

Bone Cortical+ Bone Marrow


1 2

Pabsorbed

3

4

1. White+Gray Matter2. Teeth3. Tongue4 Gland4. Gland5. Spinal Chord6. Mucous Membrane7. Vitreous Humor8. Blood Vessel


5 6

Pabsorbed

8

71. White+Gray Matter2. Teeth3. Tongue4 Gland4. Gland5. Spinal Chord6. Mucous Membrane7. Vitreous Humor8. Blood Vessel

AustinMan Anthropomorphic Model

CAT MRI C S tiCAT MRI Cross Sections

• Semi-automated+ Automated: Color based identification+ Automated: Color based identification- Manual: Knowledge of anatomy

• Boundary&Region Identification Software Kit (BRISkit)+ Create masks+ Region growing - MATLAB graphical user interface

AustinMan Model

g- Identify regions- Edit boundaries

AustinMan Anthropomorphic Model

Sub-mmpixel

l tiresolution(1/3 mm)

AustinMan Anthropomorphic ModelDownload at: web2.corral.tacc.utexas.edu/AustinManEMVoxels/

Acknowledgements

Fangzhou Wei Madison Ball

Normalized E-Field

0

Jackson MasseyCemil GeyikDr. Tahir MalasDr Victor Eijkhout

-10

-20

Dr. Victor EijkhoutProf. Paul FindellProf. John PearceProf. Leszek Demkowicz

11 mm315 mm3 11 mm3-30

-40

Prof. Robert van de Geijn

0.2 mm30.3 mm3 0.2 mm3

Documents

Fast Integral Equation Methods for BIOEMfor BIOEM in the ...users.ece.utexas.edu/~yilmaz/CEM_APPLICATIONS/yilmaz_uiuc11.pdfHombach, Meier, Burkhards, Kuhn, Kuster, e dependence of