Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Fast Integral Equation Methods for BIOEMfor BIOEM
in the Petascale Era:Kilo Cores, Mega Codes, Giga Equations,Kilo Cores, Mega Codes, Giga Equations,
Tera Bytes, and Peta Multiplications
Ali E. YILMAZ
Department of Electrical & Computer EngineeringUniversity of Texas at Austin
University of Illinois at Urbana Champaign,Urbana, IL, Apr. 19, 2011
Outline
• Motivation- Body-Centric Wireless Systems- Historical Perspective: Computational Bioelectromagnetics- Computing Trends
• Backgroundac g ou d- FDTD vs. MOM- Pre-corrected FFT/Adaptive Integral Method- Computational Complexity
• Parallelization- General Limits of Parallelization - Traditional Scheme and Its LimitsTraditional Scheme and Its Limits- Proposed Parallelization- Comparison
• BIOEM Problems• BIOEM Problems- High-Resolution Simulations- AustinMan Antrophomorphic Human Model
Body-Centric Wireless Systems
- More functionality near-, on-, in-body - Wide-spread use - Marketing
- Health concerns
Images from news articles in newscientist.com, switched.com, pulse2.com
Computational BIOEM Requirements
• Phantoms (“tissue-equivalent” liquids)IEEE Std. C95.3-1991/2002 and 1528-2003- Resolve wavelength/skin depth in material
(900 MHz)• Anthropomorphic Human Models
/10 5.4mm, /10 3.6 mmλ δ≈ ≈
- Resolve wavelength/skin depth in tissuesGabriel, Lau, Gabriel, "The dielectric properties of biological tissues: III.Parametric models for the dielectric spectrum of tissues," Phys. Med.Biol., Nov. 1996. http://niremf.ifac.cnr.it/tissprop/htmlclie/htmlclie.htm
conformity.com/artman/publish/printer_74.shtml
CerebroSpinalFluid 2.4126 68.638 0.70202 38.147 19.215
Tissue Name Conductivity[S/m]
RelativePermittivity
LossTangent
Wavelength[mm]
Skin Depth[mm]
BoneCortical 0.14331 12.454 0.22984 93.782 131.57
BoneMarrow 0.040208 5.5043 0.1459 141.61 310.58
BrainGrayMatter 0.94227 52.725 0.35694 45.182 41.536
/10 3.8mm, /10 1.9 mmλ δ≈ ≈Muscle 0.94294 55.032 0.34222 44.277 42.355
3Finite element size ~ 10-100 mm- Resolve tissue boundaries
/10 3.8mm, /10 1.9 mmλ δ Finite element size 10 100 mm6~ 10 100 10N − ×
3Finite element size ~0.1-1 mm 9~ 1 10 10N − ×
Computational BIOEMHistorical Perspective: 1970sHistorical Perspective: 1970s
CDC 66002-3 MFlops
Chen, Guru, “Internal EM field and absorbed powerdensity in human torsos induced by 1-500 MHz EMwaves,” IEEE Trans. MTT, Sep. 1977.“Our numerical computation was performed with a CDC
Hagmann, Gandhi, Durney, “Numerical calculation ofelectromagnetic energy deposition for a realisticmodel of man,” IEEE Trans. MTT, Sep. 1979.“A total of 180 cubical cells of various sizes was used“Our numerical computation was performed with a CDC
6500 computer which has a maximum capacity ofinverting a 120x120 matrix. This limits the maximumnumber of cells to 40…A cell size of 10 cm3 seemsappropriate.”
“A total of 180 cubical cells of various sizes was usedto obtain a best fit of the contour on diagrams of the50th percentile standard man…the matrix is 270 by270…”
Computational BIOEM80s and 90s: CG FFT and FDTD80s and 90s: CG-FFT and FDTD
Borup, Gandhi, “Fast-Fourier-Transform method forcalculation of SAR distributions in finely discretized
Borup, Sullivan, Gandhi, “Comparison of the FFTconjugate gradient method and the finite-differencecalculation of SAR distributions in finely discretized
inhomogeneous models of biological bodies,” IEEETrans. MTT, Apr. 1984.“It is planned to extend the FFT method to 3-Dproblems. Thus far, the only technique available…formodels of man is the MOM ”
j g gtime-domain method for the 2-D absorption problem,”IEEE Trans. MTT, Apr. 1987.“…serious errors exist in solutions for the TEpolarization…fictitious line charge sources…havebeen removed these modifications prevent the usemodels of man…is the MOM.”
“An obvious difficulty…is the creation of the model…Another problem to be solved is the presentation of theresults.”
been removed…these modifications prevent…the useof the FFT-CGM and thus require O(N3) operations.”“In contrast, FD-TD…yields excellent solutions…FD-TD has great potential for solving the high-resolutionmodels needed in bioelectromagnetics.”
Computational BIOEM90s and 00s: FDTD Dominance90s and 00s: FDTD Dominance
Sullivan, Gandhi, Taflove, “Use of the finite-difference time-domain method for calculatingEM absorption in man models,” IEEE Trans.p ,Biomed. Eng., Mar. 1988.“FDTD is used to calculate the detailed SARwithin the human body…Comparison…with themethod of moments.”H b h M i B kh d K h K t “Th
SEMCAD X
Hombach, Meier, Burkhards, Kuhn, Kuster, “Thedependence of EM energy absorption uponhuman head modeling at 900 MHz,” IEEE Trans.MTT, Oct. 1996.“Different numerical head phantoms based onpMRI scans of three different adults were usedwith voxel sizes down to 1 mm3…up to 13 tissuetypes. The numerical results are compared withthose of measurements in a multitissuephantom ”phantom…
Wiart, Hadjem, Wong, Bloch, “Analysis of RFexposure in the head tissues of children andadults,” Phys. Med. Biology, IEEE Trans. MTT,June 2008June 2008.“…major efforts…to improve the numericaldosimetry…in particular the FDTD has beenimproved and adapted to this domain.”
Computational BIOEM90s and 00s: FDTD Controversies90s and 00s: FDTD Controversies
Lin, “Cellular mobile telephones and children,” IEEEAntennas Propagat. Mag., Oct. 2002.“Are children more vulnerable to the microwave radiationfrom cellular mobile telephones? Numerical dosimetry
Wust, Nadobny, Seebass, Stalling, Gellermann,Hege, Deuflhard, Felix, “Influence of patient modelsand numerical methods on predicted powerdeposition patterns ” Int J Hyperthermia Jan 1999
Mason, Hurt, Walters, D’Andrea, Gajsek, Ryan, Nelson,
from cellular mobile telephones?…Numerical dosimetry…one of the most productive research areas during thepast decade…Instead of providing a consensus, theseSAR studies have spawned diverse opinions.”
deposition patterns, Int. J. Hyperthermia, Jan. 1999.“CT data sets are transformed…FDTD-method is theapplied on a cubic lattice (voxel model) or atetrahedron grid (region-based segmentation)…Forcomparison, the VSIE method is performed on the
Smith, Ziriax, “Effects of frequency, permittivity, and voxelsize on predicted specific absorption rate values inbiological tissue during electromagnetic-field exposure,”IEEE Trans. MTT, Nov. 2000.“Comparing localized SAR values determined
same tetrahedron grid…power deposition patterns inthe interior of the patient depended strongly on themodels…Due to its shorter calculation time, theFDTD method is currently used in the clinic.Predictions without prior corrections of tissueComparing localized SAR values determined
experimentally...with those obtained using the FDTDmethod demonstrate good agreement, except when theFDTD method calculates “hot or cold” spots (relative towhole-body average SAR).” Kainz, Christ, Kellom, Seidman, Nikoloski, Beard,
Kuster, “Dosimetric comparison of the SAM to 14
Predictions…without prior corrections of tissuespecifications are not always supported by clinicalexperiences.”
Gjonaj, Bartsch, Clemens, Schupp, Weiland, “High-resolution human anatomy models for advancedelectromagnetic field computations,” IEEE Trans. Mag.,Mar. 2002.“ induced SAR in the human head due to cellular
Kuster, Dosimetric comparison of the SAM to 14anatomical head models using a novel definition forthe mobile phone positioning,” Phys. Med. Biol., Nov.2000.“Numerous numerical dosimetric studies have been
bli h d b t th f bil h…induced SAR…in the human head due to cellularphone radiation…for heterogeneous and homogeneousHUGO models. The discrepancy of the results,concerning the location of hot spots…necessity of usinghigh-resolution anatomy models for realistic simulations.”
published about the exposure of mobile phone userswhich concluded with conflicting results. However,many of these studies lack reproducibility due toshortcomings in the description of the phoneposition.”
In the mean time…
Massoudi, Durney, Iskander, “Limitations of the cubicalblock model of man in calculating SAR distributions,”IEEE Trans MTT Aug 1984IEEE Trans. MTT, Aug. 1984.“…while the moment-method, using pulse basisfunctions, gives good values for whole-body averageSAR, the convergence of the solutions for SARdistributions is questionable…Pulse functions should
Jin, Chen, Chew, Gan, Magin, Dimbylow, “Computation
be replaced by a better approximation.”
Schaubert, Wilton, Glisson, “A tetrahedral modelingmethod for electromagnetic scattering by arbitrarilyshaped inhomogeneous dielectric bodies ” IEEE
9 tissues, 8 mm3 resolution
of electromagnetic fields for high-frequency magneticresonance imaging applications,” Phys. Med. Biol., Dec.1996.“The use of pulse basis functions yielded slowconvergence and poor results when dealing with
shaped inhomogeneous dielectric bodies, IEEETrans. AP, Jan.1984.“Special basis functions are defined within tetrahedralvolume elements to insure that the normal electric fieldsatisfies the correct jump condition at interfaces
convergence and poor results when dealing withmaterials with high dielectric contrast. Betterformulations were later proposed…most use mixed-order basis functions and Galerkin’s testing.”
Zwamborn, Vandenberg, “The 3-D weak form of theconjugate-gradient FFT method for solving scatteringproblems,” IEEE Trans. MTT, Sep. 1992.
between different dielectric media.”
“…BCG-FFT method reduces the memory requirementto O(N) and computational complexity to O(NiterNlogN),making the problem solvable on a workstation. Forlossy material…Niter is found to be small.”
“A weak form of the integral equation…is obtained bytesting it with appropriate testing functions. In view ofthe partial derivatives...the volumetric rooftop functionsare chosen as testing and expansion functions.”
Historical Perspective: SupercomputingTop Supercomputers
(top500.org)
• 1. Tian-He 1AIntel X56706 Core 2930 MHz JaguarTian-He 1A Ranger• 2. Jaguar AMD x86_64 Opteron 6 Core 2600 MHz
• 15. Ranger
JaguarTian-He 1A Ranger
gAMD x86_64 Opteron 4 Core 2300 MHz
• Blue Waters (this year)IBM POWER7 processor 8 Core p
Historical Perspective: Limits of Serial ComputingComputing
BACKGROUND
- FDTD vs. MOM
- Pre-corrected FFT/Adaptive Integral Method
- Computational Complexity
FDTD vs. MOM for BIOEM
• FDTD • MOM+FFT-based Algorithm
+ Historical momentum/easy to code+ Matrix free- Uniform grids/staircasing
- Rarer/harder to code- Dense matrix (+ fast algorithm)+ Arbitrary mesh g g
- Special grid truncation- Extended grids
Courant stability condition
y+ Exact radiation condition+ Never mesh free space+ Unknowns only in space- Courant stability condition
-+ Phase dispersion-+ Time-domain (transients)
+ Unknowns only in space+- Incorrect Green function phase speed+- Frequency-domain (dispersive materials)
Simulation Time:
Memory:
Simulation Time:
Memory:iter( log )O N N N( )tO N N ′
( )O N( )O N ′Memory: e o y ( )O N( )O N
For the same accuracy: N N′>>err(FDTD)>>For err(MOM):N N′=
FDTD vs. MOM for BIOEM
Error
Controversialres lts
Computational
results
Computational Research Increases in
Raw Computer
Power
Increasedexpectations
Curiosity
New applications
Power
yUtopia Controversial
resultsTime/Memory Cost
Volume Electric Field Integral Equation
• Time-harmonic VEFIE for Inhomogeneous and Lossy Dielectric Scatterer
≡( ) ( )ε σ μr r
0 0,ε μ incE
( )J r0 0,ε μ incE
( )( ) ( )
jσ
ε εω
= +r
r r
( ) ( ) ( )ε=D r r E r0
( ), ( ),ε σ μr rV 0 0
,ε μ0
( )
( ) 1 ( )( )j
εω
ε
⎛ ⎞⎟⎜ ⎟= −⎜ ⎟⎜ ⎟⎜⎝ ⎠J r D r
rV
( ) ( ) ( )
0
inc sca
0
( ) ( ) ( ) ( ) ( ) ( )
( ) ( )( ) ( )
jk
j
eG dv G
ω φ
μ ′− −
= − = + +∇
⎧ ⎫ ⎧ ⎫′⎪ ⎪ ⎪ ⎪⎪ ⎪ ⎪ ⎪′ ′ ′⎨ ⎬ ⎨ ⎬∫∫∫r r
E r E r E r E r A r r
A r J rr r r r
( )κ r
0
( , ) , ( , )( ) ( ) / ( ) 4
V
G dv Gjφ ωε π
= =⎨ ⎬ ⎨ ⎬′ ′⎪ ⎪ ⎪ ⎪−∇ ⋅ ′−⎪ ⎪ ⎪ ⎪⎩ ⎭ ⎩ ⎭∫∫∫ r r r r
r J r r r
inc 2( ) ( , ) ( ) ( )( ) ( ) ( ) ( ) ( ) V
Gj G dv dv
κω μ κ
′ ′ ′ ′∇ ⋅′ ′ ′ ′ ′= + ∇ ∈∫∫∫ ∫∫∫D r r r r D r
E r r r r D r r0
0
( ) ( ) ( , ) ( ) ( ) V( )
V V
j G dv dvω μ κε ε
= + −∇ ∈∫∫∫ ∫∫∫E r r r r D r rr
Method of Moments
• MOM( ) ( ) ( ) ( )G ′ ′ ′ ′∇D D
- Discretize & Galerkin test
inc 20
0
( ) ( , ) ( ) ( )( ) ( ) ( , ) ( ) ( ) V
( )V V
Gj G dv dv
κω μ κ
ε ε′ ′ ′ ′∇ ⋅′ ′ ′ ′ ′= + −∇ ∈∫∫∫ ∫∫∫
D r r r r D rE r r r r D r r
r
1 1
( ) ( ) ( ) ( )
( ) ( ) VEFIE for 1,...,
N N
k k k k kk k
k k
I jw I
dv k N
κ
κ
′ ′ ′ ′ ′′ ′= =
≅ ≅
=
∑ ∑
∫∫∫
D r V r J r V r
r V r ik−p
+
kV −
- Fill matrix & solve
( ) ( ) , ,k k
V∫∫∫
k+p
kV +ka
/ (3 )a v V± ± ±⎧⎪ ∈p r
kε+
kε−
inc N N
=Z I F / (3 )( )
0 elsewherek k k k
k
a v V⎪ ∈⎪= ⎨⎪⎪⎩
p rV r
N N×Z I F
2( )O N Schaubert, Wilton, Glisson, “A tetrahedral modeling
• Computational Cost
Matrix fill time:2( )O N2( )O N
( )O Nmethod for electromagnetic scattering by arbitrarilyshaped inhomogeneous dielectric bodies,” IEEETrans. AP, Jan.1984.
Time per iteration:
Matrix fill time:
Memory:
Pre-corrected FFT/Adaptive Integral Method
Step 1: Project/AnterpolateEmbed in a regular grid with nodes
CNV
Step 2: Propagate
Step 3: InterpolateStep 4: CorrectS ep Co ec
C ti i i 1
near {IFFT FF [ ] [F ]F }TTT≈ + ⋅Λ ΛGZZI I I
Δ
Correction region size: 1γ =
ΛTΛ
G cΔ
• Computational Costar Cne( + + )O NN pNMatrix fill time:
near C C l + + )o+ g (N N NpN pNO
Λ ΛnearZT Λ−ΛZ G
Time per iteration:
ar Cne( + + )O NN pNMemory:
Benchmark Problems
CAD Sphere Phantom Voxel-Based Sphere Phantom AustinMan Partial BodyΔV (mm) ΔV (mm) ΔV (mm)l / lλ / lδ l / lλ / lδ lN N N
(mm3) (mm) (mm3) (mm) (mm3) (mm)
1030 21 2.4 1.7 ~104 683 20 2.5 1.8 ~1.5 104 683 20 ~4 104
15 5 9.8 7.0 ~6 105 11 5 9.9 7.1 ~9 105 11 5 ~2.5 106
l / lλ / lδ l / lλ / lδ lN N N
EincEincEinc
0.3 1.4 36 26 ~3.2 107 0.2 1.3 40 29 ~5.6 107 0.2 1.3 ~1.6 108
50 mm50 mm50 mm 50 mm
radius=104 mm
41 5 0 97ε σ
radius=104 mm
41 5 0 97ε σ
3500 300 350mm× ×
41 5 0 97. , .ε σ= =r
41 5 0 97. , .ε σ= =r
f = 900 MHz
Definitions for Post-Processing
( , ) ( ) ( , ) ( , )P f f t t dtσ⟨ ⟩ = ⋅∫r r E r E rabsorbed
22
AIM sinMIE d dπ π
θθ θθσ σ θ θ φ−∫ ∫R l i /
( ) ( , ) ( , )
f
f fσ ∗= ⋅r E r E r
1
12
0 02
2
0 0
sinMIE d d
θθ θθ
π π
θθ
φ
σ θ θ φ
=∫ ∫
∫ ∫
Relative RCS Error
Einc
0 0
Einc Einc
50 mm50 mm 50 mm 50 mm
radius=104 mm
41 5 0 97ε σ
radius=104 mm
41 5 0 97ε σ
3500 300 350mm× ×f = 900 MHz
41 5 0 97. , .ε σ= =r
41 5 0 97. , .ε σ= =r
Accuracy683 mm30.2 mm3 11 mm3
P⟨ ⟩ P⟨ ⟩P⟨ ⟩absorbed P⟨ ⟩absorbed
1030 mm315 mm30 3 mm3 1030 mm15 mm30.3 mm
Computational Costs
ΔV=683 mm3
ΔV=0.2 mm3
Need for Parallelization:Minimum P for Different ConstraintsMinimum P for Different Constraints
Parallelization- General Limits
- Traditional Scheme and Its Limitations
- Proposed Scheme
- Comparison
General Limits of Parallelization
Sequential part of the code (Amdahl’s Law)
• (Hard) Scalability Limitations 1 q p ( )
Load imbalance/Can’t divide computations further (run out of parallelism)2
Communication latency/bandwidth3
e or
Mem
ory
e or
Mem
ory
k Ti
me
all C
lock
Tim
e
all C
lock
Tim
e
Wal
l Clo
ck
Wa
P 31 2
Wa
P P
Traditional Parallelization Scheme• Single Workload Distribution Strategy [1]
1. Project/Anterpolate: 1-D Slab xy 6P P= =2. Propagate: 1-D Slab3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near
6P P1D cx cymax
min( , )P N N=
y y
P1 P2 P3 P4 P5 P6x
y
z x
y
zP1 P2 P3 P4 P5 P6
[1] Aluru, Nadkarni, White, “A parallel precorrected FFT based capacitance extraction program for signal integrity analysis,” Proc. Design Automat. Conf., 1996.
Traditional Parallelization Scheme• Single Workload Distribution Strategy [1]
1. Project/Anterpolate: 1-D Slab xy 6P P= =2. Propagate: 1-D Slab3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near
6P P1D cx cymax
min( , )P N N=
P6
P4
P5
P5
P6
P2
P3
P3
P4
y y
P1P2
P1
x
y
z x
y
z
[1] Aluru, Nadkarni, White, “A parallel precorrected FFT based capacitance extraction program for signal integrity analysis,” Proc. Design Automat. Conf., 1996.
P1
Traditional Parallelization Scheme
1. Project/Anterpolate: 1-D Slab• Single Workload Distribution Strategy
2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ) X p YZ
3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near
P1 P2 P3 P4 P5 P6near IFFT{FFT[ ] FFT[ ]}T+Λ ⋅≈ ΛZ I GZI I x
y
z
near C Clog+ + + )(
NO
N N pNp N
near C
1Dmax
+ + +?
( )??? ??
NO
p pNN N
PMemory:
CPU time 1Dmax
+ + +?? ?
)???
(OP
near 1Dmax
+( )PO C
per iteration:
per iteration:Comm time
Traditional Parallelization Scheme
1. Project/Anterpolate: 1-D Slab• Single Workload Distribution Strategy
2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ) X p YZ
3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near
P1 P2 P3 P4 P5 P6near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I x
y
z
nea Cr Cl+ +(
og)+
NNO
N pNp N
nea C
1D
r
max
+ +? ?
)??
( +? ?
N p
P
pNN NOMemory:
CPU time 1Dmax
+ +?? ?
( )???
+OP
near 1Dmax
+( )PO C
per iteration:
per iteration:Comm time
Traditional Parallelization Scheme
1. Project/Anterpolate: 1-D Slab• Single Workload Distribution Strategy
P6
2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ
P4
P5
) X p YZ3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near P2
P3
P1
near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I x
y
z
P4
P5
P6
P4
P5
P6
nea Cr Cl+ +(
og)+
NNO
N pNp N
nea C
1D
r
max
+ +? ?
)??
( +? ?
N p
P
pNN NOMemory:
CPU time
P2
P1
P3
P2
P1
P31Dmax
+ +?? ?
( )???
+OP
ne 1Dm
arax
+( )PO C
per iteration:
per iteration:Comm time
Traditional Parallelization Scheme
1. Project/Anterpolate: 1-D Slab• Single Workload Distribution Strategy
P6
2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ
P4
P5
) X p YZ3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near P2
P3
near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I
P1
x
y
z
nea Cr Cl+ +(
og)+
NNO
N pNp N
nea C
1D
r
max
+ +? ?
)??
( +? ?
N p
P
pNN NOMemory:
CPU time 1Dmax
+ +?? ?
( )???
+OP
ne 1Dm
arax
+( )PO C
per iteration:
per iteration:Comm time
Traditional Parallelization Scheme
1. Project/Anterpolate: 1-D Slab• Single Workload Distribution Strategy
P6
2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ
P4
P5
) X p YZ3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near P2
P3
near IFFT{ [FFT }] F [F T ]T+Λ ⋅≈ ΛGZI IZ I
P1
x
y
z
nea Cr Cl+ +(
og)+
NNO
N pNp N
nea C
1D
r
max
+ +? ?
)??
( +? ?
N p
P
pNN NOMemory:
CPU time 1Dmax
+ +?? ?
( )???
+OP
ne 1Dm
arax
+( )PO C
per iteration:
per iteration:Comm time
Traditional Parallelization Scheme
1. Project/Anterpolate: 1-D Slab• Single Workload Distribution Strategy
P6
2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ
P4
P5
) X p YZ3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near P2
P3
near {IFFT FF [ ] [ ]}FFTTT+Λ ⋅≈ ΛGZZI II
P1
x
y
z
nea Cr Cl+ +(
og)+
NNO
N pNp N
nea C
1D
r
max
+ +? ?
)??
( +? ?
N p
P
pNN NOMemory:
CPU time 1Dmax
+ +?? ?
( )???
+OP
ne 1Dm
arax
+( )PO C
per iteration:
per iteration:Comm time
Traditional Parallelization Scheme
1. Project/Anterpolate: 1-D Slab• Single Workload Distribution Strategy
2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ) X p YZ
3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near
P1 P2 P3 P4 P5 P6near {IFFT FF [ ] [ ]}FFTTT+Λ ⋅≈ ΛGZZI II x
y
z
P4
P5
P6
P4
P5
P6
nea Cr Cl+ +(
og)+
NNO
N pNp N
nea C
1D
r
max
+ +? ?
)??
( +? ?
N p
P
pNN NOMemory:
CPU time
P1 P1
P2
P3
P2
P31Dmax
+ +?? ?
( )???
+OP
ne 1Dm
arax
+( )PO C
per iteration:
per iteration:Comm time
Traditional Parallelization Scheme
1. Project/Anterpolate: 1-D Slab• Single Workload Distribution Strategy
2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ) X p YZ
3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near
P1 P2 P3 P4 P5 P6near {IFFT FF [ ] [ ]}FFTTT+Λ ⋅≈ ΛGZZI II x
y
z
nea Cr Cl+ +(
og)+
NNO
N pNp N
nea C
1D
r
max
+ +? ?
)??
( +? ?
N p
P
pNN NOMemory:
CPU time 1Dmax
+ +?? ?
( )???
+OP
ne 1Dm
arax
+( )PO C
per iteration:
per iteration:Comm time
Traditional Parallelization Scheme
1. Project/Anterpolate: 1-D Slab• Single Workload Distribution Strategy
2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ) X p YZ
3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near
P1 P2 P3 P4 P5 P6near {IFFT FF [ ] [ ]}FFTTT+Λ ⋅≈ ΛGZZI II x
y
z
nea Cr Cl+
g(
o+ + )NN Np pNN
O
nea C
Dm
r
1ax
( ++?? ?
)??
+?
N NO
p pN
P
NMemory:
CPU time 1Dmax
+??? ? ??
( + + )P
O
ne 1Dm
arax
+( )PO C
per iteration:
per iteration:Comm time
Traditional Parallelization Scheme
1. Project/Anterpolate: 1-D Slab• Single Workload Distribution Strategy
2. Propagate: 1-D Slaba) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ) X p YZ
3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near
P1 P2 P3 P4 P5 P6x
y
znear {IFFT FF [ ] [F ]F }TTT≈ + ⋅Λ ΛGZZI I I
nea Cr Cl( +
og+ + )NN Np pNN
O
near C
1Dmax
( + +? ??
+ )???
N NO
pN pN
PMemory:
CPU time 1Dmax
( +?? ?? ??
+ + )P
O
ne 1Dm
arax
+( )PO C
per iteration:
per iteration:Comm time
Traditional Parallelization Scheme• Single Workload Distribution Strategy
1. Project/Anterpolate: 1-D Slab2. Propagate: 1-D Slab
a) FFTZY – Transpose – FFTXb) Multiplyc) IFFTX – Transpose – IFFTYZ) X p YZ
3. Interpolate: 1-D Slab4. Correct: 1-D Slab5. Communicate Near
P1 P2 P3 P4 P5 P6near {IFFT FF [ ] [F ]F }TTT≈ + ⋅Λ ΛGZZI I I x
y
z
P4
P5P6
P4
P5P6
nea Cr Cl( +
og+ + )NN Np pNN
O
near C
1Dmax
( + +? ??
+ )???
N NO
pN pN
PMemory:
CPU time
P2P1
P3
P2P1
P31Dmax
( +?? ?? ??
+ + )P
O
near 1Dmax
( + )PO C
per iteration:
per iteration:Comm time
Limits of Traditional Parallelization Scheme
1. Anterpolation, Correction, Interpolation computations not balanced
2. “Grid limited” 1D cx cymax
min( , )P N N=
General Limits of Parallelization
Sequential part of the code (Amdahl’s Law)
• (Hard) Scalability Limitations 1 q p ( )
Load imbalance/Can’t divide computations further (run out of parallelism)2
Communication latency/bandwidth3
e or
Mem
ory
e or
Mem
ory
k Ti
me
all C
lock
Tim
e
all C
lock
Tim
e
Wal
l Clo
ck
Wa
P 31 2
Wa
P P
Limits of Traditional Parallelization Maximum P and Maximum NMaximum P and Maximum N
radius=104 mm
41 5 0 97. , .ε σ= =r
Parallelization- General Limits
- Traditional Scheme and Its Limitations
- Proposed Scheme
- Comparison
Proposed Parallelization Scheme• Two Workload Distribution Strategies [1]
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
P4 P5 P6 P4 P5 P6
P1 P2 P3 P1 P2 P3yy
xzxz
[1] Wei, Yılmaz,“A 2-D decomposition based parallelization of AIM for 3-D BIOEM problems,” IEEE APS Symp., 2011
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil
P4 P5 P6
2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block
a) Shift Balance in X Dimensionb) Shift Balance in Y Dimensionc) Shift Balance in Z Dimension
5 Communicate Near/RedistributeP1 P2 P3
y
5. Communicate Near/Redistribute
xz
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block
P4 P5 P6
a) Shift Balance in X Dimensionb) Shift Balance in Y Dimensionc) Shift Balance in Z Dimension
5 Communicate Near/Redistribute5. Communicate Near/RedistributeP1 P2 P3
y
xz
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block P4
P5P6
a) Shift Balance in X Dimensionb) Shift Balance in Y Dimensionc) Shift Balance in Z Dimension
5 Communicate Near/Redistribute5. Communicate Near/RedistributeP1 P2
P3
y
xz
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block P4
P5P6
a) Shift Balance in X Dimensionb) Shift Balance in Y Dimensionc) Shift Balance in Z Dimension
5 Communicate Near/Redistribute5. Communicate Near/RedistributeP1 P2
P3
y
xz
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil xy yz 3 2P P P= × = ×2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block
3 2P P P× ×2D xy yz
max max maxP P P= ×
cx cy cy czmin( , ) min( , )N N N N= ×
5. Communicate Near/Redistribute
P4 P5 P6
P4 P5 P6
P1 P2 P3
x
y
z x
y
z
P1 P2 P3
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil xy yz 3 2P P P= × = ×2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block
3 2P P P× ×
cx cy cy czmin( , ) min( , )N N N N= ×
2D xy yzmax max maxP P P= ×
5. Communicate Near/RedistributeP1 P2 P3
P4 P5 P6
P4 P5 P6
x
y
z x
y
z
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil xy yz 3 2P P P= × = ×2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block
3 2P P P× ×
cx cy cy czmin( , ) min( , )N N N N= ×
2D xy yzmax max maxP P P= ×
5. Communicate Near/Redistribute
P3P6
P2
P3P6
P6
P1
P2P5
P5
x
y
z
P1
P4P4
x
y
z
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block
P4 P5 P6
5. Communicate Near/Redistribute
y
P1 P2 P3
xz
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block
P4 P5 P6
5. Communicate Near/Redistribute
y
P1 P2 P3
xz
P4
P5P6
P4
P5P6
P2P1
P3P4
P2P1
P3P4
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil3. Interpolate: 2-D Pencil3.5. Redistribute4. Correct: 3-D Block P4
P5P6
5. Communicate Near/Redistribute
P1 P2P3
y
xz
P4
P5P6
P4
P5P6
P2P1
P3P4
P2P1
P3P4
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil
a) FFTZ – ZY Transpose – FFTY– YX Transpose – FFTX
b) Multiply
P4 P5 P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute P1 P2 P33.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near IFFT{FFT[ ] FFT[ ]}T+Λ ⋅≈ ΛZ I GZI I x
y
z
P1 P2 P3
{ [ ] [ ]}
near C Clog+ + +( )
N pN N N pNO
near C
xy yzmax max
+ +??? ??
( )+?
N pN N
P
pN
POMemory:
CPU time
x
y
z
xy yzmax max
+ + +?? ? ??
( )? P P
O
near rd
yz xymax max
+ +( )OP P
C CP P
+
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil
a) FFTZ – ZY Transpose – FFTY– YX Transpose – FFTX
b) Multiply
P4 P5 P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute P1 P2 P33.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I
P1 P2 P3
x
y
z{ [ ] }[ ]
C Cnear
+ +( +l go
)N N
OpNN pN
C
xy yzmax m
e
ax
n ar
+ +?? ??
( +??
)N
P PO
pNN pNMemory:
CPU time xy yz
max max
+ +? ??
( +??
)? P P
O
near rd
yz xymax max
+ +( )OP P
C CP P
+
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil
P1 P2 P3
P4 P5 P6
2. Propagate: 2-D Pencila) FFTZ – ZY Transpose – FFTY
– YX Transpose – FFTXb) Multiply) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I x
y
z
P5
P3P6
P5
P3P6
{ [ ] }[ ]
C Cnear
+ +( +l go
)N N
OpNN pN
C
xy yzmax m
e
ax
n ar
+ +?? ??
( +??
)N
P PO
pNN pNMemory:
CPU time
P4P1
P2
P4P1
P2xy yzmax max
+ +? ??
( +??
)? P P
O
near rd
yzm x
xymaxa
( + )+OP
PC
P
PC +
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil
P1 P2 P3
P4 P5 P6
2. Propagate: 2-D Pencila) FFTZ – ZY Transpose – FFTY
– YX Transpose – FFTXb) Multiply) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I x
y
z{ [ ] }[ ]
C Cnear
+ +( +l go
)N N
OpNN pN
C
xy yzmax m
e
ax
n ar
+ +?? ??
( +??
)N
P PO
pNN pNMemory:
CPU time xy yz
max max
+ +? ??
( +??
)? P P
O
near rd
yzm x
xymaxa
( + )+OP
PC
P
PC +
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil P3
2. Propagate: 2-D Pencila) FFTZ – ZY Transpose – FFTY
– YX Transpose – FFTXb) Multiply P2
P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute
P1
P5
3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I x
y
z
P4
P4
P5P6
P4
P5P6
{ [ ] }[ ]
C Cnear
+ +( +l go
)N N
OpNN pN
C
xy yzmax m
e
ax
n ar
+ +?? ??
( +??
)N
P PO
pNN pNMemory:
CPU time
P2P1
P3
P2P1
P3xy yzmax max
+ +? ??
( +??
)? P P
O
yz xymax
near rd
max
( + + )C CP P
PO
P+
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil P3
2. Propagate: 2-D Pencila) FFTZ – ZY Transpose – FFTY
– YX Transpose – FFTXb) Multiply P2
P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute
P1
P5
3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near IFFT{FFT[ ] T }[ ]FFT+Λ ⋅≈ ΛZ I GZI I x
y
z
P4
{ [ ] }[ ]
C Cnear
+ +( +l go
)N N
OpNN pN
C
xy yzmax m
e
ax
n ar
+ +?? ??
( +??
)N
P PO
pNN pNMemory:
CPU time xy yz
max max
+ +? ??
( +??
)? P P
O
yz xymax
near rd
max
( + + )C CP P
PO
P+
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil P3
2. Propagate: 2-D Pencila) FFTZ – ZY Transpose – FFTY
– YX Transpose – FFTXb) Multiply P2
P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute
P1
P5
3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near IFFT{ [FFT }] F [F T ]T+Λ ⋅≈ ΛGZI IZ I x
y
z
P4
{ [ }] [ ]
C Cnear
+ +( +l go
)N N
OpNN pN
C
xy yzmax m
e
ax
n ar
+ +?? ??
( +??
)N
P PO
pNN pNMemory:
CPU time xy yz
max max
+ +? ??
( +??
)? P P
O
yz xymax
near rd
max
( + + )C CP P
PO
P+
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil P3
2. Propagate: 2-D Pencila) FFTZ – ZY Transpose – FFTY
– YX Transpose – FFTXb) Multiply P2
P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute
P1
P5
3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near {IFFT FF [ ] [ ]}T FFTT+Λ ⋅≈ ΛGZZI II x
y
z
P4
{ [ ] [ ]}
C Cnear
+ +( +l go
)N N
OpNN pN
C
xy yzmax m
e
ax
n ar
+ +?? ??
( +??
)N
P PO
pNN pNMemory:
CPU time xy yz
max max
+ +? ??
( +??
)? P P
O
yz xymax
near rd
max
( + + )C CP P
PO
P+
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil
P1 P2 P3
P4 P5 P6
2. Propagate: 2-D Pencila) FFTZ – ZY Transpose – FFTY
– YX Transpose – FFTXb) Multiply) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near {IFFT FF [ ] [ ]}T FFTT+Λ ⋅≈ ΛGZZI II x
y
z
P4
P5P6
P4
P5P6
{ [ ] [ ]}
C Cnear
+ +( +l go
)N N
OpNN pN
C
xy yzmax m
e
ax
n ar
+ +?? ??
( +??
)N
P PO
pNN pNMemory:
CPU time
P2P1
P3
P2P1
P3xy yzmax max
+ +? ??
( +??
)? P P
O
yz xymax
near rd
max
( + + )C CP P
PO
P+
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil
P1 P2 P3
P4 P5 P6
2. Propagate: 2-D Pencila) FFTZ – ZY Transpose – FFTY
– YX Transpose – FFTXb) Multiply) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute3.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near {IFFT FF [ ] [ ]}T FFTT+Λ ⋅≈ ΛGZZI II x
y
z{ [ ] [ ]}
C Cnear
+ +( +l go
)N N
OpNN pN
C
xy yzmax m
e
ax
n ar
+ +?? ??
( +??
)N
P PO
pNN pNMemory:
CPU time xy yz
max max
+ +? ??
( +??
)? P P
O
yz xymax
near rd
max
( + + )C CP P
PO
P+
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil
a) FFTZ – ZY Transpose – FFTY– YX Transpose – FFTX
b) Multiply
P4 P5 P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute P1 P2 P33.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near {IFFT FF [ ] [ ]}T FFTT+Λ ⋅≈ ΛGZZI II x
y
z
P1 P2 P3
P5
P3P6
P5
P3P6
{ [ ] [ ]}
C Cnear
+ +( +l go
)N N
OpNN pN
C
xy yzmax m
e
ax
n ar
+ +?? ??
( +??
)N
P PO
pNN pNMemory:
CPU time
P4P1
P2
P4P1
P2xy yzmax max
+ +? ??
( +??
)? P P
O
yz xymax
near rd
max
( + + )C CP P
PO
P+
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil
a) FFTZ – ZY Transpose – FFTY– YX Transpose – FFTX
b) Multiply
P4 P5 P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute P1 P2 P33.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near {IFFT FF [ ] [ ]}T FFTT+Λ ⋅≈ ΛGZZI II x
y
z
P1 P2 P3
{ [ ] [ ]}
C Cnear
+ +( +l go
)N N
OpNN pN
C
xy yzmax m
e
ax
n ar
+ +?? ??
( +??
)N
P PO
pNN pNMemory:
CPU time xy yz
max max
+ +? ??
( +??
)? P P
O
yz xymax
near rd
max
( + + )C CP P
PO
P+
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil
a) FFTZ – ZY Transpose – FFTY– YX Transpose – FFTX
b) Multiply
P4 P5 P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute P1 P2 P33.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near {IFFT FF [ ] [ ]}T FFTT+Λ ⋅≈ ΛGZZI II x
y
z
P1 P2 P3
{ [ ] [ ]}
C Cnear
+lo
+ +g
( )N N
OpN pNN
C
xy yzmax ma
n ar
x
e
??+( + +
????)
N
P PO
pN pNNMemory:
CPU time
x
y
z
xy yzmax max
+? ?? ???
+ +( )P P
O
yz xymax
near rd
max
( + + )C CP P
PO
P+
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil
a) FFTZ – ZY Transpose – FFTY– YX Transpose – FFTX
b) Multiply
P4 P5 P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute P1 P2 P33.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near {IFFT FF [ ] [ ]}T FFTT+Λ ⋅≈ ΛGZZI II x
y
z{ [ ] [ ]}
C Cnear
+lo
+ +g
( )N N
OpN pNN
C
xy yzmax ma
n ar
x
e
??+( + +
????)
N
P PO
pN pNNMemory:
CPU time P4
P5P6
P4
P5P6
xy yzmax max
+? ?? ???
+ +( )P P
O
rd
yz xym
ne
ax
ar
max
+( + )P P
CP P
CO +
per iteration:
per iteration:Comm time P2
P1
P3
P2P1
P3
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil
a) FFTZ – ZY Transpose – FFTY– YX Transpose – FFTX
b) Multiply
P4P5
P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute P1 P23.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
P1P3
x
y
znear {IFFT FF [ ] [F ]F }T TT≈ + ⋅Λ ΛGZZI I I{ [ ] [ ]}
C Cnear
( + + +log
)N NpN p
ONN
C
xy yzmax ma
ne r
x
a
??( + +
?+
? ?)
N
P PO
pN pNNP
Memory:
CPU time xy yz
max xma
( + + +?? ?? ?
)OPPP
rd
yz xym
ne
ax
ar
max
+( + )P P
CP P
CO +
per iteration:
per iteration:Comm time
Proposed Parallelization Scheme• Two Workload Distribution Strategies
1. Project/Anterpolate: 2-D Pencil2. Propagate: 2-D Pencil
a) FFTZ – ZY Transpose – FFTY– YX Transpose – FFTX
b) Multiply
P4P5
P6
) p yc) IFFTX – XY Transpose – IFFTY
– YZ Transpose – IFFTZ3. Interpolate: 2-D Pencil3 5 Redistribute P1 P23.5. Redistribute4. Correct: 3-D Block5. Communicate Near/Redistribute
near {IFFT FF [ ] [F ]F }T TT≈ + ⋅Λ ΛGZZI I I
P1 P3
x
y
z{ [ ] [ ]}
C Cnear
( + + +log
)N NpN p
ONN
C
xy yzmax ma
ne r
x
a
??( + +
?+
? ?)
N
P PO
pN pNNP
Memory:
CPU time P4
P5P6
P4
P5P6
xy yzmax xma
( + + +?? ?? ?
)OPPP
near rd
yz xymax max
( + + )P
CP
OP
CP
+
per iteration:
per iteration:Comm time P2
P1
P3
P2P1
P3
Parallelization- General Limits
- Traditional Scheme and Its Limitations
- Proposed Scheme
- Comparison
Traditional vs. Proposed Scheme
1. Correction computations better balanced
2. Grid limitation extended 2D cx cy cy czmax
min( , )min( , )P N N N N=
Traditional vs. ProposedMaximum P and Maximum NMaximum P and Maximum N
radius=104 mm
41 5 0 97. , .ε σ= =r
P vs. N (Relative Efficiency=90%)
0.83N
0.59N
0 68
Ranger
radius=104 mm
41 5 0 97. , .ε σ= =r
0.68N
P vs. N (Relative Efficiency=50%)
0.43N
Ranger
radius=104 mm
41 5 0 97. , .ε σ= =r
0.66N
General Limits of Parallelization
Sequential part of the code (Amdahl’s Law)• (Hard) Scalability Limitations 1
N i N tiLoad imbalance/Can’t divide computations further (run out of parallelism)
C i ti l t /b d idth3
2- No size N arrays or operations
- Two workload distribution strategies- 2D pencil decomposition based 3D FFTs [1]
Communication latency/bandwidth
Mem
ory
3M
emor
y- Hybrid MPI/OpenMP [2]
e
ock
Tim
e or
M
ock
Tim
e or
M
all C
lock
Tim
e
Wal
l Clo
P 31 2
Wal
l Clo
P
Wa
P31 2
[2] Wei, Yılmaz,“A hybrid message passing/shared memory parallelization of the adaptive integral method for multi-core clusters,” Parallel Comput., 2011.
[1] Wei, Yılmaz,“A 2-D decomposition based parallelization of AIM for 3-D BIOEM problems,” IEEE APS Symp., 2011
BIOEM Problems- High-Resolution Simulations
- AustinMan Antropohomorphic Model
High-Resolution Simulations683 mm31030 mm3 683 mm3
Pabsorbed
11 mm3 11 mm315 mm3
0.2 mm30.3 mm3 0.2 mm3
Sample Tissues: 11 mm3 vs. 0.2 mm3
Pabsorbed
Muscle + Tendon
Skin Dry + Fat
Bone Cortical+ Bone Marrow
Sample Tissues: 11 mm3 vs. 0.2 mm3
1 2
Pabsorbed
3
4
1. White+Gray Matter2. Teeth3. Tongue4 Gland4. Gland5. Spinal Chord6. Mucous Membrane7. Vitreous Humor8. Blood Vessel
Sample Tissues: 11 mm3 vs. 0.2 mm3
5 6
Pabsorbed
8
71. White+Gray Matter2. Teeth3. Tongue4 Gland4. Gland5. Spinal Chord6. Mucous Membrane7. Vitreous Humor8. Blood Vessel
AustinMan Anthropomorphic Model
CAT MRI C S tiCAT MRI Cross Sections
• Semi-automated+ Automated: Color based identification+ Automated: Color based identification- Manual: Knowledge of anatomy
• Boundary&Region Identification Software Kit (BRISkit)+ Create masks+ Region growing - MATLAB graphical user interface
AustinMan Model
g- Identify regions- Edit boundaries
AustinMan Anthropomorphic Model
Sub-mmpixel
l tiresolution(1/3 mm)
AustinMan Anthropomorphic ModelDownload at: web2.corral.tacc.utexas.edu/AustinManEMVoxels/
Acknowledgements
Fangzhou Wei Madison Ball
Normalized E-Field
0
Jackson MasseyCemil GeyikDr. Tahir MalasDr Victor Eijkhout
-10
-20
Dr. Victor EijkhoutProf. Paul FindellProf. John PearceProf. Leszek Demkowicz
11 mm315 mm3 11 mm3-30
-40
Prof. Robert van de Geijn
0.2 mm30.3 mm3 0.2 mm3