Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

Research ArticleAn Optimized Parallel FDTD Topology for ChallengingElectromagnetic Simulations on Supercomputers

Shugang Jiang Yu Zhang Zhongchao Lin and Xunwang Zhao

School of Electronic Engineering Xidian University Xirsquoan Shaanxi 710071 China

Correspondence should be addressed to Shugang Jiang zaishuiyifang131126com

Received 23 March 2015 Accepted 14 May 2015

Academic Editor Giuseppe Mazzarella

Copyright copy 2015 Shugang Jiang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

It may not be a challenge to run a Finite-Difference Time-Domain (FDTD) code for electromagnetic simulations on a super-computer with more than 10 thousands of CPU cores however to make FDTD code work with the highest efficiency is a challengeIn this paper the performance of parallel FDTD is optimized through MPI (message passing interface) virtual topology based onwhich a communication model is established The general rules of optimal topology are presented according to the model Theperformance of the method is tested and analyzed on three high performance computing platforms with different architectures inChina Simulations including an airplane with a 700-wavelength wingspan and a complex microstrip antenna array with nearly2000 elements are performed very efficiently using a maximum of 10240 CPU cores

1 Introduction

The principle of FDTD is that the calculation region is dis-cretized by Yee grid to make the components of E and H bedistributed at time and space alternately [1] Then there arefourH (or E) components around each E (orH) componentThis character makes the algorithm parallel in nature andusing this the Maxwell equation can be transferred to a setof difference equations The electromagnetic fields can besolved at time axis step by step Then the electromagneticfields distribution in each time step later can be obtained bythe original values and boundary condition [2]

Researches on MPI based parallel FDTD for simulatingcomplicated models have been published in the past decadeIn 2001 Volakis et al presented a parallel FDTD algorithmusing the MPI library where they raised an MPI Cartesian2D topology [3] Andersson developed parallel FDTD with3D MPI topology in the same year [4] In 2005 the authorsstudied the optimum virtual topology for MPI based parallelconformal FDTD algorithm on PC clusters [5ndash7] In 2008 Yuet al tested the parallel efficiency of the parallel FDTD [8] onBlueGeneL Supercomputer successfully and gave a parallelefficiency of 4000 cores under the case of balanced loads

Although there are many publications on parallel FDTDfew of them involve parallel FDTD simulations utilizing

more than 10000 cores Most of the papers focused on loadbalancingwhen parallel efficiencywas concerned in additionto a more precise rule to achieve the best performance thatneeds to be given especially in the case of simulations usingtens of thousands of CPU cores on supercomputers

With these concerned in this paper the influence ofdifferent virtual topology schemes on parallel performanceof FDTD is studied through a theory model analysis Thensome tests are made on National Supercomputer Center inTianjin (NSCC-TJ) and National Supercomputing Center inShenzhen (NSCC-SZ) to verify the feasibility of theory Withthe proposed theory model some electrically large problemswhose parallel scales up to 10240 cores are provided in thispaper And the parallel efficiency is nearly 80 when 10240cores of SSC were utilized for an array with nearly 2000elements To the best of our knowledge the proposedmethodachieves one of the best efficiencies ever reached using morethan 10 thousands of CPU cores

2 Computation Resourcesfrom Supercomputers

The program is tested on different clusters in three super-computer centers National Supercomputer Center in Tian-jin (NSCC-TJ) [9] National Supercomputing Center in

Hindawi Publishing CorporationInternational Journal of Antennas and PropagationVolume 2015 Article ID 690510 10 pageshttpdxdoiorg1011552015690510

2 International Journal of Antennas and Propagation

Table 1 Parameters of computation resources

Platform CPU Memorynode Clock speed Coresnode Total cores used in this paper

SSC AMD 8347HE 64-bit four-core 64G 19GHz 16 800032G 4800

NSCC-TJ Intel Xeon 5670 six-core 24G 293GHz 12 120NSCC-SZ Intel Xeon 5650 six-core 24G 256GHz 12 512

Shenzhen (NSCC-SZ) [10] and Shanghai SupercomputerCenter (SSC) [11] The parameters of computation resourcesin this paper are listed in Table 1

3 Communication Model for Parallel FDTD

Communication is the main factor which affects the par-allel performance of parallel codes Therefore reducing theamount of communication in FDTD by adjusting the virtualtopology is selected as the optimization target

Assume that the communication time in one time step is

119879 = 120572119862+120573 sdot 2119871 (1)

where 120572 is communication delay time 119862 is communicationnumber 120573 is transmission speed and 119871 is the communicationdata amount of E or H The calculated equation of eachparameter is as follows

119862 = 6119875119909119875119910119875119911minus 2 (119875

119909119875119910+119875119910119875119911+119875119911119875119909) (2)

119871 = (119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909

+ (119875119911minus 1)119873

119909119873119910

(3)

where 119875119909 119875119910 and 119875

119911are the topology values in three direc-

tions and119873119909119873119910 and119873

119911are the grids number in 119909 119910 and 119911

directionsFrom (1) it is known that when the total communication

data amount is the same the different topology scheme maybring the different communication number 119862 which comesto different total time 119879

Take Dawning 5000A as example the parameters 120572 =18 ussim25 us and 120573 = 1(16563Gbs) [12] Assume that thetotal grids are 1000 times 1000 times 1000 and total cores are 1000now the total communication delay time (972ms) is aboutless an order ofmagnitude than the total communication time(121ms) Under this cores scale the communication delaytime is the secondary factor

The communication amount of single process is

ave

=119871

119875119909119875119910119875119911

=

(119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909+ (119875119911minus 1)119873

119909119873119910

119875119909119875119910119875119911

(4)

Divided by a constant119873119909sdot 119873119910sdot 119873119911 (4) will be

ave1015840

=

(119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909+ (119875119911minus 1)119873

119909119873119910

119875119909119875119910119875119911sdot 119873119909119873119910119873119911

=1119875119909119875119910119875119911

sdot (119875119909minus 1119873119909

+

119875119910minus 1119873119910

+119875119911minus 1119873119911

)

(5)

From (5) it is known when and only (119875119909minus 1)119873

119909=

(119875119910minus 1)119873

119910= (119875119911minus 1)119873

119911 namely the topology conformal

as calculation region the communication amount of singleprocess is the least Generally the equation above cannot besatisfied absolutely So the topology distribution is requiredthat it is divided along the direction which is conformal ascalculation region as possible to make (5) the least

Generally speaking the communication time is lessbetween processes in one node than that between processeswhich belong to different nodes [12 13] that is the onebyte data communication time factor 120573 is different betweenprocesses in one node and across nodes So when the factors119862 and 119871 are the same between two different topologiesthe communication amount across nodes is needed to beconsidered

For certain grids the total memory requirement (called119872) keeps the same for different topologies The memorydistribution of each process (called119898) is as follows

119898 =119872

119875119909119875119910119875119911

(6)

Equation (6) indicates that the memory distribution of eachprocess is unrelated to the virtual topologies

From the analysis above it is known that the commu-nication surface area varies from different virtual topologyscheme for certain grids The communication time will bechanged associated with the virtual topology scheme whilethe memory distribution of each process remains the sameThus the communication amount is the main factor whichaffects the parallel performance

4 Discussions on Parallel Performance

41 Simulation Model Based on the theory above a four-element microstrip antenna array is used as the model forbenchmark The parallel FDTD code is run to analyze the

International Journal of Antennas and Propagation 3

x

y

z

h

y

x

Substrate

Topology distribution at y direction

Topology distribution at x direction

Topology distribution at z direction

xp

yp

xf

yd

xd

xp

yp

120576r

yg

xg

Figure 1 2 times 2 microstrip array

virtual topology schemes on two supercomputer center plat-forms National Supercomputer Center in Tianjin (NSCC-TJ)and National Supercomputing Center in Shenzhen (NSCC-SZ) as listed in Table 1

The array model is shown in Figure 1 The parameters ofthis array are as follows Central frequency is 497GHz 119909

119901=

14mm 119910119901= 96mm 119909

119889= 15mm 119910

119889= 15mm 120576

119903= 434

ℎ = 08mm 119909119892= 60mm 119910

119892= 60mm and 119909

119891= 36mm

The size of grid is 119889119909 = 119889119910 = 119889119911 = 04mmActually the amount of total grids is just 200 times 200 times 50

However to test the influence of different virtual topologyschemes on parallel performance of parallel FDTD thecomputational space needs to be extended So in this test theamount of total grids is set as 1200 times 1200 times 300

The radiation patterns of the microstrip array areshown in Figure 2 compared with the results obtained fromHFSSThefigure shows that there is awell agreement betweenthem

42 Discussion of Parallel Performance Here we select sev-eral groups of virtual topology schemes to be tested Thefollowing are the test results on the two supercomputer centerplatforms

421 NSCC-TJ Table 2 is the comparison of total computa-tion time (in seconds) with different CPU cores and differentvirtual topology schemes The maximum number of CPUcores used for test is 120

In Table 2 virtual topology schemes are described as (119909times119910 times 119911) for all three communication patterns If the value is1 in some direction it implies that there is no topology inthis direction For example 2 times 1 times 1 means that there is notopology in 119910 and 119911 directions respectively thus the virtualtopology is actually in one dimension Similarly 8 times 8 times 1means that there is no topology in 119911 direction thus the virtual

Table 2 Comparisons of virtual topology amount of communica-tion and computation time on NSCC-TJ

CPUcores

Virtualtopology(119909 times 119910 times 119911)

Amount ofcommunication

Computationtime (inseconds)

12 3 times 2 times 2 2520000 64666816 4 times 2 times 2 2880000 61405632 4 times 4 times 2 3600000 285897

648 times 8 times 1 5040000 1376858 times 4 times 2 5040000 13799216 times 2 times 2 7200000 182437

96

8 times 6 times 2 5760000 946528 times 4 times 3 6480000 15005412 times 4 times 2 6480000 15519516 times 3 times 2 7560000 167942

120

6 times 10 times 2 6480000 7021610 times 6 times 2 6480000 808655 times 12 times 2 6840000 7136712 times 5 times 2 6840000 863655 times 6 times 4 7560000 724906 times 5 times 4 7560000 10764115 times 4 times 2 7560000 109652

topology is actually in two dimensions In our work oneprocess uses one CPU core

The speedup and parallel efficiency of the code are shownin Figure 3 From Figure 3 it can be seen that the parallelefficiency reaches up to 80 on NSCC-TJ

From Table 2 it is obvious that increasing the numberof CPU cores can bring us the reduction of the computationtime rapidly But different virtual topology schemes will costdifferent computation time even if the code is run with thesame number of processes Next the parallel performance ofthe parallel FDTD will be discussed

Here the cases of 96 and 120 cores are taken as theexamples From (3)

119871 = (119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909

+ (119875119911minus 1)119873

119909119873119910

(7)

The following is known

(a) 96 Cores Consider

8times 6times 2 (8 minus 1) times (1200 times 300) + (6 minus 1) times (1200 times 300) +(2 minus 1) times (1200 times 1200) = 5760000 (94652 s)(05 GBprocess)



0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus40

minus30

minus20

minus10

0

HFSSFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus30

minus20

minus10

0

HFSSFDTD

(b) 119910119900119911 plane

Figure 2 The radiation patterns of the 2 times 2 microstrip antenna array

Spee

dup

128

112

96

80

64

48

32

16

0

Number of CPU cores0 16 32 48 64 80 96 112 128

Ideal speedupThis paperrsquos approach

(a) Speedup


Ideal parallel efficiencyThis paperrsquos approach

Para

llel e

ffici

ency

12

10

08

06

04

02

00

(b) Parallel efficiency

Figure 3 The speedup and parallel efficiency of the code from 12CPU cores to 120CPU cores on NSCC-TJ

(b) 120 Cores Consider

5times 6times 4 (5 minus 1) times (1200 times 300) + (6 minus 1) times (1200 times 300) +(4 minus 1) times (1200 times 1200) = 7560000 (72490 s)(04GBprocess)


From above it is known that for the virtual topologieswith the same dimensions the total grids at interfaces less(the amount of communication data less) can save calculationtime effectivelyMeanwhile it is obvious that thememory dis-tribution of each process is the same for different topologieswith the same number of CPU cores

But from above it also can be seen that for the sameamount of grids the calculation time has certain differences


xy

z

1 1 1 2 22

33

34 4

4

5 55

6 66

77

78 8

8

9 99

10 1010

(a) 5 times 6 times 4

1

x

y

z

1 1 2 2

23 3 3 4

4 45 5

5

66 6 7

7

7 8 8 8 9

9 9 10 10 10

(b) 6 times 5 times 4

Figure 4 The diagram of communication mode across nodes between the virtual topologies 5 times 6 times 4 and 6 times 5 times 4

For example the topologies 6 times 5 times 4 and 5 times 6 times 4with the same grids number and the calculation time are107641 seconds and 72490 seconds respectively Generallywe believe that the consumption time between processes inone node is less and the more time consumed between onesacross nodes [12 13] So here we speculate that the differentamount of the communication grids across nodes causes thedifference between the two cases

For 5 times 6 times 4 it has 4 times (1200 times 300) + 1 times (1200 times 300) =1800000 FDTD grids needed to communicate across nodesand its calculation time is 72490 seconds while for 6 times 5 times4 it has 5 times (1200 times 300) + (1 + 26) times (1200 times 300) = 2280000FDTD grids needed to communicate across nodes and itscalculation time is 107641 seconds The calculation way ofcommunication data above is shown in Figure 4

In Figure 4 every two adjacent figures with differentcolors have data communication Figures 1 to 10 present tennodes and the adjacent figures with the same colors presentthat they are in the same node For (b) the adjacent columnswill transfer ((1 + 26) times (1200 times 300)) data in 119909119900119911 plane andthe adjacent rows will transfer (5 times (1200 times 300)) data in 119910119900119911plane

422 NSCC-SZ Table 3 is the comparison of total computa-tion time (in seconds) with different CPU cores and differentvirtual topology schemes The maximum number of CPUcores used for test is 480

In Table 3 virtual topology schemes are described as(119909 times 119910 times 119911) for all three communication patterns and themeaning of each figure in each topology is the same with thedescription in Section 421 above

The speedup and parallel efficiency of the code are shownin Figure 5 From Figure 5 it can be seen that the parallelefficiency reaches up to 80 on NSCC-SZ

FromTable 3 it can be seen that for the virtual topologieswith the same dimensions the total grids at interfaces less(the amount of communication data less) can save calculation

Table 3 Comparisons of virtual topology amount of communica-tion and computation time on NSCC-SZ

CPUcores




48 4 times 6 times 2 4320000 1271026 times 4 times 2 4320000 125612

60


96


120


240


360


480

10 times 12 times 4 11520000 1474812 times 10 times 4 11520000 1565315 times 8 times 4 11880000 1575715 times 16 times 2 11880000 1625412 times 8 times 5 12240000 161858 times 12 times 5 12240000 1653


Number of CPU cores


Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00


Number of CPU cores0 48 96 144 192 240 288 336 384 432 480 528


Figure 5 The speedup and parallel efficiency of the code from 48CPU cores to 480CPU cores on NSCC-SZ

time effectively This conclusion coincided with the case onNSCC-TJ

The amount of communication grids of each virtualtopology is calculated by (3) while for certain topologieswith the same number of grids it is found that the calculationtime is less with the more crossing-node communicationanalyzed by the way of NSCC-TJ It is contrary to the spec-ulation theory above Therefore we speculate that whether itis caused by the reason that the communication amount ofnodes with the main communication is great in the topologyNext the cases of 10 times 2 times 3 and 3 times 5 times 4 in 60 cores aretaken as the examples to analyze the speculation (the amountof communication is 6480000)

For 10 times 2 times 3 the crossing-node communication is 4 times(1200 times 300) + 1 times (1200 times 300) = 1800000 while for 3 times 5 times 4the one is 2 times (1200 times 300) + 1612 times (1200 times 300) = 1200000But the consumption time for 10 times 2 times 3 is 111717 seconds andthe one for 3 times 5 times 4 is 121409 seconds This does not agreewith the speculation of crossing-node theory on NSCC-TJTo further explore the reason the heaviest communicationand the lightest communication of the related processes fromthese two virtual topology schemes are listed in Table 4The heaviest communication load in the virtual topologyscheme 10 times 2 times 3 is (1200 times 300)30 + (1200 times 300) times26 + (1200 times 1200) times 220 = 276000 while in the virtualtopology scheme 3 times 5 times 4 the one is (1200 times 300) times 212 +(1200 times 300) times 220 + (1200 times 1200) times 215 = 288000 Theheaviest communication loads are assigned to the processeslocated at the center of the process grid Similarly the lightestcommunication loads that are set to the processes at thecorner of the process grid can be calculated The differencebetween the heaviest communication load and the lightestcommunication load in the virtual topology scheme 3 times 5 times4 is larger than the one in 10 times 2 times 3 which results indifferent computation timeThis indicates that when the total

z

yx

Figure 6 The model of the microstrip antenna array

Table 4 Comparisons of communication load

Topology The heaviest The lightest Difference10 times 2 times 3 276000 144000 1320003 times 5 times 4 288000 144000 144000

amount of communication is the same a virtual topologyschemewith amore balanced communication loadmay bringa better performance although with a more crossing-nodecommunication

43 The General Rules of Optimal Virtual Topology Gen-erally when the amount of FDTD cells is the same theMPI virtual topology by the way where the amount oftransferred data is less can save the computation time forthe same dimensional virtual topologyThe best performanceof a parallel FDTD code can be obtained by optimizing theMPI virtual topology scheme The general rules for a betterperformance are as follows

(a) Select MPI virtual topology scheme to make the totalcommunication 119871 (equation (3)) the smallest


0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane

Figure 7 The radiation patterns of the array

y

xz

Figure 8 The model of the microstrip array

(b) When the total communication 119871 is the same set thetopology as with less crossing-node communication

(c) When the amount of crossing-node communicationof different topologies is approximately the sameselect the topology with a more balanced communi-cation load

5 Applications Using 10 Thousands ofCPU Cores

Based on the optimal virtual topology scheme above theparallel FDTD code is applied to analyze some complicatedEM problems and they are run on Shanghai SupercomputerCenter (SSC) platform

51 Radiation of Microstrip Antenna Array

511 Validation of the FDTD Code To validate the FDTDcode a microstrip antenna array with hundreds of the same

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Number of CPU cores4096 6144 8192 10240


Figure 9 The parallel efficiency of 4096sim10240 cores

antenna units is analyzed and the results are compared withthe ones provided by MoM The model is shown in Figure 6The amount of total grids is 786 times 1224 times 54 It is calculatedon PC with the virtual topology scheme 4 times 3 times 2

The radiation patterns of the array are shown in Fig-ure 7 compared with the ones provided by MoM FromFigure 7 one can see that there are very well agreementsbetween them


0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane

Figure 10 The radiation patterns of the ellipse microstrip antenna array

z y

x

Figure 11 The model of the airplane

512 Radiation of an Ellipse Antenna Array An ellipsemicrostrip antenna array with nearly 2000 elements is shownin Figure 8 In this array the amount of total grids is 6016 times1160 times 54 (about 04 billion) It is tested on ShanghaiSupercomputer Center (SSC) using 10240 cores and theparallel efficiency from 4096 to 10240 cores is tested

The parallel efficiency of this code is shown in Figure 9And the radiation patterns of this array are shown inFigure 10

From Figure 9 it is known that the parallel efficiencyreaches to nearly 80 at 10240 cores with 4096 cores asbenchmark which achieves one of the best efficiencies everreached using more than 10 thousands of CPU cores

52 RCS of an Electrically Large Airplane

521 Simulation Model The airplane is shown in Fig-ure 11 The parameters are as follows Incident wave fre-quency is 40GHz The direction of incident wave is+119910 The polarization is E

119911 The size of the airplane is

524256m times 210312m times 2944213m Electric size is about700120582times 280120582times 39120582 The size of grid is 119889119909 = 119889119910 = 119889119911 =00075m The amount of total grids is 7020 times 2840 times 420(about 84 billion) It is tested on Shanghai SupercomputerCenter (SSC) using 10240 cores

522 Result The computation time of this code is 603686seconds The RCS of the airplane is shown in Figure 12

6 Conclusion

A guideline is presented for using parallel FDTD on super-computers with more than 10 thousands of CPU coresbased on a theoretical communication model given in thispaper The benchmarks obtained on two supercomputersvalidated the optimal virtual topology rules Radiation froma large microstrip antenna array and scattering from anelectrically large airplane are simulated successfully whichindicate the capability of the method presented in thispaper for those types of real-life EM problems


0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane

Figure 12 The RCS of the airplane

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the National High TechnologyResearch andDevelopment Program of China (863 Program)(2012AA01A308) the NSFC (61301069 61072019) the Projectwith Contract no 2013KJXX-67 and the Program for New

Century Excellent Talents in University of China (NCET-13-0949) The computational resources utilized in this researchare provided by the National Supercomputer Center inTianjin (NSCC-TJ) National Supercomputing Center inShenzhen (NSCC-SZ) and Shanghai Supercomputer Center(SSC)

References

[1] A Taflove Computational Electrodynamics The Finite-Differ-ence Time-Domain Method Artech House Norwood MassUSA 2000


[2] D Ge and Y Yan Finite-Difference Time-Domain Method forElectromagnetic Waves Version 3 Xidian University PressXirsquoan China 2011 (Chinese)

[3] J L Volakis D B Davidson C Guiffaut and K Mahdjoubi ldquoAparallel FDTD algorithm using theMPI libraryrdquo IEEE Antennasand Propagation Magazine vol 43 no 2 pp 94ndash103 2001

[4] U Andersson Time-domain methods for the Maxwell equations[PhD thesis] Royal Institute of Technology Stockholm Swe-den 2001

[5] Y Zhang J Song and C H Liang ldquoMPI-based parallelizedlocally conformal fdtd for modeling slot antennas and newperiodic structures in microstriprdquo Journal of ElectromagneticWaves and Applications vol 18 no 10 pp 1321ndash1335 2004

[6] Y Zhang J Song and C Liang ldquoStudy on the parallel modifiedlocally conformal FDTD algorithm on cluster of PCs for PBGstructuresrdquo Acta Electronica Sinica vol 31 no 12A pp 2142ndash2144 2003

[7] Z Yu D Wei and C Liang ldquoAnalysis of parallel performanceof MPI based parallel FDTD on PC clustersrdquo in Proceedings ofthe Asia-Pacific Conference Proceedings Microwave ConferenceProceedings (APMC rsquo05) vol 4 December 2005

[8] W Yu X Yang Y Liu et al ldquoA new direction in computationalelectromagnetics solving large problems using the parallelFDTD on the BlueGeneL supercomputer providing teraflop-level performancerdquo IEEE Antennas and Propagation Magazinevol 50 no 2 pp 26ndash44 2008

[9] httpwwwnscc-tjgovcn[10] httpwwwnsccszgovcn[11] httpwwwsscnetcn[12] W Chen and J Zhai Preliminary Analysis on Communication

Performance of Dawning 5000A 863 High Performance Com-puter Testing Center of Tsinghua University 2008

[13] Intel Corporation Intel MPI Library for Linux OS ReferenceManual Intel Corporation 2011 httpssoftwareintelcomsitesproductsdocumentationhpcmpilinuxreference manualpdf

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of



RotatingMachinery


Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics


Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks



Table 1 Parameters of computation resources

Platform CPU Memorynode Clock speed Coresnode Total cores used in this paper

SSC AMD 8347HE 64-bit four-core 64G 19GHz 16 800032G 4800

NSCC-TJ Intel Xeon 5670 six-core 24G 293GHz 12 120NSCC-SZ Intel Xeon 5650 six-core 24G 256GHz 12 512

Shenzhen (NSCC-SZ) [10] and Shanghai SupercomputerCenter (SSC) [11] The parameters of computation resourcesin this paper are listed in Table 1

3 Communication Model for Parallel FDTD

Communication is the main factor which affects the par-allel performance of parallel codes Therefore reducing theamount of communication in FDTD by adjusting the virtualtopology is selected as the optimization target

Assume that the communication time in one time step is

119879 = 120572119862+120573 sdot 2119871 (1)

where 120572 is communication delay time 119862 is communicationnumber 120573 is transmission speed and 119871 is the communicationdata amount of E or H The calculated equation of eachparameter is as follows

119862 = 6119875119909119875119910119875119911minus 2 (119875

119909119875119910+119875119910119875119911+119875119911119875119909) (2)

119871 = (119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909

+ (119875119911minus 1)119873

119909119873119910

(3)

where 119875119909 119875119910 and 119875

119911are the topology values in three direc-

tions and119873119909119873119910 and119873

119911are the grids number in 119909 119910 and 119911

directionsFrom (1) it is known that when the total communication

data amount is the same the different topology scheme maybring the different communication number 119862 which comesto different total time 119879

Take Dawning 5000A as example the parameters 120572 =18 ussim25 us and 120573 = 1(16563Gbs) [12] Assume that thetotal grids are 1000 times 1000 times 1000 and total cores are 1000now the total communication delay time (972ms) is aboutless an order ofmagnitude than the total communication time(121ms) Under this cores scale the communication delaytime is the secondary factor

The communication amount of single process is

ave

=119871

119875119909119875119910119875119911

=

(119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909+ (119875119911minus 1)119873

119909119873119910

119875119909119875119910119875119911

(4)

Divided by a constant119873119909sdot 119873119910sdot 119873119911 (4) will be

ave1015840

=

(119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909+ (119875119911minus 1)119873

119909119873119910

119875119909119875119910119875119911sdot 119873119909119873119910119873119911

=1119875119909119875119910119875119911

sdot (119875119909minus 1119873119909

+

119875119910minus 1119873119910

+119875119911minus 1119873119911

)

(5)

From (5) it is known when and only (119875119909minus 1)119873

119909=

(119875119910minus 1)119873

119910= (119875119911minus 1)119873

119911 namely the topology conformal

as calculation region the communication amount of singleprocess is the least Generally the equation above cannot besatisfied absolutely So the topology distribution is requiredthat it is divided along the direction which is conformal ascalculation region as possible to make (5) the least

Generally speaking the communication time is lessbetween processes in one node than that between processeswhich belong to different nodes [12 13] that is the onebyte data communication time factor 120573 is different betweenprocesses in one node and across nodes So when the factors119862 and 119871 are the same between two different topologiesthe communication amount across nodes is needed to beconsidered

For certain grids the total memory requirement (called119872) keeps the same for different topologies The memorydistribution of each process (called119898) is as follows

119898 =119872

119875119909119875119910119875119911

(6)

Equation (6) indicates that the memory distribution of eachprocess is unrelated to the virtual topologies

From the analysis above it is known that the commu-nication surface area varies from different virtual topologyscheme for certain grids The communication time will bechanged associated with the virtual topology scheme whilethe memory distribution of each process remains the sameThus the communication amount is the main factor whichaffects the parallel performance

4 Discussions on Parallel Performance

41 Simulation Model Based on the theory above a four-element microstrip antenna array is used as the model forbenchmark The parallel FDTD code is run to analyze the


x

y

z

h

y

x

Substrate




xp

yp

xf

yd

xd

xp

yp

120576r

yg

xg




119901=

14mm 119910119901= 96mm 119909

119889= 15mm 119910

119889= 15mm 120576

119903= 434

ℎ = 08mm 119909119892= 60mm 119910

119892= 60mm and 119909

119891= 36mm








CPUcores






96


120






119871 = (119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909

+ (119875119911minus 1)119873

119909119873119910

(7)






0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus40

minus30

minus20

minus10

0

HFSSFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus30

minus20

minus10

0

HFSSFDTD

(b) 119910119900119911 plane


Spee

dup

128

112

96

80

64

48

32

16

0



(a) Speedup



Para

llel e

ffici

ency

12

10

08

06

04

02

00









xy

z

1 1 1 2 22

33

34 4

4

5 55

6 66

77

78 8

8

9 99

10 1010


1

x

y

z

1 1 2 2

23 3 3 4

4 45 5

5

66 6 7

7

7 8 8 8 9

9 9 10 10 10











CPUcores





60


96


120


240


360


480



Number of CPU cores


Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00








z

yx








0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane


y

xz








Para

llel e

ffici

ency

12

10

08

06

04

02

00







0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane


z y

x










6 Conclusion



0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane




Acknowledgments



References















RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










x

y

z

h

y

x

Substrate




xp

yp

xf

yd

xd

xp

yp

120576r

yg

xg




119901=

14mm 119910119901= 96mm 119909

119889= 15mm 119910

119889= 15mm 120576

119903= 434

ℎ = 08mm 119909119892= 60mm 119910

119892= 60mm and 119909

119891= 36mm








CPUcores






96


120






119871 = (119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909

+ (119875119911minus 1)119873

119909119873119910

(7)






0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus40

minus30

minus20

minus10

0

HFSSFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus30

minus20

minus10

0

HFSSFDTD

(b) 119910119900119911 plane


Spee

dup

128

112

96

80

64

48

32

16

0



(a) Speedup



Para

llel e

ffici

ency

12

10

08

06

04

02

00









xy

z

1 1 1 2 22

33

34 4

4

5 55

6 66

77

78 8

8

9 99

10 1010


1

x

y

z

1 1 2 2

23 3 3 4

4 45 5

5

66 6 7

7

7 8 8 8 9

9 9 10 10 10











CPUcores





60


96


120


240


360


480



Number of CPU cores


Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00








z

yx








0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane


y

xz








Para

llel e

ffici

ency

12

10

08

06

04

02

00







0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane


z y

x










6 Conclusion



0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane




Acknowledgments



References















RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus40

minus30

minus20

minus10

0

HFSSFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus30

minus20

minus10

0

HFSSFDTD

(b) 119910119900119911 plane


Spee

dup

128

112

96

80

64

48

32

16

0



(a) Speedup



Para

llel e

ffici

ency

12

10

08

06

04

02

00









xy

z

1 1 1 2 22

33

34 4

4

5 55

6 66

77

78 8

8

9 99

10 1010


1

x

y

z

1 1 2 2

23 3 3 4

4 45 5

5

66 6 7

7

7 8 8 8 9

9 9 10 10 10











CPUcores





60


96


120


240


360


480



Number of CPU cores


Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00








z

yx








0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane


y

xz








Para

llel e

ffici

ency

12

10

08

06

04

02

00







0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane


z y

x










6 Conclusion



0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane




Acknowledgments



References















RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










xy

z

1 1 1 2 22

33

34 4

4

5 55

6 66

77

78 8

8

9 99

10 1010


1

x

y

z

1 1 2 2

23 3 3 4

4 45 5

5

66 6 7

7

7 8 8 8 9

9 9 10 10 10











CPUcores





60


96


120


240


360


480



Number of CPU cores


Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00








z

yx








0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane


y

xz








Para

llel e

ffici

ency

12

10

08

06

04

02

00







0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane


z y

x










6 Conclusion



0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane




Acknowledgments



References















RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










Number of CPU cores


Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00








z

yx








0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane


y

xz








Para

llel e

ffici

ency

12

10

08

06

04

02

00







0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane


z y

x










6 Conclusion



0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane




Acknowledgments



References















RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane


y

xz








Para

llel e

ffici

ency

12

10

08

06

04

02

00







0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane


z y

x










6 Conclusion



0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane




Acknowledgments



References















RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane


z y

x










6 Conclusion



0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane




Acknowledgments



References















RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane




Acknowledgments



References















RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation






















RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation











RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation









Documents

Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them