11
Research Article An Optimized Parallel FDTD Topology for Challenging Electromagnetic Simulations on Supercomputers Shugang Jiang, Yu Zhang, Zhongchao Lin, and Xunwang Zhao School of Electronic Engineering, Xidian University, Xi’an, Shaanxi 710071, China Correspondence should be addressed to Shugang Jiang; [email protected] Received 23 March 2015; Accepted 14 May 2015 Academic Editor: Giuseppe Mazzarella Copyright © 2015 Shugang Jiang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. It may not be a challenge to run a Finite-Difference Time-Domain (FDTD) code for electromagnetic simulations on a super- computer with more than 10 thousands of CPU cores; however, to make FDTD code work with the highest efficiency is a challenge. In this paper, the performance of parallel FDTD is optimized through MPI (message passing interface) virtual topology, based on which a communication model is established. e general rules of optimal topology are presented according to the model. e performance of the method is tested and analyzed on three high performance computing platforms with different architectures in China. Simulations including an airplane with a 700-wavelength wingspan, and a complex microstrip antenna array with nearly 2000 elements are performed very efficiently using a maximum of 10240 CPU cores. 1. Introduction e principle of FDTD is that the calculation region is dis- cretized by Yee grid to make the components of E and H be distributed at time and space alternately [1]. en, there are four H (or E) components around each E (or H) component. is character makes the algorithm parallel in nature and using this the Maxwell equation can be transferred to a set of difference equations. e electromagnetic fields can be solved at time axis step by step. en the electromagnetic fields distribution in each time step later can be obtained by the original values and boundary condition [2]. Researches on MPI based parallel FDTD for simulating complicated models have been published in the past decade. In 2001, Volakis et al. presented a parallel FDTD algorithm using the MPI library, where they raised an MPI Cartesian 2D topology [3]. Andersson developed parallel FDTD with 3D MPI topology in the same year [4]. In 2005, the authors studied the optimum virtual topology for MPI based parallel conformal FDTD algorithm on PC clusters [57]. In 2008, Yu et al. tested the parallel efficiency of the parallel FDTD [8] on BlueGene/L Supercomputer successfully and gave a parallel efficiency of 4000 cores under the case of balanced loads. Although there are many publications on parallel FDTD, few of them involve parallel FDTD simulations utilizing more than 10000 cores. Most of the papers focused on load balancing when parallel efficiency was concerned, in addition to a more precise rule to achieve the best performance that needs to be given, especially in the case of simulations using tens of thousands of CPU cores on supercomputers. With these concerned, in this paper, the influence of different virtual topology schemes on parallel performance of FDTD is studied through a theory model analysis. en some tests are made on National Supercomputer Center in Tianjin (NSCC-TJ) and National Supercomputing Center in Shenzhen (NSCC-SZ) to verify the feasibility of theory. With the proposed theory model, some electrically large problems whose parallel scales up to 10240 cores are provided in this paper. And the parallel efficiency is nearly 80% when 10240 cores of SSC were utilized for an array with nearly 2000 elements. To the best of our knowledge, the proposed method achieves one of the best efficiencies ever reached using more than 10 thousands of CPU cores. 2. Computation Resources from Supercomputers e program is tested on different clusters in three super- computer centers, National Supercomputer Center in Tian- jin (NSCC-TJ) [9], National Supercomputing Center in Hindawi Publishing Corporation International Journal of Antennas and Propagation Volume 2015, Article ID 690510, 10 pages http://dx.doi.org/10.1155/2015/690510

Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

Research ArticleAn Optimized Parallel FDTD Topology for ChallengingElectromagnetic Simulations on Supercomputers

Shugang Jiang Yu Zhang Zhongchao Lin and Xunwang Zhao

School of Electronic Engineering Xidian University Xirsquoan Shaanxi 710071 China

Correspondence should be addressed to Shugang Jiang zaishuiyifang131126com

Received 23 March 2015 Accepted 14 May 2015

Academic Editor Giuseppe Mazzarella

Copyright copy 2015 Shugang Jiang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

It may not be a challenge to run a Finite-Difference Time-Domain (FDTD) code for electromagnetic simulations on a super-computer with more than 10 thousands of CPU cores however to make FDTD code work with the highest efficiency is a challengeIn this paper the performance of parallel FDTD is optimized through MPI (message passing interface) virtual topology based onwhich a communication model is established The general rules of optimal topology are presented according to the model Theperformance of the method is tested and analyzed on three high performance computing platforms with different architectures inChina Simulations including an airplane with a 700-wavelength wingspan and a complex microstrip antenna array with nearly2000 elements are performed very efficiently using a maximum of 10240 CPU cores

1 Introduction

The principle of FDTD is that the calculation region is dis-cretized by Yee grid to make the components of E and H bedistributed at time and space alternately [1] Then there arefourH (or E) components around each E (orH) componentThis character makes the algorithm parallel in nature andusing this the Maxwell equation can be transferred to a setof difference equations The electromagnetic fields can besolved at time axis step by step Then the electromagneticfields distribution in each time step later can be obtained bythe original values and boundary condition [2]

Researches on MPI based parallel FDTD for simulatingcomplicated models have been published in the past decadeIn 2001 Volakis et al presented a parallel FDTD algorithmusing the MPI library where they raised an MPI Cartesian2D topology [3] Andersson developed parallel FDTD with3D MPI topology in the same year [4] In 2005 the authorsstudied the optimum virtual topology for MPI based parallelconformal FDTD algorithm on PC clusters [5ndash7] In 2008 Yuet al tested the parallel efficiency of the parallel FDTD [8] onBlueGeneL Supercomputer successfully and gave a parallelefficiency of 4000 cores under the case of balanced loads

Although there are many publications on parallel FDTDfew of them involve parallel FDTD simulations utilizing

more than 10000 cores Most of the papers focused on loadbalancingwhen parallel efficiencywas concerned in additionto a more precise rule to achieve the best performance thatneeds to be given especially in the case of simulations usingtens of thousands of CPU cores on supercomputers

With these concerned in this paper the influence ofdifferent virtual topology schemes on parallel performanceof FDTD is studied through a theory model analysis Thensome tests are made on National Supercomputer Center inTianjin (NSCC-TJ) and National Supercomputing Center inShenzhen (NSCC-SZ) to verify the feasibility of theory Withthe proposed theory model some electrically large problemswhose parallel scales up to 10240 cores are provided in thispaper And the parallel efficiency is nearly 80 when 10240cores of SSC were utilized for an array with nearly 2000elements To the best of our knowledge the proposedmethodachieves one of the best efficiencies ever reached using morethan 10 thousands of CPU cores

2 Computation Resourcesfrom Supercomputers

The program is tested on different clusters in three super-computer centers National Supercomputer Center in Tian-jin (NSCC-TJ) [9] National Supercomputing Center in

Hindawi Publishing CorporationInternational Journal of Antennas and PropagationVolume 2015 Article ID 690510 10 pageshttpdxdoiorg1011552015690510

2 International Journal of Antennas and Propagation

Table 1 Parameters of computation resources

Platform CPU Memorynode Clock speed Coresnode Total cores used in this paper

SSC AMD 8347HE 64-bit four-core 64G 19GHz 16 800032G 4800

NSCC-TJ Intel Xeon 5670 six-core 24G 293GHz 12 120NSCC-SZ Intel Xeon 5650 six-core 24G 256GHz 12 512

Shenzhen (NSCC-SZ) [10] and Shanghai SupercomputerCenter (SSC) [11] The parameters of computation resourcesin this paper are listed in Table 1

3 Communication Model for Parallel FDTD

Communication is the main factor which affects the par-allel performance of parallel codes Therefore reducing theamount of communication in FDTD by adjusting the virtualtopology is selected as the optimization target

Assume that the communication time in one time step is

119879 = 120572119862+120573 sdot 2119871 (1)

where 120572 is communication delay time 119862 is communicationnumber 120573 is transmission speed and 119871 is the communicationdata amount of E or H The calculated equation of eachparameter is as follows

119862 = 6119875119909119875119910119875119911minus 2 (119875

119909119875119910+119875119910119875119911+119875119911119875119909) (2)

119871 = (119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909

+ (119875119911minus 1)119873

119909119873119910

(3)

where 119875119909 119875119910 and 119875

119911are the topology values in three direc-

tions and119873119909119873119910 and119873

119911are the grids number in 119909 119910 and 119911

directionsFrom (1) it is known that when the total communication

data amount is the same the different topology scheme maybring the different communication number 119862 which comesto different total time 119879

Take Dawning 5000A as example the parameters 120572 =18 ussim25 us and 120573 = 1(16563Gbs) [12] Assume that thetotal grids are 1000 times 1000 times 1000 and total cores are 1000now the total communication delay time (972ms) is aboutless an order ofmagnitude than the total communication time(121ms) Under this cores scale the communication delaytime is the secondary factor

The communication amount of single process is

ave

=119871

119875119909119875119910119875119911

=

(119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909+ (119875119911minus 1)119873

119909119873119910

119875119909119875119910119875119911

(4)

Divided by a constant119873119909sdot 119873119910sdot 119873119911 (4) will be

ave1015840

=

(119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909+ (119875119911minus 1)119873

119909119873119910

119875119909119875119910119875119911sdot 119873119909119873119910119873119911

=1119875119909119875119910119875119911

sdot (119875119909minus 1119873119909

+

119875119910minus 1119873119910

+119875119911minus 1119873119911

)

(5)

From (5) it is known when and only (119875119909minus 1)119873

119909=

(119875119910minus 1)119873

119910= (119875119911minus 1)119873

119911 namely the topology conformal

as calculation region the communication amount of singleprocess is the least Generally the equation above cannot besatisfied absolutely So the topology distribution is requiredthat it is divided along the direction which is conformal ascalculation region as possible to make (5) the least

Generally speaking the communication time is lessbetween processes in one node than that between processeswhich belong to different nodes [12 13] that is the onebyte data communication time factor 120573 is different betweenprocesses in one node and across nodes So when the factors119862 and 119871 are the same between two different topologiesthe communication amount across nodes is needed to beconsidered

For certain grids the total memory requirement (called119872) keeps the same for different topologies The memorydistribution of each process (called119898) is as follows

119898 =119872

119875119909119875119910119875119911

(6)

Equation (6) indicates that the memory distribution of eachprocess is unrelated to the virtual topologies

From the analysis above it is known that the commu-nication surface area varies from different virtual topologyscheme for certain grids The communication time will bechanged associated with the virtual topology scheme whilethe memory distribution of each process remains the sameThus the communication amount is the main factor whichaffects the parallel performance

4 Discussions on Parallel Performance

41 Simulation Model Based on the theory above a four-element microstrip antenna array is used as the model forbenchmark The parallel FDTD code is run to analyze the

International Journal of Antennas and Propagation 3

x

y

z

h

y

x

Substrate

Topology distribution at y direction

Topology distribution at x direction

Topology distribution at z direction

xp

yp

xf

yd

xd

xp

yp

120576r

yg

xg

Figure 1 2 times 2 microstrip array

virtual topology schemes on two supercomputer center plat-forms National Supercomputer Center in Tianjin (NSCC-TJ)and National Supercomputing Center in Shenzhen (NSCC-SZ) as listed in Table 1

The array model is shown in Figure 1 The parameters ofthis array are as follows Central frequency is 497GHz 119909

119901=

14mm 119910119901= 96mm 119909

119889= 15mm 119910

119889= 15mm 120576

119903= 434

ℎ = 08mm 119909119892= 60mm 119910

119892= 60mm and 119909

119891= 36mm

The size of grid is 119889119909 = 119889119910 = 119889119911 = 04mmActually the amount of total grids is just 200 times 200 times 50

However to test the influence of different virtual topologyschemes on parallel performance of parallel FDTD thecomputational space needs to be extended So in this test theamount of total grids is set as 1200 times 1200 times 300

The radiation patterns of the microstrip array areshown in Figure 2 compared with the results obtained fromHFSSThefigure shows that there is awell agreement betweenthem

42 Discussion of Parallel Performance Here we select sev-eral groups of virtual topology schemes to be tested Thefollowing are the test results on the two supercomputer centerplatforms

421 NSCC-TJ Table 2 is the comparison of total computa-tion time (in seconds) with different CPU cores and differentvirtual topology schemes The maximum number of CPUcores used for test is 120

In Table 2 virtual topology schemes are described as (119909times119910 times 119911) for all three communication patterns If the value is1 in some direction it implies that there is no topology inthis direction For example 2 times 1 times 1 means that there is notopology in 119910 and 119911 directions respectively thus the virtualtopology is actually in one dimension Similarly 8 times 8 times 1means that there is no topology in 119911 direction thus the virtual

Table 2 Comparisons of virtual topology amount of communica-tion and computation time on NSCC-TJ

CPUcores

Virtualtopology(119909 times 119910 times 119911)

Amount ofcommunication

Computationtime (inseconds)

12 3 times 2 times 2 2520000 64666816 4 times 2 times 2 2880000 61405632 4 times 4 times 2 3600000 285897

648 times 8 times 1 5040000 1376858 times 4 times 2 5040000 13799216 times 2 times 2 7200000 182437

96

8 times 6 times 2 5760000 946528 times 4 times 3 6480000 15005412 times 4 times 2 6480000 15519516 times 3 times 2 7560000 167942

120

6 times 10 times 2 6480000 7021610 times 6 times 2 6480000 808655 times 12 times 2 6840000 7136712 times 5 times 2 6840000 863655 times 6 times 4 7560000 724906 times 5 times 4 7560000 10764115 times 4 times 2 7560000 109652

topology is actually in two dimensions In our work oneprocess uses one CPU core

The speedup and parallel efficiency of the code are shownin Figure 3 From Figure 3 it can be seen that the parallelefficiency reaches up to 80 on NSCC-TJ

From Table 2 it is obvious that increasing the numberof CPU cores can bring us the reduction of the computationtime rapidly But different virtual topology schemes will costdifferent computation time even if the code is run with thesame number of processes Next the parallel performance ofthe parallel FDTD will be discussed

Here the cases of 96 and 120 cores are taken as theexamples From (3)

119871 = (119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909

+ (119875119911minus 1)119873

119909119873119910

(7)

The following is known

(a) 96 Cores Consider

8times 6times 2 (8 minus 1) times (1200 times 300) + (6 minus 1) times (1200 times 300) +(2 minus 1) times (1200 times 1200) = 5760000 (94652 s)(05 GBprocess)

8times 4times 3 (8 minus 1) times (1200 times 300) + (4 minus 1) times (1200 times 300) +(3 minus 1) times (1200 times 1200) = 6480000 (150054 s)(05 GBprocess)

4 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus40

minus30

minus20

minus10

0

HFSSFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus30

minus20

minus10

0

HFSSFDTD

(b) 119910119900119911 plane

Figure 2 The radiation patterns of the 2 times 2 microstrip antenna array

Spee

dup

128

112

96

80

64

48

32

16

0

Number of CPU cores0 16 32 48 64 80 96 112 128

Ideal speedupThis paperrsquos approach

(a) Speedup

Number of CPU cores0 16 32 48 64 80 96 112 128

Ideal parallel efficiencyThis paperrsquos approach

Para

llel e

ffici

ency

12

10

08

06

04

02

00

(b) Parallel efficiency

Figure 3 The speedup and parallel efficiency of the code from 12CPU cores to 120CPU cores on NSCC-TJ

(b) 120 Cores Consider

5times 6times 4 (5 minus 1) times (1200 times 300) + (6 minus 1) times (1200 times 300) +(4 minus 1) times (1200 times 1200) = 7560000 (72490 s)(04GBprocess)

6times 5times 4 (6 minus 1) times (1200 times 300) + (5 minus 1) times (1200 times 300) +(4 minus 1) times (1200 times 1200) = 7560000 (107641 s)(04GBprocess)

From above it is known that for the virtual topologieswith the same dimensions the total grids at interfaces less(the amount of communication data less) can save calculationtime effectivelyMeanwhile it is obvious that thememory dis-tribution of each process is the same for different topologieswith the same number of CPU cores

But from above it also can be seen that for the sameamount of grids the calculation time has certain differences

International Journal of Antennas and Propagation 5

xy

z

1 1 1 2 22

33

34 4

4

5 55

6 66

77

78 8

8

9 99

10 1010

(a) 5 times 6 times 4

1

x

y

z

1 1 2 2

23 3 3 4

4 45 5

5

66 6 7

7

7 8 8 8 9

9 9 10 10 10

(b) 6 times 5 times 4

Figure 4 The diagram of communication mode across nodes between the virtual topologies 5 times 6 times 4 and 6 times 5 times 4

For example the topologies 6 times 5 times 4 and 5 times 6 times 4with the same grids number and the calculation time are107641 seconds and 72490 seconds respectively Generallywe believe that the consumption time between processes inone node is less and the more time consumed between onesacross nodes [12 13] So here we speculate that the differentamount of the communication grids across nodes causes thedifference between the two cases

For 5 times 6 times 4 it has 4 times (1200 times 300) + 1 times (1200 times 300) =1800000 FDTD grids needed to communicate across nodesand its calculation time is 72490 seconds while for 6 times 5 times4 it has 5 times (1200 times 300) + (1 + 26) times (1200 times 300) = 2280000FDTD grids needed to communicate across nodes and itscalculation time is 107641 seconds The calculation way ofcommunication data above is shown in Figure 4

In Figure 4 every two adjacent figures with differentcolors have data communication Figures 1 to 10 present tennodes and the adjacent figures with the same colors presentthat they are in the same node For (b) the adjacent columnswill transfer ((1 + 26) times (1200 times 300)) data in 119909119900119911 plane andthe adjacent rows will transfer (5 times (1200 times 300)) data in 119910119900119911plane

422 NSCC-SZ Table 3 is the comparison of total computa-tion time (in seconds) with different CPU cores and differentvirtual topology schemes The maximum number of CPUcores used for test is 480

In Table 3 virtual topology schemes are described as(119909 times 119910 times 119911) for all three communication patterns and themeaning of each figure in each topology is the same with thedescription in Section 421 above

The speedup and parallel efficiency of the code are shownin Figure 5 From Figure 5 it can be seen that the parallelefficiency reaches up to 80 on NSCC-SZ

FromTable 3 it can be seen that for the virtual topologieswith the same dimensions the total grids at interfaces less(the amount of communication data less) can save calculation

Table 3 Comparisons of virtual topology amount of communica-tion and computation time on NSCC-SZ

CPUcores

Virtualtopology(119909 times 119910 times 119911)

Amount ofcommunication

Computationtime (inseconds)

48 4 times 6 times 2 4320000 1271026 times 4 times 2 4320000 125612

60

5 times 6 times 2 4680000 1055023 times 10 times 2 5400000 1044473 times 5 times 4 6480000 12140910 times 2 times 3 6480000 111717

96

6 times 8 times 2 5760000 596998 times 4 times 3 6480000 637096 times 4 times 4 7200000 604824 times 6 times 4 7200000 64113

120

6 times 10 times 2 6480000 4929610 times 6 times 2 6480000 752488 times 5 times 3 6840000 510255 times 8 times 3 6840000 56613

240

10 times 12 times 2 8640000 286128 times 10 times 3 8640000 2964810 times 8 times 3 8640000 3280712 times 10 times 2 8640000 4173

360

12 times 10 times 3 10080000 1974610 times 12 times 3 10080000 278738 times 15 times 3 10440000 1814612 times 15 times 2 10440000 29734

480

10 times 12 times 4 11520000 1474812 times 10 times 4 11520000 1565315 times 8 times 4 11880000 1575715 times 16 times 2 11880000 1625412 times 8 times 5 12240000 161858 times 12 times 5 12240000 1653

6 International Journal of Antennas and Propagation

Number of CPU cores

Ideal speedupThis paperrsquos approach

Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Ideal parallel efficiencyThis paperrsquos approach

Number of CPU cores0 48 96 144 192 240 288 336 384 432 480 528

(b) Parallel efficiency

Figure 5 The speedup and parallel efficiency of the code from 48CPU cores to 480CPU cores on NSCC-SZ

time effectively This conclusion coincided with the case onNSCC-TJ

The amount of communication grids of each virtualtopology is calculated by (3) while for certain topologieswith the same number of grids it is found that the calculationtime is less with the more crossing-node communicationanalyzed by the way of NSCC-TJ It is contrary to the spec-ulation theory above Therefore we speculate that whether itis caused by the reason that the communication amount ofnodes with the main communication is great in the topologyNext the cases of 10 times 2 times 3 and 3 times 5 times 4 in 60 cores aretaken as the examples to analyze the speculation (the amountof communication is 6480000)

For 10 times 2 times 3 the crossing-node communication is 4 times(1200 times 300) + 1 times (1200 times 300) = 1800000 while for 3 times 5 times 4the one is 2 times (1200 times 300) + 1612 times (1200 times 300) = 1200000But the consumption time for 10 times 2 times 3 is 111717 seconds andthe one for 3 times 5 times 4 is 121409 seconds This does not agreewith the speculation of crossing-node theory on NSCC-TJTo further explore the reason the heaviest communicationand the lightest communication of the related processes fromthese two virtual topology schemes are listed in Table 4The heaviest communication load in the virtual topologyscheme 10 times 2 times 3 is (1200 times 300)30 + (1200 times 300) times26 + (1200 times 1200) times 220 = 276000 while in the virtualtopology scheme 3 times 5 times 4 the one is (1200 times 300) times 212 +(1200 times 300) times 220 + (1200 times 1200) times 215 = 288000 Theheaviest communication loads are assigned to the processeslocated at the center of the process grid Similarly the lightestcommunication loads that are set to the processes at thecorner of the process grid can be calculated The differencebetween the heaviest communication load and the lightestcommunication load in the virtual topology scheme 3 times 5 times4 is larger than the one in 10 times 2 times 3 which results indifferent computation timeThis indicates that when the total

z

yx

Figure 6 The model of the microstrip antenna array

Table 4 Comparisons of communication load

Topology The heaviest The lightest Difference10 times 2 times 3 276000 144000 1320003 times 5 times 4 288000 144000 144000

amount of communication is the same a virtual topologyschemewith amore balanced communication loadmay bringa better performance although with a more crossing-nodecommunication

43 The General Rules of Optimal Virtual Topology Gen-erally when the amount of FDTD cells is the same theMPI virtual topology by the way where the amount oftransferred data is less can save the computation time forthe same dimensional virtual topologyThe best performanceof a parallel FDTD code can be obtained by optimizing theMPI virtual topology scheme The general rules for a betterperformance are as follows

(a) Select MPI virtual topology scheme to make the totalcommunication 119871 (equation (3)) the smallest

International Journal of Antennas and Propagation 7

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane

Figure 7 The radiation patterns of the array

y

xz

Figure 8 The model of the microstrip array

(b) When the total communication 119871 is the same set thetopology as with less crossing-node communication

(c) When the amount of crossing-node communicationof different topologies is approximately the sameselect the topology with a more balanced communi-cation load

5 Applications Using 10 Thousands ofCPU Cores

Based on the optimal virtual topology scheme above theparallel FDTD code is applied to analyze some complicatedEM problems and they are run on Shanghai SupercomputerCenter (SSC) platform

51 Radiation of Microstrip Antenna Array

511 Validation of the FDTD Code To validate the FDTDcode a microstrip antenna array with hundreds of the same

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Number of CPU cores4096 6144 8192 10240

Ideal parallel efficiencyThis paperrsquos approach

Figure 9 The parallel efficiency of 4096sim10240 cores

antenna units is analyzed and the results are compared withthe ones provided by MoM The model is shown in Figure 6The amount of total grids is 786 times 1224 times 54 It is calculatedon PC with the virtual topology scheme 4 times 3 times 2

The radiation patterns of the array are shown in Fig-ure 7 compared with the ones provided by MoM FromFigure 7 one can see that there are very well agreementsbetween them

8 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane

Figure 10 The radiation patterns of the ellipse microstrip antenna array

z y

x

Figure 11 The model of the airplane

512 Radiation of an Ellipse Antenna Array An ellipsemicrostrip antenna array with nearly 2000 elements is shownin Figure 8 In this array the amount of total grids is 6016 times1160 times 54 (about 04 billion) It is tested on ShanghaiSupercomputer Center (SSC) using 10240 cores and theparallel efficiency from 4096 to 10240 cores is tested

The parallel efficiency of this code is shown in Figure 9And the radiation patterns of this array are shown inFigure 10

From Figure 9 it is known that the parallel efficiencyreaches to nearly 80 at 10240 cores with 4096 cores asbenchmark which achieves one of the best efficiencies everreached using more than 10 thousands of CPU cores

52 RCS of an Electrically Large Airplane

521 Simulation Model The airplane is shown in Fig-ure 11 The parameters are as follows Incident wave fre-quency is 40GHz The direction of incident wave is+119910 The polarization is E

119911 The size of the airplane is

524256m times 210312m times 2944213m Electric size is about700120582times 280120582times 39120582 The size of grid is 119889119909 = 119889119910 = 119889119911 =00075m The amount of total grids is 7020 times 2840 times 420(about 84 billion) It is tested on Shanghai SupercomputerCenter (SSC) using 10240 cores

522 Result The computation time of this code is 603686seconds The RCS of the airplane is shown in Figure 12

6 Conclusion

A guideline is presented for using parallel FDTD on super-computers with more than 10 thousands of CPU coresbased on a theoretical communication model given in thispaper The benchmarks obtained on two supercomputersvalidated the optimal virtual topology rules Radiation froma large microstrip antenna array and scattering from anelectrically large airplane are simulated successfully whichindicate the capability of the method presented in thispaper for those types of real-life EM problems

International Journal of Antennas and Propagation 9

0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane

Figure 12 The RCS of the airplane

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the National High TechnologyResearch andDevelopment Program of China (863 Program)(2012AA01A308) the NSFC (61301069 61072019) the Projectwith Contract no 2013KJXX-67 and the Program for New

Century Excellent Talents in University of China (NCET-13-0949) The computational resources utilized in this researchare provided by the National Supercomputer Center inTianjin (NSCC-TJ) National Supercomputing Center inShenzhen (NSCC-SZ) and Shanghai Supercomputer Center(SSC)

References

[1] A Taflove Computational Electrodynamics The Finite-Differ-ence Time-Domain Method Artech House Norwood MassUSA 2000

10 International Journal of Antennas and Propagation

[2] D Ge and Y Yan Finite-Difference Time-Domain Method forElectromagnetic Waves Version 3 Xidian University PressXirsquoan China 2011 (Chinese)

[3] J L Volakis D B Davidson C Guiffaut and K Mahdjoubi ldquoAparallel FDTD algorithm using theMPI libraryrdquo IEEE Antennasand Propagation Magazine vol 43 no 2 pp 94ndash103 2001

[4] U Andersson Time-domain methods for the Maxwell equations[PhD thesis] Royal Institute of Technology Stockholm Swe-den 2001

[5] Y Zhang J Song and C H Liang ldquoMPI-based parallelizedlocally conformal fdtd for modeling slot antennas and newperiodic structures in microstriprdquo Journal of ElectromagneticWaves and Applications vol 18 no 10 pp 1321ndash1335 2004

[6] Y Zhang J Song and C Liang ldquoStudy on the parallel modifiedlocally conformal FDTD algorithm on cluster of PCs for PBGstructuresrdquo Acta Electronica Sinica vol 31 no 12A pp 2142ndash2144 2003

[7] Z Yu D Wei and C Liang ldquoAnalysis of parallel performanceof MPI based parallel FDTD on PC clustersrdquo in Proceedings ofthe Asia-Pacific Conference Proceedings Microwave ConferenceProceedings (APMC rsquo05) vol 4 December 2005

[8] W Yu X Yang Y Liu et al ldquoA new direction in computationalelectromagnetics solving large problems using the parallelFDTD on the BlueGeneL supercomputer providing teraflop-level performancerdquo IEEE Antennas and Propagation Magazinevol 50 no 2 pp 26ndash44 2008

[9] httpwwwnscc-tjgovcn[10] httpwwwnsccszgovcn[11] httpwwwsscnetcn[12] W Chen and J Zhai Preliminary Analysis on Communication

Performance of Dawning 5000A 863 High Performance Com-puter Testing Center of Tsinghua University 2008

[13] Intel Corporation Intel MPI Library for Linux OS ReferenceManual Intel Corporation 2011 httpssoftwareintelcomsitesproductsdocumentationhpcmpilinuxreference manualpdf

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

2 International Journal of Antennas and Propagation

Table 1 Parameters of computation resources

Platform CPU Memorynode Clock speed Coresnode Total cores used in this paper

SSC AMD 8347HE 64-bit four-core 64G 19GHz 16 800032G 4800

NSCC-TJ Intel Xeon 5670 six-core 24G 293GHz 12 120NSCC-SZ Intel Xeon 5650 six-core 24G 256GHz 12 512

Shenzhen (NSCC-SZ) [10] and Shanghai SupercomputerCenter (SSC) [11] The parameters of computation resourcesin this paper are listed in Table 1

3 Communication Model for Parallel FDTD

Communication is the main factor which affects the par-allel performance of parallel codes Therefore reducing theamount of communication in FDTD by adjusting the virtualtopology is selected as the optimization target

Assume that the communication time in one time step is

119879 = 120572119862+120573 sdot 2119871 (1)

where 120572 is communication delay time 119862 is communicationnumber 120573 is transmission speed and 119871 is the communicationdata amount of E or H The calculated equation of eachparameter is as follows

119862 = 6119875119909119875119910119875119911minus 2 (119875

119909119875119910+119875119910119875119911+119875119911119875119909) (2)

119871 = (119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909

+ (119875119911minus 1)119873

119909119873119910

(3)

where 119875119909 119875119910 and 119875

119911are the topology values in three direc-

tions and119873119909119873119910 and119873

119911are the grids number in 119909 119910 and 119911

directionsFrom (1) it is known that when the total communication

data amount is the same the different topology scheme maybring the different communication number 119862 which comesto different total time 119879

Take Dawning 5000A as example the parameters 120572 =18 ussim25 us and 120573 = 1(16563Gbs) [12] Assume that thetotal grids are 1000 times 1000 times 1000 and total cores are 1000now the total communication delay time (972ms) is aboutless an order ofmagnitude than the total communication time(121ms) Under this cores scale the communication delaytime is the secondary factor

The communication amount of single process is

ave

=119871

119875119909119875119910119875119911

=

(119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909+ (119875119911minus 1)119873

119909119873119910

119875119909119875119910119875119911

(4)

Divided by a constant119873119909sdot 119873119910sdot 119873119911 (4) will be

ave1015840

=

(119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909+ (119875119911minus 1)119873

119909119873119910

119875119909119875119910119875119911sdot 119873119909119873119910119873119911

=1119875119909119875119910119875119911

sdot (119875119909minus 1119873119909

+

119875119910minus 1119873119910

+119875119911minus 1119873119911

)

(5)

From (5) it is known when and only (119875119909minus 1)119873

119909=

(119875119910minus 1)119873

119910= (119875119911minus 1)119873

119911 namely the topology conformal

as calculation region the communication amount of singleprocess is the least Generally the equation above cannot besatisfied absolutely So the topology distribution is requiredthat it is divided along the direction which is conformal ascalculation region as possible to make (5) the least

Generally speaking the communication time is lessbetween processes in one node than that between processeswhich belong to different nodes [12 13] that is the onebyte data communication time factor 120573 is different betweenprocesses in one node and across nodes So when the factors119862 and 119871 are the same between two different topologiesthe communication amount across nodes is needed to beconsidered

For certain grids the total memory requirement (called119872) keeps the same for different topologies The memorydistribution of each process (called119898) is as follows

119898 =119872

119875119909119875119910119875119911

(6)

Equation (6) indicates that the memory distribution of eachprocess is unrelated to the virtual topologies

From the analysis above it is known that the commu-nication surface area varies from different virtual topologyscheme for certain grids The communication time will bechanged associated with the virtual topology scheme whilethe memory distribution of each process remains the sameThus the communication amount is the main factor whichaffects the parallel performance

4 Discussions on Parallel Performance

41 Simulation Model Based on the theory above a four-element microstrip antenna array is used as the model forbenchmark The parallel FDTD code is run to analyze the

International Journal of Antennas and Propagation 3

x

y

z

h

y

x

Substrate

Topology distribution at y direction

Topology distribution at x direction

Topology distribution at z direction

xp

yp

xf

yd

xd

xp

yp

120576r

yg

xg

Figure 1 2 times 2 microstrip array

virtual topology schemes on two supercomputer center plat-forms National Supercomputer Center in Tianjin (NSCC-TJ)and National Supercomputing Center in Shenzhen (NSCC-SZ) as listed in Table 1

The array model is shown in Figure 1 The parameters ofthis array are as follows Central frequency is 497GHz 119909

119901=

14mm 119910119901= 96mm 119909

119889= 15mm 119910

119889= 15mm 120576

119903= 434

ℎ = 08mm 119909119892= 60mm 119910

119892= 60mm and 119909

119891= 36mm

The size of grid is 119889119909 = 119889119910 = 119889119911 = 04mmActually the amount of total grids is just 200 times 200 times 50

However to test the influence of different virtual topologyschemes on parallel performance of parallel FDTD thecomputational space needs to be extended So in this test theamount of total grids is set as 1200 times 1200 times 300

The radiation patterns of the microstrip array areshown in Figure 2 compared with the results obtained fromHFSSThefigure shows that there is awell agreement betweenthem

42 Discussion of Parallel Performance Here we select sev-eral groups of virtual topology schemes to be tested Thefollowing are the test results on the two supercomputer centerplatforms

421 NSCC-TJ Table 2 is the comparison of total computa-tion time (in seconds) with different CPU cores and differentvirtual topology schemes The maximum number of CPUcores used for test is 120

In Table 2 virtual topology schemes are described as (119909times119910 times 119911) for all three communication patterns If the value is1 in some direction it implies that there is no topology inthis direction For example 2 times 1 times 1 means that there is notopology in 119910 and 119911 directions respectively thus the virtualtopology is actually in one dimension Similarly 8 times 8 times 1means that there is no topology in 119911 direction thus the virtual

Table 2 Comparisons of virtual topology amount of communica-tion and computation time on NSCC-TJ

CPUcores

Virtualtopology(119909 times 119910 times 119911)

Amount ofcommunication

Computationtime (inseconds)

12 3 times 2 times 2 2520000 64666816 4 times 2 times 2 2880000 61405632 4 times 4 times 2 3600000 285897

648 times 8 times 1 5040000 1376858 times 4 times 2 5040000 13799216 times 2 times 2 7200000 182437

96

8 times 6 times 2 5760000 946528 times 4 times 3 6480000 15005412 times 4 times 2 6480000 15519516 times 3 times 2 7560000 167942

120

6 times 10 times 2 6480000 7021610 times 6 times 2 6480000 808655 times 12 times 2 6840000 7136712 times 5 times 2 6840000 863655 times 6 times 4 7560000 724906 times 5 times 4 7560000 10764115 times 4 times 2 7560000 109652

topology is actually in two dimensions In our work oneprocess uses one CPU core

The speedup and parallel efficiency of the code are shownin Figure 3 From Figure 3 it can be seen that the parallelefficiency reaches up to 80 on NSCC-TJ

From Table 2 it is obvious that increasing the numberof CPU cores can bring us the reduction of the computationtime rapidly But different virtual topology schemes will costdifferent computation time even if the code is run with thesame number of processes Next the parallel performance ofthe parallel FDTD will be discussed

Here the cases of 96 and 120 cores are taken as theexamples From (3)

119871 = (119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909

+ (119875119911minus 1)119873

119909119873119910

(7)

The following is known

(a) 96 Cores Consider

8times 6times 2 (8 minus 1) times (1200 times 300) + (6 minus 1) times (1200 times 300) +(2 minus 1) times (1200 times 1200) = 5760000 (94652 s)(05 GBprocess)

8times 4times 3 (8 minus 1) times (1200 times 300) + (4 minus 1) times (1200 times 300) +(3 minus 1) times (1200 times 1200) = 6480000 (150054 s)(05 GBprocess)

4 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus40

minus30

minus20

minus10

0

HFSSFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus30

minus20

minus10

0

HFSSFDTD

(b) 119910119900119911 plane

Figure 2 The radiation patterns of the 2 times 2 microstrip antenna array

Spee

dup

128

112

96

80

64

48

32

16

0

Number of CPU cores0 16 32 48 64 80 96 112 128

Ideal speedupThis paperrsquos approach

(a) Speedup

Number of CPU cores0 16 32 48 64 80 96 112 128

Ideal parallel efficiencyThis paperrsquos approach

Para

llel e

ffici

ency

12

10

08

06

04

02

00

(b) Parallel efficiency

Figure 3 The speedup and parallel efficiency of the code from 12CPU cores to 120CPU cores on NSCC-TJ

(b) 120 Cores Consider

5times 6times 4 (5 minus 1) times (1200 times 300) + (6 minus 1) times (1200 times 300) +(4 minus 1) times (1200 times 1200) = 7560000 (72490 s)(04GBprocess)

6times 5times 4 (6 minus 1) times (1200 times 300) + (5 minus 1) times (1200 times 300) +(4 minus 1) times (1200 times 1200) = 7560000 (107641 s)(04GBprocess)

From above it is known that for the virtual topologieswith the same dimensions the total grids at interfaces less(the amount of communication data less) can save calculationtime effectivelyMeanwhile it is obvious that thememory dis-tribution of each process is the same for different topologieswith the same number of CPU cores

But from above it also can be seen that for the sameamount of grids the calculation time has certain differences

International Journal of Antennas and Propagation 5

xy

z

1 1 1 2 22

33

34 4

4

5 55

6 66

77

78 8

8

9 99

10 1010

(a) 5 times 6 times 4

1

x

y

z

1 1 2 2

23 3 3 4

4 45 5

5

66 6 7

7

7 8 8 8 9

9 9 10 10 10

(b) 6 times 5 times 4

Figure 4 The diagram of communication mode across nodes between the virtual topologies 5 times 6 times 4 and 6 times 5 times 4

For example the topologies 6 times 5 times 4 and 5 times 6 times 4with the same grids number and the calculation time are107641 seconds and 72490 seconds respectively Generallywe believe that the consumption time between processes inone node is less and the more time consumed between onesacross nodes [12 13] So here we speculate that the differentamount of the communication grids across nodes causes thedifference between the two cases

For 5 times 6 times 4 it has 4 times (1200 times 300) + 1 times (1200 times 300) =1800000 FDTD grids needed to communicate across nodesand its calculation time is 72490 seconds while for 6 times 5 times4 it has 5 times (1200 times 300) + (1 + 26) times (1200 times 300) = 2280000FDTD grids needed to communicate across nodes and itscalculation time is 107641 seconds The calculation way ofcommunication data above is shown in Figure 4

In Figure 4 every two adjacent figures with differentcolors have data communication Figures 1 to 10 present tennodes and the adjacent figures with the same colors presentthat they are in the same node For (b) the adjacent columnswill transfer ((1 + 26) times (1200 times 300)) data in 119909119900119911 plane andthe adjacent rows will transfer (5 times (1200 times 300)) data in 119910119900119911plane

422 NSCC-SZ Table 3 is the comparison of total computa-tion time (in seconds) with different CPU cores and differentvirtual topology schemes The maximum number of CPUcores used for test is 480

In Table 3 virtual topology schemes are described as(119909 times 119910 times 119911) for all three communication patterns and themeaning of each figure in each topology is the same with thedescription in Section 421 above

The speedup and parallel efficiency of the code are shownin Figure 5 From Figure 5 it can be seen that the parallelefficiency reaches up to 80 on NSCC-SZ

FromTable 3 it can be seen that for the virtual topologieswith the same dimensions the total grids at interfaces less(the amount of communication data less) can save calculation

Table 3 Comparisons of virtual topology amount of communica-tion and computation time on NSCC-SZ

CPUcores

Virtualtopology(119909 times 119910 times 119911)

Amount ofcommunication

Computationtime (inseconds)

48 4 times 6 times 2 4320000 1271026 times 4 times 2 4320000 125612

60

5 times 6 times 2 4680000 1055023 times 10 times 2 5400000 1044473 times 5 times 4 6480000 12140910 times 2 times 3 6480000 111717

96

6 times 8 times 2 5760000 596998 times 4 times 3 6480000 637096 times 4 times 4 7200000 604824 times 6 times 4 7200000 64113

120

6 times 10 times 2 6480000 4929610 times 6 times 2 6480000 752488 times 5 times 3 6840000 510255 times 8 times 3 6840000 56613

240

10 times 12 times 2 8640000 286128 times 10 times 3 8640000 2964810 times 8 times 3 8640000 3280712 times 10 times 2 8640000 4173

360

12 times 10 times 3 10080000 1974610 times 12 times 3 10080000 278738 times 15 times 3 10440000 1814612 times 15 times 2 10440000 29734

480

10 times 12 times 4 11520000 1474812 times 10 times 4 11520000 1565315 times 8 times 4 11880000 1575715 times 16 times 2 11880000 1625412 times 8 times 5 12240000 161858 times 12 times 5 12240000 1653

6 International Journal of Antennas and Propagation

Number of CPU cores

Ideal speedupThis paperrsquos approach

Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Ideal parallel efficiencyThis paperrsquos approach

Number of CPU cores0 48 96 144 192 240 288 336 384 432 480 528

(b) Parallel efficiency

Figure 5 The speedup and parallel efficiency of the code from 48CPU cores to 480CPU cores on NSCC-SZ

time effectively This conclusion coincided with the case onNSCC-TJ

The amount of communication grids of each virtualtopology is calculated by (3) while for certain topologieswith the same number of grids it is found that the calculationtime is less with the more crossing-node communicationanalyzed by the way of NSCC-TJ It is contrary to the spec-ulation theory above Therefore we speculate that whether itis caused by the reason that the communication amount ofnodes with the main communication is great in the topologyNext the cases of 10 times 2 times 3 and 3 times 5 times 4 in 60 cores aretaken as the examples to analyze the speculation (the amountof communication is 6480000)

For 10 times 2 times 3 the crossing-node communication is 4 times(1200 times 300) + 1 times (1200 times 300) = 1800000 while for 3 times 5 times 4the one is 2 times (1200 times 300) + 1612 times (1200 times 300) = 1200000But the consumption time for 10 times 2 times 3 is 111717 seconds andthe one for 3 times 5 times 4 is 121409 seconds This does not agreewith the speculation of crossing-node theory on NSCC-TJTo further explore the reason the heaviest communicationand the lightest communication of the related processes fromthese two virtual topology schemes are listed in Table 4The heaviest communication load in the virtual topologyscheme 10 times 2 times 3 is (1200 times 300)30 + (1200 times 300) times26 + (1200 times 1200) times 220 = 276000 while in the virtualtopology scheme 3 times 5 times 4 the one is (1200 times 300) times 212 +(1200 times 300) times 220 + (1200 times 1200) times 215 = 288000 Theheaviest communication loads are assigned to the processeslocated at the center of the process grid Similarly the lightestcommunication loads that are set to the processes at thecorner of the process grid can be calculated The differencebetween the heaviest communication load and the lightestcommunication load in the virtual topology scheme 3 times 5 times4 is larger than the one in 10 times 2 times 3 which results indifferent computation timeThis indicates that when the total

z

yx

Figure 6 The model of the microstrip antenna array

Table 4 Comparisons of communication load

Topology The heaviest The lightest Difference10 times 2 times 3 276000 144000 1320003 times 5 times 4 288000 144000 144000

amount of communication is the same a virtual topologyschemewith amore balanced communication loadmay bringa better performance although with a more crossing-nodecommunication

43 The General Rules of Optimal Virtual Topology Gen-erally when the amount of FDTD cells is the same theMPI virtual topology by the way where the amount oftransferred data is less can save the computation time forthe same dimensional virtual topologyThe best performanceof a parallel FDTD code can be obtained by optimizing theMPI virtual topology scheme The general rules for a betterperformance are as follows

(a) Select MPI virtual topology scheme to make the totalcommunication 119871 (equation (3)) the smallest

International Journal of Antennas and Propagation 7

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane

Figure 7 The radiation patterns of the array

y

xz

Figure 8 The model of the microstrip array

(b) When the total communication 119871 is the same set thetopology as with less crossing-node communication

(c) When the amount of crossing-node communicationof different topologies is approximately the sameselect the topology with a more balanced communi-cation load

5 Applications Using 10 Thousands ofCPU Cores

Based on the optimal virtual topology scheme above theparallel FDTD code is applied to analyze some complicatedEM problems and they are run on Shanghai SupercomputerCenter (SSC) platform

51 Radiation of Microstrip Antenna Array

511 Validation of the FDTD Code To validate the FDTDcode a microstrip antenna array with hundreds of the same

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Number of CPU cores4096 6144 8192 10240

Ideal parallel efficiencyThis paperrsquos approach

Figure 9 The parallel efficiency of 4096sim10240 cores

antenna units is analyzed and the results are compared withthe ones provided by MoM The model is shown in Figure 6The amount of total grids is 786 times 1224 times 54 It is calculatedon PC with the virtual topology scheme 4 times 3 times 2

The radiation patterns of the array are shown in Fig-ure 7 compared with the ones provided by MoM FromFigure 7 one can see that there are very well agreementsbetween them

8 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane

Figure 10 The radiation patterns of the ellipse microstrip antenna array

z y

x

Figure 11 The model of the airplane

512 Radiation of an Ellipse Antenna Array An ellipsemicrostrip antenna array with nearly 2000 elements is shownin Figure 8 In this array the amount of total grids is 6016 times1160 times 54 (about 04 billion) It is tested on ShanghaiSupercomputer Center (SSC) using 10240 cores and theparallel efficiency from 4096 to 10240 cores is tested

The parallel efficiency of this code is shown in Figure 9And the radiation patterns of this array are shown inFigure 10

From Figure 9 it is known that the parallel efficiencyreaches to nearly 80 at 10240 cores with 4096 cores asbenchmark which achieves one of the best efficiencies everreached using more than 10 thousands of CPU cores

52 RCS of an Electrically Large Airplane

521 Simulation Model The airplane is shown in Fig-ure 11 The parameters are as follows Incident wave fre-quency is 40GHz The direction of incident wave is+119910 The polarization is E

119911 The size of the airplane is

524256m times 210312m times 2944213m Electric size is about700120582times 280120582times 39120582 The size of grid is 119889119909 = 119889119910 = 119889119911 =00075m The amount of total grids is 7020 times 2840 times 420(about 84 billion) It is tested on Shanghai SupercomputerCenter (SSC) using 10240 cores

522 Result The computation time of this code is 603686seconds The RCS of the airplane is shown in Figure 12

6 Conclusion

A guideline is presented for using parallel FDTD on super-computers with more than 10 thousands of CPU coresbased on a theoretical communication model given in thispaper The benchmarks obtained on two supercomputersvalidated the optimal virtual topology rules Radiation froma large microstrip antenna array and scattering from anelectrically large airplane are simulated successfully whichindicate the capability of the method presented in thispaper for those types of real-life EM problems

International Journal of Antennas and Propagation 9

0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane

Figure 12 The RCS of the airplane

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the National High TechnologyResearch andDevelopment Program of China (863 Program)(2012AA01A308) the NSFC (61301069 61072019) the Projectwith Contract no 2013KJXX-67 and the Program for New

Century Excellent Talents in University of China (NCET-13-0949) The computational resources utilized in this researchare provided by the National Supercomputer Center inTianjin (NSCC-TJ) National Supercomputing Center inShenzhen (NSCC-SZ) and Shanghai Supercomputer Center(SSC)

References

[1] A Taflove Computational Electrodynamics The Finite-Differ-ence Time-Domain Method Artech House Norwood MassUSA 2000

10 International Journal of Antennas and Propagation

[2] D Ge and Y Yan Finite-Difference Time-Domain Method forElectromagnetic Waves Version 3 Xidian University PressXirsquoan China 2011 (Chinese)

[3] J L Volakis D B Davidson C Guiffaut and K Mahdjoubi ldquoAparallel FDTD algorithm using theMPI libraryrdquo IEEE Antennasand Propagation Magazine vol 43 no 2 pp 94ndash103 2001

[4] U Andersson Time-domain methods for the Maxwell equations[PhD thesis] Royal Institute of Technology Stockholm Swe-den 2001

[5] Y Zhang J Song and C H Liang ldquoMPI-based parallelizedlocally conformal fdtd for modeling slot antennas and newperiodic structures in microstriprdquo Journal of ElectromagneticWaves and Applications vol 18 no 10 pp 1321ndash1335 2004

[6] Y Zhang J Song and C Liang ldquoStudy on the parallel modifiedlocally conformal FDTD algorithm on cluster of PCs for PBGstructuresrdquo Acta Electronica Sinica vol 31 no 12A pp 2142ndash2144 2003

[7] Z Yu D Wei and C Liang ldquoAnalysis of parallel performanceof MPI based parallel FDTD on PC clustersrdquo in Proceedings ofthe Asia-Pacific Conference Proceedings Microwave ConferenceProceedings (APMC rsquo05) vol 4 December 2005

[8] W Yu X Yang Y Liu et al ldquoA new direction in computationalelectromagnetics solving large problems using the parallelFDTD on the BlueGeneL supercomputer providing teraflop-level performancerdquo IEEE Antennas and Propagation Magazinevol 50 no 2 pp 26ndash44 2008

[9] httpwwwnscc-tjgovcn[10] httpwwwnsccszgovcn[11] httpwwwsscnetcn[12] W Chen and J Zhai Preliminary Analysis on Communication

Performance of Dawning 5000A 863 High Performance Com-puter Testing Center of Tsinghua University 2008

[13] Intel Corporation Intel MPI Library for Linux OS ReferenceManual Intel Corporation 2011 httpssoftwareintelcomsitesproductsdocumentationhpcmpilinuxreference manualpdf

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

International Journal of Antennas and Propagation 3

x

y

z

h

y

x

Substrate

Topology distribution at y direction

Topology distribution at x direction

Topology distribution at z direction

xp

yp

xf

yd

xd

xp

yp

120576r

yg

xg

Figure 1 2 times 2 microstrip array

virtual topology schemes on two supercomputer center plat-forms National Supercomputer Center in Tianjin (NSCC-TJ)and National Supercomputing Center in Shenzhen (NSCC-SZ) as listed in Table 1

The array model is shown in Figure 1 The parameters ofthis array are as follows Central frequency is 497GHz 119909

119901=

14mm 119910119901= 96mm 119909

119889= 15mm 119910

119889= 15mm 120576

119903= 434

ℎ = 08mm 119909119892= 60mm 119910

119892= 60mm and 119909

119891= 36mm

The size of grid is 119889119909 = 119889119910 = 119889119911 = 04mmActually the amount of total grids is just 200 times 200 times 50

However to test the influence of different virtual topologyschemes on parallel performance of parallel FDTD thecomputational space needs to be extended So in this test theamount of total grids is set as 1200 times 1200 times 300

The radiation patterns of the microstrip array areshown in Figure 2 compared with the results obtained fromHFSSThefigure shows that there is awell agreement betweenthem

42 Discussion of Parallel Performance Here we select sev-eral groups of virtual topology schemes to be tested Thefollowing are the test results on the two supercomputer centerplatforms

421 NSCC-TJ Table 2 is the comparison of total computa-tion time (in seconds) with different CPU cores and differentvirtual topology schemes The maximum number of CPUcores used for test is 120

In Table 2 virtual topology schemes are described as (119909times119910 times 119911) for all three communication patterns If the value is1 in some direction it implies that there is no topology inthis direction For example 2 times 1 times 1 means that there is notopology in 119910 and 119911 directions respectively thus the virtualtopology is actually in one dimension Similarly 8 times 8 times 1means that there is no topology in 119911 direction thus the virtual

Table 2 Comparisons of virtual topology amount of communica-tion and computation time on NSCC-TJ

CPUcores

Virtualtopology(119909 times 119910 times 119911)

Amount ofcommunication

Computationtime (inseconds)

12 3 times 2 times 2 2520000 64666816 4 times 2 times 2 2880000 61405632 4 times 4 times 2 3600000 285897

648 times 8 times 1 5040000 1376858 times 4 times 2 5040000 13799216 times 2 times 2 7200000 182437

96

8 times 6 times 2 5760000 946528 times 4 times 3 6480000 15005412 times 4 times 2 6480000 15519516 times 3 times 2 7560000 167942

120

6 times 10 times 2 6480000 7021610 times 6 times 2 6480000 808655 times 12 times 2 6840000 7136712 times 5 times 2 6840000 863655 times 6 times 4 7560000 724906 times 5 times 4 7560000 10764115 times 4 times 2 7560000 109652

topology is actually in two dimensions In our work oneprocess uses one CPU core

The speedup and parallel efficiency of the code are shownin Figure 3 From Figure 3 it can be seen that the parallelefficiency reaches up to 80 on NSCC-TJ

From Table 2 it is obvious that increasing the numberof CPU cores can bring us the reduction of the computationtime rapidly But different virtual topology schemes will costdifferent computation time even if the code is run with thesame number of processes Next the parallel performance ofthe parallel FDTD will be discussed

Here the cases of 96 and 120 cores are taken as theexamples From (3)

119871 = (119875119909minus 1)119873

119910119873119911+ (119875119910minus 1)119873

119911119873119909

+ (119875119911minus 1)119873

119909119873119910

(7)

The following is known

(a) 96 Cores Consider

8times 6times 2 (8 minus 1) times (1200 times 300) + (6 minus 1) times (1200 times 300) +(2 minus 1) times (1200 times 1200) = 5760000 (94652 s)(05 GBprocess)

8times 4times 3 (8 minus 1) times (1200 times 300) + (4 minus 1) times (1200 times 300) +(3 minus 1) times (1200 times 1200) = 6480000 (150054 s)(05 GBprocess)

4 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus40

minus30

minus20

minus10

0

HFSSFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus30

minus20

minus10

0

HFSSFDTD

(b) 119910119900119911 plane

Figure 2 The radiation patterns of the 2 times 2 microstrip antenna array

Spee

dup

128

112

96

80

64

48

32

16

0

Number of CPU cores0 16 32 48 64 80 96 112 128

Ideal speedupThis paperrsquos approach

(a) Speedup

Number of CPU cores0 16 32 48 64 80 96 112 128

Ideal parallel efficiencyThis paperrsquos approach

Para

llel e

ffici

ency

12

10

08

06

04

02

00

(b) Parallel efficiency

Figure 3 The speedup and parallel efficiency of the code from 12CPU cores to 120CPU cores on NSCC-TJ

(b) 120 Cores Consider

5times 6times 4 (5 minus 1) times (1200 times 300) + (6 minus 1) times (1200 times 300) +(4 minus 1) times (1200 times 1200) = 7560000 (72490 s)(04GBprocess)

6times 5times 4 (6 minus 1) times (1200 times 300) + (5 minus 1) times (1200 times 300) +(4 minus 1) times (1200 times 1200) = 7560000 (107641 s)(04GBprocess)

From above it is known that for the virtual topologieswith the same dimensions the total grids at interfaces less(the amount of communication data less) can save calculationtime effectivelyMeanwhile it is obvious that thememory dis-tribution of each process is the same for different topologieswith the same number of CPU cores

But from above it also can be seen that for the sameamount of grids the calculation time has certain differences

International Journal of Antennas and Propagation 5

xy

z

1 1 1 2 22

33

34 4

4

5 55

6 66

77

78 8

8

9 99

10 1010

(a) 5 times 6 times 4

1

x

y

z

1 1 2 2

23 3 3 4

4 45 5

5

66 6 7

7

7 8 8 8 9

9 9 10 10 10

(b) 6 times 5 times 4

Figure 4 The diagram of communication mode across nodes between the virtual topologies 5 times 6 times 4 and 6 times 5 times 4

For example the topologies 6 times 5 times 4 and 5 times 6 times 4with the same grids number and the calculation time are107641 seconds and 72490 seconds respectively Generallywe believe that the consumption time between processes inone node is less and the more time consumed between onesacross nodes [12 13] So here we speculate that the differentamount of the communication grids across nodes causes thedifference between the two cases

For 5 times 6 times 4 it has 4 times (1200 times 300) + 1 times (1200 times 300) =1800000 FDTD grids needed to communicate across nodesand its calculation time is 72490 seconds while for 6 times 5 times4 it has 5 times (1200 times 300) + (1 + 26) times (1200 times 300) = 2280000FDTD grids needed to communicate across nodes and itscalculation time is 107641 seconds The calculation way ofcommunication data above is shown in Figure 4

In Figure 4 every two adjacent figures with differentcolors have data communication Figures 1 to 10 present tennodes and the adjacent figures with the same colors presentthat they are in the same node For (b) the adjacent columnswill transfer ((1 + 26) times (1200 times 300)) data in 119909119900119911 plane andthe adjacent rows will transfer (5 times (1200 times 300)) data in 119910119900119911plane

422 NSCC-SZ Table 3 is the comparison of total computa-tion time (in seconds) with different CPU cores and differentvirtual topology schemes The maximum number of CPUcores used for test is 480

In Table 3 virtual topology schemes are described as(119909 times 119910 times 119911) for all three communication patterns and themeaning of each figure in each topology is the same with thedescription in Section 421 above

The speedup and parallel efficiency of the code are shownin Figure 5 From Figure 5 it can be seen that the parallelefficiency reaches up to 80 on NSCC-SZ

FromTable 3 it can be seen that for the virtual topologieswith the same dimensions the total grids at interfaces less(the amount of communication data less) can save calculation

Table 3 Comparisons of virtual topology amount of communica-tion and computation time on NSCC-SZ

CPUcores

Virtualtopology(119909 times 119910 times 119911)

Amount ofcommunication

Computationtime (inseconds)

48 4 times 6 times 2 4320000 1271026 times 4 times 2 4320000 125612

60

5 times 6 times 2 4680000 1055023 times 10 times 2 5400000 1044473 times 5 times 4 6480000 12140910 times 2 times 3 6480000 111717

96

6 times 8 times 2 5760000 596998 times 4 times 3 6480000 637096 times 4 times 4 7200000 604824 times 6 times 4 7200000 64113

120

6 times 10 times 2 6480000 4929610 times 6 times 2 6480000 752488 times 5 times 3 6840000 510255 times 8 times 3 6840000 56613

240

10 times 12 times 2 8640000 286128 times 10 times 3 8640000 2964810 times 8 times 3 8640000 3280712 times 10 times 2 8640000 4173

360

12 times 10 times 3 10080000 1974610 times 12 times 3 10080000 278738 times 15 times 3 10440000 1814612 times 15 times 2 10440000 29734

480

10 times 12 times 4 11520000 1474812 times 10 times 4 11520000 1565315 times 8 times 4 11880000 1575715 times 16 times 2 11880000 1625412 times 8 times 5 12240000 161858 times 12 times 5 12240000 1653

6 International Journal of Antennas and Propagation

Number of CPU cores

Ideal speedupThis paperrsquos approach

Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Ideal parallel efficiencyThis paperrsquos approach

Number of CPU cores0 48 96 144 192 240 288 336 384 432 480 528

(b) Parallel efficiency

Figure 5 The speedup and parallel efficiency of the code from 48CPU cores to 480CPU cores on NSCC-SZ

time effectively This conclusion coincided with the case onNSCC-TJ

The amount of communication grids of each virtualtopology is calculated by (3) while for certain topologieswith the same number of grids it is found that the calculationtime is less with the more crossing-node communicationanalyzed by the way of NSCC-TJ It is contrary to the spec-ulation theory above Therefore we speculate that whether itis caused by the reason that the communication amount ofnodes with the main communication is great in the topologyNext the cases of 10 times 2 times 3 and 3 times 5 times 4 in 60 cores aretaken as the examples to analyze the speculation (the amountof communication is 6480000)

For 10 times 2 times 3 the crossing-node communication is 4 times(1200 times 300) + 1 times (1200 times 300) = 1800000 while for 3 times 5 times 4the one is 2 times (1200 times 300) + 1612 times (1200 times 300) = 1200000But the consumption time for 10 times 2 times 3 is 111717 seconds andthe one for 3 times 5 times 4 is 121409 seconds This does not agreewith the speculation of crossing-node theory on NSCC-TJTo further explore the reason the heaviest communicationand the lightest communication of the related processes fromthese two virtual topology schemes are listed in Table 4The heaviest communication load in the virtual topologyscheme 10 times 2 times 3 is (1200 times 300)30 + (1200 times 300) times26 + (1200 times 1200) times 220 = 276000 while in the virtualtopology scheme 3 times 5 times 4 the one is (1200 times 300) times 212 +(1200 times 300) times 220 + (1200 times 1200) times 215 = 288000 Theheaviest communication loads are assigned to the processeslocated at the center of the process grid Similarly the lightestcommunication loads that are set to the processes at thecorner of the process grid can be calculated The differencebetween the heaviest communication load and the lightestcommunication load in the virtual topology scheme 3 times 5 times4 is larger than the one in 10 times 2 times 3 which results indifferent computation timeThis indicates that when the total

z

yx

Figure 6 The model of the microstrip antenna array

Table 4 Comparisons of communication load

Topology The heaviest The lightest Difference10 times 2 times 3 276000 144000 1320003 times 5 times 4 288000 144000 144000

amount of communication is the same a virtual topologyschemewith amore balanced communication loadmay bringa better performance although with a more crossing-nodecommunication

43 The General Rules of Optimal Virtual Topology Gen-erally when the amount of FDTD cells is the same theMPI virtual topology by the way where the amount oftransferred data is less can save the computation time forthe same dimensional virtual topologyThe best performanceof a parallel FDTD code can be obtained by optimizing theMPI virtual topology scheme The general rules for a betterperformance are as follows

(a) Select MPI virtual topology scheme to make the totalcommunication 119871 (equation (3)) the smallest

International Journal of Antennas and Propagation 7

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane

Figure 7 The radiation patterns of the array

y

xz

Figure 8 The model of the microstrip array

(b) When the total communication 119871 is the same set thetopology as with less crossing-node communication

(c) When the amount of crossing-node communicationof different topologies is approximately the sameselect the topology with a more balanced communi-cation load

5 Applications Using 10 Thousands ofCPU Cores

Based on the optimal virtual topology scheme above theparallel FDTD code is applied to analyze some complicatedEM problems and they are run on Shanghai SupercomputerCenter (SSC) platform

51 Radiation of Microstrip Antenna Array

511 Validation of the FDTD Code To validate the FDTDcode a microstrip antenna array with hundreds of the same

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Number of CPU cores4096 6144 8192 10240

Ideal parallel efficiencyThis paperrsquos approach

Figure 9 The parallel efficiency of 4096sim10240 cores

antenna units is analyzed and the results are compared withthe ones provided by MoM The model is shown in Figure 6The amount of total grids is 786 times 1224 times 54 It is calculatedon PC with the virtual topology scheme 4 times 3 times 2

The radiation patterns of the array are shown in Fig-ure 7 compared with the ones provided by MoM FromFigure 7 one can see that there are very well agreementsbetween them

8 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane

Figure 10 The radiation patterns of the ellipse microstrip antenna array

z y

x

Figure 11 The model of the airplane

512 Radiation of an Ellipse Antenna Array An ellipsemicrostrip antenna array with nearly 2000 elements is shownin Figure 8 In this array the amount of total grids is 6016 times1160 times 54 (about 04 billion) It is tested on ShanghaiSupercomputer Center (SSC) using 10240 cores and theparallel efficiency from 4096 to 10240 cores is tested

The parallel efficiency of this code is shown in Figure 9And the radiation patterns of this array are shown inFigure 10

From Figure 9 it is known that the parallel efficiencyreaches to nearly 80 at 10240 cores with 4096 cores asbenchmark which achieves one of the best efficiencies everreached using more than 10 thousands of CPU cores

52 RCS of an Electrically Large Airplane

521 Simulation Model The airplane is shown in Fig-ure 11 The parameters are as follows Incident wave fre-quency is 40GHz The direction of incident wave is+119910 The polarization is E

119911 The size of the airplane is

524256m times 210312m times 2944213m Electric size is about700120582times 280120582times 39120582 The size of grid is 119889119909 = 119889119910 = 119889119911 =00075m The amount of total grids is 7020 times 2840 times 420(about 84 billion) It is tested on Shanghai SupercomputerCenter (SSC) using 10240 cores

522 Result The computation time of this code is 603686seconds The RCS of the airplane is shown in Figure 12

6 Conclusion

A guideline is presented for using parallel FDTD on super-computers with more than 10 thousands of CPU coresbased on a theoretical communication model given in thispaper The benchmarks obtained on two supercomputersvalidated the optimal virtual topology rules Radiation froma large microstrip antenna array and scattering from anelectrically large airplane are simulated successfully whichindicate the capability of the method presented in thispaper for those types of real-life EM problems

International Journal of Antennas and Propagation 9

0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane

Figure 12 The RCS of the airplane

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the National High TechnologyResearch andDevelopment Program of China (863 Program)(2012AA01A308) the NSFC (61301069 61072019) the Projectwith Contract no 2013KJXX-67 and the Program for New

Century Excellent Talents in University of China (NCET-13-0949) The computational resources utilized in this researchare provided by the National Supercomputer Center inTianjin (NSCC-TJ) National Supercomputing Center inShenzhen (NSCC-SZ) and Shanghai Supercomputer Center(SSC)

References

[1] A Taflove Computational Electrodynamics The Finite-Differ-ence Time-Domain Method Artech House Norwood MassUSA 2000

10 International Journal of Antennas and Propagation

[2] D Ge and Y Yan Finite-Difference Time-Domain Method forElectromagnetic Waves Version 3 Xidian University PressXirsquoan China 2011 (Chinese)

[3] J L Volakis D B Davidson C Guiffaut and K Mahdjoubi ldquoAparallel FDTD algorithm using theMPI libraryrdquo IEEE Antennasand Propagation Magazine vol 43 no 2 pp 94ndash103 2001

[4] U Andersson Time-domain methods for the Maxwell equations[PhD thesis] Royal Institute of Technology Stockholm Swe-den 2001

[5] Y Zhang J Song and C H Liang ldquoMPI-based parallelizedlocally conformal fdtd for modeling slot antennas and newperiodic structures in microstriprdquo Journal of ElectromagneticWaves and Applications vol 18 no 10 pp 1321ndash1335 2004

[6] Y Zhang J Song and C Liang ldquoStudy on the parallel modifiedlocally conformal FDTD algorithm on cluster of PCs for PBGstructuresrdquo Acta Electronica Sinica vol 31 no 12A pp 2142ndash2144 2003

[7] Z Yu D Wei and C Liang ldquoAnalysis of parallel performanceof MPI based parallel FDTD on PC clustersrdquo in Proceedings ofthe Asia-Pacific Conference Proceedings Microwave ConferenceProceedings (APMC rsquo05) vol 4 December 2005

[8] W Yu X Yang Y Liu et al ldquoA new direction in computationalelectromagnetics solving large problems using the parallelFDTD on the BlueGeneL supercomputer providing teraflop-level performancerdquo IEEE Antennas and Propagation Magazinevol 50 no 2 pp 26ndash44 2008

[9] httpwwwnscc-tjgovcn[10] httpwwwnsccszgovcn[11] httpwwwsscnetcn[12] W Chen and J Zhai Preliminary Analysis on Communication

Performance of Dawning 5000A 863 High Performance Com-puter Testing Center of Tsinghua University 2008

[13] Intel Corporation Intel MPI Library for Linux OS ReferenceManual Intel Corporation 2011 httpssoftwareintelcomsitesproductsdocumentationhpcmpilinuxreference manualpdf

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

4 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus40

minus30

minus20

minus10

0

HFSSFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

3300

minus10

minus20

minus30

minus40

minus30

minus20

minus10

0

HFSSFDTD

(b) 119910119900119911 plane

Figure 2 The radiation patterns of the 2 times 2 microstrip antenna array

Spee

dup

128

112

96

80

64

48

32

16

0

Number of CPU cores0 16 32 48 64 80 96 112 128

Ideal speedupThis paperrsquos approach

(a) Speedup

Number of CPU cores0 16 32 48 64 80 96 112 128

Ideal parallel efficiencyThis paperrsquos approach

Para

llel e

ffici

ency

12

10

08

06

04

02

00

(b) Parallel efficiency

Figure 3 The speedup and parallel efficiency of the code from 12CPU cores to 120CPU cores on NSCC-TJ

(b) 120 Cores Consider

5times 6times 4 (5 minus 1) times (1200 times 300) + (6 minus 1) times (1200 times 300) +(4 minus 1) times (1200 times 1200) = 7560000 (72490 s)(04GBprocess)

6times 5times 4 (6 minus 1) times (1200 times 300) + (5 minus 1) times (1200 times 300) +(4 minus 1) times (1200 times 1200) = 7560000 (107641 s)(04GBprocess)

From above it is known that for the virtual topologieswith the same dimensions the total grids at interfaces less(the amount of communication data less) can save calculationtime effectivelyMeanwhile it is obvious that thememory dis-tribution of each process is the same for different topologieswith the same number of CPU cores

But from above it also can be seen that for the sameamount of grids the calculation time has certain differences

International Journal of Antennas and Propagation 5

xy

z

1 1 1 2 22

33

34 4

4

5 55

6 66

77

78 8

8

9 99

10 1010

(a) 5 times 6 times 4

1

x

y

z

1 1 2 2

23 3 3 4

4 45 5

5

66 6 7

7

7 8 8 8 9

9 9 10 10 10

(b) 6 times 5 times 4

Figure 4 The diagram of communication mode across nodes between the virtual topologies 5 times 6 times 4 and 6 times 5 times 4

For example the topologies 6 times 5 times 4 and 5 times 6 times 4with the same grids number and the calculation time are107641 seconds and 72490 seconds respectively Generallywe believe that the consumption time between processes inone node is less and the more time consumed between onesacross nodes [12 13] So here we speculate that the differentamount of the communication grids across nodes causes thedifference between the two cases

For 5 times 6 times 4 it has 4 times (1200 times 300) + 1 times (1200 times 300) =1800000 FDTD grids needed to communicate across nodesand its calculation time is 72490 seconds while for 6 times 5 times4 it has 5 times (1200 times 300) + (1 + 26) times (1200 times 300) = 2280000FDTD grids needed to communicate across nodes and itscalculation time is 107641 seconds The calculation way ofcommunication data above is shown in Figure 4

In Figure 4 every two adjacent figures with differentcolors have data communication Figures 1 to 10 present tennodes and the adjacent figures with the same colors presentthat they are in the same node For (b) the adjacent columnswill transfer ((1 + 26) times (1200 times 300)) data in 119909119900119911 plane andthe adjacent rows will transfer (5 times (1200 times 300)) data in 119910119900119911plane

422 NSCC-SZ Table 3 is the comparison of total computa-tion time (in seconds) with different CPU cores and differentvirtual topology schemes The maximum number of CPUcores used for test is 480

In Table 3 virtual topology schemes are described as(119909 times 119910 times 119911) for all three communication patterns and themeaning of each figure in each topology is the same with thedescription in Section 421 above

The speedup and parallel efficiency of the code are shownin Figure 5 From Figure 5 it can be seen that the parallelefficiency reaches up to 80 on NSCC-SZ

FromTable 3 it can be seen that for the virtual topologieswith the same dimensions the total grids at interfaces less(the amount of communication data less) can save calculation

Table 3 Comparisons of virtual topology amount of communica-tion and computation time on NSCC-SZ

CPUcores

Virtualtopology(119909 times 119910 times 119911)

Amount ofcommunication

Computationtime (inseconds)

48 4 times 6 times 2 4320000 1271026 times 4 times 2 4320000 125612

60

5 times 6 times 2 4680000 1055023 times 10 times 2 5400000 1044473 times 5 times 4 6480000 12140910 times 2 times 3 6480000 111717

96

6 times 8 times 2 5760000 596998 times 4 times 3 6480000 637096 times 4 times 4 7200000 604824 times 6 times 4 7200000 64113

120

6 times 10 times 2 6480000 4929610 times 6 times 2 6480000 752488 times 5 times 3 6840000 510255 times 8 times 3 6840000 56613

240

10 times 12 times 2 8640000 286128 times 10 times 3 8640000 2964810 times 8 times 3 8640000 3280712 times 10 times 2 8640000 4173

360

12 times 10 times 3 10080000 1974610 times 12 times 3 10080000 278738 times 15 times 3 10440000 1814612 times 15 times 2 10440000 29734

480

10 times 12 times 4 11520000 1474812 times 10 times 4 11520000 1565315 times 8 times 4 11880000 1575715 times 16 times 2 11880000 1625412 times 8 times 5 12240000 161858 times 12 times 5 12240000 1653

6 International Journal of Antennas and Propagation

Number of CPU cores

Ideal speedupThis paperrsquos approach

Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Ideal parallel efficiencyThis paperrsquos approach

Number of CPU cores0 48 96 144 192 240 288 336 384 432 480 528

(b) Parallel efficiency

Figure 5 The speedup and parallel efficiency of the code from 48CPU cores to 480CPU cores on NSCC-SZ

time effectively This conclusion coincided with the case onNSCC-TJ

The amount of communication grids of each virtualtopology is calculated by (3) while for certain topologieswith the same number of grids it is found that the calculationtime is less with the more crossing-node communicationanalyzed by the way of NSCC-TJ It is contrary to the spec-ulation theory above Therefore we speculate that whether itis caused by the reason that the communication amount ofnodes with the main communication is great in the topologyNext the cases of 10 times 2 times 3 and 3 times 5 times 4 in 60 cores aretaken as the examples to analyze the speculation (the amountof communication is 6480000)

For 10 times 2 times 3 the crossing-node communication is 4 times(1200 times 300) + 1 times (1200 times 300) = 1800000 while for 3 times 5 times 4the one is 2 times (1200 times 300) + 1612 times (1200 times 300) = 1200000But the consumption time for 10 times 2 times 3 is 111717 seconds andthe one for 3 times 5 times 4 is 121409 seconds This does not agreewith the speculation of crossing-node theory on NSCC-TJTo further explore the reason the heaviest communicationand the lightest communication of the related processes fromthese two virtual topology schemes are listed in Table 4The heaviest communication load in the virtual topologyscheme 10 times 2 times 3 is (1200 times 300)30 + (1200 times 300) times26 + (1200 times 1200) times 220 = 276000 while in the virtualtopology scheme 3 times 5 times 4 the one is (1200 times 300) times 212 +(1200 times 300) times 220 + (1200 times 1200) times 215 = 288000 Theheaviest communication loads are assigned to the processeslocated at the center of the process grid Similarly the lightestcommunication loads that are set to the processes at thecorner of the process grid can be calculated The differencebetween the heaviest communication load and the lightestcommunication load in the virtual topology scheme 3 times 5 times4 is larger than the one in 10 times 2 times 3 which results indifferent computation timeThis indicates that when the total

z

yx

Figure 6 The model of the microstrip antenna array

Table 4 Comparisons of communication load

Topology The heaviest The lightest Difference10 times 2 times 3 276000 144000 1320003 times 5 times 4 288000 144000 144000

amount of communication is the same a virtual topologyschemewith amore balanced communication loadmay bringa better performance although with a more crossing-nodecommunication

43 The General Rules of Optimal Virtual Topology Gen-erally when the amount of FDTD cells is the same theMPI virtual topology by the way where the amount oftransferred data is less can save the computation time forthe same dimensional virtual topologyThe best performanceof a parallel FDTD code can be obtained by optimizing theMPI virtual topology scheme The general rules for a betterperformance are as follows

(a) Select MPI virtual topology scheme to make the totalcommunication 119871 (equation (3)) the smallest

International Journal of Antennas and Propagation 7

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane

Figure 7 The radiation patterns of the array

y

xz

Figure 8 The model of the microstrip array

(b) When the total communication 119871 is the same set thetopology as with less crossing-node communication

(c) When the amount of crossing-node communicationof different topologies is approximately the sameselect the topology with a more balanced communi-cation load

5 Applications Using 10 Thousands ofCPU Cores

Based on the optimal virtual topology scheme above theparallel FDTD code is applied to analyze some complicatedEM problems and they are run on Shanghai SupercomputerCenter (SSC) platform

51 Radiation of Microstrip Antenna Array

511 Validation of the FDTD Code To validate the FDTDcode a microstrip antenna array with hundreds of the same

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Number of CPU cores4096 6144 8192 10240

Ideal parallel efficiencyThis paperrsquos approach

Figure 9 The parallel efficiency of 4096sim10240 cores

antenna units is analyzed and the results are compared withthe ones provided by MoM The model is shown in Figure 6The amount of total grids is 786 times 1224 times 54 It is calculatedon PC with the virtual topology scheme 4 times 3 times 2

The radiation patterns of the array are shown in Fig-ure 7 compared with the ones provided by MoM FromFigure 7 one can see that there are very well agreementsbetween them

8 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane

Figure 10 The radiation patterns of the ellipse microstrip antenna array

z y

x

Figure 11 The model of the airplane

512 Radiation of an Ellipse Antenna Array An ellipsemicrostrip antenna array with nearly 2000 elements is shownin Figure 8 In this array the amount of total grids is 6016 times1160 times 54 (about 04 billion) It is tested on ShanghaiSupercomputer Center (SSC) using 10240 cores and theparallel efficiency from 4096 to 10240 cores is tested

The parallel efficiency of this code is shown in Figure 9And the radiation patterns of this array are shown inFigure 10

From Figure 9 it is known that the parallel efficiencyreaches to nearly 80 at 10240 cores with 4096 cores asbenchmark which achieves one of the best efficiencies everreached using more than 10 thousands of CPU cores

52 RCS of an Electrically Large Airplane

521 Simulation Model The airplane is shown in Fig-ure 11 The parameters are as follows Incident wave fre-quency is 40GHz The direction of incident wave is+119910 The polarization is E

119911 The size of the airplane is

524256m times 210312m times 2944213m Electric size is about700120582times 280120582times 39120582 The size of grid is 119889119909 = 119889119910 = 119889119911 =00075m The amount of total grids is 7020 times 2840 times 420(about 84 billion) It is tested on Shanghai SupercomputerCenter (SSC) using 10240 cores

522 Result The computation time of this code is 603686seconds The RCS of the airplane is shown in Figure 12

6 Conclusion

A guideline is presented for using parallel FDTD on super-computers with more than 10 thousands of CPU coresbased on a theoretical communication model given in thispaper The benchmarks obtained on two supercomputersvalidated the optimal virtual topology rules Radiation froma large microstrip antenna array and scattering from anelectrically large airplane are simulated successfully whichindicate the capability of the method presented in thispaper for those types of real-life EM problems

International Journal of Antennas and Propagation 9

0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane

Figure 12 The RCS of the airplane

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the National High TechnologyResearch andDevelopment Program of China (863 Program)(2012AA01A308) the NSFC (61301069 61072019) the Projectwith Contract no 2013KJXX-67 and the Program for New

Century Excellent Talents in University of China (NCET-13-0949) The computational resources utilized in this researchare provided by the National Supercomputer Center inTianjin (NSCC-TJ) National Supercomputing Center inShenzhen (NSCC-SZ) and Shanghai Supercomputer Center(SSC)

References

[1] A Taflove Computational Electrodynamics The Finite-Differ-ence Time-Domain Method Artech House Norwood MassUSA 2000

10 International Journal of Antennas and Propagation

[2] D Ge and Y Yan Finite-Difference Time-Domain Method forElectromagnetic Waves Version 3 Xidian University PressXirsquoan China 2011 (Chinese)

[3] J L Volakis D B Davidson C Guiffaut and K Mahdjoubi ldquoAparallel FDTD algorithm using theMPI libraryrdquo IEEE Antennasand Propagation Magazine vol 43 no 2 pp 94ndash103 2001

[4] U Andersson Time-domain methods for the Maxwell equations[PhD thesis] Royal Institute of Technology Stockholm Swe-den 2001

[5] Y Zhang J Song and C H Liang ldquoMPI-based parallelizedlocally conformal fdtd for modeling slot antennas and newperiodic structures in microstriprdquo Journal of ElectromagneticWaves and Applications vol 18 no 10 pp 1321ndash1335 2004

[6] Y Zhang J Song and C Liang ldquoStudy on the parallel modifiedlocally conformal FDTD algorithm on cluster of PCs for PBGstructuresrdquo Acta Electronica Sinica vol 31 no 12A pp 2142ndash2144 2003

[7] Z Yu D Wei and C Liang ldquoAnalysis of parallel performanceof MPI based parallel FDTD on PC clustersrdquo in Proceedings ofthe Asia-Pacific Conference Proceedings Microwave ConferenceProceedings (APMC rsquo05) vol 4 December 2005

[8] W Yu X Yang Y Liu et al ldquoA new direction in computationalelectromagnetics solving large problems using the parallelFDTD on the BlueGeneL supercomputer providing teraflop-level performancerdquo IEEE Antennas and Propagation Magazinevol 50 no 2 pp 26ndash44 2008

[9] httpwwwnscc-tjgovcn[10] httpwwwnsccszgovcn[11] httpwwwsscnetcn[12] W Chen and J Zhai Preliminary Analysis on Communication

Performance of Dawning 5000A 863 High Performance Com-puter Testing Center of Tsinghua University 2008

[13] Intel Corporation Intel MPI Library for Linux OS ReferenceManual Intel Corporation 2011 httpssoftwareintelcomsitesproductsdocumentationhpcmpilinuxreference manualpdf

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

International Journal of Antennas and Propagation 5

xy

z

1 1 1 2 22

33

34 4

4

5 55

6 66

77

78 8

8

9 99

10 1010

(a) 5 times 6 times 4

1

x

y

z

1 1 2 2

23 3 3 4

4 45 5

5

66 6 7

7

7 8 8 8 9

9 9 10 10 10

(b) 6 times 5 times 4

Figure 4 The diagram of communication mode across nodes between the virtual topologies 5 times 6 times 4 and 6 times 5 times 4

For example the topologies 6 times 5 times 4 and 5 times 6 times 4with the same grids number and the calculation time are107641 seconds and 72490 seconds respectively Generallywe believe that the consumption time between processes inone node is less and the more time consumed between onesacross nodes [12 13] So here we speculate that the differentamount of the communication grids across nodes causes thedifference between the two cases

For 5 times 6 times 4 it has 4 times (1200 times 300) + 1 times (1200 times 300) =1800000 FDTD grids needed to communicate across nodesand its calculation time is 72490 seconds while for 6 times 5 times4 it has 5 times (1200 times 300) + (1 + 26) times (1200 times 300) = 2280000FDTD grids needed to communicate across nodes and itscalculation time is 107641 seconds The calculation way ofcommunication data above is shown in Figure 4

In Figure 4 every two adjacent figures with differentcolors have data communication Figures 1 to 10 present tennodes and the adjacent figures with the same colors presentthat they are in the same node For (b) the adjacent columnswill transfer ((1 + 26) times (1200 times 300)) data in 119909119900119911 plane andthe adjacent rows will transfer (5 times (1200 times 300)) data in 119910119900119911plane

422 NSCC-SZ Table 3 is the comparison of total computa-tion time (in seconds) with different CPU cores and differentvirtual topology schemes The maximum number of CPUcores used for test is 480

In Table 3 virtual topology schemes are described as(119909 times 119910 times 119911) for all three communication patterns and themeaning of each figure in each topology is the same with thedescription in Section 421 above

The speedup and parallel efficiency of the code are shownin Figure 5 From Figure 5 it can be seen that the parallelefficiency reaches up to 80 on NSCC-SZ

FromTable 3 it can be seen that for the virtual topologieswith the same dimensions the total grids at interfaces less(the amount of communication data less) can save calculation

Table 3 Comparisons of virtual topology amount of communica-tion and computation time on NSCC-SZ

CPUcores

Virtualtopology(119909 times 119910 times 119911)

Amount ofcommunication

Computationtime (inseconds)

48 4 times 6 times 2 4320000 1271026 times 4 times 2 4320000 125612

60

5 times 6 times 2 4680000 1055023 times 10 times 2 5400000 1044473 times 5 times 4 6480000 12140910 times 2 times 3 6480000 111717

96

6 times 8 times 2 5760000 596998 times 4 times 3 6480000 637096 times 4 times 4 7200000 604824 times 6 times 4 7200000 64113

120

6 times 10 times 2 6480000 4929610 times 6 times 2 6480000 752488 times 5 times 3 6840000 510255 times 8 times 3 6840000 56613

240

10 times 12 times 2 8640000 286128 times 10 times 3 8640000 2964810 times 8 times 3 8640000 3280712 times 10 times 2 8640000 4173

360

12 times 10 times 3 10080000 1974610 times 12 times 3 10080000 278738 times 15 times 3 10440000 1814612 times 15 times 2 10440000 29734

480

10 times 12 times 4 11520000 1474812 times 10 times 4 11520000 1565315 times 8 times 4 11880000 1575715 times 16 times 2 11880000 1625412 times 8 times 5 12240000 161858 times 12 times 5 12240000 1653

6 International Journal of Antennas and Propagation

Number of CPU cores

Ideal speedupThis paperrsquos approach

Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Ideal parallel efficiencyThis paperrsquos approach

Number of CPU cores0 48 96 144 192 240 288 336 384 432 480 528

(b) Parallel efficiency

Figure 5 The speedup and parallel efficiency of the code from 48CPU cores to 480CPU cores on NSCC-SZ

time effectively This conclusion coincided with the case onNSCC-TJ

The amount of communication grids of each virtualtopology is calculated by (3) while for certain topologieswith the same number of grids it is found that the calculationtime is less with the more crossing-node communicationanalyzed by the way of NSCC-TJ It is contrary to the spec-ulation theory above Therefore we speculate that whether itis caused by the reason that the communication amount ofnodes with the main communication is great in the topologyNext the cases of 10 times 2 times 3 and 3 times 5 times 4 in 60 cores aretaken as the examples to analyze the speculation (the amountof communication is 6480000)

For 10 times 2 times 3 the crossing-node communication is 4 times(1200 times 300) + 1 times (1200 times 300) = 1800000 while for 3 times 5 times 4the one is 2 times (1200 times 300) + 1612 times (1200 times 300) = 1200000But the consumption time for 10 times 2 times 3 is 111717 seconds andthe one for 3 times 5 times 4 is 121409 seconds This does not agreewith the speculation of crossing-node theory on NSCC-TJTo further explore the reason the heaviest communicationand the lightest communication of the related processes fromthese two virtual topology schemes are listed in Table 4The heaviest communication load in the virtual topologyscheme 10 times 2 times 3 is (1200 times 300)30 + (1200 times 300) times26 + (1200 times 1200) times 220 = 276000 while in the virtualtopology scheme 3 times 5 times 4 the one is (1200 times 300) times 212 +(1200 times 300) times 220 + (1200 times 1200) times 215 = 288000 Theheaviest communication loads are assigned to the processeslocated at the center of the process grid Similarly the lightestcommunication loads that are set to the processes at thecorner of the process grid can be calculated The differencebetween the heaviest communication load and the lightestcommunication load in the virtual topology scheme 3 times 5 times4 is larger than the one in 10 times 2 times 3 which results indifferent computation timeThis indicates that when the total

z

yx

Figure 6 The model of the microstrip antenna array

Table 4 Comparisons of communication load

Topology The heaviest The lightest Difference10 times 2 times 3 276000 144000 1320003 times 5 times 4 288000 144000 144000

amount of communication is the same a virtual topologyschemewith amore balanced communication loadmay bringa better performance although with a more crossing-nodecommunication

43 The General Rules of Optimal Virtual Topology Gen-erally when the amount of FDTD cells is the same theMPI virtual topology by the way where the amount oftransferred data is less can save the computation time forthe same dimensional virtual topologyThe best performanceof a parallel FDTD code can be obtained by optimizing theMPI virtual topology scheme The general rules for a betterperformance are as follows

(a) Select MPI virtual topology scheme to make the totalcommunication 119871 (equation (3)) the smallest

International Journal of Antennas and Propagation 7

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane

Figure 7 The radiation patterns of the array

y

xz

Figure 8 The model of the microstrip array

(b) When the total communication 119871 is the same set thetopology as with less crossing-node communication

(c) When the amount of crossing-node communicationof different topologies is approximately the sameselect the topology with a more balanced communi-cation load

5 Applications Using 10 Thousands ofCPU Cores

Based on the optimal virtual topology scheme above theparallel FDTD code is applied to analyze some complicatedEM problems and they are run on Shanghai SupercomputerCenter (SSC) platform

51 Radiation of Microstrip Antenna Array

511 Validation of the FDTD Code To validate the FDTDcode a microstrip antenna array with hundreds of the same

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Number of CPU cores4096 6144 8192 10240

Ideal parallel efficiencyThis paperrsquos approach

Figure 9 The parallel efficiency of 4096sim10240 cores

antenna units is analyzed and the results are compared withthe ones provided by MoM The model is shown in Figure 6The amount of total grids is 786 times 1224 times 54 It is calculatedon PC with the virtual topology scheme 4 times 3 times 2

The radiation patterns of the array are shown in Fig-ure 7 compared with the ones provided by MoM FromFigure 7 one can see that there are very well agreementsbetween them

8 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane

Figure 10 The radiation patterns of the ellipse microstrip antenna array

z y

x

Figure 11 The model of the airplane

512 Radiation of an Ellipse Antenna Array An ellipsemicrostrip antenna array with nearly 2000 elements is shownin Figure 8 In this array the amount of total grids is 6016 times1160 times 54 (about 04 billion) It is tested on ShanghaiSupercomputer Center (SSC) using 10240 cores and theparallel efficiency from 4096 to 10240 cores is tested

The parallel efficiency of this code is shown in Figure 9And the radiation patterns of this array are shown inFigure 10

From Figure 9 it is known that the parallel efficiencyreaches to nearly 80 at 10240 cores with 4096 cores asbenchmark which achieves one of the best efficiencies everreached using more than 10 thousands of CPU cores

52 RCS of an Electrically Large Airplane

521 Simulation Model The airplane is shown in Fig-ure 11 The parameters are as follows Incident wave fre-quency is 40GHz The direction of incident wave is+119910 The polarization is E

119911 The size of the airplane is

524256m times 210312m times 2944213m Electric size is about700120582times 280120582times 39120582 The size of grid is 119889119909 = 119889119910 = 119889119911 =00075m The amount of total grids is 7020 times 2840 times 420(about 84 billion) It is tested on Shanghai SupercomputerCenter (SSC) using 10240 cores

522 Result The computation time of this code is 603686seconds The RCS of the airplane is shown in Figure 12

6 Conclusion

A guideline is presented for using parallel FDTD on super-computers with more than 10 thousands of CPU coresbased on a theoretical communication model given in thispaper The benchmarks obtained on two supercomputersvalidated the optimal virtual topology rules Radiation froma large microstrip antenna array and scattering from anelectrically large airplane are simulated successfully whichindicate the capability of the method presented in thispaper for those types of real-life EM problems

International Journal of Antennas and Propagation 9

0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane

Figure 12 The RCS of the airplane

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the National High TechnologyResearch andDevelopment Program of China (863 Program)(2012AA01A308) the NSFC (61301069 61072019) the Projectwith Contract no 2013KJXX-67 and the Program for New

Century Excellent Talents in University of China (NCET-13-0949) The computational resources utilized in this researchare provided by the National Supercomputer Center inTianjin (NSCC-TJ) National Supercomputing Center inShenzhen (NSCC-SZ) and Shanghai Supercomputer Center(SSC)

References

[1] A Taflove Computational Electrodynamics The Finite-Differ-ence Time-Domain Method Artech House Norwood MassUSA 2000

10 International Journal of Antennas and Propagation

[2] D Ge and Y Yan Finite-Difference Time-Domain Method forElectromagnetic Waves Version 3 Xidian University PressXirsquoan China 2011 (Chinese)

[3] J L Volakis D B Davidson C Guiffaut and K Mahdjoubi ldquoAparallel FDTD algorithm using theMPI libraryrdquo IEEE Antennasand Propagation Magazine vol 43 no 2 pp 94ndash103 2001

[4] U Andersson Time-domain methods for the Maxwell equations[PhD thesis] Royal Institute of Technology Stockholm Swe-den 2001

[5] Y Zhang J Song and C H Liang ldquoMPI-based parallelizedlocally conformal fdtd for modeling slot antennas and newperiodic structures in microstriprdquo Journal of ElectromagneticWaves and Applications vol 18 no 10 pp 1321ndash1335 2004

[6] Y Zhang J Song and C Liang ldquoStudy on the parallel modifiedlocally conformal FDTD algorithm on cluster of PCs for PBGstructuresrdquo Acta Electronica Sinica vol 31 no 12A pp 2142ndash2144 2003

[7] Z Yu D Wei and C Liang ldquoAnalysis of parallel performanceof MPI based parallel FDTD on PC clustersrdquo in Proceedings ofthe Asia-Pacific Conference Proceedings Microwave ConferenceProceedings (APMC rsquo05) vol 4 December 2005

[8] W Yu X Yang Y Liu et al ldquoA new direction in computationalelectromagnetics solving large problems using the parallelFDTD on the BlueGeneL supercomputer providing teraflop-level performancerdquo IEEE Antennas and Propagation Magazinevol 50 no 2 pp 26ndash44 2008

[9] httpwwwnscc-tjgovcn[10] httpwwwnsccszgovcn[11] httpwwwsscnetcn[12] W Chen and J Zhai Preliminary Analysis on Communication

Performance of Dawning 5000A 863 High Performance Com-puter Testing Center of Tsinghua University 2008

[13] Intel Corporation Intel MPI Library for Linux OS ReferenceManual Intel Corporation 2011 httpssoftwareintelcomsitesproductsdocumentationhpcmpilinuxreference manualpdf

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

6 International Journal of Antennas and Propagation

Number of CPU cores

Ideal speedupThis paperrsquos approach

Spee

dup

480

384

288

192

96

0

0 48 96 144 192 240 288 336 384 432 480 528

(a) Speedup

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Ideal parallel efficiencyThis paperrsquos approach

Number of CPU cores0 48 96 144 192 240 288 336 384 432 480 528

(b) Parallel efficiency

Figure 5 The speedup and parallel efficiency of the code from 48CPU cores to 480CPU cores on NSCC-SZ

time effectively This conclusion coincided with the case onNSCC-TJ

The amount of communication grids of each virtualtopology is calculated by (3) while for certain topologieswith the same number of grids it is found that the calculationtime is less with the more crossing-node communicationanalyzed by the way of NSCC-TJ It is contrary to the spec-ulation theory above Therefore we speculate that whether itis caused by the reason that the communication amount ofnodes with the main communication is great in the topologyNext the cases of 10 times 2 times 3 and 3 times 5 times 4 in 60 cores aretaken as the examples to analyze the speculation (the amountof communication is 6480000)

For 10 times 2 times 3 the crossing-node communication is 4 times(1200 times 300) + 1 times (1200 times 300) = 1800000 while for 3 times 5 times 4the one is 2 times (1200 times 300) + 1612 times (1200 times 300) = 1200000But the consumption time for 10 times 2 times 3 is 111717 seconds andthe one for 3 times 5 times 4 is 121409 seconds This does not agreewith the speculation of crossing-node theory on NSCC-TJTo further explore the reason the heaviest communicationand the lightest communication of the related processes fromthese two virtual topology schemes are listed in Table 4The heaviest communication load in the virtual topologyscheme 10 times 2 times 3 is (1200 times 300)30 + (1200 times 300) times26 + (1200 times 1200) times 220 = 276000 while in the virtualtopology scheme 3 times 5 times 4 the one is (1200 times 300) times 212 +(1200 times 300) times 220 + (1200 times 1200) times 215 = 288000 Theheaviest communication loads are assigned to the processeslocated at the center of the process grid Similarly the lightestcommunication loads that are set to the processes at thecorner of the process grid can be calculated The differencebetween the heaviest communication load and the lightestcommunication load in the virtual topology scheme 3 times 5 times4 is larger than the one in 10 times 2 times 3 which results indifferent computation timeThis indicates that when the total

z

yx

Figure 6 The model of the microstrip antenna array

Table 4 Comparisons of communication load

Topology The heaviest The lightest Difference10 times 2 times 3 276000 144000 1320003 times 5 times 4 288000 144000 144000

amount of communication is the same a virtual topologyschemewith amore balanced communication loadmay bringa better performance although with a more crossing-nodecommunication

43 The General Rules of Optimal Virtual Topology Gen-erally when the amount of FDTD cells is the same theMPI virtual topology by the way where the amount oftransferred data is less can save the computation time forthe same dimensional virtual topologyThe best performanceof a parallel FDTD code can be obtained by optimizing theMPI virtual topology scheme The general rules for a betterperformance are as follows

(a) Select MPI virtual topology scheme to make the totalcommunication 119871 (equation (3)) the smallest

International Journal of Antennas and Propagation 7

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane

Figure 7 The radiation patterns of the array

y

xz

Figure 8 The model of the microstrip array

(b) When the total communication 119871 is the same set thetopology as with less crossing-node communication

(c) When the amount of crossing-node communicationof different topologies is approximately the sameselect the topology with a more balanced communi-cation load

5 Applications Using 10 Thousands ofCPU Cores

Based on the optimal virtual topology scheme above theparallel FDTD code is applied to analyze some complicatedEM problems and they are run on Shanghai SupercomputerCenter (SSC) platform

51 Radiation of Microstrip Antenna Array

511 Validation of the FDTD Code To validate the FDTDcode a microstrip antenna array with hundreds of the same

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Number of CPU cores4096 6144 8192 10240

Ideal parallel efficiencyThis paperrsquos approach

Figure 9 The parallel efficiency of 4096sim10240 cores

antenna units is analyzed and the results are compared withthe ones provided by MoM The model is shown in Figure 6The amount of total grids is 786 times 1224 times 54 It is calculatedon PC with the virtual topology scheme 4 times 3 times 2

The radiation patterns of the array are shown in Fig-ure 7 compared with the ones provided by MoM FromFigure 7 one can see that there are very well agreementsbetween them

8 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane

Figure 10 The radiation patterns of the ellipse microstrip antenna array

z y

x

Figure 11 The model of the airplane

512 Radiation of an Ellipse Antenna Array An ellipsemicrostrip antenna array with nearly 2000 elements is shownin Figure 8 In this array the amount of total grids is 6016 times1160 times 54 (about 04 billion) It is tested on ShanghaiSupercomputer Center (SSC) using 10240 cores and theparallel efficiency from 4096 to 10240 cores is tested

The parallel efficiency of this code is shown in Figure 9And the radiation patterns of this array are shown inFigure 10

From Figure 9 it is known that the parallel efficiencyreaches to nearly 80 at 10240 cores with 4096 cores asbenchmark which achieves one of the best efficiencies everreached using more than 10 thousands of CPU cores

52 RCS of an Electrically Large Airplane

521 Simulation Model The airplane is shown in Fig-ure 11 The parameters are as follows Incident wave fre-quency is 40GHz The direction of incident wave is+119910 The polarization is E

119911 The size of the airplane is

524256m times 210312m times 2944213m Electric size is about700120582times 280120582times 39120582 The size of grid is 119889119909 = 119889119910 = 119889119911 =00075m The amount of total grids is 7020 times 2840 times 420(about 84 billion) It is tested on Shanghai SupercomputerCenter (SSC) using 10240 cores

522 Result The computation time of this code is 603686seconds The RCS of the airplane is shown in Figure 12

6 Conclusion

A guideline is presented for using parallel FDTD on super-computers with more than 10 thousands of CPU coresbased on a theoretical communication model given in thispaper The benchmarks obtained on two supercomputersvalidated the optimal virtual topology rules Radiation froma large microstrip antenna array and scattering from anelectrically large airplane are simulated successfully whichindicate the capability of the method presented in thispaper for those types of real-life EM problems

International Journal of Antennas and Propagation 9

0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane

Figure 12 The RCS of the airplane

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the National High TechnologyResearch andDevelopment Program of China (863 Program)(2012AA01A308) the NSFC (61301069 61072019) the Projectwith Contract no 2013KJXX-67 and the Program for New

Century Excellent Talents in University of China (NCET-13-0949) The computational resources utilized in this researchare provided by the National Supercomputer Center inTianjin (NSCC-TJ) National Supercomputing Center inShenzhen (NSCC-SZ) and Shanghai Supercomputer Center(SSC)

References

[1] A Taflove Computational Electrodynamics The Finite-Differ-ence Time-Domain Method Artech House Norwood MassUSA 2000

10 International Journal of Antennas and Propagation

[2] D Ge and Y Yan Finite-Difference Time-Domain Method forElectromagnetic Waves Version 3 Xidian University PressXirsquoan China 2011 (Chinese)

[3] J L Volakis D B Davidson C Guiffaut and K Mahdjoubi ldquoAparallel FDTD algorithm using theMPI libraryrdquo IEEE Antennasand Propagation Magazine vol 43 no 2 pp 94ndash103 2001

[4] U Andersson Time-domain methods for the Maxwell equations[PhD thesis] Royal Institute of Technology Stockholm Swe-den 2001

[5] Y Zhang J Song and C H Liang ldquoMPI-based parallelizedlocally conformal fdtd for modeling slot antennas and newperiodic structures in microstriprdquo Journal of ElectromagneticWaves and Applications vol 18 no 10 pp 1321ndash1335 2004

[6] Y Zhang J Song and C Liang ldquoStudy on the parallel modifiedlocally conformal FDTD algorithm on cluster of PCs for PBGstructuresrdquo Acta Electronica Sinica vol 31 no 12A pp 2142ndash2144 2003

[7] Z Yu D Wei and C Liang ldquoAnalysis of parallel performanceof MPI based parallel FDTD on PC clustersrdquo in Proceedings ofthe Asia-Pacific Conference Proceedings Microwave ConferenceProceedings (APMC rsquo05) vol 4 December 2005

[8] W Yu X Yang Y Liu et al ldquoA new direction in computationalelectromagnetics solving large problems using the parallelFDTD on the BlueGeneL supercomputer providing teraflop-level performancerdquo IEEE Antennas and Propagation Magazinevol 50 no 2 pp 26ndash44 2008

[9] httpwwwnscc-tjgovcn[10] httpwwwnsccszgovcn[11] httpwwwsscnetcn[12] W Chen and J Zhai Preliminary Analysis on Communication

Performance of Dawning 5000A 863 High Performance Com-puter Testing Center of Tsinghua University 2008

[13] Intel Corporation Intel MPI Library for Linux OS ReferenceManual Intel Corporation 2011 httpssoftwareintelcomsitesproductsdocumentationhpcmpilinuxreference manualpdf

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

International Journal of Antennas and Propagation 7

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus20

minus10

0

MoMFDTD

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

30

30

20

20

10

10

minus10

minus20

minus30

minus20

minus10

0

MoMFDTD

(b) 119910119900119911 plane

Figure 7 The radiation patterns of the array

y

xz

Figure 8 The model of the microstrip array

(b) When the total communication 119871 is the same set thetopology as with less crossing-node communication

(c) When the amount of crossing-node communicationof different topologies is approximately the sameselect the topology with a more balanced communi-cation load

5 Applications Using 10 Thousands ofCPU Cores

Based on the optimal virtual topology scheme above theparallel FDTD code is applied to analyze some complicatedEM problems and they are run on Shanghai SupercomputerCenter (SSC) platform

51 Radiation of Microstrip Antenna Array

511 Validation of the FDTD Code To validate the FDTDcode a microstrip antenna array with hundreds of the same

Para

llel e

ffici

ency

12

10

08

06

04

02

00

Number of CPU cores4096 6144 8192 10240

Ideal parallel efficiencyThis paperrsquos approach

Figure 9 The parallel efficiency of 4096sim10240 cores

antenna units is analyzed and the results are compared withthe ones provided by MoM The model is shown in Figure 6The amount of total grids is 786 times 1224 times 54 It is calculatedon PC with the virtual topology scheme 4 times 3 times 2

The radiation patterns of the array are shown in Fig-ure 7 compared with the ones provided by MoM FromFigure 7 one can see that there are very well agreementsbetween them

8 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane

Figure 10 The radiation patterns of the ellipse microstrip antenna array

z y

x

Figure 11 The model of the airplane

512 Radiation of an Ellipse Antenna Array An ellipsemicrostrip antenna array with nearly 2000 elements is shownin Figure 8 In this array the amount of total grids is 6016 times1160 times 54 (about 04 billion) It is tested on ShanghaiSupercomputer Center (SSC) using 10240 cores and theparallel efficiency from 4096 to 10240 cores is tested

The parallel efficiency of this code is shown in Figure 9And the radiation patterns of this array are shown inFigure 10

From Figure 9 it is known that the parallel efficiencyreaches to nearly 80 at 10240 cores with 4096 cores asbenchmark which achieves one of the best efficiencies everreached using more than 10 thousands of CPU cores

52 RCS of an Electrically Large Airplane

521 Simulation Model The airplane is shown in Fig-ure 11 The parameters are as follows Incident wave fre-quency is 40GHz The direction of incident wave is+119910 The polarization is E

119911 The size of the airplane is

524256m times 210312m times 2944213m Electric size is about700120582times 280120582times 39120582 The size of grid is 119889119909 = 119889119910 = 119889119911 =00075m The amount of total grids is 7020 times 2840 times 420(about 84 billion) It is tested on Shanghai SupercomputerCenter (SSC) using 10240 cores

522 Result The computation time of this code is 603686seconds The RCS of the airplane is shown in Figure 12

6 Conclusion

A guideline is presented for using parallel FDTD on super-computers with more than 10 thousands of CPU coresbased on a theoretical communication model given in thispaper The benchmarks obtained on two supercomputersvalidated the optimal virtual topology rules Radiation froma large microstrip antenna array and scattering from anelectrically large airplane are simulated successfully whichindicate the capability of the method presented in thispaper for those types of real-life EM problems

International Journal of Antennas and Propagation 9

0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane

Figure 12 The RCS of the airplane

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the National High TechnologyResearch andDevelopment Program of China (863 Program)(2012AA01A308) the NSFC (61301069 61072019) the Projectwith Contract no 2013KJXX-67 and the Program for New

Century Excellent Talents in University of China (NCET-13-0949) The computational resources utilized in this researchare provided by the National Supercomputer Center inTianjin (NSCC-TJ) National Supercomputing Center inShenzhen (NSCC-SZ) and Shanghai Supercomputer Center(SSC)

References

[1] A Taflove Computational Electrodynamics The Finite-Differ-ence Time-Domain Method Artech House Norwood MassUSA 2000

10 International Journal of Antennas and Propagation

[2] D Ge and Y Yan Finite-Difference Time-Domain Method forElectromagnetic Waves Version 3 Xidian University PressXirsquoan China 2011 (Chinese)

[3] J L Volakis D B Davidson C Guiffaut and K Mahdjoubi ldquoAparallel FDTD algorithm using theMPI libraryrdquo IEEE Antennasand Propagation Magazine vol 43 no 2 pp 94ndash103 2001

[4] U Andersson Time-domain methods for the Maxwell equations[PhD thesis] Royal Institute of Technology Stockholm Swe-den 2001

[5] Y Zhang J Song and C H Liang ldquoMPI-based parallelizedlocally conformal fdtd for modeling slot antennas and newperiodic structures in microstriprdquo Journal of ElectromagneticWaves and Applications vol 18 no 10 pp 1321ndash1335 2004

[6] Y Zhang J Song and C Liang ldquoStudy on the parallel modifiedlocally conformal FDTD algorithm on cluster of PCs for PBGstructuresrdquo Acta Electronica Sinica vol 31 no 12A pp 2142ndash2144 2003

[7] Z Yu D Wei and C Liang ldquoAnalysis of parallel performanceof MPI based parallel FDTD on PC clustersrdquo in Proceedings ofthe Asia-Pacific Conference Proceedings Microwave ConferenceProceedings (APMC rsquo05) vol 4 December 2005

[8] W Yu X Yang Y Liu et al ldquoA new direction in computationalelectromagnetics solving large problems using the parallelFDTD on the BlueGeneL supercomputer providing teraflop-level performancerdquo IEEE Antennas and Propagation Magazinevol 50 no 2 pp 26ndash44 2008

[9] httpwwwnscc-tjgovcn[10] httpwwwnsccszgovcn[11] httpwwwsscnetcn[12] W Chen and J Zhai Preliminary Analysis on Communication

Performance of Dawning 5000A 863 High Performance Com-puter Testing Center of Tsinghua University 2008

[13] Intel Corporation Intel MPI Library for Linux OS ReferenceManual Intel Corporation 2011 httpssoftwareintelcomsitesproductsdocumentationhpcmpilinuxreference manualpdf

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

8 International Journal of Antennas and Propagation

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus40

minus40

minus20

0

xoz

(a) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

40

20

20

minus20

minus60

minus40

minus40

minus60

minus80

minus20

0

yoz

(b) 119910119900119911 plane

Figure 10 The radiation patterns of the ellipse microstrip antenna array

z y

x

Figure 11 The model of the airplane

512 Radiation of an Ellipse Antenna Array An ellipsemicrostrip antenna array with nearly 2000 elements is shownin Figure 8 In this array the amount of total grids is 6016 times1160 times 54 (about 04 billion) It is tested on ShanghaiSupercomputer Center (SSC) using 10240 cores and theparallel efficiency from 4096 to 10240 cores is tested

The parallel efficiency of this code is shown in Figure 9And the radiation patterns of this array are shown inFigure 10

From Figure 9 it is known that the parallel efficiencyreaches to nearly 80 at 10240 cores with 4096 cores asbenchmark which achieves one of the best efficiencies everreached using more than 10 thousands of CPU cores

52 RCS of an Electrically Large Airplane

521 Simulation Model The airplane is shown in Fig-ure 11 The parameters are as follows Incident wave fre-quency is 40GHz The direction of incident wave is+119910 The polarization is E

119911 The size of the airplane is

524256m times 210312m times 2944213m Electric size is about700120582times 280120582times 39120582 The size of grid is 119889119909 = 119889119910 = 119889119911 =00075m The amount of total grids is 7020 times 2840 times 420(about 84 billion) It is tested on Shanghai SupercomputerCenter (SSC) using 10240 cores

522 Result The computation time of this code is 603686seconds The RCS of the airplane is shown in Figure 12

6 Conclusion

A guideline is presented for using parallel FDTD on super-computers with more than 10 thousands of CPU coresbased on a theoretical communication model given in thispaper The benchmarks obtained on two supercomputersvalidated the optimal virtual topology rules Radiation froma large microstrip antenna array and scattering from anelectrically large airplane are simulated successfully whichindicate the capability of the method presented in thispaper for those types of real-life EM problems

International Journal of Antennas and Propagation 9

0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane

Figure 12 The RCS of the airplane

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the National High TechnologyResearch andDevelopment Program of China (863 Program)(2012AA01A308) the NSFC (61301069 61072019) the Projectwith Contract no 2013KJXX-67 and the Program for New

Century Excellent Talents in University of China (NCET-13-0949) The computational resources utilized in this researchare provided by the National Supercomputer Center inTianjin (NSCC-TJ) National Supercomputing Center inShenzhen (NSCC-SZ) and Shanghai Supercomputer Center(SSC)

References

[1] A Taflove Computational Electrodynamics The Finite-Differ-ence Time-Domain Method Artech House Norwood MassUSA 2000

10 International Journal of Antennas and Propagation

[2] D Ge and Y Yan Finite-Difference Time-Domain Method forElectromagnetic Waves Version 3 Xidian University PressXirsquoan China 2011 (Chinese)

[3] J L Volakis D B Davidson C Guiffaut and K Mahdjoubi ldquoAparallel FDTD algorithm using theMPI libraryrdquo IEEE Antennasand Propagation Magazine vol 43 no 2 pp 94ndash103 2001

[4] U Andersson Time-domain methods for the Maxwell equations[PhD thesis] Royal Institute of Technology Stockholm Swe-den 2001

[5] Y Zhang J Song and C H Liang ldquoMPI-based parallelizedlocally conformal fdtd for modeling slot antennas and newperiodic structures in microstriprdquo Journal of ElectromagneticWaves and Applications vol 18 no 10 pp 1321ndash1335 2004

[6] Y Zhang J Song and C Liang ldquoStudy on the parallel modifiedlocally conformal FDTD algorithm on cluster of PCs for PBGstructuresrdquo Acta Electronica Sinica vol 31 no 12A pp 2142ndash2144 2003

[7] Z Yu D Wei and C Liang ldquoAnalysis of parallel performanceof MPI based parallel FDTD on PC clustersrdquo in Proceedings ofthe Asia-Pacific Conference Proceedings Microwave ConferenceProceedings (APMC rsquo05) vol 4 December 2005

[8] W Yu X Yang Y Liu et al ldquoA new direction in computationalelectromagnetics solving large problems using the parallelFDTD on the BlueGeneL supercomputer providing teraflop-level performancerdquo IEEE Antennas and Propagation Magazinevol 50 no 2 pp 26ndash44 2008

[9] httpwwwnscc-tjgovcn[10] httpwwwnsccszgovcn[11] httpwwwsscnetcn[12] W Chen and J Zhai Preliminary Analysis on Communication

Performance of Dawning 5000A 863 High Performance Com-puter Testing Center of Tsinghua University 2008

[13] Intel Corporation Intel MPI Library for Linux OS ReferenceManual Intel Corporation 2011 httpssoftwareintelcomsitesproductsdocumentationhpcmpilinuxreference manualpdf

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

International Journal of Antennas and Propagation 9

0

30

60120

150

180

210

240

270

300

330

0

40

80

80

60

60

40

20

20

minus20

minus40

minus20

0

xoy

90

(a) 119909119900119910 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

60

40

20

20

minus20

0

xoz

(b) 119909119900119911 plane

0

30

60

90

120

150

180

210

240

270

300

330

0

40

60

80

60

80

40

20

20

minus20

0

yoz

(c) 119910119900119911 plane

Figure 12 The RCS of the airplane

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the National High TechnologyResearch andDevelopment Program of China (863 Program)(2012AA01A308) the NSFC (61301069 61072019) the Projectwith Contract no 2013KJXX-67 and the Program for New

Century Excellent Talents in University of China (NCET-13-0949) The computational resources utilized in this researchare provided by the National Supercomputer Center inTianjin (NSCC-TJ) National Supercomputing Center inShenzhen (NSCC-SZ) and Shanghai Supercomputer Center(SSC)

References

[1] A Taflove Computational Electrodynamics The Finite-Differ-ence Time-Domain Method Artech House Norwood MassUSA 2000

10 International Journal of Antennas and Propagation

[2] D Ge and Y Yan Finite-Difference Time-Domain Method forElectromagnetic Waves Version 3 Xidian University PressXirsquoan China 2011 (Chinese)

[3] J L Volakis D B Davidson C Guiffaut and K Mahdjoubi ldquoAparallel FDTD algorithm using theMPI libraryrdquo IEEE Antennasand Propagation Magazine vol 43 no 2 pp 94ndash103 2001

[4] U Andersson Time-domain methods for the Maxwell equations[PhD thesis] Royal Institute of Technology Stockholm Swe-den 2001

[5] Y Zhang J Song and C H Liang ldquoMPI-based parallelizedlocally conformal fdtd for modeling slot antennas and newperiodic structures in microstriprdquo Journal of ElectromagneticWaves and Applications vol 18 no 10 pp 1321ndash1335 2004

[6] Y Zhang J Song and C Liang ldquoStudy on the parallel modifiedlocally conformal FDTD algorithm on cluster of PCs for PBGstructuresrdquo Acta Electronica Sinica vol 31 no 12A pp 2142ndash2144 2003

[7] Z Yu D Wei and C Liang ldquoAnalysis of parallel performanceof MPI based parallel FDTD on PC clustersrdquo in Proceedings ofthe Asia-Pacific Conference Proceedings Microwave ConferenceProceedings (APMC rsquo05) vol 4 December 2005

[8] W Yu X Yang Y Liu et al ldquoA new direction in computationalelectromagnetics solving large problems using the parallelFDTD on the BlueGeneL supercomputer providing teraflop-level performancerdquo IEEE Antennas and Propagation Magazinevol 50 no 2 pp 26ndash44 2008

[9] httpwwwnscc-tjgovcn[10] httpwwwnsccszgovcn[11] httpwwwsscnetcn[12] W Chen and J Zhai Preliminary Analysis on Communication

Performance of Dawning 5000A 863 High Performance Com-puter Testing Center of Tsinghua University 2008

[13] Intel Corporation Intel MPI Library for Linux OS ReferenceManual Intel Corporation 2011 httpssoftwareintelcomsitesproductsdocumentationhpcmpilinuxreference manualpdf

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

10 International Journal of Antennas and Propagation

[2] D Ge and Y Yan Finite-Difference Time-Domain Method forElectromagnetic Waves Version 3 Xidian University PressXirsquoan China 2011 (Chinese)

[3] J L Volakis D B Davidson C Guiffaut and K Mahdjoubi ldquoAparallel FDTD algorithm using theMPI libraryrdquo IEEE Antennasand Propagation Magazine vol 43 no 2 pp 94ndash103 2001

[4] U Andersson Time-domain methods for the Maxwell equations[PhD thesis] Royal Institute of Technology Stockholm Swe-den 2001

[5] Y Zhang J Song and C H Liang ldquoMPI-based parallelizedlocally conformal fdtd for modeling slot antennas and newperiodic structures in microstriprdquo Journal of ElectromagneticWaves and Applications vol 18 no 10 pp 1321ndash1335 2004

[6] Y Zhang J Song and C Liang ldquoStudy on the parallel modifiedlocally conformal FDTD algorithm on cluster of PCs for PBGstructuresrdquo Acta Electronica Sinica vol 31 no 12A pp 2142ndash2144 2003

[7] Z Yu D Wei and C Liang ldquoAnalysis of parallel performanceof MPI based parallel FDTD on PC clustersrdquo in Proceedings ofthe Asia-Pacific Conference Proceedings Microwave ConferenceProceedings (APMC rsquo05) vol 4 December 2005

[8] W Yu X Yang Y Liu et al ldquoA new direction in computationalelectromagnetics solving large problems using the parallelFDTD on the BlueGeneL supercomputer providing teraflop-level performancerdquo IEEE Antennas and Propagation Magazinevol 50 no 2 pp 26ndash44 2008

[9] httpwwwnscc-tjgovcn[10] httpwwwnsccszgovcn[11] httpwwwsscnetcn[12] W Chen and J Zhai Preliminary Analysis on Communication

Performance of Dawning 5000A 863 High Performance Com-puter Testing Center of Tsinghua University 2008

[13] Intel Corporation Intel MPI Library for Linux OS ReferenceManual Intel Corporation 2011 httpssoftwareintelcomsitesproductsdocumentationhpcmpilinuxreference manualpdf

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Research Article An Optimized Parallel FDTD Topology for ...downloads.hindawi.com/journals/ijap/2015/690510.pdf · Although there are many publications on parallel FDTD, few of them

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of