Parallel computation for natural convection

CONCURRENCY: PRACTICE AND EXPERIENCE, VOL. 9(10), 975–987 (OCTOBER 1997)

Parallel computation for natural convectionP. WANG AND R. D. FERRARO

Jet Propulsion Laboratory, California Institute of Technology, MS 168-522, 4800 Oak Grove Drive, Pasadena,CA 91109-8099, USA

SUMMARYParallel computation for two-dimensional convective flows in cavities with adiabatic horizontalboundaries and driven by differential heating of the two vertical end walls are investigatedusing the Intel Paragon, Intel Touchstone Delta, Cray T3D and IBM SP2. The numericalscheme, including a parallel multigrid solver, and domain decomposition techniques for parallelcomputing are discussed in detail. Performance comparisons are made for the different parallelsystems, and numerical results using various numbers of processors are discussed. 1997 byJohn Wiley & Sons, Ltd.

NOMENCLATUREh height of cavityl length of cavityg acceleration due to gravityL = l/h aspect ratio of cavityR Rayleigh numberT non-dimensional temperaturex, z non-dimensional co-ordinatesu,w non-dimensional velocity componentsNPES number of processors

Greek symbolsβ coefficient of thermal expansionψ non-dimensional stream functionκ thermal diffusivityν kinematic viscosityσ Prandtl numberω non-dimensional vorticity function

1. INTRODUCTION

Convective motions driven by lateral temperature gradients in cavities are important inmany areas of interest in industry and in nature. Applications include the temperaturecontrol of circuit board components under natural convection in the electronics industry,heating and ventilation control in building design and construction, cooling systems fornuclear reactors in the nuclear industry, flows and heat transfer associated with all stages ofthe power generation process, solar-energy collectors in the power industry and atmosphericand fluvial dispersion in the environment.

Due to the wide range of applications, studies of natural convection flow and heat transferhave been pursued vigorously for many years. A typical model of convection driven by

CCC 1040–3108/97/100975–13 $17.50 Received 21 December 19941997 by John Wiley & Sons, Ltd. Revised 24 July 1996

976 P. WANG AND R. D. FERRARO

a lateral thermal gradient consists of a two-dimensional rectangular cavity with the twovertical end walls held at different constant temperatures. In order to determine the flowstructure and heat transfer across cavities with different physical properties, numerousanalytical, experimental and computational techniques have been used.

Experimental investigations of cavity flows driven by lateral heating have been reportedin [1,2]. In general, these flows consist of a main circulation in which fluid rises at thehot wall, sinks at the cold wall and travels laterally across the intervening core region.The flow structure is dependent on three non-dimensional parameters: a Rayleigh numberR based on the height of the cavity and the temperature difference across the end walls,the Prandtl number σ of the fluid, and the aspect ratio L (length/height). For cavities ofdifferent aspect ratios, Rayleigh numbers and Prandtl numbers, extensive numerical resultshave been obtained by researchers during the past 25 years[3–5].

For a shallow cavity (L→∞) and Rayleigh numbersR� L the flow is again dominatedby conduction and consists of a Hadley cell driven by the constant horizontal temperaturegradient set up between the end walls. Cormack, Leal and Imberger[6] predicted that theflow consists of two distinct parts: a parallel flow in the core region which extends formost of the length of the cavity and a second, non-parallel flow near the ends. Non-linearconvective effects first become significant at the ends of the cavity where the flow is turnedwhen R1 = R

L = O(1). The non-linear end-zone structure has been studied in [7–10].The present study focuses on parallel computation for convective flows in cavities with

L� 1. The problem formulation is given in Section 2. The numerical scheme and parallelcomputing techniques are discussed in Sections 3 and 4. The numerical results by usingdifferent parallel systems and with various numbers of processors are discussed in Section 5.Finally, conclusions are outlined in Section 6.

2. FORMULATION AND NUMERICAL APPROACH

A closed rectangular cavity of length l and heighth is considered in which two-dimensionalmotions are generated by maintaining the vertical end walls at different fixed temperaturesT0 and T0 +4T . The appropriate governing equations, subject to the Boussinesq approx-imation, can be written in non-dimensional form as

σ−1

(∂ω

∂t+ J(ω, ψ)

)= ∇2ω +R

∂T

∂x(1)

∇2ψ = −ω (2)∂T

∂t+ J(T, ψ) = ∇2T (3)

where

σ =ν

κ, R =

gβ4Th3

κν(4)

are the Prandtl number and the Rayleigh number, respectively. Here the pressure has beeneliminated and ω, ψ and T are the vorticity, stream function and temperature, respectively,with the two Jacobians given by

J(ω, ψ) =∂ω

∂x

∂ψ

∂z− ∂ω

∂z

∂ψ

∂x, J(T, ψ) =

∂T

∂x

∂ψ

∂z− ∂T

∂z

∂ψ

∂x(5)

PARALLEL COMPUTATION FOR NATURAL CONVECTION 977

The top and bottom walls are insulating.The equations possess an exact parallel-flow solution[6,8] of a shallow cavity as

T = x+ c+R1F (z), ψ = R1F′(z) (6)

where

F (z) =z5

120− 1

48z4 +

172z3 − 1

1440(7)

R1 = RL and c is a constant which is determined by a full solution of the end-zone

problems[10].The finite difference method is used for the whole computation. The Dufort–Frankel

scheme[11] is used for the vorticity (1) and the energy (3) equations, which is an explicit,three-layer method and has second-order accuracy. The multigrid method[12], used for thePoisson Equation (2), and based on a five-point scheme and the successive over-relaxationmethod, has proven an effective and fast method. In the present code, a complete V-cyclescheme on four-level grids is used.

3. PARALLEL COMPUTING TECHNIQUES

3.1. Computing systems

In the following, four different parallel systems will be introduced, which are the majorcomputing systems for the present work.

3.1.1. Intel Touchstone Delta

The Delta at the Concurrent Supercomputing Consortium at California Institute of Tech-nology is a message-passing MIMD (multiple instruction multiple data) multicomputer,consisting of an ensemble of individual and autonomous nodes that communicate acrossa two-dimensional mesh interconnection network. It has 512 computational i860 nodes,each with 16 Mbytes of memory and each node having a peak speed of 60 MFLOPS. AConcurrent File System (CFS) is attached to the nodes with a total of 95 Gbytes of formatteddisk space. The operating system is Intel’s Node Executive for the mesh (NX/M).

3.1.2. Intel Paragon XP/S

This MIMD distributed machine has a 2D mesh topology, with a faster processor andnetwork speed than the Delta system. The one at the Jet Propulsion Laboratory is currentlyconfigured with 56 compute nodes, and each one has a peak speed of 75 MFLOPS and32 Mbytes of memory. The operating system is the Paragon OSF/1, based on the OSF/1operating system from the Open Software Foundation. The NX communication library wasused for the present study, which makes it portable to the Delta machine.

3.1.3. Cray T3D

The Cray T3D at JPL, currently one of the most powerful MIMD computers available,has 256 compute nodes with 150 MFLOPS peak performance and 64 Mbytes memory pernode. Logically, it has a shared memory, and physically a distributed memory, associated


with a processor. It uses a three-dimensional torus as the interconnect network. A message-passing library is available, based on PVM3 software developed by the Oak Ridge NationalLaboratory, the University of Tennessee and Emory University.

3.1.4. IBM SP2

The IBM SP2 is a 160-node MIMD parallel computer composed of IBM RS6000/590workstations connected by a fast network, with some message-passing software to makeit look like a single parallel computer. Each node has 128 Mbytes of memory, and theindividual nodes are very powerful. System reliability is expected to be good becausethe SP2 uses commodity workstation hardware and software. It bridges the gap betweenworkstations and supercomputers. Several message-passing libraries, such as MPI, MPLand PVMe, are available on SP2. The MPI (message-passing interface) is a new message-passing standard developed by a group of representatives from industry, government andacademia (including IBM and NAS). MPI is to message-passing as HPF (high performanceFortran) is to data parallel languages. It is intended to solve the problems of portability andperformance caused by lack of a standard. MPL is IBM’s native message-passing library,which used to be known as EUI-H. MPL is a fairly robust library. PVMe is a rewritten andoptimized version of the Parallel Virtual Machine message-passing library from Oak RidgeNational Lab and the University of Tennessee. Currently it is compatible with PVM 3.2.Our present study on SP2 uses MPI as the message-passing library.

Numerical results from these systems are discussed later.

3.2. Domain decomposition

In order to implement a parallel code with the Dufort–Frankel–multigridmethod for naturalconvective flow problems in rectangular cavities, a two-dimensional original fine mesh ispartitioned into blocks of consecutive columns (L � 1) and distributed onto a logicallylinear array of processors (Figure 1). Each processor has a subdomain (Figure 2), andthe whole computational job is divided into n subjobs if n processors are used. This isa natural way for data partitioning with the above geometries since the communicationamong subdomains needs to be minimized. Once the partition structure is set up, the nextconcern is about how to update all information at neighbors in each subdomain to keep thecomputation continuous over the whole computational domain.

Figure 1. An example of computational mesh 4× 16 with 4 processors

3.3. Parallel Dufort–Frankel–multigrid algorithms

One of the major concerns in the design of a parallel code is communications. The majorpart of communication is that each subdomain needs to exchange information with its


Figure 2. Data partition on the computational domain

neighbors, and this is done by message-passing during each iterative level on the IntelParagon, Delta and Cray T3D. Since only the values at the boundaries of each subdomainneed to be updated at each iteration, the total amount of communication is still relativelysmall compared with the whole computation. The other part of the communication occursin I/O operations. The main structure of the parallel algorithm for the whole computationis briefly summarized by the following steps:

1. Partition the computational mesh.2. Set initial conditions on each subdomain.3. Exchange edge values with neighboring processors.4. Perform the Dufort–Frankel update T (k)

i,j locally on each processor.

5. Perform the Dufort–Frankel update ω(k)i,j locally at all interior grid points on each

processor.6. Perform the parallel multigrid update of ψ(k)

i,j on all processors.

7. Perform boundary calculations for ω(k)i,j on all processors involving boundaries.

8. Check conditions for a steady-state solution. If satisfied, stop. Otherwise, advancetime step k ← k + 1 and go to 3.

The parallel multigrid solver is summarized as follows: for the Poisson equation∇2ψ =−ω, the discrete form can be written as Aψh = ωh, and the parallel V-cycle schemeψh ←Mψh(ψh, ωh) with total grid levels= N is outlined as:

1. Do k = 1, N − 1Relax n1 times onAhψh = ωh with a given initial guess φh, and after each relaxationiteration, exchange edge values with neighborsω2h ← I2h

h (ωh −Ahψh), ψ2h ← 0Enddo

2. k = N (the coarsest grid); solve Ahψh = ωh

3. Do k = N − 1, 1Correct ψh ← ψh + Ih2hψ

2h

Relax n2 times on Ahψh = ωh with initial guess φh, and after each relaxationiteration, exchange edge values with neighborsEnddo

Here I2hh and Ih2h are restriction operators and interpolation operators, respectively, and

in the present study, injection and linear interpolation are used. An example of multigridstructures with a partitioning on four processors is given in Figure 3. The data distribution


on each processor on one grid level is illustrated in Figure 4, and a similar strategy is usedfor the rest of the grid levels. Here each processor needs to store its own subdomain dataand its neighbor boundary data as well.

Figure 3. Data partition of 4× 16 fine grid and the derived coarse grids with four processors

Figure 4. Data distribution on each processor for 4× 16 grid


4. RESULTS AND DISCUSSION

Various numerical experiments have been carried out on the Intel Delta, the Intel Paragon,the Cray T3D and the IBM SP2. A model with R1 = 10, 000, σ = 0.733, L = 16 andmesh= 64 × 1024 was tested on those machines. The computation was stopped whenthe following conditions were satisfied, corresponding to the attainment of a steady-statesolution:

maxi,j|T k+1i,j − T ki,j | < ε1

maxi,j|ωk+1i,j − ωki,j | < ε2

where k is the time level index and ε1 and ε2 are usually taken to be 10−6.In order to compare the performance on each system, 16 processors were used for the

parallel code with the above numerical model. The computation results are showed inTable 1, which lists the total wallclock time on the test problem for the four systems. Herefor the above problem with a fix grid 64× 1024, the IBM SP2 gives the best performance,and the Cray T3D shows better performance than the Paragon; Delta, from the point viewof speed, ranks the last. However, results might be changed when a different grid size anda different number of processors are used. More comparison is discussed later. By varioustests on the parallel systems and comparisons with some previous results, the parallel codeis proven numerically stable, efficient and reliable.

Table 1. Wallclock times (seconds) using 16 processors on the Delta, the Paragon, the Cray T3Dand the IBM SP2 for the problem with parameters as noted in the Table

Parameters System Wallclock time, s

Mesh = 64× 1024 Intel Touchstone Delta 1099R1 = 104, σ = 0.733 Intel Paragon 824L = 16 Cray T3D 197

IBM SP2 150

Numerical results were obtained for various Rayleigh and Prandtl numbers. Here resultsfor the flow in water are discussed. The solution for R1 = 2000 and σ = 6.983 in thesteady state is illustrated in Figure 5 by the contour plots of stream function, vorticityand temperature. The results have been compared with the asymptotic structures predictedby boundary-layer theory[9]. According to asymptotic predictions, for σ = 0.733, c0 =

5.2849 × 10−5, the c in (6) satisfies c ∼ R7/51 c0(σ). Our numerical computation gives

c0 = 4.9605× 10−5, which is in excellent agreement with the theoretical prediction. Sincethe theory is based on a high Rayleigh number, numerical simulation for a higher Rayleighnumber in the case of water will be considered in future work.

For low Prandtl numbers with a small aspect ratio, numerical computation for the wholecavity is considered. Figure 6 shows the contour plots of stream function, vorticity andtemperature for σ = 0.005,R1 = 40, 000 and L = 4 with a mesh size equal to 64× 1024.The results show the existence of secondary flow consistent with linear stability analysisin [7,8].


Figure 5. Contours of the steady-state solution for (a) stream function, (b) vorticity, (c) temperature,for σ = 6.983, R1 = 2× 103 and x∞ = 16, using a 64× 1024 computational grid

In order to demonstrate that the scheme achieves the goals of load balance, more numer-ical experiments have been investigated on the Cray T3D, the Paragon and the Delta withvarious numbers of processors. Figures 7–9 are the scaling performances of the parallelcomputation code on the Paragon , the Cray T3D and the IBM SP2, respectively. Variousmeshes have been used with a test model of R1 = 400, σ = 0.733, L = 128 for a totalten time steps. The largest problem has a global grid of 512× 65, 536 distributed on 256processors, with total unknowns of 167, 506, 944.

For the results on the Paragon, Figure 7 illustrates the execution time versus the numberof processors. It is obvious to note that, for a small size, it does not give a good scalingperformance, and it starts to slow down when more processors come into play. This is mainlydue to the computational load for a small problem is relatively small and the communicationpart weights much heavier for the whole code when a large number of processors is used.Once the grid size increases to 64× 8192, the performance is improved, and the larger thesize used, the better speed-up will be obtained. When the grid size reaches 128× 16, 384, anearly linear speed-up is obtained. For the results on the Cray T3D, Figure 8 shows a similarpattern of scaling performance, but gives some differences. First, it shows that the total timeis smaller than that on the Paragon, and this is due to the differences of hardware which wediscussed previously. For a small grid size problem, it has more speed-up loss than that onthe Paragon, but it recovers when the grid size reaches 64× 8192, and excellent speed-ups


Figure 6. Contour plots for (a) stream function, (b) vorticity, (c) temperature, for σ = 0.005,R = 4× 104 and L = 4, using a 64× 1024 computational grid for the whole cavity

are obtained through parallelism when larger gird sizes are applied. For the results on theIBM SP2, Figure 9 gives the performance for larger grid size problems. As the speed onthe IBM SP2 is faster than the other three systems, the total time for a small problem witha total of ten time steps is very small, so we started our computation with a grid 64× 8192and a grid 512 × 65, 536 as the largest problem. It is clear that the performance on theIBM SP2 shows some differences over other systems, and here, when a grid size reaches256× 32, 768, a nearly linear speed-up is obtained.

Since IBM SP2 is fast than the other systems, the ratio of computation time via com-munication time is changed. So a similar scaling performance can be obtained by using alarger grid size.

The difference of speed-ups for a small problem on these systems is mainly due to thehardware differences in internal communication networks and the different communicationlibraries used, which will make a significant contribution for a code which has a largerproportion of communication than computation. For a large size problem, the computationwill become the dominating part of the code, and the communication part no longer playsa major role in the code. Hence, the code on these systems achieves a good performance ofspeed-ups.

More intensive studies of scaling performance on different parallel systems are carriedout, and in Figure 10 a comparison is made for various grid sizes and numbers of processors


Figure 7. Scaling performance on the Paragon with grid sizes 8 × 1024 (2), 16 × 2048 (4),32× 4096 (3), 64× 8192 (+), 128× 16,384 (..), 256× 32,768 (∗)

Figure 8. Scaling performance on the Cray T3D with grid sizes 8 × 1024 (2), 16 × 2048 (4),32× 4096 (3), 64× 8192 (+), 128× 16,384 (..), 256× 32,768 (∗)


Figure 9. Scaling performance on the IBM SP2 with grid sizes 128× 16,384 (..), 256× 32,768 (∗)and 512× 65,536 (×)

on the Cray T3D, the Paragon and the Delta. Here a model with R1 = 1000, σ = 0.733,L = number of processor with ten time steps was tested on these machines. In theory,each line on Figure 10 should be a constant as the total amount of workload on eachprocessor is the same if the communication is negligible. In our study, three types of gridsizes, 16 × 16 × NPES, 64 × 64 ×NPES and 256 × 256× NPES, were chosen forcomparison via various numbers of processors from a range 2 to 256; here NPES is thenumber of processors. The curves with the ∗ symbols are the results for 16× 16×NPES,and show that the execution time increases gradually with increasing numbers of processors.The cases on the Paragon ( ) and the Delta (. . . ) have slightly faster increases thanthose on the Cray T3D (. . .). Here, each curve has a slight jump from two processorsto four processors and then smoothly increases, and this is because the communicationstructure of the code with two processors is slightly different from those with more thantwo processors. When two processors are used, each processor only exchanges informationwith one side boundary (with 1D partition), but when more than two processors are used,each processor has to exchange information with two side boundaries except the first andthe last processor, which only need to exchange information with one side boundary. Sincethe grid 16 × 16 × NPES is very small, the communication time is much larger thanthe computation time. Therefore, the difference in the communication structure on twoprocessors and on four processors is showed on each curve. For 64 × 64 × NPES, thecurves with the 2 symbols tend to constant lines as the computation time increases. Forthe 256 × 256 × NPES case, the three curves with the × symbols have very similarbehaviour and all give straight lines with very little variation. In this case, the numericalresults are very close to the theoretical prediction. Here again, it is interesting to note thetotal execution time for each system with each grid case. For each grid case, the Deltahas nearly double the total time of the Paragon, but the Cray T3D has a varying speed-upcompared with the other systems. It is evident that the larger the grid size used, the fasterperformance the Cray T3D will have over the other systems. This is because for a large gridsize the communication part has little effect on the whole performance, so the speed-up is


Figure 10. Comparison on the Delta (. . .), the Paragon ( ) and Cray T3D (. . .) with the gridsizes 16× 16×NPES (∗), 64× 64×NPES (2), 256× 256×NPES (×)

mainly based on the hardware of the systems. From these studies we can say that the codebased on the domain decomposition method and iterative scheme is scalable if the grid sizeis greater than a certain value. In other words, this kind of problem will take advantageof parallel systems if a large grid size is used. More numerical results for high Rayleighnumbers and large grids on parallel systems are given in the forthcoming paper[13].

5. CONCLUSIONS

In this paper, a detailed numerical study of natural convection has been described withdifferent parallel systems. A domain decomposition technique has been used efficiently insolving the Poisson equation with a multigrid method and the other two time-dependentequations with the Dufort–Frankel scheme. The parallel code shows good convergence andefficiency, which indicate the potential of parallel systems for solving large complicatedfluid dynamics problems. The present numerical solutions appear to be in good agreementwith theoretical predictions. The discrepancy of wallclock times for the whole computationon the four systems is due to the differences in hardware on the systems and the networkconnections used. From the point view of speed, the IBM SP2 ranks first, and the CrayT3D is much faster than the Intel Paragon and Delta.

ACKNOWLEDGEMENTS

The research described in this paper was carried out by the Jet Propulsion Laboratory,California Institute of Technology, and was sponsored by the National Research Counciland the National Aeronautics and Space Administration while one of the authors (P. Wang)held an NRC-JPL Research Associateship.

Reference herein to any specific commercial product, process or service by trade name,trademark, manufacturer or otherwise does not constitute or imply its endorsement by the


United States Government, the National Research Council or the Jet Propulsion Laboratory,California Institute of Technology.

This research was performed in part using the CSCC parallel computer system oper-ated by Caltech on behalf of the Concurrent Supercomputing Consortium. Access to thisfacility was provided by NASA. The Cray Supercomputer used in this investigation wasprovided by funding from the NASA Offices of Mission to Planet Earth, Aeronautics, andSpace Science. Access to the Intel Paragon at the Jet Propulsion Laboratory, CaliforniaInstitute of Technology, was provided by the NASA High Performance Computing andCommunications Office.

REFERENCES

1. P. G. Simpkins and K. S. Chen,‘Convection in horizontal cavities’, J. Fluid Mech., 166, 21–39(1986).

2. J. C. Patterson and S. W. Armfield, ‘Transient features of natural convection in a cavity’, J. FluidMech., 219, 469–497 (1990).

3. D. E. Cormack, L. G. Leal and J. H. Seinfeld, ‘Natural convection in a shallow cavity withdifferentially heated end walls. Part 2. Numerical solutions’, J. Fluid Mech., 65, 231–246(1974).

4. A. Bejan and C. L. Tien, ‘Laminar natural convection heat transfer in a horizontal cavity withdifferent end temperatures’, Trans. ASME: J. Heat Transf. 100, 641–647 (1978).

5. J. E. Drummond and S. A. Korpela, ‘Natural convection in a shallow cavity’, J. Fluid Mech.,182, 543–564 (1987).

6. D. E. Cormack, L. G. Leal and J. Imberger, ‘Natural convection in a shallow cavity withdifferentially heated end walls. Part 1. Asymptotic theory’, J. Fluid Mech. 65, 209–229 (1974).

7. J. E. Hart, ‘Low Prandtl number convection between differentially heated end walls’, Int. J.Heat Mass Transf., 26, 1069–1074 (1983).

8. P. G. Daniels, P. A. Blythe and P. G. Simpkins, ‘Onset of multicellular convection in a shallowlaterally heated cavity’, Proc. R. Soc. Lond. A, 411, 327–350 (1987).

9. P. G. Daniels, ‘High Rayleigh number thermal convection in a shallow laterally heated cavity’,Proc. R. Soc. Lond. A, 440, 273–289 (1993).

10. P. Wang and P. G. Daniels, ‘Numerical solutions for the flow near the end of a shallow laterallyheated cavity’, J. Eng. Math. 28, 211–226 (1994).

11. P. J. Roache, Computational Fluid Dynamics, Hermosa, Alberquerque, NM, USA, 1976.12. A. Brandt, ‘Multi-level adaptive solutions to boundary-value problems’, Math. Comput., 31,

333–390 (1977).13. P. Wang and R. D. Ferraro, ‘A numerical study of high Rayleigh number thermal convection in

shallow cavities’, in preparation.

Documents

Parallel computation for natural convection