Parallelizing Highly Dynamic N-Body Simulations

PARALLELIZING HIGHLY DYNAMIC N-BODY SIMULATIONS

Joachim [email protected]

University of Zurich

Institute for Theoretical Physics

mailto:[email protected]

Meaning of “highly dynamic”

Push to ever higher resolution means that there is an ever widening range of timescales which need to be captured by the simulation.

In rocky planet formation simulations there is a big range from the outermost orbital timescale to that when there is a collision between planetesimals. (years to hours)

Including super massive black holes in galaxy simulations also creates a huge range in time scales.

Galaxy formation simulations with hydro, cooling, star formation, feedback etc. also have enormous ranges in dynamical times (as well as large numbers of particles).

Despite the fact that it is not symplectic, the KDK multi-stepping integrator has been found to work quite well for cosmological applications. For planet formation problems one needs to be much more careful and use a high order hermite, Symba-like, or Mercury-like integration scheme with the tree code.

Multiple time-stepping integrators work quite well.

There isn’t time to go into detail on the method of deciding an appropriate time-step for a particle. Traditionally sqrt(eps/acc) has been used in cosmology, but we really want something that is representative of the dynamical time.

Time-step criteria, old and new.

This is even a modest example. The red line shows where half the theoretical work lies if even high density particles take the same amount of time to calculate their force as low density particles. The real half-work line can lie even deeper in the time step hierarchy.

Rung distribution as function of time.

1 billion particles of 1000 solar masses each form a galactic sized dark matter halo with a wealth of substructure. This simulation required 1.6 million CPU hours (2008) and ran on 1000 cores.

The Central 50 kpc of the GHALO Simulation

Each higher resolution seems to converge on the Einasto profile which becomes shallower than r^-1 and does not converge to a set power-law.

Density profiles of successively zoomed simulation.

Although small differences in accuracy parameters set during the simulation will cause a slightly different orbits in the smaller halos (it is chaotic after all).

Keeping the same large scale modes allows 1-1 comparison of the simulation at different resolutions.

These refinements are created by the codes of Doug Potter (Uni Zurich) who has parallelized both GRAFIC1 and more significantly GRAFIC2 codes of Bertschinger.

Creating such initial conditions is still a very difficult job, but getting more automated now.

Actual refinement strategy in 2 cases: GHALO & SR

No mixing of low-res particles inside of the halo.

On big steps where N_active is large all is good.

On smaller sub-steps there is a lot of imbalance.

The lower left panel shows the case where the network protocol was not working properly so the dark green bars are not representative. Scaling is just ok at 2000 processors after a lot of tuning of the code. Note at 2000 the light green bars exceed the dark red of step 0 (forces on all particles)!

Where all the time goes for each rung.

…how it is getting more parallel…and how to deal with this.

HPC Hardware

HPC Hardware Paradigms

Multi-core Nodes connected to HPC Network

Multi-GPU Nodes connected to HPC Network

HPC Network

Each level comes with a bandwidth bottleneck.

8 to 24 cores/node1000s of nodes!

512 cores/GPU (Fermi)4 GPUs/node1000s of nodes?

Load Balancing Techniques 1:MIMD

High degree of data locality.

Very good comp/com if a lot of particles active in each domain.

Difficult to balance computing and memory usage.

Adapting to changing computation is expensive.

Adapting to changing computation is cheap.

Can balance memory use and computation.

Lower degree of data locality.

Overheads for switching threads.

Duplication of work when small number of particles are active.

DOMAIN DECOMPOSITION THREAD SCHEDULING

Load Balancing Techniques 1:MIMD

pkdgrav2 & gasoline Typically MPI code. ORB or Space-filling

Curve

DOMAIN DECOMPOSITION THREAD SCHEDULING

Changa: CHARM++ Objects called Chares

which each have O(100-1000) particles including all parent tree cells.

These “tree pieces” are scheduled dynamically.

Very good scaling beyond 1000 CPUs (eg Bluegene).

Particularly good scaling on other tasks such as neighbor searching!

Load Balancing Techniques 2:SIMD For on-chip SSE instructions and for the cores of

a GPU we need a different approach. Each core should execute exactly the same

instruction stream, but use different data. Load balance is very good as long as not too

much dummy “filler” data needs to be used. Vectors need to be a multiple of nCore in length.

How can we do this in the context of a tree code where branching (different instructions) seems to be inherent in the tree walk algorithm?

I won’t talk about the details of domain decomposition nor thread scheduling since these are quite complicated and also of little use to GPU programming.

We will see that no branches are needed to accomplish a tree code!

SIMD Parallelizing the Tree Code

a

bc

d ge

f

hj

kl m

no

p

c

C S

b 0

Cell c is larger than b.

CC PP

a

bc

d ge

f

hj

kl m

no

p

f

C S

b 0

g

Cell b is larger than f, but g is far enough away.Cell f stays on the checklist while g is moved to the CC list.

CC PP

a

bc

d ge

f

hj

kl m

no

p

f

C S

b 0g

CC PP

The checklist has been processed and we add to the local expansion at b due to g. This is the main calculation in the FMM code. (Usually hundreds of cells)

a

bc

d ge

f

hj

kl m

no

p

f

C S

b 0

Cell f is far enough away but cell e is too close and larger.

e

Lf d

CC PP

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

f

Cell cell l is far enough away but cell k is too close but smaller than d.

l

Lf dk

CC PP

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

f

l

Lf dk

CC PP

Again we have completely processed the checklist for cell d. We evaluate the local expansions for f and l about d and add this to the current L’.

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

Lf d

k

j

L’

k h

This process continues, but let us assume that h, j and k are all leaf cells. This means they must interact by P-P.

CC PP

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

Lf d

L’

k h

Checklist is now empty as it should be and we can process the PP interactions.

Then we pop the stack and go to the other child of d.

CC PP

k

j

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

Lf dk

h

We are again at leaf cell j.

CC PP

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

Lf d

Again we can perform the interactions followed by popping the stack.

CC PP

k

h

a

bc

d ge

f

hj

kl m

no

p

C S

b 0f

d

We proceed in the same manner from cell e now. Note again that we have also taken the old value of L from the stack as a starting point which included the interaction with cell g.

CC PP

There are 4 cases in this example.

Case 0: Cell stays on the checklist. Case 1: Cell is opened and its children are placed at the

end of the checklist. Case 2: Cell is moved to the PP list. Case 3: Cell is moved to the CC list. An Opening Value of 0 to 3 can be calculated

arithmetically from the current cell and the cell on the checklist, i.e., no if-then-else!

Suppose we could magically have the mapping from checklist entries to each of the 4 processing lists (based on the particular opening value).

Then each processor could independently move entries to their lists using the mapping.

All cores then do case 1, then case 2, finally case 3.

For simplicity assume 3 cases, 3 cores.

1 0 0 2 1 2 0 1 0

0 0 0 0

1 1 1

2 2 ?

??0 1 2 0 1 2 0 1 2

THREAD / CORE

MAPWe have had perfect load balancing in the determination of cases and have had to process 3 dummy data elements out of 12 in the second part.

This map can also be calculated in parallel in if each thread counts the number of occurances of each case. The map is found by doing a running sum (a parallel SCAN operation) over the case counts and threads. O(log P)

Pkdgrav2: more cases to deal with.

In reality we have 5 cases without softening and in pkdgrav2 we have actually 9 cases.

There are about 80 operations (floating point and logical) to calculate the opening value.

Typically each cell in the tree requires the processing of 1000s of checklist elements and 100s to 1000s of processing elements.

Even if only a single particle is active we get significant speed-up in the simulation as long as the simulation is not too small.

It is still early-days in implementing code like this but I am confident that this will allow big speedups in tough highly dynamic simulations.

Prospects

Prospects for pkdgrav2

Currently I am developing this method for use with SSE/OpenMP/MPI within pkdgrav2.

I also plan to later try thread scheduling by popping off the stack, and also pushing things when lists get too long.

I find speedups of 6 or so with 8 cores with as few checklist elements as 100, below this the speedup drops quickly.

Prospects for GPUs and GPU clusters

I have started exploring the implementation of this method on GPUs, where the host currently does all tree building and the GPU only does the walk algorithm (calculates forces).

We want to have the entire code on the GPU so that we can simulate a modest number of particles O(1 million) with many many timesteps all within the GPU memory rocky planet formation simulations.

In the long run we would like the host to do the fuzzy edges of domains, while the GPU is given the majority of the work in the interior of the domain.

Thank-you.

Questions?

Discussion Session – Baryons? baryons and precision cosmology: The mass function of

clusters is uncertain.

Stanek and Rudd http://arxiv.org/abs/0809.2805 hydro sims for cluster mass function with and without gas cooling, and preheating. implies ~10% uncertainty in m500 m.f. Teyssier and Davide's paper http://arxiv.org/abs/1003.4744 about AGN effects on cluster mass profile.

Rudd paper http://arxiv.org/abs/0703741 on baryons and matter power spectrum.

Precision solution for baryon effects involves cooling and feedback, so solution may require modeling star formation correctly over entire cosmic history, beginning with first stars. Baryon physics might explain many CDM challenges.

http://arxiv.org/abs/0809.2805


http://arxiv.org/abs/0703741

Discussion Session – basic assumptions

Expansion of inhomogeneous universe versus the global expansion factor in simulations (back reactions and related). some people are very worried about it.

Buchert--- http://arxiv.org/abs/0707.2153

In principle, this could create an apparent accelerating expansion without dark energy. Consensus in field appears to be that effect is too small, but may be important in precision dark energy measurements.

What about finite speed of light in very large volume simulations being performed now for surveys?


Discussion session – numerical issues?

Lack of numerical standards for precision simulations: Previous standards become obsolete as particle numbers and resolution capabilities improve. Every simulation requires convergence tests to be trusted, and few do it. Starting redshift is a good example.

Dark matter detection in a cosmological context. If dark matter is detected, simulations will be needed (including any baryon influences), and to constrain dark matter particle properties. If small scale dark matter structure is important, simulations will need to resolve down to microhalo scale ~(10^20 particles/halo).

To what extent can we trust simulations of quite different models? Some examples include Warm Dark Matter simulations or f(R) gravity models or non-gaussian fluctuation (f_NL).

Can we make any use of GRID computing, or is it all hype?

Documents

Parallelizing Highly Dynamic N-Body Simulations