45
PARALLELIZING HIGHLY DYNAMIC N-BODY SIMULATIONS Joachim Stadel [email protected] University of Zurich Institute for Theoretical Physics

Parallelizing Highly Dynamic N-Body Simulations

  • Upload
    colin

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Parallelizing Highly Dynamic N-Body Simulations. Joachim Stadel [email protected] University of Zurich Institute for Theoretical Physics. Meaning of “highly dynamic”. - PowerPoint PPT Presentation

Citation preview

Page 1: Parallelizing Highly Dynamic N-Body Simulations

PARALLELIZING HIGHLY DYNAMIC N-BODY SIMULATIONS

Joachim [email protected]

University of Zurich

Institute for Theoretical Physics

Page 2: Parallelizing Highly Dynamic N-Body Simulations

Meaning of “highly dynamic”

Push to ever higher resolution means that there is an ever widening range of timescales which need to be captured by the simulation.

In rocky planet formation simulations there is a big range from the outermost orbital timescale to that when there is a collision between planetesimals. (years to hours)

Including super massive black holes in galaxy simulations also creates a huge range in time scales.

Galaxy formation simulations with hydro, cooling, star formation, feedback etc. also have enormous ranges in dynamical times (as well as large numbers of particles).

Page 3: Parallelizing Highly Dynamic N-Body Simulations

Despite the fact that it is not symplectic, the KDK multi-stepping integrator has been found to work quite well for cosmological applications. For planet formation problems one needs to be much more careful and use a high order hermite, Symba-like, or Mercury-like integration scheme with the tree code.

Multiple time-stepping integrators work quite well.

Page 4: Parallelizing Highly Dynamic N-Body Simulations

There isn’t time to go into detail on the method of deciding an appropriate time-step for a particle. Traditionally sqrt(eps/acc) has been used in cosmology, but we really want something that is representative of the dynamical time.

Time-step criteria, old and new.

Page 5: Parallelizing Highly Dynamic N-Body Simulations

This is even a modest example. The red line shows where half the theoretical work lies if even high density particles take the same amount of time to calculate their force as low density particles. The real half-work line can lie even deeper in the time step hierarchy.

Rung distribution as function of time.

Page 6: Parallelizing Highly Dynamic N-Body Simulations

1 billion particles of 1000 solar masses each form a galactic sized dark matter halo with a wealth of substructure. This simulation required 1.6 million CPU hours (2008) and ran on 1000 cores.

The Central 50 kpc of the GHALO Simulation

Page 7: Parallelizing Highly Dynamic N-Body Simulations

Each higher resolution seems to converge on the Einasto profile which becomes shallower than r^-1 and does not converge to a set power-law.

Density profiles of successively zoomed simulation.

Page 8: Parallelizing Highly Dynamic N-Body Simulations

Although small differences in accuracy parameters set during the simulation will cause a slightly different orbits in the smaller halos (it is chaotic after all).

Keeping the same large scale modes allows 1-1 comparison of the simulation at different resolutions.

Page 9: Parallelizing Highly Dynamic N-Body Simulations

These refinements are created by the codes of Doug Potter (Uni Zurich) who has parallelized both GRAFIC1 and more significantly GRAFIC2 codes of Bertschinger.

Creating such initial conditions is still a very difficult job, but getting more automated now.

Actual refinement strategy in 2 cases: GHALO & SR

Page 10: Parallelizing Highly Dynamic N-Body Simulations

No mixing of low-res particles inside of the halo.

Page 11: Parallelizing Highly Dynamic N-Body Simulations

On big steps where N_active is large all is good.

Page 12: Parallelizing Highly Dynamic N-Body Simulations

On smaller sub-steps there is a lot of imbalance.

Page 13: Parallelizing Highly Dynamic N-Body Simulations

The lower left panel shows the case where the network protocol was not working properly so the dark green bars are not representative. Scaling is just ok at 2000 processors after a lot of tuning of the code. Note at 2000 the light green bars exceed the dark red of step 0 (forces on all particles)!

Where all the time goes for each rung.

Page 14: Parallelizing Highly Dynamic N-Body Simulations

…how it is getting more parallel…and how to deal with this.

HPC Hardware

Page 15: Parallelizing Highly Dynamic N-Body Simulations

HPC Hardware Paradigms

Multi-core Nodes connected to HPC Network

Multi-GPU Nodes connected to HPC Network

HPC Network

Each level comes with a bandwidth bottleneck.

8 to 24 cores/node1000s of nodes!

512 cores/GPU (Fermi)4 GPUs/node1000s of nodes?

Page 16: Parallelizing Highly Dynamic N-Body Simulations

Load Balancing Techniques 1:MIMD

High degree of data locality.

Very good comp/com if a lot of particles active in each domain.

Difficult to balance computing and memory usage.

Adapting to changing computation is expensive.

Adapting to changing computation is cheap.

Can balance memory use and computation.

Lower degree of data locality.

Overheads for switching threads.

Duplication of work when small number of particles are active.

DOMAIN DECOMPOSITION THREAD SCHEDULING

Page 17: Parallelizing Highly Dynamic N-Body Simulations

Load Balancing Techniques 1:MIMD

pkdgrav2 & gasoline Typically MPI code. ORB or Space-filling

Curve

DOMAIN DECOMPOSITION THREAD SCHEDULING

Changa: CHARM++ Objects called Chares

which each have O(100-1000) particles including all parent tree cells.

These “tree pieces” are scheduled dynamically.

Very good scaling beyond 1000 CPUs (eg Bluegene).

Particularly good scaling on other tasks such as neighbor searching!

Page 18: Parallelizing Highly Dynamic N-Body Simulations

Load Balancing Techniques 2:SIMD For on-chip SSE instructions and for the cores of

a GPU we need a different approach. Each core should execute exactly the same

instruction stream, but use different data. Load balance is very good as long as not too

much dummy “filler” data needs to be used. Vectors need to be a multiple of nCore in length.

How can we do this in the context of a tree code where branching (different instructions) seems to be inherent in the tree walk algorithm?

Page 19: Parallelizing Highly Dynamic N-Body Simulations

I won’t talk about the details of domain decomposition nor thread scheduling since these are quite complicated and also of little use to GPU programming.

We will see that no branches are needed to accomplish a tree code!

SIMD Parallelizing the Tree Code

Page 20: Parallelizing Highly Dynamic N-Body Simulations
Page 21: Parallelizing Highly Dynamic N-Body Simulations
Page 22: Parallelizing Highly Dynamic N-Body Simulations
Page 23: Parallelizing Highly Dynamic N-Body Simulations
Page 24: Parallelizing Highly Dynamic N-Body Simulations
Page 25: Parallelizing Highly Dynamic N-Body Simulations

a

bc

d ge

f

hj

kl m

no

p

c

C S

b 0

Cell c is larger than b.

CC PP

Page 26: Parallelizing Highly Dynamic N-Body Simulations

a

bc

d ge

f

hj

kl m

no

p

f

C S

b 0

g

Cell b is larger than f, but g is far enough away.Cell f stays on the checklist while g is moved to the CC list.

CC PP

Page 27: Parallelizing Highly Dynamic N-Body Simulations

a

bc

d ge

f

hj

kl m

no

p

f

C S

b 0g

CC PP

The checklist has been processed and we add to the local expansion at b due to g. This is the main calculation in the FMM code. (Usually hundreds of cells)

Page 28: Parallelizing Highly Dynamic N-Body Simulations

a

bc

d ge

f

hj

kl m

no

p

f

C S

b 0

Cell f is far enough away but cell e is too close and larger.

e

Lf d

CC PP

Page 29: Parallelizing Highly Dynamic N-Body Simulations

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

f

Cell cell l is far enough away but cell k is too close but smaller than d.

l

Lf dk

CC PP

Page 30: Parallelizing Highly Dynamic N-Body Simulations

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

f

l

Lf dk

CC PP

Again we have completely processed the checklist for cell d. We evaluate the local expansions for f and l about d and add this to the current L’.

Page 31: Parallelizing Highly Dynamic N-Body Simulations

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

Lf d

k

j

L’

k h

This process continues, but let us assume that h, j and k are all leaf cells. This means they must interact by P-P.

CC PP

Page 32: Parallelizing Highly Dynamic N-Body Simulations

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

Lf d

L’

k h

Checklist is now empty as it should be and we can process the PP interactions.

Then we pop the stack and go to the other child of d.

CC PP

k

j

Page 33: Parallelizing Highly Dynamic N-Body Simulations

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

Lf dk

h

We are again at leaf cell j.

CC PP

Page 34: Parallelizing Highly Dynamic N-Body Simulations

a

bc

d ge

f

hj

kl m

no

p

C S

b 0

Lf d

Again we can perform the interactions followed by popping the stack.

CC PP

k

h

Page 35: Parallelizing Highly Dynamic N-Body Simulations

a

bc

d ge

f

hj

kl m

no

p

C S

b 0f

d

We proceed in the same manner from cell e now. Note again that we have also taken the old value of L from the stack as a starting point which included the interaction with cell g.

CC PP

Page 36: Parallelizing Highly Dynamic N-Body Simulations

There are 4 cases in this example.

Case 0: Cell stays on the checklist. Case 1: Cell is opened and its children are placed at the

end of the checklist. Case 2: Cell is moved to the PP list. Case 3: Cell is moved to the CC list. An Opening Value of 0 to 3 can be calculated

arithmetically from the current cell and the cell on the checklist, i.e., no if-then-else!

Suppose we could magically have the mapping from checklist entries to each of the 4 processing lists (based on the particular opening value).

Then each processor could independently move entries to their lists using the mapping.

All cores then do case 1, then case 2, finally case 3.

Page 37: Parallelizing Highly Dynamic N-Body Simulations

For simplicity assume 3 cases, 3 cores.

1 0 0 2 1 2 0 1 0

0 0 0 0

1 1 1

2 2 ?

??0 1 2 0 1 2 0 1 2

THREAD / CORE

MAPWe have had perfect load balancing in the determination of cases and have had to process 3 dummy data elements out of 12 in the second part.

This map can also be calculated in parallel in if each thread counts the number of occurances of each case. The map is found by doing a running sum (a parallel SCAN operation) over the case counts and threads. O(log P)

Page 38: Parallelizing Highly Dynamic N-Body Simulations

Pkdgrav2: more cases to deal with.

In reality we have 5 cases without softening and in pkdgrav2 we have actually 9 cases.

There are about 80 operations (floating point and logical) to calculate the opening value.

Typically each cell in the tree requires the processing of 1000s of checklist elements and 100s to 1000s of processing elements.

Even if only a single particle is active we get significant speed-up in the simulation as long as the simulation is not too small.

Page 39: Parallelizing Highly Dynamic N-Body Simulations

It is still early-days in implementing code like this but I am confident that this will allow big speedups in tough highly dynamic simulations.

Prospects

Page 40: Parallelizing Highly Dynamic N-Body Simulations

Prospects for pkdgrav2

Currently I am developing this method for use with SSE/OpenMP/MPI within pkdgrav2.

I also plan to later try thread scheduling by popping off the stack, and also pushing things when lists get too long.

I find speedups of 6 or so with 8 cores with as few checklist elements as 100, below this the speedup drops quickly.

Page 41: Parallelizing Highly Dynamic N-Body Simulations

Prospects for GPUs and GPU clusters

I have started exploring the implementation of this method on GPUs, where the host currently does all tree building and the GPU only does the walk algorithm (calculates forces).

We want to have the entire code on the GPU so that we can simulate a modest number of particles O(1 million) with many many timesteps all within the GPU memory rocky planet formation simulations.

In the long run we would like the host to do the fuzzy edges of domains, while the GPU is given the majority of the work in the interior of the domain.

Page 42: Parallelizing Highly Dynamic N-Body Simulations

Thank-you.

Questions?

Page 43: Parallelizing Highly Dynamic N-Body Simulations

Discussion Session – Baryons? baryons and precision cosmology: The mass function of

clusters is uncertain.

Stanek and Rudd http://arxiv.org/abs/0809.2805 hydro sims for cluster mass function with and without gas cooling, and preheating. implies ~10% uncertainty in m500 m.f. Teyssier and Davide's paper http://arxiv.org/abs/1003.4744 about AGN effects on cluster mass profile.

Rudd paper http://arxiv.org/abs/0703741 on baryons and matter power spectrum.

Precision solution for baryon effects involves cooling and feedback, so solution may require modeling star formation correctly over entire cosmic history, beginning with first stars. Baryon physics might explain many CDM challenges.

Page 44: Parallelizing Highly Dynamic N-Body Simulations

Discussion Session – basic assumptions

Expansion of inhomogeneous universe versus the global expansion factor in simulations (back reactions and related). some people are very worried about it.

Buchert--- http://arxiv.org/abs/0707.2153

In principle, this could create an apparent accelerating expansion without dark energy. Consensus in field appears to be that effect is too small, but may be important in precision dark energy measurements.

What about finite speed of light in very large volume simulations being performed now for surveys?

Page 45: Parallelizing Highly Dynamic N-Body Simulations

Discussion session – numerical issues?

Lack of numerical standards for precision simulations: Previous standards become obsolete as particle numbers and resolution capabilities improve. Every simulation requires convergence tests to be trusted, and few do it. Starting redshift is a good example.

Dark matter detection in a cosmological context. If dark matter is detected, simulations will be needed (including any baryon influences), and to constrain dark matter particle properties. If small scale dark matter structure is important, simulations will need to resolve down to microhalo scale ~(10^20 particles/halo).

To what extent can we trust simulations of quite different models? Some examples include Warm Dark Matter simulations or f(R) gravity models or non-gaussian fluctuation (f_NL).

Can we make any use of GRID computing, or is it all hype?