Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Lattice Boltzmann simulations on heterogeneous CPU-GPU
clusters H. Köstler, Ch. Feichtinger
2nd International Symposium
“Computer Simulations on GPU” Freudenstadt 29.05.2013
1
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Contents
Motivation
waLBerla software concepts
LBM simulations on Tsubame
Future Work
2
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Computational Science and Engineering @ LSS
3
Applications • Multiphysics • fluid, structure • medical imaging • laser
Applied Math • LBM • multigrid • FEM • numerics
Computer Science • HPC / hardware • Performance
engineering • software
engineering USE_SweepSection( getLBMsweepUID() ) USE_Sweep() swUseFunction(„LBM",sweep::LBMsweep,FSUIDSet::all(),hsCPU,BSUIDSet::all()); USE_After() //Communication
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Problems
Hardware: Modern HPC clusters are massively parallel Intra-core, intra-node, and inter-node
Software: Applications become more complex with increasing computational power
More complex (physical) models
Code development in interdisciplinary teams
Algorithm: Many variants exist Components and parameters depend on computational domain or grid, type of problem, …
4
Computer Science X - System Simulation Group Harald Köstler ([email protected])
WALBERLA Applications
5
Computer Science X - System Simulation Group Harald Köstler ([email protected])
waLBerla: parallel block-structured grid framework
6
Computer Science X - System Simulation Group Harald Köstler ([email protected])
waLBerla @ GPU
7
Geometric multigrid solver on Tsubame
Computational Steering (VIPER)
CFD, fluid-structure interaction
0 500
1000 1500 2000 2500 3000 3500
unknowns in million
runt
ime
in m
s
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Boltzmann equation
Mesoscopic approach to solving the Navier-Stokes equations
Boltzmann equation describes the statistical distribution of one particle in a fluid
f is the probability distribution function (PDF), the particle velocity, and Ω(f) is the change due to collision
Models behavior of fluids in statistical physics
Lattice Boltzmann Method (LBM) solves the discrete Boltzmann equation
)(f f ft Ω=∇⋅+∂ ζ
ζ
8
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Particulate Flow Simulation
9
D3Q19 LBM cell Collide and Stream
amF ⋅=α⋅= JM
Computer Science X - System Simulation Group Harald Köstler ([email protected])
WALBERLA CPU-GPU cluster software concepts
10
Computer Science X - System Simulation Group Harald Köstler ([email protected])
waLBerla framework
Main goal: provide a massive parallel and efficient software framework for multi-physics simulations
WaLBerla is mainly designed for HPC clusters
11
waLBerla (C++) Code management,
standard implementations
Low-level kernels for optimized architecture-
specific computations (in
C++, CUDA, Assembler)
Computer Science X - System Simulation Group Harald Köstler ([email protected])
waLBerla: Block concept
12
Computer Science X - System Simulation Group Harald Köstler ([email protected])
waLBerla: Sweep concept
13
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Challenges on heterogeneous clusters I
Problem: Description of the heterogeneous compute resources Solution: Description of all compute components per compute node in the input file
Problem: Management of the communication and compute kernels for each architecture Solution: Kernel management based on meta data
14
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Challenges on heterogeneous clusters II
Problem: Common communication interface Solution: Data exchange via communication buffers also for intra node communication
Problem: Minimization of the heterogeneous communication overhead Solution: Overlapping of work and communication, non-uniform domain decomposition, and intra-node communication in z-dimension
15
Computer Science X - System Simulation Group Harald Köstler ([email protected])
waLBerla: Communication concept
16
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Overlapping of work and communication
17
Computer Science X - System Simulation Group Harald Köstler ([email protected])
WaLBerla: Subblocks
Assumption: A block corresponds to a (shared-memory) compute node
Can possibly be heterogeneous (CPU + GPU)
Distributed memory communication (via MPI) is not required within one block
Divide one block into subblocks of different sizes for (static) load balancing
Subblocks map to (local) devices
18
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Domain decomposition on one compute node
19
Computer Science X - System Simulation Group Harald Köstler ([email protected])
RESULTS LBM Simulations on Tsubame 2.0
20
Computer Science X - System Simulation Group Harald Köstler ([email protected])
21
Tsubame 2.0 in Japan
Compute nodes: 1442
Processor: Intel Xeon X5670
GPU: 3 x Nvidia Tesla M205
Peak performance:
2.2 PFlop/s
633 TB/s memory bandwidth
LINPACK performance: 1.2 Petaflops
Power consumption: 1.4 MW
Interconnect: QDR Infiniband
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Performance Engineering
22
Create performance model
Identify performance bottlenecks
Create problem-specific, hardware-
dependent, and highly efficient kernel
Integrate them in software framework
Algorithm Hardware
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Input Algorithm: LBM kernel
Generic Implementation
Hardware information (bandwidth, peak performance)
Assumption
Computation time limited by memory bandwidth and instruction throughput
Communication time limited by network bandwidth and latency (for direct and collective communication)
Performance Model I
),max( ,,,, MPIcommGPUCPUcommbufferinnercompoutercomptotal tttttt +++=
23
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Single node performance on Tsubame
Machine balance
Code balance
Lightspeed estimate (if l < 1 code is bandwidth limited)
Performance Model II
24
eperformancpeak bandwidth esustainabl
=mB
=
c
m
BBl ,1min
200304
FLOPS executed no.stored and loaded bytes no.
==cB
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Single Compute Node Performance I
25
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Single Compute Node Performance II
26
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Single Compute Node Performance III
27
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Single Compute Node Performance IV
28
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Communication model
29
),max( ,,,, MPIcommGPUCPUcommbufferinnercompoutercomptotal tttttt +++=
Communication time for one message depends on size of message s
number of messages x that are concurrently transferred over the communication link (communication pattern)
type of communication link ω
relative position of the communication partners e.g.intra- or inter-node communication p
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Weak scaling, 3 GPUs per node
30
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Strong scaling, 3 GPUs per node
31
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Test case: Packed bed of hollow cylinders
32
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Porous media: 100x100x1536, 1D dom. decomp.
33
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Porous media: 100x100x1536, 1D dom. decomp.
34
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Porous media: 100x100x1536, 1D/2D/3D
35
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Porous media: 256x256x3600, 1D/2D
36
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Current Work
Focus in waLBerla currently on Juqueen and SuperMUC
37
Computer Science X - System Simulation Group Harald Köstler ([email protected])
Future Work
Tests on Nvidia Kepler cluster
Programming paradigms on future HPC clusters?
Code generation techniques to improve portability
Dynamic load balancing
38