Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Lattice Boltzmann algorithm in Lattice Boltzmann algorithm in
3D with MPI
Faisal Shahzad
27-01-2011
• A Quick overview of 1st presentation (OpenMP implementation)
– LBM
– Performance Modeling
– Performance achieved
Overview
– Performance achieved
• MPI Implementation for LBM3D
– Different MPI program approaches, Program flow
– What to communicate between processes
– How communication is done
• Performance measurements for LBM3D_MPI
– Performance graphs
– Scalability Results
27/01/2011 2MuCoSim WS10: LBM in 3D with MPI
Lattice Boltzmann theory
• Domain Discretization: Time and space (D3Q19 Lattice)
– Fluid particles are positioned in certain lattice sites
– May move only in certain, fixed directions
• Distribution functions defines the probability of a movement in a
certain direction
D3Q15 D3Q19
certain direction
• Discretized form of Time and space
27/01/2011 3MuCoSim WS10: LBM in 3D with MPI
Lattice Boltzmann theory
• Stream step
- Pull Scheme
- Push Scheme
• Collide step
- New distribution functions
• Boundary Conditions:• Boundary Conditions:
- No slip (Real Wall with friction)
- Free slip ( Symmetry boundary condition )
- Moving no slip (Movement of wall involving friction => induces flow)
27/01/2011 4MuCoSim WS10: LBM in 3D with MPI
Implementation of LBM
• Two types of data layouts
• Assume domain >> cache
Arrays of Structures Structure of Arrays
• Collision optimized • Streaming/propagation optimized• Collision optimized
Cell 0
Cell 1.
.
.
.
Cell N
• While colliding, all data in one cache line
• While streaming:
Accessing 19 neighbors
• Streaming/propagation optimized
f0
f1.
.
.
.
f19
• While colliding, 19 cache lines are loaded and
all elements are used for collision in the next
cells
ffff0 0 0 0 ffff1111 ffff2 ………. 2 ………. 2 ………. 2 ………. ffff19 19 19 19
ffff0 0 0 0 ffff1111 ffff2 ………. 2 ………. 2 ………. 2 ………. ffff19 19 19 19
ffff0 0 0 0 ffff1111 ffff2 ………. 2 ………. 2 ………. 2 ………. ffff19 19 19 19
cellcellcellcell0 0 0 0 cellcellcellcell1111 cellcellcellcell2 ………2 ………2 ………2 ………cellcellcellcellNNNN
cellcellcellcell0 0 0 0 cellcellcellcell1111 cellcellcellcell2 ………2 ………2 ………2 ………cellcellcellcellNNNN
cellcellcellcell0 0 0 0 cellcellcellcell1111 cellcellcellcell2 ………2 ………2 ………2 ………cellcellcellcellNNNN
Accessing 19 neighbors
Data is evicted from cache before full
cacheline usage
• Performance less than expected from
memory bandwidth
• While streaming:
19 cache lines are loaded (+ RFO); all
elements are used for streaming in the next
cells
• Performance better in memory than AoS
27/01/2011 5MuCoSim WS10: LBM in 3D with MPI
Performance modeling
PC
PC
C
PC
PC
C
PCC
PCC
PCC
PCC
C
PCC
PCC
PCC
PCC
C
Peak performance Peak Memory BandWidthSTREAM (Scale)
BandWidth
STREAM (Scale)
BandWidth-RFO
Chipset
MemoryMemoryMemoryMemory
C
MI
MemoryMemoryMemoryMemory
C
MI
MemoryMemoryMemoryMemory
C
Xeon 5160 dual socket
(Woody Node)
Xeon 5550 dual socket
(TinyBlue Node)
Xeon 5650 dual socket
(Lima Node)
Peak performance
(Gflops)
Peak Memory BandWidth
(GB/s)BandWidth
(GB/s)
BandWidth-RFO
(GB/s)
Woody 48 21 7 4
TinyBlue 85 63 37 27
Lima 127 63 40 26
27/01/2011 6MuCoSim WS10: LBM in 3D with MPI
Performance modeling
Performance Estimates:
1 lup : ≈ 250 Flops , 19*8*2 bytes, 19*8*3 bytes(RFO)
Peak performance based on Peak
Flops (MLUPs)
Peak performance based on
Peak Memory BW (MLUPs)
Performance based on
STREAM BW RFO (MLUPs)
Woody 208 46 15
TinyBlue 370 140 88
Lima 555 140 88
Maximum achieved
Maximum achieved performance on one node:
27/01/2011 7MuCoSim WS10: LBM in 3D with MPI
Maximum achieved
performance Open
MP(MLUPs)
Woody Node 14
TinyBlue Node 67
Lima Node 75
MPI Implementation
• MPI
– Single program multiple data(SPMD)– Single program multiple data(SPMD)� Each process runs same program with different data
– Two approaches
• Master-Worker
• All Workers
27/01/2011 MuCoSim WS10: LBM in 3D with MPI 8
MPI Implementation-Program Flow
27/01/2011 MuCoSim WS10: LBM in 3D with MPI 9
MPI Implementation-Program Flow
27/01/2011 MuCoSim WS10: LBM in 3D with MPI 10
MPI Implementation for LBMDomain Decomposition
27/01/2011 11MuCoSim WS10: LBM in 3D with MPI
MPI Implementation for LBM
Boundary Condition at Walls
Boundary
Condition
Stream step in
following
iteration
Boundary Condition at Ghost layers - communication required
27/01/2011 12MuCoSim WS10: LBM in 3D with MPI
Communication after each stream-collide step
MPI Implementation for LBMHow communication is done
• In D3Q19 scheme, 5 distribution functions for each cell needs communication in each direction.
• pack () function:
– To fill buffer with necessary distribution functions
– Takes care of direction in which to communicate
• Send-Recv
27/01/2011 13MuCoSim WS10: LBM in 3D with MPI
• unpack() function:
– To extract distribution functions from received buffer
– Takes care of direction from which buffer is received
MPI Implementation for LBM
Example : Domain of 200 x 200 x 200
Naïve implementation :• Naïve implementation :
Communicate all distributions
19*8*2 byte per Cell � 13 MB per iteration (comm face)
• Communicate only necessary distributions:
5*8*2 byte per Cell � 3.2 MB (comm face)
27/01/2011 14MuCoSim WS10: LBM in 3D with MPI
Performance measurement - Woody
PC
PC
C
PC
PC
C
Chipset
Memory
C CC
C CC
27/01/2011 15MuCoSim WS10: LBM in 3D with MPI
likwid-mpirun –np 1 –pin N:0 ./lbm params.dat
likwid-mpirun –np 2 –pin N:0_1 ./lbm params.dat
likwid-mpirun –np 4 –NperDomain S:2 ./lbm params.dat
likwid-pin -t intel -c S0:0-1@S1:0-1 ./lbm params.dat
Performance measurement - Tinyblue
PC
PC
PC
PC
PC
PC
PC
PCC
CCC
CC
MI
Memory
CC
C
CC
CC
CC
MI
Memory
CC
C
27/01/2011 16MuCoSim WS10: LBM in 3D with MPI
likwid-mpirun –np 1 –pin N:0 ./lbm params.dat
likwid-mpirun –np 2 –pin N:0_1 ./lbm params.dat
likwid-mpirun –np 4 –NperDomain S:4 ./lbm params.dat
likwid-mpirun –np 8 –NperDomain S:4 ./lbm params.dat
likwid-pin -t intel -c S0:0-3@S1:0-3 ./lbm params.dat
Performance measurement - Lima
27/01/2011 17MuCoSim WS10: LBM in 3D with MPI
likwid-mpirun –np 1 –pin N:0 ./lbm params.dat
likwid-mpirun –np 2 –pin N:0_1 ./lbm params.dat
likwid-mpirun –np 4 –pin N:0_1_2_3 ./lbm params.dat
likwid-mpirun –np 6 –NperDomain S:6 ./lbm params.dat
likwid-mpirun –np 12 –NperDomain S:4 ./lbm params.dat
likwid-pin -t intel -c S0:0-5@S1:0-5 ./lbm params.dat
Performance measurement – Scaling Results on TinyBlue
27/01/2011 18MuCoSim WS10: LBM in 3D with MPI
likwid-mpirun –np 8 –NperDomain S:4 ./lbm params.dat
likwid-mpirun –np 16 –NperDomain S:4 ./lbm params.dat
likwid-mpirun –np 32 –NperDomain S:4 ./lbm params.dat
likwid-mpirun –np 64 –NperDomain S:4 ./lbm params.dat
Performance measurement – Scaling Results on Lima
27/01/2011 19MuCoSim WS10: LBM in 3D with MPI
likwid-mpirun –np 12 –NperDomain S:4 ./lbm params.dat likwid-mpirun –np 24 –NperDomain S:4 ./lbm params.dat
likwid-mpirun –np 48 –NperDomain S:4 ./lbm params.dat likwid-mpirun –np 96 –NperDomain S:4 ./lbm params.dat
likwid-mpirun –np 192 –NperDomain S:4 ./lbm params.dat likwid-mpirun –np 384 –NperDomain S:4 ./lbm params.dat
Thank you!
Questions and comments are welcome
• Special thanks:
27/01/2011 MuCoSim WS10: LBM in 3D with MPI 20
• Special thanks:
– Mr. Johannes Habich