Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
High resolution GPUs odes for
dire t numeri al simulation of
turbulen e
Alberto Vela-Martin, Jose I. Cardesa, Miguel P. En inar y
Javier Jim�enez
E.T.S.I Aerona�uti a, UPM
1
Turbulen e and high performan e omputing
�
Turbulen e is a ommon phenomena in uid me hani s.
Pra ti al importan e: industrial pro esses, energy and
aeronauti s.
�
Related to energy saving and eÆ ien y in transportation.
�
5% of world total energy spent in turbulent fri tion.
2
Turbulen e and high performan e omputing
�
Highly omplex and haoti phenomena: high level of detail
required.
�
Dire t numeri al simulations (DNS): turbulen e simulated
with all its relevant details N
DOF
� Re
9=4
degrees of freedom.
�
DNS simulation with N � 10
9
degrees of freedom and 30
million CPU hours.
�
Large DNSs omputed in our group:
�
Hoyas and Jim�enez 2006 hannel at Re
�
= 2000 (6x10e6
CPU-hours, Marenostrum).
�
Sillero et al. 2013 boundary layer at Re
�
= 6600.
�
Lozano-Dur�an et al. 2014 time-resolved hannel at
Re
�
= 4000.
3
DNS on GPUs
�
Regular domains and boundary onditions.
�
Simple but highly eÆ ient and s alable odes.
�
CFD on single GPUs: simple homogeneous isotropi
turbulen e ode (3 periodi dire tions) and hannel ode (2
periodi dire tions).
Outstanding performan e, no devide-to-host ommuni ations.
�
In lude MPI on single GPU odes: MPI all to all
ommuni ations from host memory. Penalization aused by
D2H and H2D memory transfer.
�
Optimization: asyn hronous GPU-CPU exe ution.
Overlapping.
4
DNS on GPUs
�
Regular domains and boundary onditions.
�
Simple but highly eÆ ient and s alable odes.
�
CFD on single GPUs: simple homogeneous isotropi
turbulen e ode (3 periodi dire tions) and hannel ode (2
periodi dire tions).
Outstanding performan e, no devide-to-host ommuni ations.
�
In lude MPI on single GPU odes: MPI all to all
ommuni ations from host memory. Penalization aused by
D2H and H2D memory transfer.
�
Optimization: asyn hronous GPU-CPU exe ution.
Overlapping.
4
DNS on GPUs
�
Regular domains and boundary onditions.
�
Simple but highly eÆ ient and s alable odes.
�
CFD on single GPUs: simple homogeneous isotropi
turbulen e ode (3 periodi dire tions) and hannel ode (2
periodi dire tions).
Outstanding performan e, no devide-to-host ommuni ations.
�
In lude MPI on single GPU odes: MPI all to all
ommuni ations from host memory. Penalization aused by
D2H and H2D memory transfer.
�
Optimization: asyn hronous GPU-CPU exe ution.
Overlapping.
4
DNS on GPUs
�
Regular domains and boundary onditions.
�
Simple but highly eÆ ient and s alable odes.
�
CFD on single GPUs: simple homogeneous isotropi
turbulen e ode (3 periodi dire tions) and hannel ode (2
periodi dire tions).
Outstanding performan e, no devide-to-host ommuni ations.
�
In lude MPI on single GPU odes: MPI all to all
ommuni ations from host memory. Penalization aused by
D2H and H2D memory transfer.
�
Optimization: asyn hronous GPU-CPU exe ution.
Overlapping.
4
DNS on GPUs
�
Two basi on�gurations:
Isotropi turbulen e Turbulent hannel
5
Turbulent hannel ow
PSfrag repla ements
N
gpus
time (se )
�
Flow between two parallel wall
�
Periodi boundary ondition in x and z , no-slip ondition at
both walls.
�
Mean pressure gradient in x ! mean velo ity pro�le U
PSfrag repla ements
N
gpus
time (se )
6
Turbulent hannel ow
�
Formulation in !
y
and r
2
v (Kim, Moin and Moser 1987)
�
Fully dealiased psudospe tral method in x and z (CUFFT
library).
�
High resolution ompa t �nite di�eren es in y (7 point
sten il on non-uniform grid): inversion of heptadiagonal
matri es ( ustom CUDA kernels).
�
Temporal integration with 3th order low-storage Runge-Kutta.
�
Domain de ompostion: y -z planes. MPI transpose to x-z
planes.
�
Core of the ode single pre ission (some parts in double)
7
Turbulent hannel ow
�
Formulation in !
y
and r
2
v (Kim, Moin and Moser 1987)
�
Fully dealiased psudospe tral method in x and z (CUFFT
library).
�
High resolution ompa t �nite di�eren es in y (7 point
sten il on non-uniform grid): inversion of heptadiagonal
matri es ( ustom CUDA kernels).
�
Temporal integration with 3th order low-storage Runge-Kutta.
�
Domain de ompostion: y -z planes. MPI transpose to x-z
planes.
�
Core of the ode single pre ission (some parts in double)
7
Turbulent hannel ow
�
Formulation in !
y
and r
2
v (Kim, Moin and Moser 1987)
�
Fully dealiased psudospe tral method in x and z (CUFFT
library).
�
High resolution ompa t �nite di�eren es in y (7 point
sten il on non-uniform grid): inversion of heptadiagonal
matri es ( ustom CUDA kernels).
�
Temporal integration with 3th order low-storage Runge-Kutta.
�
Domain de ompostion: y -z planes. MPI transpose to x-z
planes.
�
Core of the ode single pre ission (some parts in double)
7
Turbulent hannel ow
�
Formulation in !
y
and r
2
v (Kim, Moin and Moser 1987)
�
Fully dealiased psudospe tral method in x and z (CUFFT
library).
�
High resolution ompa t �nite di�eren es in y (7 point
sten il on non-uniform grid): inversion of heptadiagonal
matri es ( ustom CUDA kernels).
�
Temporal integration with 3th order low-storage Runge-Kutta.
�
Domain de ompostion: y -z planes. MPI transpose to x-z
planes.
�
Core of the ode single pre ission (some parts in double)
7
Turbulent hannel ow
�
Formulation in !
y
and r
2
v (Kim, Moin and Moser 1987)
�
Fully dealiased psudospe tral method in x and z (CUFFT
library).
�
High resolution ompa t �nite di�eren es in y (7 point
sten il on non-uniform grid): inversion of heptadiagonal
matri es ( ustom CUDA kernels).
�
Temporal integration with 3th order low-storage Runge-Kutta.
�
Domain de ompostion: y -z planes. MPI transpose to x-z
planes.
�
Core of the ode single pre ission (some parts in double)
7
Turbulent hannel ow
�
Formulation in !
y
and r
2
v (Kim, Moin and Moser 1987)
�
Fully dealiased psudospe tral method in x and z (CUFFT
library).
�
High resolution ompa t �nite di�eren es in y (7 point
sten il on non-uniform grid): inversion of heptadiagonal
matri es ( ustom CUDA kernels).
�
Temporal integration with 3th order low-storage Runge-Kutta.
�
Domain de ompostion: y -z planes. MPI transpose to x-z
planes.
�
Core of the ode single pre ission (some parts in double)
7
Channel ode
�
Non-linear terms are the most expensive:
�
t
u = �uru�rp+ �r
2
u
�
6 Complex-Real FFT + 5 MPI transpose (global
ommuni ations)
�
3 Real-Complex FFT + 3 MPI transpose (global
ommuni ations)
�
Optimization: asyn hronous GPU-CPU exe ution.
Overlapping.
8
Non-linear onvolution: overlapping
Compute stream D2H stream H2D stream Host stream
al ulate u and w opy v to host
al ulate �
y
u opy u to host MPI transp. v
al ulate �
y
w opy w to host opy v to devi e MPI transp. u
al ulate �
yy
r
2
v opy �
y
u to host opy u to devi e MPI transp. w
al ulate �
yy
!y opy �
y
w to host opy w to devi e MPI transp. �
y
u
FFT to real v opy �
y
u to devi e MPI transp. �
y
w
FFT to real u opy �
y
w to devi e
FFT to real w
al ulate !
y
and FFT to real
al ulate !
x
and FFT to real
al ulate !
z
and FFT to real al ulate statisti s
al ulate H1 and FFT to omplex H1
al ulate H3 and FFT to omplex H3 opy H1 to host
al ulate H2 and FFT to omplex H2 opy H3 to host MPI transp. H1
1st RK step for r
2
v opy H2 to host opy H1 to devi e MPI transp. H3
1st RK step for ! opy H3 to devi e MPI transp. H2
non-linear RHS for !
y
and 2nd RK step opy H2 to devi e
impli it step for !
y
non-linear RHS for r
2
v and 2nd RK step
impli it step for r
2
v
9
Non-linear onvolution: overlapping
�
8 MPI transpose, 8 D2H opy and 8 H2D opy
CPU
H2DD2H
GPU
MPI transpose H2D transfer
GPU execution D2H transfer
PSfrag repla ements
N
gpus
time (se )
Figure: Exe ution pro�le on 32 M2090 in Minotauro at BSC.
10
S aling in PizDaint
N
x
�N
y
�N
z
N
min
gpus
�N
max
gpus
�
min
�
min
(ns)
? 1024 � 1024 � 256 16� 256 67% 60
+ 2048 � 2048 � 512 64� 512 82% 63
Æ 4096 � 4096 � 1024 512� 1024 100% 65
� 6144 � 4096 � 1024 512� 1024 100% 64
Time per deegre of freedom and GPU
� = time � N
gpus
=DoF
min
EÆ ien y
� = time �N
gpus
=time
0
�N
0
gpus
� 100
11
S aling in PizDaint
16 32 64 128 256 512 1024
10−1
100
100%
96%
92%
80%
67%
100%
109%
100%
82%
100%
101%
100%
105%
PSfrag repla ements
N
gpus
t
i
m
e
(
s
e
)
12
S aling in PizDaint
16 32 64 128 256 512 10243
4
5
6
7
8
9
10x 10
−8
PSfrag repla ements
N
gpus
t
i
m
e
(
s
e
)
N
g
p
u
s
/
D
o
F
13
Future proje ts: what we would like to do
Future goal for next generation GPUs (Pas al and Volta):
�
Re
�
= 10; 000 in a large box 8� � 3�.
�
Mesh: N
x
�N
y
�N
z
= 20; 480 � 2048 � 15; 360.
�
Total GPU memory: � 10� 15TB .
�
� 500 hours per eddy-turnover time on 2048 GPUs
(PizDaint).
�
� 10; 000; 000 node-hours for a 15 eddy-turnover time
simulation.
�
Generate on-the- y ompressed time-resolved data.
14
Present proje ts: what we an do now
Current proje t at PizDaint (Pas al):
�
Re
�
= 5; 000 in a large box 8� � 3� (low resolution).
�
Mesh: N
x
�N
y
�N
z
= 6140 � 1024 � 4196.
�
Total GPU memory: � 1� 1:5TB .
�
� 22 hours per eddy-turnover time on 1048 GPUs (Tesla).
�
� 1; 600; 000 node-hours for a 50 eddy-turnover time
simulation.
�
Generate on-the- y ompressed time-resolved data.
15
Homogeneous isotropi turbulen e
�
Flow in a triply periodi box
�
Optimization strategy similar to the hannel ow.
�
Good s alability up to 64 GPUs at Minotauro (BSC).
16
The turbulen e as ade in 5D
�
DECI-13 COSIT proje t in MinoTauro
�
�500,000 pu-hours on M2090 NVIDIA GPUs
�
Long run (� 60 ETT)
�
High temporal resolution (Kolmogorov time-s ale)
�
�26,000 snapshots / � 100Tb
17
The turbulen e as ade in 5D
�
Proje t to study the turbulen e as ade in 3 spatial
oordinates, s ale and time (5D).
�
Time tra king algorithms at di�erent s ales.
�
Results in Cardesa, Vela-Martin & Jim�enez 2017, S ien e
�
Database and GPU ode available at
https://torroja.dmt.upm.es
18
Questions
19