urbulence T - on-demand.gputechconf.com

High resolution GPUs odes for

dire t numeri al simulation of

turbulen e

Alberto Vela-Martin, Jose I. Cardesa, Miguel P. En inar y

Javier Jim�enez

E.T.S.I Aerona�uti a, UPM

1

Turbulen e and high performan e omputing

�

Turbulen e is a ommon phenomena in uid me hani s.

Pra ti al importan e: industrial pro esses, energy and

aeronauti s.

�

Related to energy saving and eÆ ien y in transportation.

�

5% of world total energy spent in turbulent fri tion.

2

Turbulen e and high performan e omputing

�

Highly omplex and haoti phenomena: high level of detail

required.

�

Dire t numeri al simulations (DNS): turbulen e simulated

with all its relevant details N

DOF

� Re

9=4

degrees of freedom.

�

DNS simulation with N � 10

9

degrees of freedom and 30

million CPU hours.

�

Large DNSs omputed in our group:

�

Hoyas and Jim�enez 2006 hannel at Re

�

= 2000 (6x10e6

CPU-hours, Marenostrum).

�

Sillero et al. 2013 boundary layer at Re

�

= 6600.

�

Lozano-Dur�an et al. 2014 time-resolved hannel at

Re

�

= 4000.

3

DNS on GPUs

�

Regular domains and boundary onditions.

�

Simple but highly eÆ ient and s alable odes.

�

CFD on single GPUs: simple homogeneous isotropi

turbulen e ode (3 periodi dire tions) and hannel ode (2

periodi dire tions).

Outstanding performan e, no devide-to-host ommuni ations.

�

In lude MPI on single GPU odes: MPI all to all

ommuni ations from host memory. Penalization aused by

D2H and H2D memory transfer.

�

Optimization: asyn hronous GPU-CPU exe ution.

Overlapping.

4

DNS on GPUs

�


�


�





�




�


Overlapping.

4

DNS on GPUs

�


�


�





�




�


Overlapping.

4

DNS on GPUs

�


�


�





�




�


Overlapping.

4

DNS on GPUs

�

Two basi on�gurations:

Isotropi turbulen e Turbulent hannel

5

Turbulent hannel ow

PSfrag repla ements

N

gpus

time (se )

�

Flow between two parallel wall

�

Periodi boundary ondition in x and z , no-slip ondition at

both walls.

�

Mean pressure gradient in x ! mean velo ity pro�le U

PSfrag repla ements

N

gpus

time (se )

6

Turbulent hannel ow

�

Formulation in !

y

and r

2

v (Kim, Moin and Moser 1987)

�

Fully dealiased psudospe tral method in x and z (CUFFT

library).

�

High resolution ompa t �nite di�eren es in y (7 point

sten il on non-uniform grid): inversion of heptadiagonal

matri es ( ustom CUDA kernels).

�

Temporal integration with 3th order low-storage Runge-Kutta.

�

Domain de ompostion: y -z planes. MPI transpose to x-z

planes.

�

Core of the ode single pre ission (some parts in double)

7

Turbulent hannel ow

�

Formulation in !

y

and r

2


�


library).

�




�


�


planes.

�


7

Turbulent hannel ow

�

Formulation in !

y

and r

2


�


library).

�




�


�


planes.

�


7

Turbulent hannel ow

�

Formulation in !

y

and r

2


�


library).

�




�


�


planes.

�


7

Turbulent hannel ow

�

Formulation in !

y

and r

2


�


library).

�




�


�


planes.

�


7

Turbulent hannel ow

�

Formulation in !

y

and r

2


�


library).

�




�


�


planes.

�


7

Channel ode

�

Non-linear terms are the most expensive:

�

t

u = �uru�rp+ �r

2

u

�

6 Complex-Real FFT + 5 MPI transpose (global

ommuni ations)

�

3 Real-Complex FFT + 3 MPI transpose (global

ommuni ations)

�


Overlapping.

8

Non-linear onvolution: overlapping

Compute stream D2H stream H2D stream Host stream

al ulate u and w opy v to host

al ulate �

y

u opy u to host MPI transp. v

al ulate �

y

w opy w to host opy v to devi e MPI transp. u

al ulate �

yy

r

2

v opy �

y

u to host opy u to devi e MPI transp. w

al ulate �

yy

!y opy �

y

w to host opy w to devi e MPI transp. �

y

u

FFT to real v opy �

y

u to devi e MPI transp. �

y

w

FFT to real u opy �

y

w to devi e

FFT to real w

al ulate !

y

and FFT to real

al ulate !

x

and FFT to real

al ulate !

z

and FFT to real al ulate statisti s

al ulate H1 and FFT to omplex H1

al ulate H3 and FFT to omplex H3 opy H1 to host

al ulate H2 and FFT to omplex H2 opy H3 to host MPI transp. H1

1st RK step for r

2

v opy H2 to host opy H1 to devi e MPI transp. H3

1st RK step for ! opy H3 to devi e MPI transp. H2

non-linear RHS for !

y

and 2nd RK step opy H2 to devi e

impli it step for !

y

non-linear RHS for r

2

v and 2nd RK step

impli it step for r

2

v

9

Non-linear onvolution: overlapping

�

8 MPI transpose, 8 D2H opy and 8 H2D opy

CPU

H2DD2H

GPU

MPI transpose H2D transfer

GPU execution D2H transfer

PSfrag repla ements

N

gpus

time (se )

Figure: Exe ution pro�le on 32 M2090 in Minotauro at BSC.

10

S aling in PizDaint

N

x

�N

y

�N

z

N

min

gpus

�N

max

gpus

�

min

�

min

(ns)

? 1024 � 1024 � 256 16� 256 67% 60

+ 2048 � 2048 � 512 64� 512 82% 63

Æ 4096 � 4096 � 1024 512� 1024 100% 65

� 6144 � 4096 � 1024 512� 1024 100% 64

Time per deegre of freedom and GPU

� = time � N

gpus

=DoF

min

EÆ ien y

� = time �N

gpus

=time

0

�N

0

gpus

� 100

11

S aling in PizDaint

16 32 64 128 256 512 1024

10−1

100

100%

96%

92%

80%

67%

100%

109%

100%

82%

100%

101%

100%

105%

PSfrag repla ements

N

gpus

t

i

m

e

(

s

e

)

12

S aling in PizDaint

16 32 64 128 256 512 10243

4

5

6

7

8

9

10x 10

−8

PSfrag repla ements

N

gpus

t

i

m

e

(

s

e

)

N

g

p

u

s

/

D

o

F

13

Future proje ts: what we would like to do

Future goal for next generation GPUs (Pas al and Volta):

�

Re

�

= 10; 000 in a large box 8� � 3�.

�

Mesh: N

x

�N

y

�N

z

= 20; 480 � 2048 � 15; 360.

�

Total GPU memory: � 10� 15TB .

�

� 500 hours per eddy-turnover time on 2048 GPUs

(PizDaint).

�

� 10; 000; 000 node-hours for a 15 eddy-turnover time

simulation.

�

Generate on-the- y ompressed time-resolved data.

14

Present proje ts: what we an do now

Current proje t at PizDaint (Pas al):

�

Re

�

= 5; 000 in a large box 8� � 3� (low resolution).

�

Mesh: N

x

�N

y

�N

z

= 6140 � 1024 � 4196.

�

Total GPU memory: � 1� 1:5TB .

�

� 22 hours per eddy-turnover time on 1048 GPUs (Tesla).

�

� 1; 600; 000 node-hours for a 50 eddy-turnover time

simulation.

�

Generate on-the- y ompressed time-resolved data.

15

Homogeneous isotropi turbulen e

�

Flow in a triply periodi box

�

Optimization strategy similar to the hannel ow.

�

Good s alability up to 64 GPUs at Minotauro (BSC).

16

The turbulen e as ade in 5D

�

DECI-13 COSIT proje t in MinoTauro

�

�500,000 pu-hours on M2090 NVIDIA GPUs

�

Long run (� 60 ETT)

�

High temporal resolution (Kolmogorov time-s ale)

�

�26,000 snapshots / � 100Tb

17

The turbulen e as ade in 5D

�

Proje t to study the turbulen e as ade in 3 spatial

oordinates, s ale and time (5D).

�

Time tra king algorithms at di�erent s ales.

�

Results in Cardesa, Vela-Martin & Jim�enez 2017, S ien e

�

Database and GPU ode available at

https://torroja.dmt.upm.es

18

Questions

19

Documents

urbulence T - on-demand.gputechconf.com