29
High Performance Computa3onal FluidThermal Sciences & Engineering Lab GenIDLEST CoDesign Virginia Tech 1 AFOSRBRI Workshop Feb 7 2014 Danesh TaNi & Amit Amritkar Collaborators Wuchun Feng, Paul Sathre, Kaixi Hou, Sriram Chivukula, Hao Wang, Eric de Sturler, Kasia Swirydowicz

Danesh%Tai %&%Amit%Amritkar%

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

GenIDLEST  Co-­‐Design  

Virginia  Tech  

1  

AFOSR-­‐BRI  Workshop  Feb  7  2014  

Danesh  TaNi  &  Amit  Amritkar    

Collaborators  Wu-­‐chun  Feng,  Paul  Sathre,  Kaixi  Hou,  Sriram  Chivukula,  

Hao  Wang,  Eric  de  Sturler,  Kasia  Swirydowicz  

Page 2: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Outline  

•  GenIDLEST  Code  •  Features  and  capabili3es  

•  Ini3al  Code  por3ng  •  Strategy  

•  Co-­‐design  op3miza3on  •  Valida3on  •  Applica3on  

•  Bat  wing  flapping  •  Solver  co-­‐design  •  Future  work  

2  

Page 3: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Generalized  Incompressible  Direct  and  Large-­‐Eddy  Simula3ons  of  Turbulence  (GenIDLEST)  

•  Background  •  Finite-­‐volume  mul3-­‐block  body  fiZed  mesh  topology.  •  Grid  has  structured  (i,j,k)  connec3vity  but  blocks  can  have  arbitrary  connec3vity  (unstructured)  

•  Pressure  based  frac3onal  step  algorithm  •  MPI,  OpenMP  or  MPI-­‐OpenMP  parallelism  •  OpenMP  scalability  as  good  as  MPI  (tested  up  to  256  cores)  

•  Capability  •  Wall  resolved  and  wall  modeled  LES  in  complex  geometries  

•  Dynamic  Smagorinsky  with  zonal  two-­‐layer  wall  model  •  Hybrid  URANS-­‐LES  

•  K-­‐ω  based  •  Dynamic  Grids  (ALE)  •  Immersed  Boundary  Method  (IBM)  on  general  grids.  •  Discrete  Element  Method  (DEM)  

 

3  

Page 4: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Generalized  Incompressible  Direct  and  Large-­‐Eddy  Simula3ons  of  Turbulence  (GenIDLEST)  

•  Dynamic  stall  on  pitching  airfoil  (LES+ALE)  -­‐45  million  cells-­‐304  cores  

 

4  

Page 5: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Generalized  Incompressible  Direct  and  Large-­‐Eddy  Simula3ons  of  Turbulence  (GenIDLEST)  

•  Flapping  Bat  wing  (IBM)  -­‐  25  million  cells  –  60  cores  

 

5  

Page 6: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Generalized  Incompressible  Direct  and  Large-­‐Eddy  Simula3ons  of  Turbulence  (GenIDLEST)  

•  DEM  in  a  fluidized  bed  (5.3  million  par3cles,  1  million  fluid  grid  cells  –  256  cores)  

 

6  

Page 7: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

GenIDLEST  Flow  Chart  

7  

GenIDLEST  (Generalized  Incompressible  Direct  and  Large  Eddy  Simulation  of  Turbulence)

ALE

Rezoning  Phase

IBM

DEM

Setup  system  matrices

Solve  momentum  equations  for  intermediate  velocity

Solve  linear  system

Calculate  fluxes

Solve  Pressure  Poisson  equation

Solve  linear  system

Correct  intermediate  velocities

T>Tmax

START

Advance  time  step

END  (Post  processing)

No

Yes

Initialize  velocity  and  temperature  fields

Solve  turbulent  model  equations  for  turbulent  

viscosity

Forced  boundary  node  movement

Corner  displacement  using  spring  analogy

TFI  for  corner  -­‐>  edge  -­‐>  face  -­‐>  volume  displacement

Calculate  new  system  matrices

Read  user  defined  input  and  grid

Ts>Tsmax

Calculate  forces  and  heat  flux  on  particle

Update  particle  position,  velocity  and  

temperature

Calculate  volume  fraction

Calculate  momentum  and  temperature  source  terms

Yes

No

Identify  solid,  fluid  and  IB  nodes

Apply  velocity  boundary  condition  at  IB  nodes

Correct  mass  fluxes

Apply  boundary  conditions  for  velocity  and  pressure  at  

IB  nodes

Post-­‐process  data  at  immersed  boundary

Move  the  immersed  boundary  to  new  location

Read  surface  grid

Page 8: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

GenIDLEST-­‐Linear  Systems  

•  Sparse  •  Sparsity  paZern  depends  on  B.C.s.  •  Nonorthogonal  grid  -­‐  Non-­‐symmetric  -­‐  18  off-­‐diagonal  elements  

•  Orthogonal  grid  –  Symmetric  –  6  off-­‐diagonal  elements  

•  Use  CG  for  symmetric  system,  and  BiCGSTAB  or  GMRES(m)  for  nonsymmetric  non-­‐orthogonal  systems.  

•  Precondi3oners:  Domain  decomposi3on  -­‐  Addi3ve  Schwarz  Richardson  or  SSOR  itera3ons  applied  to  sub-­‐structured  overlapping  “cache”  blocks  –  SIMD  parallel    

Page 9: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

GenIDLEST  –  Time  Consuming  Rou3nes  on  CPU  

Page 10: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Data  Structures  and  Mapping  to  Co-­‐Processor  Architectures  

10  

Compute Node

CPUs Co-procs.

MPI offload

OpenMP offload

MPI MPI

GPU

Intel MIC

Global grid

Nodal grid

Mesh blocks

Cache blocks

Page 11: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Challenges  in  GPU  Co-­‐Processing  Implementa3on  

•  Large  memory  footprint  for  problems  of  prac3cal  interest  limits  GPU  use  to  subset  of  code  • Added  overhead  of  gejng  data  on  and  off  GPU  in  3me-­‐integra3on  loop  

•  All  computa3ons  on  GPU  have  to  be  data  parallel  –  anything  that  deviates  from  this  paradigm  is  not  easy  to  implement  

•  Message  passing  for  global  block  boundary  synchroniza3on  between  processors  requires  addi3onal  data  transfer  from  and  to  GPU  

11  

Page 12: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

GPU  IMPLEMENTATION  

•  Implementa3on  Strategy  •  Port  large  func3onal  blocks  of  subrou3nes  to  GPU  without  overwhelming  GPU  memory  while  keeping  CPU-­‐GPU  and  GPU-­‐CPU  data  transfers  to  a  minimum.  

•  Over  75  CUDA-­‐Fortran  kernels  wriZen  •  Calcula3on  of  diffusion  coefficients  for  all  transport  equa3ons  •  Solu3on  of  all  linear  systems  in  transport  equa3ons  and  pressure  equa3on  which  consume  between  50-­‐90%  of  computa3on  3me  on  CPU  •  2  global  sparse  matrix-­‐vector  mul3plies,  4  global  inner  products  or  global  reduc3ons,  2  precondi3oner  calls  and  4  global  block  boundary  updates  per  BiCGSTAB  itera3on.  

•  Message  passing  packing  and  unpacking  12  

Page 13: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Co-­‐design  with  computer  science  team  

•  Approach  1.   The  program  was  profiled  to  iden3fy  performance  

boZlenecks.  2.   Accelerator  architecture  aware  op3miza3ons  including  

coalesced  memory  access,  shared  memory  usage,  double  buffering  and  pointer  swapping,  and  removal  of  redundant  synchroniza3ons  were  applied.  

3.   Kernel  modifica3ons  were  performed  to  suit  the  GPUs.  This  involved  op3miza3ons  such  as  elimina3on  of  trivial  memory  accesses  by  performing  in  situ  computa3ons  and  dot  products  plus  reduc3ons  which  are  ghost  cell  aware.  

•  GPU  code  op3miza3on    •  Improved  performance  from  3x  to  6x  

13  

Page 14: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Turbulent  Channel  flow  

•  Turbulent  flow  through  channel  Reτ  =  180  •  Orthogonal  Mesh    

•  7  point  stencil  •  643  grid  nodes  •  Single  mesh  block  

Page 15: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Channel  flow  -­‐  valida3on  

•  BiCGSTAB  solver  •  addi3ve  Schwarz  block  pre-­‐condi3oner  •  Richardson  itera3ve  smoothing    

•  Time  averaged  turbulence  sta3s3cs  •  Root  mean  squared  flow  veloci3es  

•  Single  vs  Double  precision  •  More  SP  cores  on  GPUs  

15  DNS study by – Moser, R. D., Kim, J., and Mansour, N. N., 1999, "Direct numerical simulation of turbulent channel flow up to Re= 590," Physics of fluids, 11, p. 943

Page 16: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

•  Cache  block  configura3ons  •  3D  data  blocking  •  25  itera3ons  /3me  step  -­‐  pressure  Poisson  equa3on  

•  Different  GPUs  •  200  3me  steps  •  Time  in  3me  integra3on  loop  

16  

Channel  flow  –  performance  results  

CPU (s)

Nvidia Tesla GPU (s) K20c C2050

DP SP DP SP DP 504.5 39.7 63.1 54.2 96.3

Layout   Total threads   Time (s)  8x4x4   128   98.7  8x8x2   128   97.9  8x8x4   256   101.1  

16x8x2   256   103.6  32x4x2   256   104  64x2x2   256   103.5  64x4x2   512   96.3  32x8x2   512   97.5  

16x16x2   512   98.4  8x8x8   512   107.5  

Page 17: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

•  Turbulent  flow  through  pipe  Reτ  =  338  •  Non-­‐orthogonal  body  fiZed  mesh  

•  19  point  stencil  •  1.78  million  grid  nodes  •  36  mesh  blocks  

•  Krylov  solvers  •  Sensi3ve  to  round  off  •  Single  precision  calcula3ons  

•  Relaxed  convergence  criterion  •  Restricted  maximum  number  of  itera3ons  

17  

Turbulent  Pipe  flow  

Page 18: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Pipe  flow  –  valida3on  

0

5

10

15

20

25

-0.5 -0.3 -0.1 0.1 0.3 0.5

U+

R/D

Experimental

Numerical

Numierical-SP

0

2

4

6

8

10

12

14

16

18

20

0.1 1 10 100

U

Y+

Experimental

Numerical

Numierical-SP

Experiments by – den Toonder, J. M. J., and Nieuwstadt, F. T. M., 1997, "Reynolds number effects in a turbulent pipe flow for low to moderate Re," Physics of Fluids (1994-present), 9(11), pp. 3398-3409

0  

0.5  

1  

1.5  

2  

2.5  

3  

0.1   1   10   100  

rms  (U,V)  

y+  

Experimental  U  Experimental  V  Numerical-­‐DP  U  Numerical-­‐DP  V  Numerical-­‐SP  U  Numerical-­‐SP  V  

0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  

0   0.1   0.2   0.3   0.4   0.5  

Shear  S

tress  

r/D  

Total  Laminar  Turbulent  Experimental  

Page 19: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

•  Cache  block  sizes  •  ‘i’  direc3on  has  the  most  threads  

•  Strong  scaling  •  CPU  and  GPU  cache  block  sizes  are  op3mized    

19  

Pipe  flow  –  performance  results  

Total threads in each cache block Time (s)

32x38x40 32x40x40 32x39x40 512 512 416 255 304 200 208 263 416 400 380 270 190 200 293 293

Number of CPU cores or GPUs

CPU (s) C2050 GPU (s)

DP DP SP

2 8492 1975 1353 4 4220 1290 806

12 1426 837 492 36 511 386 255

Page 20: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Close  up  of  Bat  wing  mo3on  

20  

Page 21: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Wing  flapping  –  simula3on  details  

•  Flapping  of  a  bat  wing  using  IBM  •  25  million  nodes  for  background  grid  •  9400  surface  mesh  elements  for  bat  wing  •  Re  =  433  

•  Runs  on  HokieSpeed  at  Virginia  Tech  •  2  Nvidia  M2050/C2050  GPUs  •  Dual  socket  Intel  Xeon  E5645  CPUs  

Page 22: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Wing  flapping  –  geometry  

22  

Page 23: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Comparison  of  results  (GPU  vs  CPU)  

•  LiN  force  over  1  cycle  of  bat  wing  flapping  

Page 24: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

GPU  (60  GPUs)          CPU  (60  CPU  cores)  

Page 25: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Solver  co-­‐design  with  Math  team  

•  Solu3on  of  pressure  Poisson  equa3on  •  Most  3me  consuming  func3on  (50  to  90  %  of  total  3me)  

•  Solving  linear  system  Ax  =  b  •  ‘A’  remains  constant  from  one  3me  step  to  other  in  many  CFD  calcula3ons  

•  GCROT  algorithm  •  Recycling  of  basis  vectors  from  one  3me  step  to  the  other  

Page 26: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Current/Future  work  

•  Use  GPU  aware-­‐MPI  •  GPU  Direct  v2  gives  about  25%  performance  improvement  

•  GCROT  algorithm  •  OpCmizaCon  •  Implement  of  GCROT  on  GPUs  

•  Non-­‐orthogonal  case  (pipe  flow)  •  Op3miza3on  

•  Use  of  co-­‐processing  (Intel  mic)  for  offloading  •  Direct  offloading  with  pragmas  •  Combina3on  of  CPU  and  Co-­‐processor  

Page 27: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Publica3ons  –  in  process  

•  Accelera3ng  Bio-­‐Inspired  MAV  Computa3ons  using  GPUs  •  Amit  Amritkar,  Danesh  TaNi,  Paul  Sathre,  Kaixi  Hou,  Sriram  Chivakula,  Wu-­‐Chun  Feng  

•  AIAA  Avia3on  and  Aeronau3cs  Forum  and  Exposi3on  2014,  16  -­‐  20  June  2014,  Atlanta,  Georgia    

•  CFD  computa3ons  using  precondi3oned  Krylov  solver  on  GPUs  •  Amit  Amritkar,  Danesh  TaNi  •  Proceedings  of  FESEM  2014,  August  3-­‐7,  2014,  Chicago,  Illinois,  USA  

•  Recycling  Krylov  subspaces  for  CFD  applica3on  •  Katarzyna  Swirydowicz,  Amit  Amritkar,  Eric  De  Sturler,  Danesh  TaNi  •  FEDSM  2014,  August  3-­‐7,  2014,  Chicago,  Illinois,  USA  

Page 28: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

Fruit  bat  wing  mo3on  

28  Based on work by – Viswanath, K., et al., Straight-Line Climbing Flight Aerodynamics of a Fruit Bat. Physics of Fluids, accepted

Page 29: Danesh%Tai %&%Amit%Amritkar%

High  Performance  Computa3onal  Fluid-­‐Thermal  Sciences  &  Engineering  Lab  

TAU  profile  data  –  10  3me  steps  

•  Mpi_allreduce •  Gpu_mpi_sendrecv •  Ibm_movement •  Ibm_write_tecplot •  Gpu_pc2

•  Mpi_waitall •  Mpi_sendrecv •  Read_grid •  Gpu_dot_prd •  Ibm_buffer_alloc

•  Gpu_exchange_var •  Ibm_projection •  Alloc_field_var •  Ibm_locate_ijk

Total = 157s, communication = 70s, initial/end (one time) = 25s