Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
AceCAST – High Performance CUDA based Weather Research Forecasting (WRF) Model
Allen HUANG ([email protected]) Tempo Quest, Inc. And Scien>sts from Space Science and Engineering Center, University of Wisconsin–Madison
Introduction
WRF System Components
Code Validation
GPU Technology Conference, Silicon Valley, April 2016
Summary Jimy Dudhia: WRF physics options
Potential temperatures [K] Difference between potential temperatures on CPU and GPU [K]
Performance Profile of WRF
Jan. 2000 30km workload profiling by John Michalakes, “Code restructuring to improve performance in WRF model physics on Intel Xeon Phi”, Workshop on Programming weather, climate, and earth-system models on heterogeneous multi-core platforms, September 20, 2013
Performance Comparison
• ~35% of widely used WRF codes ported to CUDA-C • Highly skilled core GPU programming team established • High-performance WRF demonstrated with ~70 papers and independently validated by NVIDIA (CUDA-based modules achieved a speedup range from 105-1311 times) • Lessons learned during CUDA C could be readily applied to OpenACC 2.0/OpenMP 4.0 optimization of WRF modules for GPU/Intel MIC • Currently funding, not science & not technology, is the only barrier to commercialization (time-to-solution is only ~12 months)
The WRF physics components are microphysics, cumulus parametrization, planetary boundary layer (PBL), land-surface model and shortwave/longwave radiation.
• AceCAST is a proprietary version of WRF, a mesoscale and global Weather Research and Forecasting model • Designed for both operational forecasters and atmospheric researchers & widely used by commercial, government & institutional users around the world, in >150 countries • WRF is suitable for a broad spectrum of applications across domain scales ranging from meters to hundreds of kilometers • Increases in computational power enables - Increased vertical as well as horizontal resolution - More timely delivery of forecasts - Probabilistic forecasts based on ensemble methods with much improved forecast accuracy • Why accelerated AceCAST? -High resolution accuracy & cost performance -Need for strong scaling -Greatly improved profits for weather sensitive industry
GPU Speedups
Parallel Execution of WRF on GPU
Single threaded non-vectorized CPU code is compiled with gfortran 4.4.6
WRF model integration procedure
• Fused multiply-addition was turned off (--fmad=false) • GNU C math library was used on GPU, i.e. powf(), expf(), sqrt() and logf() were replaced by routines from GNU C library → bit-exact output compared to gfortran compiler on CPU • Small output differences for –fast-math (shown below)
Equations describing YSU PBL scheme are executed in one thread for each grid point
Mapping of the CONUS domain onto one GPU thread-block-grid domain
Implementation of YSU PBL in GPUs with CUDA Program
CPU runtime GPU runtime Speedup One CPU core 1800.0 ms Non-coalesced 50.0 ms 36.0x Coalesced 48.0 ms 37.5x0
Improvement of YSU PBL in GPU-Based Parallelism
Three configurations of memory between shared memory and L1 cache are : (1) 48 KB shared memory, 16 KB L1 cache – default (2) 32 KB shared memory, 32 KB L1 cache (3) 16 KB shared memory, 48 KB L1 cache à can be achieved by applying “cudaFuncCachePreferL1” à After increasing L1 cache with
“cudaFuncCachePreferL1” , the GPU runtime reduces and speedup increases CPU runtime GPU runtime Speedup
One CPU core 1800.0 ms Non-coalesced 48.0 ms 37.5x Coalesced 45.0 ms 40.0x
By Scalarization, the temporary arrays are reduced from
68 down to 14 arrays –
àThis makes global memory access reduced a lot !!!
CPU runtime GPU runtime Speedup One CPU core 1800.0 ms Non-coalesced 39.0 ms 46.2x Coalesced 35.0 ms 51.4x
GPU Runtime & Speedups with Multi-GPU Implementations for YSU PBL Module
TQI AceCast high performance has been validated by peer reviewed publications and independently verified by NVIDIA: Ø Peer reviewed of software
implementation and optimization to validate TQI claimed performance by multiple anonymous experts
Ø Independent evaluation and implementation of the source codes (7 targeted CUDA-C modules achieve published GPU performance within full WRF model run)
Ø Official presentations of TQI performance results in international professional and scientific conferences
ü WRF model physics performance comparison between Intel Xeon Phi and NVIDIA Kepler K20/K40 (w/wo boast mode)
ü Results from NVIDIA PSG Cluster (HQ, USA) - >10x speedups
ü NVIDIA independent evaluation of WRF GPU CUDA-C acceleration : WSM6 – 64x speedups
All runs made on NVIDIA PSG cluster with an user-provided WRF namelist • CPU-only and CPU+GPU hybrid results based on standard NCAR WRF
3.6.1 plus CUDA WRF 3.6.1 modules developed by SSEC • CPU-Only results run on 2 x IVB CPU, total of 20 cores • CPU+GPU results run on 2 x IVB CPU, total of 20 and using only 1 of 2
Tesla K80 GPU • System software included: CUDA 7.0, PGI 15.7, running on CentOS 6
contact name
Valeriu Codreanu: [email protected]
P6338
category: earth system modelling - esm01
GTC_2016_Earth_Systems_Modelling_ESM_01_P6338.indd 1 2/23/16 1:20 PM