Upload
others
View
34
Download
0
Embed Size (px)
Citation preview
An Agile Approach to Building a GPU-enabled and Performance-portable Global Cloud-resolving Atmospheric Model
Dr. Richard Loft*
Director, Technology Development
CISL/NCAR
*National Center for Atmospheric Research
GTC, San Jose, CA March 26, 2018
Outline
• Origins Backstory
• The MPAS Model
• Team
• Tools and Design
• Status
2
Project began with research based on student projects
• Two years of student internship projects in the Summer Internships in Parallel
Computational Science (SIParCS) at NCAR funded student projects related to
architectural inter-comparison.
• Projects focused on optimizing atmospheric numerical PDE solvers for both
CPUs and GPUs with performance portability in mind.
• Architectures compared:
o Xeon Broadwell, Haswell;
o Xeon Phi KNL;
o NVIDIA Tesla P100->V100.
3
Benchmark Problem
• Shallow Water Equations (SWE) – A set of non-linear partial differential equations (PDE)
– Capture features of atmospheric flow around the Earth
• Radial basis function-generated finite difference (RBF-FD) methods
RBF-FD solution to SWE test case “Flow over an isolated mountain” using 655,532 points [1] 3
An example of 75-point stencil on a sphere [1]
Evaluate differential operator D at every point
Stencil points
Non-stencil points
Cone-shaped mountain
Day 1 Day 15
4
Optimizing Stencils for different architectures
Insufficient Workload
Parallelism
Sufficient Workload
Parallelism
Directive-based portability in the RBF-FD shallow water equations (2-D unstructured stencil)
• CI roofline model generally predicts performance well, even for more complicated algorithms.
• Xeon performance crashes to DRAM BW limit when cache size is exceeded, with some state reuse.
• Xeon Phi (KNL) HBM memory is less sensitive to problem size that Xeon, saturates with CI figure.
• NVIDIA Pascal P100 performance fits CI model GPU’s require higher levels of parallelism to reach saturation.
0
50
100
150
200
250
300
350
Per
form
an
ce (
GF
LO
PS
)
Broadwell KNL P100
5
What is MPAS? – The Model for Prediction Across ScalesNCAR’s Global Meteorological/Climate Model; ~100,000 SLOC
6Simulation of 2012 Tropical Cyclones at 4Km Resolution – Courtesy of Falko Judt, NCAR
Weather and Climate Alliance (WACA):
• NCAR
• NVIDIA Corporation
• IBM Corporation/The Weather Company
• University of Wyoming, CE&EE Department
• Korean Institute of Science and Technology Information (KISTI)
7
Initial Divide and Conquer Strategy
8
MPAS Dynamics MPAS Physics
Problem Reports and Support
Ideas and Results
Weather and Climate Alliance (WACA):A Collaboration for Earth System Model Acceleration
• NCAR (2+4)o Dr. Rich Loft, Director TDD
o Dr. Raghu Raj Kumar, Project Scientist TDD
o Clint Olson, TDD
o Bill Skamarock, Senior Science, MMM
o Michael Duda, Software Engineer, MMM
o Dave Gill, Software Engineer, MMM
• KISTI (2+1)o Minsu Joh, KISTI Director, Disaster Management Research Center
o Dr. Ji-Sun Kang. Senior Researcher
o Jae-Youp Kim, GRA
• NVIDIA/PGI (1+3)o Greg Branch, NVIDIA, Sales
o Dr. Carl Ponder, Senior Applications Engineer
o Brent Leback, PGI Compiler Engineering Manager
o Craig Tierny, Solutions Architect
• University of Wyoming (1+5)o Dr. Suresh Muknahallipatna, Professor E&CE, UW
o Supreeth Suresh, Pranay Reddy, Sumathi Lakshmiranganathan, Cena Miller, Bradley Riotto - GRAs
9
~6 PI +13 technical staff Started in September 2016 (18 months) ~9 FTE-years
10
Problem Reports and Support
Since September: added IBM and The Weather Company
IBM/TWC participants (1+2)o Jaime Morenoo Todd Hutchinsono Constantinos Evangelinos
Tools for Accelerating Code Optimization
• Kernel GENerator (KGEN)
o Extracts kernels from Fortran applications
o Creates:
• Standalone source code
• Input and output state for verification
• Added support for code coverage and representation
o Broad user community
• 8 Domestic institutions
• 5 international institutions
• 1 Company
o Available on Github:
https://github.com/NCAR/KGen
11
KGEN is a useful tool for accelerating code porting and optimization
MPAS Synchronous and Asynchronous Execution
LW and SW Radiation
Dynamics and Physics
AsynchI/O
Land Surface
:
Dynamics and Physics
Land Surface
:
LW and SW Radiation
or
LW and SW Radiation
LW and SW Radiation
or
or
Disk
𝛥t
or
Phase 2: pushing on to a full MPAS port• Status of GPU-based model components
o Ported, optimized, verified
• Dry dynamical core• GPU-direct implementation of MPAS halo exchanges
o Ported, optimized
• Moist dynamics (tracer transport)• Xu-Randall Cloud fraction
o Ported, undergoing optimization• WSM6 Microphysics• YSU Boundary layer scheme
o Awaiting Porting
• Scale Insensitive Tiedtke convection scheme • Monin-Obukhov surface layer scheme
• CPU-based components
o Overlapping SW and LW RRTMG Radiation (lagged radiation)
o NOAH Land Surface Model (synchronous, remains on CPU)
o SIONlib I/O subsystem13
IBM/TWC MPAS Objectives • MPAS grid with local refinement
24-hour global forecasts
• 12 km global grid
• 3 km refinement over selected regions.
• 32.8 M horizontal points
• 56 layers
Forecast requirement
• Complete 20 hour simulation
• …in 45 minutes
• xRe = 26.7
• For 𝛥t = 18 sec, timestepbudget is 0.674 seconds
14
Refined grids can be generated anywhere desired.
Dr. Kumar will show next that as few as 800 V100s could achieve this goal…