An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global

An Agile Approach to Building a GPU-enabled and Performance-portable Global Cloud-resolving Atmospheric Model

Dr. Richard Loft*

Director, Technology Development

CISL/NCAR

*National Center for Atmospheric Research

GTC, San Jose, CA March 26, 2018

Outline

• Origins Backstory

• The MPAS Model

• Team

• Tools and Design

• Status

2

Project began with research based on student projects

• Two years of student internship projects in the Summer Internships in Parallel

Computational Science (SIParCS) at NCAR funded student projects related to

architectural inter-comparison.

• Projects focused on optimizing atmospheric numerical PDE solvers for both

CPUs and GPUs with performance portability in mind.

• Architectures compared:

o Xeon Broadwell, Haswell;

o Xeon Phi KNL;

o NVIDIA Tesla P100->V100.

3

Benchmark Problem

• Shallow Water Equations (SWE) – A set of non-linear partial differential equations (PDE)

– Capture features of atmospheric flow around the Earth

• Radial basis function-generated finite difference (RBF-FD) methods

RBF-FD solution to SWE test case “Flow over an isolated mountain” using 655,532 points [1] 3

An example of 75-point stencil on a sphere [1]

Evaluate differential operator D at every point

Stencil points

Non-stencil points

Cone-shaped mountain

Day 1 Day 15

4

Optimizing Stencils for different architectures

Insufficient Workload

Parallelism

Sufficient Workload

Parallelism

Directive-based portability in the RBF-FD shallow water equations (2-D unstructured stencil)

• CI roofline model generally predicts performance well, even for more complicated algorithms.

• Xeon performance crashes to DRAM BW limit when cache size is exceeded, with some state reuse.

• Xeon Phi (KNL) HBM memory is less sensitive to problem size that Xeon, saturates with CI figure.

• NVIDIA Pascal P100 performance fits CI model GPU’s require higher levels of parallelism to reach saturation.

0

50

100

150

200

250

300

350

Per

form

an

ce (

GF

LO

PS

)

Broadwell KNL P100

5

What is MPAS? – The Model for Prediction Across ScalesNCAR’s Global Meteorological/Climate Model; ~100,000 SLOC

6Simulation of 2012 Tropical Cyclones at 4Km Resolution – Courtesy of Falko Judt, NCAR

Weather and Climate Alliance (WACA):

• NCAR

• NVIDIA Corporation

• IBM Corporation/The Weather Company

• University of Wyoming, CE&EE Department

• Korean Institute of Science and Technology Information (KISTI)

7

Initial Divide and Conquer Strategy

8

MPAS Dynamics MPAS Physics

Problem Reports and Support

Ideas and Results

Weather and Climate Alliance (WACA):A Collaboration for Earth System Model Acceleration

• NCAR (2+4)o Dr. Rich Loft, Director TDD

o Dr. Raghu Raj Kumar, Project Scientist TDD

o Clint Olson, TDD

o Bill Skamarock, Senior Science, MMM

o Michael Duda, Software Engineer, MMM

o Dave Gill, Software Engineer, MMM

• KISTI (2+1)o Minsu Joh, KISTI Director, Disaster Management Research Center

o Dr. Ji-Sun Kang. Senior Researcher

o Jae-Youp Kim, GRA

• NVIDIA/PGI (1+3)o Greg Branch, NVIDIA, Sales

o Dr. Carl Ponder, Senior Applications Engineer

o Brent Leback, PGI Compiler Engineering Manager

o Craig Tierny, Solutions Architect

• University of Wyoming (1+5)o Dr. Suresh Muknahallipatna, Professor E&CE, UW

o Supreeth Suresh, Pranay Reddy, Sumathi Lakshmiranganathan, Cena Miller, Bradley Riotto - GRAs

9

~6 PI +13 technical staff Started in September 2016 (18 months) ~9 FTE-years

10

Problem Reports and Support

Since September: added IBM and The Weather Company

IBM/TWC participants (1+2)o Jaime Morenoo Todd Hutchinsono Constantinos Evangelinos

Tools for Accelerating Code Optimization

• Kernel GENerator (KGEN)

o Extracts kernels from Fortran applications

o Creates:

• Standalone source code

• Input and output state for verification

• Added support for code coverage and representation

o Broad user community

• 8 Domestic institutions

• 5 international institutions

• 1 Company

o Available on Github:

https://github.com/NCAR/KGen

11

KGEN is a useful tool for accelerating code porting and optimization

https://github.com/NCAR/KGen

MPAS Synchronous and Asynchronous Execution

LW and SW Radiation

Dynamics and Physics

AsynchI/O

Land Surface

:

Dynamics and Physics

Land Surface

:

LW and SW Radiation

or

LW and SW Radiation

LW and SW Radiation

or

or

Disk

𝛥t

or

Phase 2: pushing on to a full MPAS port• Status of GPU-based model components

o Ported, optimized, verified

• Dry dynamical core• GPU-direct implementation of MPAS halo exchanges

o Ported, optimized

• Moist dynamics (tracer transport)• Xu-Randall Cloud fraction

o Ported, undergoing optimization• WSM6 Microphysics• YSU Boundary layer scheme

o Awaiting Porting

• Scale Insensitive Tiedtke convection scheme • Monin-Obukhov surface layer scheme

• CPU-based components

o Overlapping SW and LW RRTMG Radiation (lagged radiation)

o NOAH Land Surface Model (synchronous, remains on CPU)

o SIONlib I/O subsystem13

IBM/TWC MPAS Objectives • MPAS grid with local refinement

24-hour global forecasts

• 12 km global grid

• 3 km refinement over selected regions.

• 32.8 M horizontal points

• 56 layers

Forecast requirement

• Complete 20 hour simulation

• …in 45 minutes

• xRe = 26.7

• For 𝛥t = 18 sec, timestepbudget is 0.674 seconds

14

Refined grids can be generated anywhere desired.

Dr. Kumar will show next that as few as 800 V100s could achieve this goal…

Documents

An Agile Approach to Building a GPU-enabled and ...on-demand.gputechconf.com/gtc/2018/presentation/s8811-an...An Agile Approach to Building a GPU-enabled and Performance-portable Global