Download pdf - Using GPUs to Run Weather Prediction Models · November 2010 Using GPUs to Run Weather Prediction Models Mark Govett Tom Henderson, Jacques Middlecoff, Paul Madden, Jim Rosinski

November 2010

Using GPUs to Run Weather Prediction Models

Mark GovettTom Henderson, Jacques Middlecoff,

Paul Madden, Jim Rosinski

November 2010

GPU / Multi‐core Technology

GPU: 2008933Gflops150W

CPU:2008~45 Gflops160W

1.2 TeraFlops8x increase in

double precision2x increase in

memory bandwidthError correcting

memory

NVIDIA: Tesla (2008)NVIDIA: Fermi (2010)

• NVIDIA: Fermi chip first to support HPC– Formed partnerships with Cray, IBM on HPC systems

– #1, #3 systems on TOP500 (Fermi, China)

• AMD/ATI: Primarily graphics currently– #7 system on TOP500 (AMD‐Radeon, China)

– Fusion chip in 2011 (5 TeraFlops)

• Intel: Knights Ferry (2011), 32‐64 cores

November 2010

CPU – GPU ComparisonCHIPTYPE

CPUNahalem

GPUNVIDIA Tesla

GPUNVIDIA Fermi

Cores 4 240 480

Parallelism Medium Grain Fine Grain Fine Grain

PerformanceSingle Precision

Double Precision

85 GFlops 933 GFlops60 GFlops

1040 GFlops500 GFlops

Power Consumption

90-130W 150W 220W

Transistors 730 million 1.4 bilion 3.0 billion

November 2010

CPU – GPU Comparison• CPUs focus on per‐core performance

– Chip real estate devoted to cache, speculative logic

– 4 cores

• GPUs focus on parallel execution– Fermi: 512 cores, 16 SM

CPU: Nahalem (2009)

GPU: Fermi (2010)

November 2010

Next Generation Weather Models• Models being designed for global cloud resolving scales (3‐4km)

• Requires PetaFlop Computers

DOE Jaguar System‐ 2.3 PetaFlops

‐ 250,000 CPUs

‐ 284 cabinets

‐ 7‐10 MW power

‐ ~ $100 million

‐ Reliability in hours

Equivalent GPU System‐ 2.3 PetaFlop

‐ 2000 Fermi GPUs

‐ 20 cabinets

‐ 1.0 MW power

‐ ~ $10 million

‐ Reliability in weeks

• Large CPU systems (>100 thousand cores) are unrealistic for operational weather forecasting

• Power, cooling, reliability, cost

• Application scaling

ValmontPower Plant

~200 MegaWattsBoulder, CO

November 2010

Application Performance• 20‐50x is possible on highly

scalable codes

• Efficient use of memory is critical to good performance– 1‐2 cycles to access shared memory

– Hundreds of cycles to access global memory

CPU Host

GPU Device

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

GPU Multi-layer Memory

Memory Tesla Fermi

Shared 16K 64K

Constant 16K 64K

Global 1‐2GB 4‐6GB

November 2010

Programming GPUs

• Challenging

• Languages– CUDA‐C: available from NVIDIA

– OpenCL: industry standard (NVIDIA, AMD, Apple, etc)

– Fortran: PGI, CAPS, F2C‐ACC compilers

• Fine grain (loop level) parallelism– Code modifications needed to get good performance

– Code restructuring may also be required

November 2010

Execution Flow‐control(Accelerator Approach)

r1 r2 r3 r4

GPU

r1 r2 r3 r4

run-time

routine

– Copy between CPU and GPU is non‐trivial

• Performance benefits can be overshadowed by the copy

• WRF demonstrated ~6x for one subroutine including data transfers (Michalakes, 2009)

– ~ 10x without data transfers

November 2010

Execution Flow‐control(run everything on GPUs)

init dynamics physics output

GPUdynamics

– Eliminates copy every model time step

– CPU‐GPU copies only needed for input /output, inter‐process communications

– JMA: ASUCA model, reported a 70x performance improvement

• Rewrote the code in CUDA

GPUPhysics

CPU

GPU

November 2010

Non‐hydrostatic Icosahedral Model (NIM)(Lee, MacDonald)

– Global Weather Forecast Model

– Under development at NOAA Earth System Research Laboratory

• Dynamics complete, physics integration in progress

– Non‐hydrostatic

– Uniform, hexagonal‐based, icosahedral grid

– Plan to run tests at 3.5km global in late 2010• Cloud resolving scale

• Model validation using AquaPlanet

November 2010

Code Parallelization (2009)• Developed the Fortran‐to‐CUDA compiler (F2C‐ACC)

– Commercial compilers were not available in 2008– Converts Fortran 90 into C or CUDA‐C– Some hand tuning was necessary

• Parallelized NIM model dynamics• Tesla Chip, Intel Harpertown (2008)• Result for a single GPU • Communications only needed for I/O

CPU#1

Input Output

Single GPU communications

GPU#1

Resolution HorizPts Harpertown Tesla Nehalem Fermi

G4‐480km 2562 2.13 0.079 (26.9) 1.45 0.054 (26.7)

G5‐240km 10242 8.81 0.262 (33.5) 5.38 0.205 (26.2)

NIM Dynamics (version 160)

November 2010

Model Parallelization (2010)

CPU#1 CPU#2

GPU to GPUcommunications

SMS

GPU#1 GPU#2

• Updated NIM Model Parallelization– Active model development

– Code optimizations on‐going

• Run with Fermi

• Evaluate Fortran GPU compilers– Use F2C results as benchmark

• Run on Multiple GPUs– Modified F2C‐ACC GPU compiler

– Uses MPI‐based Scalable Modeling System (SMS)

– Testing on 10 Tesla GPUs

November 2010

Fortran GPU Compilers• General Features

– Do not support all Fortran language constructs

– Converts Fortran into CUDA for further compilation

• CAPS – HMPP– Extensive set of parallelization directives to guide compiler analysis

and optimization

– Optionally generates OpenCL

• PGI– ACCELERATOR – directive‐based accelerator

– CUDA Fortran – Fortran + language extensions to support Kernel calls, GPU memory, etc

• F2C‐ACC– Developed at NOAA for our models

– Requires hand tuning for optimal performance

November 2010

CAPS‐HMPP Compiler• Multi‐core Fortran

– CUDA, OpenCL code generation

• Extensive set of directives– Parallelization

– Optimization

• A minimal set of directives to get a working code– Compiler will do what it safely can

and provide diagnostic information

• Our evaluation began ~4 months ago– Good documentation & support

http://www.caps-entreprise.com

November 2010

PGI Directives• Directives are placed directly in the code body

– Define an accelerated region• !$acc region( [ copy | copyout | copyin ]) begin

• !$acc region end

– User defines loop level parallelism• !$acc do [ vector | parallel | unroll ]

vector = thread, parallel = threadblock

November 2010

F2C‐ACC Compiler• Developed at NOAA to support our Fortran codes

• Parsing– All standard language features for Fortran 77, 90, 95

• Code Translation (added as needed)– if, do, assignments, declarations, modules, include files

• Does not support I/O, common blocks, derived types

– Handles 98% of code translations (serial & parallel)• Support for thread optimizations and GPU memory types needed

F2C‐ACC ERROR: “fctprs.f90” line 23:10” Language construct not currently supported

November 2010

Code Generation• Uses Macros to resolve array references

• Generates Fortran‐to‐C driver routine

– cudaMemcpy for each variable (intent IN, OUT, INOUT)

– Block and grid declarations

– Call to GPU kernel routines

• Generates Kernel Routines for each ACC$REGION BEGIN / END pair

– Declares each variable used in the kernel

– Consistency checks between kernels (proper intent)

real deldp (nvl, npp,npn)

deldp (ivl,isn,ipn) = 1.0

__device__ float deldp [nvl*npp*npn];

deldp[FTNREF3D(ivl,isn,ipn,nvl,npp,1,1,2)] = 1.0;

November 2010

F2C‐ACC Directives

• Directives are required for kernel generation– Supports multiple kernels in a routine

• Define Accelerated regionsACC$REGION (< Threads>, < Blocks >) BEGIN

ACC$REGION END

• Define loop level parallelismACC$DO VECTOR( dim ) - thread level parallelism

ACC$DO PARALLEL (dim ) - block level parallelism

• Data movementACC$REGION (<Threads>, <Blocks>[,<DATA>]) BEGIN

November 2010

Achieving High Performance#1 Optimize the CPU code

– Modifications often help the GPU performance too• Performance Profiling identified a matrix solver that accounted for 40 percent of the runtime

– Called 150,000 per timestep (G4), 3.6 million for 3.5KM

– Replaced BLAS routine with hand‐coded solution resulting in a 3x performance increase for the entire model

– Loop unrolling improved overall performance by another 40 percent

» Developed test kernels to study CPU & GPU performance

– Organize arrays so the inner dimension will be used for thread calculations

• Improve loads & stores from GPU global memory

– Re‐order calculations to improve data reuse

November 2010

Achieving High Performance#2 Optimize GPU Kernels (major issues)

– Use the performance profiler to identify bottlenecks

• Occupancy– Largely determined by registers and shared memory usage

• Coalesced Loads and Stores– Data alignment with adjacent threads can yield big performance gains

» Eg. Threading on “k”, for” a(k,I,j)”

• Use Shared Memory where there is data reuse– Must be at least > 3 to have benefit (2 loads & stores are needed to move data from global to local memory

November 2010

Run Times for Single GPU vs. Single Nehalem CPU Core (2562 points)

Harpertown CPU Time

F2C‐ACC CUDA‐C Tesla GPU Time

Nahalem CPU Time

F2C‐ACC CUDA‐C Fermi GPU Time

Total 190.5 ‐‐ 106.6 ‐‐

vdmints 89.2 4.28 (20.8) 50.6 2.66 (19.0)

vdmintv 38.8 2.48 (15.6) 23.3 1.02 (22.8)

flux 16.0 0.75 (21.3) 10.4 0.35 (29.7)

vdn 12.6 0.56 (22.5) 4.6 0.56 (8.2)

diag 4.0 0.09 (44.4) 4.0 0.09 (44.4)

force 5.3 0.11 (48.2) 3.4 0.11 (30.9)

trisol 8.6 1.45 (5.8) 2.0 1.28 (1.5)

input/init 1.0 1.0 1.0 1.0

output 0.2 0.2 0.2 0.2

Results for F2C-ACC generated code, no optimizations

November 2010

GPU Compilers• Reliance on NVIDIA, AMD compilers

– Register allocation inefficient• Fermi Cache helps offload register pressure

– data re‐use limiting performance

– “normal” optimizations not done

– Requires code changes to achieve good results

PGI

F2C‐ACC

CAPS

NVIDIA

AMD

Fortran

Multi‐Core

CUDA

OpenCL

November 2010

Parallel Performance

G4 G5 G6 G7 G8 G9 G10 G11

resolution 480KM 240KM 120KM 60KM 30 KM 15 KM 7 KM 3.5 KM

horizontalpoints

2.5K 10K 40K 160K 640K 2560K 10,000K 40,000K

memory .25GB 1GB 4GB 16GB 64GB

tesla 26x 33x

fermi 26x 26x ??

# GPUs 1 1 4 16 64 256 1000 4000

• A doubling of model resolution implies:

• 4x increase in horizontal points

• 2x increase in model time step

• 4x increase in memory

• GPU memory is the limiting factor

• Number of horizontal points per GPU is 10K currently.

November 2010

Communication Bottleneck• Application scaling will be limited by the fraction of time

spent doing inter‐process communications

• CPU communications time is about 5% of compute time

• Using GPUs, if we get a ~20x speedup in computation time, communications now becomes 50 percent of the runtime.

• Will need to reduce communications time– Minimize data volume and frequency

– Overlap communications with computations• Move inter‐process communications from just before needed to just after

data is computed

100 sec 5

GPU time Inter-process communicationsInput / output time

5 100 sec

5

100 sec100 sec

55 55 55 55

CPU

GPU

5

November 2010

Conclusion• HPC transitions about every decade

– Vector ‐> MPP ‐> COTS Clusters ‐> GPUs• 20‐50x cost savings: hardware, power, infrastructure

• Tools and compilers are not mature– CUDA register allocation policies can restrict performance

– Commercial Fortran compilers need to do more analysis• Similar to early vector compilers, which required lots of user involvement

• GPU Roadmaps look strong• Focus on power, performance

• Ready for operations?

• Useful for research?

Tesla

Fermi

Nahalem