30
Shortened presentation title Shortened presentation title WRF Benchmarks Performance Analysis and Optimizations of the Weather Research and Forecasting (WRF) Model Dixit Patel 1 , Davide Del Vento 2 , Akira Kyle 3 , Brian Vanderwende 2 , Negin Sobhani 2 University of Colorado Boulder 1 - National Center for Atmospheric Research 2 - Carnegie Mellon University 3 SIParCS 2018

Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Performance Analysis and Optimizations of the Weather Research and Forecasting (WRF)

Model

Dixit Patel1, Davide Del Vento2, Akira Kyle3 , Brian Vanderwende2, NeginSobhani2

University of Colorado Boulder1 - National Center for Atmospheric Research2 - Carnegie Mellon University3

SIParCS 2018

Page 2: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Contents

• Introduction

• Compilers and MPI details

• Scaling Summary

• Hybrid Parallelization

• Hyperthreading Results

• Conclusion & Future work

2

Page 3: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Cheyenne Supercomputer

• Mesoscale numerical weather prediction system

• Designed for both atmospheric research and operational forecasting needs.

• Used by over 30,000 Scientists around the world.

3

• 4032 dual-socket nodes

• 2.3-GHz Intel E5-2697v4 processors (Broadwell)

• 18 cores/socket

• 313 TB memory

• Partial 9D Enhanced Hypercube single-plane interconnect topology (25 GBps) - Mellanox EDR InfiniBand

The Weather Research & Forecasting(WRF)

Model

Introduction

Page 4: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Motivation

4

Model AlgorithmHigh Level

CodeCompiled Program

Execution

Reproduced from : https://cvw.cac.cornell.edu/Optimization/overview

Page 5: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Motivation

5

Parallelism, Scalability

LibrariesCompiler Options

Profiling, Tuning

Model AlgorithmHigh Level

CodeCompiled Program

Execution

Page 6: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Motivation

6

Parallelism, Scalability

LibrariesCompiler Options

Profiling, Tuning

Model AlgorithmHigh Level

CodeCompiled Program

Execution

Page 7: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Scaling WRF

• Motivation

– How to optimize performance ?

– Can I solve a problem of a given size in timely manner ?

– How many core-hours would I need for my problem size ?

7

Page 8: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Scaling Summary

• Compilers : Intel (18,17) , GNU (6.3.0,8.1.0)

• MPI : MPT (2.18) , MVAPICH2, IMPI (2018)

• Case : Katrina 1km, Katrina 3km

• Domain :

• Physics suite = ‘Conus’ :

• Higher problem size makes the model unstable (CFL violations)

8

Case Resolution

Grid Size Total Grid Points

Time Steps (sec)

Simulation

Katrina 1km 512 x 512 x 35 262,144 6 12 Hours

Katrina 3km 800 x 900 x 35 720,000 12 12 Hours

Page 9: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Scaling Summary

9

• All cases • Katrina Cases

Page 10: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Scaling Summary

10

All cases Katrina Cases

• Intel 18.0.1 + SGI MPT 2.18 gives best performance• GNU + MVAPICH2 < GNU + SGI MPT • IMPI at high node count doesn’t do well.

Page 11: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Scalability Summary

11

• Cheyenne • Yellowstone

Page 12: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Scalability Summary

12

• Cheyenne • Yellowstone

• Strong scaling : Keeping problem size fixed, increasing the core-count will increase performance, while the (approx.) same core-hours are consumed

Compute Bound (Strong Scaling)

MPI Bound

Page 13: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Scalability Summary

13

• Cheyenne • Yellowstone

• Much better Initialization and I/O due to improvements in WRF

Page 14: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Compiler Optimizations• GNU : -ofast

• Intel : -xHost -fp-model fast =2

• Use Option 66 or 67 for Intel Compiler :

– xHost -fp-model fast=2 -xCORE-AVX2

– -no-heap-arrays -no-prec-div -no-prec-sqrt -fno-common

14

Page 15: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Hybrid Parallelism

• MPI (Distributed Memory) + OpenMP (Shared Memory)

• Why Hybrid ?

– Eliminates domain decomposition at node level

– Better Memory Coherency and less data movement within node

• Cheyenne Node Layout :

15

Page 16: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Hybrid Parallelism

• WRF Tiling :

– Domain decomposition to divide work

16

Page 17: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Hybrid Parallelism

• WRF Tiling :

– Domain decomposition to divide work

– Domain first broken into pieces called patches

17

MPI

Page 18: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Hybrid Parallelism

• WRF Tiling :

– Domain decomposition to divide work

– Domain first broken into pieces called patches

– Each can be further sub divided for shared memory parallelism

18

MPIShared Memory

Page 19: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Hybrid Parallelism

• WRF Tiling :

– Domain decomposition to divide work

– Domain first broken into pieces called patches

– Each can be further sub divided for shared memory parallelism

19

Page 20: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Hybrid Parallelism

• How many OpenMP threads and MPI tasks ?

20

Page 21: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Hybrid Parallelism

• How many OpenMP threads and MPI tasks ?

– 4MPI+9OMP or 6MPI+6OMP seems to work well

21

Good

Page 22: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Hybrid Parallelism

• How many OpenMP threads and MPI tasks ?

- 4MPI+9OMP or 6MPI+6OMP seems to work well

22

Undersubscribed

Over-subscribed

Page 23: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Hybrid Parallelism

• How many OpenMP threads and MPI tasks ?

- 4MPI+9OMP or 6MPI+6OMP seems to work well

23

Undersubscribed

Over-subscribed

Page 24: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Hybrid Parallelism

• How many OpenMP threads and MPI tasks ?

- 4MPI+9OMP or 6MPI+6OMP seems to work well

- Better Scaling than pure MPI

24

MPI Bound

Compute Bound

Page 25: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Core Affinity/Binding

• Above runs use core affinity

25

Page 26: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Core Affinity/Binding

• Above runs use core affinity

– Omplace -vv

26

Page 27: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Hyperthreading Results

• Hyperthreading on Cheyenne :

– Single Core acts like two logical cores

– Model performance was significantly reduced

– Do not recommend using more than 36 cores per node

27

Page 28: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Conclusion

• We recommend running WRF with the latest Intel compiler (18.0.1) and MPT (2.18) library. It consistently gave better performance than other options.

• For intel compilation’s : –xHost –fp-model fast =2 –xCore-AVX2 option turned on. For GNU, use –ofast

• For hybrid runs, thread affinity significantly affects performance. Use omplace or dplace (if you want to specify explicitly the CPU mapping)

• Hyperthreaded runs lowered model performance

28

Page 29: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Future Work

1. Investigate MVAPICH Environment settings1. Eg : Inter-node Communication : Eager vs Rendezvous Protocol

( MV2_VBUF_TOTAL_SIZE & MV2_IBA_EAGER_THRESHOLD & MV2_SMP_EAGERSIZE )

2. WRF crashes around a size of 2k & halts at 128k (64 nodes)

2. Explore different patching/tiling strategies for domain decomposition

3. Explore different core binding strategies for hybrid runs to understand impact on performance at high node counts.

4. Personal : Tools to profile application : WRF Advection Code

29

Page 30: Performance Analysis and Optimizations of the Weather ... · Shortened presentation titleWRF BenchmarksShortened presentation title Performance Analysis and Optimizations of the Weather

Shortened presentation titleShortened presentation titleWRF Benchmarks

Acknowledgements

• Mentors :

– Davide Del Vento

– Brian Vanderwende

– Negin Sobhani

– Alessandro Fanfarillo

• Project Partner :

– Akira Kyle

• SIParCS Team

– AJ Lauer, Jenna Preston, Eliot Foust, Shilo Hall

– Rich Loft

– and fellow interns !

30