89
Performance Engineering for Legacy Codes on a Cray XC40 with Intel Xeon Phi (KNL) Matthias Noack ([email protected]), Florian Wende, Thomas Steinke, Alexander Reinefeld Zuse Institute Berlin 1 / 44 2017-06-22, Performance Engineering for HPC: Implementation, Processes & Case Studies at ISC’17

Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Performance Engineering for Legacy Codeson a Cray XC40 with Intel Xeon Phi (KNL)

Matthias Noack ([email protected]), Florian Wende,Thomas Steinke, Alexander Reinefeld

Zuse Institute Berlin

1 / 442017-06-22, Performance Engineering for HPC: Implementation, Processes & Case Studies at ISC’17

Page 2: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

“Konrad” (Berlin, ZIB)- 1872 Xeon nodes- 44928 cores

“Gottfried” (Hanover, LUIS)- 1680 Xeon nodes- 40320 cores + 64 SMP servers, 256/512 GB

TDS (Berlin, ZIB)- 16 KNC nodes (until July 2016)- 80 KNL nodes (since July 2016)- data warp nodes

Applications on the HLRN-III- many pure MPI codes- some MPI+OpenMP- some vectorized

10 Gbps (243 km linear distance)

North-German Supercomputing Alliance (HLRN)

2 / 44

Page 3: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

1987Cray X-MP

471 MFlops

1994Cray T3D38 GFlops

1997Cray T3E

486 GFlops

2002 (HLRN-I)IBM p6902,5 TFlops

1984Cray 1M

160 MFlops

2008 (HLRN-II)SGI ICE, XE150 TFlops

2013 (HLRN-III)Cray XC30/XC40

1,3 PFlops

[peak performance]

2

192DEC Alpha

256DEC Alpha

384IBM Power4

10.240Intel Harpertown,

Nehalem

44.928Intel Ivy Bridge, Haswell

1

[cores]

Supercomputer at ZIB

3 / 44

Page 4: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

1987Cray X-MP

471 MFlops

1994Cray T3D38 GFlops

1997Cray T3E

486 GFlops

2002 (HLRN-I)IBM p6902,5 TFlops200 kWatt

10 M€1984Cray 1M

160 MFlops

2008 (HLRN-II)SGI ICE, XE150 TFlops

2013 (HLRN-III)Cray XC30/XC40

1,3 PFlops

[peak performance]

2

192DEC Alpha

256DEC Alpha

384IBM Power4

10.240Intel Harpertown,

Nehalem

44.928Intel Ivy Bridge, Haswell

1

[cores]

Xeon Phi KNL, 2016

Intel Xeon Phi 729072 cores (288 threads)3 TFLOPS245 Watt6662 € (6/2017)

Y

Supercomputer at ZIB

4 / 44

Page 5: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Applications

• GLAT (atomistic thermodynamics)

• VASP (electronic structure)

• BQCD (high-energy physics)

• HEOM (photo-active processes)

• BOSS (time series analysis, phase 2)

• PALM (fluid dynamics, phase 2)

• PHOENIX3D (astrophysics, phase 2)

Challenges

• Adapting data structures for enabling SIMD

• Vectorising complex code structures

• Transition to hybrid MPI + OpenMP

• (Offload with Intel LEO vs. OpenMP 4.x)

OBJECTIVE

Many-Core High-Performance Computing

RESEARCH

Programming Models,

Runtime Libraries

APPLICATIONS

Modernization,

OpenMP/MPI,

Scalability

Intel Parallel Compute Center (IPCC)

Research Center for Many-Core HPC at ZIB

5 / 44

Page 6: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Knights Landing (KNL): Intel’s 2. Many Integrated Core (MIC) Architecture1)

self-booting CPU, optionall with integrated OmniPath fabric 64+ core (based on Intel Atom Silvermont architecture, x86-64)

4-way hardware-threading 512-bit SIMD vector processing (AVX-512)

On-Chip Multi-Channel (MC) DRAM: 16 GiB DDR4 main memory: up to 384 GiB

1) A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro vol. 6, April 2016

Intel Xeon Phi (Knights Landing) – Architecture

6 / 44

Page 7: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL CPU

PCIe 3 DMI

Tile

DDR MC

EDC EDC

EDC EDC

3 D

DR

4 C

han

nel

s

DDR MC

EDC EDC

EDC EDC

3 D

DR

4 C

han

nel

s

Misc

up to 36 active tilesconnected via 2D-Mesh-

Interconnect

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

Intel Xeon Phi (Knights Landing) – Architecture

7 / 44

Page 8: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL CPU Tile

PCIe 3 DMI

Tile

DDR MC

EDC EDC

EDC EDC

3 D

DR

4 C

han

nel

s

DDR MC

EDC EDC

EDC EDC

3 D

DR

4 C

han

nel

s

Misc

up to 36 active tilesconnected via 2D-Mesh-

Interconnect

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

Core Core

2 VPUs 2 VPUs

1 MiB L2 Cache

CHA

2 out-of-order cores 2 VPUs per core (AVX-512) 1 MiB shared L2-Cache Caching/Home-Agent (CHA)

interface to 2D-Mesh-Interconnect

distributed tag directory

(MESIF cache coherency protocol)

Intel Xeon Phi (Knights Landing) – Architecture

8 / 44

Page 9: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

2 memory controller 6 DDR4 Channels

KNL CPU DDR4 Memory (up to 384 GiB)

PCIe 3 DMI

Tile

DDR MC

EDC EDC

EDC EDC

3 D

DR

4 C

han

nel

s

DDR MC

EDC EDC

EDC EDC

3 D

DR

4 C

han

nel

s

Misc

up to 36 active tilesconnected via 2D-Mesh-

Interconnect

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

Intel Xeon Phi (Knights Landing) – Architecture

9 / 44

Page 10: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

2 memory controller 6 DDR4 Channels

MCDRAM (16 GiB) 8 on-chip units, each 2 GiB

KNL CPU DDR4 Memory (up to 384 GiB)

PCIe 3 DMI

Tile

DDR MC

EDC EDC

EDC EDC

3 D

DR

4 C

han

nel

s

DDR MC

EDC EDC

EDC EDC

3 D

DR

4 C

han

nel

s

Misc

up to 36 active tilesconnected via 2D-Mesh-

Interconnect

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

Intel Xeon Phi (Knights Landing) – Architecture

10 / 44

Page 11: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Flat-ModeDDR4 and MCDRAMin same address space

Cache-ModeMCDRAM = direct-mapped Cachefor DDR4

Hybrid-Mode8 | 4 GiB MCDRAMin Cache-Mode,remainder in Flat-Mode

KNL CPU Memory Modes:

PCIe 3 DMI

Tile

DDR MC

EDC EDC

EDC EDC

3 D

DR

4 C

han

nel

s

DDR MC

EDC EDC

EDC EDC

3 D

DR

4 C

han

nel

s

Misc

up to 36 active tilesconnected via 2D-Mesh-

Interconnect

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

DDR4

16 GiBMCDRAM

16 GiBMCDRAM

DDR4

DDR4

8 | 12 GiBMCDRAM

8 | 4 GiBMCDRAM

phy

s. a

dd

ress

sp

ace

phy

s. a

dd

ress

sp

ace

Intel Xeon Phi (Knights Landing) – Architecture

11 / 44

Page 12: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL in the Roofline Model (Samuel Williams, 2008)

12 / 44

Arithmetic Intensity

[FLOP/Byte]1 2 4 8 16 32 64 128 256

Attainable

FLOPS

[GFLOPS]

1

4

16

64

256

1024

4096

Page 13: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL in the Roofline Model - adding peak FLOPS

12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s

Arithmetic Intensity

[FLOP/Byte]1 2 4 8 16 32 64 128 256

Attainable

FLOPS

[GFLOPS]

1

4

16

64

256

1024

4096 peak FLOPS

Page 14: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL in the Roofline Model - adding peak DRAM bandwidth

12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s

Arithmetic Intensity

[FLOP/Byte]1 2 4 8 16 32 64 128 256

Attainable

FLOPS

[GFLOPS]

1

4

16

64

256

1024

4096 peak FLOPS

DRAM BW

Page 15: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL in the Roofline Model

12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s

Arithmetic Intensity

[FLOP/Byte]1 2 4 8 16 32 64 128 256

Attainable

FLOPS

[GFLOPS]

1

4

16

64

256

1024

4096 peak FLOPS

DRAM BW

Page 16: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL in the Roofline Model

12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s

Arithmetic Intensity

[FLOP/Byte]1 2 4 8 16 32 64 128 256

Attainable

FLOPS

[GFLOPS]

1

4

16

64

256

1024

4096 peak FLOPS

DRAM BW

AttainableFLOPS

= min

PeakFLOPS

MemoryBandwidth

× ArithmeticIntensity

Page 17: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL in the Roofline Model

12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s

Arithmetic Intensity

[FLOP/Byte]1 2 4 8 16 32 64 128 256

Attainable

FLOPS

[GFLOPS]

1

4

16

64

256

1024

4096 peak FLOPS

DRAM BW

AttainableFLOPS

= min

PeakFLOPS

MemoryBandwidth

× ArithmeticIntensity

⇐ memory bound compute bound ⇒

Page 18: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL in the Roofline Model

12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s

Arithmetic Intensity

[FLOP/Byte]1 2 4 8 16 32 64 128 256

Attainable

FLOPS

[GFLOPS]

1

4

16

64

256

1024

4096 peak FLOPS

DRAM BW

⇐ memory bound compute bound ⇒

=

PeakFLOPS

MemoryBandwidth

Page 19: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL in the Roofline Model

12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/sNumbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s DDR, 490 GB/s MCDRAM; Xeon E5-2680v3: 480 GFLOPS, 68 GiB/s

Arithmetic Intensity

[FLOP/Byte]1 2 4 8 16 32 64 128 256

Attainable

FLOPS

[GFLOPS]

1

4

16

64

256

1024

4096 peak FLOPS

DRAM BW

=

PeakFLOPS

MemoryBandwidth

MCDRAM

BW

MinimalArithmetic Intensitynecessary forpeak FLOPS(HLRN nodes):KNL DRAM: 22.7KNL MCDRAM: 5.3Haswell: 7.1

Page 20: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Refining the Ceilings - FLOPS

Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP

• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS

Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA

• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD

• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)

13 / 44

Page 21: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Refining the Ceilings - FLOPS

Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP

• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS

• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS

Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA

• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD

• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)

13 / 44

Page 22: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Refining the Ceilings - FLOPS

Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP

• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load

⇒ actual peak: 2611.2 GFLOPS

Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA

• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD

• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)

13 / 44

Page 23: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Refining the Ceilings - FLOPS

Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP

• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS

Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA

• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD

• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)

13 / 44

Page 24: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Refining the Ceilings - FLOPS

Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP

• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS

Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA

• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)

• without ILP, and without SIMD• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)

13 / 44

Page 25: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Refining the Ceilings - FLOPS

Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP

• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS

Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA

• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD

• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)

13 / 44

Page 26: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL in the Roofline Model

14 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s

Arithmetic Intensity

[FLOP/Byte]1 2 4 8 16 32 64 128 256

Attainable

FLOPS

[GFLOPS]

1

4

16

64

256

1024

4096 peak FLOPS

DRAM BW

MCDRAM

BW

w/o ILP

w/o ILP+SIMD

Upper computebound in GFLOPS:peak: 2611.2w/o ILP: 652.8w/o ILP+SIMD: 84.6

Page 27: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Case study I: Atmospheric Research with PALM

DUSTDEVILS

WIND ENERGY

CLOUD PHYSIC

TURBULENCE EFFECTS ON AIRCRAFT

URBAN METEOROLOGY AND CITY PLANING

15 / 44https://palm.muk.uni-hannover.de

Page 28: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

The PALM Code

• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)

16 / 44https://palm.muk.uni-hannover.de

Page 29: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

The PALM Code

• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)

• Fortran 95/2003.• hybrid MPI + OpenMP code• 140 kLOC, 79 modules and 171 source files• highly scalable, tested for up to 43,200 cores

16 / 44https://palm.muk.uni-hannover.de

Page 30: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

The PALM Code

• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)

• Fortran 95/2003.• hybrid MPI + OpenMP code• 140 kLOC, 79 modules and 171 source files• highly scalable, tested for up to 43,200 cores

• runs on the HLRN supercomputing facilities at Berlin (ZIB) and Hannover (LUIS)• modernisation target within the Intel Parallel Computing Center at ZIB

16 / 44https://palm.muk.uni-hannover.de

Page 31: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

PALM Hackathon at ZIB

17 / 44left to right: Matthias Noack, Florian Wende, Helge Knoop, Matthias Sühring, Tobias Gronemeier; behind the camera: Thomas Steinke

Page 32: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Parallelization Strategy

18 / 44https://palm.muk.uni-hannover.de

Page 33: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Parallelization Strategy

2D domain decomposition

21

0

54

3

87

6

x

y

18 / 44https://palm.muk.uni-hannover.de

Page 34: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Parallelization Strategy

2D domain decomposition

21

0

54

3

87

6

x

y

outer of 3 nested loops threaded

n• different variants for different

targets• e.g. cache optimised with

decomposed inner loop

• no vectorisation⇒ use of floating point

exceptions preventsautomatic vectorisation

⇒ no explicit SIMD constructs

18 / 44https://palm.muk.uni-hannover.de

Page 35: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

19 / 44

getoperationalon KNL

Page 36: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

19 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

Page 37: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

19 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

definesome

benchmarks

Page 38: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

19 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

definesome

benchmarks

- small- medium- large

Page 39: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

19 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

definesome

benchmarks

- small- medium- large

measurebaselineon Xeon

Page 40: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Benchmark Systems and Haswell Baseline

HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)

• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s

20 / 44

Page 41: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Benchmark Systems and Haswell Baseline

HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)

• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s

Haswell Resultsnodes runtime

small 4 131.3 smedium 8 147.6 slarge 16 164.6 s

20 / 44

Page 42: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Benchmark Systems and Haswell Baseline

HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)

• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s

HLRN KNL TDS node, 80 nodes• 1 × Intel Xeon Phi 7250 (KNL)

• 68 cores at 1.4 GHz⇒ 2611.2 GFLOPS with AVX clock

"3.05 TFLOPS"• 96 GiB DDR4, 115.2 GiB/s• 16 GiB MCDRAM, 490 GiB/s

Haswell Resultsnodes runtime

small 4 131.3 smedium 8 147.6 slarge 16 164.6 s

20 / 44

Page 43: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Benchmark Systems and Haswell Baseline

HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)

• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s

HLRN KNL TDS node, 80 nodes• 1 × Intel Xeon Phi 7250 (KNL)

• 68 cores at 1.4 GHz⇒ 2611.2 GFLOPS with AVX clock

"3.05 TFLOPS"• 96 GiB DDR4, 115.2 GiB/s• 16 GiB MCDRAM, 490 GiB/s

• Upper bounds for speed-up:⇒ compute bound: 2.7×⇒ memory bound: 3.6× (MCDRAM)

Haswell Resultsnodes runtime

small 4 131.3 smedium 8 147.6 slarge 16 164.6 s

20 / 44

Page 44: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

21 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

definesome

benchmarks

- small- medium- large

measurebaselineon Xeon

X

Page 45: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

21 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

definesome

benchmarks

- small- medium- large

measurebaselineon Xeon

X

MCDRAMand

boot mode

Page 46: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

MCDRAM Usage and Boot Mode

boot flat mode

run in DDR

run in MCDRAM

boot cache mode

run again

- problem < 16 GiB- upper bound forgain from MCDRAM

- good enough?⇒ decide about

explicit placement

22 / 44

Page 47: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

MCDRAM Usage and Boot Mode

boot flat mode

run in DDR

run in MCDRAM

boot cache mode

run again

- problem < 16 GiB- upper bound forgain from MCDRAM

- good enough?⇒ decide about

explicit placement

• 25 - 41% gain from MCDRAM• ≤ 3% loss from Cache-Mode⇒ no need for explicit placement

• 2 MiB pages worked best

22 / 44

0

20

40

60

80

100

120

140

160

small medium large

runtime[s]

MCDRAM Usage Comparison

1.41

1.411.25

1.38

1.391.22

quad_�at, DDR4quad_�at, MCDRAMquad_cache, both

Page 48: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

23 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

definesome

benchmarks

- small- medium- large

measurebaselineon Xeon

X

MCDRAMand

boot mode

⇒ quad_cache

Page 49: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

23 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

definesome

benchmarks

- small- medium- large

measurebaselineon Xeon

X

MCDRAMand

boot mode

⇒ quad_cache

. . .

. . .MPI processes

vs.OpenMP threads

Page 50: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

MPI processes vs. OpenMP threads

Tuning run for optimal per-node config

ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X

medium Xlarge X

≤ 2% off from best≤ 15% off from best

worse

X fastest

24 / 44quad_cache mode

Page 51: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

MPI processes vs. OpenMP threads

Tuning run for optimal per-node config

ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X

medium Xlarge X

≤ 2% off from best≤ 15% off from best

worse

X fastest

Conclusion• fastest⇒ small: 16 ranks × 4 thread⇒ medium: 8 ranks × 8 threads⇒ large: 32 × 2 threads

• 16 × 4 performs for all setups• fastest vs. slowest config: 2.3×

24 / 44quad_cache mode

Page 52: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

MPI processes vs. OpenMP threads

Tuning run for optimal per-node config

ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X

medium Xlarge X

≤ 2% off from best≤ 15% off from best

worse

X fastest

Conclusion• fastest⇒ small: 16 ranks × 4 thread⇒ medium: 8 ranks × 8 threads⇒ large: 32 × 2 threads

• 16 × 4 performs for all setups• fastest vs. slowest config: 2.3×

• very low impact on speedupsbetween MCDRAM usage models

• absolute numbers vary largely⇒ do memory first

24 / 44quad_cache mode

Page 53: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

MPI processes vs. OpenMP threads

Tuning run for optimal per-node config

ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X

medium Xlarge X

≤ 2% off from best≤ 15% off from best

worse

X fastest

Conclusion• fastest⇒ small: 16 ranks × 4 thread⇒ medium: 8 ranks × 8 threads⇒ large: 32 × 2 threads

• 16 × 4 performs for all setups• fastest vs. slowest config: 2.3×

• very low impact on speedupsbetween MCDRAM usage models

• absolute numbers vary largely⇒ do memory first

• always check pinning/affinity

24 / 44quad_cache mode

Page 54: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

25 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

definesome

benchmarks

- small- medium- large

measurebaselineon Xeon

X

MCDRAMand

boot mode

⇒ quad_cache

. . .

. . .MPI processes

vs.OpenMP threads

⇒ 16 × 4

Page 55: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

25 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

definesome

benchmarks

- small- medium- large

measurebaselineon Xeon

X

MCDRAMand

boot mode

⇒ quad_cache

. . .

. . .MPI processes

vs.OpenMP threads

⇒ 16 × 4

startoptimisationworkflow

Page 56: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

25 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

definesome

benchmarks

- small- medium- large

measurebaselineon Xeon

X

MCDRAMand

boot mode

⇒ quad_cache

. . .

. . .MPI processes

vs.OpenMP threads

⇒ 16 × 4

startoptimisationworkflow

Page 57: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL Optimisation Strategy

25 / 44

getoperationalon KNL

- build scripts- job scripts- bug fixing

definesome

benchmarks

- small- medium- large

measurebaselineon Xeon

X

MCDRAMand

boot mode

⇒ quad_cache

. . .

. . .MPI processes

vs.OpenMP threads

⇒ 16 × 4

startoptimisationworkflow

- compiler optimisation reports- Intel VTune Amplifier XE, Advisor XE

Page 58: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

First Code Changes

Floating point exception-handling

• prevents vectorisation• fp-model-strict → fp-model-source• remove exception handling• add NaN/Inf-tests⇒ when writing checkpoints

⇒ no significant auto vectorisation⇒ suboptimal memory layout

26 / 44

Page 59: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

First Code Changes

Floating point exception-handling

• prevents vectorisation• fp-model-strict → fp-model-source• remove exception handling• add NaN/Inf-tests⇒ when writing checkpoints

⇒ no significant auto vectorisation⇒ suboptimal memory layout

Using MKL FFT

• small gain for benchmarks⇒ might be significant for larger setups

26 / 44

Page 60: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

First Code Changes

Floating point exception-handling

• prevents vectorisation• fp-model-strict → fp-model-source• remove exception handling• add NaN/Inf-tests⇒ when writing checkpoints

⇒ no significant auto vectorisation⇒ suboptimal memory layout

Using MKL FFT

• small gain for benchmarks⇒ might be significant for larger setups

CONTIGUOUS keyword

• from Fortran 2008• tell the compiler about contiguously

allocated arrays

26 / 44

0

50

100

150

200

small medium large

runtime[s]

Current Results, Intel Compiler 17.0.0

1.001.00

1.00

1.091.13

1.01

1.24

1.07

0.96

HSW, BaselineHSW, CurrentKNL, Current

Page 61: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Cray Specialities

Hardware/Software

• Cray Aries network• Cray MPI• Cray Performance Tools• Cray Compiler

27 / 44

Page 62: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Cray Specialities

Hardware/Software

• Cray Aries network• Cray MPI• Cray Performance Tools• Cray Compiler

Rank Reordering

• instrumented application run• optimised mapping of MPI

ranks to cores and nodes⇒ no improvement on KNL

27 / 44

Page 63: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Cray Specialities

Hardware/Software

• Cray Aries network• Cray MPI• Cray Performance Tools• Cray Compiler

Rank Reordering

• instrumented application run• optimised mapping of MPI

ranks to cores and nodes⇒ no improvement on KNL

Cray Compiler

• initial KNL support• crash with OpenMP⇒ 64 MPI ranks per KNL

27 / 44

0

50

100

150

200

small medium large

runtime[s]

Current Results, Intel 17.0.0 vs. Cray Compiler 8.5.3

1.001.00

1.00

1.091.13

1.01

1.24

1.07

0.96

1.29

1.17

1.12

HSW, Baseline, IntelHSW, Current, IntelKNL, Current, IntelKNL, Current, Cray

Page 64: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Projected Production Run Performance• benchmark runs: ≈ 5 min, productions runs: ≈ 12 hours⇒ serial initialisation becomes negligible⇒ plot speedup based on ttotal − tinit

28 / 44

Page 65: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Projected Production Run Performance• benchmark runs: ≈ 5 min, productions runs: ≈ 12 hours⇒ serial initialisation becomes negligible⇒ plot speedup based on ttotal − tinit

28 / 44

0

0.5

1

1.5

2

small medium large

speedup

Projected Speedups (without init)

1.00 1.00 1.001.10 1.14

1.00

1.251.11 1.06

1.45

1.25 1.23

HSW, BaselineHSW, Current, IntelKNL, Current, IntelKNL, Current, Cray

Page 66: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

PALM - Final RemarksBottom Line

• getting started on KNL was easy⇒ way easier than KNC and offloading

• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .

29 / 44https://palm.muk.uni-hannover.de

Page 67: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

PALM - Final RemarksBottom Line

• getting started on KNL was easy⇒ way easier than KNC and offloading

• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .

• Cray Fortran compiler up to 16.5%faster than Intel on KNL

29 / 44https://palm.muk.uni-hannover.de

Page 68: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

PALM - Final RemarksBottom Line

• getting started on KNL was easy⇒ way easier than KNC and offloading

• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .

• Cray Fortran compiler up to 16.5%faster than Intel on KNL

• speedup over dual HSW HLRN node:• benchmark: up to 1.29ו production (projected): up to 1.45×⇒ even without vectorisation

29 / 44https://palm.muk.uni-hannover.de

Page 69: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

PALM - Final RemarksBottom Line

• getting started on KNL was easy⇒ way easier than KNC and offloading

• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .

• Cray Fortran compiler up to 16.5%faster than Intel on KNL

• speedup over dual HSW HLRN node:• benchmark: up to 1.29ו production (projected): up to 1.45×⇒ even without vectorisation

What’s next• VTune Results:⇒ increase concurrency⇒ reduce L2 misses on KNL

• adapt data layout for SIMD⇒ twice the potential on KNL

29 / 44https://palm.muk.uni-hannover.de

Page 70: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Case study II: Material Science with VASPVASP – Vienna Ab-Initio Simulation Package

• Plan wave electronic structure code to model atomic scale materials from firstprincipals: Hψ = Eψ

• widely used in material sciences• historically grown, large MPI-only Fortran code base

• different approximations (DFT, hybrid functionals, . . . ) to tackle the physics• library dependencies: FFTW, BLAS, scaLAPACK, ELPA

Collaboration of:

30 / 44ZIB contact: Florian Wende ([email protected])

Page 71: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter SIMD registers become increasingly larger: currently 512 bit with AVX-512

Slightly increased logic on the chip, but heavily increased arithmetic throughput Xeon Phi KNL w/ and w/o SIMD: 3 TFLOPS vs. 0.37 TFLOPS

SIMD Introduction

31 / 44

Page 72: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter

for (i = 0; i < N; ++i)y[i] = log(x[i]);

...

tim

eSIMD Introduction

32 / 44

Page 73: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter

for (i = 0; i < N; ++i) for (i = 0; i < N; i += 8)y[i] = log(x[i]); y[i + 0] = log(x[i + 0]);

... y[i] = vlog(x[i])y[i + 7] = log(x[i + 7]);

...

...

tim

e

8 times faster execution with SIMD

2

SIMD Introduction

33 / 44

Page 74: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

4 times faster execution with SIMD

SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter Control flow divergences can hurt SIMD performance significantly

for (i = 0; i < N; ++i) for (i = 0; i < N; i += 8)if (p[i]) y[i] = log(x[i]); if (m← p[i]) y[i] = vlog_mask(y[i], m, x[i]);else y[i] = exp(x[i]); else y[i] = vexp_mask(y[i], ~m, x[i]);

...

...

tim

eSIMD Introduction

34 / 44

Page 75: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

OpenMP 4.x compiler directives portability across compilers low code invasiveness no SIMD intrinsics for Fortran

combine OpenMP 4.x SIMD with “high-level vectors” (loop chunking) to increase flexibility and expressiveness

SIMD Vectorisation in VASP

35 / 44

Page 76: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Non-vectorizable loop split into parts to enable SIMD vectorization

idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;

d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);

}

C-version of the Codes(not optimized)

SIMD Vectorisation in VASP - Example

36 / 44

Page 77: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Non-vectorizable loop split into parts to enable SIMD vectorization

idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;

d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);

}

nj rather small: not a candidate for SIMD vectorization

SIMD Vectorisation in VASP - Example

37 / 44

Page 78: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Non-vectorizable loop split into parts to enable SIMD vectorization

idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;

d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);

}

loop iterations are not independent: idx

SIMD Vectorisation in VASP - Example

38 / 44

Page 79: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Non-vectorizable loop split into parts to enable SIMD vectorization

idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;

d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);

}

idx = 0;for (i = 0; i < ni; i += CHUNKSIZE) {ii_max = min(CHUNKSIZE, ni – i);for (ii = 0; ii < ii_max; ++ii) {while (“some condition”)

++idx;vidx[ii] = idx;

}...

}

Compute idx-values in advance to enable SIMD vectorization afterwards!

Loop-Chunking, e.g. CHUNKSIZE=32

SIMD Vectorisation in VASP - Example

39 / 44

Page 80: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Non-vectorizable loop split into parts to enable SIMD vectorization

idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;

d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);

}

idx = 0;for (i = 0; i < ni; i += CHUNKSIZE) {ii_max = min(CHUNKSIZE, ni – i);for (ii = 0; ii < ii_max; ++ii) {while (“some condition”)

++idx;vidx[ii] = idx;

}for (ii = 0; ii < ii_max; ++ii)vd[ii] = data[vidx[ii]];

...}

Load data in a separate loop: leave it to the compilerto vectorize or not

SIMD Vectorisation in VASP - Example

40 / 44

Page 81: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Non-vectorizable loop split into parts to enable SIMD vectorization

idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;

d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);

}

idx = 0;for (i = 0; i < ni; i += CHUNKSIZE) {ii_max = min(CHUNKSIZE, ni – i);for (ii = 0; ii < ii_max; ++ii) {while (“some condition”)

++idx;vidx[ii] = idx;

}for (ii = 0; ii < ii_max; ++ii)vd[ii] = data[vidx[ii]];

for (j = 0; j < nj; ++j)#pragma omp simdfor (ii = 0; ii < ii_max; ++ii)

res[j] += vd[ii] * (...);}

SIMD Vectorisation in VASP - Example

41 / 44

Page 82: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Non-vectorizable loop split into parts to enable SIMD vectorization

0

5

10

15

20

25

30

35

Tim

e [s

]

GW0 subroutine only

no-SIMD (KNL) SIMD (KNL)

no-SIMD (2x Haswell CPU) SIMD (2x Haswell CPU)

0

10

20

30

40

50

60

70

80

90

Whole program

further optimization needed!

Xeon Phi nodes: quadrant mode, all data in MCDRAM

SIMD Vectorisation in VASP - Results

42 / 44

Page 83: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL SummaryXeon Phi (KNL) has a low entry barrier. . .

• no offloading• well-known CPU toolchains and workflows

. . . but getting performance is challenging

• ease of use can be misleading towards quick fixes• code needs to be re-thought and re-written for SIMD• effort pays off with significant speed-up for hot-spots⇒ Xeon benefits from Xeon Phi optimisations as well

• overall application performance suffers from low single-thread performance⇒ working on a few hotspots is not sufficient

With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.

43 / 44

Page 84: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL SummaryXeon Phi (KNL) has a low entry barrier. . .

• no offloading• well-known CPU toolchains and workflows

. . . but getting performance is challenging

• ease of use can be misleading towards quick fixes• code needs to be re-thought and re-written for SIMD• effort pays off with significant speed-up for hot-spots⇒ Xeon benefits from Xeon Phi optimisations as well

• overall application performance suffers from low single-thread performance⇒ working on a few hotspots is not sufficient

With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.

43 / 44

Page 85: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

KNL SummaryXeon Phi (KNL) has a low entry barrier. . .

• no offloading• well-known CPU toolchains and workflows

. . . but getting performance is challenging

• ease of use can be misleading towards quick fixes• code needs to be re-thought and re-written for SIMD• effort pays off with significant speed-up for hot-spots⇒ Xeon benefits from Xeon Phi optimisations as well

• overall application performance suffers from low single-thread performance⇒ working on a few hotspots is not sufficient

With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.

43 / 44

Page 86: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Overall Conclusions

A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.

Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.

Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.

44 / 44

- EoP

Page 87: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Overall Conclusions

A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.

Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.

Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.

44 / 44

- EoP

Page 88: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Overall Conclusions

A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.

Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.

Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.

44 / 44

- EoP

Page 89: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing

Overall Conclusions

A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.

Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.

Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.

44 / 44

- EoP