Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Performance Engineering for Legacy Codeson a Cray XC40 with Intel Xeon Phi (KNL)
Matthias Noack ([email protected]), Florian Wende,Thomas Steinke, Alexander Reinefeld
Zuse Institute Berlin
1 / 442017-06-22, Performance Engineering for HPC: Implementation, Processes & Case Studies at ISC’17
“Konrad” (Berlin, ZIB)- 1872 Xeon nodes- 44928 cores
“Gottfried” (Hanover, LUIS)- 1680 Xeon nodes- 40320 cores + 64 SMP servers, 256/512 GB
TDS (Berlin, ZIB)- 16 KNC nodes (until July 2016)- 80 KNL nodes (since July 2016)- data warp nodes
Applications on the HLRN-III- many pure MPI codes- some MPI+OpenMP- some vectorized
10 Gbps (243 km linear distance)
North-German Supercomputing Alliance (HLRN)
2 / 44
1987Cray X-MP
471 MFlops
1994Cray T3D38 GFlops
1997Cray T3E
486 GFlops
2002 (HLRN-I)IBM p6902,5 TFlops
1984Cray 1M
160 MFlops
2008 (HLRN-II)SGI ICE, XE150 TFlops
2013 (HLRN-III)Cray XC30/XC40
1,3 PFlops
[peak performance]
2
192DEC Alpha
256DEC Alpha
384IBM Power4
10.240Intel Harpertown,
Nehalem
44.928Intel Ivy Bridge, Haswell
1
[cores]
Supercomputer at ZIB
3 / 44
1987Cray X-MP
471 MFlops
1994Cray T3D38 GFlops
1997Cray T3E
486 GFlops
2002 (HLRN-I)IBM p6902,5 TFlops200 kWatt
10 M€1984Cray 1M
160 MFlops
2008 (HLRN-II)SGI ICE, XE150 TFlops
2013 (HLRN-III)Cray XC30/XC40
1,3 PFlops
[peak performance]
2
192DEC Alpha
256DEC Alpha
384IBM Power4
10.240Intel Harpertown,
Nehalem
44.928Intel Ivy Bridge, Haswell
1
[cores]
Xeon Phi KNL, 2016
Intel Xeon Phi 729072 cores (288 threads)3 TFLOPS245 Watt6662 € (6/2017)
Y
Supercomputer at ZIB
4 / 44
Applications
• GLAT (atomistic thermodynamics)
• VASP (electronic structure)
• BQCD (high-energy physics)
• HEOM (photo-active processes)
• BOSS (time series analysis, phase 2)
• PALM (fluid dynamics, phase 2)
• PHOENIX3D (astrophysics, phase 2)
Challenges
• Adapting data structures for enabling SIMD
• Vectorising complex code structures
• Transition to hybrid MPI + OpenMP
• (Offload with Intel LEO vs. OpenMP 4.x)
OBJECTIVE
Many-Core High-Performance Computing
RESEARCH
Programming Models,
Runtime Libraries
APPLICATIONS
Modernization,
OpenMP/MPI,
Scalability
Intel Parallel Compute Center (IPCC)
Research Center for Many-Core HPC at ZIB
5 / 44
Knights Landing (KNL): Intel’s 2. Many Integrated Core (MIC) Architecture1)
self-booting CPU, optionall with integrated OmniPath fabric 64+ core (based on Intel Atom Silvermont architecture, x86-64)
4-way hardware-threading 512-bit SIMD vector processing (AVX-512)
On-Chip Multi-Channel (MC) DRAM: 16 GiB DDR4 main memory: up to 384 GiB
1) A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro vol. 6, April 2016
Intel Xeon Phi (Knights Landing) – Architecture
6 / 44
KNL CPU
PCIe 3 DMI
Tile
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
Misc
up to 36 active tilesconnected via 2D-Mesh-
Interconnect
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
Intel Xeon Phi (Knights Landing) – Architecture
7 / 44
KNL CPU Tile
PCIe 3 DMI
Tile
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
Misc
up to 36 active tilesconnected via 2D-Mesh-
Interconnect
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
Core Core
2 VPUs 2 VPUs
1 MiB L2 Cache
CHA
2 out-of-order cores 2 VPUs per core (AVX-512) 1 MiB shared L2-Cache Caching/Home-Agent (CHA)
interface to 2D-Mesh-Interconnect
distributed tag directory
(MESIF cache coherency protocol)
Intel Xeon Phi (Knights Landing) – Architecture
8 / 44
2 memory controller 6 DDR4 Channels
KNL CPU DDR4 Memory (up to 384 GiB)
PCIe 3 DMI
Tile
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
Misc
up to 36 active tilesconnected via 2D-Mesh-
Interconnect
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
Intel Xeon Phi (Knights Landing) – Architecture
9 / 44
2 memory controller 6 DDR4 Channels
MCDRAM (16 GiB) 8 on-chip units, each 2 GiB
KNL CPU DDR4 Memory (up to 384 GiB)
PCIe 3 DMI
Tile
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
Misc
up to 36 active tilesconnected via 2D-Mesh-
Interconnect
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
Intel Xeon Phi (Knights Landing) – Architecture
10 / 44
Flat-ModeDDR4 and MCDRAMin same address space
Cache-ModeMCDRAM = direct-mapped Cachefor DDR4
Hybrid-Mode8 | 4 GiB MCDRAMin Cache-Mode,remainder in Flat-Mode
KNL CPU Memory Modes:
PCIe 3 DMI
Tile
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
Misc
up to 36 active tilesconnected via 2D-Mesh-
Interconnect
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
DDR4
16 GiBMCDRAM
16 GiBMCDRAM
DDR4
DDR4
8 | 12 GiBMCDRAM
8 | 4 GiBMCDRAM
phy
s. a
dd
ress
sp
ace
phy
s. a
dd
ress
sp
ace
Intel Xeon Phi (Knights Landing) – Architecture
11 / 44
KNL in the Roofline Model (Samuel Williams, 2008)
12 / 44
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096
KNL in the Roofline Model - adding peak FLOPS
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
KNL in the Roofline Model - adding peak DRAM bandwidth
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
KNL in the Roofline Model
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
KNL in the Roofline Model
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
AttainableFLOPS
= min
PeakFLOPS
MemoryBandwidth
× ArithmeticIntensity
KNL in the Roofline Model
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
AttainableFLOPS
= min
PeakFLOPS
MemoryBandwidth
× ArithmeticIntensity
⇐ memory bound compute bound ⇒
KNL in the Roofline Model
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
⇐ memory bound compute bound ⇒
=
PeakFLOPS
MemoryBandwidth
KNL in the Roofline Model
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/sNumbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s DDR, 490 GB/s MCDRAM; Xeon E5-2680v3: 480 GFLOPS, 68 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
=
PeakFLOPS
MemoryBandwidth
MCDRAM
BW
MinimalArithmetic Intensitynecessary forpeak FLOPS(HLRN nodes):KNL DRAM: 22.7KNL MCDRAM: 5.3Haswell: 7.1
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD
• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS
• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD
• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load
⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD
• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD
• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)
• without ILP, and without SIMD• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD
• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
KNL in the Roofline Model
14 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
MCDRAM
BW
w/o ILP
w/o ILP+SIMD
Upper computebound in GFLOPS:peak: 2611.2w/o ILP: 652.8w/o ILP+SIMD: 84.6
Case study I: Atmospheric Research with PALM
DUSTDEVILS
WIND ENERGY
CLOUD PHYSIC
TURBULENCE EFFECTS ON AIRCRAFT
URBAN METEOROLOGY AND CITY PLANING
15 / 44https://palm.muk.uni-hannover.de
The PALM Code
• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)
16 / 44https://palm.muk.uni-hannover.de
The PALM Code
• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)
• Fortran 95/2003.• hybrid MPI + OpenMP code• 140 kLOC, 79 modules and 171 source files• highly scalable, tested for up to 43,200 cores
16 / 44https://palm.muk.uni-hannover.de
The PALM Code
• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)
• Fortran 95/2003.• hybrid MPI + OpenMP code• 140 kLOC, 79 modules and 171 source files• highly scalable, tested for up to 43,200 cores
• runs on the HLRN supercomputing facilities at Berlin (ZIB) and Hannover (LUIS)• modernisation target within the Intel Parallel Computing Center at ZIB
16 / 44https://palm.muk.uni-hannover.de
PALM Hackathon at ZIB
17 / 44left to right: Matthias Noack, Florian Wende, Helge Knoop, Matthias Sühring, Tobias Gronemeier; behind the camera: Thomas Steinke
Parallelization Strategy
2D domain decomposition
21
0
54
3
87
6
x
y
18 / 44https://palm.muk.uni-hannover.de
Parallelization Strategy
2D domain decomposition
21
0
54
3
87
6
x
y
outer of 3 nested loops threaded
n• different variants for different
targets• e.g. cache optimised with
decomposed inner loop
• no vectorisation⇒ use of floating point
exceptions preventsautomatic vectorisation
⇒ no explicit SIMD constructs
18 / 44https://palm.muk.uni-hannover.de
KNL Optimisation Strategy
19 / 44
getoperationalon KNL
KNL Optimisation Strategy
19 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
KNL Optimisation Strategy
19 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
KNL Optimisation Strategy
19 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
KNL Optimisation Strategy
19 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)
• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s
20 / 44
Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)
• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s
Haswell Resultsnodes runtime
small 4 131.3 smedium 8 147.6 slarge 16 164.6 s
20 / 44
Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)
• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s
HLRN KNL TDS node, 80 nodes• 1 × Intel Xeon Phi 7250 (KNL)
• 68 cores at 1.4 GHz⇒ 2611.2 GFLOPS with AVX clock
"3.05 TFLOPS"• 96 GiB DDR4, 115.2 GiB/s• 16 GiB MCDRAM, 490 GiB/s
Haswell Resultsnodes runtime
small 4 131.3 smedium 8 147.6 slarge 16 164.6 s
20 / 44
Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)
• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s
HLRN KNL TDS node, 80 nodes• 1 × Intel Xeon Phi 7250 (KNL)
• 68 cores at 1.4 GHz⇒ 2611.2 GFLOPS with AVX clock
"3.05 TFLOPS"• 96 GiB DDR4, 115.2 GiB/s• 16 GiB MCDRAM, 490 GiB/s
• Upper bounds for speed-up:⇒ compute bound: 2.7×⇒ memory bound: 3.6× (MCDRAM)
Haswell Resultsnodes runtime
small 4 131.3 smedium 8 147.6 slarge 16 164.6 s
20 / 44
KNL Optimisation Strategy
21 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
KNL Optimisation Strategy
21 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
MCDRAM Usage and Boot Mode
boot flat mode
run in DDR
run in MCDRAM
boot cache mode
run again
- problem < 16 GiB- upper bound forgain from MCDRAM
- good enough?⇒ decide about
explicit placement
22 / 44
MCDRAM Usage and Boot Mode
boot flat mode
run in DDR
run in MCDRAM
boot cache mode
run again
- problem < 16 GiB- upper bound forgain from MCDRAM
- good enough?⇒ decide about
explicit placement
• 25 - 41% gain from MCDRAM• ≤ 3% loss from Cache-Mode⇒ no need for explicit placement
• 2 MiB pages worked best
22 / 44
0
20
40
60
80
100
120
140
160
small medium large
runtime[s]
MCDRAM Usage Comparison
1.41
1.411.25
1.38
1.391.22
quad_�at, DDR4quad_�at, MCDRAMquad_cache, both
KNL Optimisation Strategy
23 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
KNL Optimisation Strategy
23 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
. . .
. . .MPI processes
vs.OpenMP threads
MPI processes vs. OpenMP threads
Tuning run for optimal per-node config
ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X
medium Xlarge X
≤ 2% off from best≤ 15% off from best
worse
X fastest
24 / 44quad_cache mode
MPI processes vs. OpenMP threads
Tuning run for optimal per-node config
ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X
medium Xlarge X
≤ 2% off from best≤ 15% off from best
worse
X fastest
Conclusion• fastest⇒ small: 16 ranks × 4 thread⇒ medium: 8 ranks × 8 threads⇒ large: 32 × 2 threads
• 16 × 4 performs for all setups• fastest vs. slowest config: 2.3×
24 / 44quad_cache mode
MPI processes vs. OpenMP threads
Tuning run for optimal per-node config
ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X
medium Xlarge X
≤ 2% off from best≤ 15% off from best
worse
X fastest
Conclusion• fastest⇒ small: 16 ranks × 4 thread⇒ medium: 8 ranks × 8 threads⇒ large: 32 × 2 threads
• 16 × 4 performs for all setups• fastest vs. slowest config: 2.3×
• very low impact on speedupsbetween MCDRAM usage models
• absolute numbers vary largely⇒ do memory first
24 / 44quad_cache mode
MPI processes vs. OpenMP threads
Tuning run for optimal per-node config
ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X
medium Xlarge X
≤ 2% off from best≤ 15% off from best
worse
X fastest
Conclusion• fastest⇒ small: 16 ranks × 4 thread⇒ medium: 8 ranks × 8 threads⇒ large: 32 × 2 threads
• 16 × 4 performs for all setups• fastest vs. slowest config: 2.3×
• very low impact on speedupsbetween MCDRAM usage models
• absolute numbers vary largely⇒ do memory first
• always check pinning/affinity
24 / 44quad_cache mode
KNL Optimisation Strategy
25 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
. . .
. . .MPI processes
vs.OpenMP threads
⇒ 16 × 4
KNL Optimisation Strategy
25 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
. . .
. . .MPI processes
vs.OpenMP threads
⇒ 16 × 4
startoptimisationworkflow
KNL Optimisation Strategy
25 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
. . .
. . .MPI processes
vs.OpenMP threads
⇒ 16 × 4
startoptimisationworkflow
KNL Optimisation Strategy
25 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
. . .
. . .MPI processes
vs.OpenMP threads
⇒ 16 × 4
startoptimisationworkflow
- compiler optimisation reports- Intel VTune Amplifier XE, Advisor XE
First Code Changes
Floating point exception-handling
• prevents vectorisation• fp-model-strict → fp-model-source• remove exception handling• add NaN/Inf-tests⇒ when writing checkpoints
⇒ no significant auto vectorisation⇒ suboptimal memory layout
26 / 44
First Code Changes
Floating point exception-handling
• prevents vectorisation• fp-model-strict → fp-model-source• remove exception handling• add NaN/Inf-tests⇒ when writing checkpoints
⇒ no significant auto vectorisation⇒ suboptimal memory layout
Using MKL FFT
• small gain for benchmarks⇒ might be significant for larger setups
26 / 44
First Code Changes
Floating point exception-handling
• prevents vectorisation• fp-model-strict → fp-model-source• remove exception handling• add NaN/Inf-tests⇒ when writing checkpoints
⇒ no significant auto vectorisation⇒ suboptimal memory layout
Using MKL FFT
• small gain for benchmarks⇒ might be significant for larger setups
CONTIGUOUS keyword
• from Fortran 2008• tell the compiler about contiguously
allocated arrays
26 / 44
0
50
100
150
200
small medium large
runtime[s]
Current Results, Intel Compiler 17.0.0
1.001.00
1.00
1.091.13
1.01
1.24
1.07
0.96
HSW, BaselineHSW, CurrentKNL, Current
Cray Specialities
Hardware/Software
• Cray Aries network• Cray MPI• Cray Performance Tools• Cray Compiler
27 / 44
Cray Specialities
Hardware/Software
• Cray Aries network• Cray MPI• Cray Performance Tools• Cray Compiler
Rank Reordering
• instrumented application run• optimised mapping of MPI
ranks to cores and nodes⇒ no improvement on KNL
27 / 44
Cray Specialities
Hardware/Software
• Cray Aries network• Cray MPI• Cray Performance Tools• Cray Compiler
Rank Reordering
• instrumented application run• optimised mapping of MPI
ranks to cores and nodes⇒ no improvement on KNL
Cray Compiler
• initial KNL support• crash with OpenMP⇒ 64 MPI ranks per KNL
27 / 44
0
50
100
150
200
small medium large
runtime[s]
Current Results, Intel 17.0.0 vs. Cray Compiler 8.5.3
1.001.00
1.00
1.091.13
1.01
1.24
1.07
0.96
1.29
1.17
1.12
HSW, Baseline, IntelHSW, Current, IntelKNL, Current, IntelKNL, Current, Cray
Projected Production Run Performance• benchmark runs: ≈ 5 min, productions runs: ≈ 12 hours⇒ serial initialisation becomes negligible⇒ plot speedup based on ttotal − tinit
28 / 44
Projected Production Run Performance• benchmark runs: ≈ 5 min, productions runs: ≈ 12 hours⇒ serial initialisation becomes negligible⇒ plot speedup based on ttotal − tinit
28 / 44
0
0.5
1
1.5
2
small medium large
speedup
Projected Speedups (without init)
1.00 1.00 1.001.10 1.14
1.00
1.251.11 1.06
1.45
1.25 1.23
HSW, BaselineHSW, Current, IntelKNL, Current, IntelKNL, Current, Cray
PALM - Final RemarksBottom Line
• getting started on KNL was easy⇒ way easier than KNC and offloading
• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .
29 / 44https://palm.muk.uni-hannover.de
PALM - Final RemarksBottom Line
• getting started on KNL was easy⇒ way easier than KNC and offloading
• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .
• Cray Fortran compiler up to 16.5%faster than Intel on KNL
29 / 44https://palm.muk.uni-hannover.de
PALM - Final RemarksBottom Line
• getting started on KNL was easy⇒ way easier than KNC and offloading
• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .
• Cray Fortran compiler up to 16.5%faster than Intel on KNL
• speedup over dual HSW HLRN node:• benchmark: up to 1.29ו production (projected): up to 1.45×⇒ even without vectorisation
29 / 44https://palm.muk.uni-hannover.de
PALM - Final RemarksBottom Line
• getting started on KNL was easy⇒ way easier than KNC and offloading
• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .
• Cray Fortran compiler up to 16.5%faster than Intel on KNL
• speedup over dual HSW HLRN node:• benchmark: up to 1.29ו production (projected): up to 1.45×⇒ even without vectorisation
What’s next• VTune Results:⇒ increase concurrency⇒ reduce L2 misses on KNL
• adapt data layout for SIMD⇒ twice the potential on KNL
29 / 44https://palm.muk.uni-hannover.de
Case study II: Material Science with VASPVASP – Vienna Ab-Initio Simulation Package
• Plan wave electronic structure code to model atomic scale materials from firstprincipals: Hψ = Eψ
• widely used in material sciences• historically grown, large MPI-only Fortran code base
• different approximations (DFT, hybrid functionals, . . . ) to tackle the physics• library dependencies: FFTW, BLAS, scaLAPACK, ELPA
Collaboration of:
30 / 44ZIB contact: Florian Wende ([email protected])
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter SIMD registers become increasingly larger: currently 512 bit with AVX-512
Slightly increased logic on the chip, but heavily increased arithmetic throughput Xeon Phi KNL w/ and w/o SIMD: 3 TFLOPS vs. 0.37 TFLOPS
SIMD Introduction
31 / 44
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter
for (i = 0; i < N; ++i)y[i] = log(x[i]);
...
tim
eSIMD Introduction
32 / 44
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter
for (i = 0; i < N; ++i) for (i = 0; i < N; i += 8)y[i] = log(x[i]); y[i + 0] = log(x[i + 0]);
... y[i] = vlog(x[i])y[i + 7] = log(x[i + 7]);
...
...
tim
e
8 times faster execution with SIMD
2
SIMD Introduction
33 / 44
4 times faster execution with SIMD
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter Control flow divergences can hurt SIMD performance significantly
for (i = 0; i < N; ++i) for (i = 0; i < N; i += 8)if (p[i]) y[i] = log(x[i]); if (m← p[i]) y[i] = vlog_mask(y[i], m, x[i]);else y[i] = exp(x[i]); else y[i] = vexp_mask(y[i], ~m, x[i]);
...
...
tim
eSIMD Introduction
34 / 44
OpenMP 4.x compiler directives portability across compilers low code invasiveness no SIMD intrinsics for Fortran
combine OpenMP 4.x SIMD with “high-level vectors” (loop chunking) to increase flexibility and expressiveness
SIMD Vectorisation in VASP
35 / 44
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
C-version of the Codes(not optimized)
SIMD Vectorisation in VASP - Example
36 / 44
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
nj rather small: not a candidate for SIMD vectorization
SIMD Vectorisation in VASP - Example
37 / 44
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
loop iterations are not independent: idx
SIMD Vectorisation in VASP - Example
38 / 44
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
idx = 0;for (i = 0; i < ni; i += CHUNKSIZE) {ii_max = min(CHUNKSIZE, ni – i);for (ii = 0; ii < ii_max; ++ii) {while (“some condition”)
++idx;vidx[ii] = idx;
}...
}
Compute idx-values in advance to enable SIMD vectorization afterwards!
Loop-Chunking, e.g. CHUNKSIZE=32
SIMD Vectorisation in VASP - Example
39 / 44
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
idx = 0;for (i = 0; i < ni; i += CHUNKSIZE) {ii_max = min(CHUNKSIZE, ni – i);for (ii = 0; ii < ii_max; ++ii) {while (“some condition”)
++idx;vidx[ii] = idx;
}for (ii = 0; ii < ii_max; ++ii)vd[ii] = data[vidx[ii]];
...}
Load data in a separate loop: leave it to the compilerto vectorize or not
SIMD Vectorisation in VASP - Example
40 / 44
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
idx = 0;for (i = 0; i < ni; i += CHUNKSIZE) {ii_max = min(CHUNKSIZE, ni – i);for (ii = 0; ii < ii_max; ++ii) {while (“some condition”)
++idx;vidx[ii] = idx;
}for (ii = 0; ii < ii_max; ++ii)vd[ii] = data[vidx[ii]];
for (j = 0; j < nj; ++j)#pragma omp simdfor (ii = 0; ii < ii_max; ++ii)
res[j] += vd[ii] * (...);}
SIMD Vectorisation in VASP - Example
41 / 44
Non-vectorizable loop split into parts to enable SIMD vectorization
0
5
10
15
20
25
30
35
Tim
e [s
]
GW0 subroutine only
no-SIMD (KNL) SIMD (KNL)
no-SIMD (2x Haswell CPU) SIMD (2x Haswell CPU)
0
10
20
30
40
50
60
70
80
90
Whole program
further optimization needed!
Xeon Phi nodes: quadrant mode, all data in MCDRAM
SIMD Vectorisation in VASP - Results
42 / 44
KNL SummaryXeon Phi (KNL) has a low entry barrier. . .
• no offloading• well-known CPU toolchains and workflows
. . . but getting performance is challenging
• ease of use can be misleading towards quick fixes• code needs to be re-thought and re-written for SIMD• effort pays off with significant speed-up for hot-spots⇒ Xeon benefits from Xeon Phi optimisations as well
• overall application performance suffers from low single-thread performance⇒ working on a few hotspots is not sufficient
With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.
43 / 44
KNL SummaryXeon Phi (KNL) has a low entry barrier. . .
• no offloading• well-known CPU toolchains and workflows
. . . but getting performance is challenging
• ease of use can be misleading towards quick fixes• code needs to be re-thought and re-written for SIMD• effort pays off with significant speed-up for hot-spots⇒ Xeon benefits from Xeon Phi optimisations as well
• overall application performance suffers from low single-thread performance⇒ working on a few hotspots is not sufficient
With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.
43 / 44
KNL SummaryXeon Phi (KNL) has a low entry barrier. . .
• no offloading• well-known CPU toolchains and workflows
. . . but getting performance is challenging
• ease of use can be misleading towards quick fixes• code needs to be re-thought and re-written for SIMD• effort pays off with significant speed-up for hot-spots⇒ Xeon benefits from Xeon Phi optimisations as well
• overall application performance suffers from low single-thread performance⇒ working on a few hotspots is not sufficient
With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.
43 / 44
Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.
Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.
Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.
44 / 44
- EoP
Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.
Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.
Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.
44 / 44
- EoP
Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.
Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.
Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.
44 / 44
- EoP
Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.
Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.
Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.
44 / 44
- EoP