82
1 HOW Series: Knights Landing Hands-On Workshop on Performance Optimization for Intel® Xeon Phi Processor Family x200 Colfax International @colfaxintl September 2016 - Rev. 3.0 colfaxresearch.com/knl-webinar Welcome © Colfax International, 2013–2016

HOW Series: Knights Landing

Embed Size (px)

Citation preview

Page 1: HOW Series: Knights Landing

1

HOW Series: Knights LandingHands-On Workshop on Performance Optimization for

Intel® Xeon Phi™ Processor Family x200

Colfax International — @colfaxintl

September 2016 - Rev. 3.0

colfaxresearch.com/knl-webinar Welcome © Colfax International, 2013–2016

Page 2: HOW Series: Knights Landing

2

About This Document

This document represents the ma-terials of a Web-based training “In-troduction to 2nd Generation Intel®

Xeon Phi™ Processors: Developer’sGuide to Knights Landing” developedand run by Colfax International.

© Colfax International, 2013–2016

colfaxresearch.com/knl-webinar/

colfaxresearch.com/knl-webinar About This Document © Colfax International, 2013–2016

Page 3: HOW Series: Knights Landing

3

Disclaimer

While best efforts have been used in preparing this training, Colfax International makes norepresentations or warranties of any kind and assumes no liabilities of any kind with respect tothe accuracy or completeness of the contents and specifically disclaims any implied warrantiesof merchantability or fitness of use for a particular purpose. The publisher shall not be heldliable or responsible to any person or entity with respect to any loss or incidental orconsequential damages caused, or alleged to have been caused, directly or indirectly, by theinformation or programs contained herein. No warranty may be created or extended by salesrepresentatives or written sales materials.

colfaxresearch.com/knl-webinar About This Document © Colfax International, 2013–2016

Page 4: HOW Series: Knights Landing

4

Resources

colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016

Page 5: HOW Series: Knights Landing

5

Colfax Research

http://colfaxresearch.com/

colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016

Page 6: HOW Series: Knights Landing

6

HOW Series

Interested? Sign-up at:

colfaxresearch.com/how-seriescolfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016

Page 7: HOW Series: Knights Landing

7

Developer Access Program (DAP)

Can’t wait to get your hands on Knights Landing?

Find out more at dap.xeonphi.comor contact us at [email protected].

colfaxresearch.com/knl-webinar Resources © Colfax International, 2013–2016

Page 8: HOW Series: Knights Landing

8

§2. Intel Architecture: Today andTomorrow

colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016

Page 9: HOW Series: Knights Landing

9

Intel Architecture

colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016

Page 10: HOW Series: Knights Landing

10

Servers with Intel Xeon Phi Processors

Bootable Host Processor

RHEL/CentOS/SUSE/Win

64 cores × 4 HT, 1.3 GHz

≤ 384 GiB DDR4, > 90 GB/s

16 GiB HBM, > 400 GB/s

PCIe bus for network

4 systems in 2U chassis

dap.xeonphi.com

colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016

Page 11: HOW Series: Knights Landing

11

Workstations with Intel Xeon Phi Processors

Bootable Host Processor

RHEL/CentOS/SUSE/Win

64 cores × 4 HT, 1.3 GHz

≤ 384 GiB DDR4, > 90 GB/s

16 GiB HBM, > 400 GB/s

PCIe bus for add-in cards

Liquid- or air-cooled

dap.xeonphi.com

colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016

Page 12: HOW Series: Knights Landing

12

Future Form-Factors

KNLF: KNL with FabricFabric integrated on CPU

Ï Intel OmniPath Architecture

Socket mount processor*KNC image

KNL CoprocessorPCIe add-in card

Ï Requires host

Multiple KNLs in a system

colfaxresearch.com/knl-webinar Intel Architecture: Today and Tomorrow © Colfax International, 2013–2016

Page 13: HOW Series: Knights Landing

13

§3. Out-of-the-Box Performance

colfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016

Page 14: HOW Series: Knights Landing

14

Intel® MKL

Intel® Math Kernel Library (MKL) — standard mathematical functionsoptimized for Intel architecture.

BLAS

LAPACK

Sparse solvers

ScaLAPACK

Linear AlgebraFast FourierTransform Vector Math

Vector RandomNumber Generators

SummaryStatistics

Data Fitting

Multidimentional

(up to 7D)

FFTW interfaces

Cluster FFT

Trigonometric

Hyperbolic

Exponential

Logarithmic

Power/Root

Rounding

Congruential

Recursive

Wichmann-Hill

Mersenne Twister

Sobol

Neiderreiter

Non-deterministic

Kurtosis

Variation

coefficitent

Quantiles, order

statistics

Min/max

Variance-

covariance

Splines

Interpolation

Cell search

colfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016

Page 15: HOW Series: Knights Landing

15

DGEMM in MKLC/C++

1 #include <mkl.h>2 ...3 cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,4 n, n, n, 1.0, A, n, B, n, 0.0, C, n);

Python

from scipy.linalg.blas import dgemm as dgemm...C=dgemm(alpha=1.0, a=A, b=B, overwrite_c=1, trans_b=1)

Rrequire(Matrix)...c <- a %*% b

colfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016

Page 16: HOW Series: Knights Landing

16

Python Benchmark

N=500 1000 1200 1500 2000 3000 4000 50000

500

1000

1500

2000

Per

form

ance

(GFL

OP/

s)

10 11 11 11 12 12 12 12 60 69 71 72 77 75 73 83

280

701 745

968

1270

1530

1730 1850

colfaxresearch.com DGEMM on Knights Landing Processors

CPython, SciPy CPython, NumPy Intel Python, SciPy

For more information on KNL with Intel Python, see this papercolfaxresearch.com/knl-webinar Out-of-the-Box Performance © Colfax International, 2013–2016

Page 17: HOW Series: Knights Landing

17

§4. Importance of Code Optimization

colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016

Page 18: HOW Series: Knights Landing

18

Python BenchmarkWhat we did not tell you in the previous slide...

LUDecomposition

CholeskyDecomposition

Singular ValueDecomposition

DGEMM0

20

40

60

80

100

120

140

160

180

Rel

ativ

e Pe

rfor

man

ce

1.0 1.0 1.0 1.0 3.5 3.6 1.1 7.0

29.0 17.0

8.3

154.0

colfaxresearch.com Intel Python on Knights Landing Processors (N=5000)

CPython, SciPy CPython, NumPy Intel Python, SciPy

It takes good software to unlock the performance of KNL!colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016

Page 19: HOW Series: Knights Landing

19

Optimization Areas

Scalar Tuningwhat goes on in the pipeline?

Threadingdo cores cooperate efficiently?

Vectorizationis SIMD parallelism used well?

Memoryis cache usage maximized or

RAM access streamlined?

Communicationcan coordination in a distributed or

heterogeneous system be improved?

colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016

Page 20: HOW Series: Knights Landing

20

Application

1 Astrophysics:Ï planetary systemsÏ galaxiesÏ cosmological structures

2 Electrostatic systems:Ï moleculesÏ crystals

This work: “toy model” with all-to-allO

(n2

)algorithm. Practical N-body sim-

ulations may use tree algorithms withO

(n logn

)complexity.

Source: APOD, credit: Debra MeloyElmegreen (Vassar College) et al., & the

Hubble Heritage Team (AURA/ STScI/ NASA)

colfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016

Page 21: HOW Series: Knights Landing

21

N-body Simulation on Intel ArchitectureDetails in Chapter 23 of this book

Step 0:Initial

Step 1:Multi-threaded

Step 2:Scalar Tuning

Step 3:Vectorized w/SoA

Step 4:Memory Optimization

0

500

1000

1500

2000

2500

3000

Per

form

ance

, GFL

OP

/s

5.9160

261

918 976

0.8124 196 288

1712

2.1

316443

1612

2875

N-Body Simulation Performance

Intel Xeon E5-2697 v3 (Haswell)

Intel Xeon Phi 7120A (1st gen, KNC)

Intel Xeon Phi 7220 (2nd gen, KNL)

The best way to prepare your code for KNL: optimize for Xeon or KNCcolfaxresearch.com/knl-webinar Importance of Code Optimization © Colfax International, 2013–2016

Page 22: HOW Series: Knights Landing

22

§5. Details of Architecture

colfaxresearch.com/knl-webinar Details of Architecture © Colfax International, 2013–2016

Page 23: HOW Series: Knights Landing

23

Cores in Intel Xeon Phi Processors

colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016

Page 24: HOW Series: Knights Landing

24

KNL Die Organization: Tiles

Up to 36 tiles, each with 2 physical cores (72 total).

Distributed L2 cache across a mesh interconnect.

CORE

L2

CORE CORE

L2

CORE CORE

L2

CORE CORE

L2

CORE CORE

L2

CORE

DD

R4

CO

NT

RO

LL

ER

DD

R4

CO

NT

RO

LL

ER

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE CORE

L2

CORECORE

L2

CORE CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE

CORE

L2

CORE CORE

L2

CORE

MCDRAM PCIe

≤ 38

4 G

iB s

yste

m D

DR

4, ~

90 G

B/s

CORE

L2

CORE

≤ 16 GiB on-package MCDRAM, ~ 400 GB/s

MCDRAM

MCDRAM MCDRAM

CORE

L2

CORE CORE

L2

CORE

CORE

L2

CORECORE

L2

CORE

...36 TILES

72 CORES

colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016

Page 25: HOW Series: Knights Landing

25

KNL Cores

4-way hyper-threading (up to 4×72 = 288 logical processors)

L2 cache shared by 2 cores on a tile.

I-cache

Decode+ Retire

Vector

ALU

VectorALU

Legacy

32 KiB L1 D-cache

1 MiBL2

D-cache

I-cache

Decode+ Retire

Vector

ALU

VectorALU

Legacy

32 KiB L1 D-cache

CORE CORETILE

to Mesh

HW+ SWprefetch

HW+ SWprefetch

Out-of-order

Out-of-order

colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016

Page 26: HOW Series: Knights Landing

26

More Forgiving Cores than First Generation (KNC)

Based on Intel® Atom cores (Silvermont microarchitecture)Out-of-order cores:Better latency masking from long latency operations

Back-to-back instructions from single thread:Thread count requirement reduced to ≈ 70 (from ≈ 120 for KNC)

Advanced branch prediction:Fewer cycles wasted to branch misprediction

Generally more forgiving to non-optimized code.

colfaxresearch.com/knl-webinar Cores in Intel Xeon Phi Processors © Colfax International, 2013–2016

Page 27: HOW Series: Knights Landing

27

Vector Instruction Support

colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016

Page 28: HOW Series: Knights Landing

28

Short Vector Support

Vector instructions – one of the implementations of SIMD (SingleInstruction Multiple Data) parallelism.

+ =+ =+ =+ =

+ =

Vec

tor

Len

gth

Scalar Instructions Vector Instructions

4 1 5

0 3 3

-2 8 6

9 -7 2

4 1 5

0 3 3

-2 8 6

9 -7 2

colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016

Page 29: HOW Series: Knights Landing

29

Dual VPU

Each core on KNL has two Vector Processing Units (VPUs).

Penalty from non-vectorized code

SP →512 bit registers /32 bits ×2 VPUs = 32 SIMD lanes

DP →512 bit registers /64 bits ×2 VPUs = 16 SIMD lanes

VPU 0512-bit registers = 16 SP or 8 DP

ZMM registersx32 per thread VPU 1

Decode

I-Cache

32/16 SIMD lanes (SP/DP)

colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016

Page 30: HOW Series: Knights Landing

30

Supported Instruction Sets on KNL

Intel® Advanced VectorExtensions 512 (AVX-512)

Ï 512-bit vector registers.Ï Hardware gather/scatter, DP

transcendental functions supportand more.

Ï Supported by non-Intel compilerslike GCC.

≤ Intel® AVX2Ï Legacy mode operation.Ï Binary compatibility with Xeon.Ï Does not include IMCI (from KNC).

E5-2600(SNB)

E5-2600v3(HSW)

KNL(Xeon Phi)

AVX-512PF

AVX-512ER

AVX-512CD

AVX-512F

BMI

AVX2

AVX

SSE

x87/MMX

TSX

BMI

AVX2

AVX

SSE

x87/MMX

AVX

SSE

x87/MMX

Leg

acy

Sup

port

no TSX

colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016

Page 31: HOW Series: Knights Landing

31

AVX-512 Features

Knights Landing: first AVX-512 processorAVX-512F (Fundamentals)

Ï Extension of most AVX2 instructions to 512-bit vector registers.

AVX-512CD (Conflict Detection)Ï Efficient conflict detection (application: binning).

AVX-512ER (Exponential and Reciprocal)Ï Transcendental function (exp, rcp and rsqrt) support.

AVX-512PF (Prefetch)Ï Prefetch for scatter and gather.

White Paper: http://colfaxresearch.com/avx-512/

colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016

Page 32: HOW Series: Knights Landing

32

Performance Considerations

Even if your code is vectorized, tuning may unlock more performance.

link to paper

Providing enough parallelism.Ï More consecutive vector

operations required toovercome vectorization latency.

Loop pipelining and unrolling.Ï Double the pipeline stages to

populate.

Better vectorization patterns.Ï Avoid long latency operations

with unit-stride and unmaskedoperations.

colfaxresearch.com/knl-webinar Vector Instruction Support © Colfax International, 2013–2016

Page 33: HOW Series: Knights Landing

33

High-Bandwidth Memory

colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016

Page 34: HOW Series: Knights Landing

34

KNL Memory Organization

Direct access to on-package MCDRAM and system DDR4 (socket)

Use MCDRAM as cache, in flat mode, or as hybrid

CORE

CORE

RE

GIS

TE

RS

RE

GIS

TE

RS

L1cache

L1cache

L2cache

MCDRAM(on package

RAM)DDR4RAM

(system memory)

Up to 16 GiBover 400 GB/s

Up to 384 GiB~ 90 GB/s (STREAM)

...more cores

Intel Xeon PhiProcessor

Flat,Cache

orHybrid

colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016

Page 35: HOW Series: Knights Landing

35

MCDRAM Memory ModesFlat Mode

MCDRAM treated as aNUMA node

Users control whatgoes to MCDRAM

CPU

DDR4

AddressableMCDRAM

NUMANode 0

NUMANode 1

Cache ModeMCDRAM treated as aLast Level Cache (LLC)

MCDRAM is usedautomatically

CPU

DDR4

MCDRAMas cache

Hybrid ModeCombination of Flatand Cache

Ratio can be chosen inthe BIOS

AddressableMCDRAM

NUMANode 0

NUMANode 1

DDR4

MCDRAMas cache

CPU

colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016

Page 36: HOW Series: Knights Landing

36

Running Applications in HBM with numactlFinding information about the NUMA nodes in the system.

user@knl% # In Flat mode with All-to-Alluser@knl% numactl -Havailable: 2 nodes (0-1)node 0 cpus: ... all cpus ...node 0 size: 98207 MBnode 0 free: 94798 MBnode 1 cpus:node 1 size: 16384 MBnode 1 free: 15991 MB

Binding the application to MCDRAM (Flat/Hybrid)

user@knl% gcc myapp.c -o runme -mavx512f -O2user@knl% numactl --membind 1 ./runme// ... Application running on MCDRAM ... //

colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016

Page 37: HOW Series: Knights Landing

37

Allocation in HBM with Memkind LibraryManual allocation to MCDRAM possible with hbwmalloc and Memkind Library.

1 #include <hbwmalloc.h>2 const int n = 1<<10;3 // Allocation to MCDRAM4 double* A = (double*) hbw_malloc(sizeof(double)*n);5 // No replacement for _mm_malloc. Use posix_memalign6 double* B;7 int ret = hbw_posix_memalign((void*) B, 64, sizeof(double)*n);8 .....9 // Free with hbw_free

10 hbw_free(A); hbw_free(b);

Fortran Allocations.

1 REAL, ALLOCATABLE :: A(:)2 !DEC$ ATTRIBUTES FASTMEM :: A3 ALLOCATE (A(1:1024))

colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016

Page 38: HOW Series: Knights Landing

38

Compilation with Memkind Library and hbwmallocTo compile C/C++ applications:user@knl% icpc -lmemkind foo.cc -o runmeuser@knl% g++ -lmemkind foo.cc -o runme

To compile Fortran applications:user@knl% ifort -lmemkind foo.f90 -o runmeuser@knl% gfortran -lmemkind foo.f90 -o runme

Open source distribution of Memkind library can be found at:http://memkind.github.io/memkind/

Colfax Publication on using MCDRAM and Memkind Library:http://colfaxresearch.com/knl-mcdram/colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016

Page 39: HOW Series: Knights Landing

39

Flow Chart for Bandwidth-Bound Applications

Does your Application fit in 16 GB?

Can you partition ≤16 GB of BW-critical memory?

No

Yes Yes

No

Start

numactl Memkind Cache modeSimply run the wholeprogram in MCDRAM

No code modificationrequired

Manually allocateBW-critical memory toMCDRAM

Memkind calls need tobe added.

Allow the OS to figureout how to useMCDRAM

No code modificationrequired

colfaxresearch.com/knl-webinar High-Bandwidth Memory © Colfax International, 2013–2016

Page 40: HOW Series: Knights Landing

40

Clustering Modes on KNL

colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016

Page 41: HOW Series: Knights Landing

41

Clustering Modes: All-to-All

No affinity between the distributed Tag Directory (TD) and memory.

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

NU

MA

nod

e 0

MC

DR

AM

MC

DR

AM

MC

DR

AM

MC

DR

AM

colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016

Page 42: HOW Series: Knights Landing

42

Clustering Modes: Quadrant/Hemisphere

Tag Directory (TD) and memory reside in the same quadrant.

tileTD

MC

DR

AM

MC

DR

AM

MC

DR

AM

MC

DR

AM

tileTD

tileTD

tileTD

NU

MA

nod

e 0

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016

Page 43: HOW Series: Knights Landing

43

Clustering Modes: SNC-4/SNC-2

Will appear as 4 NUMA nodes (similar to quad-socket system).

tileTD

MC

DR

AM

MC

DR

AM

MC

DR

AM

MC

DR

AM

tileTD

tileTD

tileTD

NU

MA

nod

e 0

NU

MA

nod

e 1

NU

MA

nod

e 2

NU

MA

nod

e 3

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

tileTD

colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016

Page 44: HOW Series: Knights Landing

44

How to Use Clustering modesThread affinity to match the memory/directory affinity.

Nested OpenMP

1 #pragma omp parallel2 {3 // ...4 #pragma omp parallel5 {6 // ...7 }8 }

OpenMP Threads

Quadrants

KNL

user@knl% OMP_NUM_THREADS=4,72user@knl% OMP_NESTED=1

MPI+OpenMP

1 stat = MPI_Init();2 // ...3 #pragma omp parallel4 {5 // ...6 }7 // ...8 MPI_Finalize();

NUMA node 0

MPI rank 0OpenMP

NUMA node 1

MPI rank 1OpenMP

NUMA node 2

MPI rank 2OpenMP

NUMA node 3

MPI rank 3OpenMP

user@knl% mpirun -host knl \> -np 4 ./myparallel_app

White Paper: http://colfaxresearch.com/knl-numa/colfaxresearch.com/knl-webinar Clustering Modes on KNL © Colfax International, 2013–2016

Page 45: HOW Series: Knights Landing

45

“Get Ready for Intel’s Knights Landing” Guide

colfaxresearch.com/knl-webinar “Get Ready for Intel’s Knights Landing” Guide © Colfax International, 2013–2016

Page 46: HOW Series: Knights Landing

46

Get Ready For Intel® Xeon Phi processors (Codenamed:Knights Landing)

colfaxresearch.com/knl-ready

colfaxresearch.com/knl-webinar “Get Ready for Intel’s Knights Landing” Guide © Colfax International, 2013–2016

Page 47: HOW Series: Knights Landing

47

§6. Hands-On Demonstrations

colfaxresearch.com/knl-webinar Hands-On Demonstrations © Colfax International, 2013–2016

Page 48: HOW Series: Knights Landing

48

Running the STREAM Benchmark

colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016

Page 49: HOW Series: Knights Landing

49

STREAM Benchmark

Industry-standard tool formemory bandwidthmeasurement

4 tests: COPY, ADD,SCALE and TRIAD

Download from Dr. JohnMcCalpin’s site:www.cs.virginia.edu/stream/

Expected for KNL: ≈400 GB/s

colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016

Page 50: HOW Series: Knights Landing

50

STREAM Benchmark on Tuning

Set 1 thread per core on all platforms (-1 for offload)

Set affinity “scatter”: white paper (Colfax Research)

Tune prefetching: white paper (Intel)

In addition, secret sauce for your own STREAM-like application:

Parallel first touch (see NUMA discussion in Session 08)

Essential element – streaming stores: discussion

colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016

Page 51: HOW Series: Knights Landing

51

HBM in Knights Landing

Finding HBM (MCDRAM) in an Intel Xeon Phi processor x200 (KNL):

user@knl% # In Flat mode with All-to-All or Quadrantuser@knl% numactl -Havailable: 2 nodes (0-1)node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... 249 250 251 252 253 254 255node 0 size: 98207 MBnode 0 free: 94798 MBnode 1 cpus:node 1 size: 16384 MBnode 1 free: 15991 MB

Binding the application to HBM (Flat/Hybrid)

user@knl% numactl --membind 1 ./runme// ... Application running in HBM ... //

colfaxresearch.com/knl-webinar Running the STREAM Benchmark © Colfax International, 2013–2016

Page 52: HOW Series: Knights Landing

52

Vectorization with AVX-512

colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016

Page 53: HOW Series: Knights Landing

53

Demo: AVX-512CD and Histograms1 // worker.cc from Colfax lab 4.042 void Histogram(const float* age, int* const hist,3 const int n, const float group_width,4 const int m) {5

6 for (int i = 0; i < n; i++) {7 const int j = (int) ( age[i] / group_width );8 hist[j]++;9 }

10 }

56.8 52.1 21.1 46.9 23.8 16.5 96.3 19.4

20.0 20.0 20.0 20.0 20.0 20.0 20.0 20.0

2.84 2.60 1.05 2.34 1.19 0.82 4.81 0.97

int

2 2 1 2 1 0 4 0

0 1 2 3 4hist:

=

÷

user@knl% cat worker.optrpt....remark ...: vectorization support: scatter was generated for the variable hist:remark ...: vectorization support: gather was generated for the variable hist:remark #15300: LOOP WAS VECTORIZED

colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016

Page 54: HOW Series: Knights Landing

54

Intel Compiler support for AVX-512Intel Compiler versions ≥ 15.0 supports AVX-512 instruction set.user@knl% icc -vicc version 16.0.1 (gcc version 4.8.5 compatibility)user@knl% icc -help// ... truncated output ... //-x<code>

...MIC-AVX512CORE-AVX512COMMON-AVX512

-xMIC-AVX512 : for KNL (supports F, CD, ER, PF)

-xCORE-AVX512 : for future Xeon (supports F, CD, DQ, BW, VL)

-xCOMMON-AVX512 : common to KNL and Xeon (supports F, CD)

colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016

Page 55: HOW Series: Knights Landing

55

Strip-Mining for VectorizationVectorization improves performance on both platforms. However, more work isneeded to take advantage of the MIC architecture. See materials on multi-threading.

Scalar Serial Code Strength Reduction Vectorized Serial Code(Strip-Mining)

0.0

0.2

0.4

0.6

0.8

1.0

Per

form

ance

, bill

ion

valu

es/s

(hig

her i

s bet

ter)

0.373

0.588

0.937

0.017 0.045

0.126 0.105 0.119 0.117

Intel Xeon Processor E5-2697 V2 Intel Xeon Phi Coprocessor 7120P (KNC) Intel Xeon Phi Processor 7210 (KNL)

colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016

Page 56: HOW Series: Knights Landing

56

Using Reduction instead of Synchronization

Vectorized Serial Code(Strip-Mining)

Vectorized Parallel Code(Atomic Operations)

Vectorized Parallel Code(Private Variables)

0

5

10

15

20

Per

form

ance

, bill

ion

valu

es/s

(hig

her i

s bet

ter)

0.937 0.046

11.2

0.126 0.035

13.9

0.117 0.017

16.4

Intel Xeon Processor E5-2697 V2 Intel Xeon Phi Coprocessor 7120P (KNC) Intel Xeon Phi Processor 7210 (KNL)

colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016

Page 57: HOW Series: Knights Landing

57

Memory Traffic Optimization

Vectorized Parallel Code(Private Variables)

Parallel Initialization(First-Touch Allocation)

0

5

10

15

20

25

Per

form

ance

, bill

ion

valu

es/s

(hig

her i

s bet

ter)

11.2

22.1

13.9 13.6

16.4 16.4

Intel Xeon processor E5-2697 V2 Intel Xeon Phi coprocessor 7120P (KNC) Intel Xeon Phi processor 7210 (KNL)

colfaxresearch.com/knl-webinar Vectorization with AVX-512 © Colfax International, 2013–2016

Page 58: HOW Series: Knights Landing

58

Tuning with Intel Math Kernel Library

colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016

Page 59: HOW Series: Knights Landing

59

Batch Processing

Feed multiple signals to MKL

Let the library decide on the parallel strategy.

1 MKL_Complex16* data =2 (MKL_Complex16*) malloc(sizeof(MKL_Complex16)*fft_size*num_fft);3

4 DFTI_DESCRIPTOR_HANDLE handle;5 DftiCreateDescriptor(&handle, DFTI_DOUBLE, DFTI_COMPLEX, 1, (MKL_LONG)fft_size);6 DftiSetValue(handle, DFTI_NUMBER_OF_TRANSFORMS, num_fft);7 DftiSetValue(handle, DFTI_INPUT_DISTANCE, fft_size);8 DftiSetValue(handle, DFTI_OUTPUT_DISTANCE, fft_size);9 DftiSetValue(handle, DFTI_PLACEMENT, DFTI_INPLACE);

10 DftiCommitDescriptor(handle);

colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016

Page 60: HOW Series: Knights Landing

60

Nested Parallelism

... ... ...

Fine-grainedparallelism

Coarse-grainedparallelism

Nestedparallelism

throwing all threadson one MKL function

using one threadper MKL function

putting teams of threadson several MKL functions

colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016

Page 61: HOW Series: Knights Landing

61

Thread Affinity Tuning

Xeon Xeon Phi

MKL uses 1 thread/core MKL uses 4 threads/core

OMP_NUM_THREADS=#phys. cores OMP_NUM_THREADS=4×#cores

KMP_AFFINITY=scatter KMP_AFFINITY=scatter

KMP_AFFINITY=compact (HT off) KMP_AFFINITY=compact

KMP_AFFINITY=compact,1 (HT on)

See also HOW series Session 8.

colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016

Page 62: HOW Series: Knights Landing

62

Thread Affinity: Scatter PatternGenerally beneficial for bandwidth-bound applications.

HT 0 HT 1 HT 0 HT 1

Core 0 Core 1

Socket 0

HT 0 HT 1 HT 0 HT 1

Core 0 Core 1

Socket 1

Threads:

Cores:

0 2 1 3

HT 0 HT 1 HT 0 HT 1

Core 0 Core 1

Socket 0

HT 0 HT 1 HT 0 HT 1

Core 0 Core 1

Socket 1

Threads:

Cores:

0

HT 0 HT 1 HT 0 HT 1

Core 0

Xeon Phi chip

HT 0 HT 1 HT 0 HT 1

Core 1

Threads:

Cores:

0 61 122 183 1 62 123 184

...

MKLon

Xeon:

MKLon

Xeon Phi:

KMP_AFFINITY=scatter

KMP_AFFINITY=scatter

colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016

Page 63: HOW Series: Knights Landing

63

Data Container Tweaks

Parallel first-touch (see also HOW series Session 8):1 #pragma omp parallel for2 for(int i = 0; i < num_fft; i++)3 for(int j = 0; j < fft_size; j++) {4 data[i*fft_size+j].real = cos(k*j);5 data[i*fft_size+j].imag = sin(k*j);6 }

Special allocator (alignment – see also HOW series Session 6 andallocation in HBM on KNL)

1 MKL_Complex16* data =2 (MKL_Complex16*) mkl_malloc(sizeof(MKL_Complex16)*fft_size*num_fft, 64);

colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016

Page 64: HOW Series: Knights Landing

64

Complex-to-Complex 1D FFT Performance

Initial BatchMode

AffinityTuning

ParallelFirst Touch

MKLAllocator

0

50

100

150

200

250

300

350

400

450

Dou

ble

Prec

isio

n G

FLO

P/s

10

56 65

145 156

1

148 155 155

239

5

113 113 113

383

1D Fast Fourier Transform Performance

Intel Xeon processor E5-2697 v3 Intel Xeon Phi coprocessor 7120P Intel Xeon Phi processor 7210

2 ·105 FFTs of size 2048.colfaxresearch.com/knl-webinar Tuning with Intel Math Kernel Library © Colfax International, 2013–2016

Page 65: HOW Series: Knights Landing

65

Machine Learning on Intel Architecture

colfaxresearch.com/knl-webinar Machine Learning on Intel Architecture © Colfax International, 2013–2016

Page 66: HOW Series: Knights Landing

66

Machine Learning: Optimized Middleware

http://colfaxresearch.com/isc16-neuraltalkcolfaxresearch.com/knl-webinar Machine Learning on Intel Architecture © Colfax International, 2013–2016

Page 67: HOW Series: Knights Landing

67

More Information: “HOW” Series

colfaxresearch.com/knl-webinar More Information: “HOW” Series © Colfax International, 2013–2016

Page 68: HOW Series: Knights Landing

68

HOW Series

Need more tutorials? Sign-up at:

colfaxresearch.com/how-seriescolfaxresearch.com/knl-webinar More Information: “HOW” Series © Colfax International, 2013–2016

Page 69: HOW Series: Knights Landing

69

§7. Closing Words

colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016

Page 70: HOW Series: Knights Landing

70

Bottom Line

KNL is a highly-parallel successor of KNC, withimprovements in both performance and ease-of-use.

KEEP CALM

GO PARALLELAND

colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016

Page 71: HOW Series: Knights Landing

71

Get Ready For Intel® Xeon Phi processors (Codenamed:Knights Landing)

colfaxresearch.com/knl-ready

colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016

Page 72: HOW Series: Knights Landing

72

Colfax Research

http://colfaxresearch.com/

colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016

Page 73: HOW Series: Knights Landing

73

Developer Access Program (DAP)

Can’t wait to get your hands on Knights Landing?

Find out more at dap.xeonphi.comor contact us at [email protected].

colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016

Page 74: HOW Series: Knights Landing

74

Thank you for Tuning In!

colfaxresearch.com/knl-webinar Closing Words © Colfax International, 2013–2016

Page 75: HOW Series: Knights Landing

75

§8. Back-up Slides

colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016

Page 76: HOW Series: Knights Landing

76

Loop Collapse: PrincipleIdea: combine iterations spaces of the inner loop and the outer loop.

1 #pragma omp parallel for collapse(2)2 for (int i = 0; i < m; i++)3 for (int j = 0; j < n; j++) {4 // ...5 // ...6 }

1 #pragma omp parallel for2 for (int c = 0; c < m*n; c++) {3 i = c / n;4 j = c % n;5 // ...6 }

colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016

Page 77: HOW Series: Knights Landing

77

Exposing Parallelism: Strip-Mining and Loop Collapse1 void sum_stripmine(const int m, const int n, long* M, long* s){2 const int STRIP=1024;3 assert(n%STRIP==0);4 s[0:m]=0;5 #pragma omp parallel6 {7 long total[m]; total[0:m]=0;8 #pragma omp for collapse(2) schedule(guided)9 for (int i=0; i<m; i++)

10 for (int jj=0; jj<n; jj+=STRIP)11 #pragma vector aligned12 for (int j=jj; j<jj+STRIP; j++)13 total[i]+=M[i*n+j];14 for (int i=0; i<m; i++) // Reduction15 #pragma omp atomic16 s[i]+=total[i];17 } }

colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016

Page 78: HOW Series: Knights Landing

78

Checking for AVX-512 support

Check /proc/cpuinfo for AVX-512 flags

user@knl% cat /proc/cpuinfoflags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lmconstant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tscaperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fmacx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avxf16c rdrand lahf_lm abm 3dnowprefetch arat epb xsaveopt pln pts dthermtpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2erms avx512f rdseed adx avx512pf avx512er avx512cd

Also possible through C/C++ function calls: see this blog.

colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016

Page 79: HOW Series: Knights Landing

79

Using AVX-512: Two Approaches

Automatic Vectorization:Vectorization w/ compiler.

Portable: just recompile.

Tuning with directives.

1 double A[vec_width], B[vec_width];2 // ...3

4

5 // This loop will be auto-vectorized6 for(int i = 0; i < vec_width; i++)7 A[i]+=B[i];

1 double A[vec_width], B[vec_width];2 // ...3 // This is explicitly vectorized4 __m512d A_vec = _mm512_load_pd(A);5 __m512d B_vec = _mm512_load_pd(B);6 A_vec = _mm512_add_pd(A_vec,B_vec);7 _mm512_store_pd(A,A_vec);

Explicit Vectorization:Vectorization w/ intrinsics

Full control over instructions

Limited portability

colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016

Page 80: HOW Series: Knights Landing

80

GCC support for AVX-512GCC ≥ 4.9.1 supports AVX-512 instruction set.user@knl% g++ -vgcc version 4.9.2 (GCC)user@knl% g++ foo.cc -mavx512f -mavx512er -mavx512cd -mavx512pf

Basic automatic vectorization support: add -O2 or -O3.1 // ... foo.cc ... //2 for(int i = 0; i < n; i++)3 B[i] = A[i] + B[i];

user@knl% g++ -s foo.cc -mavx512f -03user@knl% cat foo.s...

vmovapd -16432(%rbp,%rax), %zmm0vaddpd -8240(%rbp,%rax), %zmm0, %zmm0vmovapd %zmm0, -8240(%rbp,%rax)

colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016

Page 81: HOW Series: Knights Landing

81

HOW Series

Interested? Sign-up at:

colfaxresearch.com/how-seriescolfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016

Page 82: HOW Series: Knights Landing

82

TextbookISBN: 978-0-9885234-0-1 (508 pages, Electronic or Print)

Parallel Programmingand Optimization with

Intel® Xeon Phi™

Coprocessors

Handbook on the Development andOptimization of Parallel Applications

for Intel® Xeon® Processorsand Intel® Xeon Phi™ Coprocessors

© Colfax International, 2015

P A R A L L E L P R O G R A M M I N GA N D O P T I M I Z A T I O N W I T H

HANDBOOK ON THE

DEVELOPMENT AND

OPTIMIZATION OF

PARALLEL

APPLICATIONS FOR

INTEL XEON

PROCESSORS

AND INTEL

XEON PHI

COPROCESSORS

INTEL XEON PHICOPROCESSORS

TMR

SECO

ND E

DITI

ON

C O L F A X I N T E R N A T I O N A LA N D R E Y V L A D I M I R O V | R Y O A S A I | V A D I M K A R P U S E N K O

http://xeonphi.com/book

colfaxresearch.com/knl-webinar Back-up Slides © Colfax International, 2013–2016