Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi...

Preview:

Citation preview

Intel® Xeon™ Phi™– архитектура,

модели программирования, оптимизация.

Дмитрий Прохоров, Дмитрий Рябцев Intel

Agenda

• What and Why

• Intel Xeon Phi –Top 500 insights, roadmap, architecture

• How

• Programming models - positioning and spectrum

• How Fast

• Optimization and Tools

2

What and WhyHPC

High-Performance

Computing

the use of super computers and

parallel processing techniques for

solving complex computational

problems.

3

What and WhyTOP 500 –”Today’s Future” of tomorrow’s mainstream HPC

4

What and WhyTOP 500 Highlights – Performance Projection

5

What and WhyTOP 500 Highlights – Top 10 list

6

What and Why TOP 500 Highlights – Accelerators in Power Efficiency

7

What and WhyTOP 500 Highlights – Accelerators/Coprocessors

8

N

V

I

d

I

a

What and WhyIntel May Integrated Core (MIC) architecture

Larrabee

+TerraFlops

Research

Chip

+Competition

with NVidia on

Accelerators

9

What and WhyParallelization and vectorization

10

Scalar Vector

Parallel Parallel + Vector

What and WhyXeon VS Xeon Phi

11

12

13

14

KNL Mesh Interconnect: All-to-AllAddress uniformly hashed

across all distributed

directories

Typical Read L2 miss

1. L2 miss encountered

2. Send request to the distributed

directory

3. Miss in the directory. Forward to

memory

4. Memory sends the data to the

requestor

15

Misc

IIOEDC EDC

Tile Tile

Tile Tile Tile

EDC EDC

Tile Tile

Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

EDC EDC EDC EDC

iMC Tile Tile Tile Tile iMC

OPIO OPIO OPIO OPIO

OPIO OPIO OPIO OPIO

PCIe

DDR DDR

1

2

3

4

KNL Mesh Interconnect: QuadrantChip divided into four Quadrants

Directory for an address resides in the same Quadrant as the memory location

SW Transparent

Typical Read L2 miss

1. L2 miss encountered

2. Send request to the distributed directory

3. Miss in the directory. Forward to memory

4. Memory sends the data to the requestor

16

Misc

IIOEDC EDC

Tile Tile

Tile Tile Tile

EDC EDC

Tile Tile

Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

EDC EDC EDC EDC

iMC Tile Tile Tile Tile iMC

OPIO OPIO OPIO OPIO

OPIO OPIO OPIO OPIO

PCIe

DDR DDR

1

2

3

4

KNL Mesh Interconnect: Sub-NUMA

Clustering Each Quadrant (Cluster) exposed as a

separate NUMA domain to OS

Analogous to 4S Xeon

SW Visible

Typical Read L2 miss

1. L2 miss encountered

2. Send request to the distributed directory

3. Miss in the directory. Forward to memory

4. Memory sends the data to the requestor

17

Misc

IIOEDC EDC

Tile Tile

Tile Tile Tile

EDC EDC

Tile Tile

Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

EDC EDC EDC EDC

iMC Tile Tile Tile Tile iMC

OPIO OPIO OPIO OPIO

OPIO OPIO OPIO OPIO

PCIe

DDR DDR

1

2

3

4

18

19

20

21

• Cori Supercomputer at NERSC (National Energy Research Scientific

Computing Center at LBNL/DOE) became the first publically announced Knights

Landing based system, with over 9,300 nodes slated to be deployed in mid-2016

• “Trinity” Supercomputer at NNSA (National Nuclear Security Administration) is

a $174 million deal awarded to Cray that will feature Haswell and Knights

Landing, with acceptance phases in both late-2015 and 2016.

• Expecting over 50 system providers for the KNL host processor, in addition to

many more PCIe*-card based solutions.

• >100 Petaflops of committed customer deals to date

• The DOE* and Argonne* awarded Intel contracts for two systems (Theta and

Aurora) as a part of the CORAL* program, with a combined value of over $200

million. Intel is teaming with Cray* on both systems. Scheduled for 2016, Theta

is the first system with greater than 8.5 petaFLOP/s and more than 2,500 nodes,

featuring the Intel® Xeon Phi™ processor (Knights Landing), Cray* Aries*

interconnect and Cray’s* XC* supercomputing platform. Scheduled for 2018,

Aurora is the second and largest system with 180-450 petaFLOP/s and

approximately 50,000 nodes, featuring the next-generation Intel® Xeon Phi™

processor (Knights Hill), 2nd generation Intel® Omni-Path fabric, Cray’s* Shasta*

platform, and a new memory hierarchy composed of Intel Lustre, Burst Buffer

Storage, and persistent memory through high bandwidth on-package memory

HowProgramming models

22

HowPositioning works: Adoption for Coprocessors in TOP 500

23

HowPositioning works: Adoption speed for Coprocessors in TOP

500

24

25

HowKNL positioning

Out-of-box performance on throughput workloads “about the same” as Xeon, with potential for > 2X in performance when optimized for vectors, threads and memory BW.

Same programming model, tools, compilers and libraries as Xeon. Boots standard OS, runs all legacy code

Xeon KNL

Programming mode

Compilers, Tools & Libraries

Code Base

Massive thread and data parallelism and massive memory bandwidth with

good ST performance in a ISA compatible standard CPU form factor

HowProgramming models on Xeon Phi

• Native (Xeon Phi)

• Offload (Xeon -> Xeon Phi)

26

HowProgramming models on Xeon Phi: native

• Recompilation, with –xMIC-AVX512

• Vectorization: increased efficiency, use of new instructions

• MCDRAM and memory tuning: tile, 1GB pages

27

HowOffload programming model

28

HowOffload with pragma target in OpenMP 4.0

29

HowProgramming models on Xeon Phi : offload

• Applicable for coprocessor cards mostly

Cost for data transfers

• Three ways to use:

• OpenMP 4.0 “target” directives

• MKL Automatic offload

• Direct calls to the offload APIs (COI), and those built on it (e.g.,

HStreams)

• Offload over fabric implementation for self-boot

30

How FastOptimization BKMs

Optimization techniques are the same as for Xeon and helping both

• Loop unrolling to feed vectorization

• Loop reorganization to avoid strides

Be careful with no dependency pragmas

• Data layout changes for more efficient cache usage

• Moving to hybrid MPI+OpenMP from pure MPI

• Avoid data replication, inner node communication, increased MPI buffer size

• NUMA-awareness for sub-NUMA clustering mode

• MPI/thread pinning with parallel data initialization

• Eliminating syncs on barriers where possible

• The more threads the more barrier cost

31

32

How FastTools: Vector Advisor – explore vectorization

5. Memory Access Patterns Analysis

2. Guidance: detect problem and recommend how to fix it

1. Compiler diagnostics + Performance Data + SIMD efficiency information

4. Loop-Carried Dependency Analysis

3. “Accurate” Trip Counts: understand parallelism granularity and overheads

“Intel® Advisor’s Vectorization

Advisor fills a gap in code

performance analysis. It can guide

the informed user to better exploit the

vector capabilities of modern

processors and coprocessors”

Dr. Luigi IapichinoScientific Computing Expert

Leibniz Supercomputing

Centre

How FastTools: VTune Amplifier – explore threading/CPU utilization

33

Is serial time of my application significant to prevent scaling?

How efficient is my parallelization towards ideal parallel execution?

How much theoretical gain I can get if invest in tuning?

What regions are more

perspective to invest?

Links to grid view for more

details on inefficiency

What region is inefficient?

Is the potential gain worth it?

Why is it inefficient? Imbalance?

Scheduling? Lock spinning?

Intel® Xeon Phi™ systems supported

34

Deep Dive in OpenMP* for Efficiency and Scalability at

Region/Barrier levelSee the wall clock impact of inefficiencies, identify their cause

Actual Elapsed Time

Ideal Time

Fork Join

Potential

Gain

Lock SpinningImbalance

Scheduling

Node

• Memory related PMU-events + tracing of memory allocations

• Metrics by function: CPU Time, Memory Bound, KNL Bandwidth Estimate (NDA)

– KNL Bandwidth Estimate - per core, should be multiplied by number of KNL cores

• Metrics by memory object: Loads, Stores, LLC Misses, Remote DRAM and Remote Cache

accesses

• Memory objects are identified by allocation source line and call stack

• Allows to define structures on high bandwidth path to put to MCDRAM

• Group by ‘Function / Memory Object / Allocation Stack’ -> Sort by ‘KNL Bandwidth Estimate’

metric -> Expand to see Memory Objects -> Sort by Loads

35

Memory Profiling with VTune Amplifier XE Memory Access

Analysis

Node

Memory Profiling with VTune Amplifier XE

Memory Access Analysis - Bandwidth

Bandwidth data for DDR and MCDRAM can be analyzed in

VTune:

36

Summary

• Many-core-based architectures play main role to achieve

Exascale and further

• Intel Many Integrated Core (MIC) offers competitive

performance on well-known HPC programming models

• KNL is a step forward in this direction with

• More cores, faster ST

• High Bandwidth Memory

• Self-boot with better performance/Watt and no data transfer

cost

37

Intel Confidential

Recommended