Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi...

Intel® Xeon™ Phi™– архитектура,

модели программирования, оптимизация.

Дмитрий Прохоров, Дмитрий Рябцев Intel

Agenda

• What and Why

• Intel Xeon Phi –Top 500 insights, roadmap, architecture

• How

• Programming models - positioning and spectrum

• How Fast

• Optimization and Tools

What and WhyHPC

High-Performance

Computing

the use of super computers and

parallel processing techniques for

solving complex computational

problems.

What and WhyTOP 500 –”Today’s Future” of tomorrow’s mainstream HPC

What and WhyTOP 500 Highlights – Performance Projection

What and WhyTOP 500 Highlights – Top 10 list

What and Why TOP 500 Highlights – Accelerators in Power Efficiency

What and WhyTOP 500 Highlights – Accelerators/Coprocessors

What and WhyIntel May Integrated Core (MIC) architecture

Larrabee

+TerraFlops

Research

+Competition

with NVidia on

Accelerators

What and WhyParallelization and vectorization

Scalar Vector

Parallel Parallel + Vector

What and WhyXeon VS Xeon Phi

KNL Mesh Interconnect: All-to-AllAddress uniformly hashed

across all distributed

directories

Typical Read L2 miss

1. L2 miss encountered

2. Send request to the distributed

directory

3. Miss in the directory. Forward to

memory

4. Memory sends the data to the

requestor

IIOEDC EDC

Tile Tile

Tile Tile Tile

EDC EDC

Tile Tile

Tile Tile Tile

Tile Tile Tile Tile Tile Tile

EDC EDC EDC EDC

iMC Tile Tile Tile Tile iMC

OPIO OPIO OPIO OPIO

DDR DDR

KNL Mesh Interconnect: QuadrantChip divided into four Quadrants

Directory for an address resides in the same Quadrant as the memory location

SW Transparent

2. Send request to the distributed directory

3. Miss in the directory. Forward to memory

4. Memory sends the data to the requestor

IIOEDC EDC

Tile Tile

Tile Tile Tile

EDC EDC

Tile Tile

Tile Tile Tile

EDC EDC EDC EDC

OPIO OPIO OPIO OPIO

DDR DDR

KNL Mesh Interconnect: Sub-NUMA

Clustering Each Quadrant (Cluster) exposed as a

separate NUMA domain to OS

Analogous to 4S Xeon

SW Visible

2. Send request to the distributed directory

3. Miss in the directory. Forward to memory

4. Memory sends the data to the requestor

IIOEDC EDC

Tile Tile

Tile Tile Tile

EDC EDC

Tile Tile

Tile Tile Tile

EDC EDC EDC EDC

OPIO OPIO OPIO OPIO

DDR DDR

• Cori Supercomputer at NERSC (National Energy Research Scientific

Computing Center at LBNL/DOE) became the first publically announced Knights

Landing based system, with over 9,300 nodes slated to be deployed in mid-2016

• “Trinity” Supercomputer at NNSA (National Nuclear Security Administration) is

a $174 million deal awarded to Cray that will feature Haswell and Knights

Landing, with acceptance phases in both late-2015 and 2016.

• Expecting over 50 system providers for the KNL host processor, in addition to

many more PCIe*-card based solutions.

• >100 Petaflops of committed customer deals to date

• The DOE* and Argonne* awarded Intel contracts for two systems (Theta and

Aurora) as a part of the CORAL* program, with a combined value of over $200

million. Intel is teaming with Cray* on both systems. Scheduled for 2016, Theta

is the first system with greater than 8.5 petaFLOP/s and more than 2,500 nodes,

featuring the Intel® Xeon Phi™ processor (Knights Landing), Cray* Aries*

interconnect and Cray’s* XC* supercomputing platform. Scheduled for 2018,

Aurora is the second and largest system with 180-450 petaFLOP/s and

approximately 50,000 nodes, featuring the next-generation Intel® Xeon Phi™

processor (Knights Hill), 2nd generation Intel® Omni-Path fabric, Cray’s* Shasta*

platform, and a new memory hierarchy composed of Intel Lustre, Burst Buffer

Storage, and persistent memory through high bandwidth on-package memory

HowProgramming models

HowPositioning works: Adoption for Coprocessors in TOP 500

HowPositioning works: Adoption speed for Coprocessors in TOP

HowKNL positioning

Out-of-box performance on throughput workloads “about the same” as Xeon, with potential for > 2X in performance when optimized for vectors, threads and memory BW.

Same programming model, tools, compilers and libraries as Xeon. Boots standard OS, runs all legacy code

Xeon KNL

Programming mode

Compilers, Tools & Libraries

Code Base

Massive thread and data parallelism and massive memory bandwidth with

good ST performance in a ISA compatible standard CPU form factor

HowProgramming models on Xeon Phi

• Native (Xeon Phi)

• Offload (Xeon -> Xeon Phi)

HowProgramming models on Xeon Phi: native

• Recompilation, with –xMIC-AVX512

• Vectorization: increased efficiency, use of new instructions

• MCDRAM and memory tuning: tile, 1GB pages

HowOffload programming model

HowOffload with pragma target in OpenMP 4.0

HowProgramming models on Xeon Phi : offload

• Applicable for coprocessor cards mostly

Cost for data transfers

• Three ways to use:

• OpenMP 4.0 “target” directives

• MKL Automatic offload

• Direct calls to the offload APIs (COI), and those built on it (e.g.,

HStreams)

• Offload over fabric implementation for self-boot

How FastOptimization BKMs

Optimization techniques are the same as for Xeon and helping both

• Loop unrolling to feed vectorization

• Loop reorganization to avoid strides

Be careful with no dependency pragmas

• Data layout changes for more efficient cache usage

• Moving to hybrid MPI+OpenMP from pure MPI

• Avoid data replication, inner node communication, increased MPI buffer size

• NUMA-awareness for sub-NUMA clustering mode

• MPI/thread pinning with parallel data initialization

• Eliminating syncs on barriers where possible

• The more threads the more barrier cost

How FastTools: Vector Advisor – explore vectorization

5. Memory Access Patterns Analysis

2. Guidance: detect problem and recommend how to fix it

1. Compiler diagnostics + Performance Data + SIMD efficiency information

4. Loop-Carried Dependency Analysis

3. “Accurate” Trip Counts: understand parallelism granularity and overheads

“Intel® Advisor’s Vectorization

Advisor fills a gap in code

performance analysis. It can guide

the informed user to better exploit the

vector capabilities of modern

processors and coprocessors”

Dr. Luigi IapichinoScientific Computing Expert

Leibniz Supercomputing

Centre

How FastTools: VTune Amplifier – explore threading/CPU utilization

Is serial time of my application significant to prevent scaling?

How efficient is my parallelization towards ideal parallel execution?

How much theoretical gain I can get if invest in tuning?

What regions are more

perspective to invest?

Links to grid view for more

details on inefficiency

What region is inefficient?

Is the potential gain worth it?

Why is it inefficient? Imbalance?

Scheduling? Lock spinning?

Intel® Xeon Phi™ systems supported

Deep Dive in OpenMP* for Efficiency and Scalability at

Region/Barrier levelSee the wall clock impact of inefficiencies, identify their cause

Actual Elapsed Time

Ideal Time

Fork Join

Potential

Lock SpinningImbalance

Scheduling

• Memory related PMU-events + tracing of memory allocations

• Metrics by function: CPU Time, Memory Bound, KNL Bandwidth Estimate (NDA)

– KNL Bandwidth Estimate - per core, should be multiplied by number of KNL cores

• Metrics by memory object: Loads, Stores, LLC Misses, Remote DRAM and Remote Cache

accesses

• Memory objects are identified by allocation source line and call stack

• Allows to define structures on high bandwidth path to put to MCDRAM

• Group by ‘Function / Memory Object / Allocation Stack’ -> Sort by ‘KNL Bandwidth Estimate’

metric -> Expand to see Memory Objects -> Sort by Loads

Memory Profiling with VTune Amplifier XE Memory Access

Analysis

Memory Profiling with VTune Amplifier XE

Memory Access Analysis - Bandwidth

Bandwidth data for DDR and MCDRAM can be analyzed in

VTune:

Summary

• Many-core-based architectures play main role to achieve

Exascale and further

• Intel Many Integrated Core (MIC) offers competitive

performance on well-known HPC programming models

• KNL is a step forward in this direction with

• More cores, faster ST

• High Bandwidth Memory

• Self-boot with better performance/Watt and no data transfer

Intel Confidential

Xeon Phi™–архитектура модели …Agenda • What and Why • Intel Xeon Phi...

Documents

Efforts of Xeon Phi use for CESM

Intel Xeon Phi MIC Offload Programming Models

The Intel® Xeon Phi Coprocessor - nm.ifi.lmu.de

Parallel Graph Algorithms on the Xeon Phi Coprocessorfelsin9.de/nnis/phi/thesis/thesis.pdf · algorithms and other algorithms to the Intel Xeon Phi architecture, including eval-uations

Intel® Xeon Phi™ Coprocessor: IntroductionŸ7.pdf · 2014-11-20 · сопроцессором Intel® Xeon Phi Host CPU Host CPU Intel® Xeon® платформа («хост»)

XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications

Elmer on Intel Xeon Phi - umu.se

KNIGHTS LANDING:SECOND G INTEL XEON PHI …pages.cs.wisc.edu/~david/courses/cs758/Fall2016/handouts/...KNIGHTS LANDING:SECOND- GENERATION INTEL XEON PHI PRODUCT THE KNIGHTS LANDING

Intel® Xeon Phi™ programming

Dell PowerEdge C4130 & Intel Xeon Phi coprocessor 7120P

Intel Xeon Phi Programming Models · Xeon Phi Online" Intel have only recently publically unveiled Xeon Phi, and the ﬁrst commercially available cards are being delivered. You can

Programming for Intel® Xeon Phi™

High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Productive parallel programming for intel xeon phi coprocessors

Modernização de código em Xeon® e Xeon Phi™

Xeon Phi™ コプロセッサー - cms-initiative.jp

6 MPI on Intel Xeon Phi Coprocessor

Overview of the Intel Xeon and Xeon Phi tecnologies...Overview of the Intel Xeon and Xeon Phi tecnologies V. Ruggiero (v.ruggiero@cineca.it) Roma, 19 July 2017 SuperComputing Applications

Intel Xeon Phi Hotchips Architecture Presentation

Программирование для Intel Xeon Phi