Code Modernization WorkshopIntel Software Technologies for ... · Intel Software & Services Group 3D Tri-Gate Hi-K Metal Executing to Moore’s Law Predictable Silicon Track Record

Code Modernization Workshop

Intel Software Technologies for Developers –What’s new?Ralph de Wargny, Intel Software & Service GroupMay 2015

Intel Software & Services Group

Intel Paragon 1993


What are the chances?

3

What are the chances a code written on this CPU(probably in FORTRAN or C) will work well on these?

Multi-core & SIMD

Many-core & SIMD


Why Code Modernization?

4

[in HPC] the “hardware first” ethic is changing. Hardware retains the glamour, but there is

now the stark realization that the newest parallel supercomputers will not realize their full potential without reengineering the software code to

efficiently divide computational problems among the thousands of processors that comprise next-generation many-core computing platforms.

This process is referred to as parallelization, code optimization or code modernization.

As systems move toward exascale levels of performance, the problem of outdated code will only grow in urgency.

From: http://www.scientificcomputing.com/articles/2014/12/hpc-community-

experts-weigh-code-modernization

http://www.scientificcomputing.com/articles/2014/12/hpc-community-experts-weigh-code-modernization


3DTri-Gate

Hi-KMetalGate

Executing to Moore’s Law

Predictable Silicon Track Record – well and alive at Intel.Enabling new devices with higher performance and functionality while controlling power, cost, and size

Transforming the Economics of HPC

14nm

201310nm

R&D**

**Future options are forecasts and subject to change without notice.

7nm

R&D**

2nd

GenTri-Gate


Intel® Xeon®

processor

64-bit

Intel® Xeon®

processor

5100 series

Intel® Xeon®

processor

5500 series

Intel® Xeon®

processor

5600 series

Intel® Xeon®

processor code-named

Sandy Bridge

EP

Intel® Xeon®


Ivy Bridge

EP

Intel® Xeon®


Haswell

EP

Core(s) 1 2 4 6 8 12 18

Threads 2 2 8 12 16 24 36

SIMD Width 128 128 128 128 256 256 256

Intel® Xeon Phi™

coprocessor

Knights

Corner

Intel® Xeon Phi™

processor &

coprocessor

Knights

Landing1

61 60+

244 240+

512 512

More cores More Threads Wider vectors

*Product specification for launched and shipped products available on ark.intel.com. 1. Not launched or in planning.

Parallel is the Path ForwardIntel® Xeon® and Intel® Xeon Phi™ Product Families are both going parallel

6


0

20.000

40.000

60.000

80.000

100.000

120.000

140.000

160.000

Op

tio

ns P

er

Sec

Binomial Options SP (Higher is Better)

How much potential lies untapped today?

2012Intel® Xeon™

Processor

E5-2600 family formerly

codenamed

Sandy Bridge

2013Intel® Xeon™

Processor

E5-2600 v2 family formerly

codenamed

Ivy Bridge

2010Intel® Xeon™

Processor

X5680formerly

codenamed

Westmere

2007Intel® Xeon™

Processor

X5472formerly

codenamed

Harpertown

2009Intel® Xeon™

Processor

X5570formerly

codenamed

Nehalem

2014Intel® Xeon™

Processor

E5-2600 v3 family formerly

codenamed

Haswell

179x

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

Parallelized Vectorized

Scalar

Single Thread

Single Thread Scalar

7

Parallel + Vectorized is much faster than either one alone

http://www.intel.com/performance


SIMDVectorization

CORESMulti-Threading

NODESMessaging

Multi-Core(CPU)

Many-Core(CPU)

Core

Nodes + Fabric(CLUSTER)

Performance Technologies - Parallelism on all Levels


Knights

Corner

1 TFLOPS(peak F.P.-DP)

Knights

Landing

3+ TFLOPS(peak F.P.-DP)

Knights

Hill

3rd Generation

Intel® Xeon Phi™

Product Family

2nd Generation

Intel® Omni-Path

Architecture

10nm

Process Technology

many more

card based

systems

H2’15 First

Commercial

Systems

>50

Systems Provider

expected1

+

Bootable processor

On-package high BW memory

Integrated Omni-Path fabric

>100 PFLOPS customer system compute commits to-data1

Intel® Xeon Phi™

Coprocessor

Applications and

Solutions Catalog

1Intel internal estimate

Intel® Xeon Phi™ Product FamilyIndustry and User Momentum

https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-applications-and-solutions-catalog


Server Processor

DDR4

high performanceon-package

memory

up to 16GB

~5x STREAMperformanceover DDR4

NUMAsupporton-package

Intel® Omni-Path Fabric

60+ cores

3+ TFLOPS DP peakup to 3x single thread

2-D core meshSMP cache coherency

Integr. PCIe 3.0

Binary compatible withIntel® Xeon® processors

in partnership with

Cores based on Intel® Atom™(Silvermont) microarchitecturewith HPC enhancements:

14nmAVX-512: 512-bit SIMD (VPU) 4 threads/coredeep out-of-order buffersgather/scatterbetter branch predictionhigher cache bandwidth. . .

high capacityhigh bandwidth

Potential future options, subject to change without notice. Codenames.All timeframes, features, products and dates are preliminary forecasts and subject to change without further notification.

Knights LandingNext Generation Intel® Xeon Phi™


2015 2016 FUTURE

Forecast and Estimations, in Planning & Targets

. . .

Not drawn to scale, for illustration only. Potential future options, subject to change without notice. Codenames, for illustration only.

All timeframes, features, products and dates are targets and preliminary forecasts and subject to change without further notification.

. . .

XEON® E5

XEON PHI™

FABRIC

. . .100Gb

14nm

AVX-512

MCDRAM DDR4

PCIe3 Omni-Path 1 on-pckg option

Knights Landing10nm

Knights Hill

Omni-Path 2

Future Knights

Omni-Path Gen2

Future Omni-Path100Gb/s

PSM SW-Stack

up to 48-port switches

Silicon Photonics

Omni-Path Gen140Gb

80Gb Dual-Rail

PSM SW-Stack

True Scale

22nm Coprocessor

KNI up to 61c

GDDR5

PCIe Card

Knights Corner

22nm up to 18c

AVX-2

DDR4

PCIe3

Haswell-EP14nm

AVX-2

DDR4

PCIe3

Broadwell-EP≤14nm

Future Xeon-EP

Intel HPC Midrange Roadmap


128b

512b

256b

MulticoreMany-Core

Thread/Task-Parallelism Process-Parallelism

Message Passing

MPI

IP-based

Multi-Threading

OpenMP*

TBB, Cilk™ Plus

OpenCL

pthreads

Vectorization

Automatic

Directives/Pragmas

Libraries

SIMD

Cluster

. . .

Node

Data-Parallelism

2015

Professional Edition

Portable & Scalable Parallel ProgrammingOn a Higher Abstracted Level

2015

Professional Edition

2015

Cluster Edition


(c) 2013 Jim Jeffers and James Reinders.

Single software architecture valid for all Intel hardware targets

Standard-based Intel parallel programming modelsIntel® Parallel Studio XE 2015: C/C++/Fortran - OpenMP - MPI


Technical Computing

& PerformanceResponsiveness Embedded System Web Multi-Platform

What are you developing software for?

Video

Sciences / HPC

Enterprise apps

Big Data

Servers / Clusters

Performance through

Parallel Processing

• Vectorization

• Threading

• Message Passing

Encoding / Decoding

Streaming

Performance through

hardware acceleration

• MPEG4, etc.

• HEVC

Cross-Platform

Multimedia performance

• Android

• Windows

• OS X

Internet of Things

Hardware-based

embedded programming

• BIOS/UEFI/FW

• Kernel/OS

• Drivers

• Embedded

Applications

Cross Device – Multiple

APP stores

• Mobile Apps

• HTML5 technology

• Write - once

Intel Software & Services Group15

Technical Computing

& Performance

application

performance,

scalability & reliability

Responsiveness

Immersive interactivity

for multimedia apps

Embedded System

Fast, efficient

embedded & mobile

devices/systems

Web Multi-Platform

Deploy apps on

multiple platforms

using one codebase

Intel® Software Development Products

Video streaming

performance

Video

Intel® Parallel Studio XE 2016What’s New

Launching Aug 25th 2015


Use One Software Architecture Today. Scale Forward Tomorrow.

ClusterCode

CompilerLibrariesParallel Models

Many-core

Intel®MIC

ArchitectureCo-processor

Multicore

MulticoreCPU

Code Reusability


Faster Code FasterIntel® Parallel Studio XE 2015

•Simplifies building, debugging and tuning parallel code

•Integrated C++ and Fortran tool suite

•Drops into development environment e.g., Visual Studio*

• Windows*, Linux* & OS X*

1818

Faster Code

Performance without compromise through optimizations for current and future processors• Compilers• Libraries

Profilers simplify tuning parallel code for best performance

Code Faster

Compilers with high level parallelism features including OpenMP* 4.0

Parallelism prototyping assistant

Advanced parallel models and libraries, simple update with relink

Graphical profilers visualize bottlenecks

Memory, thread and MPI error checkers help remove errors


Intel® Parallel Studio XE 2016 Suites

Vectorization – Boost Performance By Utilizing Vector Instructions / Units

Intel® Advisor XE - Vectorization Advisor identifies new vectorization opportunities as well as

improvements to existing vectorization and highlights them in your code. It makes actionable coding

recommendations to boost performance and estimates the speedup.

Scalable MPI Analysis– Fast & Lightweight Analysis for 32K+ Ranks

Intel® Trace Analyzer and Collector add MPI Performance Snapshot feature for easy to use,

scalable MPI statistics collection and analysis of large MPI jobs to identify areas for improvement.

Big Data Analytics – Easily Build IA Optimized Data Analytics Application

Intel® Data Analytics Acceleration Library (DAAL) will help data scientists speed through big

data challenges with optimized IA functions.

Standards – Scaling Development Efforts Forward

Supporting the evolution of industry standards of OpenMP, MPI, TBB, Fortran and C++ Intel®

Compilers & performance libraries

Launching Aug 25th, 2015

19

















20


Intel® Advisor XE - Vectorization Advisor Data Driven

Vectorization Design

21

Have you: Recompiled with AVX2, but seen little benefit?

Wondered where to start adding vectorization?

Recoded intrinsics for each new architecture?

Struggled with cryptic compiler vectorization messages?

Breakthrough for vectorization design What vectorization will pay off the most?

What is blocking vectorization and why?

Are my loops vector friendly?

Will reorganizing data increase performance?

Is it safe to just use pragma simd?

More PerformanceFewer Machine Dependencies


Intel® Advisor XE – Vectorization AdvisorProvides the data you need for high impact vectorization

Compiler diagnostics + Performance Data = All the data you need in one place

Find “hot” un-vectorized or “under vectorized” loops.

Trip counts

Recommendations – How do I fix it?

Correctness via dependency analysis

Is it safe to vectorize?

Memory Access Patterns analysis

Unit stride vs Non-unit stride access, Unaligned memory access, etc.


Data Driven Threading DesignIntel® Advisor XE – Thread Prototyping

Have you: Tried threading an app, but seen little

performance benefit?

Hit a “scalability barrier”? Performance gains level off as you add cores?

Delayed a release that adds threading because of synchronization errors?

Breakthrough for threading design: Quickly prototype multiple options

Project scaling on larger systems

Find synchronization errors before implementing threading

Separate design and implementation -Design without disrupting development

Add Parallelism with Less Effort, Less Risk and More Impacthttp://intel.ly/advisor-xe

Part of Intel® Parallel Studio

For Windows* and Linux* From $1,599

“Intel® Advisor XE has allowed us to quickly prototype ideas for parallelism, saving developer time and effort”

Simon HammondSenior Technical StaffSandia National Laboratories

http://intel.ly/advisor-xe


Compiler diagnostics + Performance DataFind “hot” un-vectorized or “under vectorized” loops

All of the information you require to vectorize available on one screen!


Gives estimated expected gain!

25

Gain estimates – Gives recommendations and the gain you can expect by using a different vector instruction or rewriting the control flow of your program.


Convince the compiler to vectorize Unvectorized loops / “under vectorized” loops

26

• Assumed dependencies• Control structures

preventing vectorization.• Rewrite loops to

vectorize – remove conditions, breaks and returns and many other techniques.


Summary: Vector Advisor4 Analysis Features for Efficient Vectorization

1. Compiler diagnostics with Performance Data 2. Recommendations on how to improve vectorization

4. Memory Access Patterns Analysis3. Correctness Dependency Analysis


Intel® Parallel Studio XE 2015

Composer EditionIntel® Parallel Studio XE 2015

Professional EditionIntel® Parallel Studio XE 2015

Cluster Edition

Intel® C++ Compiler

Intel® Fortran Compiler

Intel® Threading Building Blocks

Intel® Integrated Performance Primitives

Intel® Math Kernel Library

Intel® Cilk™ Plus

Intel® OpenMP*







Intel® OpenMP*







Intel® OpenMP*

Intel® Advisor XE Intel® Inspector XE

Intel® VTune™ Amplifier XE

Intel® Advisor XE Intel® Inspector XE

Intel® VTune™ Amplifier XE

Intel® MPI Library

Intel® Trace Analyzer and Collector

For more information: http://intel.ly/perf-tools

28

Intel® Advisor XE is part of Intel® Parallel Studio XE

















29


Lightweight – Low overhead profiling for 32K+ Ranks

Scalability- Performance variation at scale can be detected sooner

Identifying Key Metrics –Shows PAPI counters and MPI/OpenMP imbalances

MPI Performance Snapshot


Intel® Cluster Checker 3.0 – What’s New

Data Collectors

Diagnostic Data Analysis Checking for Issues

Suggesting Remedies

Cluster Database Expert System

Results

Provides Assistance

Cluster Health Checks(on-demand, background)

Diagnoses and remedies for common issues

Compliance with Intel® Cluster Ready spec

Simplifies Cluster Computing Platforms

Reduces need for specialized expertise

Enables cluster health checks by applications

Extensible and customizable, API

















32


What is Intel DAAL?New library targeting data analytics market

Customers: analytics solution providers, system integrators, and application developers (FSI, Telco, Retail, Grid, etc.)

Key benefits: improved time-to-value, forward-scaling performance and parallelism on IA, advanced analytics building blocks

Key features

Building blocks highly optimized for IA to support all data analysis stages.

Support batch, streaming, and distributed processing with easy connectors to popular platforms (Hadoop, Spark) and tools (R, Python, Matlab).

Flexible interfaces for handling different data sources (CSV, MySQL, HDFS, RDD (Spark)).

Rich set of operations to handle sparse and noisy data.

C++ and Java APIs.

6 releases of Tech Preview in 2014.

First Beta in Feb’15. First gold release in Aug’15.

Analysis

•PCA•Variance-Covariance Matrix

•Distances

•Matrix decompositions (SVD, QR, Cholesky)

•EM for GMM•Uni-/multi-variate outlier detection

•Statistical moments

Machine learning

• Linear regression• Apriori

• K-Means clustering

• Naïve Bayes

• LogitBoost, BrownBoost, AdaBoost• SVM

Intel® Data Analytics Acceleration Library – a C++ and Java API library of optimized analytics building blocks for all data analysis stages, from data acquisition to data mining and machine learning. Essential for engineering high performance Big Data applications.

Important features offered in the initial Beta

• Data layouts: AOS, SOA, homogeneous, CSR• Data sources: csv, MySQL, HDFS/RDD• Compression/decompression: ZLIB, LZO, RLE, BZIP2• Serialization/deserialization

Data Processing

Optimized analytics building blocks for all data analysis stages, from data acquisition to data mining

and machine learning.

Data Modeling

Data structures for model representation, and operations to derive model-based predictions and

conclusions.

Data Management

Interfaces for data representation and access. Connectors to a variety of data sources and data formats, such HDFS, SQL, CSV, ARFF, and user-

defined data source/format.

Data Sources

Numeric Tables

Outliers Detection

Compression / Decompression

Serialization / Deserialization


Data Analytics in the Age of Big Data

Problem: Big data needs high performance computing. Many big data applications leave performance at the table – Not optimized for underlying hardware.

Solution: A performance library provides building blocks to be easily integrated into big data analytics workflow.

Volume

Velocity Variety

Value


Intel® Data Analytics Acceleration Library

An industry leading end-to-end IA-based data analytics acceleration library of fundamental algorithms covering all data analysis stages.

(De-)CompressionOutlier detection

PCAStatistical momentsVar-Covar matrixMatrix decompositionsAprioriK-Means ClusteringEM for GMM

Linear regressionDecision treesNaïve BayesMulti-Class SVMBoosting

Pre-processing Transformation Analysis Modeling Decision Making

Sci

en

tifi

c/E

ng

ine

eri

ng

We

b/S

oci

al

Bu

sin

ess

Validation


Who Should use Intel DAAL?

•Software developers who needs optimized implementations of fundamental numerical algorithms in their analytics application, but do not have resource/expertise to manually do the optimizations themselves.

•Data scientists who build and execute math models for domain specific knowledge discovering, and need to speed up the performance critical parts of their models.

•Data analytics ISV’s who want to gain competitive advantages by making their solutions run faster on Intel architectures.

•Big Data system integrators who want to beef up their product portfolio by providing performance-enhanced alternatives of popular open-source analytics tools.


What Are We Releasing?

Intel DAAL 2016 Beta

Available to selected partners in Feb 2015.

Public beta starting in April 2015.

Intel DAAL 2016 product release

Available in Q3 2015.

• Support IA-32 and Intel64 architectures.

• C++, Java APIs.

• Static and dynamic linking.

• A standalone library, and also bundled in Intel PSXE Cluster Edition 2016.

Note: Bundled version is not available on OS* X.

















38


Intel® C/C++ and Fortran Compilers 16.0

What’s New

39

• More of C++ 2014, generic lambdas, member initializers and aggregates

• More of C11, _Static_assert, _Generic, _Noreturn, and more

• OpenMP 4.0 C++ User Defined Reductions, Fortran Array Reductions

• OpenMP 4.1 asynchronous offloading, simdlen, simd ordered

• F2008 Submodules, Impure Elemental Functions

• F2015 TYPE(*), DIMENSION(..), RANK intrinsic, attributes for args with BIND

• Significant improvement in alignment analysis, vectorization robustness

• Much improved Neighboring Gather optimization


Additional Sparse Matrix Vector Multiplication API new two stage API for Sparse BLAS level 2 and 3 routines

MKL MPI wrappers all MPI implementations are API-compatible but MPI implementations are not ABI-compatible

MKLMPI wrapper solves this problem by providing an MPI-independent ABI to MKL

Support For Small Matrix multiplication a single call executes independent ?GEMM operation simultaneously

Support for Philox4x35 and ARS5 RNG two new pseudorandom number generators with a period of 2^128 are highly optimized for multithreaded

environment

Sparse Solver SMP improvements significantly improved overall scalability for Intel Xeon Phi coprocessors and scalability of the solving step for

Intel Xeon processors

40

Intel® MKL 11.3

What’s New


Intel® VTune™ Amplifier XE 2016 Beta

What’s New

41

New OS and IDE support: Visual Studio* 2015 & Windows* 10 Threshold

GPU profiling

GPU Architecture Annotation Diagram

GPU profiling on Linux (Open CL, Media SDK)

Microarchitecture tuning

General Exploration analysis with confidence indication

Driverless ‘perf’ EBS with stacks


Intel® VTune™ Amplifier XE 2016 Beta

Improved Hybrid Support

42

Intel OpenMP analysis enhancements

Precise trace-based imbalance calculation that is especially useful for profiling of small region instances

Classification and issue highlighting of potential gains, e.g., imbalance, lock contention, creation overhead, etc.

Detailed analysis of barrier-to-barrier region segments

MPI+OpenMP: multi-rank analysis on a compute node

Per-rank OpenMP potential gain and serial time metrics

Per-rank Intel MPI communication busy wait time detection


Technical Computing

& Performance

application

performance,

scalability & reliability

Responsiveness

Immersive interactivity

for multimedia apps

Embedded System

Fast, efficient

embedded & mobile

devices/systems

Web Multi-Platform

Deploy apps on

multiple platforms

using one codebase

Intel® Software Development Products

Video streaming

performance

Video


Recommended books

High performance parallelism pearls: multi-core and many-core approaches, by James Reinders and Jim Jeffers, Morgan Kaufmann, 2014

Introduction to high-performance scientific computing (2nd edition), by Victor Eijkhout, Lulu, 2015

Introduction to high performance computing for scientists and engineers, by Georg Hager and Gerhard

Wellein, CRC Press, 2011

Parallel programming with Intel® Parallel Studio XE, by Stephen Blair-Chappell and Andrew Stokes, Wrox

press, 2012

Thank you!

Documents

Code Modernization WorkshopIntel Software Technologies for ... · Intel Software & Services Group 3D Tri-Gate Hi-K Metal Executing to Moore’s Law Predictable Silicon Track Record