Parallel Programming Modelsdownload.microsoft.com/documents/rus/visualstudio/ru/ru/... · 2018-12-05 · Parallelization vs. Vectorization Explain difference -Vectorization which

Andrey Bokhanko Intel

Parallel Programming Models

More and More Transistors!

Parallelism

in Hardware

has only

“just” started.

Much MORE

to come,

for many

years

ahead.

2

Do NOT make programming harder

One version of each algorithm,

One source code, feeds all forms of

parallelism (cores, co-processors, SIMD).

3

Different programmers want different

levels of control over how their program

executes

Program in “tasks” not “threads”

THREA

D

THREA

D

THREA

D

THREA

D

HARDWARE HARDWARE

PROGRAMMER

5

Program in “tasks” not “threads”

HARDWARE

THREA

D

THREA

D

THREA

D

THREA

D

HARDWARE

ABSTRACTION: MAPS TASKS TO THREADS

PROGRAMMER

PROGRAMMER PROGRAMMER ✔ 6

Parallelization vs. Vectorization Explain difference

- Vectorization which is DLP (Data Level Parallelization) uses wide registers (SSE) for operation of multiple instructions.

- Parallelization which is TLP (Thread Level Parallelization) uses processor cores (and hardware threads) for parallelization of serial code.

- Possible pitfalls in parallelization: Concurrent access by more threads to the same memory locations (data race!).

Combine the POWER of Both!

7

A Family of Parallel Models Developer Choice

Intel® Threading Building Blocks

Widely used C++ template library for parallelism

Open sourced

Also an Intel product

Domain-Specific Libraries

Intel® Integrated Performance Primitives

Intel® Math Kernel Library

Established Standards

Message Passing Interface (MPI)

OpenMP*

Coarray Fortran

OpenCL*

Research / Development

Intel® Concurrent Collections

Offload Extensions

Intel® Array Building Blocks

Intel® SPMD Parallel Compiler

Choice of high-performance parallel programming models

Libraries for pre-optimized and parallelized functionality

Intel® Cilk™ Plus and Intel® Threading Building Blocks supports composable parallelization of a wide variety of applications.

OpenCL* addresses the needs of customers in specific segments, and provides developers an additional choice to maximize their app performance

MPI supports distributed computation, combines with other models on nodes

8

Intel® Cilk™ Plus

C/C++ language extensions to simplify parallelism

Open sourced



Keywords Set of keywords, for expression of task parallelism:

cilk_spawn

cilk_sync

cilk_for

Reducers Reliable access to nonlocal variables without races

cilk::reducer_opadd<int> sum;

CEAN Provide data parallelism for sections of

arrays or whole arrays

a[:] = b[:] * c[:];

Elementary functions Define actions that can be applied

to whole or parts of arrays or

scalars

__declspec (vector)

Execution parameters Runtime system APIs, Environment variables, pragmas

Task parallelism

Data parallelism

9

cilkplus.org

#pragma simd Extended vector parallelism using

SIMD hardware registers

9

Cilk Plus - Serial Semantics

• A deterministic Cilk Plus program will have the same semantics as its serialization. • Easier regression testing • Easier to debug:

• Run with one core • Run serialized

• Strong analysis tools (Cilk Plus-specific versions will be posted on WhatIf) • race detector • parallelism analyzer

1

0

Cilk Plus Keywords

• Cilk Plus adds three keywords to C and C++: _Cilk_spawn _Cilk_sync _Cilk_for

• If you #include <cilk/cilk.h>, you can write the keywords as cilk_spawn, cilk_sync, and cilk_for.

• Cilk Plus runtime controls thread creation and scheduling. A thread pool is created prior to use of Cilk Plus keywords.

• The number of threads matches the number of cores by default, but can be controlled by the user.

11

Cilk Plus – Sample cilk_spawn

int fib(int n)

{

int x, y;

if (n < 2) return n;

x = cilk_spawn fib(n-1);

y = fib(n-2); // Execution can continue while

// fib(n-1) is running)

cilk_sync; // Asynchronous call must

//complete before using x.

return x+y;

}

1

2

Cilk Plus – Sample Hyperobjekt

int result = 0;

for (std::size_t i = 0; i < n; ++i)

{

result += compute(myArray[i]);

}

std::cout << "The result is: " << result << std::endl;

1

3


int result = 0;

cilk_for (std::size_t i = 0; i < n; ++i)

{

result += compute(myArray[i]); // data race!!!

}

std::cout << "The result is: " << result << std::endl;

1

4


cilk::reducer_opadd<int> result;

cilk_for (std::size_t i = 0; i < ARRAY_SIZE; ++i)

{

result += compute(myArray[i]); // reducer

hyperobject // avoids race

}

std::cout << "The result is: "

<< result.get_value() //value extracted

<< std::endl;

1

5

Cilk Plus - Array Notations

• Array Notations provide a syntax to specify sections of arrays on which to perform operations

• syntax :: [<lower bound> : <length> : <stride>]

• Simple example • a[0:N] = b[0:N] * c[0:N]; • a[:] = b[:] * c[:] // if a, b, c are declared with size N

• The Intel® C++ Compiler’s automatic vectorization can use this information to apply single operations to multiple elements of the array using Intel® Streaming SIMD Extensions (Intel® SSE) and Intel® Advanced Vector Extensions (Intel® AVX) • Default is SSE2. Use compiler options (/Qx, /arch, /Qax) to change the target.

• More advanced example • x[0:10:10] = sin(y[20:10:2]);

1

6

Array Section Notation • Syntax

<array base> [ <lower bound> : <length> [: <stride>] ]

[ <lower bound> : <length> [: <stride>] ].....

• Note that length is chosen. • Not upper bound as in Fortran [lower bound : upper bound] A[:] // All elements of vector A

B[2:6] // Elements 2 to 7 of vector B

D[0:3:2] // Elements 0,2,4 of vector D

E[0:3][0:4] // 12 elements from E[0][0] to E[2][3]

0 1 2 3 4 5 6 7 8 9 float B[10];

B[2:6] = …

17 1

7

Cilk Plus - Array Notations Example

void foo(double * a, double * b, double * c, double * d, double * e, int n) {

for(int i = 0; i < n; i++)

a[i] *= (b[i] - d[i]) * (c[i] + e[i]);

}

void goo(double * a, double * b, double * c, double * d, double * e, int n) {

a[0:n] *= (b[0:n] - d[0:n]) * (c[0:n] + e[0:n]);

}

icl -Qvec-report3 -c test-array-notations.cpp

test-array-notations.cpp(2) (col. 2): remark: loop was not vectorized: existence of vector dependence.

test-array-notations.cpp(3) (col. 3): remark: vector dependence: assumed FLOW dependence between a line 3 and e line 3.

<…>

Test-array-notations.cpp(7) (col. 6): remark: LOOP WAS VECTORIZED.

1

8

Cilk Plus - Reductions

• Reduction combines array section elements to generate a scalar result

• Some 10 built-in reduction functions supporting basic C data-types:

• add, mul, max, max_ind, min, min_ind, all_zero, all_non_zero, any_nonzero

• Supports user-defined reduction functions:

+

int a[] = {1,2,3,4};

sum = __sec_reduce_add(a[:]); // sum

// is 10

type fn(type in1, type in2); // scalar reduction function

out = __sec_reduce(fn, identity_value, in[x:y:z]);

19 1

9

Cilk Plus - Function Maps with Array Sections “Elemental Functions”

Compiler can convert a user-supplied scalar function to vector function, when called with

array notation arguments

Compiler automatically maps the function across multiple array elements (in example, the

function becomes “a * x[:] + y[:]”)

// Plain C scalar function declared with __declspec(vector)

__declspec(vector) float saxpy (float a, float x, float y) {

return (a * x + y);

}

Z[:] = saxpy(A, X[:], Y[:]); // Call scalar function with

// array notation parameters

20 2

0

Cilk Plus - Elemental Functions

• The compiler can’t assume that user-defined functions are safe for vectorization

• Now you can make your function an elemental function which indicates to the compiler that such a function can be applied to multiple elements of an array in parallel safely.

• Specify __declspec(vector) on both function declarations and definitions as this will affect name-mangling.

2

1

Cilk Plus - Elemental Functions Example

double user_function(double x);

__declspec(vector) double elemental_function(double x);

void foo(double *a, double *b, int n) {

a[0:n] = user_function(b[0:n]);

}

void goo(double *a, double *b, int n) {

a[0:n] = elemental_function(b[0:n]);

}

icl /Qvec-report3 /c test-elemental-functions.cpp

test-elemental-functions.cpp(4) (col. 39): remark: routine skipped: no vectorization candidates.

test-elemental-functions.cpp(9) (col. 2): remark: LOOP WAS VECTORIZED.

2

2

Positioning of SIMD Vectorization A Cilk Plus Feature

ASM code (addps)

Vector intrinsic (mm_add_ps())

SIMD intrinsic class (F32vec4 add)

User Mandated Vectorization

( SIMD Directive)

Auto vectorization hints (#pragma ivdep)

Fully automatic vectorization

Programmer control

Ease of use

2

3

Cilk Plus - SIMD Directive Notation C/C++: #pragma simd [clause [,clause] …]

Fortran: !DIR$ SIMD [clause [,clause] …]

Without any clause, the directive enforces vectorization of the (innermost) loop

Sample: void add_fl(float *a, float *b, float *c, float *d, float *e, int n)

{

#pragma simd

for (int i=0; i<n; i++)

a[i] = a[i] + b[i] + c[i] + d[i] + e[i];

}

Without SIMD directive, vectorization will fail since there are too many pointer references to do a run-time

check for overlapping (compiler heuristic)

2

4

Cilk Plus - Clauses of SIMD Directive

vectorlength(n1 [,n2] …)

• n1, n2, … must be 2,4,8 or 16: The compiler can assume a vectorization for a vector length of n1, n2, … to be save

private(v1, v2, …)

• variables private to each iteration; initial value is broadcast to all private instances, and the last value is copied out from the last iteration instance.

linear(v1:step1, v2:step2, …)

• for every iteration of original scalar loop, v1 is incremented by step1, … etc. Therefore it is incremented by step1 *(vector length) for the vectorized loop.

reduction(operator:v1, v2, …)

• v1 etc are reduction variables for operation “operator”

[no]assert

• reaction in case vectorization fails: Print a warning only ( noassert, the default) or treat failure as error and stop compilation

2

5

Cilk Plus - References and Contact Information

• Cilk Homepage • www.cilk.com

• Intel® Parallel Building Blocks and Intel® Cilk™ Plus pages • http://software.intel.com/en-us/articles/intel-parallel-building-blocks

• User Forums • http://software.intel.com/en-us/forums/intel-cilk-plus/ • http://software.intel.com/en-us/forums/intel-parallel-studio

2

6

http://www.cilk.com/

http://software.intel.com/en-us/articles/intel-parallel-building-blocks









http://software.intel.com/en-us/forums/intel-cilk-plus/







http://software.intel.com/en-us/forums/intel-parallel-studio







A Family of Parallel Programming Models Developer Choice



OpenMP*

Coarray Fortran

OpenCL*



Offload Extensions



Choice of high-performance parallel programming models

Libraries for pre-optimized and parallelized functionality




27



Open sourced







Open sourced


C++ library

Parallel tasks

Parallel algorithms

Concurrent containers

Synchronization primitives

Scalable memory allocator

Outfits C++ for Parallelism.

Introduced 2006. The most widely used abstraction for parallelism.

threadingbuildingblocks.org

28

The MOST popular parallel

programming model.

in our 5th year

Intel® Threading Building Blocks (TBB) Extens C++ for Parallelism

• A kind of „STL for Parallel C++ Programming“

• You specify tasks (that can run concurrently) instead of threads • Library maps user-defined logical tasks onto physical threads, efficiently using cache

and balancing load • Full support for nested parallelism

• Targets threading for scalable performance • Portable across Linux*, Mac OS*, Windows*, and Solaris*

• Compatible with other threading packages • Can be used in concert with native threads and OpenMP*

Flexible, scalable solution with high amount of control at minimum overhead

29


TBB - Generic Parallel Algorithms Loop parallelization

parallel_for

parallel_reduce

- load balanced parallel execution

- fixed number of independent iterations

parallel_scan

- computes parallel prefix

y[i] = y[i-1] op x[i]

Parallel Algorithms for Streams

parallel_do

- Use for unstructured stream or pile of work

- Can add additional work to pile while running

parallel_for_each

- parallel_do without an additional work feeder

pipeline / parallel_pipeline

- Linear pipeline of stages

- Each stage can be parallel or serial in-order or serial out-of-order.

- Uses cache efficiently

Parallel function invocation

parallel_invoke

- Parallel execution of a number of user-specified

functions

Parallel sorting

parallel_sort

Computational graph

flow::graph

- Implements dependencies between tasks

- Pass messages between tasks

30 30

Concurrent Containers concurrent_hash_map

concurrent_queue concurrent_bounded_queue

concurrent_vector

Generic Parallel Algorithms parallel_for(range)

parallel_reduce parallel_for_each(begin, end)

parallel_do parallel_invoke

pipeline parallel_sort parallel_scan

Task scheduler task_group

task_structured_group task_scheduler_init

task_scheduler_observer

Synchronization Primitives atomic;

mutex; recursive_mutex; spin_mutex; spin_rw_mutex;

queuing_mutex; queuing_rw_mutex; null_mutex; null_rw_mutex

Memory Allocation tbb_allocator; cache_aligned_allocator; scalable_allocator; zero_allocator

Miscellaneous tick_count

Threads tbb_thread

Thread Local Storage enumerable_thread_specific

combinable

threadingbuildingblocks.org C++ Needs Help threadingbuildingblocks.org

Flow Graph

31

Scaling with abstraction: powerful

proven highly scalable


32 Optimization Notice

A Family of Parallel Programming Models Developer Choice



OpenMP*

Coarray Fortran

OpenCL*



Offload Extensions



Choice of high-performance parallel programming models Libraries for pre-optimized and parallelized functionality




33



Open sourced




Open sourced





Intel® Integrated Performance

Primitives (IPP) 7.0 Multicore Power for Multimedia and Data

Processing

Features • Rapid Application Development • Cross-platform Compatibility & Code Re-Use • Highly optimized functions from 15 Domains

• Images and Video • Communications and Signal Processing • Data Processing

• Performance optimizations for latest Intel processors incl. Core i7, AES-NI (incl. 2nd Gen, AVX) and Atom processors

34 34 Optimization Notice

Intel® Integrated Performance Primitives 7.0

Applications Digital Media | Web/Enterprise Data | Embedded Communications | Scientific/Technical

Intel® Integrated Performance Primitives 16 Function Domains

Optimized 32-bit and 64-bit Multicore Performance

High level APIs and Codecs Interfaces and Code Samples

Multimedia

• Image Processing

• Color Conversion

• JPEG/JPEG2000

• Video Coding

• Computer Vision

• Realistic Rendering

Cross-platform C/C++ API for Code Re-use

Signal

Processing

• Signal Processing

• Audio Coding

• Speech Coding

• Speech Recognition

• Vector Operations

Data

Processing

• Data Compression

• Data Integrity

• Cryptography

• String Processing

• Matrix Operations

35


Intel® Math Kernel Library 10.3 Flagship math processing library

Features • Multi-core ready with excellent scaling • Highly optimized, extensively threaded math routines

for science, engineering and financial applications for maximum performance

• Automatic runtime processor detection ensures great performance on whatever processor your application is running on.

• Support for C and Fortran • Optimizations for latest Intel processors including 2nd-

gen Core processors

36 36 Optimization Notice

Application Areas which could use MKL • Energy - Reservoir simulation, Seismic, Electromagnetics, etc.

• Finance - Options pricing, Mortgage pricing, financial portfolio management etc.

• Manufacturing - CAD, FEA etc.

• Applied mathematics • Linear programming, Quadratic programming, Boundary value problems, Nonlinear parameter estimation, Homotopy

calculations, Curve and surface fitting, Numerical integration, Fixed-point methods, Partial and ordinary differential equations, Statistics, Optimal control and system theory

• Physics & Computer science • Spectroscopy, Fluid dynamics, Optics, Geophysics, seismology, and hydrology, Electromagnetism, Neural network training,

Computer vision, Motion estimation and robotics

• Chemistry • Physical chemistry, Chemical engineering, Study of transition states, Chemical kinetics, Molecular modeling, Crystallography,

Mass transfer, Speciation

• Engineering • Structural engineering, Transportation analysis, Energy distribution networks, Radar applications, Modeling and mechanical

design, Circuit design

• Biology and medicine • Magnetic resonance applications, Rheology, Pharmacokinetics, Computer-aided diagnostics, Optical tomography

• Economics and sociology • Random utility models, Game theory and international negotiations, Financial portfolio management


Optimization Notice

38

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.

These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or

effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use

with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable

product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

38

Legal Disclaimer

39

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2012. Intel Corporation.

http://intel.com/software/products

39

http://www.intel.com/software/products

http://intel.com/software/products

Documents

Parallel Programming Modelsdownload.microsoft.com/documents/rus/visualstudio/ru/ru/... · 2018-12-05 · Parallelization vs. Vectorization Explain difference -Vectorization which