Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Andrey Bokhanko Intel
Parallel Programming Models
More and More Transistors!
Parallelism
in Hardware
has only
“just” started.
Much MORE
to come,
for many
years
ahead.
2
Do NOT make programming harder
One version of each algorithm,
One source code, feeds all forms of
parallelism (cores, co-processors, SIMD).
3
Different programmers want different
levels of control over how their program
executes
Program in “tasks” not “threads”
THREA
D
THREA
D
THREA
D
THREA
D
HARDWARE HARDWARE
PROGRAMMER
5
Program in “tasks” not “threads”
HARDWARE
THREA
D
THREA
D
THREA
D
THREA
D
HARDWARE
ABSTRACTION: MAPS TASKS TO THREADS
PROGRAMMER
PROGRAMMER PROGRAMMER ✔ 6
Parallelization vs. Vectorization Explain difference
- Vectorization which is DLP (Data Level Parallelization) uses wide registers (SSE) for operation of multiple instructions.
- Parallelization which is TLP (Thread Level Parallelization) uses processor cores (and hardware threads) for parallelization of serial code.
- Possible pitfalls in parallelization: Concurrent access by more threads to the same memory locations (data race!).
Combine the POWER of Both!
7
A Family of Parallel Models Developer Choice
Intel® Threading Building Blocks
Widely used C++ template library for parallelism
Open sourced
Also an Intel product
Domain-Specific Libraries
Intel® Integrated Performance Primitives
Intel® Math Kernel Library
Established Standards
Message Passing Interface (MPI)
OpenMP*
Coarray Fortran
OpenCL*
Research / Development
Intel® Concurrent Collections
Offload Extensions
Intel® Array Building Blocks
Intel® SPMD Parallel Compiler
Choice of high-performance parallel programming models
Libraries for pre-optimized and parallelized functionality
Intel® Cilk™ Plus and Intel® Threading Building Blocks supports composable parallelization of a wide variety of applications.
OpenCL* addresses the needs of customers in specific segments, and provides developers an additional choice to maximize their app performance
MPI supports distributed computation, combines with other models on nodes
8
Intel® Cilk™ Plus
C/C++ language extensions to simplify parallelism
Open sourced
Also an Intel product
Intel® Cilk™ Plus
Keywords Set of keywords, for expression of task parallelism:
cilk_spawn
cilk_sync
cilk_for
Reducers Reliable access to nonlocal variables without races
cilk::reducer_opadd<int> sum;
CEAN Provide data parallelism for sections of
arrays or whole arrays
a[:] = b[:] * c[:];
Elementary functions Define actions that can be applied
to whole or parts of arrays or
scalars
__declspec (vector)
Execution parameters Runtime system APIs, Environment variables, pragmas
Task parallelism
Data parallelism
9
cilkplus.org
#pragma simd Extended vector parallelism using
SIMD hardware registers
9
Cilk Plus - Serial Semantics
• A deterministic Cilk Plus program will have the same semantics as its serialization. • Easier regression testing • Easier to debug:
• Run with one core • Run serialized
• Strong analysis tools (Cilk Plus-specific versions will be posted on WhatIf) • race detector • parallelism analyzer
1
0
Cilk Plus Keywords
• Cilk Plus adds three keywords to C and C++: _Cilk_spawn _Cilk_sync _Cilk_for
• If you #include <cilk/cilk.h>, you can write the keywords as cilk_spawn, cilk_sync, and cilk_for.
• Cilk Plus runtime controls thread creation and scheduling. A thread pool is created prior to use of Cilk Plus keywords.
• The number of threads matches the number of cores by default, but can be controlled by the user.
11
Cilk Plus – Sample cilk_spawn
int fib(int n)
{
int x, y;
if (n < 2) return n;
x = cilk_spawn fib(n-1);
y = fib(n-2); // Execution can continue while
// fib(n-1) is running)
cilk_sync; // Asynchronous call must
//complete before using x.
return x+y;
}
1
2
Cilk Plus – Sample Hyperobjekt
int result = 0;
for (std::size_t i = 0; i < n; ++i)
{
result += compute(myArray[i]);
}
std::cout << "The result is: " << result << std::endl;
1
3
Cilk Plus – Sample Hyperobjekt
int result = 0;
cilk_for (std::size_t i = 0; i < n; ++i)
{
result += compute(myArray[i]); // data race!!!
}
std::cout << "The result is: " << result << std::endl;
1
4
Cilk Plus – Sample Hyperobjekt
cilk::reducer_opadd<int> result;
cilk_for (std::size_t i = 0; i < ARRAY_SIZE; ++i)
{
result += compute(myArray[i]); // reducer
hyperobject // avoids race
}
std::cout << "The result is: "
<< result.get_value() //value extracted
<< std::endl;
1
5
Cilk Plus - Array Notations
• Array Notations provide a syntax to specify sections of arrays on which to perform operations
• syntax :: [<lower bound> : <length> : <stride>]
• Simple example • a[0:N] = b[0:N] * c[0:N]; • a[:] = b[:] * c[:] // if a, b, c are declared with size N
• The Intel® C++ Compiler’s automatic vectorization can use this information to apply single operations to multiple elements of the array using Intel® Streaming SIMD Extensions (Intel® SSE) and Intel® Advanced Vector Extensions (Intel® AVX) • Default is SSE2. Use compiler options (/Qx, /arch, /Qax) to change the target.
• More advanced example • x[0:10:10] = sin(y[20:10:2]);
1
6
Array Section Notation • Syntax
<array base> [ <lower bound> : <length> [: <stride>] ]
[ <lower bound> : <length> [: <stride>] ].....
• Note that length is chosen. • Not upper bound as in Fortran [lower bound : upper bound] A[:] // All elements of vector A
B[2:6] // Elements 2 to 7 of vector B
D[0:3:2] // Elements 0,2,4 of vector D
E[0:3][0:4] // 12 elements from E[0][0] to E[2][3]
0 1 2 3 4 5 6 7 8 9 float B[10];
B[2:6] = …
17 1
7
Cilk Plus - Array Notations Example
void foo(double * a, double * b, double * c, double * d, double * e, int n) {
for(int i = 0; i < n; i++)
a[i] *= (b[i] - d[i]) * (c[i] + e[i]);
}
void goo(double * a, double * b, double * c, double * d, double * e, int n) {
a[0:n] *= (b[0:n] - d[0:n]) * (c[0:n] + e[0:n]);
}
icl -Qvec-report3 -c test-array-notations.cpp
test-array-notations.cpp(2) (col. 2): remark: loop was not vectorized: existence of vector dependence.
test-array-notations.cpp(3) (col. 3): remark: vector dependence: assumed FLOW dependence between a line 3 and e line 3.
<…>
Test-array-notations.cpp(7) (col. 6): remark: LOOP WAS VECTORIZED.
1
8
Cilk Plus - Reductions
• Reduction combines array section elements to generate a scalar result
• Some 10 built-in reduction functions supporting basic C data-types:
• add, mul, max, max_ind, min, min_ind, all_zero, all_non_zero, any_nonzero
• Supports user-defined reduction functions:
+
int a[] = {1,2,3,4};
sum = __sec_reduce_add(a[:]); // sum
// is 10
type fn(type in1, type in2); // scalar reduction function
out = __sec_reduce(fn, identity_value, in[x:y:z]);
19 1
9
Cilk Plus - Function Maps with Array Sections “Elemental Functions”
Compiler can convert a user-supplied scalar function to vector function, when called with
array notation arguments
Compiler automatically maps the function across multiple array elements (in example, the
function becomes “a * x[:] + y[:]”)
// Plain C scalar function declared with __declspec(vector)
__declspec(vector) float saxpy (float a, float x, float y) {
return (a * x + y);
}
Z[:] = saxpy(A, X[:], Y[:]); // Call scalar function with
// array notation parameters
20 2
0
Cilk Plus - Elemental Functions
• The compiler can’t assume that user-defined functions are safe for vectorization
• Now you can make your function an elemental function which indicates to the compiler that such a function can be applied to multiple elements of an array in parallel safely.
• Specify __declspec(vector) on both function declarations and definitions as this will affect name-mangling.
2
1
Cilk Plus - Elemental Functions Example
double user_function(double x);
__declspec(vector) double elemental_function(double x);
void foo(double *a, double *b, int n) {
a[0:n] = user_function(b[0:n]);
}
void goo(double *a, double *b, int n) {
a[0:n] = elemental_function(b[0:n]);
}
icl /Qvec-report3 /c test-elemental-functions.cpp
test-elemental-functions.cpp(4) (col. 39): remark: routine skipped: no vectorization candidates.
test-elemental-functions.cpp(9) (col. 2): remark: LOOP WAS VECTORIZED.
2
2
Positioning of SIMD Vectorization A Cilk Plus Feature
ASM code (addps)
Vector intrinsic (mm_add_ps())
SIMD intrinsic class (F32vec4 add)
User Mandated Vectorization
( SIMD Directive)
Auto vectorization hints (#pragma ivdep)
Fully automatic vectorization
Programmer control
Ease of use
2
3
Cilk Plus - SIMD Directive Notation C/C++: #pragma simd [clause [,clause] …]
Fortran: !DIR$ SIMD [clause [,clause] …]
Without any clause, the directive enforces vectorization of the (innermost) loop
Sample: void add_fl(float *a, float *b, float *c, float *d, float *e, int n)
{
#pragma simd
for (int i=0; i<n; i++)
a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
}
Without SIMD directive, vectorization will fail since there are too many pointer references to do a run-time
check for overlapping (compiler heuristic)
2
4
Cilk Plus - Clauses of SIMD Directive
vectorlength(n1 [,n2] …)
• n1, n2, … must be 2,4,8 or 16: The compiler can assume a vectorization for a vector length of n1, n2, … to be save
private(v1, v2, …)
• variables private to each iteration; initial value is broadcast to all private instances, and the last value is copied out from the last iteration instance.
linear(v1:step1, v2:step2, …)
• for every iteration of original scalar loop, v1 is incremented by step1, … etc. Therefore it is incremented by step1 *(vector length) for the vectorized loop.
reduction(operator:v1, v2, …)
• v1 etc are reduction variables for operation “operator”
[no]assert
• reaction in case vectorization fails: Print a warning only ( noassert, the default) or treat failure as error and stop compilation
2
5
Cilk Plus - References and Contact Information
• Cilk Homepage • www.cilk.com
• Intel® Parallel Building Blocks and Intel® Cilk™ Plus pages • http://software.intel.com/en-us/articles/intel-parallel-building-blocks
• User Forums • http://software.intel.com/en-us/forums/intel-cilk-plus/ • http://software.intel.com/en-us/forums/intel-parallel-studio
2
6
A Family of Parallel Programming Models Developer Choice
Established Standards
Message Passing Interface (MPI)
OpenMP*
Coarray Fortran
OpenCL*
Research / Development
Intel® Concurrent Collections
Offload Extensions
Intel® Array Building Blocks
Intel® SPMD Parallel Compiler
Choice of high-performance parallel programming models
Libraries for pre-optimized and parallelized functionality
Intel® Cilk™ Plus and Intel® Threading Building Blocks supports composable parallelization of a wide variety of applications.
OpenCL* addresses the needs of customers in specific segments, and provides developers an additional choice to maximize their app performance
MPI supports distributed computation, combines with other models on nodes
27
Intel® Cilk™ Plus
C/C++ language extensions to simplify parallelism
Open sourced
Also an Intel product
Domain-Specific Libraries
Intel® Integrated Performance Primitives
Intel® Math Kernel Library
Intel® Threading Building Blocks
Widely used C++ template library for parallelism
Open sourced
Also an Intel product
C++ library
Parallel tasks
Parallel algorithms
Concurrent containers
Synchronization primitives
Scalable memory allocator
Outfits C++ for Parallelism.
Introduced 2006. The most widely used abstraction for parallelism.
threadingbuildingblocks.org
28
The MOST popular parallel
programming model.
in our 5th year
Intel® Threading Building Blocks (TBB) Extens C++ for Parallelism
• A kind of „STL for Parallel C++ Programming“
• You specify tasks (that can run concurrently) instead of threads • Library maps user-defined logical tasks onto physical threads, efficiently using cache
and balancing load • Full support for nested parallelism
• Targets threading for scalable performance • Portable across Linux*, Mac OS*, Windows*, and Solaris*
• Compatible with other threading packages • Can be used in concert with native threads and OpenMP*
Flexible, scalable solution with high amount of control at minimum overhead
29
threadingbuildingblocks.org
TBB - Generic Parallel Algorithms Loop parallelization
parallel_for
parallel_reduce
- load balanced parallel execution
- fixed number of independent iterations
parallel_scan
- computes parallel prefix
y[i] = y[i-1] op x[i]
Parallel Algorithms for Streams
parallel_do
- Use for unstructured stream or pile of work
- Can add additional work to pile while running
parallel_for_each
- parallel_do without an additional work feeder
pipeline / parallel_pipeline
- Linear pipeline of stages
- Each stage can be parallel or serial in-order or serial out-of-order.
- Uses cache efficiently
Parallel function invocation
parallel_invoke
- Parallel execution of a number of user-specified
functions
Parallel sorting
parallel_sort
Computational graph
flow::graph
- Implements dependencies between tasks
- Pass messages between tasks
30 30
Concurrent Containers concurrent_hash_map
concurrent_queue concurrent_bounded_queue
concurrent_vector
Generic Parallel Algorithms parallel_for(range)
parallel_reduce parallel_for_each(begin, end)
parallel_do parallel_invoke
pipeline parallel_sort parallel_scan
Task scheduler task_group
task_structured_group task_scheduler_init
task_scheduler_observer
Synchronization Primitives atomic;
mutex; recursive_mutex; spin_mutex; spin_rw_mutex;
queuing_mutex; queuing_rw_mutex; null_mutex; null_rw_mutex
Memory Allocation tbb_allocator; cache_aligned_allocator; scalable_allocator; zero_allocator
Miscellaneous tick_count
Threads tbb_thread
Thread Local Storage enumerable_thread_specific
combinable
threadingbuildingblocks.org C++ Needs Help threadingbuildingblocks.org
Flow Graph
31
Scaling with abstraction: powerful
proven highly scalable
threadingbuildingblocks.org
32 Optimization Notice
A Family of Parallel Programming Models Developer Choice
Established Standards
Message Passing Interface (MPI)
OpenMP*
Coarray Fortran
OpenCL*
Research / Development
Intel® Concurrent Collections
Offload Extensions
Intel® Array Building Blocks
Intel® SPMD Parallel Compiler
Choice of high-performance parallel programming models Libraries for pre-optimized and parallelized functionality
Intel® Cilk™ Plus and Intel® Threading Building Blocks supports composable parallelization of a wide variety of applications.
OpenCL* addresses the needs of customers in specific segments, and provides developers an additional choice to maximize their app performance
MPI supports distributed computation, combines with other models on nodes
33
Intel® Cilk™ Plus
C/C++ language extensions to simplify parallelism
Open sourced
Also an Intel product
Intel® Threading Building Blocks
Widely used C++ template library for parallelism
Open sourced
Also an Intel product
Domain-Specific Libraries
Intel® Integrated Performance Primitives
Intel® Math Kernel Library
Intel® Integrated Performance
Primitives (IPP) 7.0 Multicore Power for Multimedia and Data
Processing
Features • Rapid Application Development • Cross-platform Compatibility & Code Re-Use • Highly optimized functions from 15 Domains
• Images and Video • Communications and Signal Processing • Data Processing
• Performance optimizations for latest Intel processors incl. Core i7, AES-NI (incl. 2nd Gen, AVX) and Atom processors
34 34 Optimization Notice
Intel® Integrated Performance Primitives 7.0
Applications Digital Media | Web/Enterprise Data | Embedded Communications | Scientific/Technical
Intel® Integrated Performance Primitives 16 Function Domains
Optimized 32-bit and 64-bit Multicore Performance
High level APIs and Codecs Interfaces and Code Samples
Multimedia
• Image Processing
• Color Conversion
• JPEG/JPEG2000
• Video Coding
• Computer Vision
• Realistic Rendering
Cross-platform C/C++ API for Code Re-use
Signal
Processing
• Signal Processing
• Audio Coding
• Speech Coding
• Speech Recognition
• Vector Operations
Data
Processing
• Data Compression
• Data Integrity
• Cryptography
• String Processing
• Matrix Operations
35
35 Optimization Notice
Intel® Math Kernel Library 10.3 Flagship math processing library
Features • Multi-core ready with excellent scaling • Highly optimized, extensively threaded math routines
for science, engineering and financial applications for maximum performance
• Automatic runtime processor detection ensures great performance on whatever processor your application is running on.
• Support for C and Fortran • Optimizations for latest Intel processors including 2nd-
gen Core processors
36 36 Optimization Notice
Application Areas which could use MKL • Energy - Reservoir simulation, Seismic, Electromagnetics, etc.
• Finance - Options pricing, Mortgage pricing, financial portfolio management etc.
• Manufacturing - CAD, FEA etc.
• Applied mathematics • Linear programming, Quadratic programming, Boundary value problems, Nonlinear parameter estimation, Homotopy
calculations, Curve and surface fitting, Numerical integration, Fixed-point methods, Partial and ordinary differential equations, Statistics, Optimal control and system theory
• Physics & Computer science • Spectroscopy, Fluid dynamics, Optics, Geophysics, seismology, and hydrology, Electromagnetism, Neural network training,
Computer vision, Motion estimation and robotics
• Chemistry • Physical chemistry, Chemical engineering, Study of transition states, Chemical kinetics, Molecular modeling, Crystallography,
Mass transfer, Speciation
• Engineering • Structural engineering, Transportation analysis, Energy distribution networks, Radar applications, Modeling and mechanical
design, Circuit design
• Biology and medicine • Magnetic resonance applications, Rheology, Pharmacokinetics, Computer-aided diagnostics, Optical tomography
• Economics and sociology • Random utility models, Game theory and international negotiations, Financial portfolio management
37 Optimization Notice
Optimization Notice
38
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable
product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
38
Legal Disclaimer
39
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2012. Intel Corporation.
http://intel.com/software/products
39