OpenMP TBBPPL MPI OpenCLOpenACC CUDAC++ AMP Renderscript Cilk PlusGCD

Parallelism in the Standard C++: What to Expect in C++ 17

Artur Laksberg

Microsoft Corp.

May 8th, 2014

Agenda

Fundamentals Task regions

Parallel Algorithms Parallelization Vectorization

Part 1: The Fundamentals

OpenMPTBBPPL

MPIOpenCLOpenACC

CUDA C++ AMP

Renderscript

Cilk Plus GCD

Parallelism in C++11/14

Fundamentals: Memory model Atomics

Basics: thread mutex condition_variable async future

Quicksort: Serial

void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end);

quicksort(v, start, pivot - 1);

quicksort(v, pivot + 1, end);

Quicksort: Use Threads

void quicksort(int *v, int start, int end) { if (start < end) {

int pivot = partition(v, start, end);

std::thread t1([&] { quicksort(v, start, pivot - 1); });

std::thread t2([&] { quicksort(v, pivot + 1, end); });

t1.join(); t2.join(); }}

Problem 1:expensive

Problem 2:Fork-join not enforced

Problem 3:Exceptions??

Quicksort: Fork-Join Parallelism

void quicksort(int *v, int start, int end) { if (start < end) {

quicksort(v, start, pivot - 1);

quicksort(v, pivot + 1, end);

parallel region

Quicksort: Using Task Regions (N3832)void quicksort(int *v, int start, int end) { if (start < end) {

task_region([&] (auto& r) {

r.run([&] { quicksort(v, start, pivot - 1); });

r.run([&] { quicksort(v, pivot + 1, end); });

}); }}

parallel region

Under The Hood…

Work Stealing Scheduling

proc 1 proc 3proc 2 proc 4

proc 1

Old items

proc 3proc 2 proc 4

New items

proc 1

Old items

proc 3proc 2 proc 4

New items

proc 1

Old items

proc 3proc 2 proc 4

New items

“Thief”

Fork-Join Parallelism and Work Stealing

task_region([] (auto& r) {

r.run(f);

f() g()

Q2: What thread runs g?

Q3: What thread runs h?

Q1: What thread runs f?

Work Stealing Design Choices What Thread Executes After

a Spawn? Child Stealing Continuation (parent)

Stealing

What Thread Executes After a Join? Stalling: initiating thread

waits Greedy: the last thread to

reach join continuestask_region([] (auto& r) { for(int i=0; i<n; ++i) r.run(f);});

Part 2: The Algorithms

Alex Stepanov: Start With The Algorithms

Inspiration

Performing Parallel Operations On Containers

Intel Threading Building Blocks

Microsoft Parallel Patterns Library, C++ AMP

Nvidia Thrust

Parallel STL

Just like STL, only parallel… Can be faster

If you know what you’re doing

Two Execution Policies: std:par std::vec

Parallelization: What’s a Big Deal?

Why not already parallel?

std::sort(begin, end, [](int a, int b) { return a < b; });

User-provided closures must be thread safe:

int comparisons = 0;std::sort(begin, end, [&](int a, int b) { comparisons++; return a < b; });

But also special-member functions, std::swap etc.

It’s a Contract

What the user can do What the implementer can do

Asymptotic Guarantees:std::sort: O(n*log(n)), std::stable_sort: O(n*log2(n)), what about parallel sort?

What is a valid implementation? (see next slide)

Chaos Sorttemplate<typename Iterator, typename Compare>void chaos_sort( Iterator first, Iterator last, Compare comp ) { auto n = last-first; std::vector<char> c(n); for(;;) { bool flag = false; for( size_t i=1; i<n; ++i ) { c[i] = comp(first[i],first[i-1]); flag |= c[i]; } if( !flag ) break; for( size_t i=1; i<n; ++i ) if( c[i] ) std::swap( first[i-1], first[i] ); }}

Execution Policies

Built-in Execution Policies:extern const sequential_execution_policy seq;extern const parallel_execution_policy par;extern const vector_execution_policy vec;

Dynamic Execution Policy:class execution_policy{public:// ... const type_info& target_type() const; template<class T> T *target(); template<class T> const T *target() const;};

Using Execution Policy To Write Paralel Code

std::vector<int> vec = ...

// standard sequential sortstd::sort(vec.begin(), vec.end());

using namespace std::experimental::parallel;

// explicitly sequential sortsort(seq, vec.begin(), vec.end());

// permitting parallel executionsort(par, vec.begin(), vec.end());

// permitting vectorization as wellsort(vec, vec.begin(), vec.end());

Picking Execution Policy Dynamically

size_t threshold = ...

execution_policy exec = seq;

if(vec.size() > threshold){ exec = par;}

sort(exec, vec.begin(), vec.end());

Exception Handling

In C++ philosophy, no exception is silently ignored Exception list: container of exception_ptr objects

try{ r = std::inner_product(std::par, a.begin(), a.end(), b.begin(), func1, func2, 0);}catch(const exception_list& list){ for(auto& exptr : list) { // process exception pointer exptr }}

Vectorization: A Tale From Agriculture

A Tale From Agriculture

Idea: Fewer Tractors, Wider Plows

Vectorization: What’s a Big Deal?

int a[n] = ...;int b[n] = ...;for(int i=0; i<n; ++i){ a[i] = b[i] + c;}

movdqu xmm1, XMMWORD PTR _b$[esp+eax+132]movdqu xmm0, XMMWORD PTR _a$[esp+eax+132]paddd xmm1, xmm2paddd xmm1, xmm0movdqu XMMWORD PTR _a$[esp+eax+132], xmm1

a[i:i+3] = b[i:i+3] + c;

Vector Lane is not a Thread!

Taking locks Thread with thread_id x takes a lock… Then another “thread” with the same thread_id enters the

lock… Deadlock!!!

Exceptions Can we unwind 1/4th of the stack?

Vectorization: Not So Easy Any More…

void f(int* a, int*b){ for(int i=0; i<n; ++i) { a[i] = b[i] + c; func();

mov ecx, DWORD PTR _b$[esp+esi+140]add ecx, ediadd DWORD PTR _a$[esp+esi+140], ecxcall func

Aliasing?

Side effects?Dependence?Exceptions?

Vectorization Hazard: Locks

for(int i=0; i<n; ++i){ lock.enter(); a[i] = b[i] + c; lock.release();}

for(int i=0; i<n; i+=4){ for(int j=0; j<4; ++j) lock.enter();

a[i:i+3] = b[i:i+3] + c;

for(int j=0; j<4; ++j) lock.release();}

This transformation is not safe!

Consider: f takes a lock, g releases the lock:

How Do We Get This?

void f(int* a, int*b){ for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); }

for(int i=0; i<n; i+=4){ a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) func();}

Need a helping hand from the programmer, because…

Vector Loop with Parallel STL

void f(int* a, int*b){ integer_iterator begin {0}; integer_iterator end {n};

std::for_each( std::vec, begin, end, [&](int i) { a[i] = b[i] + c; func(); }}

Parallelization vs. Vectorization

Parallelization Threads Stack Good for divergent code Relatively heavy-weight

Vectorization Vector Lanes No stack Lock-step execution Very light-weight

When To Vectorize

std::par No race conditions No aliasing

std::vec Same as std::vec, plus: No Exceptions No Locks No/Little Divergence

References

N3832: Task Region N3872: A Primer on Scheduling Fork-Join Parallelism

with Work Stealing N3724: A Parallel Algorithms Library N3850: Working Draft, Technical Specification for C++

Extensions for Parallelism parallelstl.codeplex.com

OpenMP TBBPPL MPI OpenCLOpenACC CUDAC++ AMP Renderscript Cilk PlusGCD

Documents

Android Renderscript (LLVM Developer Conference …llvm.org/devmtg/2011-11/Hines_AndroidRenderscript.pdf · Android Renderscript Stephen Hines, Shih-wei Liao, ... Only script space

Optimizing LU Factorization in Cilk ++

Cilk 5.3.1 Reference Manual

Analysis of Cilk

A New Approach for Performance Analysis of …johnmc/papers/HPCToolkit-OpenMP-ICS...performance analysis of Cilk programs executed by a work stealing runtime. Rather than a thread

Programming in Cilk - Massachusetts Institute of …supertech.lcs.mit.edu/cilk/lecture-2.pdfProgramming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul LECTURE 3 Advanced

Android RenderScript on LLVM

Cilk Plus - Engineering School Class Web Sites · 2015. 1. 19. · Cilk Plus ∙ The “Cilk” part is a small set of linguistic extensions to C/C++ to support fork-join parallelism

CILK/CILK++ and Reducers

Parallel Hybrid Computing · GPU GPU GPU GPU OpenMP HMPP MPI CUDA. Programming Multicores/ ... CILK, TBB, automatic parallelization, vectorization… • Distributed memory architectures

Intel® Cilk™ Plus Application Binary Interface …® Cilk Plus Application Binary Interface Specification Document Number: 324512-001US Page 3 On Windows, the Cilk Plus runtime

Paug renderscript-mars-2013

Unreal Engine 4: Mobile Graphics on ARM CPU and GPU … · Unreal Engine 4: Mobile Graphics on ARM ... Jack Porter, Engine Development Lead, Epic Games Korea ... OpenMP® , Renderscript,

Cilk 5.3.2 Reference Manual

Programming in Cilksupertech.lcs.mit.edu/cilk/lecture-3.pdfProgramming matrix multiplication in Cilk — Dr. Bradley C. Kuszmaul LECTURE 3 Advanced Cilk programming: inlets, abort,

Cilk 5.4.6 Reference Manual - Massachusetts Institute of …people.csail.mit.edu/jim/temp/manual.pdf · 2014. 5. 3. · Cilk runtime system, the Cilk compiler, a collection of example

Speculative Parallelism in Cilk++

Android RenderScript on LLVM - events.static.linuxfound.org · What is RenderScript? It is the future of Android 3D Rendering and Compute Portability Performance Usability C99 plus

CILK: An Efficient Multithreaded Runtime System

Parallel Processing Final Presentation CILK