Task and Data Parallelism

Sasha GoldshteinCTO, Sela Group

Agenda

•Multicore machines have been a cheap commodity for >10 years•Adoption of concurrent programming is still slow•Patterns and best practices are scarce•We discuss the APIs first…•…and then turn to examples, best practices, and tips

TPL Evolution

•A task is a unit of work–May be executed in parallel with other tasks by a scheduler (e.g. Thread Pool)–Much more than threads, and yet much cheaper

Task<string> t = Task.Factory.StartNew( () => { return DnaSimulation(…); });t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted);t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion);DisplayProgress();

try { //The C# 5.0 version var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task);} catch (Exception ex) { Show(ex);}

Parallel Loops

•Ideal for parallelizing work over a collection of data•Easy porting of for and foreach loops–Beware of inter-iteration dependencies!

Parallel.For(0, 100, i => { ...});

Parallel.ForEach(urls, url => { webClient.Post(url, options, data);});

Parallel LINQ

•Mind-bogglingly easy parallelization of LINQ queries•Can introduce ordering into the pipeline, or preserve order of original elementsvar query = from monster in monsters.AsParallel()

where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster;

query.ForAll(monster => Move(monster));

Measuring Concurrency

•Visual Studio Concurrency Visualizer to the rescue

Recursive Parallelism Extraction

•Divide-and-conquer algorithms are often parallelized through the recursive call–Be careful with parallelization threshold and watch out for dependenciesvoid FFT(float[] src, float[] dst, int n, int r, int

s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time }}

Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2));

DEMORecursive parallel QuickSort

Symmetric Data Processing

•For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution•Inter-iteration dependencies complicate things (think in-place blur)Parallel.For(0, image.Rows, i => {

for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); }});

Uneven Work Distribution

•With non-uniform data items, use custom partitioning or manual distribution–Primes: 7 is easier to check than 10,320,647

var work = Enumerable.Range(0, Environment.ProcessorCount) .Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1))));Task.WaitAll(work.ToArray());

versus

Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2));

DEMOUneven workload distribution

Complex Dependency Management

•Must extract all dependencies and incorporate them into the algorithm–Typical scenarios: 1D loops, dynamic algorithms–Edit distance: each task depends on 2 predecessors, wavefront

C = x[i-1] == y[i-1] ? 0 : 1;D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C);

DEMODependency management

Synchronization > Aggregation

•Excessive synchronization brings parallel code to its knees–Try to avoid shared state–Aggregate thread- or task-local state and merge

Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List<int>(), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes);});

DEMOAggregation

Creative Synchronization

• We implement a collection of stock prices, initialized with 105 name/price pairs– 107 reads/s, 106 “update” writes/s, 103

“add” writes/day–Many reader threads, many writer

threadsGET(key): if safe contains key then return safe[key] lock { return unsafe[key] }

PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }

Lock-Free Patterns (1)

•Try to avoid Windows synchronization and use hardware synchronization–Primitive operations such as Interlocked.Increment, Interlocked.CompareExchange–Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms

int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r;}

Lock-Free Patterns (2)

•User-mode spinlocks (SpinLock class) can replace locks you acquire very often, which protect tiny computationsclass __DontUseMe__SpinLock { private volatile int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { _lck = 0; }}

Miscellaneous Tips (1)

•Don’t mix several concurrency frameworks in the same process•Some parallel work is best organized in pipelines – TPL DataFlow

•Some parallel work can be offloaded to the GPU – C++ AMP

void vadd_exp(float* x, float* y, float* z, int n) { array_view<const float,1> avX(n, x), avY(n, y); array_view<float,1> avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index<1> i) ... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize();}

•Invest in SIMD parallelization of heavy math or data-parallel algorithms

–Already available on Mono (Mono.Simd)

•Make sure to take cache effects into account, especially on MP systems

START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4jns START

Summary

• Avoid shared state and synchronization• Parallelize judiciously and apply

thresholds• Measure and understand performance

gains or losses• Concurrency and parallelism are still

hard• A body of best practices, tips, patterns,

examples is being built

Additional References

THANK YOU!

Sasha GoldshteinCTO, Sela Groupblog.sashag.net@goldshtn

Task and Data Parallelism

Technology

Accelerating HMMER Search on GPUs using Hybrid Task and Data Parallelism

Exploiting Coarse-Grained Task, Data, and Pipeline ...groups.csail.mit.edu/commit/papers/06/gordon-asplos06.pdf · Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in

Kite: Braided Parallelism for Heterogeneous Systemsmixture of both task- and data-parallelism, a form of parallelism Lefohn [12] calls braided parallelism. This is only one frame;

Refactoring Conventional Task Schedulers to Exploit ...€¦ · Task parallelism Data parallelism. Contribution Asymmetry-oblivious scheduler Asymmetry-aware + DLA library Task parallelism

Enabling Task Level Parallelism in HandelC - ittc.ku.edu · Instruction Level Parallelism vs. Task Level Parallelism Erik Anderson. 07-Dec-07 • The objective of this thesis is to

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism

Enabling Task Parallelism in the CUDA Schedulerecosimulation.com/chrisgregg/Publications/TaskParallelismCuda.pdfEnabling Task Parallelism in the CUDA Scheduler ... since OpenGL and

Optimal Use of Mixed Task and Data Parallelism for Pipelined … · 2004-05-20 · Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations1 ... time to process an

Data Level Parallelism (Vector Processors)

Lecture 23: Thread Level Parallelism --Introduction, SMP ... · 2 Topics for Thread Level Parallelism (TLP) § Parallelism (centered around … –Instruction Level Parallelism –Data

Concurrency Data Structures and Parallelism

Task and Data Parallelism: Real-World Examples

Optimizing Game Architectures with Task-based Parallelism

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Featherlight Speculative Task Parallelism - Vivek …vivkumar.github.io › papers › europar2019_slides.pdfFeatherlight Speculative Task Parallelism | Vivek Kumar | EuroPar 2019

A Framework for Exploiting Task and Data Parallelism on ...graal.ens-lyon.fr/~yrobert/paraX00/task-parallelism.pdf · AReal with BImag and AImag with BReal and taking their sum. There

Lecture 7: POSIX Threads - Pthreads. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism

Exploiting Vector and Multicore Parallelism for Recursive ...milind/docs/ppopp17b.pdf · mix data- and task-parallelism and run efﬁciently on hard-ware with both multicores and

Enabling Task Level Parallelism in HandelC - ittc.ku.edu · parallelism [9]. Nevertheless, these constructs can be used to express higher levels of parallelism in a primitive manner

Comp 422: Parallel Programming Lecture 10: Message Passing (MPI) Task Parallelism