Task and Data Parallelism

Preview:

DESCRIPTION

Presentation from DevWeek 2014 on task and data parallelism. This session explains the TPL APIs and then covers various scenarios for extracting concurrency, reducing synchronization, putting thresholds on parallelization, and other topics.

Citation preview

Sasha GoldshteinCTO, Sela Group

Task and Data Parallelism

Agenda

•Multicore machines have been a cheap commodity for >10 years•Adoption of concurrent programming is still slow•Patterns and best practices are scarce•We discuss the APIs first…•…and then turn to examples, best practices, and tips

TPL Evolution

Tasks

•A task is a unit of work–May be executed in parallel with other tasks by a scheduler (e.g. Thread Pool)–Much more than threads, and yet much cheaper

Task<string> t = Task.Factory.StartNew( () => { return DnaSimulation(…); });t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted);t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion);DisplayProgress();

try { //The C# 5.0 version var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task);} catch (Exception ex) { Show(ex);}

Parallel Loops

•Ideal for parallelizing work over a collection of data•Easy porting of for and foreach loops–Beware of inter-iteration dependencies!

Parallel.For(0, 100, i => { ...});

Parallel.ForEach(urls, url => { webClient.Post(url, options, data);});

Parallel LINQ

•Mind-bogglingly easy parallelization of LINQ queries•Can introduce ordering into the pipeline, or preserve order of original elementsvar query = from monster in monsters.AsParallel()

where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster;

query.ForAll(monster => Move(monster));

Measuring Concurrency

•Visual Studio Concurrency Visualizer to the rescue

Recursive Parallelism Extraction

•Divide-and-conquer algorithms are often parallelized through the recursive call–Be careful with parallelization threshold and watch out for dependenciesvoid FFT(float[] src, float[] dst, int n, int r, int

s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time }}

Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2));

DEMORecursive parallel QuickSort

Symmetric Data Processing

•For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution•Inter-iteration dependencies complicate things (think in-place blur)Parallel.For(0, image.Rows, i => {

for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); }});

Uneven Work Distribution

•With non-uniform data items, use custom partitioning or manual distribution–Primes: 7 is easier to check than 10,320,647

var work = Enumerable.Range(0, Environment.ProcessorCount) .Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1))));Task.WaitAll(work.ToArray());

versus

Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2));

DEMOUneven workload distribution

Complex Dependency Management

•Must extract all dependencies and incorporate them into the algorithm–Typical scenarios: 1D loops, dynamic algorithms–Edit distance: each task depends on 2 predecessors, wavefront

C = x[i-1] == y[i-1] ? 0 : 1;D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C);

0,0

m,n

DEMODependency management

Synchronization > Aggregation

•Excessive synchronization brings parallel code to its knees–Try to avoid shared state–Aggregate thread- or task-local state and merge

Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List<int>(), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes);});

DEMOAggregation

Creative Synchronization

• We implement a collection of stock prices, initialized with 105 name/price pairs– 107 reads/s, 106 “update” writes/s, 103

“add” writes/day–Many reader threads, many writer

threadsGET(key): if safe contains key then return safe[key] lock { return unsafe[key] }

PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }

Lock-Free Patterns (1)

•Try to avoid Windows synchronization and use hardware synchronization–Primitive operations such as Interlocked.Increment, Interlocked.CompareExchange–Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms

int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r;}

Old

val

ue

New

val

ue

Com

para

nd

Lock-Free Patterns (2)

•User-mode spinlocks (SpinLock class) can replace locks you acquire very often, which protect tiny computationsclass __DontUseMe__SpinLock { private volatile int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { _lck = 0; }}

Miscellaneous Tips (1)

•Don’t mix several concurrency frameworks in the same process•Some parallel work is best organized in pipelines – TPL DataFlow

Miscellaneous Tips (2)

•Some parallel work can be offloaded to the GPU – C++ AMP

void vadd_exp(float* x, float* y, float* z, int n) { array_view<const float,1> avX(n, x), avY(n, y); array_view<float,1> avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index<1> i) ... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize();}

Miscellaneous Tips (3)

•Invest in SIMD parallelization of heavy math or data-parallel algorithms

–Already available on Mono (Mono.Simd)

•Make sure to take cache effects into account, especially on MP systems

START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4jns START

Summary

• Avoid shared state and synchronization• Parallelize judiciously and apply

thresholds• Measure and understand performance

gains or losses• Concurrency and parallelism are still

hard• A body of best practices, tips, patterns,

examples is being built

Additional References

THANK YOU!

Sasha GoldshteinCTO, Sela Groupblog.sashag.net@goldshtn

Recommended