25
Sasha Goldshtein CTO, Sela Group Task and Data Parallelism

Task and Data Parallelism

Embed Size (px)

DESCRIPTION

Presentation from DevWeek 2014 on task and data parallelism. This session explains the TPL APIs and then covers various scenarios for extracting concurrency, reducing synchronization, putting thresholds on parallelization, and other topics.

Citation preview

Page 1: Task and Data Parallelism

Sasha GoldshteinCTO, Sela Group

Task and Data Parallelism

Page 2: Task and Data Parallelism

Agenda

•Multicore machines have been a cheap commodity for >10 years•Adoption of concurrent programming is still slow•Patterns and best practices are scarce•We discuss the APIs first…•…and then turn to examples, best practices, and tips

Page 3: Task and Data Parallelism

TPL Evolution

Page 4: Task and Data Parallelism

Tasks

•A task is a unit of work–May be executed in parallel with other tasks by a scheduler (e.g. Thread Pool)–Much more than threads, and yet much cheaper

Task<string> t = Task.Factory.StartNew( () => { return DnaSimulation(…); });t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted);t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion);DisplayProgress();

try { //The C# 5.0 version var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task);} catch (Exception ex) { Show(ex);}

Page 5: Task and Data Parallelism

Parallel Loops

•Ideal for parallelizing work over a collection of data•Easy porting of for and foreach loops–Beware of inter-iteration dependencies!

Parallel.For(0, 100, i => { ...});

Parallel.ForEach(urls, url => { webClient.Post(url, options, data);});

Page 6: Task and Data Parallelism

Parallel LINQ

•Mind-bogglingly easy parallelization of LINQ queries•Can introduce ordering into the pipeline, or preserve order of original elementsvar query = from monster in monsters.AsParallel()

where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster;

query.ForAll(monster => Move(monster));

Page 7: Task and Data Parallelism

Measuring Concurrency

•Visual Studio Concurrency Visualizer to the rescue

Page 8: Task and Data Parallelism

Recursive Parallelism Extraction

•Divide-and-conquer algorithms are often parallelized through the recursive call–Be careful with parallelization threshold and watch out for dependenciesvoid FFT(float[] src, float[] dst, int n, int r, int

s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time }}

Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2));

Page 9: Task and Data Parallelism

DEMORecursive parallel QuickSort

Page 10: Task and Data Parallelism

Symmetric Data Processing

•For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution•Inter-iteration dependencies complicate things (think in-place blur)Parallel.For(0, image.Rows, i => {

for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); }});

Page 11: Task and Data Parallelism

Uneven Work Distribution

•With non-uniform data items, use custom partitioning or manual distribution–Primes: 7 is easier to check than 10,320,647

var work = Enumerable.Range(0, Environment.ProcessorCount) .Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1))));Task.WaitAll(work.ToArray());

versus

Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2));

Page 12: Task and Data Parallelism

DEMOUneven workload distribution

Page 13: Task and Data Parallelism

Complex Dependency Management

•Must extract all dependencies and incorporate them into the algorithm–Typical scenarios: 1D loops, dynamic algorithms–Edit distance: each task depends on 2 predecessors, wavefront

C = x[i-1] == y[i-1] ? 0 : 1;D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C);

0,0

m,n

Page 14: Task and Data Parallelism

DEMODependency management

Page 15: Task and Data Parallelism

Synchronization > Aggregation

•Excessive synchronization brings parallel code to its knees–Try to avoid shared state–Aggregate thread- or task-local state and merge

Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List<int>(), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes);});

Page 16: Task and Data Parallelism

DEMOAggregation

Page 17: Task and Data Parallelism

Creative Synchronization

• We implement a collection of stock prices, initialized with 105 name/price pairs– 107 reads/s, 106 “update” writes/s, 103

“add” writes/day–Many reader threads, many writer

threadsGET(key): if safe contains key then return safe[key] lock { return unsafe[key] }

PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }

Page 18: Task and Data Parallelism

Lock-Free Patterns (1)

•Try to avoid Windows synchronization and use hardware synchronization–Primitive operations such as Interlocked.Increment, Interlocked.CompareExchange–Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms

int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r;}

Old

val

ue

New

val

ue

Com

para

nd

Page 19: Task and Data Parallelism

Lock-Free Patterns (2)

•User-mode spinlocks (SpinLock class) can replace locks you acquire very often, which protect tiny computationsclass __DontUseMe__SpinLock { private volatile int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { _lck = 0; }}

Page 20: Task and Data Parallelism

Miscellaneous Tips (1)

•Don’t mix several concurrency frameworks in the same process•Some parallel work is best organized in pipelines – TPL DataFlow

Page 21: Task and Data Parallelism

Miscellaneous Tips (2)

•Some parallel work can be offloaded to the GPU – C++ AMP

void vadd_exp(float* x, float* y, float* z, int n) { array_view<const float,1> avX(n, x), avY(n, y); array_view<float,1> avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index<1> i) ... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize();}

Page 22: Task and Data Parallelism

Miscellaneous Tips (3)

•Invest in SIMD parallelization of heavy math or data-parallel algorithms

–Already available on Mono (Mono.Simd)

•Make sure to take cache effects into account, especially on MP systems

START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4jns START

Page 23: Task and Data Parallelism

Summary

• Avoid shared state and synchronization• Parallelize judiciously and apply

thresholds• Measure and understand performance

gains or losses• Concurrency and parallelism are still

hard• A body of best practices, tips, patterns,

examples is being built

Page 24: Task and Data Parallelism

Additional References

Page 25: Task and Data Parallelism

THANK YOU!

Sasha GoldshteinCTO, Sela Groupblog.sashag.net@goldshtn