View
22
Download
0
Category
Preview:
DESCRIPTION
Attacking the programming model wall. Marco Danelutto Dept. Computer Science, Univ. of Pisa Belfast, February 28 th 2013. Setting the scenario (HW). Market pressure. Multicores. Moore law from components to cores Simpler cores, shared memory, cache coherent, full interconnect. Manycores. - PowerPoint PPT Presentation
Citation preview
Attacking theprogramming model
wallMarco Danelutto
Dept. Computer Science, Univ. of PisaBelfast, February 28th 2013
Market pressure
Hw advances
Power wall
Setting the scenario (HW)
Market pressure
•From components•To cores
Moorelaw
•Gesture/voice interaction•3/D graphics
Newneeds
•New applications•Larger data sizes
Supercomputing
Multicores Moore law from components to cores Simpler cores, shared memory, cache
coherent, full interconnect
Name Cores Contexts Sockets Core x board
AMD 6176 12 1 4 48E5-4650 8 2 4 64SPARC T4 8 8 4 512
Even simpler cores, shared memory, cache coherent, regular interconnection, co-processors (via PCIe)
Options for cache coherence, more complex inter core communication protocols
Manycores
Name Core Core
Contexts Intercon Mem controllers
TileraPro64 VLIW 64 64 Mesh 4Intel PHI IA-64 60 240 Ring 8 (2 way)
ALUs + instruction sequencers, large and fast memory access, co-processors (via PCIe)
Data parallel computations only
GPUs
Name Cores M interface
M bandwidth
nVidia C2075
448 384 bit 144GB/sec
mVidia K20X 2688 384 bit 250GB/sec
Low scale manifacturing, Accelerators (mission critical sw), GP computing (PCIe co-processors, CPU socket replacement), possibly hosting GP CPU/cores
Non-standard programming tools
FPGA
Name Cells Block RAM M bandwith
Artix 7 215,000 13Mb 1,066MB/sVirtex 7 2,000,000 68Mb 1,866MB/s
Power cost > hw cost
Thermal dissipation cost
FLOP/Watt is “the must”
Power wall
Reducing idle costs◦ E4 CARMA CLUSTER
ARM + nVIDIA Spare Watt → GPU
Reducing the cooling costs◦ Eurotech AURORA TIGON
Intel technology Water cooling Spare Watt → CPU
Power wall (2)
Close to metal programming models
SIMD / MIMD abstractionprogrammingmodels
High level high productivityprogramming models
Setting the scenario (sw)
Programming models
Low abstraction level High abstraction level
Pros◦ Performance / efficiency◦ Heterogeneous hw
targeting
Cons◦ Huge application
programmer responsibilities
◦ Portability (functional, performance)
◦ Quantitative parallelism exploitation
Pros◦ Expressive power◦ Separation of concerns◦ Qualitative parallelism
exploitation
Cons◦ Performance / efficiency◦ Hw targeting
Separation of concerns
Functional Non functional
What has to be computer
Function from input data to output data
Domain specific
Application dependent
How the results is computed
Parallelism, Power management, Security, Fault Tolerance, …
Target hw specific
Factorizable
Supported programming paradigms
Current programmingframeworks
OpenCL
OpenMP
MPI
CILK
TBBExpressive power
Market pressur
es
HW advances
Low
level
progra
mming
models
UrgenciesNeed for
a) Parallel programming models
b) Parallel programmers
Structured parallel programming
Algorithmic skeletons Parallel design patterns
From HPC community
Started in early ‘90(M. Cole’s PhD thesis)
Pre-defined parallel patterns, exposed to programmers as programming constructs/lib calls
From SW engineering community
Started in early ‘00
“Recipes” to handle parallelism (name, problem, forces, solutions, …)
Compiling tools + run
time systems
Clear program
ming model
High level
abstractionsSimilarities
Common, parametric, reusable parallelism exploitation patterns (from HPC community)
Exposed to programmers as constructs, library calls, objects, higher order functions, components, ...
Composable◦ Two tier model: “stream parallel” skeletons with
inner “data parallel” skeletons
Algorithmic skeletons
Sample classical skeletons
Stream parallel Data Parallel
Parallel computation of different items from an input stream
Task/farm (master/worker), Pipeline
Parallel computation on (possibly overlapped) partitions of the same input data
Map, Stencil, Reduce, Scan, Mapreduce
‘90 •Complex patterns, no composition•Targeting clusters•Mostly libraries (RTS)
‘00 •Simple data/stream parallel patterns•Composable, targeting COW/NOW•Libraries + First compilers
‘10 •Optimized, composable building blocks•Targeting cluster of heterogeneous multicore•Quite complex tool chain (compiler + RTS)
Evolution of the concept
‘90 •Cole PhD thesis skeletons•P3L (Pisa)•SCL (Imperial college London)
‘00 •Lithium/Muskel (Pisa), Skandium (INRIA)•Muesli (Muenster), SkeTo (Tokio)•OSL (Orleans), Mallba (La Laguna)
‘10 •SkePu (Linkoping)•FastFlow (Pisa/Torino)•TBB? (Intel), TPL? (Microsoft)
Evolution of the concept (2)
Implementing skeletons
Template based Macro Data Flow based
Skeleton implemented by instantiating a “concurrent activity graph template”
Performance models used to instantiate quantitative parameters
P3L, Muesli, SkeTo, FastFlow
Skeleton program compiled to macro data flow graphs
Rewriting/refactoring compiling process
Parallel MDF graph interpreter
Muskel, Skipper, Skandium
Formally proven rewriting rules
Farm(Δ) = ΔPipe(Δ1, Δ2) = SeqComp(Δ1, Δ2)
Pipe(Map(Δ1), Map(Δ1)) = Map(Pipe(Δ1, Δ2))
Refactoring skeletons
Pipe(Farm(Δ1), Δ2) •Service time = maxi=1,2 {stagei}•Nw = nw(farm)+1
Pipe(Δ1, Δ2) •Higher service time•Nw = 2
SeqComp(Δ1, Δ2) •Sequential service time•Nw = 1
Farm(SeqComp(Δ1,Δ2))
•Service time < original•With less resources (normal form)
Sample refactoring: normal form
Performance modelling
Pipeline service timeMaxi=1,k { serviceTime(Stagei)}
Pipeline latency∑i=1,k { serviceTime(Stagei)}
Farm service timemax { taskSchedTime, resGathTime, workerTime/#worker}
Map latency partitionTime + workerTime + gatherTime
Sample performance models
Key strenghts Full parallel structure of the application exposed
to the skeleton framework◦ Exploited by optimizations, support for autonomic non
functional concern management Framework responsibility for architecture
targeting◦ Write once run everywhere code, with architecture
specific compiler and back end (run time) tools Only functional debugging required to
application programmers
Expressive power reduce time to
deploy
Parallel structure exposed guarantees
performance
Ideally
•Application programmer: WHAT•System programmer: HOW
Separation of concerns
•Structure suggested•Interpreted by tools
Inversion of control
•Close to hand coded programs•At a fraction of the devel time
Performance
Assessments
Carefully describe a parallelism exploitation pattern including◦ Applicability◦ Forces◦ Possibile implementations/problem solutions
As text At different levels of abstraction
Parallel design patterns
Finding concurrency
Algorithm space
Supporting structure
Implementation mechanism
Pattern spaces
Collapsed in algorithmic skeletons◦ application programmer → concurrency and algorithm spaces◦ Skeleton implementation (system programmer)→ support
structures and implementation mechanisms
Patterns
Structured parallel programming: design patterns
Design patterns
Problem
Programming
tools
Low level code
Follow, learn, use
Structured parallel programming: skeletons
Skeleton library
ProblemHigh level code
Instantiate, compose
Structured parallel programming
Design patterns
ProblemHigh level code
Use knowledge to instantiate, compose
Skeletons
Working unstructured Tradeoffs
◦ CPU/GPU threads◦ Processes/Threads◦ Coarse/fine grain
tasks Target
architecture dependent decisions Concurrent activity set
Thread/
Procs
SynchMemo
ry
Creation◦ Thread pool vs. on-the-fly creation
Pinning◦ Operating system dendent effectiveness
Memory management ◦ Embarrassingly parallel patterns may benefit of
process memory space separation (see Memory (next) slide)
Thread/processes
Cache friendly algorithms◦ Minimization of cache coherency traffic◦ Data aligment/padding
Memory wall◦ 1-2 memory interfaces per 4-8 cores ◦ 4-8 memory interfaces per 60-64 cores (+internal
routing)
Memory
High level, general purpose mechanisms◦ Passive wait◦ High latency
Low level mechanisms◦ Active wait◦ Smaller latency
Eventually◦ Synchronization on memory (fences)
Synchronization
Ideally◦ As much parallel activities as necessary to sustain the
input data rate Base measures
◦ Estimated input pressure & task processing time, communication overhead
Compile vs. run time choices◦ Try to devise statically some optimal values◦ Adjust initial settings dynamically based on observations
Devising parallelism degree
Auto scheduling◦ Idle workers require tasks from a “global” queue◦ Far nodes require less than near ones
Affinity scheduling◦ Tasks scheduled on the producing cores
Round robin allocation of dynamically allocated chunks
NUMA memory exploitation
Algorithmic code:
computing the
application results out of
the input data
Non functional code:
programming performance, security, fault
tolerance, power
management
More separation of concerns
Behavioural skeletonsStructured
parallel algorithm code
Pipe
Pipe
Seq
Seq
Map
Seq
Seq
Seq
Sensors & Actuators
exposes
NFC autonomic manager
reads
ECA rule based program
Autonomic manager: ex-ecutes a MAPE loop. Ateach iteration, and ECA(Event Condition Action)
rule system is executed usingmonitored values and possi-bly operating actions on thestructured parallel pattern
Sensors: determinewhat can be perceived
of the computationActuators: determine whatcan be affected/changed
in the computationMAPEloop
Event: inter arrival time changes Condition: faster than service time Action: increase the parallelism degree
Sample rules
Event: fault at worker Condition: service time low Action: recruit new worker
resource
BS assessments
Yes, nice, but then ?
We have MPI, OpenMP, Cuda, OpenCL …
FastFlow Full C++, skeleton based, streaming parallel
processing framework
•Pipeline, Farm•Composable, customizable
Streaming network patterns
•Lock free SPMC, MPSC, •& MPMC queues
Arbitrary streaming networks
•Lock free SPSC queue•General threading model
Simple streaming networks
http://mc-fastflow.sourceforge.net
Full POSIX/C++ compliancy◦ G++, make, gprof, gdb, pthread, …
Reuse existing code◦ Proper wrappers
Run from laptops to clusters & clouds◦ Same skeleton structure
Bring skeletons to your desk
Basic abstraction: ff_nodeclass RedEye: public ff_node { … int svc_init(){ … }
void svc_end() { … }
void * svc(void * task) { Image *in = (Image *)task; Image * out = …. return((void *) out); } … }
ff_node
Input channel
svcOutput channel
Farm(Worker, Nw)◦ Embarrassingly parallel computations on streams◦ Computing Worker in parallel (Nw copies)◦ Emitter + string of workers + Collector implementation
Pipeline(Stage1, … , StageN)◦ StageK processes output of Stage(K-1)
and delivers to Stage(K+1) Feedback(Skel, Cond)
◦ Routes back results from Skel to input or forward to output depending on Cond
Basic stream parallel skeletons
Setting up a pipelineff_pipeline myImageProcessingPipe;
ff_node startNode = new Reader(…); ff_node redEye = new RedEye();ff_node light = new LightCalibration();ff_node sharpen = new Sharpen();ff_node endNode = new Writer(…);
myImageProcessingPipe.addStage(startNode);myImageProcessingPipe.addStage(redEye);myImageProcessingPipe.addStage(light);myImageProcessingPipe.addStage(sharpen);myImageProcessingPipe.addStage(endNode);
myImageProcessingPipe.run_and_wait_end();
Refactoring (farm introduction)ff_node sharpen = new Sharpen();ff_farm<> thirdStage; std::vector<ff_node *> w; for(int i=0;i<nworkers;++i) w.push_back(new Sharpen());farm.add_workers(w); …myImageProcessingPipeaddStage(sharpen);myImageProcessingPipe.addStage(thirdStage);…
Refactoring (map introduction)ff_farm<> thirdStage;
std::vector<ff_node *> w; for(int i=0;i<nworkers;++i) w.push_back(new Sharpen());farm.add_workers(w);Emitter em; // scatter data to workersCollector co; // collect results from wfarm.add_emitter(em);farm.add_collector(co); …myImageProcessingPipe.addStage(sharpen);…
1. Create a suitable skeleton accelerator2. Offload tasks from main (sequential)
business logic code3. Accelerator exploits the “spare cores” on
your machine
FastFlow accelerator
FastFlow acceleratorff_farm<> farm(true); // Create acceleratorstd::vector<ff_node *> w;for(int i=0;i<nworkers;++i) w.push_back(new Worker);farm.add_workers(w);farm.add_collector(new Collector);farm.run_then_freeze();
while(…) { …. farm.offload(x); // offload tasks …}…while(farm.load_result(&res)) { ….// eventually process results}
0.5msecs tasks
5msecs tasks
50msecs tasks
Scalability (vs. PLASMA)
Farm vs. Macro Data Flow
FastFlow evolution
Data parallelism
•Generalized map•Reduce
Distributed version
•Transport level channels•Distributed orchestration of FF applications
HeterogeneousHw targeting
•Simple map & reduce offloading•Fair scheduling to CPU/GPU cores
Distributed version
Cloud offloadingApplication
runs on COWNot enough throughput
Monitored on a single cloud
resourcePerformance
model: number
of necessary cloud
resources
Distributed version
(local+cloud)With throughput
Cloud offloading (2)
Performance modelling of percentage of tasks to offload (in a map or in a reduce)
GPU offloading
GPU map offloading
FastFlow ported to Tilera Pro64◦ (paper this afternoon)
With minimal intervention to ensure functional portability
And some more changes made to support better hw optimizations
Moving to MIC
FastFlow on Tilera Pro64
Conclusions
Thanks to Marco Aldinucci, Massimo Torquati, Peter Kilpatrick, Sonia Campa, Giorgio Zoppi, Daniele Buono, Silvia Lametti, Tudor Serban
Any questions?
marcod@di.unipi.it
Recommended