Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Dataflow and higher level abstractions for parallel programmingJeronimo CastrillonChair for Compiler Construction (CCC)TU Dresden, Germany
Cyber-Physical Systems Summer School 2019Alghero, Sardinia, Italy. September 25, 2019
2
Evolution of Hardware
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Transistors(thousands)
Single-threadPerformance
Frequency(MHz)
Typical power(W)
Core count
M. Horowitz, F. Labonte, et al. Dotted-line by C. Moore, “Data processing in exascale-class computer systems,” The Salishan Conference on High Speed Computing, 2011
Evolution of Hardware: Great complexity
3 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Transistors(thousands)
Single-threadPerformance
Frequency(MHz)
Typical power(W)
Core count
M. Horowitz, F. Labonte, et al. Dotted-line by C. Moore, “Data processing in exascale-class computer systems,” The Salishan Conference on High Speed Computing, 2011
q Heterogeneous many-cores, scalable platforms, complex memory hierarchies, domain-specific accelerators, …
Panda, P. R., Dutt, N. D., & Nicolau, A. Memory issues in embedded systems-on-chip: optimizations and exploration. Springer Science & Business Media. 1999
1999
Smart health
Aerospace
AI, MLNLP
Smart transport
Infotainment system
Virtual 3D
Control systems
i/po/p
Robotics HCI
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2
2019
Programmer
Models: Von Neumann
q Program: sequence of instructions in memory
4 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Central Processing Unit (CPU)
Memory & I/O
Control unit Processing unit
Inst. regProg. counter
Reg. fileALU
Good abstractions:q “High-level” languages (C, …)q Efficient HW implementation
Models: Von Neumann and Modern HW
q Different sources of parallelismq SIMD, threads, GPUs, …
q Fundamentally different resourcesq Dataflow accelerators, Tensor processing units, …
è Lots of effort in auto-parallelization and vectorization
5 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Sequential code
Parallel code(OpenMP, Pthreads,…)Compiler
Theorem (Allen/Kennedy): Any reordering transformation that preserves every dependence in a program preserves the meaning of that program
6
Static/dynamic control/data analysis: Sample results
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
[DAC’08]
2008
[Aguilar’18]
2018Polybenchon real Jetson TX1
Experimental multi-core from Tokyo Institute of Technology (Prof. Isshiki)
Problems for auto-parallelizing compilers
1) Find all dependencies?
7 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
for (i = 1; i <= 100; i++)for (j = 1; j <= 100; j++) {
S1: X[i][j] = X[i][j] + Y[i-l][j];S2: Y[i][j] = Y[i][j] + X[i][j-1];
}
i
j
More often than not, impossible!
2) Coding style and the illusion of infinite shared memory
struct a{struct s *a;int *pi;struct t *p;float *p;…}
Successful example: Polyhedral Compilation
Problems for auto-parallelizing compilers (2)
1) Find all dependencies?
8 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
2) Coding style and the illusion of infinite shared memory
3) Dependencies can sometimes be violated!(they are artifacts of style)
[Edler’18]
Models: Von Neumann and Modern HW
q Different sources of parallelismq SIMD, threads, GPUs, …
q Fundamentally different resourcesq Dataflow accelerators, Tensor processing unit, …
q Von Neumann and sequential programmingq Difficult to automatically extract parallelism
q Difficult to recognize high-level operations
q Difficult to ensure correctness (functional, timing, …)
9 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Models and Programming
q Specially important for CPS!q Domain specific q Interconnected systems
q Interaction with physical world (Reactive)q Real-time (WCET, predictable HW)
q Dynamic changing conditionsq …
10
Smart health
Aerospace
AI, MLNLP
Smart transport
Infotainment system
Virtual 3D
Control systems
i/po/p
Robotics HCI
In this lectureq Classic dataflow modelingq Current work on extensions: dynamic, adaptable… (CPS)
q Raise-level of abstraction: Domain-specific languages
Dataflow programming
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Dataflow programming
q Graph representation of applications: Long historyq Implicit repetitive execution of tasksq Good model for streaming applications
q Good match for signal processing & multi-media
q The whyq Explicit parallelism
q Often: Determinism q Better analyzability (scheduling, mapping, optimization)
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
PnTransformSdfToKpn(D, S);PnTransformToArrayAccess(D, S);CollectChannelAccessRanges(D, S);
PropagateChannelAccessRanges(D, S);PnStreamFactorystreamFactory(BasePath);
switch (transTarget) {case TransM VP:
PnTransformTemplateInstantiate(D, S);
ErasePnProcessTemplates(D);
PnPrintForM VP(D, S);break;
case TransPthread:PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;
case TransSystemC:PrintForSystemC(D, S, traces,
streamFactory);ErasePnDefs(D);
break;case TransVPUtg:
PrintForVPUtg(D, S, streamFactory);
ErasePnDefs(D);break;
case TransVPUmap:PrintForVPUmap(D, S,
strM appingFileName, streamFactory);
ErasePnDefs(D);break;
case TransInvalid:assert(false);break;
}}
#include "PnTransform.h"#include "PnTVPUtg.h"
#include "PnTVPUmap.h"#include "clang/AST/ASTContext.h"#include "PnStreamFactory.h"using namespace clang;
voidclang::PnTransform(TransTarget transTarget, bool traces,
const std::string & strM appingFileName, ASTContext
&Ctx,Sema &S, const
llvm::sys::Path &BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =
Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {
case TransSystemC:
case TransVPUtg:case TransVPUmap:
PnTCopying(D, S);break;
default:;
}
PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;
case TransSystemC:PrintForSystemC(D, S, traces,
streamFactory);ErasePnDefs(D);break;
case TransVPUtg:PrintForVPUtg(D, S,
streamFactory);ErasePnDefs(D);break;
case TransVPUmap:PrintForVPUmap(D, S,
strM appingFileName, streamFactory);
ErasePnDefs(D);break;
case TransInvalid:assert(false);break;
&BasePath) {
assert(transTarget != TransInvalid);TranslationUnitDecl *D =
Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {
case TransSystemC:case TransVPUtg:case TransVPUmap:
PnTCopying(D, S);
break;default:
;}
12
Example: Synchronous dataflow graph (SDF)
13 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Def.: An SDF is an annotated multi-graph G=(V,E,W). V is the set of actors and E ⊆ VxV the set of channels. W = {w1,...,w|E|} ⊂N3 is a set of annotations for every channel e = (a1,a2), we=(pe,ce,de): pe is the number of tokens produced by a firing of a1, ce the tokens consumed by a2 and de the amount of delay tokens
Originally in [Lee’87]
Example: Synchronous dataflow graph (SDF)
q Example
14 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Def.: An SDF is an annotated multi-graph G=(V,E,W). V is the set of actors and E ⊆ VxV the set of channels. W = {w1,...,w|E|} ⊂N3 is a set of annotations for every channel e = (a1,a2), we=(pe,ce,de): pe is the number of tokens produced by a firing of a1, ce the tokens consumed by a2 and de the amount of delay tokens
What next?
a1 a2 a33 1 2 3
6 2
12
e1 e3
e4
e2
a1 a2 a33 1 2 3
6 2
12
e1 e3
e4
e2
a1 a2 a33 1 2 3
6 2
12
e1 e3
e4
e2
Originally in [Lee’87]
Topology matrix
q Example
15 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Def.: Given an SDF G=(V,E,W), its topology matrix 𝚪 has a row for every channel and a column for every actor. [𝚪ik] = pik – cik (i.e. the amount of tokens produced minus those consumed by actor k on/from channel i)
a1 a2 a33 1 2 3
6 2
12
e1 e3
e4
e2
� =
0
BB@
3 �1 06 �2 00 2 �3�2 0 1
1
CCA
Topology matrix (2)
q State of an SDF: tokens in the queuesq The topology matrix can be used to update the state after a firing
16 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
a1 a2 a33 1 2 3
6 2
12
e1 e3
e4
e2
a1 a2 a33 1 2 3
6 2
12
e1 e3
e4
e2
s1 = s0 +
0
BB@
3 �1 06 �2 00 2 �3�2 0 1
1
CCA ·
0
@100
1
A =
0
BB@
0002
1
CCA+
0
BB@
360�2
1
CCA =
0
BB@
3600
1
CCA
� =
0
BB@
3 �1 06 �2 00 2 �3�2 0 1
1
CCA
Topology matrix (3)
q State of an SDF: tokens in the queuesq The topology matrix can be used to update the state after a firing
17 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
a1 a2 a33 1 2 3
6 2
12
e1 e3
e4
e2
a1 a2 a33 1 2 3
6 2
12
e1 e3
e4
e2
s2 = s1 +
0
BB@
3 �1 06 �2 00 2 �3�2 0 1
1
CCA ·
0
@010
1
A =
0
BB@
3600
1
CCA+
0
BB@
�1�220
1
CCA =
0
BB@
2420
1
CCA
� =
0
BB@
3 �1 06 �2 00 2 �3�2 0 1
1
CCA
Topology matrix and complete cycles
q Complete cycle: sequence of firing that brings the SDF to the original state
q What for: Automatic unrolling to express parallelism (preserve semantics)
18 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
sn = s0 + � · (v1 + v2 + · · ·+ vn) = s0
� · (v1 + v2 + · · ·+ vn) = � · ~r = 0
a1 a2 a33 1 2 3
6 2
12
e1 e3
e4
e2 ~r = (1, 3, 2)T
a11 a22
a31
a21
a23
a32
Maximum cycle mean (MCM)
q Unroll SDF is an HSDFq The MCM relates to the maximum throughput of an HSDFq Given an HSDF G=(V,E)
q Let O(G) be the set of all cycles
q A cycle C=((vi,vi+1),(vi+2,vi+3),…,(vi+n-1, vi))
19 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
MCM(G) = maxC2O(G)
Pv:(u,v)_(v,u)2C
⇢(v)P
e2Cw(e)
!
x delay tokens
WCET
Other models
q Trade-off: Expressivity vs analyzability q Dynamics: Compromise on analyzability
20 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
[Stuijk‘11]
Dynamic models and language support
q Kahn Process Networks (KPNs)q Simple syntax for programmingq Less model information: more analysis effort
q Via tracing compilers canq Identify general communication patternsq Exploit event correlations
(“dependencies”)
21 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
22
Programming flow: Overview
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Dynamic Dataflow Application(CPN)
Non-functional specification
PNargs_ifft_r.ID = 6U;PNargs_ifft_r.PNchannel_freq_coef = filtered_coef_rightPNargs_ifft_r.PNnum_freq_coef = 0U;PNargs_ifft_r.PNchannel_time_coef = sink_rightPNargs_ifft_r.channel = 1;sink_left = IPCllmrf_open(3, 1, 1);sink_right = IPCllmrf_open(7, 1, 1);PNargs_sink.ID = 7U;PNargs_sink.PNchannel_in_left = sink_leftPNargs_sink.PNnum_in_left = 0U;PNargs_sink.PNchannel_in_right = sink_rightPNargs_sink.PNnum_in_right = 0U;taskParams.arg0 = (xdc_UArg)&PNargs_srctaskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtr&taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_fft_ltaskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrft_Templ, &taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_ifft_rtaskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrfft_Templ, &taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_sinktaskParams.priority = 1;
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2
Architecture model
Property models (timing, energy, errors, …)
Analysis
Synthesis
Code generation[Springer’14]
Programming flows
q Many: CAL, DOL, CPN, Daedalus, Ptolemy, CAPH, …
q Informationq Dataflow model: Rates, states, actions, traces, …
q Architecture model: Resources, interconnect, costs, …
q Optimization: Mapping application to hardware q Multi-objective optimizationq High-level simulation / cost models
23 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
LM
RISC
LM
RISC
LM
DSP
LM
DSP
LM
DSP
LM
DSP
ARM
SM
AHB Bus
LM: Local memorySM: Shared memory
qnt
src snk
qnt
qntdct
qnt
dctvle
dct
dct init
MJPEG application
Automatic mapping to heterogeneous many-cores
q Lots of research: Heuristics and meta-heuristicsq Sample results: Effort-quality trade-off
24 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
[MCSOC’16]
Multimedia apps on TI Keystone II
Dataflow and CPSsq Adaptivity, predictable, multi-app,
reactive, …
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
System dynamics: Hybrid mappings
q Applications not so static anymore!
q Hybrid DSE: a compile and run-time approachq Enable adaptivity: malleable, multi-variant
q Run-time predictability, robustness & isolation
26 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Smart health
Aerospace
AI, MLNLP
Smart transport
Infotainment system
Virtual 3D
Control systems
i/po/p
Robotics HCI
EUF Pareto front @ compile-time
Data-level parallelism: Scalable and adaptive
q Change parallelism from the application specificationq Static code analysis to identify possible transformations (or via annotations)q Implementation in FIFO library (semantics preserving)q Similar to AdaPNet or parameterized SDFs
27 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
P1 P2 P3
Meta-information & configurations
Changes @ runtime
P1
P12
P3P22
P32
P1
P12
P3P32
P42
or
P22
[PARMA-DITAM’18]
[Schor’14, Desnos’13]
Flexible mappings: Generalized Tetris
q Given multiple canonical configs by compiler, select one at run-time
q Exploit mapping equivalences and similarities
q Related approach: Produce mapping constraints at compile-time
28 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Mapping 3 configuration1Mapping 2
configuration1Mapping configuration1
RuntimeMapping
configuration*
[ACM TACO’17, SCOPES’17b]
[Schwarzer’17]
Implemented as actual game for dissemination!
Sample symmetries for MJPEG
q Multimedia benchmarks on Adapteva (16 Epiphany cores)q Not always very intuitive
29 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
R R
RR
R
R
R
R
R R
RR
R
R
R
R
R R
RR
R
R
R
R
R R
RR
R
R
R
R
R R
RR
R
R
R
R
R R
RR
R
R
R
R
R R
RR
R
R
R
R
R R
RR
R
R
R
R
Best result Result “simple” Pruned (equiv. to “simple”) Pruned (equiv. to best)
[ACM TACO’17]
Flexible mappings: Run-time analysis
q Linux kernel: symmetry-awareq Target: Odroid XU4 (big.LITTLE)q Multi-application scenarios: audio filter (AF) and MIMO
q 1x AF
q 4 x AF q 2 x AF + 2 x MIMO
q 3 mappings to two processors q T1: Best CPU time
q T2: Best wall-clock timeq T3: GBM heuristic
30 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
●●●
12
14
16
18
CFS Dyn T1 T2 T3
CPU
tim
e [s
]
●
●
●
●●
●●
●
●
10
12
14
16
CFS Dyn T1 T2 T3
Ener
gy [J
]
●
●
●
●●●●
●●
4
5
6
7
8
CFS Dyn T1 T2 T3
Wal
l−cl
ock
time
[s]
Single AF
[Castrill12]
[SCOPES’17b]
Flexible mappings: Multi-application results (1)
31 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
6.5
7.0
7.5
8.0
8.5
9.0
9.5
CFS Dyn T1 T2 T3
Wal
l−cl
ock
time
[s]
●●
●●●●●●
●●●
●●●●●●●●●●●●●
●●●
●
●●●●●●
●●●●●●●
10
20
30
40
AF MIMO
CPU
time
[s]
●●●● ●●
●● ●●●●●●●
●●●
●●●●
●● ●●
10
20
30
AF MIMO
Wal
l−clo
ck ti
me
[s] More
predictable performance
Comparable performance to dynamic mapping
●
●
●
●
●●●●●●
●
●●●●
●
●●●
●
● ● ● ●
● ●
●●
●
●●
●
●●
● ● ● ●
12.5
15.0
17.5
20.0
CFS Dyn T1 T2 T3
CPU
tim
e [s
]
[SCOPES’17b]
Robustness
q Static mappings, transformed or not, provide good predictabilityq However: Many things out of control
q Application data, unexpected interrupts, unexpected OS decisions
è Can we reason about robustness of mapping to external factors?
32 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Mapping 3 configuration1Mapping 2
configuration1Mapping configuration1
RuntimeMapping
configuration* Perturb Modified behavior(correct?)
Design centering
q Design centering: Find a mapping that can better tolerate variations while staying feasible
q Studied field, in e.g., biology, circuit design or manufacturing systems.
q Currentlyq Using a bio-inspired algorithmq Robust against OS changes to the mapping
33 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
optimum
design center in feasible region
x1
x2
feasible region
[SCOPES’17a]
Design centering: Algorithmic
q Intuition: Find the center and the form of a region, in which parameters deliver a correct solution
q Formally q A: Set of correct solutionsq P: Hitting probability
q L: Generic metric space
q Searching: Allow annealing (dynamically change P)
34 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
[SCOPES’17a]
Evaluation
q Analyze how robust the center really isq Perturbate mappings and check how often the constraints are missedq Signal processing applications on clustered ARM manycore and NoC manycore (16)
35 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
MIMO-OFDM
[SCOPES’17a]
Ongoing work: Improve representations
q Work on embeddings: Architectures à Real numbers q Novel mapping representations: faster heuristics & more efficient heuristics?q Example: T-SNE Visualization for mappings space (8 tasks on Odroid XU4)
36 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
[MCSoC’18]
Domain-specific abstractions
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Higher-level Abstractions
q Dataflow: Nice abstraction across domains (maybe too overloaded)q Need to match domain-specific accelerators with domain-specific abstractions
38 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
https://www.hpcwire.com/2017/04/10/nvidia-responds-google-tpu-benchmarking/
var input A : matrixvar input u : tensorIN
v = (A # A # A # u . [[5 8] [3 7] [1 6]])
Domain-specific language
[RWDSL’18]
Example: CFDlang
q Origin: Spectral element methods in Computational Fluid Dynamics
q Simple expression language, formal semantics, domain expert optimizations
39 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
source = type matrix : [mp np] &type tensorIN : [np np np ne] &type tensorOUT : [mp mp mp me] &
&var input A : matrix &var input u : tensorIN &var input output v : tensorOUT &var input alpha : [] &var input beta : [] &
&v = alpha * (A # A # A # u .
[[5 8] [3 7] [1 6]]) + beta * v
Fortran embedding
[GPCE’17, RWDSL’18]
Tensor languages (hype): Machine learning
q High-level dataflow graph, low-level tensor operators
q Many existing frameworks, e.g., TVM, Tensor Comprehensions, TensorFlow, …
q Recent workq Cross-domain tensor transformations
q Formal semantics for safety checks(particularly important for CPS!)
q Optimization for emerging memories
40 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
[Chen’18]Example flow: TVM
[GPCE’18]
[Array’19]
[LCTES’19]
Correct by construction
q Similar to formal MoCsq Correct by designq No abstraction leaks
q Current efforts in formal semantics for safe code generation
41 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
[Array’19]
A = placeholder((m,h), name='A')
B = placeholder((n,h), name='B')
k = reduce_axis((0, h), name='k')
C = compute((m, n), lambda i, j:
sum(A[k, i] * B[k, j], axis=k))
𝐶%& = ()*+
,
𝐴)%𝐵)&
Emerging architectures
q Lots of research on tensor computations on tensor accelerators! q Recently looking into emerging memory architectures
q Hybrid architectures STT/Re-RAMq Racetrack memories
q Racetrack memories: Predicted extreme density at low latencyq Multiple bits stored in nano-wires with magnetic domainsq Non-volatile: Energy efficiency!
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 201942
e.g., [Kim’19]
[TVLSI’18, TVLSI’19]
Need to exploit sequentially on top of locality
[IEEE CAL’19]
Architecture and data layout optimization
q Architecture – software co-optimizationq Embedded system for inference: RTM as scratchpad with pre-shifting and other
optimizations
43[LCTES’19]
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Architecture and data layout optimization
q Data-layout: Reduce the number of shifts
44[LCTES’19]
n
DB
C:2n
DB
C:2n+1
DB
C:3n-1
R0
R1
Rn-1
C00 C01
n
R0
R1
Rn-1
DB
C:0
DB
C:1
DB
C:n-1
A00 A01 A0n-1
A10 A1n-1
An-1n-1
C0 Cn-1DBC:n DBC:n+1 DBC:2n-1
B00
B10
Bn-10
A00 A01 A0n-1
compulsory shifts
overhead shifts
B00 B10 Bn-10
compulsory shifts
overhead shifts
C1
B01
B11
Bn-11
C (Bank-2)Ã (Bank-0) ~B (Bank-1)
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Latency comparison vs SRAM
q Un-optimized and naïve mapping: Even worse latency than SRAMq 24% average improvement (even with very conservative circuit simulation)
45
0
0.5
1
1.5
2
2.5
3
4 8 16 32 64 128 256 512 1024 2048 Avg
Nor
mal
ized
late
ncy
Tensor size
SRAM RTM-naïve RTM-opt RTM-opt-ps
[LCTES’19]
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Energy comparison vs SRAM
q Higher savings due to less leakage powerq 74% average improvement
46
0
0.2
0.4
0.6
0.8
1
1.2
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
4 8 16 32 64 128 256 512 1024 2048
Nor
mal
ized
ene
rgy
Tensor size
Leakage Energy Read/Write Energy Shift Energy
[LCTES’19]
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Summary
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
Summary
q Principled methodologies for programming are a must! – preaching to the choirq Advances in auto-parallelization (polyhedral, dynamic analysis)q Explicit parallel MoC-based programming models
q Quest for more adaptivity but retaining time predictability q Higher-level abstractions: Example for tensor-based computations for accelerators
q Lots of challenges remain (thankfully)q Cost models and characterization of trade-offs (vs. blind searches)q Understand impact of emerging technologies
q Syntax and semantics (for correctness): Lots of open questions q Time-semantics in programming and execution environments
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 201948
© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019
References
[DAC’08] Ceng, J., et al. "MAPS: an integrated framework for MPSoC application parallelization.”, Proceedings of the 45th annual Design Automation Conference. DAC’08.[Aguilar’18] Aguilar, M. A., Software Parallelization and Distribution for Heterogeneous Multi-Core Embedded Systems. RWTH Aachen University, RWTH Aachen University, 2018, 181pp[Edler’18] T. J.K. Edler von Koch, S. Manilov, C. Vasiladiotis, M. Cole, and B.Franke. “Towards a Compiler Analysis for Parallel Algorithmic Skeletons”, CC’18.[Lee’87] E.A. Lee and D. G. Messerschmitt. "Synchronous data flow." Proceedings of the IEEE 1987[Springer’14] J. Castrillon, R. Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap" , Springer, pp. 258, 2014. [MCSoC’16] A. Goens, R. Khasanov, J. Castrillon, S. Polstra, A. Pimentel, "Why Comparing System-level MPSoC Mapping Approaches is Difficult: a Case Study" , MCSoC-16.[Schor’14] Schor, L.; Bacivarov, I.; Yang, H. & Thiele, L. “AdaPNet: Adapting Process Networks in Response to Resource Variations”, ESWEEK'14, 2014
[Stuijk’11] Stuijk, S.; Geilen, M.; Theelen, B. & Basten, T. “Scenario-aware dataflow: Modeling, analysis and implementation of dynamic applications”, SAMOS’11 404-411
[Desnos’13] Desnos, K., et al. "Pimm: Parameterized and interfaced dataflow meta-model for mpsocsruntime reconfiguration." SAMOS), 2013.
[PARMA-DITAM’18] R. Khasanov, et al, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap" , PARMA-DITAM 18, ACM, pp. 20–25, Jan 2018.
[ACM TACO’17] Goens, A. et al. “Symmetry in Software Synthesis”. In: ACM Transactions on Architecture and Code Optimization (TACO) (2017).[SCOPES’17a] G. Hempel, et al, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering" SCOPES’17.[SCOPES’17b] Goens, A. et al. “TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings”, SCOPES’17.
[DAC’12] J. Castrillon, A. Tretter, R. Leupers, G. Ascheid. “Communication-aware Mapping of KPN Applications Onto Heterogeneous MPSoCs” in DAC 2012, 1266-1271[Schwarzer’17] T. Schwarzer, et al. "Symmetry-eliminating design space exploration for hybrid application mapping on many-core architectures." IEEE TCAD 37.2 (2017): 297-310.
[MCSoC’18] A. Goens, C. Menard, J. Castrillon, "On the Representation of Mappings to Multicores", MCSoC-18, pp. 184–191, Hanoi, Vietnam, Sep 2018.
[CyPhy’19] M. Lohstroh, et al. "Reactors: A Deterministic Model for Composable Reactive Systems" Workshop on Design, Modeling and Evaluation of Cyber Physical Systems (CyPhy 2019) Oct 2019.
[RWDSL’18] N. A. Rink, et al. “CFDlang: High-level code generation for high-order methods in fluid dynamics”. RWDSL’18.[GPCE’17] A. Susungi, N. A. Rink, J. Castrillon, et al. “Towards Compositional and Generative Tensor Optimizations”. GPCE’17, Oct. 2017, pp. 169–175.[Chen’18] Chen, Tianqi, et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." 13th OSDI, 2018.[GPCE’18] A. Susungi, N. A. Rink, A. Cohen, J. Castrillon, and C. Tadonki. “Meta-programming for cross-domain tensor optimizations”. GPCE’18, Nov. 2018, pp. 79–92.
[Array’19] N.A. Rink, N. A. and J. Castrillon. “TeIL: a type-safe imperative Tensor Intermediate Language”, ARRAY, ACM, 2019, pp. 57-68
[Kim’19] J. Kim, et al. "A code generator for high-performance tensor contractions on GPUs." Proceedings of the Symposium on Code Generation and Optimization. IEEE Press, 2019.
[IEEE CAL’19] Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories”, In IEEE Computer Architecture Letters, Feb 2019.
[LCTES’19] Khan, A. A., Rink, N. A., Hameed, F., Castrillon, J. "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads”, Proceedings of LCTES, Jun 2019, pp. 5-18
49