Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
SoC programming in the era of the Internet of Things, machine learning and emerging technologies
Jeronimo CastrillonChair for Compiler Construction (CCC)TU Dresden, Germany
Groupement De Recherche SOC2: System On Chip, Systémes embarqués et Objets ConnectéMontpellier, France. June 20 2019
Systems on Chip (SoC): Evolution
q SoCs: Long history of specialization and interaction with environmentq Incredible evolution over the last decades
3 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Panda, P. R., Dutt, N. D., & Nicolau, A. Memory issues in embedded systems-on-chip: optimizations and exploration. Springer Science & Business Media. 1999
Smart health
Aerospace
AI, MLNLP
Smart transport
Infotainment system
Virtual 3D
Control systems
i/po/p
Robotics HCI
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2
SoC design and programming: Handling complexity
q Advances in design methodologies
q Model-based programming
4 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Transistors
Gate level
Register transfer level
Electronic system level
Application Architecture
Results
Optimization
SoC design and programming: Handling complexity
q Advances in design methodologies
q Model-based programming
5 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Transistors
Gate level
Register transfer level
Electronic system level
Application Architecture
Results
Optimization
Exciting innovations in q Modeling languagesq Programming languages and compilers
q Costs models of hardwareq System simulators
q Design space exploration (DSE) methodologies
New challengesq System dynamics (e.g., IoT)q Ubiquity of machine learning workloads
q Complexity of emerging technologies
Dataflow and Hybrid DSE
© Prof. J. Castrillon. GdR SoC2. Montpellier, 20196
Dataflow programming
q Graph representation of applicationsq Implicit repetitive execution of tasksq Good model for streaming applications
q Good match for signal processing & multi-media
q The whyq Explicit parallelism
q Often: Determinism q Better analyzability (scheduling, mapping, optimization)
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
PnTransformSdfToKpn(D, S);PnTransformToArrayAccess(D, S);CollectChannelAccessRanges(D, S);
PropagateChannelAccessRanges(D, S);PnStreamFactorystreamFactory(BasePath);
switch (transTarget) {case TransM VP:
PnTransformTemplateInstantiate(D, S);
ErasePnProcessTemplates(D);
PnPrintForM VP(D, S);break;
case TransPthread:PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;
case TransSystemC:PrintForSystemC(D, S, traces,
streamFactory);ErasePnDefs(D);
break;case TransVPUtg:
PrintForVPUtg(D, S, streamFactory);
ErasePnDefs(D);break;
case TransVPUmap:PrintForVPUmap(D, S,
strM appingFileName, streamFactory);
ErasePnDefs(D);break;
case TransInvalid:assert(false);break;
}}
#include "PnTransform.h"#include "PnTVPUtg.h"
#include "PnTVPUmap.h"#include "clang/AST/ASTContext.h"#include "PnStreamFactory.h"using namespace clang;
voidclang::PnTransform(TransTarget transTarget, bool traces,
const std::string & strM appingFileName, ASTContext
&Ctx,Sema &S, const
llvm::sys::Path &BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =
Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {
case TransSystemC:
case TransVPUtg:case TransVPUmap:
PnTCopying(D, S);break;
default:;
}
PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;
case TransSystemC:PrintForSystemC(D, S, traces,
streamFactory);ErasePnDefs(D);break;
case TransVPUtg:PrintForVPUtg(D, S, streamFactory);ErasePnDefs(D);break;
case TransVPUmap:
PrintForVPUmap(D, S, strM appingFileName, streamFactory);
ErasePnDefs(D);break;
case TransInvalid:assert(false);break;
&BasePath) {assert(transTarget != TransInvalid);
TranslaYonUnitDecl *D = Ctx.getTranslaYonUnitDecl();
PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {
case TransSystemC:case TransVPUtg:case TransVPUmap:
PnTCopying(D, S);break;
default:;
}
7
Dataflow compilation
q Plenty of research onq Language, compiler and mapping algorithmsq Hardware modeling, performance estimation
q Runtime systemsq Code generation for heterogeneous multicores
8 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
LM
RISC
LM
RISC
LM
DSP
LM
DSP
LM
DSP
LM
DSP
ARM
SM
AHB Bus
LM: Local memorySM: Shared memory
qnt
src snk
qnt
qntdct
qnt
dctvle
dct
dct init
MJPEG application
[Springer14]
Example: HW acceleration and SW-defined radio
q Application: MIMO OFDM receiverq Hardware
q Platform 1: Baseline software q Platform 2: Optimized software
q Platform 3: Optimized SW + HW
9 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
ARM Mem
VLIW VLIW VLIW
src sink
Library 1: SW implementations
Library 2: SW optimized
MemMem
FFT
ARM Mem
VLIW FFT Viterbi
src sink
Custom Interconnect
Demap Library 3: SW+HW accel.7680
128000
1758241,758
1,00E+00
1,00E+02
1,00E+04
1,00E+06
1)bsp1 (sw,
unoptimized)
2)bsp2 (sw,optimized)
3)bsp3 (hw)
Rate
(bps
)
Achieved rate @ 100 MHz
[SDR’10, ALOG’11]
System dynamics
q Applications not so static anymore!
q Hybrid DSE: a compile and run-time approachq Enable adaptivity: malleable, multi-variant
q Run-time predictability, robustness & isolation
11 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Smart health
Aerospace
AI, MLNLP
Smart transport
Infotainment system
Virtual 3D
Control systems
i/po/p
Robotics HCI
EUF Pareto front @ compile-time
Data-level parallelism: Scalable and adaptive
q Change parallelism from the application specificationq Static code analysis to identify possible transformations (or via annotations)q Implementation in FIFO library (semantics preserving)
12 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
P1 P2 P3
Meta-information & configurations
Changes @ runtime
P1
P12
P3P22
P32
P1
P12
P3P32
P42
or
P22
[PARMA-DITAM’18]
Exploiting symmetries
q Intuitionq SW: Some tasks/processes/actors may do the sameq HW: Symmetric latencies (CoreX çè CoreY)
q Symmetry: Allows transformations w/o changing the outcome
è No need to analyze all possible mappings (prune search space)
è Work on formalization via inverse semi-groups and efficient algorithms
13 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
=?
15
Symmetries in Odroid: Example
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
PE7
PE6PE5
PE4PE3
PE2PE1
PE0
PE7
PE6PE5
PE4PE3
PE2PE1
PE0D
A B C
DA
B
C
ϕ
PE1
PE2
PE3
PE4
PE5PE6
PE7
PE0
PE1
PE2
PE3
PE4
PE5
PE6PE7
PE0
Mappings Architecture Subgraphs
Cortex A7 Cortex A15
Equivalent mappings
Graph isomorphism
Mappings Architecture subgraphs
[IESS’15, ACM TACO’17]
Flexible mappings: Generalized Tetris
q Given multiple canonical configs by compiler, select one at run-time
q Exploit mapping equivalences and similarities
16 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Mapping 3 configuration1Mapping 2
configuration1Mapping configuration1
RuntimeMapping
configuration*
[SCOPES’17b]
Flexible mappings: Run-time analysis
q Linux kernel: symmetry-awareq Target: Odroid XU4 (big.LITTLE)q Multi-application scenarios: audio filter (AF) and MIMO
q 1x AF
q 4 x AF q 2 x AF + 2 x MIMO
q 3 mappings to two processors q T1: Best CPU time
q T2: Best wall-clock timeq T3: GBM heuristic
17 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
●●●
12
14
16
18
CFS Dyn T1 T2 T3
CPU
tim
e [s
]
●
●
●
●●
●●
●
●
10
12
14
16
CFS Dyn T1 T2 T3
Ener
gy [J
]
●
●
●
●●●●
●●
4
5
6
7
8
CFS Dyn T1 T2 T3
Wal
l−cl
ock
time
[s]
Single AF
[Castrill12]
[SCOPES’17b]
Flexible mappings: Multi-application results (1)
18 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
6.5
7.0
7.5
8.0
8.5
9.0
9.5
CFS Dyn T1 T2 T3
Wal
l−cl
ock
time
[s]
●●
●●●●●●
●●●
●●●●●●●●●●●●●
●●●
●
●●●●●●
●●●●●●●
10
20
30
40
AF MIMO
CPU
time
[s]
●●●● ●●
●● ●●●●●●●
●●●
●●●●
●● ●●
10
20
30
AF MIMO
Wal
l−clo
ck ti
me
[s] More
predictable performance
Comparable performance to dynamic mapping
●
●
●
●
●●●●●●
●
●●●●
●
●●●
●
● ● ● ●
● ●
●●
●
●●
●
●●
● ● ● ●
12.5
15.0
17.5
20.0
CFS Dyn T1 T2 T3
CPU
tim
e [s
]
[SCOPES’17b]
Robustness
q Static mappings, transformed or not, provide good predictabilityq However: Many things out of control
q Application data, unexpected interrupts, unexpected OS decisions
è Can we reason about robustness of mapping to external factors?
20 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Mapping 3 configuration1Mapping 2
configuration1Mapping configuration1
RuntimeMapping
configuration* Perturb Modified behavior(correct?)
Design centering
q Design centering: Find a mapping that can better tolerate variations while staying feasible
q Studied field, in e.g., biology, circuit design or manufacturing systems.
q Currentlyq Using a bio-inspired algorithmq Robust against OS changes to the mapping
21 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
optimum
design center in feasible region
x1
x2
feasible region
[SCOPES’17a]
Evaluation
q Analyze how robust the center really isq Perturbate mappings and check how often the constraints are missedq Signal processing applications on clustered ARM manycore and NoC manycore (16)
23 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
MIMO-OFDM
[SCOPES’17a]
Machine Learning & emerging technologies
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
ML revolution: Frameworks and architectures
q Many existing frameworks, e.g., TVM, Tensor Comprehensions, TensorFlow, …
q Lots of traction in hardware architectures: TPU, V100, …
q Lot’s of resources for training, less on inference on edge-devices!
© Prof. J. Castrillon. GdR SoC2. Montpellier, 201926
[Chen, OSDI’18]Example flow: TVM
Domain-specific abstractions
q Commonality: Tensor expression languages
q Increase programmer’s productivityq From compiler perspective: No abstraction toll
q Easier access to informationq Larger score for optimization
27 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
var input A : matrix &var input u : tensorIN &
v = (A # A # A # u . [[5 8] [3 7] [1 6]])
[RWDSL’18]
vs
Correct by construction
q Especially important in embedded systemsq Correct by designq No abstraction leaks
q Current efforts in formal semantics for safe code generation
28 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
[Array’19]
A = placeholder((m,h), name='A')
B = placeholder((n,h), name='B')
k = reduce_axis((0, h), name='k')
C = compute((m, n), lambda i, j:
sum(A[k, i] * B[k, j], axis=k))
!"# = %&'(
)*&"+&#
TeML: Results
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
q Extra control allows for new optimization (vs pluto): changing shapesq General tensor semantics allows covering more benchmarks than TensorFlow
29
[GPCE’18]
Emerging technologies: Racetrack memories
q Racetrack memories: one of many future alternativesq Cf. STT/Re-RAM, hybrid architectures
q Predicted extreme density at low latencyq 3D nano-wires with magnetic domains
q One port shared for many bitsq Domains move at high speeds (1000 ms-1)
q Sequential: Game changer for current HW/SW stackq Memory management
q Integration with other memory architectures q Data layout and allocation
[Parkin-Nature’15]
© Prof. J. Castrillon. GdR SoC2. Montpellier, 201930
[TVLSI’18, TVLSI’19]
Compiler research on placement
q Compiler pass to reorganize placement of instructionsq Instruction fetch is naturally sequential!q Layout instructions to reduce shift operations
q Compiler pass to reorganize data for higher-level objectsq Variable allocation
q Tensor allocation and program transformation
31
[ISLPED’19]
[ArXiv’19]
[LCTES’19]
Simulating RTMs
q RTSim: Configurable racetrack simulatorq Allows running software benchmarksq Built on stop of other simulator technology: NVMAIN 2, Gem5, SystemC, …
© Prof. J. Castrillon. GdR SoC2. Montpellier, 201932
https://github.com/tud-ccc/RTSim[IEEE CAL’19]
Architecture and data layout optimization
q Architecture – software co-optimizationq Embedded system for inference: RTM as scratchpad with pre-shifting and other
optimizations
33[LCTES’19]
Architecture and data layout optimization
q Data-layout: Reduce the number of shifts
34[LCTES’19]
n
DB
C:2n
DB
C:2n+1
DB
C:3n-1
R0
R1
Rn-1
C00 C01
n
R0
R1
Rn-1
DB
C:0
DB
C:1
DB
C:n-1
A00 A01 A0n-1
A10 A1n-1
An-1n-1
C0 Cn-1DBC:n DBC:n+1 DBC:2n-1
B00
B10
Bn-10
A00 A01 A0n-1
compulsory shifts
overhead shifts
B00 B10 Bn-10
compulsory shifts
overhead shifts
C1
B01
B11
Bn-11
C (Bank-2)Ã (Bank-0) ~B (Bank-1)
Latency comparison vs SRAM
q Un-optimized and naïve mapping: Even worse latency than SRAMq 24% average improvement (even with very conservative circuit simulation)
35
0
0.5
1
1.5
2
2.5
3
4 8 16 32 64 128 256 512 1024 2048 Avg
Nor
mal
ized
late
ncy
Tensor size
SRAM RTM-naïve RTM-opt RTM-opt-ps
[LCTES’19]
Energy comparison vs SRAM
q Higher savings due to less leakage powerq 74% average improvement
36
0
0.2
0.4
0.6
0.8
1
1.2
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
SRA
M
RTM
-naï
ve
RTM
-opt
RTM
-opt
-ps
4 8 16 32 64 128 256 512 1024 2048
Nor
mal
ized
ene
rgy
Tensor size
Leakage Energy Read/Write Energy Shift Energy
[LCTES’19]
Discussion
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
Summary
q System dynamics: More complex in distributed IoT scenariosq Hybrid compile-runtime methodologies: Difficult balance, new interfacesq Strive at retaining time predictability
q New workloads (e.g., ML) + new techs: Harness domain-specific abstractionsq More complex decision making (e.g., het. Memory systems)
q Expression DSLs to ease high-level manipulation and transformation
q Exciting times for DSE, SW/HW co-design formal languages for emerging platforms (example: RTM-scratchpads)
© Prof. J. Castrillon. GdR SoC2. Montpellier, 201938
© Prof. J. Castrillon. GdR SoC2. Montpellier, 2019
References
[Springer’14] J. Castrillon, R. Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap" , Springer, pp. 258, 2014. [SDR’10] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable SDR,” Wireless Innovation Conference and Product Exposition (SDR), 2010[ALOG’11] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable software defined radio”, Analog Integrated Circuits and Signal Processing, vol. 69, no. 2–3, pp. 173–190, 2011
[ACM TACO’17] Goens, A. et al. “Symmetry in Software Synthesis”. In: ACM Transactions on Architecture and Code Optimization (TACO) (2017).
[IESS’15] A. Goens and J. Castrillon, “Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs”, In IFIP International Embedded Systems Symposium (IESS), 2015, Foz do Iguaçu, Brazil, 2015.
[PARMA-DITAM’18] R. Khasanov, et al, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap" , PARMA-DITAM 18, ACM, pp. 20–25, Jan 2018.[SCOPES’17a] G. Hempel, et al, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering" SCOPES’17.[SCOPES’17b] Goens, A. et al. “TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings”, SCOPES’17.[Chen, OSDI’18] Chen, Tianqi, et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 2018.[RWDSL’18] N. A. Rink, et al. “CFDlang: High-level code generation for high-order methods in fluid dynamics”. RWDSL’18.[Array’19] N.A. Rink, N. A. and Jeronimo Castrillon. “TeIL: a type-safe imperative Tensor Intermediate Language”,
ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), ACM, 2019, 57-68[GPCE’17] A. Susungi, et al. "Towards Compositional and Generative Tensor Optimizations" GPCE’17, 169-175.
[GPCE’18] A. Susungi, et al. " "Meta-programming for cross-domain tensor optimizations" GPCE’18, 79-92.
[TVLSI’18] F. Hameed, A. A. Khan, J. Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018.
[TVLSI’19] F. Hameed, J. Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), pp. 13pp, Jun 2019.
[Parkin-Nature’15] S. Parkin, and See-Hun Yang. "Memory on the racetrack." Nature nanotechnology 10.3 (2015): 195.
[IEEE CAL’19] A. A. Khan, F. Hameed, R. Bläsing, S. Parkin, J.Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories" In IEEE Computer Architecture Letters, Feb 2019.
[ISPLED’19] J. Multanen, A. A. Khan, P. Jääskeläinen, F. Hameed, J. Castrillon, “SHRIMP: Efficient Instruction Delivery with Domain Wall Memory”, International Symposium on Low Power Electronics and Design (ISLPED), ACM, 2019, 10pp
[LCTES’19] Khan, A. A., Rink, N. A., Hameed, F., Castrillon, J. "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads”, Proceedings of LCTES’19, ACM, pp. 12pp, New York, NY, USA, Jun 2019
[ArXiv’19] Khan, A. A., Hameed, F., Bläsing, R., Parkin, S., Castrillon, J. “ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0”, ArXiv arXiv:1903.03597, Mar 2019.
41