SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and

SoC programming in the era of the Internet of Things, machine learning and emerging technologies

Jeronimo CastrillonChair for Compiler Construction (CCC)TU Dresden, Germany

Groupement De Recherche SOC2: System On Chip, Systémes embarqués et Objets ConnectéMontpellier, France. June 20 2019

Systems on Chip (SoC): Evolution

q SoCs: Long history of specialization and interaction with environmentq Incredible evolution over the last decades

3 © Prof. J. Castrillon. GdR SoC2. Montpellier, 2019

Panda, P. R., Dutt, N. D., & Nicolau, A. Memory issues in embedded systems-on-chip: optimizations and exploration. Springer Science & Business Media. 1999

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

SoC design and programming: Handling complexity

q Advances in design methodologies

q Model-based programming


Transistors

Gate level

Register transfer level

Electronic system level

Application Architecture

Results

Optimization

SoC design and programming: Handling complexity

q Advances in design methodologies

q Model-based programming


Transistors

Gate level

Register transfer level

Electronic system level

Application Architecture

Results

Optimization

Exciting innovations in q Modeling languagesq Programming languages and compilers

q Costs models of hardwareq System simulators

q Design space exploration (DSE) methodologies

New challengesq System dynamics (e.g., IoT)q Ubiquity of machine learning workloads

q Complexity of emerging technologies

Dataflow and Hybrid DSE

© Prof. J. Castrillon. GdR SoC2. Montpellier, 20196

Dataflow programming

q Graph representation of applicationsq Implicit repetitive execution of tasksq Good model for streaming applications

q Good match for signal processing & multi-media

q The whyq Explicit parallelism

q Often: Determinism q Better analyzability (scheduling, mapping, optimization)


PnTransformSdfToKpn(D, S);PnTransformToArrayAccess(D, S);CollectChannelAccessRanges(D, S);

PropagateChannelAccessRanges(D, S);PnStreamFactorystreamFactory(BasePath);

switch (transTarget) {case TransM VP:

PnTransformTemplateInstantiate(D, S);

ErasePnProcessTemplates(D);

PnPrintForM VP(D, S);break;

case TransPthread:PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);

break;case TransVPUtg:

PrintForVPUtg(D, S, streamFactory);

ErasePnDefs(D);break;

case TransVPUmap:PrintForVPUmap(D, S,

strM appingFileName, streamFactory);


case TransInvalid:assert(false);break;

}}

#include "PnTransform.h"#include "PnTVPUtg.h"

#include "PnTVPUmap.h"#include "clang/AST/ASTContext.h"#include "PnStreamFactory.h"using namespace clang;

voidclang::PnTransform(TransTarget transTarget, bool traces,

const std::string & strM appingFileName, ASTContext

&Ctx,Sema &S, const

llvm::sys::Path &BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =

Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:

case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);break;

default:;

}

PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);break;

case TransVPUtg:PrintForVPUtg(D, S, streamFactory);ErasePnDefs(D);break;

case TransVPUmap:

PrintForVPUmap(D, S, strM appingFileName, streamFactory);


case TransInvalid:assert(false);break;

&BasePath) {assert(transTarget != TransInvalid);

TranslaYonUnitDecl *D = Ctx.getTranslaYonUnitDecl();

PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);break;

default:;

}

7

Dataflow compilation

q Plenty of research onq Language, compiler and mapping algorithmsq Hardware modeling, performance estimation

q Runtime systemsq Code generation for heterogeneous multicores


LM

RISC

LM

RISC

LM

DSP

LM

DSP

LM

DSP

LM

DSP

ARM

SM

AHB Bus

LM: Local memorySM: Shared memory

qnt

src snk

qnt

qntdct

qnt

dctvle

dct

dct init

MJPEG application

[Springer14]

Example: HW acceleration and SW-defined radio

q Application: MIMO OFDM receiverq Hardware

q Platform 1: Baseline software q Platform 2: Optimized software

q Platform 3: Optimized SW + HW


ARM Mem

VLIW VLIW VLIW

src sink

Library 1: SW implementations

Library 2: SW optimized

MemMem

FFT

ARM Mem

VLIW FFT Viterbi

src sink

Custom Interconnect

Demap Library 3: SW+HW accel.7680

128000

1758241,758

1,00E+00

1,00E+02

1,00E+04

1,00E+06

1)bsp1 (sw,

unoptimized)

2)bsp2 (sw,optimized)

3)bsp3 (hw)

Rate

(bps

)

Achieved rate @ 100 MHz

[SDR’10, ALOG’11]

System dynamics

q Applications not so static anymore!

q Hybrid DSE: a compile and run-time approachq Enable adaptivity: malleable, multi-variant

q Run-time predictability, robustness & isolation


Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

EUF Pareto front @ compile-time

Data-level parallelism: Scalable and adaptive

q Change parallelism from the application specificationq Static code analysis to identify possible transformations (or via annotations)q Implementation in FIFO library (semantics preserving)


P1 P2 P3

Meta-information & configurations

Changes @ runtime

P1

P12

P3P22

P32

P1

P12

P3P32

P42

or

P22

[PARMA-DITAM’18]

Exploiting symmetries

q Intuitionq SW: Some tasks/processes/actors may do the sameq HW: Symmetric latencies (CoreX çè CoreY)

q Symmetry: Allows transformations w/o changing the outcome

è No need to analyze all possible mappings (prune search space)

è Work on formalization via inverse semi-groups and efficient algorithms


=?

15

Symmetries in Odroid: Example


PE7

PE6PE5

PE4PE3

PE2PE1

PE0

PE7

PE6PE5

PE4PE3

PE2PE1

PE0D

A B C

DA

B

C

ϕ

PE1

PE2

PE3

PE4

PE5PE6

PE7

PE0

PE1

PE2

PE3

PE4

PE5

PE6PE7

PE0

Mappings Architecture Subgraphs

Cortex A7 Cortex A15

Equivalent mappings

Graph isomorphism

Mappings Architecture subgraphs

[IESS’15, ACM TACO’17]

Flexible mappings: Generalized Tetris

q Given multiple canonical configs by compiler, select one at run-time

q Exploit mapping equivalences and similarities


Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration*

[SCOPES’17b]

Flexible mappings: Run-time analysis

q Linux kernel: symmetry-awareq Target: Odroid XU4 (big.LITTLE)q Multi-application scenarios: audio filter (AF) and MIMO

q 1x AF

q 4 x AF q 2 x AF + 2 x MIMO

q 3 mappings to two processors q T1: Best CPU time

q T2: Best wall-clock timeq T3: GBM heuristic


●●●

12

14

16

18

CFS Dyn T1 T2 T3

CPU

tim

e [s

]

●

●

●

●●

●●

●

●

10

12

14

16

CFS Dyn T1 T2 T3

Ener

gy [J

]

●

●

●

●●●●

●●

4

5

6

7

8

CFS Dyn T1 T2 T3

Wal

l−cl

ock

time

[s]

Single AF

[Castrill12]

[SCOPES’17b]

Flexible mappings: Multi-application results (1)


●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●

●

●

●

●

●●

●

●●

●●

●

●

●●

6.5

7.0

7.5

8.0

8.5

9.0

9.5

CFS Dyn T1 T2 T3

Wal

l−cl

ock

time

[s]

●●

●●●●●●

●●●

●●●●●●●●●●●●●

●●●

●

●●●●●●

●●●●●●●

10

20

30

40

AF MIMO

CPU

time

[s]

●●●● ●●

●● ●●●●●●●

●●●

●●●●

●● ●●

10

20

30

AF MIMO

Wal

l−clo

ck ti

me

[s] More

predictable performance

Comparable performance to dynamic mapping

●

●

●

●

●●●●●●

●

●●●●

●

●●●

●

● ● ● ●

● ●

●●

●

●●

●

●●

● ● ● ●

12.5

15.0

17.5

20.0

CFS Dyn T1 T2 T3

CPU

tim

e [s

]

[SCOPES’17b]

Robustness

q Static mappings, transformed or not, provide good predictabilityq However: Many things out of control

q Application data, unexpected interrupts, unexpected OS decisions

è Can we reason about robustness of mapping to external factors?


Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration* Perturb Modified behavior(correct?)

Design centering

q Design centering: Find a mapping that can better tolerate variations while staying feasible

q Studied field, in e.g., biology, circuit design or manufacturing systems.

q Currentlyq Using a bio-inspired algorithmq Robust against OS changes to the mapping


optimum

design center in feasible region

x1

x2

feasible region

[SCOPES’17a]

Evaluation

q Analyze how robust the center really isq Perturbate mappings and check how often the constraints are missedq Signal processing applications on clustered ARM manycore and NoC manycore (16)


MIMO-OFDM

[SCOPES’17a]

Machine Learning & emerging technologies


ML revolution: Frameworks and architectures

q Many existing frameworks, e.g., TVM, Tensor Comprehensions, TensorFlow, …

q Lots of traction in hardware architectures: TPU, V100, …

q Lot’s of resources for training, less on inference on edge-devices!


[Chen, OSDI’18]Example flow: TVM

Domain-specific abstractions

q Commonality: Tensor expression languages

q Increase programmer’s productivityq From compiler perspective: No abstraction toll

q Easier access to informationq Larger score for optimization


var input A : matrix &var input u : tensorIN &

v = (A # A # A # u . [[5 8] [3 7] [1 6]])

[RWDSL’18]

vs

Correct by construction

q Especially important in embedded systemsq Correct by designq No abstraction leaks

q Current efforts in formal semantics for safe code generation


[Array’19]

A = placeholder((m,h), name='A')

B = placeholder((n,h), name='B')

k = reduce_axis((0, h), name='k')

C = compute((m, n), lambda i, j:

sum(A[k, i] * B[k, j], axis=k))

!"# = %&'(

)*&"+&#

TeML: Results


q Extra control allows for new optimization (vs pluto): changing shapesq General tensor semantics allows covering more benchmarks than TensorFlow

29

[GPCE’18]

Emerging technologies: Racetrack memories

q Racetrack memories: one of many future alternativesq Cf. STT/Re-RAM, hybrid architectures

q Predicted extreme density at low latencyq 3D nano-wires with magnetic domains

q One port shared for many bitsq Domains move at high speeds (1000 ms-1)

q Sequential: Game changer for current HW/SW stackq Memory management

q Integration with other memory architectures q Data layout and allocation

[Parkin-Nature’15]


[TVLSI’18, TVLSI’19]

Compiler research on placement

q Compiler pass to reorganize placement of instructionsq Instruction fetch is naturally sequential!q Layout instructions to reduce shift operations

q Compiler pass to reorganize data for higher-level objectsq Variable allocation

q Tensor allocation and program transformation

31

[ISLPED’19]

[ArXiv’19]

[LCTES’19]

Simulating RTMs

q RTSim: Configurable racetrack simulatorq Allows running software benchmarksq Built on stop of other simulator technology: NVMAIN 2, Gem5, SystemC, …


https://github.com/tud-ccc/RTSim[IEEE CAL’19]

https://www.researchgate.net/deref/https:/github.com/tud-ccc/RTSim

Architecture and data layout optimization

q Architecture – software co-optimizationq Embedded system for inference: RTM as scratchpad with pre-shifting and other

optimizations

33[LCTES’19]

Architecture and data layout optimization

q Data-layout: Reduce the number of shifts

34[LCTES’19]

n

DB

C:2n

DB

C:2n+1

DB

C:3n-1

R0

R1

Rn-1

C00 C01

n

R0

R1

Rn-1

DB

C:0

DB

C:1

DB

C:n-1

A00 A01 A0n-1

A10 A1n-1

An-1n-1

C0 Cn-1DBC:n DBC:n+1 DBC:2n-1

B00

B10

Bn-10

A00 A01 A0n-1

compulsory shifts

overhead shifts

B00 B10 Bn-10

compulsory shifts

overhead shifts

C1

B01

B11

Bn-11

C (Bank-2)Ã (Bank-0) ~B (Bank-1)

Latency comparison vs SRAM

q Un-optimized and naïve mapping: Even worse latency than SRAMq 24% average improvement (even with very conservative circuit simulation)

35

0

0.5

1

1.5

2

2.5

3

4 8 16 32 64 128 256 512 1024 2048 Avg

Nor

mal

ized

late

ncy

Tensor size

SRAM RTM-naïve RTM-opt RTM-opt-ps

[LCTES’19]

Energy comparison vs SRAM

q Higher savings due to less leakage powerq 74% average improvement

36

0

0.2

0.4

0.6

0.8

1

1.2

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

4 8 16 32 64 128 256 512 1024 2048

Nor

mal

ized

ene

rgy

Tensor size

Leakage Energy Read/Write Energy Shift Energy

[LCTES’19]

Discussion


Summary

q System dynamics: More complex in distributed IoT scenariosq Hybrid compile-runtime methodologies: Difficult balance, new interfacesq Strive at retaining time predictability

q New workloads (e.g., ML) + new techs: Harness domain-specific abstractionsq More complex decision making (e.g., het. Memory systems)

q Expression DSLs to ease high-level manipulation and transformation

q Exciting times for DSE, SW/HW co-design formal languages for emerging platforms (example: RTM-scratchpads)



References

[Springer’14] J. Castrillon, R. Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap" , Springer, pp. 258, 2014. [SDR’10] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable SDR,” Wireless Innovation Conference and Product Exposition (SDR), 2010[ALOG’11] J. Castrillon, S. Schu ̈rmans, A. Stulova, W. Sheng, T. Kempf, R. Leupers, G. Ascheid, and H. Meyr, “Component-based waveform development: The nucleus tool flow for efficient and portable software defined radio”, Analog Integrated Circuits and Signal Processing, vol. 69, no. 2–3, pp. 173–190, 2011

[ACM TACO’17] Goens, A. et al. “Symmetry in Software Synthesis”. In: ACM Transactions on Architecture and Code Optimization (TACO) (2017).

[IESS’15] A. Goens and J. Castrillon, “Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs”, In IFIP International Embedded Systems Symposium (IESS), 2015, Foz do Iguaçu, Brazil, 2015.

[PARMA-DITAM’18] R. Khasanov, et al, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap" , PARMA-DITAM 18, ACM, pp. 20–25, Jan 2018.[SCOPES’17a] G. Hempel, et al, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering" SCOPES’17.[SCOPES’17b] Goens, A. et al. “TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings”, SCOPES’17.[Chen, OSDI’18] Chen, Tianqi, et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 2018.[RWDSL’18] N. A. Rink, et al. “CFDlang: High-level code generation for high-order methods in fluid dynamics”. RWDSL’18.[Array’19] N.A. Rink, N. A. and Jeronimo Castrillon. “TeIL: a type-safe imperative Tensor Intermediate Language”,

ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), ACM, 2019, 57-68[GPCE’17] A. Susungi, et al. "Towards Compositional and Generative Tensor Optimizations" GPCE’17, 169-175.

[GPCE’18] A. Susungi, et al. " "Meta-programming for cross-domain tensor optimizations" GPCE’18, 79-92.

[TVLSI’18] F. Hameed, A. A. Khan, J. Castrillon, "Performance and Energy Efficient Design of STT-RAM Last-Level-Cache" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 26, no. 6, pp. 1059–1072, Jun 2018.

[TVLSI’19] F. Hameed, J. Castrillon, "A Novel Hybrid DRAM/STT-RAM Last-Level-Cache Architecture for Performance, Energy and Endurance Enhancement" , In IEEE Transactions on Very Large Scale Integration Systems (TVLSI), pp. 13pp, Jun 2019.

[Parkin-Nature’15] S. Parkin, and See-Hun Yang. "Memory on the racetrack." Nature nanotechnology 10.3 (2015): 195.

[IEEE CAL’19] A. A. Khan, F. Hameed, R. Bläsing, S. Parkin, J.Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories" In IEEE Computer Architecture Letters, Feb 2019.

[ISPLED’19] J. Multanen, A. A. Khan, P. Jääskeläinen, F. Hameed, J. Castrillon, “SHRIMP: Efficient Instruction Delivery with Domain Wall Memory”, International Symposium on Low Power Electronics and Design (ISLPED), ACM, 2019, 10pp

[LCTES’19] Khan, A. A., Rink, N. A., Hameed, F., Castrillon, J. "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads”, Proceedings of LCTES’19, ACM, pp. 12pp, New York, NY, USA, Jun 2019

[ArXiv’19] Khan, A. A., Hameed, F., Bläsing, R., Parkin, S., Castrillon, J. “ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0”, ArXiv arXiv:1903.03597, Mar 2019.

41

Documents

SoC programming in the era of the Internet of Things, machine learning and emerging ... · 2019-08-23 · SoC programming in the era of the Internet of Things, machine learning and