A Component-Based Multicore Programming · PDF fileA Component-Based Multicore Programming Environment Pierre Paulin ... Functional TLM Performance Power ... Thread creation/assignment,

A Component-Based Multicore Programming Environment

Pierre Paulin Director, SoC Platform Automation STMicroelectronics (Canada) Inc. Workshop on Fundamentals of Component-Based Design, ESWeek 2010 , Scottsdale, AZ, 24 Oct. 2010

2

Core’s Law (for Embedded SoCs) Co

res p

er C

hip

2004

2002

2006

2012

 Number embedded cores in SoC products

doubling every ~2 years

2 4 8

16 32 64 128 256

2008

2010

2014

2016

Multi-core

Many-core

Dual-core

???

Source: Casual observation

3

Core’s Law: What’s Next? Co

res p

er C

hip

2004

2002

2006

2012

2 4 8

16 32 64 128 256

2008

2010

2014

2016

Dual-core

Corezilla

Many-core Multi-

core

Corrama

MCU!

I/O Mem

RISC DSP

H/W

Bus!

Bus!

Coremporium

4

Lessons of History   50 years of sequential

programming has taken us to the edge of the abyss

  With parallel programming, we will all take a huge leap forward

Multi-Processor SoC!

Platform Programming Models

for! Control Audio Video Video

Prog. Model

NoC

PE

PE

PE

PE

PE

PE

HW

HW

Controller, DMAs

Shared L1

MEM

PE

PE

PE

PE

Shared L1

MEM

PE

PE

PE

PE

Controller, DMAs

MCU!

I/O Mem

RISC DSP

H/W

Bus!Bus!

5

Outline

  Platform 2012 Multicore Fabric   Platform 2012 Programming Environment

  Component-based programming models   Component-aware debug and visualization tools

  Case Studies   Video High-Quality Rescaling

  Mapped to S/W platform   Mapped to H/W-S/W platform

  VC1 codec

6

The P2012 Scalable Tile

System bus

HOST

NI L1

CC DMA

NI

DMA

L1

CC L2

CC

STxP70 [config I/S]

[Bit-Stream Processing]

[VECx extension]

OCE

[FPU]

DDR

Peripherals

NI

DMA

L1

CC

HW 1

HW 2

HW 3

HW 4 SoC carrier

accelerator

…

NI

DMA

L1

CC

NI

DMA

L1

CC

Fabric Ctrl

DMA BR

  P2012 Fabric can integrate up to 32 clusters

  1-16 configurable cores / cluster

  Optional H/W Processing Elements

  Few 100 GOPs to several TOPS

7

Local Interconnect (ANoC – Asynchronous NoC)

Slave Master

Clk, Power domain

HW PE # K

SIF S SO SI

IT

HW PE # 0

SIF S SO SI

IT

P2012 Cluster Overview

NI

ENCore<n>

S

M

Shared L1 I$ Shared DMEM

HW Synchronizer

SI SO

S M

Stream Out Stream In

Cluster Controller

M

S

STxP70

I-$ TCDM

OCE

ITC DMA

DMA

SI SO S M

I T

S M

Fabric Interconnect (ANoC – Asynchronous NoC) GALS I/F

PE 0 PE n

JTAG

Debug & Test Unit

8

P2012 Design Flow

F1 F2

F3 F4

Application

Runtime SW libraries & drivers

Customizable core

NI

DMA

L1

CC

Customizable tile

NI

DMA

L2

CC

SW only cluster

NI

DMA

L2

CC

HW/SW cluster

NI

DMA

L2

CC

HW cluster V1 SW fabric

Tools

L2

CC

STxP70 [config I/S]

[Bit-Stream Processing]

[VECx extension]

OCE

[FPU]

9

Outline


  Component-based programming models   Component-aware debug, visualization and analysis

tools   Case Studies

  Video High-Quality Rescaling   Mapped to S/W platform   Mapped to H/W-S/W platform

  VC1 codec

10

Software Development Kit Stack

11

Component repository Prog. Patterns

Streaming: PEDF, DDF

Parallel Threads OpenMP

OpenCL

Standard Progr. Models Advanced Prog. Models Native Prog.Layer

NPL API

System Infrastructure & Runtime!Dynamic Deployment Execution Engines QoS Power Management

Platform Models!TLM Performance Power Functional

Programming Environment!Component-based Language-based

Analysis, Optimization, Mapping API-based

Execution engine configuration

Trace and debug info management

12

Programming Tools Outline

  MIND Component Infrastructure   Component-based

Programming models   Programming tools flow   Runtime   Apex Application Modeling   Trace, Visualization and Analysis   Component-aware Debug

F1 F2

F3 F4

Component-based Progr. Models

  Encapsulation & Interfaces   Good for distributed memory

  Binding through link components   Heterogeneity

  Control interface   Introspection   Observability

  Semantic neutral   Can be used to

support multiple prog. models

Mem2

µProc2

Mem1

µProc1

RPC, Marshalling Demarshalling

Mem

µProc

Direct Call

Component

Interfaces

Component

Binding

S/W Architecture, Components,

Interfaces

link

13

14

S/W Components   Support of Multiple Commn/Execution Semantics

component component

Synchronous Call

RPC

component component

Asynchronous Call

Scheduler • Priority-based • Preemptive • Non-preemptive • Real-time

component component

Push Model (streaming)

S/W Architecture, Components, Interfaces

ADL IDL

Tools

FIFO

Component Application Capture

T2

F1

T2’ T2’’

F2 T1

T3

F3 F4 F5

Dataflow filter

Data Level Parallelism

Patterns

Identical working tasks

Comm. PPP

Dynamic Task Pools

T5

  Explicit description of the application architecture   ADL: Architecture Description Language   IDL: Interface Description Language

  Built on Fractal MIND Component infrastructure   Open source (LGPL) available on OW2 (mind.ow2.org)

15

ADL

ADL ADL

16

MIND Toolchain

ADL

ADL ADL IDL

ADL ADL C

mindc

mpp gcc

ADL ADL Generated C

Executable/ Loadable binary

ADL : Architecture Description Language IDL : Interface Definition Language mindc : MIND ADL/IDL parser and C code generator mpp : MIND PreProcessor

Inputs ADL

F1 F2 F3

F4 F5

interface comm.QueueWriter { void createQ(size_t sz, int nb); void destroyQueue(); void * readNext(); }

int METH(itf, func) (void) { while((in= CALL(input, readNext)()) !=0) { compute_function(in,out,PRIVATE.arg); } }

17

Prog. Tools & Runtime Outline



F1 F2

F3 F4

Programming Models Objectives

  Efficiency:   Max parallelism with mininum overhead

  Productivity:   Abstraction   Ease of debugging   High-level analysis

  Scalability:   More resources more performance

  Platform independence:   Patterns designed from applications perspective

18

PPM Development: Dual Approach

  Build basic set of Parallel Programming Patterns   For exploiting data-level (DLP) &

task-level parallelism (TLP)   Communication, synchronization and memory

management patterns   Constructions for thread-based programming

  Thread creation/assignment, synchronization, msg. passing   Constructions for dataflow programming (streaming)

  Execution engines (schedulers), filters template, queues   Refine PPPs from application experience

  Video Codecs (VC1, H.264)   Image Quality Improvnt. (HQR, TMNR, TNR, MC-DEI)   Image analysis (pedestrian recognition)

19

Parallel Programming Pattern (PPP) lib

Execution model

…

Exchanger

Synchr Buffer

Communication/ synchronization

Memory access

Async Prefetch ...

PEDF

SDF

Prefetch

Thread

Run-to- Compl.

DDF

Queue iterator

Parallel Programming Models

HAL

Abs

trac

tion

Na've Programming Layer (NPL)

Application F1 F3 F2

Parallel Programming Models High-‐level threading for Data Level Parallelism

GCD-‐core Stream DDF, SDF (S/W, HW/SW)

Codec, IQI, Modem …

Thread pool

FIFO

20

21

Parallel Programming Patterns

Pattern A

Platform

NPL / HAL

Inter-cluster

Intra-cluster Pattern B

Application PPP Component repository

Component Framework

  High-level set of patterns used to parallelize applications   Exploiting different types of parallelism   Interchangeable implementations

  Component-based (MIND framework)

SIMD

SISD

MIMD

SPMD

MISD

Single Instruction

Multiple Instructions

Single Data

Multiple Data

Single Program

Multiple Programs

MPMD

Data-level parallelism

Task-level parallelism

L1

L1-L2 via

CDMA

Multiple implemns.

L1-L2

22

Communication patterns (examples)

  Exchanger: buf = exchange(buf)

  Swapping buffers between two participants

  Queue Iterator:

buf=writeNext(buf); buf=readeNext()

  Iteration based communication between producer(s) and consumer(s)   Single queue   Split / Join / Broadcast

T1 T2

T1 Ti Tn

split

join

… …

23

Communication patterns (examples)

  Synchronized buffer:

request/release{read,write}(buf)

  Synchronized read/write on shared buffer   Sliced Synchronized buffer:

Specialization to access a large buffer in smaller slices

  FIFO: push(); pop(); peek()

  Packet based streaming between producer and consumer   Buffer copy or buffer pointer passing   Single queue   Split / Join / Broadcast

Synch buffer

FIFO

T1 T2

T1 T2

24

Memory access patterns

  Prefetcher: key=load(ptrL3,size), ptrL1=get(key), free(key)   Prefetching data from L3 to L1

  load() programs the prefetch and returns a key without waiting for the transfer   get() returns a pointer to L1, blocks until the transfer is completed   Also supports 2D arrays: key=load(ptr, width, height)

  Async. Prefetcher: load(ptr, size, handler(key)), free(key)   Enables efficient thread-pool execution engines

  load() programs the prefetch and returns waiting for the transfer   handler(key) is invoked by the execution engine when the transfer is completed   Also supports 2D arrays: load(ptr, width, height, handler(key))

25

Execution Patterns: Static Threads   Static mapping of

kernels on a set of platform resources

  Minimal runtime overhead   No kernel multiplexing

required   Manual load balancing

  Similar computational requirements for each kernels

  Example usage   Video High-Quality

Rescaling (HQR)   Mapped to S/W

Thread Creation Bootstrap

main() main() main()

C1 C2

C3 C1

C2

C3

Component Constructors

26

Execution Patterns: Thread Pool   Dynamic dispatch of

kernels on a set of platform resources

  Some runtime overhead   Mux K kernels on

R resources   Dynamic load balancing

  Different heuristics offered   Job stealing   Cache awareness

  Example usage   H.264 Motion Estimator   Mapped to S/W

Application Main Controller

Kernel #1

Thread Pool

[ ]

Kernel #N

…

T.P. registration

I/F

Runnable I/F

Execution Patterns: Dataflow

27

F1 F2 F3

Mode Controller

Host Communication

 Predicated Execution Data-Flow (PEDF)

 Host Communication Component  Models part of application

that communicates with host  Mode Controller

 Configures control parameters, steps pipeline

 Filters  Perform actual data

computation  Example use: Video HQR

 Mapped to H/W-S/W

28

Prog. Tools & Runtime Outline



F1 F2

F3 F4

Traces Visualization/debug!

P2012 Mapping Flow

29

Programming Tools!Component-based

Application Capture!

Analysis, Optimization

Streaming: - PEDF, DDF

Parallel Thread

OpenMP

OpenCL

Communication genn.

Std Progr. Models Advanced Prog. Models Native!PM NPL API

PPP Lib

P2012 Platform!Performance Analysis! L2

M E M

CC PE0 PE1 …

PEn

C

C

C

C DMA HWS

L1

M E M

Runtime

Execution Engines

Deployment

QoS, Power mgr.

Apex functional simulation

API- based

Trace synthesis, debug info generation Execution engine confign.

NoC

PE

PE

PE

HW

PE

PE

PE

HW

Controller, DMAs

Shared L1

MEM

PE

PE

PE

PE

Shared L1

MEM

PE

PE

PE

PE

Controller, DMAs

Application-to-Platform Mapping

T2

F1

T2’ T2’’

F2 T1

T3

F3 F4 F5

Dataflow filter

Data Level Parallelism

Patterns

Identical working tasks

Comm. PPP

Dynamic Task Pools

T5

F4 F5

F3

F1

Mapping tools : assignment, transformations, comm. generation, runtime link"

Runtime: dynamic

deployment, assignment,

& scheduling"

30

Platform 2012

Software Runtime Architecture

Execute

Na've Progr. Layer

Synchro Queues

HAL

Deploy

Component Dynamic

Deployment

Code Loading

PlaJorm QoS Power mana-‐ gement

Monitoring

Threading Execu'on engine Streaming Execu'on Engine

HAL

Fault-‐ tolerance

Thread alloc

Comm alloc

Mem alloc

HAL

Parallel Progr. Models

Fabric Controller

Cluster N

PE Synch / Control

Monitor PE PE

…

Cluster 1

PE Synch / Control

Monitor PE PE

…

31

32

Gap

P2012 mapping tools

Component-based Models

Apex: Application Modeling & Simulation Environment

Firmware on P2012 TLM

Refinement

in f out C

f in

f f out

f f f

C C C

C C C

Apex benefits   No need for TLM platform   Fast simulation: >100x TLM speed   High-level validation parallelism   Compatible with mapping tools

Apex Sequential Model

Parallel Model

Apex Objectives   Capture, validation of high-level

sequential & parallel descriptions   Verification of parallel correctness

C Reference Model

L2 Shared

L1 PE PE …

C

C

C CDMA

33

Visualization and analysis flow

Mapping!tools!

Visualization and analysis tools (in STWorkbench)!

Low-level!

Traces!

SoC

Cluster 2

Cluster N

Fabric Ctrl P2012

Host

System Trace

Module

F1 F2 F3

Cluster 0

Cluster 1

Trace Database

• Correspondence between application descriptionand platform resources!

• Application mapping results!

• Used resources!• Symbolic names / physical IDs!

34

Prog. Model-aware visualization   Displays broadcast, exchanger, split/join, sync. buffer, exec. patterns

Cache fill / flush

Exchange

Join

Split

35

NPL Visualization

Mutex ownership change

Mutex critical section

Nested functions

Barrier

36

Component-aware multi-core debug

  Component-aware debug   Print state variables (attributes, private data)   Break on component method, conditional breakpoint on a

specific instance   Jump over compiler-generated interface stubs   Print instance hierarchy of the application   Print current location in the hierarchy

  Multicore features   Single cockpit controlling multiple debugger instances   Support for identical program images   Support for heterogeneous program images planned

37

P2012 Debug flow

Mapping!tools!

F1 F2 F3

Component hierarchy!

description!

Component awareness!Extensions (Python) !

Standard GDB V7

Terminal STWorkbench

Multi-core support

SoC

Cluster 2

Cluster N

Fabric Ctrl P2012

Host

Debug & Test Unit

Cluster 0

Cluster 1

Outline


  Component-based programming models   Component-aware debug, visualization and analysis

tools   Case Studies

  Video High-Quality Rescaling   Mapped to S/W platform   Mapped to H/W-S/W platform

  VC1 video codec

38

39

HQR (High-Quality Rescaling)

  HD 1080p, 120 fps   SDF model variant

  One “token” on in/out per link per filter firing   Or simple static multi-rate

  Tokens typically a line of pixel data   Multiple modes (on frame-by-frame basis)   Some dynamic control flow, exceptions

  E.g. dynamic bypass of a filter

Gaussian (GA)

Gradient Approx

D =

5

D =

4

D =

4

D =

9

D =

4

U,V

Y D =

6

D =

2 Horizontal

Antialiasing (AAL-H)

Vertical Antialiasing

(AAL-V)

Horizontal NewSpline

Vertical NewSpline

Threads & DLP

40

Two Mapping Approaches

  Map to S/W-based platform   Data-level parallelism

  Structured threading model   Multi-processor & SIMD

  All tasks for a given data element assigned to single PE

  Map to H/W-dominated platform   Task-level parallelism

  Dataflow programming model   Software-based control

  Tasks assigned to a single H/W Processing Unit   DLP inside each H/W PU

In Out

F2 F1 F3 … … F4

PEDF dataflow

HQR

S/W

H/W

41

Stripe Width = W Pixels Accessed = W + 2 + 2

S/W Mapping: HQR example

  Data-level parallelism   Each image line split

into stripes   Each PE runs all

filters for a stripe   SIMD optimization of

each filter   Parallel Progr.

Patterns   Data iterator split

and join patterns   Synchronization

between PEs using “exchanger” pattern (for border pixels)

Task1 Task2 Task3 Task4

Input Image

42

S/W Mapping: HQR example

  Data-level parallelism   Each image line split

into stripes   Each PE runs all

filters for a stripe   SIMD optimization of

each filter   Parallel Progr.

Patterns   Data iterator split

and join patterns   Synchronization

between PEs using “exchanger” pattern (for border pixels)

Task1 Task2 Task3 Task4

PE1 PE2 PE3 PE4

Iterator Split

Iterator Join Data exchange

F1

F3

F2

F4

F6

F5

F1

F3

F2

F4

F6

F5

F1

F3

F2

F4

F6

F5

F1

F3

F2

F4

F6

F5

43

HQR PPP Profiling Results

PE0 PE1 PE2 PE3 W0 W1 W2 W3

readNext(0) readNext(1) readNext(2) readNext(3)

writeNext(0) writeNext(1) writeNext(2) writeNext(3)

PE4 Split

PE5

Join

readNext()

writeNext()

exchange exchange exchange

50 instr./call

65 instr./call

65 instr./call

80 instr./call

65 instr./call

  Concrete impact on performance in the case of HQR

"   Less than 1%

44

Prog. Tools: HQR Mapping Results

  Vectorization results (16 way VECx EFU):   Results for standalone CA-ISS   Average vector unit utilization 79%

  Parallel processing results (1 vs. 4 PEs)

19179956

9406673

5731049 4811882

Cache Enabled Initial Burst communication Buffer dimensionning

Single CPU 4 CPU 4 CPU 4 CPU

Cyc

les 2 X 3.3 X 3.9 X

Results on cycle-approx TLM platform

45

HQR Optimization Process # PEs 697

26

200

16 12

Reference code: frame-based (no vectorization)

Line-based (vectorization unoptimized)

Line/column switch before hor. spline filter (simplifies processing)

Line/column switch before last filter (vectorization optimized)

Use new specialized SIMD instructions

8 Wide EFU

~5X cost of H/W solution

H/W Mapping: HQR Example

46

Gaussian (GA)

Gradient Approx

D =

5

D =

4

D =

4

D =

9

D =

4

U,V

Y D =

6

D =

2 Horizontal

Antialiasing (AAL-H)

Vertical Antialiasing

(AAL-V)

Horizontal NewSpline

Vertical NewSpline

Y, U,V PE0 Y PE1 U PE2

V PE3   Task-level parallelism   Assignment of each filter to a H/W PU   Grouping of highly communicating PUs to a single PE

  In contrast with S/W mapping, where   Data-level parallelism exploited (Multi-PE and SIMD)   Each PE performs all tasks

PEDF Dataflow Programming Model

47

F1 F2 F3

Mode Controller

Host Communication

 Host Communication Component  Gets host request params.

  Frame data   Required processing type   Processing parameters

 Mode Controller  Configures control

parameters, steps pipeline  Filters

 Actual data computation  Auto-generated Iteration

Controller Iteration Controller

Predicated Execution Data Flow

48

TLM Native

Overview of PEDF Mapping Flow

Mapping tools!

F1 F2 F3

Mode Controller

Host Comm.

Apex Host Execution

NOC Model

PE 1 PE 2 PE 0 PE 3

Mode Ctl

F1 F2 …

F3 Fm …

Fn Fr …

Fs Fz …

MEM

Cluster Controller

ISS Host Com.

Abstract PEDF Application Capture

  Single PEDF description

  Mapped to Host with Apex tool

  Mapped to TLM platform   Control code on

STxP70   Microcode on data

streaming ports of H/W PEs

HW-SW Interaction in P2012

HWPE

L1 Memory

Data Streamer

STXp70

SI SO

SO

HWPE SI SO

SI

HWPE SI SO

HWPE SI SO

M

T

M

L2 Memory

Data streaming

Load/Store

Streaming data flow

Configuration and control flow

Hardware Processing

Data Mover Cluster Control

49

50

Mapping to HW/SW Platform

STxP70 ISS

NOC Model

PE 1 PE 2 PE 0 PE 3

Confg. Ctl Itr Ctl

Fa Fj …

Fk Fm …

Fn Fr …

Fs Fz …

Functional MODEL TLM / COSIM

  Application capture using DF variant prog. model   Host execution (APEX) for functional validation   Automatic control code generation for TLM/COSIM

for perfomance analysis

Connectivity “Glue”

Filter 1 Filter 0

Iteration Controller

Configuration Controller

Filter 2 Filter N … …

TNR Performance Comparison FlexMap v.s. hand-coded (STxP70 Instrns/Frame)

>50% reduction in !execution time!

51

  More abstract capture allows for more optimization

52

Multiple Programming Models

  Top-level: Dynamic Dataflow pipeline

  Interchangeable Thread/DLP and PEDF implementations of HQR

  Components act as semantic-neutral structuring mechanism Dynamic Dataflow

TMNR In Out

F2 F1 F3 … … F4

PEDF dataflow

HQR

Threads & DLP

FIFO FIFO

MV Pred

  Processing functions   Vectorial (SIMD) data

parallelism

VC1 overview: task-level parallelism

2 decoded reference frames in L3. Circular buffer

Process one segment of macro-block at a time (dataflow)

•  Comm components –  Available in a library –  Binding between

parallel components

Contains motion vectors + data

Ctrl

LoopFilter

IDCT Reconstruction

Intra Prediction

Motion Compen

VLD

Control

Bit Parse VLD Intra

VLD Inter

AC/DC IZZ

IQ IDCT Smooth

MC Deblock

Range mapping

? +

Prefetch manager

Queue Queue

Queue (MB)

Queue

Reference pictures

Display

Input bitstream

Uses the CDMA 1 decoded frame

Queue (2lines)

QueueIterator Broadcast

Frame/slice control read by all

Simple Queue

MV Pred Queue

I/F to obtain the reference MB

Intensity compensn.

  Application components   Pipelined task-level

parallelism

53

Component-based Prog. Models

  Demonstrated value of components   Supports multiple programming models

  DLP/threads S/W mapping   PEDF HW/SW mapping

  Multiple & evolving execution targets   H/W-S/W partitioning

  Multiple simulation environments

  Platform independence  Abstraction of communication

  No overhead in practical use   S/W mapping: <1% overhead   H/W-S/W mapping: 50% execution time reduction

54

55

Multi-Processor SoC!

for Smart People!

Programming Models:!- Higher productivity!- More platform independence!

Documents

A Component-Based Multicore Programming · PDF fileA Component-Based Multicore Programming Environment Pierre Paulin ... Functional TLM Performance Power ... Thread creation/assignment,