Upload
lykhue
View
228
Download
1
Embed Size (px)
Citation preview
A Component-Based Multicore Programming Environment
Pierre Paulin Director, SoC Platform Automation STMicroelectronics (Canada) Inc. Workshop on Fundamentals of Component-Based Design, ESWeek 2010 , Scottsdale, AZ, 24 Oct. 2010
2
Core’s Law (for Embedded SoCs) Co
res p
er C
hip
2004
2002
2006
2012
Number embedded cores in SoC products
doubling every ~2 years
2 4 8
16 32 64 128 256
2008
2010
2014
2016
Multi-core
Many-core
Dual-core
???
Source: Casual observation
3
Core’s Law: What’s Next? Co
res p
er C
hip
2004
2002
2006
2012
2 4 8
16 32 64 128 256
2008
2010
2014
2016
Dual-core
Corezilla
Many-core Multi-
core
Corrama
MCU!
I/O Mem
RISC DSP
H/W
Bus!
Bus!
Coremporium
4
Lessons of History 50 years of sequential
programming has taken us to the edge of the abyss
With parallel programming, we will all take a huge leap forward
Multi-Processor SoC!
Platform Programming Models
for! Control Audio Video Video
Prog. Model
NoC
PE
PE
PE
PE
PE
PE
HW
HW
Controller, DMAs
Shared L1
MEM
PE
PE
PE
PE
Shared L1
MEM
PE
PE
PE
PE
Controller, DMAs
MCU!
I/O Mem
RISC DSP
H/W
Bus!Bus!
5
Outline
Platform 2012 Multicore Fabric Platform 2012 Programming Environment
Component-based programming models Component-aware debug and visualization tools
Case Studies Video High-Quality Rescaling
Mapped to S/W platform Mapped to H/W-S/W platform
VC1 codec
6
The P2012 Scalable Tile
System bus
HOST
NI L1
CC DMA
NI
DMA
L1
CC L2
CC
STxP70 [config I/S]
[Bit-Stream Processing]
[VECx extension]
OCE
[FPU]
DDR
Peripherals
NI
DMA
L1
CC
HW 1
HW 2
HW 3
HW 4 SoC carrier
accelerator
…
NI
DMA
L1
CC
NI
DMA
L1
CC
Fabric Ctrl
DMA BR
P2012 Fabric can integrate up to 32 clusters
1-16 configurable cores / cluster
Optional H/W Processing Elements
Few 100 GOPs to several TOPS
7
Local Interconnect (ANoC – Asynchronous NoC)
Slave Master
Clk, Power domain
HW PE # K
SIF S SO SI
IT
HW PE # 0
SIF S SO SI
IT
P2012 Cluster Overview
NI
ENCore<n>
S
M
Shared L1 I$ Shared DMEM
HW Synchronizer
SI SO
S M
Stream Out Stream In
Cluster Controller
M
S
STxP70
I-$ TCDM
OCE
ITC DMA
DMA
SI SO S M
I T
S M
Fabric Interconnect (ANoC – Asynchronous NoC) GALS I/F
PE 0 PE n
JTAG
Debug & Test Unit
8
P2012 Design Flow
F1 F2
F3 F4
Application
Runtime SW libraries & drivers
Customizable core
NI
DMA
L1
CC
Customizable tile
NI
DMA
L2
CC
SW only cluster
NI
DMA
L2
CC
HW/SW cluster
NI
DMA
L2
CC
HW cluster V1 SW fabric
Tools
L2
CC
STxP70 [config I/S]
[Bit-Stream Processing]
[VECx extension]
OCE
[FPU]
9
Outline
Platform 2012 Multicore Fabric Platform 2012 Programming Environment
Component-based programming models Component-aware debug, visualization and analysis
tools Case Studies
Video High-Quality Rescaling Mapped to S/W platform Mapped to H/W-S/W platform
VC1 codec
10
Software Development Kit Stack
11
Component repository Prog. Patterns
Streaming: PEDF, DDF
Parallel Threads OpenMP
OpenCL
Standard Progr. Models Advanced Prog. Models Native Prog.Layer
NPL API
System Infrastructure & Runtime!Dynamic Deployment Execution Engines QoS Power Management
Platform Models!TLM Performance Power Functional
Programming Environment!Component-based Language-based
Analysis, Optimization, Mapping API-based
Execution engine configuration
Trace and debug info management
12
Programming Tools Outline
MIND Component Infrastructure Component-based
Programming models Programming tools flow Runtime Apex Application Modeling Trace, Visualization and Analysis Component-aware Debug
F1 F2
F3 F4
Component-based Progr. Models
Encapsulation & Interfaces Good for distributed memory
Binding through link components Heterogeneity
Control interface Introspection Observability
Semantic neutral Can be used to
support multiple prog. models
Mem2
µProc2
Mem1
µProc1
RPC, Marshalling Demarshalling
Mem
µProc
Direct Call
Component
Interfaces
Component
Binding
S/W Architecture, Components,
Interfaces
link
13
14
S/W Components Support of Multiple Commn/Execution Semantics
component component
Synchronous Call
RPC
component component
Asynchronous Call
Scheduler • Priority-based • Preemptive • Non-preemptive • Real-time
component component
Push Model (streaming)
S/W Architecture, Components, Interfaces
ADL IDL
Tools
FIFO
Component Application Capture
T2
F1
T2’ T2’’
F2 T1
T3
F3 F4 F5
Dataflow filter
Data Level Parallelism
Patterns
Identical working tasks
Comm. PPP
Dynamic Task Pools
T5
Explicit description of the application architecture ADL: Architecture Description Language IDL: Interface Description Language
Built on Fractal MIND Component infrastructure Open source (LGPL) available on OW2 (mind.ow2.org)
15
ADL
ADL ADL
16
MIND Toolchain
ADL
ADL ADL IDL
ADL ADL C
mindc
mpp gcc
ADL ADL Generated C
Executable/ Loadable binary
ADL : Architecture Description Language IDL : Interface Definition Language mindc : MIND ADL/IDL parser and C code generator mpp : MIND PreProcessor
Inputs ADL
F1 F2 F3
F4 F5
interface comm.QueueWriter { void createQ(size_t sz, int nb); void destroyQueue(); void * readNext(); }
int METH(itf, func) (void) { while((in= CALL(input, readNext)()) !=0) { compute_function(in,out,PRIVATE.arg); } }
17
Prog. Tools & Runtime Outline
MIND Component Infrastructure Component-based
Programming models Programming tools flow Runtime Apex Application Modeling Trace, Visualization and Analysis Component-aware Debug
F1 F2
F3 F4
Programming Models Objectives
Efficiency: Max parallelism with mininum overhead
Productivity: Abstraction Ease of debugging High-level analysis
Scalability: More resources more performance
Platform independence: Patterns designed from applications perspective
18
PPM Development: Dual Approach
Build basic set of Parallel Programming Patterns For exploiting data-level (DLP) &
task-level parallelism (TLP) Communication, synchronization and memory
management patterns Constructions for thread-based programming
Thread creation/assignment, synchronization, msg. passing Constructions for dataflow programming (streaming)
Execution engines (schedulers), filters template, queues Refine PPPs from application experience
Video Codecs (VC1, H.264) Image Quality Improvnt. (HQR, TMNR, TNR, MC-DEI) Image analysis (pedestrian recognition)
19
Parallel Programming Pattern (PPP) lib
Execution model
…
Exchanger
Synchr Buffer
Communication/ synchronization
Memory access
Async Prefetch ...
PEDF
SDF
Prefetch
Thread
Run-to- Compl.
DDF
Queue iterator
Parallel Programming Models
HAL
Abs
trac
tion
Na've Programming Layer (NPL)
Application F1 F3 F2
Parallel Programming Models High-‐level threading for Data Level Parallelism
GCD-‐core Stream DDF, SDF (S/W, HW/SW)
Codec, IQI, Modem …
Thread pool
FIFO
20
21
Parallel Programming Patterns
Pattern A
Platform
NPL / HAL
Inter-cluster
Intra-cluster Pattern B
Application PPP Component repository
Component Framework
High-level set of patterns used to parallelize applications Exploiting different types of parallelism Interchangeable implementations
Component-based (MIND framework)
SIMD
SISD
MIMD
SPMD
MISD
Single Instruction
Multiple Instructions
Single Data
Multiple Data
Single Program
Multiple Programs
MPMD
Data-level parallelism
Task-level parallelism
L1
L1-L2 via
CDMA
Multiple implemns.
L1-L2
22
Communication patterns (examples)
Exchanger: buf = exchange(buf)
Swapping buffers between two participants
Queue Iterator:
buf=writeNext(buf); buf=readeNext()
Iteration based communication between producer(s) and consumer(s) Single queue Split / Join / Broadcast
T1 T2
T1 Ti Tn
split
join
… …
23
Communication patterns (examples)
Synchronized buffer:
request/release{read,write}(buf)
Synchronized read/write on shared buffer Sliced Synchronized buffer:
Specialization to access a large buffer in smaller slices
FIFO: push(); pop(); peek()
Packet based streaming between producer and consumer Buffer copy or buffer pointer passing Single queue Split / Join / Broadcast
Synch buffer
FIFO
T1 T2
T1 T2
24
Memory access patterns
Prefetcher: key=load(ptrL3,size), ptrL1=get(key), free(key) Prefetching data from L3 to L1
load() programs the prefetch and returns a key without waiting for the transfer get() returns a pointer to L1, blocks until the transfer is completed Also supports 2D arrays: key=load(ptr, width, height)
Async. Prefetcher: load(ptr, size, handler(key)), free(key) Enables efficient thread-pool execution engines
load() programs the prefetch and returns waiting for the transfer handler(key) is invoked by the execution engine when the transfer is completed Also supports 2D arrays: load(ptr, width, height, handler(key))
25
Execution Patterns: Static Threads Static mapping of
kernels on a set of platform resources
Minimal runtime overhead No kernel multiplexing
required Manual load balancing
Similar computational requirements for each kernels
Example usage Video High-Quality
Rescaling (HQR) Mapped to S/W
Thread Creation Bootstrap
main() main() main()
C1 C2
C3 C1
C2
C3
Component Constructors
26
Execution Patterns: Thread Pool Dynamic dispatch of
kernels on a set of platform resources
Some runtime overhead Mux K kernels on
R resources Dynamic load balancing
Different heuristics offered Job stealing Cache awareness
Example usage H.264 Motion Estimator Mapped to S/W
Application Main Controller
Kernel #1
Thread Pool
[ ]
Kernel #N
…
T.P. registration
I/F
Runnable I/F
Execution Patterns: Dataflow
27
F1 F2 F3
Mode Controller
Host Communication
Predicated Execution Data-Flow (PEDF)
Host Communication Component Models part of application
that communicates with host Mode Controller
Configures control parameters, steps pipeline
Filters Perform actual data
computation Example use: Video HQR
Mapped to H/W-S/W
28
Prog. Tools & Runtime Outline
MIND Component Infrastructure Component-based
Programming models Programming tools flow Runtime Apex Application Modeling Trace, Visualization and Analysis Component-aware Debug
F1 F2
F3 F4
Traces Visualization/debug!
P2012 Mapping Flow
29
Programming Tools!Component-based
Application Capture!
Analysis, Optimization
Streaming: - PEDF, DDF
Parallel Thread
OpenMP
OpenCL
Communication genn.
Std Progr. Models Advanced Prog. Models Native!PM NPL API
PPP Lib
P2012 Platform!Performance Analysis! L2
M E M
CC PE0 PE1 …
PEn
C
C
C
C DMA HWS
L1
M E M
Runtime
Execution Engines
Deployment
QoS, Power mgr.
Apex functional simulation
API- based
Trace synthesis, debug info generation Execution engine confign.
NoC
PE
PE
PE
HW
PE
PE
PE
HW
Controller, DMAs
Shared L1
MEM
PE
PE
PE
PE
Shared L1
MEM
PE
PE
PE
PE
Controller, DMAs
Application-to-Platform Mapping
T2
F1
T2’ T2’’
F2 T1
T3
F3 F4 F5
Dataflow filter
Data Level Parallelism
Patterns
Identical working tasks
Comm. PPP
Dynamic Task Pools
T5
F4 F5
F3
F1
Mapping tools : assignment, transformations, comm. generation, runtime link"
Runtime: dynamic
deployment, assignment,
& scheduling"
30
Platform 2012
Software Runtime Architecture
Execute
Na've Progr. Layer
Synchro Queues
HAL
Deploy
Component Dynamic
Deployment
Code Loading
PlaJorm QoS Power mana-‐ gement
Monitoring
Threading Execu'on engine Streaming Execu'on Engine
HAL
Fault-‐ tolerance
Thread alloc
Comm alloc
Mem alloc
HAL
Parallel Progr. Models
Fabric Controller
Cluster N
PE Synch / Control
Monitor PE PE
…
Cluster 1
PE Synch / Control
Monitor PE PE
…
31
32
Gap
P2012 mapping tools
Component-based Models
Apex: Application Modeling & Simulation Environment
Firmware on P2012 TLM
Refinement
in f out C
f in
f f out
f f f
C C C
C C C
Apex benefits No need for TLM platform Fast simulation: >100x TLM speed High-level validation parallelism Compatible with mapping tools
Apex Sequential Model
Parallel Model
Apex Objectives Capture, validation of high-level
sequential & parallel descriptions Verification of parallel correctness
C Reference Model
L2 Shared
L1 PE PE …
C
C
C CDMA
33
Visualization and analysis flow
Mapping!tools!
Visualization and analysis tools (in STWorkbench)!
Low-level!
Traces!
SoC
Cluster 2
Cluster N
Fabric Ctrl P2012
Host
System Trace
Module
F1 F2 F3
Cluster 0
Cluster 1
Trace Database
• Correspondence between application descriptionand platform resources!
• Application mapping results!
• Used resources!• Symbolic names / physical IDs!
34
Prog. Model-aware visualization Displays broadcast, exchanger, split/join, sync. buffer, exec. patterns
Cache fill / flush
Exchange
Join
Split
35
NPL Visualization
Mutex ownership change
Mutex critical section
Nested functions
Barrier
36
Component-aware multi-core debug
Component-aware debug Print state variables (attributes, private data) Break on component method, conditional breakpoint on a
specific instance Jump over compiler-generated interface stubs Print instance hierarchy of the application Print current location in the hierarchy
Multicore features Single cockpit controlling multiple debugger instances Support for identical program images Support for heterogeneous program images planned
37
P2012 Debug flow
Mapping!tools!
F1 F2 F3
Component hierarchy!
description!
Component awareness!Extensions (Python) !
Standard GDB V7
Terminal STWorkbench
Multi-core support
SoC
Cluster 2
Cluster N
Fabric Ctrl P2012
Host
Debug & Test Unit
Cluster 0
Cluster 1
Outline
Platform 2012 Multicore Fabric Platform 2012 Programming Environment
Component-based programming models Component-aware debug, visualization and analysis
tools Case Studies
Video High-Quality Rescaling Mapped to S/W platform Mapped to H/W-S/W platform
VC1 video codec
38
39
HQR (High-Quality Rescaling)
HD 1080p, 120 fps SDF model variant
One “token” on in/out per link per filter firing Or simple static multi-rate
Tokens typically a line of pixel data Multiple modes (on frame-by-frame basis) Some dynamic control flow, exceptions
E.g. dynamic bypass of a filter
Gaussian (GA)
Gradient Approx
D =
5
D =
4
D =
4
D =
9
D =
4
U,V
Y D =
6
D =
2 Horizontal
Antialiasing (AAL-H)
Vertical Antialiasing
(AAL-V)
Horizontal NewSpline
Vertical NewSpline
Threads & DLP
40
Two Mapping Approaches
Map to S/W-based platform Data-level parallelism
Structured threading model Multi-processor & SIMD
All tasks for a given data element assigned to single PE
Map to H/W-dominated platform Task-level parallelism
Dataflow programming model Software-based control
Tasks assigned to a single H/W Processing Unit DLP inside each H/W PU
In Out
F2 F1 F3 … … F4
PEDF dataflow
HQR
S/W
H/W
41
Stripe Width = W Pixels Accessed = W + 2 + 2
S/W Mapping: HQR example
Data-level parallelism Each image line split
into stripes Each PE runs all
filters for a stripe SIMD optimization of
each filter Parallel Progr.
Patterns Data iterator split
and join patterns Synchronization
between PEs using “exchanger” pattern (for border pixels)
Task1 Task2 Task3 Task4
Input Image
42
S/W Mapping: HQR example
Data-level parallelism Each image line split
into stripes Each PE runs all
filters for a stripe SIMD optimization of
each filter Parallel Progr.
Patterns Data iterator split
and join patterns Synchronization
between PEs using “exchanger” pattern (for border pixels)
Task1 Task2 Task3 Task4
PE1 PE2 PE3 PE4
Iterator Split
Iterator Join Data exchange
F1
F3
F2
F4
F6
F5
F1
F3
F2
F4
F6
F5
F1
F3
F2
F4
F6
F5
F1
F3
F2
F4
F6
F5
43
HQR PPP Profiling Results
PE0 PE1 PE2 PE3 W0 W1 W2 W3
readNext(0) readNext(1) readNext(2) readNext(3)
writeNext(0) writeNext(1) writeNext(2) writeNext(3)
PE4 Split
PE5
Join
readNext()
writeNext()
exchange exchange exchange
50 instr./call
65 instr./call
65 instr./call
80 instr./call
65 instr./call
Concrete impact on performance in the case of HQR
" Less than 1%
44
Prog. Tools: HQR Mapping Results
Vectorization results (16 way VECx EFU): Results for standalone CA-ISS Average vector unit utilization 79%
Parallel processing results (1 vs. 4 PEs)
19179956
9406673
5731049 4811882
Cache Enabled Initial Burst communication Buffer dimensionning
Single CPU 4 CPU 4 CPU 4 CPU
Cyc
les 2 X 3.3 X 3.9 X
Results on cycle-approx TLM platform
45
HQR Optimization Process # PEs 697
26
200
16 12
Reference code: frame-based (no vectorization)
Line-based (vectorization unoptimized)
Line/column switch before hor. spline filter (simplifies processing)
Line/column switch before last filter (vectorization optimized)
Use new specialized SIMD instructions
8 Wide EFU
~5X cost of H/W solution
H/W Mapping: HQR Example
46
Gaussian (GA)
Gradient Approx
D =
5
D =
4
D =
4
D =
9
D =
4
U,V
Y D =
6
D =
2 Horizontal
Antialiasing (AAL-H)
Vertical Antialiasing
(AAL-V)
Horizontal NewSpline
Vertical NewSpline
Y, U,V PE0 Y PE1 U PE2
V PE3 Task-level parallelism Assignment of each filter to a H/W PU Grouping of highly communicating PUs to a single PE
In contrast with S/W mapping, where Data-level parallelism exploited (Multi-PE and SIMD) Each PE performs all tasks
PEDF Dataflow Programming Model
47
F1 F2 F3
Mode Controller
Host Communication
Host Communication Component Gets host request params.
Frame data Required processing type Processing parameters
Mode Controller Configures control
parameters, steps pipeline Filters
Actual data computation Auto-generated Iteration
Controller Iteration Controller
Predicated Execution Data Flow
48
TLM Native
Overview of PEDF Mapping Flow
Mapping tools!
F1 F2 F3
Mode Controller
Host Comm.
Apex Host Execution
NOC Model
PE 1 PE 2 PE 0 PE 3
Mode Ctl
F1 F2 …
F3 Fm …
Fn Fr …
Fs Fz …
MEM
Cluster Controller
ISS Host Com.
Abstract PEDF Application Capture
Single PEDF description
Mapped to Host with Apex tool
Mapped to TLM platform Control code on
STxP70 Microcode on data
streaming ports of H/W PEs
HW-SW Interaction in P2012
HWPE
L1 Memory
Data Streamer
STXp70
SI SO
SO
HWPE SI SO
SI
HWPE SI SO
HWPE SI SO
M
T
M
L2 Memory
Data streaming
Load/Store
Streaming data flow
Configuration and control flow
Hardware Processing
Data Mover Cluster Control
49
50
Mapping to HW/SW Platform
STxP70 ISS
NOC Model
PE 1 PE 2 PE 0 PE 3
Confg. Ctl Itr Ctl
Fa Fj …
Fk Fm …
Fn Fr …
Fs Fz …
Functional MODEL TLM / COSIM
Application capture using DF variant prog. model Host execution (APEX) for functional validation Automatic control code generation for TLM/COSIM
for perfomance analysis
Connectivity “Glue”
Filter 1 Filter 0
Iteration Controller
Configuration Controller
Filter 2 Filter N … …
TNR Performance Comparison FlexMap v.s. hand-coded (STxP70 Instrns/Frame)
>50% reduction in !execution time!
51
More abstract capture allows for more optimization
52
Multiple Programming Models
Top-level: Dynamic Dataflow pipeline
Interchangeable Thread/DLP and PEDF implementations of HQR
Components act as semantic-neutral structuring mechanism Dynamic Dataflow
TMNR In Out
F2 F1 F3 … … F4
PEDF dataflow
HQR
Threads & DLP
FIFO FIFO
MV Pred
Processing functions Vectorial (SIMD) data
parallelism
VC1 overview: task-level parallelism
2 decoded reference frames in L3. Circular buffer
Process one segment of macro-block at a time (dataflow)
• Comm components – Available in a library – Binding between
parallel components
Contains motion vectors + data
Ctrl
LoopFilter
IDCT Reconstruction
Intra Prediction
Motion Compen
VLD
Control
Bit Parse VLD Intra
VLD Inter
AC/DC IZZ
IQ IDCT Smooth
MC Deblock
Range mapping
? +
Prefetch manager
Queue Queue
Queue (MB)
Queue
Reference pictures
Display
Input bitstream
Uses the CDMA 1 decoded frame
Queue (2lines)
QueueIterator Broadcast
Frame/slice control read by all
Simple Queue
MV Pred Queue
I/F to obtain the reference MB
Intensity compensn.
Application components Pipelined task-level
parallelism
53
Component-based Prog. Models
Demonstrated value of components Supports multiple programming models
DLP/threads S/W mapping PEDF HW/SW mapping
Multiple & evolving execution targets H/W-S/W partitioning
Multiple simulation environments
Platform independence Abstraction of communication
No overhead in practical use S/W mapping: <1% overhead H/W-S/W mapping: 50% execution time reduction
54
55
Multi-Processor SoC!
for Smart People!
Programming Models:!- Higher productivity!- More platform independence!