He-P2012: Architectural Heterogeneity Exploration on
a Scalable Many-Core Platform
Francesco Conti*, Chuck Pilkington†, Andrea Marongiu*‡, Luca Benini*‡
*University of Bologna, Italy †STMicroelectronics, Ottawa, Canada
‡ETH Zurich, Switzerland
FP7-318013 FP7-611016
Dark silicon & heterogeneity • Moore’s Law gives us the possibility to integrate more
and more Processing Elements on chip…
• …but the utilization wall makes it impossible to keep them powered on together!
• One possible solution: architectural heterogeneity: HW IPs – complementary/alternative to PEs
– more power efficient than PEs
PE PE
PE PE
Memory
Bank
PE PE
PE PE
Memory
Bank
PE PE
PE PE
Memory
Bank
PE PE
PE PE
Memory
Bank
PE PE
PE PE
Memory
Bank
PE PE
PE PE
Memory
Bank
PE PE
PE PE
Memory
Bank
PE PE
PE PE
Memory
Bank
PE PE
PE PE
Memory
Bank
PE PE
PE PE
Memory
Bank
PE
PE
HW
IP
Memory
Bank
PE
PE
HW
IP
Memory
Bank
PE
PE
HW
IP
Memory
Bank
PE
PE
HW
IP
Memory
Bank
PE
PE
HW
IP
Memory
Bank
2
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Switched off!
State of Art
3
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Tig
hte
r contr
ol by p
rocessor
Data nearer to processor
ASIP (Tensilica,
Synopsys Processor Designer,
Movidius…)
Dataflow (Maxeler,
CEA Magali…)
L2-coupled (Accelerator-rich CMPs[1]…)
L1-coupled (GreenDroid[2],
Our work,…)
[1] J.Cong et al. Architecture support for accelerator-rich CMPs.
Proceedings of the 49th Design Automation Conference
[2] N.Goulding-Hotta et al. The GreenDroid Mobile Application
Processor: An Architecture for Silicon’s Dark Future. IEEE Micro
4
Heterogeneous P2012 cluster
Memory Bank
#0
Memory Bank
#1
Memory Bank
#2
Memory Bank
#3
Memory Bank
#30
Memory Bank
#31
Shared Tightly-Coupled
Data Memory (TCDM)
Low-Latency Interconnection (LIC)
STxP70 core PE #0
Peripheral Interconnection (PIC)
Timers
HWS
EXT2MEM
EXT2PER
ENC2EXT
DMA chan #0
DMA chan #1
Global Interconnection IF
STxP70 core PE #15
... STxP70 core PE #2
STxP70 core PE #1
...
wrapper wrapper
HWPE #0
HWPE #1
Data plane
zero-copy communication
through shared mem
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Shared L1
scratchpad memory
• STMicroelectronics P2012 is a fabric of tightly-coupled
shared-memory clusters linked by a scalable NoC
Control plane
lightweight control by
pointer exchange
General mechanism: Natural support for
many programming models
Easy exploration of HW/SW deployment
5
HWPE Architecture
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
• HWPEs are throw-in replacements for SW→ we want to
be able to generate them with HLS (ex. Calypto Catapult
System Level Synthesis)
• HW IPs generated by Catapult use their own memory
space → they must be encapsulated in a wrapper
• The wrapper provides facilities for communication with
shared memory and control by PEs
Address Translation
Register File + Control Logic
Low-Latency Interconnect
(LIC)
Peripheral Interconnect
(PIC) FSM
HW IP HWPE wrapper
L1 Shared Memory (TCDM)
Processing Elements
Data plane
Control plane
[3] F.Conti, A.Marongiu, L.Benini. Synthesis-friendly techniques for tightly-coupled integration of hardware
accelerators into shared-memory multi-core clusters, Proceedings ot the 2013 International Conference on
Hardware/Software Codesign and System Synthesis
[4] P.Burgio, A.Marongiu, D.Heller, C.Chavet, P.Coussy, L.Benini. OpenMP-based Synergistic Parallelization
and HW Acceleration for On-Chip Shared-Memory Clusters. Proceedings of the 2012 15th Euromicro Conference
on Digital System Design
6
HWPE Codesign Flow
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
1. Write a P2012 application
(currently OpenCL or OpenMP)
2. Select a hot function and
define the interface and the
implementation of its HWPE
3. The SIDL compiler uses the
interface to generate HWPE
API and HLS scripts +
templates
4. Calypto Catapult is used to
generate the HWPE models
5. The P2012 compiler is used to
generate the P2012 app
6. Performance measurements
with the P2012 Gepop
simulator + Area/Power
estimates with Synopsys DC
typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t; void matrixMul(char* a, char* b, char* c, sizes_t* s) { int i,j,k; for(i=0; i<s->size_i; i++) { for(j=0; j<s->size_j; j++) { for(k=0; k<s->size_k; k++) { c[i*s->size_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } } void foo() { #pragma omp parallel { // ... for(int i=0; i<10; i++) { matrixMul(&a_list[i], &b_list[i], &c_list[i], &size); } // ... } }
Data structure
Hot function
void matrixMul(char* a, char* b, char* c, sizes_t* s) { int i,j,k; loop_i: for(i=0; i<s->size_i; i++) { loop_j: for(j=0; j<s->size_j; j++) { loop_k: for(k=0; k<s->size_k; k++) { c[i*s->size_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } }
void matrixMul(hwpe_ptr_char a, hwpe_ptr_char b, hwpe_ptr_char c, hwpe_ptr_sizes_t s) { int i,j,k; loop_i: for(i=0; i<s->size_i; i++) { loop_j: for(j=0; j<s->size_j; j++) { loop_k: for(k=0; k<s->size_k; k++) { c[i*s->size_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } }
struct sizes_t{ unsigned short int i; unsigned short int j; unsigned short int k; }; #pragma sidl hw clk period 2.5 class hwpe { public: #pragma sidl directive core/main/loop_i –UNROLL 2 #pragma sidl directive core/main/loop_j –UNROLL 2 virtual void matrixMul( char* a, char* b, char* c, sizes_t* s) = 0; };
Accelerator
interface +
HLS directives
Accelerator HLS
implementation with
‘smart’ pointers
typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t; void matrixMul(img_t* a, img_t* b, img_t* c, sizes_t* s) { int i,j,k; for(i=0; i<s->size_i; i++) { for(j=0; j<s->size_j; j++) { for(k=0; k<s->size_k; k++) { c[i*s->size_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } } void foo() { #pragma omp parallel { // ... for(int i=0; i<10; i++) { hwpe_matrixMul(&a_list[i], &b_list[i], &c_list[i], &size); } // ... } }
inline void hwpe_matrixMul ( volatile char *a, volatile char *b, volatile char *c, volatile struct sizes_t *s); inline int hwpe_async_matrixMul ( volatile char *a, volatile char *b, volatile char *c, volatile struct sizes_t *s); inline void hwpe_wait_matrixMul (); inline int hwpe_check_matrixMul (); inline int hwpe_free_matrixMul ();
Blocking HWPE call
(same signature as
SW func)
Non-blocking
HWPE call (same
signature as SW func)
Synchronization
HWPE APIs
[5] A.Marongiu, A.Capotondi, G.Tagliavini, L.Benini. Improving the programmability of STHORM-based
heterogeneous systems with offload-enabled OpenMP, Proceedings ot the First International Workshop on Many-
core Embedded Systems
[6] D.Melpignano, L.Benini, E.Flamand, B.Jego, T.Lepley, G.Haugou, F.Clermidy, D.Dutoit. Platform 2012, a
many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications.
Proceedings of the 49th Design Automation Conference
7
Data structures access
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t;
s[0].j_size s[0].i_size
s[0].k_size s[1].i_size
s[1].k_size s[1].j_size
0x00
0x04
0
DATA WORD BE
35 31
DATA WORD BE
DATA WORD BE
0
1
2
size_t s[2];
0x08
s[0].i_size = 1; p = s[1].k_size; s[1].i_size = s[0].k_size;
sc_uint<36> backingStore[3]; 0 31
shared memory
Statically resolved
C++ template
library class hwpe_ptr_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_ptr_sizes_t operator[](int idx) { int newOffset = byteOffset + (idx * SIZES_T_STRUCT_SIZE); return hwpe_sizes_t(backingStore, newOffset); } }; class hwpe_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_uint16_t i_size; hwpe_uint16_t j_size; hwpe_uint16_t k_size; }; class hwpe_uint16_t { sc_uint<36> *backingStore; int byteOffset; public: const uint16_t operator=(const uint16_t data) { int backOffset = offset / sizeof(int); int subIdx = offset % (sizeof(int) / sizeof(uint16_t)); if(subIdx == 0) backingStore[backOffset] = 0x300000000 | data; else backingStore[backOffset] = 0xC00000000 | (data << 16); return data; } };
Resolves array access (operator [])
Resolves struct element access
(operator .)
Actual memory access (operator =)
sc_uint<36> backing store
hwpe_uint16_t smart basic type
hwpe_sizes_t smart struct
hwpe_ptr_sizes_t smart pointer
In practice: just
substitute ‘dumb’
pointers with smart ones
in the source of the
HWPE implementation
typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t;
8
Applications
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Face Detection (OpenCL)
FAST Corner Detection (OpenMP)
Color Tracking (OpenMP)
• Execution time measured by GEPOP simulation on whole
applications, minimal changes from homogenous P2012
• Area and power consumption of the cluster estimated
through Design Compiler synthesis
• Simple energy model:
Average cluster energy consumption
(DMAs, controller, etc.)
SW PE energy consumption
HWPE energy consumption
Pessimistic model:
includes polling for
HWPEs
DMA in
0
DMA in
1
DMA in
2
9
Color Tracking
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
CSC 0 (parallel sw) TH+MOM 0
(parallel sw) CSC 1 (parallel sw)
TH+MOM 1
(parallel sw)
sw
sw + 1x sync hwpe
sw + 1x async hwpe
DMA out
0
DMA in
0
DMA in
1
CSC 0
(async hwpe)
DMA in
2
CSC 1
(async hwpe)
TH+MOM 0
(parallel sw)
DMA in
3
CSC 2
(async hwpe)
TH+MOM 1
(parallel sw)
DMA out
0
DMA in
4
CSC 3
(async hwpe)
TH+MOM 2
(parallel sw)
DMA out
1
DMA in
5
CSC 4
(async hwpe)
TH+MOM 3
(parallel sw)
DMA out
2
TIME
DMA in
0
DMA in
1
DMA in
2
CSC 0 (parallel sw) TH+MOM 0
(parallel sw) CSC 1 (parallel sw)
TH+MOM 1
(parallel sw)
DMA out
0
DMA in
0
DMA in
1
DMA in
2
CSC 0 (parallel sw) TH+MOM 0
(parallel sw) CSC 1 (parallel sw)
TH+MOM 1
(parallel sw)
DMA out
0
CSC 0
(sync hwpe)
CSC 0
(sync hwpe)
DMA in
3
CSC 2
(sync hwpe)
TH+MOM 2
(parallel sw)
DMA out
1
fine-grain
pixel level
parallelism
10
Results: Color Tracking
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
BW limitation: speedup does
not cope with increased power
consumption
1x HWPE less
performant
than 16x PEs
BW-limited
Our model: • Base cluster of 1, 2, 4, 8, or 16 PEs
• HWPEs are added to the cluster
• PEs are on while waiting for HWPEs
Best performance speedup:
99% of Amdahl limit Best Energy-Delay tradeoff
~ 45x P/A/E improvement ~ 5x P/A/E improvement
Bett
er
perf
orm
ance
Lower energy consumption Performance / Area / Energy
DMA in
Face Detection
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
INT IMG (parallel sw) CASCADE INT IMG WINDOW (parallel sw)
sw
DMA out
coarse-grain
stripe level
parallelism
DMA in INT IMG (parallel sw) CASCADE INT IMG WINDOW(parallel sw)
sw + 1x hw coarse (accelerates all cascade)
DMA out
CASC
(coarse hwpe)
CASC
(coarse hwpe)
DMA in INT IMG (parallel sw) CASCADE INT IMG WINDOW(parallel sw)
sw + 1x/2x hw finer (accelerates cascades except from first 2 stages)
DMA out
CASC
(finer
hwpe)
CASC
(finer
hwpe)
CASC
(finer
hwpe)
CASC
(finer
hwpe)
DMA in INT IMG
(hwpe) CASCADE INT IMG WINDOW(parallel sw)
sw + 2x hw finer + 4x hw int
DMA out
CASC
(finer
hwpe)
CASC
(finer
hwpe)
CASC
(finer
hwpe)
CASC
(finer
hwpe)
TIME
HWPE contention: many
threads competing on a single
accelerator less used HWPE
=> less contention 2 HWPEs => ~ no contention
Int. Img HWPE is
needed to extract
more performance
Best Energy-Delay tradeoff
Results: Face Detection
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Bett
er
perf
orm
ance
Lower energy consumption Performance / Area / Energy
~ 5x P/A/E improvement
~ 90x P/A/E improvement
DMA in
FAST Corner Detection
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
CIRCULAR DETECTION (parallel sw)
sw
DMA out
fine-grain
pixel level
parallelism
DMA in CIRCULAR DETECTION
(parallel sw)
sw + 4x hw async
DMA out
FAST
(hwpe)
FAST
(hwpe)
FAST
(hwpe)
TIME
“Unlucky” parallelization:
feature size ~ paral. chunk
Best Energy/Delay tradeoff
Results: FAST Corner Detection
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Bett
er
perf
orm
ance
Lower energy consumption Performance / Area / Energy
~ 21x P/A/E improvement
• A methodology to enable shared-memory HW
acceleration in tightly-coupled multicore clusters
– Minimal modifications to existing software
– Supports standard tools (HLS) or custom flow
• A flow for architectural heterogeneity exploration
– Allows relatively effortless exploration of HW/SW deployment
schemes
– Supports multiple programming models on the SW side due to
the flexible, lightweight HWPE control
• An embodiment of our methodology on the He-P2012
architecture
– acceleration up to 99% of the Amdahl limit
– up to 5x energy improvement vs. the best SW implementation
– up to 21x P/A/E improvement vs. the best SW implementation
15
Conclusions
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Thanks for your attention.
Any questions?
16
Backup slides
17
19
Heterogeneous P2012 cluster
Memory Bank
#0
Memory Bank
#1
Memory Bank
#2
Memory Bank
#3
Memory Bank
#30
Memory Bank
#31
Shared Tightly-Coupled
Data Memory (TCDM)
Low-Latency Interconnection (LIC)
STxP70 core PE #0
Peripheral Interconnection (PIC)
Timers
HWS
EXT2MEM
EXT2PER
ENC2EXT
DMA chan #0
DMA chan #1
Global Interconnection IF
STxP70 core PE #15
... STxP70 core PE #2
STxP70 core PE #1
...
wrapper wrapper
HWPE #0
HWPE #1
master
thread
#pragma omp parallel
Data plane
zero-copy communication
through shared mem
slave
thread
slave
thread
slave
thread
...
hwpe
job
master
thread
#pragma omp barrier
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Shared L1
scratchpad memory
• STMicroelectronics P2012 is a fabric of tightly-coupled
shared-memory clusters linked by a scalable NoC
hwpe
job
Control plane
lightweight control by
pointer exchange
slave
thread
Natural support for many
programming models
Easy exploration of
HW/SW deployment
20
Applications 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Face Detection (OpenCL)
FAST Corner Detection (OpenMP)
Color Tracking (OpenMP)
Removed Object Detection (OpenMP)
Image taken from www.ekintechnology.com
• Performance measured by GEPOP simulation
• Area and power consumption of the cluster estimated
through Design Compiler synthesis
• Simple energy model:
Average cluster energy consumption
(DMAs, controller, etc.)
SW PE energy consumption
HWPE energy consumption
Pessimistic model:
includes polling for
HWPEs
Accele
rato
r D
ata
Path
21
Data structures access
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t;
s[0].j_size s[0].i_size
s[0].k_size s[1].i_size
s[1].k_size s[1].j_size
0x00
0x04
0
DATA WORD BE
35 31
DATA WORD BE
DATA WORD BE
0
1
2
size_t s[2];
0x08
s[0].i_size = 1; p = s[1].k_size; s[1].i_size = s[0].k_size;
0x0001
s[0].k_size s[1].i_size
s[1].k_size s[1].j_size
0x0000 0xC
sc_uint<36> backingStore[3]; 0 31
0x3 0x0000
shared memory
Statically resolved
C++ template
library class hwpe_ptr_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_ptr_sizes_t operator[](int idx) { int newOffset = byteOffset + (idx * SIZES_T_STRUCT_SIZE); return hwpe_sizes_t(backingStore, newOffset); } }; class hwpe_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_uint16_t i_size; hwpe_uint16_t j_size; hwpe_uint16_t k_size; }; class hwpe_uint16_t { sc_uint<36> *backingStore; int byteOffset; public: const uint16_t operator=(const uint16_t data) { int backOffset = offset / sizeof(int); int subIdx = offset % (sizeof(int) / sizeof(uint16_t)); if(subIdx == 0) backingStore[backOffset] = 0x300000000 | data; else backingStore[backOffset] = 0xC00000000 | (data << 16); return data; } };
Resolves array access (operator [])
Resolves struct element access
(operator .)
Actual memory access (operator =)
sc_uint<36> backing store
hwpe_uint16_t smart basic type
hwpe_sizes_t smart struct
hwpe_ptr_sizes_t smart pointer
In practice: just
substitute ‘dumb’
pointers with smart ones
in the source of the
HWPE implementation
typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t;
22
Results: Color Tracking
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
BW limitation: speedup does
not cope with increased power
consumption BW-limited
BW-limited
Remember the model: • HWPEs are added to the cluster
• PEs are on while waiting for HWPEs
Best performance speedup:
99% of Amdahl limit Best Energy-Delay tradeoff
~ 2x energy improvement
~ 45x P/A/E improvement ~ 5x P/A/E improvement
Bett
er
perf
orm
ance
Lower energy consumption Energy consumption Performance / Area / Energy
HWPE contention: many
threads competing on a single
accelerator less used HWPE
=> less contention 2 HWPEs => ~ no contention
Int. Img HWPE is
needed to extract
more performance
Best Energy-Delay tradeoff ~ 2.5x energy improvement
Results: Face Detection
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Bett
er
perf
orm
ance
Lower energy consumption Energy consumption Performance / Area / Energy
~ 5x P/A/E improvement
~ 90x P/A/E improvement
“Unlucky” parallelization:
feature size ~ paral. chunk
Best Energy/Delay tradeoff
~ 6x energy improvement
Results: FAST Corner Detection
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Bett
er
perf
orm
ance
Lower energy consumption Energy consumption Performance / Area / Energy
~ 21x P/A/E improvement
DMA in
Removed Object Detection 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
NORMALIZED CROSS-CORRELATION (parallel sw)
sw
DMA out
coarse-grain
stripe level
parallelism
DMA in NORM. CROSS-CORRELATION
(single-thread sw)
hw coarse (no parallelism)
DMA out
NCC
(coarse hwpe)
DMA in NORM. CROSS-CORRELATION
(parallel sw)
hw finer (finer-grain HWPE called from parallel threads)
DMA out
NCC
(finer
hwpe)
NCC
(finer
hwpe)
NCC
(finer
hwpe)
No scaling (PEs are just
waiting for the HWPE) HWPE contention: more
threads (doing nothing but
control) than accelerators
HWPE provides no
performance advantage but
better energy consumption
~ 2x energy improvement
Results: Removed Object Det. 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Bett
er
perf
orm
ance
Lower energy consumption Energy consumption Performance / Area / Energy
~ 5x P/A/E improvement
~ 50x P/A/E improvement
DMA in
0
DMA in
1
DMA in
2
27
Color Tracking
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
CSC 0 (parallel sw) TH+MOM 0
(parallel sw) CSC 1 (parallel sw)
TH+MOM 1
(parallel sw)
sw
sw + 1x sync hwpe
sw + 1x async hwpe
DMA out
0
DMA in
0
DMA in
1
CSC 0
(sync hwpe)
TH+MOM 0
(parallel sw)
DMA in
2
CSC 1
(sync hwpe)
TH+MOM 1
(parallel sw)
DMA out
0
DMA in
3
CSC 2
(sync hwpe)
TH+MOM 2
(parallel sw)
DMA out
1
DMA in
0
DMA in
1
CSC 0
(async hwpe)
DMA in
2
CSC 1
(async hwpe)
TH+MOM 0
(parallel sw)
DMA in
3
CSC 2
(async hwpe)
TH+MOM 1
(parallel sw)
DMA out
0
DMA in
4
CSC 3
(async hwpe)
TH+MOM 2
(parallel sw)
DMA out
1
DMA in
5
CSC 4
(async hwpe)
TH+MOM 3
(parallel sw)
DMA out
2
fine-grain
pixel level
parallelism
TIME
DMA in
0
DMA in
1
DMA in
2
CSC 0 (parallel sw) TH+MOM 0
(parallel sw) CSC 1 (parallel sw)
TH+MOM 1
(parallel sw)
DMA out
0
• P2012 is a fabric of tightly-coupled shared-memory
clusters linked by a scalable NoC
NoC
28
STMicroelectronics P2012 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
• Our approach to heterogeneity is L1-coupled shared-
mem acceleration: accelerators inside clusters
– Zero-copy communication
– Flexible programming model: same memory abstraction used in
SW, same address space
– High-level view of the HW accelerator as a HW thread of
execution
HWPE
#0
HWPE
#1
sw threads hw threads
P. Burgio, , A. Marongiu, D. Heller, C. Chavet, P. Coussy, and L. Benini.
OpenMP-based Synergistic Parallelization and HW Acceleration for
On-Chip Shared-Memory Clusters.
15th Euromicro Conference on Digital System Design: Architectures,
Methods & Tools, Turkey (2012), pages 751–758, Sept. 2012.
M. Dehyadegari, , A. Marongiu, M. R. Kakoee, L. Benini, S. Mohammadi,
and N. Yazdani.
A tightly-coupled multi-core cluster with shared-memory HW
accelerators.
2012 International Conference on Embedded Computer Systems (SAMOS),
pages 96–103, July 2012.
29
L1-Coupled Acceleration 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
Accele
rato
r D
ata
Path
30
Data structures access 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t;
s[0].j_size s[0].i_size
s[0].k_size s[1].i_size
s[1].k_size s[1].j_size
0x00
0x04
0
DATA WORD BE
35 31
DATA WORD BE
DATA WORD BE
0
1
2
size_t s[2];
0x08
s[0].i_size = 1; p = s[1].k_size; s[1].i_size = s[0].k_size;
0x0001
s[0].k_size s[1].i_size
s[1].k_size s[1].j_size
0x0000 0xC
sc_uint<36> backingStore[3]; 0 31
0x3 0x0000
shared memory
Statically resolved
C++ template
library class hwpe_ptr_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_ptr_sizes_t operator[](int idx) { int newOffset = byteOffset + (idx * SIZES_T_STRUCT_SIZE); return hwpe_sizes_t(backingStore, newOffset); } }; class hwpe_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_uint16_t i_size; hwpe_uint16_t j_size; hwpe_uint16_t k_size; }; class hwpe_uint16_t { sc_uint<36> *backingStore; int byteOffset; public: const uint16_t operator=(const uint16_t data) { int backOffset = offset / sizeof(int); int subIdx = offset % (sizeof(int) / sizeof(uint16_t)); if(subIdx == 0) backingStore[backOffset] = 0x300000000 | data; else backingStore[backOffset] = 0xC00000000 | (data << 16); return data; } };
Resolves array access (operator [])
Resolves struct element access
(operator .)
Actual memory access (operator =)
UNLOCKED LOCKED
Register File
Job 3
Register File
Job 2
Register File
Job 1
Register File
Job 0
Register File
Job 3 PE 0
PE 1
acquire context job id
running job
start
HWPE wrapper
hwpe_acquire(hwpe_id)
event
done
31
Accelerator Job Offload
acquire context -2 (LOCKED)
hwpe_acquire(hwpe_id)
write params
hwpe_setIOReg(hwpe_id,reg,value)
hwpe_setGenericReg(hwpe_id,reg,value)
trigger
hwpe_trigger(hwpe_id)
End of
Job 3
Broadcast
notification of
end of job
2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
32
Accelerator Interface Definition 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
// generated by SIDL class hwpe_ptr_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_ptr_sizes_t operator[](int idx) { int newOffset = byteOffset + (idx * SIZES_T_STRUCT_SIZE); return hwpe_sizes_t(backingStore, newOffset); } } class hwpe_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_uint16_t i_size; hwpe_uint16_t j_size; hwpe_uint16_t k_size; } class hwpe_uint16_t { sc_uint<36> *backingStore; int byteOffset; public: const uint16_t operator=(const uint16_t data) { int backOffset = offset / sizeof(int); int subIdx = offset % (sizeof(int) / sizeof(uint16_t)); int bs = backingStore[backOffset]; if(subIdx == 0) { bs.range(31,0) = 0x00000000 | data; bs.range(35,32) = 0x3; } else { bs.range(31,0) = 0x00000000 | (data << 16); bs.range(35,32) = 0xC; } backingStore[backOffset] = bs; return bs; }
stage_t s[N]; p = s[i].var; s[i].var = q; uint34_t s[N*SIZE_stage_t]; p = (TYPE_var) (0x0ffffffff & s[i*SIZE_stage_t + OFFSET_var]) BEOP_var; s[i*SIZE_stage_t + OFFSET_var] = q | BE_var << 32;
hwpe_ptr_sizes_t s; sizes_t p,q; p = s[i].j_size; s[i].j_size = q; uint34_t s[N*3]; p = (short) (0x0ffffffff & s[i*3 + 0x0]) >> 16; s[i*3 + 0x0] = q | 1 << 32;
BE
33 0 31 s[0].j_size s[0].i_size
s[0].k_size
0x00
0x04
+0x0 +0x1 +0x2 +0x3
33
Color Tracking 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
HWPE address translation
address
translator
Cluster address space (TCDM) Accelerator address space
0x0000
0x0400
0x0800
0x0c00
0x1000
0x1400
*in1
*in0
*out
0x10000000
*in0
*in1
*out
0x0401
0x04
0x01
*vec2
34
s[0].beta s[0].alpha
s[0].gamma
s[0].delta
s[1].beta s[1].alpha
s[1].gamma
s[1].delta
s[2].beta s[2].alpha
s[2].gamma
s[2].delta
s[3].beta s[3].alpha
s[3].gamma
s[3].delta
s[0].beta
s[0].alpha
s[0].gamma
s[0].delta
s[1].beta
s[1].alpha
s[1].gamma
s[1].delta
s[2].beta
s[2].alpha
s[2].gamma
s[2].delta
0x00
0x04
0x08
0x0c
0x10
0x14
0x18
0x1c
0x20
0x24
0x28
0x2c
Data structures in STHORM
typedef struct { int16_t alpha; int16_t beta; int32_t gamma; int32_t delta; } struct_t; struct_t s[16];
+0x0
typedef struct { int16_t alpha; int32_t gamma; int16_t beta; int32_t delta; } struct_t; struct_t s[16];
+0x1 +0x2 +0x3
s[0]
s[1]
s[2]
s[3]
s[0]
s[1]
s[2] for Catapult, an
array of data
structures is just a
collection of arrays
35
Data structure mapping
s[0].beta
s[0].alpha
s[0].gamma
s[0].delta
s[1].beta
s[1].alpha
s[1].gamma
s[1].delta
s[2].beta
s[2].alpha
s[2].gamma
s[2].delta
0x00
0x04
0x08
0x0c
0x10
0x14
0x18
0x1c
0x20
0x24
0x28
0x2c
+0x0 +0x1 +0x2 +0x3
s[0]
s[1]
s[2]
s[0].beta s[0].alpha
s[0].gamma
s[0].delta
s[1].beta s[1].alpha
s[1].gamma
s[1].delta
s[2].beta s[2].alpha
s[2].gamma
s[2].delta
s[3].beta s[3].alpha
s[3].gamma
s[3].delta
0x00
0x04
0x08
0x0c
0x10
0x14
0x18
0x1c
0x20
0x24
0x28
0x2c
+0x0 +0x1 +0x2 +0x3
s[0]
s[1]
s[2]
s[3]
var OFFSET BE BEOP
alpha 0x0 00 &
0x0000ffff
beta 0x0 01 >> 16
gamma 0x4 - -
delta 0x8 - -
SIZE_stage_t = 3 SIZE_stage_t = 4
stage_t s[N]; p = s[i].var; s[i].var = q; uint34_t s[N*SIZE_stage_t]; p = (TYPE_var) (0x0ffffffff & s[i*SIZE_stage_t + OFFSET_var]) BEOP_var; s[i*SIZE_stage_t + OFFSET_var] = q | BE_var << 32;
struct_t s[N]; p = s[i].beta; s[i].beta = q; uint34_t s[N*3]; p = (short) (0x0ffffffff & s[i*3 + 0x0]) >> 16; s[i*3 + 0x0] = q | 1 << 32;
var OFFSET BE BEOP
alpha 0x0 00 &
0x0000ffff
beta 0x8 00 &
0x0000ffff
gamma 0x4 - -
delta 0xc - -
data word BE
33 0 31
Each struct is associated to a set of #define’s
s[0].beta s[0].alpha
s[0].gamma
s[0].delta
0x00
0x04
0x08
+0x0 +0x1 +0x2 +0x3
36
Performance remarks
• Loss of performance due to reduced bandwidth usage is
visible in non-clock-divided configurations
• Clock division hides reduced bandwidth usage under
performance loss of reduced frequency
• Time-multiplexing recovers performance in most cases
– up to 81% of the original HWPE performance
– in VJ, this does not happen because the internal parallelism of
the HWPE does not scale well with the number of ports
37
38
HWPE Architecture 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions
• HWPEs use same memory abstraction as SW→ very
suitable for generation through HLS (ex. Calypto
Catapult System Level Synthesis)
• HW IPs generated by Catapult use their own memory
space → they must be encapsulated in a wrapper
• The wrapper provides facilities for communication with
shared memory and control by PEs
foo(*in0, *in1, *out);
void foo(int *in0, char *in1, int *out) { int i; for(i=0; i<1024; i++) { if(in1[i] == 100) return; else if(in0[i+1] < 10) out[i] = in0[i] * in1[i]; else out[i] = in0[i]; } }
HWPE_foo address
translation PE hwpe_foo(*in0, *in1, *out);
HW acceleration approaches
• HW accelerators can couple with PEs in different ways:
– Loosely coupled: private memory and address space (ex. GPUs)
– (Very) tightly-coupled: shared reg file (ex. ISA extensions,
ASIP)
• Distinguished also by communication:
– Shared memory: accelerator and CPU share a level of the
memory hierarchy
– Message passing: no memory sharing, communicates via
messages (ex. dataflow accelerators)
39
PE
HW acc
L3 mem
L2 mem
L1 mem
local mem
RE
G
acc L3 mem
L2 mem L1
mem
RE
G
PE
PE
HW acc
L3 mem
L2 mem
L1 mem
RE
G
FIF
O
DMA
more coupling
mo
re m
em
sh
arin
g
Cutting combinational paths
slave module
address
translator
done
start
cluster addr
register file
accel response cluster response
Shared-memory HWPE wrapper
regfile cfg
controller module sequencer
event
Hard
ware
IP
(accele
rato
r)
sh
mem
inte
rconnect
Cluster side Catc accelerator side
RE
G
EN
RE
G
EN
c) Both paths-
registered
configuration
accel request cluster request
b) Forward-
registered
configuration
RE
G
EN
a) Unregistered
configuration
accel address req buf
resp ready
fully combinational
low cluster
frequency
high cluster
frequency