Download pdf - Presentazione di PowerPoint€¦ · Bank PE PE PE PE Memory Bank PE PE PE PE Memory Bank PE PE PE PE Memory Bank PE PE PE PE Memory Bank Memory Memory Memory Memory Memory HW IP HW

He-P2012: Architectural Heterogeneity Exploration on

a Scalable Many-Core Platform

Francesco Conti*, Chuck Pilkington†, Andrea Marongiu*‡, Luca Benini*‡

*University of Bologna, Italy †STMicroelectronics, Ottawa, Canada

‡ETH Zurich, Switzerland

FP7-318013 FP7-611016

Dark silicon & heterogeneity • Moore’s Law gives us the possibility to integrate more

and more Processing Elements on chip…

• …but the utilization wall makes it impossible to keep them powered on together!

• One possible solution: architectural heterogeneity: HW IPs – complementary/alternative to PEs

– more power efficient than PEs

PE PE

PE PE

Memory

Bank

PE PE

PE PE

Memory

Bank

PE PE

PE PE

Memory

Bank

PE PE

PE PE

Memory

Bank

PE PE

PE PE

Memory

Bank

PE PE

PE PE

Memory

Bank

PE PE

PE PE

Memory

Bank

PE PE

PE PE

Memory

Bank

PE PE

PE PE

Memory

Bank

PE PE

PE PE

Memory

Bank

PE

PE

HW

IP

Memory

Bank

PE

PE

HW

IP

Memory

Bank

PE

PE

HW

IP

Memory

Bank

PE

PE

HW

IP

Memory

Bank

PE

PE

HW

IP

Memory

Bank

2

2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

Switched off!

State of Art

3


Tig

hte

r contr

ol by p

rocessor

Data nearer to processor

ASIP (Tensilica,

Synopsys Processor Designer,

Movidius…)

Dataflow (Maxeler,

CEA Magali…)

L2-coupled (Accelerator-rich CMPs[1]…)

L1-coupled (GreenDroid[2],

Our work,…)

[1] J.Cong et al. Architecture support for accelerator-rich CMPs.

Proceedings of the 49th Design Automation Conference

[2] N.Goulding-Hotta et al. The GreenDroid Mobile Application

Processor: An Architecture for Silicon’s Dark Future. IEEE Micro

4

Heterogeneous P2012 cluster

Memory Bank

#0

Memory Bank

#1

Memory Bank

#2

Memory Bank

#3

Memory Bank

#30

Memory Bank

#31

Shared Tightly-Coupled

Data Memory (TCDM)

Low-Latency Interconnection (LIC)

STxP70 core PE #0

Peripheral Interconnection (PIC)

Timers

HWS

EXT2MEM

EXT2PER

ENC2EXT

DMA chan #0

DMA chan #1

Global Interconnection IF

STxP70 core PE #15

... STxP70 core PE #2

STxP70 core PE #1

...

wrapper wrapper

HWPE #0

HWPE #1

Data plane

zero-copy communication

through shared mem


Shared L1

scratchpad memory

• STMicroelectronics P2012 is a fabric of tightly-coupled

shared-memory clusters linked by a scalable NoC

Control plane

lightweight control by

pointer exchange

General mechanism: Natural support for

many programming models

Easy exploration of HW/SW deployment

5

HWPE Architecture


• HWPEs are throw-in replacements for SW→ we want to

be able to generate them with HLS (ex. Calypto Catapult

System Level Synthesis)

• HW IPs generated by Catapult use their own memory

space → they must be encapsulated in a wrapper

• The wrapper provides facilities for communication with

shared memory and control by PEs

Address Translation

Register File + Control Logic

Low-Latency Interconnect

(LIC)

Peripheral Interconnect

(PIC) FSM

HW IP HWPE wrapper

L1 Shared Memory (TCDM)

Processing Elements

Data plane

Control plane

[3] F.Conti, A.Marongiu, L.Benini. Synthesis-friendly techniques for tightly-coupled integration of hardware

accelerators into shared-memory multi-core clusters, Proceedings ot the 2013 International Conference on

Hardware/Software Codesign and System Synthesis

[4] P.Burgio, A.Marongiu, D.Heller, C.Chavet, P.Coussy, L.Benini. OpenMP-based Synergistic Parallelization

and HW Acceleration for On-Chip Shared-Memory Clusters. Proceedings of the 2012 15th Euromicro Conference

on Digital System Design

6

HWPE Codesign Flow


1. Write a P2012 application

(currently OpenCL or OpenMP)

2. Select a hot function and

define the interface and the

implementation of its HWPE

3. The SIDL compiler uses the

interface to generate HWPE

API and HLS scripts +

templates

4. Calypto Catapult is used to

generate the HWPE models

5. The P2012 compiler is used to

generate the P2012 app

6. Performance measurements

with the P2012 Gepop

simulator + Area/Power

estimates with Synopsys DC

typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t; void matrixMul(char* a, char* b, char* c, sizes_t* s) { int i,j,k; for(i=0; i<s->size_i; i++) { for(j=0; j<s->size_j; j++) { for(k=0; k<s->size_k; k++) { c[i*s->size_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } } void foo() { #pragma omp parallel { // ... for(int i=0; i<10; i++) { matrixMul(&a_list[i], &b_list[i], &c_list[i], &size); } // ... } }

Data structure

Hot function

void matrixMul(char* a, char* b, char* c, sizes_t* s) { int i,j,k; loop_i: for(i=0; i<s->size_i; i++) { loop_j: for(j=0; j<s->size_j; j++) { loop_k: for(k=0; k<s->size_k; k++) { c[i*s->size_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } }

void matrixMul(hwpe_ptr_char a, hwpe_ptr_char b, hwpe_ptr_char c, hwpe_ptr_sizes_t s) { int i,j,k; loop_i: for(i=0; i<s->size_i; i++) { loop_j: for(j=0; j<s->size_j; j++) { loop_k: for(k=0; k<s->size_k; k++) { c[i*s->size_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } }

struct sizes_t{ unsigned short int i; unsigned short int j; unsigned short int k; }; #pragma sidl hw clk period 2.5 class hwpe { public: #pragma sidl directive core/main/loop_i –UNROLL 2 #pragma sidl directive core/main/loop_j –UNROLL 2 virtual void matrixMul( char* a, char* b, char* c, sizes_t* s) = 0; };

Accelerator

interface +

HLS directives

Accelerator HLS

implementation with

‘smart’ pointers

typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t; void matrixMul(img_t* a, img_t* b, img_t* c, sizes_t* s) { int i,j,k; for(i=0; i<s->size_i; i++) { for(j=0; j<s->size_j; j++) { for(k=0; k<s->size_k; k++) { c[i*s->size_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } } void foo() { #pragma omp parallel { // ... for(int i=0; i<10; i++) { hwpe_matrixMul(&a_list[i], &b_list[i], &c_list[i], &size); } // ... } }

inline void hwpe_matrixMul ( volatile char *a, volatile char *b, volatile char *c, volatile struct sizes_t *s); inline int hwpe_async_matrixMul ( volatile char *a, volatile char *b, volatile char *c, volatile struct sizes_t *s); inline void hwpe_wait_matrixMul (); inline int hwpe_check_matrixMul (); inline int hwpe_free_matrixMul ();

Blocking HWPE call

(same signature as

SW func)

Non-blocking

HWPE call (same

signature as SW func)

Synchronization

HWPE APIs

[5] A.Marongiu, A.Capotondi, G.Tagliavini, L.Benini. Improving the programmability of STHORM-based

heterogeneous systems with offload-enabled OpenMP, Proceedings ot the First International Workshop on Many-

core Embedded Systems

[6] D.Melpignano, L.Benini, E.Flamand, B.Jego, T.Lepley, G.Haugou, F.Clermidy, D.Dutoit. Platform 2012, a

many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications.

Proceedings of the 49th Design Automation Conference

7

Data structures access


typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t;

s[0].j_size s[0].i_size

s[0].k_size s[1].i_size

s[1].k_size s[1].j_size

0x00

0x04

0

DATA WORD BE

35 31

DATA WORD BE

DATA WORD BE

0

1

2

size_t s[2];

0x08

s[0].i_size = 1; p = s[1].k_size; s[1].i_size = s[0].k_size;

sc_uint<36> backingStore[3]; 0 31

shared memory

Statically resolved

C++ template

library class hwpe_ptr_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_ptr_sizes_t operator[](int idx) { int newOffset = byteOffset + (idx * SIZES_T_STRUCT_SIZE); return hwpe_sizes_t(backingStore, newOffset); } }; class hwpe_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_uint16_t i_size; hwpe_uint16_t j_size; hwpe_uint16_t k_size; }; class hwpe_uint16_t { sc_uint<36> *backingStore; int byteOffset; public: const uint16_t operator=(const uint16_t data) { int backOffset = offset / sizeof(int); int subIdx = offset % (sizeof(int) / sizeof(uint16_t)); if(subIdx == 0) backingStore[backOffset] = 0x300000000 | data; else backingStore[backOffset] = 0xC00000000 | (data << 16); return data; } };

Resolves array access (operator [])

Resolves struct element access

(operator .)

Actual memory access (operator =)

sc_uint<36> backing store

hwpe_uint16_t smart basic type

hwpe_sizes_t smart struct

hwpe_ptr_sizes_t smart pointer

In practice: just

substitute ‘dumb’

pointers with smart ones

in the source of the

HWPE implementation


8

Applications


Face Detection (OpenCL)

FAST Corner Detection (OpenMP)

Color Tracking (OpenMP)

• Execution time measured by GEPOP simulation on whole

applications, minimal changes from homogenous P2012

• Area and power consumption of the cluster estimated

through Design Compiler synthesis

• Simple energy model:

Average cluster energy consumption

(DMAs, controller, etc.)

SW PE energy consumption

HWPE energy consumption

Pessimistic model:

includes polling for

HWPEs

DMA in

0

DMA in

1

DMA in

2

9

Color Tracking


CSC 0 (parallel sw) TH+MOM 0

(parallel sw) CSC 1 (parallel sw)

TH+MOM 1

(parallel sw)

sw

sw + 1x sync hwpe

sw + 1x async hwpe

DMA out

0

DMA in

0

DMA in

1

CSC 0

(async hwpe)

DMA in

2

CSC 1

(async hwpe)

TH+MOM 0

(parallel sw)

DMA in

3

CSC 2

(async hwpe)

TH+MOM 1

(parallel sw)

DMA out

0

DMA in

4

CSC 3

(async hwpe)

TH+MOM 2

(parallel sw)

DMA out

1

DMA in

5

CSC 4

(async hwpe)

TH+MOM 3

(parallel sw)

DMA out

2

TIME

DMA in

0

DMA in

1

DMA in

2



TH+MOM 1

(parallel sw)

DMA out

0

DMA in

0

DMA in

1

DMA in

2



TH+MOM 1

(parallel sw)

DMA out

0

CSC 0

(sync hwpe)

CSC 0

(sync hwpe)

DMA in

3

CSC 2

(sync hwpe)

TH+MOM 2

(parallel sw)

DMA out

1

fine-grain

pixel level

parallelism

10

Results: Color Tracking


BW limitation: speedup does

not cope with increased power

consumption

1x HWPE less

performant

than 16x PEs

BW-limited

Our model: • Base cluster of 1, 2, 4, 8, or 16 PEs

• HWPEs are added to the cluster

• PEs are on while waiting for HWPEs

Best performance speedup:

99% of Amdahl limit Best Energy-Delay tradeoff

~ 45x P/A/E improvement ~ 5x P/A/E improvement

Bett

er

perf

orm

ance

Lower energy consumption Performance / Area / Energy

DMA in

Face Detection


INT IMG (parallel sw) CASCADE INT IMG WINDOW (parallel sw)

sw

DMA out

coarse-grain

stripe level

parallelism

DMA in INT IMG (parallel sw) CASCADE INT IMG WINDOW(parallel sw)

sw + 1x hw coarse (accelerates all cascade)

DMA out

CASC

(coarse hwpe)

CASC

(coarse hwpe)

DMA in INT IMG (parallel sw) CASCADE INT IMG WINDOW(parallel sw)

sw + 1x/2x hw finer (accelerates cascades except from first 2 stages)

DMA out

CASC

(finer

hwpe)

CASC

(finer

hwpe)

CASC

(finer

hwpe)

CASC

(finer

hwpe)

DMA in INT IMG

(hwpe) CASCADE INT IMG WINDOW(parallel sw)

sw + 2x hw finer + 4x hw int

DMA out

CASC

(finer

hwpe)

CASC

(finer

hwpe)

CASC

(finer

hwpe)

CASC

(finer

hwpe)

TIME

HWPE contention: many

threads competing on a single

accelerator less used HWPE

=> less contention 2 HWPEs => ~ no contention

Int. Img HWPE is

needed to extract

more performance

Best Energy-Delay tradeoff

Results: Face Detection


Bett

er

perf

orm

ance


~ 5x P/A/E improvement


DMA in

FAST Corner Detection


CIRCULAR DETECTION (parallel sw)

sw

DMA out

fine-grain

pixel level

parallelism

DMA in CIRCULAR DETECTION

(parallel sw)

sw + 4x hw async

DMA out

FAST

(hwpe)

FAST

(hwpe)

FAST

(hwpe)

TIME

“Unlucky” parallelization:

feature size ~ paral. chunk

Best Energy/Delay tradeoff

Results: FAST Corner Detection


Bett

er

perf

orm

ance



• A methodology to enable shared-memory HW

acceleration in tightly-coupled multicore clusters

– Minimal modifications to existing software

– Supports standard tools (HLS) or custom flow

• A flow for architectural heterogeneity exploration

– Allows relatively effortless exploration of HW/SW deployment

schemes

– Supports multiple programming models on the SW side due to

the flexible, lightweight HWPE control

• An embodiment of our methodology on the He-P2012

architecture

– acceleration up to 99% of the Amdahl limit

– up to 5x energy improvement vs. the best SW implementation

– up to 21x P/A/E improvement vs. the best SW implementation

15

Conclusions


Thanks for your attention.

Any questions?

16

Backup slides

17

19

Heterogeneous P2012 cluster

Memory Bank

#0

Memory Bank

#1

Memory Bank

#2

Memory Bank

#3

Memory Bank

#30

Memory Bank

#31

Shared Tightly-Coupled

Data Memory (TCDM)

Low-Latency Interconnection (LIC)

STxP70 core PE #0

Peripheral Interconnection (PIC)

Timers

HWS

EXT2MEM

EXT2PER

ENC2EXT

DMA chan #0

DMA chan #1

Global Interconnection IF

STxP70 core PE #15

... STxP70 core PE #2

STxP70 core PE #1

...

wrapper wrapper

HWPE #0

HWPE #1

master

thread

#pragma omp parallel

Data plane

zero-copy communication

through shared mem

slave

thread

slave

thread

slave

thread

...

hwpe

job

master

thread

#pragma omp barrier


Shared L1

scratchpad memory

• STMicroelectronics P2012 is a fabric of tightly-coupled

shared-memory clusters linked by a scalable NoC

hwpe

job

Control plane

lightweight control by

pointer exchange

slave

thread

Natural support for many

programming models

Easy exploration of

HW/SW deployment

20

Applications 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

Face Detection (OpenCL)

FAST Corner Detection (OpenMP)

Color Tracking (OpenMP)

Removed Object Detection (OpenMP)

Image taken from www.ekintechnology.com

• Performance measured by GEPOP simulation

• Area and power consumption of the cluster estimated

through Design Compiler synthesis

• Simple energy model:

Average cluster energy consumption

(DMAs, controller, etc.)

SW PE energy consumption

HWPE energy consumption

Pessimistic model:

includes polling for

HWPEs

Accele

rato

r D

ata

Path

21

Data structures access






0x00

0x04

0

DATA WORD BE

35 31

DATA WORD BE

DATA WORD BE

0

1

2

size_t s[2];

0x08


0x0001



0x0000 0xC


0x3 0x0000

shared memory

Statically resolved

C++ template




(operator .)


sc_uint<36> backing store

hwpe_uint16_t smart basic type

hwpe_sizes_t smart struct

hwpe_ptr_sizes_t smart pointer

In practice: just

substitute ‘dumb’

pointers with smart ones

in the source of the

HWPE implementation


22

Results: Color Tracking


BW limitation: speedup does

not cope with increased power

consumption BW-limited

BW-limited

Remember the model: • HWPEs are added to the cluster

• PEs are on while waiting for HWPEs

Best performance speedup:

99% of Amdahl limit Best Energy-Delay tradeoff

~ 2x energy improvement

~ 45x P/A/E improvement ~ 5x P/A/E improvement

Bett

er

perf

orm

ance

Lower energy consumption Energy consumption Performance / Area / Energy

HWPE contention: many

threads competing on a single

accelerator less used HWPE

=> less contention 2 HWPEs => ~ no contention

Int. Img HWPE is

needed to extract

more performance

Best Energy-Delay tradeoff ~ 2.5x energy improvement

Results: Face Detection


Bett

er

perf

orm

ance




“Unlucky” parallelization:

feature size ~ paral. chunk

Best Energy/Delay tradeoff


Results: FAST Corner Detection


Bett

er

perf

orm

ance



DMA in

Removed Object Detection 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

NORMALIZED CROSS-CORRELATION (parallel sw)

sw

DMA out

coarse-grain

stripe level

parallelism

DMA in NORM. CROSS-CORRELATION

(single-thread sw)

hw coarse (no parallelism)

DMA out

NCC

(coarse hwpe)

DMA in NORM. CROSS-CORRELATION

(parallel sw)

hw finer (finer-grain HWPE called from parallel threads)

DMA out

NCC

(finer

hwpe)

NCC

(finer

hwpe)

NCC

(finer

hwpe)

No scaling (PEs are just

waiting for the HWPE) HWPE contention: more

threads (doing nothing but

control) than accelerators

HWPE provides no

performance advantage but

better energy consumption


Results: Removed Object Det. 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

Bett

er

perf

orm

ance




DMA in

0

DMA in

1

DMA in

2

27

Color Tracking




TH+MOM 1

(parallel sw)

sw

sw + 1x sync hwpe

sw + 1x async hwpe

DMA out

0

DMA in

0

DMA in

1

CSC 0

(sync hwpe)

TH+MOM 0

(parallel sw)

DMA in

2

CSC 1

(sync hwpe)

TH+MOM 1

(parallel sw)

DMA out

0

DMA in

3

CSC 2

(sync hwpe)

TH+MOM 2

(parallel sw)

DMA out

1

DMA in

0

DMA in

1

CSC 0

(async hwpe)

DMA in

2

CSC 1

(async hwpe)

TH+MOM 0

(parallel sw)

DMA in

3

CSC 2

(async hwpe)

TH+MOM 1

(parallel sw)

DMA out

0

DMA in

4

CSC 3

(async hwpe)

TH+MOM 2

(parallel sw)

DMA out

1

DMA in

5

CSC 4

(async hwpe)

TH+MOM 3

(parallel sw)

DMA out

2

fine-grain

pixel level

parallelism

TIME

DMA in

0

DMA in

1

DMA in

2



TH+MOM 1

(parallel sw)

DMA out

0

• P2012 is a fabric of tightly-coupled shared-memory

clusters linked by a scalable NoC

NoC

28

STMicroelectronics P2012 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

• Our approach to heterogeneity is L1-coupled shared-

mem acceleration: accelerators inside clusters

– Zero-copy communication

– Flexible programming model: same memory abstraction used in

SW, same address space

– High-level view of the HW accelerator as a HW thread of

execution

HWPE

#0

HWPE

#1

sw threads hw threads

P. Burgio, , A. Marongiu, D. Heller, C. Chavet, P. Coussy, and L. Benini.

OpenMP-based Synergistic Parallelization and HW Acceleration for

On-Chip Shared-Memory Clusters.

15th Euromicro Conference on Digital System Design: Architectures,

Methods & Tools, Turkey (2012), pages 751–758, Sept. 2012.

M. Dehyadegari, , A. Marongiu, M. R. Kakoee, L. Benini, S. Mohammadi,

and N. Yazdani.

A tightly-coupled multi-core cluster with shared-memory HW

accelerators.

2012 International Conference on Embedded Computer Systems (SAMOS),

pages 96–103, July 2012.

29

L1-Coupled Acceleration 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

Accele

rato

r D

ata

Path

30

Data structures access 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions





0x00

0x04

0

DATA WORD BE

35 31

DATA WORD BE

DATA WORD BE

0

1

2

size_t s[2];

0x08


0x0001



0x0000 0xC


0x3 0x0000

shared memory

Statically resolved

C++ template




(operator .)


UNLOCKED LOCKED

Register File

Job 3

Register File

Job 2

Register File

Job 1

Register File

Job 0

Register File

Job 3 PE 0

PE 1

acquire context job id

running job

start

HWPE wrapper

hwpe_acquire(hwpe_id)

event

done

31

Accelerator Job Offload

acquire context -2 (LOCKED)

hwpe_acquire(hwpe_id)

write params

hwpe_setIOReg(hwpe_id,reg,value)

hwpe_setGenericReg(hwpe_id,reg,value)

trigger

hwpe_trigger(hwpe_id)

End of

Job 3

Broadcast

notification of

end of job


32

Accelerator Interface Definition 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

// generated by SIDL class hwpe_ptr_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_ptr_sizes_t operator[](int idx) { int newOffset = byteOffset + (idx * SIZES_T_STRUCT_SIZE); return hwpe_sizes_t(backingStore, newOffset); } } class hwpe_sizes_t { sc_uint<36> *backingStore; int byteOffset; public: hwpe_uint16_t i_size; hwpe_uint16_t j_size; hwpe_uint16_t k_size; } class hwpe_uint16_t { sc_uint<36> *backingStore; int byteOffset; public: const uint16_t operator=(const uint16_t data) { int backOffset = offset / sizeof(int); int subIdx = offset % (sizeof(int) / sizeof(uint16_t)); int bs = backingStore[backOffset]; if(subIdx == 0) { bs.range(31,0) = 0x00000000 | data; bs.range(35,32) = 0x3; } else { bs.range(31,0) = 0x00000000 | (data << 16); bs.range(35,32) = 0xC; } backingStore[backOffset] = bs; return bs; }

stage_t s[N]; p = s[i].var; s[i].var = q; uint34_t s[N*SIZE_stage_t]; p = (TYPE_var) (0x0ffffffff & s[i*SIZE_stage_t + OFFSET_var]) BEOP_var; s[i*SIZE_stage_t + OFFSET_var] = q | BE_var << 32;

hwpe_ptr_sizes_t s; sizes_t p,q; p = s[i].j_size; s[i].j_size = q; uint34_t s[N*3]; p = (short) (0x0ffffffff & s[i*3 + 0x0]) >> 16; s[i*3 + 0x0] = q | 1 << 32;

BE

33 0 31 s[0].j_size s[0].i_size

s[0].k_size

0x00

0x04

+0x0 +0x1 +0x2 +0x3

33

Color Tracking 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

HWPE address translation

address

translator

Cluster address space (TCDM) Accelerator address space

0x0000

0x0400

0x0800

0x0c00

0x1000

0x1400

*in1

*in0

*out

0x10000000

*in0

*in1

*out

0x0401

0x04

0x01

*vec2

34

s[0].beta s[0].alpha

s[0].gamma

s[0].delta


s[1].gamma

s[1].delta


s[2].gamma

s[2].delta


s[3].gamma

s[3].delta

s[0].beta

s[0].alpha

s[0].gamma

s[0].delta

s[1].beta

s[1].alpha

s[1].gamma

s[1].delta

s[2].beta

s[2].alpha

s[2].gamma

s[2].delta

0x00

0x04

0x08

0x0c

0x10

0x14

0x18

0x1c

0x20

0x24

0x28

0x2c

Data structures in STHORM

typedef struct { int16_t alpha; int16_t beta; int32_t gamma; int32_t delta; } struct_t; struct_t s[16];

+0x0

typedef struct { int16_t alpha; int32_t gamma; int16_t beta; int32_t delta; } struct_t; struct_t s[16];

+0x1 +0x2 +0x3

s[0]

s[1]

s[2]

s[3]

s[0]

s[1]

s[2] for Catapult, an

array of data

structures is just a

collection of arrays

35

Data structure mapping

s[0].beta

s[0].alpha

s[0].gamma

s[0].delta

s[1].beta

s[1].alpha

s[1].gamma

s[1].delta

s[2].beta

s[2].alpha

s[2].gamma

s[2].delta

0x00

0x04

0x08

0x0c

0x10

0x14

0x18

0x1c

0x20

0x24

0x28

0x2c

+0x0 +0x1 +0x2 +0x3

s[0]

s[1]

s[2]


s[0].gamma

s[0].delta


s[1].gamma

s[1].delta


s[2].gamma

s[2].delta


s[3].gamma

s[3].delta

0x00

0x04

0x08

0x0c

0x10

0x14

0x18

0x1c

0x20

0x24

0x28

0x2c

+0x0 +0x1 +0x2 +0x3

s[0]

s[1]

s[2]

s[3]

var OFFSET BE BEOP

alpha 0x0 00 &

0x0000ffff

beta 0x0 01 >> 16

gamma 0x4 - -

delta 0x8 - -

SIZE_stage_t = 3 SIZE_stage_t = 4

stage_t s[N]; p = s[i].var; s[i].var = q; uint34_t s[N*SIZE_stage_t]; p = (TYPE_var) (0x0ffffffff & s[i*SIZE_stage_t + OFFSET_var]) BEOP_var; s[i*SIZE_stage_t + OFFSET_var] = q | BE_var << 32;

struct_t s[N]; p = s[i].beta; s[i].beta = q; uint34_t s[N*3]; p = (short) (0x0ffffffff & s[i*3 + 0x0]) >> 16; s[i*3 + 0x0] = q | 1 << 32;

var OFFSET BE BEOP

alpha 0x0 00 &

0x0000ffff

beta 0x8 00 &

0x0000ffff

gamma 0x4 - -

delta 0xc - -

data word BE

33 0 31

Each struct is associated to a set of #define’s


s[0].gamma

s[0].delta

0x00

0x04

0x08

+0x0 +0x1 +0x2 +0x3

36

Performance remarks

• Loss of performance due to reduced bandwidth usage is

visible in non-clock-divided configurations

• Clock division hides reduced bandwidth usage under

performance loss of reduced frequency

• Time-multiplexing recovers performance in most cases

– up to 81% of the original HWPE performance

– in VJ, this does not happen because the internal parallelism of

the HWPE does not scale well with the number of ports

37

38

HWPE Architecture 2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

• HWPEs use same memory abstraction as SW→ very

suitable for generation through HLS (ex. Calypto

Catapult System Level Synthesis)

• HW IPs generated by Catapult use their own memory

space → they must be encapsulated in a wrapper

• The wrapper provides facilities for communication with

shared memory and control by PEs

foo(*in0, *in1, *out);

void foo(int *in0, char *in1, int *out) { int i; for(i=0; i<1024; i++) { if(in1[i] == 100) return; else if(in0[i+1] < 10) out[i] = in0[i] * in1[i]; else out[i] = in0[i]; } }

HWPE_foo address

translation PE hwpe_foo(*in0, *in1, *out);

HW acceleration approaches

• HW accelerators can couple with PEs in different ways:

– Loosely coupled: private memory and address space (ex. GPUs)

– (Very) tightly-coupled: shared reg file (ex. ISA extensions,

ASIP)

• Distinguished also by communication:

– Shared memory: accelerator and CPU share a level of the

memory hierarchy

– Message passing: no memory sharing, communicates via

messages (ex. dataflow accelerators)

39

PE

HW acc

L3 mem

L2 mem

L1 mem

local mem

RE

G

acc L3 mem

L2 mem L1

mem

RE

G

PE

PE

HW acc

L3 mem

L2 mem

L1 mem

RE

G

FIF

O

DMA

more coupling

mo

re m

em

sh

arin

g

Cutting combinational paths

slave module

address

translator

done

start

cluster addr

register file

accel response cluster response

Shared-memory HWPE wrapper

regfile cfg

controller module sequencer

event

Hard

ware

IP

(accele

rato

r)

sh

mem

inte

rconnect

Cluster side Catc accelerator side

RE

G

EN

RE

G

EN

c) Both paths-

registered

configuration

accel request cluster request

b) Forward-

registered

configuration

RE

G

EN

a) Unregistered

configuration

accel address req buf

resp ready

fully combinational

low cluster

frequency

high cluster

frequency