Presentazione di PowerPoint Bank PE PE PE PE Memory Bank PE PE PE PE Memory Bank PE PE PE PE Memory

  • View
    4

  • Download
    0

Embed Size (px)

Text of Presentazione di PowerPoint Bank PE PE PE PE Memory Bank PE PE PE PE Memory Bank PE PE PE PE Memory

  • He-P2012: Architectural Heterogeneity Exploration on

    a Scalable Many-Core Platform

    Francesco Conti*, Chuck Pilkington†, Andrea Marongiu*‡, Luca Benini*‡

    *University of Bologna, Italy †STMicroelectronics, Ottawa, Canada

    ‡ETH Zurich, Switzerland

    FP7-318013 FP7-611016

  • Dark silicon & heterogeneity • Moore’s Law gives us the possibility to integrate more

    and more Processing Elements on chip…

    • …but the utilization wall makes it impossible to keep them powered on together!

    • One possible solution: architectural heterogeneity: HW IPs – complementary/alternative to PEs

    – more power efficient than PEs

    PE PE

    PE PE

    Memory

    Bank

    PE PE

    PE PE

    Memory

    Bank

    PE PE

    PE PE

    Memory

    Bank

    PE PE

    PE PE

    Memory

    Bank

    PE PE

    PE PE

    Memory

    Bank

    PE PE

    PE PE

    Memory

    Bank

    PE PE

    PE PE

    Memory

    Bank

    PE PE

    PE PE

    Memory

    Bank

    PE PE

    PE PE

    Memory

    Bank

    PE PE

    PE PE

    Memory

    Bank

    PE

    PE

    HW

    IP

    Memory

    Bank

    PE

    PE

    HW

    IP

    Memory

    Bank

    PE

    PE

    HW

    IP

    Memory

    Bank

    PE

    PE

    HW

    IP

    Memory

    Bank

    PE

    PE

    HW

    IP

    Memory

    Bank

    2

    2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

    Switched off!

  • State of Art

    3

    2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

    T ig

    h te

    r c o n tr

    o l b y p

    ro c e s s o r

    Data nearer to processor

    ASIP (Tensilica,

    Synopsys Processor Designer,

    Movidius…)

    Dataflow (Maxeler,

    CEA Magali…)

    L2-coupled (Accelerator-rich CMPs[1]…)

    L1-coupled (GreenDroid[2],

    Our work,…)

    [1] J.Cong et al. Architecture support for accelerator-rich CMPs.

    Proceedings of the 49th Design Automation Conference

    [2] N.Goulding-Hotta et al. The GreenDroid Mobile Application

    Processor: An Architecture for Silicon’s Dark Future. IEEE Micro

  • 4

    Heterogeneous P2012 cluster

    M e m o r y B a n k

    # 0

    M e m o r y B a n k

    # 1

    M e m o r y B a n k

    # 2

    M e m o r y B a n k

    # 3

    M e m o r y B a n k

    # 3 0

    M e m o r y B a n k

    # 3 1

    Shared Tightly-Coupled

    Data Memory (TCDM)

    Low-Latency Interconnection (LIC)

    STxP70 core PE #0

    Peripheral Interconnection (PIC)

    Timers

    HWS

    EXT2MEM

    EXT2PER

    ENC2EXT

    DMA chan #0

    DMA chan #1

    G l o b a l I n t e r c o n n e c t i o n I F

    STxP70 core PE #15

    ... STxP70 core PE #2

    STxP70 core PE #1

    ...

    wrapper wrapper

    HWPE #0

    HWPE #1

    Data plane

    zero-copy communication

    through shared mem

    2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

    Shared L1

    scratchpad memory

    • STMicroelectronics P2012 is a fabric of tightly-coupled

    shared-memory clusters linked by a scalable NoC

    Control plane

    lightweight control by

    pointer exchange

    General mechanism: Natural support for

    many programming models

    Easy exploration of HW/SW deployment

  • 5

    HWPE Architecture

    2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

    • HWPEs are throw-in replacements for SW→ we want to

    be able to generate them with HLS (ex. Calypto Catapult

    System Level Synthesis)

    • HW IPs generated by Catapult use their own memory

    space → they must be encapsulated in a wrapper

    • The wrapper provides facilities for communication with

    shared memory and control by PEs

    Address Translation

    Register File + Control Logic

    Low-Latency Interconnect

    (LIC)

    Peripheral Interconnect

    (PIC) FSM

    HW IP HWPE wrapper

    L1 Shared Memory (TCDM)

    Processing Elements

    Data plane

    Control plane

    [3] F.Conti, A.Marongiu, L.Benini. Synthesis-friendly techniques for tightly-coupled integration of hardware

    accelerators into shared-memory multi-core clusters, Proceedings ot the 2013 International Conference on

    Hardware/Software Codesign and System Synthesis

    [4] P.Burgio, A.Marongiu, D.Heller, C.Chavet, P.Coussy, L.Benini. OpenMP-based Synergistic Parallelization

    and HW Acceleration for On-Chip Shared-Memory Clusters. Proceedings of the 2012 15th Euromicro Conference

    on Digital System Design

  • 6

    HWPE Codesign Flow

    2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

    1. Write a P2012 application

    (currently OpenCL or OpenMP)

    2. Select a hot function and

    define the interface and the

    implementation of its HWPE

    3. The SIDL compiler uses the

    interface to generate HWPE

    API and HLS scripts +

    templates

    4. Calypto Catapult is used to

    generate the HWPE models

    5. The P2012 compiler is used to

    generate the P2012 app

    6. Performance measurements

    with the P2012 Gepop

    simulator + Area/Power

    estimates with Synopsys DC

    typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t; void matrixMul(char* a, char* b, char* c, sizes_t* s) { int i,j,k; for(i=0; isize_i; i++) { for(j=0; jsize_j; j++) { for(k=0; ksize_k; k++) { c[i*s->size_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } } void foo() { #pragma omp parallel { // ... for(int i=0; isize_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } }

    void matrixMul(hwpe_ptr_char a, hwpe_ptr_char b, hwpe_ptr_char c, hwpe_ptr_sizes_t s) { int i,j,k; loop_i: for(i=0; isize_i; i++) { loop_j: for(j=0; jsize_j; j++) { loop_k: for(k=0; ksize_k; k++) { c[i*s->size_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } }

    struct sizes_t{ unsigned short int i; unsigned short int j; unsigned short int k; }; #pragma sidl hw clk period 2.5 class hwpe { public: #pragma sidl directive core/main/loop_i –UNROLL 2 #pragma sidl directive core/main/loop_j –UNROLL 2 virtual void matrixMul( char* a, char* b, char* c, sizes_t* s) = 0; };

    Accelerator

    interface +

    HLS directives

    Accelerator HLS

    implementation with

    ‘smart’ pointers

    typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t; void matrixMul(img_t* a, img_t* b, img_t* c, sizes_t* s) { int i,j,k; for(i=0; isize_i; i++) { for(j=0; jsize_j; j++) { for(k=0; ksize_k; k++) { c[i*s->size_j+j] += a[i*s->size_k+k]*b[k*s->size_j+j]; } } } } void foo() { #pragma omp parallel { // ... for(int i=0; i

  • 7

    Data structures access

    2.HW architecture: He-P2012 3.HW/SW codesign: SIDL 4.Using He-P2012: CV apps 1.Introduction 5.Conclusions

    typedef struct { unsigned short int size_i; unsigned short int size_j; unsigned short int size_k; } sizes_t;

    s[0].j_size s[0].i_size

    s[0].k_size s[1].i_size

    s[1].k_size s[1].j_size

    0x00

    0x04

    0

    DATA WORD BE

    35 31

    DATA WORD BE

    DATA WORD BE

    0

    1

    2

    size_t s[2];

    0x08

    s[0].i_size = 1; p = s[1].k_size; s[1].i_size = s[0].k_size;

    sc_uint backingStore[3]; 0 31

    shared memory

    Statically resolved

    C++ template

    library class hwpe_ptr_sizes_t { sc_uint *backingStore; int byteOffset; public: hwpe_ptr_sizes_t operator[](int idx) { int newOffset = byteOffset + (idx * SIZES_T_STRUCT_SIZE); return hwpe_sizes_t(backingStore, newOffset); } }; class hwpe_sizes_t { sc_uint *backingStore; int byteOffset; public: hwpe_uint16_t i_size; hwpe_uint16_t j_size; hwpe_uint16_t k_size; }; class hwpe_uint16_t { sc_uint *backingStore; int byteOffset; public: const uint16_t operator=(const uint16_t data) { int backOffset = offset / sizeof(int); int subIdx = offset % (sizeof(int) / sizeof(uint16_t)); if(subIdx == 0) backingStore[backOffset] = 0x300000000 | data; else backingStore[backOffset] = 0xC00000000 | (data

  • 8

    Applications

    2.HW architecture: He-P2012 3.HW/SW codes