20
1/20 Andrea Michelotti - Atmel Toolchain Overview & Demo hArtes 2010 Final Review Toolchain Overview & Demo Andrea Michelotti, Atmel WP2 Leader

Andrea Michelotti - Atmel Toolchain Overview & Demo 1/20 hArtes 2010 Final Review Toolchain Overview & Demo Andrea Michelotti, Atmel WP2 Leader

Embed Size (px)

Citation preview

1/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

hArtes 2010 Final Review

Toolchain Overview & Demo

Andrea Michelotti, Atmel

WP2 Leader

2/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Agenda• Key Achievements

• Developing an Application for a Heterogeneous Platform

– Initialization

– Sharing resources

– Calling a DSP function

– Targeting Execution

– Expressing parallelism

• Conclusions

3/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Key Achievements (1)

The hArtes toolchain…

• dramatically minimizes learning curve

• hides heterogeneous complexity: no expert knowledge of the platform,

tools or new programming languagesuse ‘C’ or tools that produce ‘C’

(Scilab/Nutech)

• automatically speeds-up application (in respect to GPP-only execution) by exploiting Processor Element capabilities

4/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Key Achievements (2)

Easily retarget new platforms:basic OMAP porting took ~1 monthother popular platforms like GPUs can be

targeted as wellmust conform to Molen machine

Quickly evaluate how an application behaves on different architectures

important for time to market

5/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Targeted Platforms/Architectures

Heterogeneous Platforms:Atmel DEB (Diopsis Evaluation Board):

ARM9 GPP + mAgicV DSP

UniFE’s hArtes HW Platform: ARM9 GPP + mAgicV DSP + Xilinx FPGA

Scaleo’s hArtes Emulation Platform: ARM9 GPP + mAgicV DSP + Altera FPGA

TI Experimenter OMAPL138: ARM9 GPP + C67 TI DSP

6/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Developing an Application for a Heterogeneous Platform in a nutshell

Before hArtes, toolchains for heterogeneous hardware (Diopsis and OMAP) consist of separate subchains: one for GPP and one for DSP.

MAIN PROBLEMS:

• High learning curve: target tools, architecture, APIs….

• Without knowledge of the underlying software/hardware it’s difficult to use and to benefit from the existence of PEs;

• Code maintainability: two distinct projects must be kept aligned;

• Code portability: usually GPP code contains specific APIs to load, execute and access PEs resources; An identical C-code cannot be produce correct result across all the platforms;

• Debugging: not an unified image, not an unified debugger.

7/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Writing an application for a heterogeneous platform in a nutshell

Suppose we want to port our legacy code on a powerful but heterogeneous architecture like OMAP or DIOPSIS…

void main(){

unsigned *shared_array=malloc(SIZE);

my_fft(int param,shared_array…);

another_kernel(…);

}

C code running on my host PC

Seems easy, but, after a while, becomes a nightmare! …

8/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

InitializationOMAP toolchain (through dsplink)OMAP toolchain (through dsplink)

int main(int argc,char**argv){… if (DSP_SUCCEEDED (status)) { status = PROC_setup (NULL) ; }if (DSP_SUCCEEDED (status)) { status = PROC_attach (processorId, NULL) ; if (DSP_FAILED (status)) { RDWR_1Print ("PROC_attach failed. Status: [0x%x]\n", status) ; } } else { RDWR_1Print ("PROC_setup failed. Status: [0x%x]\n", status) ; } if (DSP_SUCCEEDED (status)) { args [0] = strNumIterations ; { status = PROC_load (processorId, dspExecutable_myfft, NUM_ARGS, args) ; } if (DSP_FAILED (status)) { RDWR_1Print ("PROC_load failed. Status: [0x%x]\n", status) ; } } if (DSP_SUCCEEDED (status)) { status = PROC_start (processorId) ; if (DSP_FAILED (status)) { RDWR_1Print ("PROC_start failed. Status: [0x%x]\n", status) ; } }

Use of specific API to access DSP.

The DSP image is loaded as a file.

The DSP has its own main

GPP code

void myfft();int main(int argc,char**argv){

myfft();}

DSP code

9/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Initialization Diopsis toolchain caseDiopsis toolchain case

int main (int argc, char *argv[]){

....

ret = mAgicV_load_PM("myfft.bin",_m_fd_extm);if(ret!=0) return ret;

ret = mAgicV_load_DM(“myfft_datamem.bin");if(ret!=0) return ret;

ret = mAgicV_load_XM(“myfft_extmem.bin",(0x365890));if(ret!=0) return ret;

ret = mAgicV_init_PMU();if(ret!=0) return ret; magicV_start();..magicV_wait();.... }

GPP code

void myfft();

int main(int argc,char**argv){…myfft();…}

Very low API to access DSP.

The DSP image is binary loaded as a file. The

DSP has its own main

DSP code

10/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Initialization hArtes toolchain casehArtes toolchain case

int main (int argc, char *argv[]){

....}

GPP/Application codeThe hArtes Toolchain

and hArtes Runtime take care of hiding all the initialization details.

JUST one code.

11/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Sharing resources OMAP toolchain caseOMAP toolchain case Use of specific API

to create a shared area, translate the address for DSP and then pass the translated address

to DSP via messages.

Memory Layout must be configured

by recompiling drivers!!

int main(int argc,char**argv){ SMAPOOL_Attrs poolAttrs poolAttrs.bufSizes = (Uint32 *) &size ; poolAttrs.numBuffers = (Uint32 *) &numBufs ; poolAttrs.numBufPools = NUM_BUF_SIZES ; poolAttrs.exactMatchReq = TRUE ; volatile unsigned* my_shared_array; status = POOL_open (POOL_makePoolId(processorId, SAMPLE_POOL_ID), &poolAttrs) ; if (DSP_FAILED (status)) { MPCSXFER_1Print ("POOL_open () failed. Status = [0x%x]\n", status) ; } } if (DSP_SUCCEEDED (status)) { status = POOL_alloc (POOL_makePoolId(processorId, SAMPLE_POOL_ID), &my_shared_array, SIZE, DSPLINK_BUF_ALIGN)) ; /* Get the translated DSP address to be sent to the DSP. */ if (DSP_SUCCEEDED (status)) { status = POOL_translateAddr ( POOL_makePoolId(processorId, SAMPLE_POOL_ID), &dspCtrlBuf, AddrType_Dsp, (Void *) &my_shared_array_from_dsp, AddrType_Usr) ;

GPP code

int main(int argc,char**argv){

DSPlink_init();….}

DSP code

12/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Sharing resources Diopsis toolchain caseDiopsis toolchain case

Very raw access, sharing directly

addresses that are manually mapped,

using local copy and write back.

Use of specific APIs and

Specific compiler directives.

NO LINKER used, many problems of source alignment,

debug

#define MY_SHARED_ARRAY_ADDR 2

int main(int argc,char**argv){....unsigned local_my_shared_array[];mAgicV_read_buff(local_my_shared_array,MY_SHARED_ARRAY_ADDR,sizeof(my_shared_array));....// modify local copy, write backmAgicV_write_buff(local_my_shared_array,MY_SHARED_ARRAY_ADDR,sizeof(my_shared_array));…

GPP code

#define MY_SHARED_ARRAY_ADDR 2volatile long chess_storage(DATA:MY_SHARED_ARRAY_ADDR) my_shared_array;volatile long chess_storage(DATA:MY_SHARED_ARRAY_ADDR+SIZEOF(my_shared_array) my_other_variable;

int main(int argc,char**argv){ // access the variable}

DSP code

13/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Sharing resources hArtes toolchain casehArtes toolchain case

Very natural access

int main(int argc,char**argv){

....

Unsigned* my_shared_array;my_shared_array = malloc(MYSIZE);#pragma map call_hw dsp0 dsp_func(my_shared_array);

GPP codeIntuitive and portable.

DSE turns automatically malloc into hmalloc (hArtes API), that allocates and traces memory in a shared physical space of the target platform.

14/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Calling a DSP routineOMAP/Diopsis toolchain casesOMAP/Diopsis toolchain cases

int main(int argc,char**argv){… // initialization, see main

if (DSP_SUCCEEDED (status)) { status = PROC_start (processorId) ; if (DSP_FAILED (status)) { RDWR_1Print ("PROC_start failed. Status: [0x%x]\n", status) ; } }…

GPP code

DSP code

void my_fft(int pp, float*…);

int main(int argc,char**argv){

… // initialization, see mainmy_fft();…}

int main(int argc,char**argv){… // initialization, see main

mAgicV_start()…

GPP code

There is not concept of DSP call from GPP, the GPP can start a DSP process that executes the desired function.

The call can be inefficient or maybe cannot be executed correctly on the DSP. The programmer must know the underlying architecture!

For example the DSP in the DIOPSIS architecture the type int is 16 bit wide. Typically is 32 bit.

15/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Calling a DSP routinehArtes toolchain casehArtes toolchain case

GPP code

void my_fft(…){…}int main(int argc,char**argv){

..#pragma call_hw dsp 0my_fft();..

}

Intuitive and portable.

DSE checks if the my_fft function can be executed on the target DSP (checks parameters, used stack memory). It also estimates the cost of the call to decide if it’s convenient to move the execution on the DSP.

To call the DSP function, the DSE adds a pragma to the function call and generate a C-source that can be compiled by the DSP toolchain.

16/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Expressing Parallelism OMAP/Diopsis toolchainOMAP/Diopsis toolchain

NOT KNOWN/NOT IMPLEMENTED

17/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Expressing Parallelism hArtes toolchain casehArtes toolchain case

Void main(){

#pragma omp parallel sections

{

#pragma omp section

{

#pragma call_hw dsp 0

my_fft();

}

#pragma omp section

{

another_kernel(…);

}

}

}

Intuitive and portable.

hArtes supports some openMP construct to express parallelism.

DSE in some case automatically detects kernels that can go in parallel and adds openMP annotations to the c-source.

The parallelism can be also explicited via POSIX threads

#pragma call_hw dsp 0void my_fft(…);

int main(int argc,char**argv){… // initialization, see hthread_create(my_fft()…);Another_kernel();hthread_join();}

18/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Target Execution (under Linux) OMAP/Diopsis toolchainOMAP/Diopsis toolchain

Two separate binaries, not common symbols, not unified debugger, I/O

messages (printf) often relies on jtag connection.

RUN and HOPE!

bash$./my_fft_arm.elf <my_fft_dsp.bin>

19/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Target Execution (under Linux) hArtes toolchainhArtes toolchain

Single ELF binary, common symbols, unified debugger, I/O messages (printf)

on the target.

RUN!

$bash ./my_fft.elf

20/20

An

dre

a M

ich

elo

tti -

Atm

elT

oo

lch

ain

Ove

rvie

w &

Dem

o

Conclusions

• Although the original “Brain to Bit” (B2B) objective was very ambitious, the hArtes toolchain fulfilled its original promise: to support software development of heterogeneous hardware without expert knowledge of the target platform, and therefore allowing developers to achieve high-performance applications through complete automated solutions and by abstracting low-level hardware details.

• Areas of improvement regards mainly data flow analysis, automatic parallelization, AET integration in Eclipse, debugging capabilities.