Overview

An Introduction to 3L Diamond

on Sundance Hardware

Some slides have extra information as notes.

What is 3L Diamond?

Diamond is a set of tools and other components that work together with the TI C compiler and linker to support applications using multiprocessor hardware.

Sundance hardware is well-suited Diamond’s way of dealing with multiprocessors and the combination provides the most rapid way to get your application running efficiently.

Why Diamond?

The first response of many people when offered Diamond is:

“We do not need any extra software.

Code Composer Studio provides everything we need to write multiprocessor applications.”

Is this really true?

The Hardware

The structure of Sundance hardware is a good place to start.Sundance provides modular hardware that allows you to build complex multiprocessor systems.Modules include an FPGA that is used to implement interprocessor links that allow pairs of processors to communicate. These include comports and SDBs.

A Sundance Module

C6000 DSP

FPGA

Flash ROM

External Memory

Comports

JTAG out

JTAG in

Reset

SDBs

Typical Hardware

Host PC

SMT395DSP

SMT374DSP

SMT362DSP

SMT390ADC

Comport

SDBSMT374

DSP

Scaling

Sundance hardware scales:• There are no shared resources• Adding processors adds communication• No contention for shared memory or

busses

How to Develop Applications

Given hardware like this, the first thought will be that Code Composer from TI is ideal for developing applications.

We shall now investigate this thought.

Code Composer Studio

•A good platform for single-processor work.

•No real support for multiprocessors.• CCS is really a single-processor system• You have to treat each processor separately.

•You build separate programs for each processor as follows:

Building with CCS

Source file...Source file

...

Object file

Source file...

Source file

...

Object file

Source file Object file...

Source file

...

Object file

Texas Instruments

Compiler Linker

Executable.out

Executable.out

Executable.out

Object file

Object file

Problem: Specification

•You have to divide your application into separate programs for each processor.• Modularity should be driven by the program

structure.• You should not use the hardware structure.

•Difficult to use several developers:• only one program for each processor

•Difficult to test components• hard to make each processor work in

isolation

How do you load the application?

•You have to load using JTAG

•JTAG is very slow (0.2MB/s)

•You have all the parts of your application as separate .out files, one for each processor.

•You have to load these, one at a time.• it is very easy to load the wrong processor• it is very easy to forget to load a processor• instructions for your users are complicated

Problem: Loading

•Customers need CCS (or load from ROM)

•Difficult to develop your own host program

•You can’t use JTAG from a program.

•You must use a separate mechanism to allow processors to communicate.• This means you have to maintain two,

unrelated networks:• JTAG chain for loading• I/O network for communication

Problem: Host integration

•Host communication is with JTAG.• very slow• very difficult to add your own host code

•Need to use other devices• need to write host driver code• how to start the host code & DSP code?

Problem: Communication

•How do the processors communicate?• No support for Sundance peripherals• Need to write device drivers

• Learn device details• Manage EDMA• Deal with EDMA coherency problems• Manage interrupts• Learn the tricks to make them run fast

Problem: Message routing

If two processors want to exchange data but there is no direct connection between them, the data will have to be routed through intermediate nodes.

• How do you do this?

• How do you construct routing tables?• by hand?• build in knowledge of the processor

network?

Problem: Deadlock

A problem with all message routing systems is deadlocking.

This is when sending data from one processor to another has to wait for data to be transmitted between another pair of processors, but that transmission needs to wait for the first to complete!

Deadlock prevention options

•Use a proven deadlock-free system.

•Make the user stop the program and change parameters each time a deadlock happens.

•Hope it never happens.

The most common technique is:

•Be completely unaware deadlock can happen.

Problem: The Cache

•There are problems with cache coherency• The cache cannot maintain coherence between:

• external memory• EDMA transfers

•Transfers must handle cache coherency• you cannot turn the cache off• cache errors are very hard to find

•You have to sort out all these problems.

Why loading may fail

JTAG loading assumes the cache is clear.This is not true with Sundance hardware. After reset, a bootloader is loaded from ROM and executed. This initialises the processor and configures the FPGA to implement the inter-processor communication links.The code for the bootloader gets into the cache. JTAG loads behind the cache, leading to inconsistencies that prevent programs running.

Problem: Making changes

•How do you change the network?• Rewrite sections of your code• Are there enough EDMA channels?

• only 4 external interrupt lines for synchronisation

• what if you use more than 4 devices?• host comport (2 devices)• comport to another processor (2 devices)• SDB to another processor (2 devices)• that is already 6 devices

Problem: Changing Devices

•How do you change processors?• different device addresses• different memory sizes• different memory addresses• different initialisation requirements

•With CCS: rewrite sections of your code.

Problem: Choosing devices

• Comports• Sundance Digital Bus (SDB)• Rocket I/O

•You need to learn how to use them.

•You need to write & maintain device drivers.

•You need to change your code to use them.

Before you start coding…

•Be certain you know how to partition the problem.

•Be certain you know how much memory you need.

•Be certain you know which modules you need.

•Be certain of the system topology.

… because it will be very hard to change.

The advantage of CCS

•You have complete control of everything…

•… because you have to do everything yourself

… and this takes a lot of time and experience.

CCS: Summary

•CCS works well with single processors

•It was not designed for multiple processors

•You have to do all the hard work

•Knowledge gets built into the application:• processor types• memory layout• I/O devices being used• connections between processors

•It is very hard to make significant changes.

Diamond

•Originally designed in 1987• tried and tested• proven model

•Designed for multiprocessor systems

•Designed for simplicity

•Designed for efficiency• during development• during execution

Some advantages of Diamond

•Easy to use

•Gives you flexibility: late binding• easy to change topology• easy to change modules

•Reduces housekeeping• memory usually allocated for you• interrupts handled for you• loading managed for you• communication details managed for you• processor issues handled for you

What Diamond is not

•Diamond is not a compiler• we use the standard TI compiler and linker

•Diamond is not a simulator or an interpreter• real, optimised code is generated

•Diamond is not DSP/BIOS• it has it’s own optimised kernel, designed

for multiprocessor operation• it does not have or need a large API

Building with Diamond

•You partition the application into tasks:• modularity determined by the needs of the

application; you ignore processors here.

•Diamond adds an extra configuration step.

•The configurer:• can see the whole application• can optimise communication and device

access.• builds a single output file; nothing can get lost.• arranges to load from this single file.

Building with CCS

Source file...

Source file

...

Object file

Source file...

Source file

...

Object file


Source file

...

Object file

Texas Instruments

Compiler Linker

Executable.out

Executable.out

Executable.out

Object file

Object file

Building with Diamond


Source file

...

Object file

Relocatable.tsk


Source file

...

Object file

.appRelocatable

.tsk


Source file

...

Object file

Texas Instruments 3L Diamond

ConfigurerCompiler Linker

Relocatable.tsk

With Diamond…

• The application is in a single file.• Nothing can get lost.• You cannot get loading wrong.• Loading is easy

•load from the host•no need for ROM during development•development is fast

Diamond…•is designed for multiprocessor systems.

•has its own small, efficient microkernel.

•has a small but effective API.

•is optimised for target hardware:• it knows about different modules• it automatically inserts optimised device drivers• it handles interrupts• it handles memory and the cache

•is very good at communication

•leaves you free to concentrate on your code.

Sundance TIMs

Comport Links

Memory ROME

MIF

EM

IF

C6000 DSPE

MIF

FPGA SDB Links

Dual-Processor Module

Memory

C6000 DSP

Comports

Memory

C6000 DSP

FPGA SDBsInternal comports

Identical to two separate modules; there areno shared resources.

The Diamond Model

Diamond builds applications from independent tasks that send data to other tasks using channels.

This model is based upon CSP: Communicating Sequential Processes.

CSPCommunicating Sequential Processes

Task Task

Task Task

Channel

Forget about processors

A Diamond application is…•Tasks

• complete C programs• start at a main function• fully linked (but relocatable)• input & output ports for connecting channels

• unlimited number of ports

• Multi-threaded

•Channels• data transfer mechanisms• transfer data from one task to one other• blocking: both ends wait for completion

Channels

•Many possible implementations• memcpy – between tasks on one processor• I/O - between adjacent processors

• comports• SDBs• Rapid IO links

• Routed I/O – between remote processors• software routing• guaranteed deadlock-free• any task can communicate with any other task

Diamond will choose the best implementation.

The Hardware

Module

C6000

FPGA

EMIF

comports

SDBs

Host PC

A Sundance NetworkHost PC

C64

C62

C67 C64

C67 C62

C64

Ideal Hardware

•No shared resources• Simplifies hardware• Simplifies software• Scales: more processors = more power

•Connected by communication links• Add processors = add bandwidth

•Designing multiprocessor hardware:• Speak to 3L first.

Tasks & Channels

Map onto hardware

A simple task

AddOne

Words coming in Incremented words going out

DATA_IN(input channel)

DATA_OUT(output channel)

0 0

1 12 2

input ports

output ports

A simple task

#include <chan.h>

INPUT_PORT(0, DATA_IN)

OUTPUT_PORT(0, DATA_OUT)

main()

{

int n;

for (;;) {

chan_in_word (&n, &DATA_IN);

chan_out_word(n+1, &DATA_OUT);

}

}

Team Working

•Tasks are self-contained

•They are developed separately

•Communication between tasks:• is a contract• allows test systems to be built

•Ideal for team working

Design Flow

•Network• Tasks• Channels

Sources

Design Flow

•Network

•Code tasks

Tasks

Design Flow

•Network

•Code tasks

•Compile & Link

configuration file

Design Flow

•Network

•Code tasks

•Compile & Link

•Configuration File

application file

Design Flow

•Network

•Code tasks

•Compile & Link


•Configure

application file

processor network

Design Flow

•Network

•Code tasks

•Compile & Link


•Configure

•Load & Run

Running an application

Demonstration Hardware

SMT365

SMT370

SMT374

SMT361

Only the SMT365 and the SMT361 will be used in the examples.

A Correlator Example

Example2

Correlator

0

Control channel

Data channel

UI

Disp_corDisp_raw

MainCtrl

Code Each TaskOUTPUT_PORT(2, COR_DATA)

INPUT_PORT (1, COR_RESULT)

. . .

main()

{

printf("3L Diamond Example\n");

for (;;) {

. . .

chan_out_message(BYTES, Data, &COR_DATA);

chan_in_message(BYTES, Result, &COR_RESULT);

. . .

}

}

Configuration

Write a configuration file to:

• Describe the hardware• processors• connections between processors

• Describe the software• tasks• channels connecting tasks

• Map the software onto the hardware• place tasks on processors

Task names

TASK example2

TASK mainctrl

TASK disp_raw

TASK disp_cor

TASK UI

TASK correlator

Task ports

TASK example2 INS=3 OUTS=7

TASK mainctrl INS=1 OUTS=1

TASK disp_raw INS=2 OUTS=0

TASK disp_cor INS=2 OUTS=0

TASK UI INS=1 OUTS=1

TASK correlator INS=1 OUTS=1

Task stack & heap

TASK example2 INS=3 OUTS=7 DATA=500K

TASK mainctrl INS=1 OUTS=1 DATA=200K

TASK disp_raw INS=2 OUTS=0 DATA=200K

TASK disp_cor INS=2 OUTS=0 DATA=200K

TASK UI INS=1 OUTS=1 DATA=200K

TASK correlator INS=1 OUTS=1 DATA=32K

Task starting priorities

TASK example2 urgent INS=3 OUTS=7 DATA=500K

TASK mainctrl INS=1 OUTS=1 DATA=200K

TASK disp_raw INS=2 OUTS=0 DATA=200K

TASK disp_cor INS=2 OUTS=0 DATA=200K

TASK UI urgent INS=1 OUTS=1 DATA=200K

TASK correlator priority=2 INS=1 OUTS=1 DATA=32K

! The starting priority is 1 unless explicitly stated.

Channel creation

! channel output port input port

! ======= =========== ==========

CONNECT C1 UI[0] example2[0]

CONNECT C2 example2[5] mainctrl[0]

CONNECT C3 mainctrl[0] example2[2]

CONNECT C4 example2[0] disp_raw[0]

CONNECT C5 example2[1] disp_raw[1]

CONNECT C6 example2[2] correlator[0]

CONNECT C7 correlator[0] example2[1]

CONNECT C8 example2[3] disp_cor[0]

CONNECT C9 example2[4] disp_cor[1]

CONNECT C10 example2[6] UI[0]

The processor & placement

PROCESSOR Root SMT365_8_1

…

PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root

Processor types

Diamond supports all of the Sundance TIMs. The ProcType utility will display them all.

A note about memory

• With CCS you need to:• specify memory explicitly.• know which “sections” are used by the compiler• allocate memory explicitly at the start

• Diamond can do all memory allocation• available memory determined automatically• no linker command files• but, you can tell Diamond how to use memory• this is an optimisation once the code is working.• ignore it until the program’s needs are

understood.

Building & Running

•Compile each task with the command: 3L C

•Link each task with the command: 3L T

•Configure with the command: 3L A

•Execute with the command: 3L X

Making it run faster

Example2

Correlator

0

Control channel

Data channel

UI

Disp_corDisp_raw

MainCtrl

Use a second processor

We shall use TIM1 (SMT365) and TIM4 (SMT361) connected by comports 0 & 3 respectively.

Demonstration Hardware

SMT365

SMT370

SMT374

SMT361



…

PLACE mainctrl Root

PLACE example2 Root

PLACE disp_raw Root

PLACE disp_cor Root

PLACE UI Root

PLACE correlator Root



PROCESSOR NodePROCESSOR Node SMT361SMT361

…

PLACE mainctrl Root

PLACE example2 Root

PLACE disp_raw Root

PLACE disp_cor Root

PLACE UI Root




PROCESSOR Node SMT361

WIRE W1WIRE W1 Root[CP:0] Node[CP:3]Root[CP:0] Node[CP:3]

…

PLACE mainctrl Root

PLACE example2 Root

PLACE disp_raw Root

PLACE disp_cor Root

PLACE UI Root




PROCESSOR Node SMT361

WIRE W1 Root[CP:0] Node[CP:3]

…

PLACE mainctrl Root

PLACE example2 Root

PLACE disp_raw Root

PLACE disp_cor Root

PLACE UI Root

PLACE correlatorcorrelator NodeNode

Notes

•The two tasks have not changed in any way.

•Their connections have not changed.

•No need to recompile them or relink them.

•All we changed to move the tasks onto a second processor was the configuration file.

•We just built a new application by running the configuration command again (3L A).

•Loading the two processors is automatic.

Making it go even faster

Module

C6000

FPGA

EMIF

comports

SDBs

Host PC

Use the FPGA on the SMT365

PROCESSOR Root SMT365_8_1PROCESSOR F FPGAFPGA

…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root

The FPGA is already being used

•The FPGA is also used to support functions on the SMT365 DSP.

•Attaching the FPGA to its processor allows the configurer to include all the necessary logic to support the needed functions.

Use the FPGA

PROCESSOR Root SMT365_8_1PROCESSOR F FPGA ATTACH=RootATTACH=Root

…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root

Use the FPGA

PROCESSOR Root SMT365_8_1PROCESSOR F FPGA ATTACH=Root

WIRE W1 Root[SDB:0] F[SDB_DEVICE:0]Root[SDB:0] F[SDB_DEVICE:0]…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root

Use the FPGA

PROCESSOR Root SMT365_8_1PROCESSOR F FPGA ATTACH=Root

WIRE W1 Root[SDB:0] F[SDB_DEVICE:0]…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlatorcorrelator FF

FPGA Tasks

•Placing a task on an FPGA instructs the configurer to look for an FPGA version of the task.

•This can be written using:• VHDL• Xilinx System Generator• Handel-C (Celoxica)• Any other method you like.

Building with FPGA

•The configurer will construct a Xilinx project for the FPGA

•It will call the Xilinx toold to build a complete bitstream.

•The bitstream will be included in the single application file.

•The FPGA will be configured automatically as the application is loaded.

Conclusion

•Diamond does a lot of the work for you.

•Diamond allows you to change your mind and alter processors and topology.

•Diamond gives a structured model for developing efficient applications.

•The Diamond model is the same for any number and any combination of processors: DSP or FPGA.

•Diamond simplifies developing multiprocessor applications.

Technology

Overview