87

Overview

  • Upload
    hendriv

  • View
    229

  • Download
    3

Embed Size (px)

DESCRIPTION

xx

Citation preview

Page 1: Overview
Page 2: Overview

An Introduction to 3L Diamond

on Sundance Hardware

Some slides have extra information as notes.

Page 3: Overview

What is 3L Diamond?

Diamond is a set of tools and other components that work together with the TI C compiler and linker to support applications using multiprocessor hardware.

Sundance hardware is well-suited Diamond’s way of dealing with multiprocessors and the combination provides the most rapid way to get your application running efficiently.

Page 4: Overview

Why Diamond?

The first response of many people when offered Diamond is:

“We do not need any extra software.

Code Composer Studio provides everything we need to write multiprocessor applications.”

Is this really true?

Page 5: Overview

The Hardware

The structure of Sundance hardware is a good place to start.Sundance provides modular hardware that allows you to build complex multiprocessor systems.Modules include an FPGA that is used to implement interprocessor links that allow pairs of processors to communicate. These include comports and SDBs.

Page 6: Overview

A Sundance Module

C6000 DSP

FPGA

Flash ROM

External Memory

Comports

JTAG out

JTAG in

Reset

SDBs

Page 7: Overview

Typical Hardware

Host PC

SMT395DSP

SMT374DSP

SMT362DSP

SMT390ADC

Comport

SDBSMT374

DSP

Page 8: Overview

Scaling

Sundance hardware scales:• There are no shared resources• Adding processors adds communication• No contention for shared memory or

busses

Page 9: Overview

How to Develop Applications

Given hardware like this, the first thought will be that Code Composer from TI is ideal for developing applications.

We shall now investigate this thought.

Page 10: Overview

Code Composer Studio

•A good platform for single-processor work.

•No real support for multiprocessors.• CCS is really a single-processor system• You have to treat each processor separately.

•You build separate programs for each processor as follows:

Page 11: Overview

Building with CCS

Source file...Source file

...

Object file

Source file...

Source file

...

Object file

Source file Object file...

Source file

...

Object file

Texas Instruments

Compiler Linker

Executable.out

Executable.out

Executable.out

Object file

Object file

Page 12: Overview

Problem: Specification

•You have to divide your application into separate programs for each processor.• Modularity should be driven by the program

structure.• You should not use the hardware structure.

•Difficult to use several developers:• only one program for each processor

•Difficult to test components• hard to make each processor work in

isolation

Page 13: Overview

How do you load the application?

•You have to load using JTAG

•JTAG is very slow (0.2MB/s)

•You have all the parts of your application as separate .out files, one for each processor.

•You have to load these, one at a time.• it is very easy to load the wrong processor• it is very easy to forget to load a processor• instructions for your users are complicated

Page 14: Overview

Problem: Loading

•Customers need CCS (or load from ROM)

•Difficult to develop your own host program

•You can’t use JTAG from a program.

•You must use a separate mechanism to allow processors to communicate.• This means you have to maintain two,

unrelated networks:• JTAG chain for loading• I/O network for communication

Page 15: Overview

Problem: Host integration

•Host communication is with JTAG.• very slow• very difficult to add your own host code

•Need to use other devices• need to write host driver code• how to start the host code & DSP code?

Page 16: Overview

Problem: Communication

•How do the processors communicate?• No support for Sundance peripherals• Need to write device drivers

• Learn device details• Manage EDMA• Deal with EDMA coherency problems• Manage interrupts• Learn the tricks to make them run fast

Page 17: Overview

Problem: Message routing

If two processors want to exchange data but there is no direct connection between them, the data will have to be routed through intermediate nodes.

• How do you do this?

• How do you construct routing tables?• by hand?• build in knowledge of the processor

network?

Page 18: Overview

Problem: Deadlock

A problem with all message routing systems is deadlocking.

This is when sending data from one processor to another has to wait for data to be transmitted between another pair of processors, but that transmission needs to wait for the first to complete!

Page 19: Overview

Deadlock prevention options

•Use a proven deadlock-free system.

•Make the user stop the program and change parameters each time a deadlock happens.

•Hope it never happens.

The most common technique is:

•Be completely unaware deadlock can happen.

Page 20: Overview

Problem: The Cache

•There are problems with cache coherency• The cache cannot maintain coherence between:

• external memory• EDMA transfers

•Transfers must handle cache coherency• you cannot turn the cache off• cache errors are very hard to find

•You have to sort out all these problems.

Page 21: Overview

Why loading may fail

JTAG loading assumes the cache is clear.This is not true with Sundance hardware. After reset, a bootloader is loaded from ROM and executed. This initialises the processor and configures the FPGA to implement the inter-processor communication links.The code for the bootloader gets into the cache. JTAG loads behind the cache, leading to inconsistencies that prevent programs running.

Page 22: Overview

Problem: Making changes

•How do you change the network?• Rewrite sections of your code• Are there enough EDMA channels?

• only 4 external interrupt lines for synchronisation

• what if you use more than 4 devices?• host comport (2 devices)• comport to another processor (2 devices)• SDB to another processor (2 devices)• that is already 6 devices

Page 23: Overview

Problem: Changing Devices

•How do you change processors?• different device addresses• different memory sizes• different memory addresses• different initialisation requirements

•With CCS: rewrite sections of your code.

Page 24: Overview

Problem: Choosing devices

• Comports• Sundance Digital Bus (SDB)• Rocket I/O

•You need to learn how to use them.

•You need to write & maintain device drivers.

•You need to change your code to use them.

Page 25: Overview

Before you start coding…

•Be certain you know how to partition the problem.

•Be certain you know how much memory you need.

•Be certain you know which modules you need.

•Be certain of the system topology.

… because it will be very hard to change.

Page 26: Overview

The advantage of CCS

•You have complete control of everything…

•… because you have to do everything yourself

… and this takes a lot of time and experience.

Page 27: Overview

CCS: Summary

•CCS works well with single processors

•It was not designed for multiple processors

•You have to do all the hard work

•Knowledge gets built into the application:• processor types• memory layout• I/O devices being used• connections between processors

•It is very hard to make significant changes.

Page 28: Overview

Diamond

•Originally designed in 1987• tried and tested• proven model

•Designed for multiprocessor systems

•Designed for simplicity

•Designed for efficiency• during development• during execution

Page 29: Overview

Some advantages of Diamond

•Easy to use

•Gives you flexibility: late binding• easy to change topology• easy to change modules

•Reduces housekeeping• memory usually allocated for you• interrupts handled for you• loading managed for you• communication details managed for you• processor issues handled for you

Page 30: Overview

What Diamond is not

•Diamond is not a compiler• we use the standard TI compiler and linker

•Diamond is not a simulator or an interpreter• real, optimised code is generated

•Diamond is not DSP/BIOS• it has it’s own optimised kernel, designed

for multiprocessor operation• it does not have or need a large API

Page 31: Overview

Building with Diamond

•You partition the application into tasks:• modularity determined by the needs of the

application; you ignore processors here.

•Diamond adds an extra configuration step.

•The configurer:• can see the whole application• can optimise communication and device

access.• builds a single output file; nothing can get lost.• arranges to load from this single file.

Page 32: Overview

Building with CCS

Source file...

Source file

...

Object file

Source file...

Source file

...

Object file

Source file Object file...

Source file

...

Object file

Texas Instruments

Compiler Linker

Executable.out

Executable.out

Executable.out

Object file

Object file

Page 33: Overview

Building with Diamond

Source file Object file...

Source file

...

Object file

Relocatable.tsk

Source file Object file...

Source file

...

Object file

.appRelocatable

.tsk

Source file Object file...

Source file

...

Object file

Texas Instruments 3L Diamond

ConfigurerCompiler Linker

Relocatable.tsk

Page 34: Overview

With Diamond…

• The application is in a single file.• Nothing can get lost.• You cannot get loading wrong.• Loading is easy

•load from the host•no need for ROM during development•development is fast

Page 35: Overview

Diamond…•is designed for multiprocessor systems.

•has its own small, efficient microkernel.

•has a small but effective API.

•is optimised for target hardware:• it knows about different modules• it automatically inserts optimised device drivers• it handles interrupts• it handles memory and the cache

•is very good at communication

•leaves you free to concentrate on your code.

Page 36: Overview

Sundance TIMs

Comport Links

Memory ROME

MIF

EM

IF

C6000 DSPE

MIF

FPGA SDB Links

Page 37: Overview

Dual-Processor Module

Memory

C6000 DSP

Comports

Memory

C6000 DSP

FPGA SDBsInternal comports

Identical to two separate modules; there areno shared resources.

Page 38: Overview

The Diamond Model

Diamond builds applications from independent tasks that send data to other tasks using channels.

This model is based upon CSP: Communicating Sequential Processes.

Page 39: Overview

CSPCommunicating Sequential Processes

Task Task

Task Task

Channel

Forget about processors

Page 40: Overview

A Diamond application is…•Tasks

• complete C programs• start at a main function• fully linked (but relocatable)• input & output ports for connecting channels

• unlimited number of ports

• Multi-threaded

•Channels• data transfer mechanisms• transfer data from one task to one other• blocking: both ends wait for completion

Page 41: Overview

Channels

•Many possible implementations• memcpy – between tasks on one processor• I/O - between adjacent processors

• comports• SDBs• Rapid IO links

• Routed I/O – between remote processors• software routing• guaranteed deadlock-free• any task can communicate with any other task

Diamond will choose the best implementation.

Page 42: Overview

The Hardware

Module

C6000

FPGA

EMIF

comports

SDBs

Host PC

Page 43: Overview

A Sundance NetworkHost PC

C64

C62

C67 C64

C67 C62

C64

Page 44: Overview

Ideal Hardware

•No shared resources• Simplifies hardware• Simplifies software• Scales: more processors = more power

•Connected by communication links• Add processors = add bandwidth

•Designing multiprocessor hardware:• Speak to 3L first.

Page 45: Overview

Tasks & Channels

Page 46: Overview

Map onto hardware

Page 47: Overview

A simple task

AddOne

Words coming in Incremented words going out

DATA_IN(input channel)

DATA_OUT(output channel)

0 0

1 12 2

input ports

output ports

Page 48: Overview

A simple task

#include <chan.h>

INPUT_PORT(0, DATA_IN)

OUTPUT_PORT(0, DATA_OUT)

main()

{

int n;

for (;;) {

chan_in_word (&n, &DATA_IN);

chan_out_word(n+1, &DATA_OUT);

}

}

Page 49: Overview

Team Working

•Tasks are self-contained

•They are developed separately

•Communication between tasks:• is a contract• allows test systems to be built

•Ideal for team working

Page 50: Overview

Design Flow

•Network• Tasks• Channels

Page 51: Overview

Sources

Design Flow

•Network

•Code tasks

Page 52: Overview

Tasks

Design Flow

•Network

•Code tasks

•Compile & Link

Page 53: Overview

configuration file

Design Flow

•Network

•Code tasks

•Compile & Link

•Configuration File

Page 54: Overview

application file

Design Flow

•Network

•Code tasks

•Compile & Link

•Configuration File

•Configure

Page 55: Overview

application file

processor network

Design Flow

•Network

•Code tasks

•Compile & Link

•Configuration File

•Configure

•Load & Run

Page 56: Overview

Running an application

Page 57: Overview

Demonstration Hardware

SMT365

SMT370

SMT374

SMT361

Only the SMT365 and the SMT361 will be used in the examples.

Page 58: Overview

A Correlator Example

Example2

Correlator

0

Control channel

Data channel

UI

Disp_corDisp_raw

MainCtrl

Page 59: Overview

Code Each TaskOUTPUT_PORT(2, COR_DATA)

INPUT_PORT (1, COR_RESULT)

. . .

main()

{

printf("3L Diamond Example\n");

for (;;) {

. . .

chan_out_message(BYTES, Data, &COR_DATA);

chan_in_message(BYTES, Result, &COR_RESULT);

. . .

}

}

Page 60: Overview

Configuration

Write a configuration file to:

• Describe the hardware• processors• connections between processors

• Describe the software• tasks• channels connecting tasks

• Map the software onto the hardware• place tasks on processors

Page 61: Overview

Task names

TASK example2

TASK mainctrl

TASK disp_raw

TASK disp_cor

TASK UI

TASK correlator

Page 62: Overview

Task ports

TASK example2 INS=3 OUTS=7

TASK mainctrl INS=1 OUTS=1

TASK disp_raw INS=2 OUTS=0

TASK disp_cor INS=2 OUTS=0

TASK UI INS=1 OUTS=1

TASK correlator INS=1 OUTS=1

Page 63: Overview

Task stack & heap

TASK example2 INS=3 OUTS=7 DATA=500K

TASK mainctrl INS=1 OUTS=1 DATA=200K

TASK disp_raw INS=2 OUTS=0 DATA=200K

TASK disp_cor INS=2 OUTS=0 DATA=200K

TASK UI INS=1 OUTS=1 DATA=200K

TASK correlator INS=1 OUTS=1 DATA=32K

Page 64: Overview

Task starting priorities

TASK example2 urgent INS=3 OUTS=7 DATA=500K

TASK mainctrl INS=1 OUTS=1 DATA=200K

TASK disp_raw INS=2 OUTS=0 DATA=200K

TASK disp_cor INS=2 OUTS=0 DATA=200K

TASK UI urgent INS=1 OUTS=1 DATA=200K

TASK correlator priority=2 INS=1 OUTS=1 DATA=32K

! The starting priority is 1 unless explicitly stated.

Page 65: Overview

Channel creation

! channel output port input port

! ======= =========== ==========

CONNECT C1 UI[0] example2[0]

CONNECT C2 example2[5] mainctrl[0]

CONNECT C3 mainctrl[0] example2[2]

CONNECT C4 example2[0] disp_raw[0]

CONNECT C5 example2[1] disp_raw[1]

CONNECT C6 example2[2] correlator[0]

CONNECT C7 correlator[0] example2[1]

CONNECT C8 example2[3] disp_cor[0]

CONNECT C9 example2[4] disp_cor[1]

CONNECT C10 example2[6] UI[0]

Page 66: Overview

The processor & placement

PROCESSOR Root SMT365_8_1

PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root

Page 67: Overview

Processor types

Diamond supports all of the Sundance TIMs. The ProcType utility will display them all.

Page 68: Overview

A note about memory

• With CCS you need to:• specify memory explicitly.• know which “sections” are used by the compiler• allocate memory explicitly at the start

• Diamond can do all memory allocation• available memory determined automatically• no linker command files• but, you can tell Diamond how to use memory• this is an optimisation once the code is working.• ignore it until the program’s needs are

understood.

Page 69: Overview

Building & Running

•Compile each task with the command: 3L C

•Link each task with the command: 3L T

•Configure with the command: 3L A

•Execute with the command: 3L X

Page 70: Overview

Making it run faster

Example2

Correlator

0

Control channel

Data channel

UI

Disp_corDisp_raw

MainCtrl

Page 71: Overview

Use a second processor

We shall use TIM1 (SMT365) and TIM4 (SMT361) connected by comports 0 & 3 respectively.

Page 72: Overview

Demonstration Hardware

SMT365

SMT370

SMT374

SMT361

Page 73: Overview

Use a second processor

PROCESSOR Root SMT365_8_1

PLACE mainctrl Root

PLACE example2 Root

PLACE disp_raw Root

PLACE disp_cor Root

PLACE UI Root

PLACE correlator Root

Page 74: Overview

Use a second processor

PROCESSOR Root SMT365_8_1

PROCESSOR NodePROCESSOR Node SMT361SMT361

PLACE mainctrl Root

PLACE example2 Root

PLACE disp_raw Root

PLACE disp_cor Root

PLACE UI Root

PLACE correlator Root

Page 75: Overview

Use a second processor

PROCESSOR Root SMT365_8_1

PROCESSOR Node SMT361

WIRE W1WIRE W1 Root[CP:0] Node[CP:3]Root[CP:0] Node[CP:3]

PLACE mainctrl Root

PLACE example2 Root

PLACE disp_raw Root

PLACE disp_cor Root

PLACE UI Root

PLACE correlator Root

Page 76: Overview

Use a second processor

PROCESSOR Root SMT365_8_1

PROCESSOR Node SMT361

WIRE W1 Root[CP:0] Node[CP:3]

PLACE mainctrl Root

PLACE example2 Root

PLACE disp_raw Root

PLACE disp_cor Root

PLACE UI Root

PLACE correlatorcorrelator NodeNode

Page 77: Overview

Notes

•The two tasks have not changed in any way.

•Their connections have not changed.

•No need to recompile them or relink them.

•All we changed to move the tasks onto a second processor was the configuration file.

•We just built a new application by running the configuration command again (3L A).

•Loading the two processors is automatic.

Page 78: Overview

Making it go even faster

Module

C6000

FPGA

EMIF

comports

SDBs

Host PC

Page 79: Overview

Use the FPGA on the SMT365

PROCESSOR Root SMT365_8_1PROCESSOR F FPGAFPGA

…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root

Page 80: Overview

The FPGA is already being used

•The FPGA is also used to support functions on the SMT365 DSP.

•Attaching the FPGA to its processor allows the configurer to include all the necessary logic to support the needed functions.

Page 81: Overview

Use the FPGA

PROCESSOR Root SMT365_8_1PROCESSOR F FPGA ATTACH=RootATTACH=Root

…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root

Page 82: Overview

Use the FPGA

PROCESSOR Root SMT365_8_1PROCESSOR F FPGA ATTACH=Root

WIRE W1 Root[SDB:0] F[SDB_DEVICE:0]Root[SDB:0] F[SDB_DEVICE:0]…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root

Page 83: Overview

Use the FPGA

PROCESSOR Root SMT365_8_1PROCESSOR F FPGA ATTACH=Root

WIRE W1 Root[SDB:0] F[SDB_DEVICE:0]…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlatorcorrelator FF

Page 84: Overview

FPGA Tasks

•Placing a task on an FPGA instructs the configurer to look for an FPGA version of the task.

•This can be written using:• VHDL• Xilinx System Generator• Handel-C (Celoxica)• Any other method you like.

Page 85: Overview

Building with FPGA

•The configurer will construct a Xilinx project for the FPGA

•It will call the Xilinx toold to build a complete bitstream.

•The bitstream will be included in the single application file.

•The FPGA will be configured automatically as the application is loaded.

Page 86: Overview

Conclusion

•Diamond does a lot of the work for you.

•Diamond allows you to change your mind and alter processors and topology.

•Diamond gives a structured model for developing efficient applications.

•The Diamond model is the same for any number and any combination of processors: DSP or FPGA.

•Diamond simplifies developing multiprocessor applications.

Page 87: Overview