44
Introduction to circuit design using Celoxica’s Handel-C Presenter: Mr. David Sanders Co-Sponsored by: The Internet Innovation Centre and IEEE Computer Society

Introduction to circuit design using Celoxica ’ s Handel-C Presenter: Mr. David Sanders Co-Sponsored by: The Internet Innovation Centre and IEEE Computer

Embed Size (px)

Citation preview

Introduction to circuit design using Celoxica’s Handel-C

Presenter: Mr. David SandersCo-Sponsored by: The Internet Innovation Centre and IEEE Computer Society

2

Agenda

FPGA overview Purpose of Handel-C Comparison with ANSI C

Handel-C data types parallelism special Handel-C constructs and data types

Hardware implementation of Handel-C constructs (Some of them)

Optimization and retiming features of Celoxica’s design tool

3

FPGAs

Field Programmable Gate Array A user programmable logic device with a

collection of Look-Up-Tables (LUTs), routing resources, and Input/Output blocks (IOBs).

LUTs contain varying number of inputs depending on vendor/technology used. Usually at least 4 inputs, but state of the art LUTs can have up to 8 (Altera’s 8-input fracturable LUT).

Modern FPGAs also included dedicated RAM blocks, ALUs, multipliers, or hard or soft-core processors (eg. ARM, NIOS II, MicroBlaze, PPC).

4

So what does Handel-C do for us? Those who have programmed in VHDL/Verilog

know that you must think in terms of a state machine, and write the code accordingly.

Handel-C is one level of abstraction higher than an HDL. Compiler deals with the state machine generation

automatically. Will create a netlist, but not an FPGA programming

files The FPGA vendor tools are still required.

It does however provide scripts for automating the place and route, and bit stream creation for Xilinx and Altera tools.

5

Still no free lunch!!

Just as in Professor McLeod’s talk last time, there is no free lunch.

The state machine generated by the Handel-C compiler uses One-Hot Encoding.

Not necessarily optimal for every design, but still gives good results in practice.

6

Other Capabilities of Handel-C Design Suite Provides support for Altera, Xilinx, and Actel FPGAs

Handel-C compiler can take advantage of technology available in a particular device (RAMs, ALUs, Multipliers, etc.)

Compiler can provide output in different formats: Netlist (EDIF) VHDL code from C code Debug (can be used with SystemC or ANSI C front-end for

verification) Provides a Platform Abstraction Layer (PAL)

Set of common utilities for hardware devices commonly found on development boards Video, Keyboard, Mouse, Ethernet, RS-232, LED output, General I/O, etc.

Provides support for integration with company specific tools and/or intellectual property. Quartus II, SOPC Builder, NIOS II processor, MicroBlaze processor

7

Handel-C Data Types

Handel-C supports all of the primitive integral types provided by ANSI C, (signed and unsigned). char, int, short, long

Variables are implemented as registers. Depth of an array must be specified at compile time. Can also declare variables of arbitrary width from 1 to 128 bits.

eg.unsigned 8 myVariable;signed 25 myVariable2[15];

No native floating point types or calculations in the current version. Course instructor claims it will be included in the next release.

8

Operators

All the operators from ANSI C, plus a few others: Relational: !=, ==, <, >, <=, >= (GT and LT expensive to

evaluate with combinational logic). Operands must have same width. Result is a 1 bit value.

Logical: &&, ||, ! Take 1 bit unsigned operands, however...X || y compiler will take this as: x!=0 || y!=0

Bitwise: ^, |, &, ~ Operands must have equal width. Shift: <<, >> For a << b, b must have a width of

ceil(log2(width(a)+1)) Macros provided by the Platform Developers Kit (PDK).

9

…the others

Bit manipulation Take: <- Drop: \\

1 0 0 1 1a =

0 1 1b = a <- 3

1 0 c = a \\ 3

Very cheap in hardware since these operators are implemented as wires.

Range selection: Expression[n:m] (bits n to m) a[3:1] = 0 0 1

Concatenation: expression 1 @ expression 2

d = a @ a[3:1] = 1 0 0 1 1 0 0 1

10

Parallelism

Since logic circuit operation is highly parallel by nature, it is necessary for a design tool to support parallelism.

Accomplished in Handel-C by using a par statement, as opposed to a seq statement, where the code is executed sequentially.

11

static unsigned 8 a = 2;static unsigned 8 b = 1;

par{ a++; b = a + 10;}

Each Handel-C assignment takes 1 clock cycle.Both statements begin execution at the same time, therefore both statements take only 1 clock cycle combined. Operations are performed on the value that the variable contained before the start of the previous cycle.Results: a = 3, b = 12

static unsigned 8 a = 2;static unsigned 8 b = 1;

seq{ a++; b = a + 10;}

The seq block operates in the same manner as you would expect from an ANSI C program.

Results: a = 3, b = 13

12

Signals

However, occasionally we need to use a value immediately after assigning it in a par block.

This can be done by declaring a variable as a signal.

The value of a signal lasts only for the duration of the current clock cycle.

signal unsigned 8 a;static unsigned 8 b;

par{ a = 7; b = a;}

Results: a = 0, b = 7

signal unsigned 8 a;static unsigned 8 b;

seq{ a = 7; b = a;}

Results: a = 0, b = 0

13

Nesting seq and par

Can be nested as in the following example:

par { seq { /*some statements to be executed sequentially */ } seq { /* these statements are executed sequentially, but in parallel with previous seq block */ }}

par will not return until all of the statements/sub-blocks have completed.

14

Special Data Types

Input/Output Obviously there must be a mechanism for

performing I/O with the FPGA. Handel-C has data types for buses or interfaces.

(input, output, tri-state). Also supports ports

I/O between modules/components in a design, not a physical pin.

15

I/O Declaration Examples

interface bus_in(type portName) Name() with {data = {Pin List}};

Input interface prototype:

interface bus_in(unsigned 2 val) myInput() with {data = {“P1”,”P2”}};

unsigned 2 inData;

inData = myInput.val; //read the value {P1 P2}

Input interface usage:

16

Examples cont’d

Output interface prototype:

interface bus_out() Name(type portName=Expression) with {data = {Pin List}};

static unsigned 8 counter = 0;interface bus_out() CountOut(unsigned 8 outVal=counter+1);while(1){ counter++;}

Output interface usage:

17

RAM and ROM

No such thing as malloc() on an FPGA Instead, Handel-C allows you to store variables in FPGA dedicated

RAM blocks

ram int 9 myRam[256]; /* a RAM block that holds 256, 9-bit integers */

static rom int 9 myRom[3] = {100,200,300}; /* must be static or global */

Different from arrays because declaring an array is the same as declaring multiple variables

This means that an array’s indices can be accessed simultaneously RAMs cannot because they only have 1 or 2 ports.

myRam[25]++; /*Read, Write, Modify = undefined results */par /* 2 modifies during same cycle -> This also won’t work */{ myRam[0] = 100; myRam[2] = 498;}

18

If/Else

Handel-C if/else syntax is almost the same as in ANSI C.

The exception: The condition of the if() must take 0 clock cycles to evaluate. This implies that there can not be any variable assignment in the condition expression.

if( (z = x + y) == 6) //legal in ANSI C, but not in Handel-C

19

Loops

while(), for(), do…while() All have same syntax as in ANSI C Same limitation applies to the conditions as

with if/else. When programming a PC, it is good practice

to use a for loop when the context calls for it. When writing C code for circuits, it’s almost

never good practice to use for() loops at all. One clock cycle overhead per iteration.

20

While Loop Optimization

The limitations of a for() loop can be avoided by incrementing a counter variable in parallel with the body of a while() loop.

static unsigned 4 x = 15;par{ do{

//do something } while(x != 0); x--;}

21

Macros, Channels, Prialt, and Semaphores Scenario: Suppose you need to design a

circuit that calculates pixel values in a frame buffer, and that each calculation takes 4 or 5 clock cycles. However you need to calculate one pixel every clock cycle to meet a display timing constraint.

Possible Solution: Duplicate the calculation code 5 times, and have each block store values in the proper place in the frame buffer.

22

Macros

Macros can be used to implement parameterizable code, or to provide code re-use.

Like a regular function without parameter types. For the solution to our scenario, declaring a macro

would look like:

macro proc myCalculation(dataSource)

{

//receive data from source

//Perform 3-5 clock cycles worth of calculations

}

23

Channels

Handel-C provides a channel type to allow for synchronization or communication between parallel processes.

Declaration: chan <type> <channelName>

Data can then be sent over the channel, or received from it, but only in one direction.

Two parallel blocks of code:

chan unsigned 8 dataPipe;static unsigned 8 someData = 5;…dataPipe ! someData;…

static unsigned 8 recvData;…dataPipe ? recvData;…

Must be declared with global scope.Each channel operation will block if the other party is not ready.

24

Prialt

Now suppose we have 5 of our ‘worker’ processes running in parallel. How do we use them to achieve our goal?

Each operation will complete in 3-5 cycles, so we don’t know which of the 5 will be free to perform the next pixel calculation.

But if we send data down a channel sequentially to each of the 5 processes, we might block on one of them, when another is not doing anything…wasted clock cycles.

Prialt is the solution for this.

25

Prialt

Similar to a case statement that chooses the first channel able to receive data.

In other words, it gives a priority to each channel.prialt{ case channel1 ! data ; break; case channel2 ! data ; break; default: break;}

If default is not used, then prialt will block on the last case statement if a prior one was not taken.

Need to be careful that process aren’t starved. Wasted resources

26

Semaphores

Once a process has finished its computation we need to update the frame buffer (FB), which is typically implemented in a RAM block for FPGA area efficiency.

Recall that a RAM block typically only has one write port, therefore we can’t have each process write to the frame buffer because we can’t guarantee that simultaneous access will not happen.

One solution is to have each process send the result down a separate channel to another process that deals with FB access.

But this is a section on semaphores, so we’ll go with them instead.

27

Semaphores

Semaphores can be used to guard critical sections of code against parallel access.

More like a mutex from POSIX threads. trysema() and releasesema() methods used to check if

critical section is free. eg.

sema fbGuard;

while(trysema(fbGuard)==0); delay; /*loop until semaphore is free */

/* critical section of code, ie. Frame buffer access */

releasesema(fbGuard); /*skipping this step could result in deadlock*/

28

Putting it all together…#define NUM_CHANNELS 5#define SCR_WIDTH 4#define SCR_HEIGHT 4set clock = external;

typedef struct point //just as in ANSI C{ unsigned 2 x; unsigned 2 y;} point;

sema fbGuard;

//you can even send structures over channelschan point dataChannels[NUM_CHANNELS];

ram unsigned 8 frameBuffer[SCR_WIDTH*SCR_HEIGHT];

macro proc increment(p){ if(p.x==SCR_WIDTH-1) { par { p.x=0; p.y++; } } else p.x++;}

macro proc coordGen(){ point pGen; pGen.x = 0; pGen.y = 0; while(1) { prialt { case dataChannels[0] ! pGen: increment(pGen); break; case dataChannels[1] ! pGen: increment(pGen); break; case dataChannels[2] ! pGen: increment(pGen); break; case dataChannels[3] ! pGen: increment(pGen); break; case dataChannels[4] ! pGen: increment(pGen); break; default: delay; break;

} }}

29

void main(){ par { //create the coord generator and the worker processes coordGen(); worker(dataChannels[0]); worker(dataChannels[1]); worker(dataChannels[2]); worker(dataChannels[3]); worker(dataChannels[4]);

//will never return because at //least 1 process has an infinite loop }}

macro proc worker(channel){ point p; static unsigned 8 pixel = 0; //loop forever waiting for data to compute pixels with

while(1) { channel ? p;

if(p.x <- 1 == 0 && p.y <- 1 == 0 ) //x, y are even { pixel = 2; delay; delay; } else if(p.x <- 1 == 1 && p.y <- 1 == 1 ) //both odd { pixel = 1; delay; } else //x is even/odd and y is odd/even pixel = 3;

//critical section while(trysema(fbGuard) == 0) delay; frameBuffer[[email protected]] = pixel; releasesema(fbGuard); }}

30

Mapping Handel-C to Logic

Ultimately, the statements you write in Handel-C must be mapped to logic by the compiler.

The following slides show the mapping for some of the constructs discussed so far. assignment seq and par if while do…while

The following logic circuits are taken from the course notes from Celoxica’s DK training course.

31

Assignment

a = b;

32

Sequential Statements

seq

{

statement1;

statement2;

}

33

Parallel Statements

par

{

statement1;

statement2;

}

34

If Statements

if (Condition)

statement2;

35

While Loops

while (Condition)

{

statement2;

}

do

{

statement2;

} while(Condition);

36

Automatic Retiming

37

Why Retime?

Many designs will require the use of a multiplier, divider, or other large combinational logic circuit. The propagation delay through deep logic can be quite long.

Having even one path in the design with a long delay could cause the maximum clock rate to drop significantly to the point where timing constraints cannot be met.

Retiming involves moving/adding flip-flops around the data path to reduce the depth of logic, and ultimately reduce the critical path delay.

38

Simple Example1

x = a+b+c+d;

The result is calculated through two adder stages. However we can pipeline the result by inserting registers at intermediate locations.

1: Example adapted from Celoxica’s Handel-C and DK training course notes.

The adder stages are split with two registers. This reduces the propagation delay of each stage, allowing a higher clock frequency.

The consequence is that the result is delayed by one cycle.

39

Programming for Retiming

Retiming is not a trivial task, it is extremely time consuming to do by hand, especially for large designs.

Handel-C design tools can perform retiming automatically if the code is written properly.

The compiler will add/remove/move flip-flops as necessary, but will not alter the timing of the design.

Therefore to use retiming, the design must be pipelined, or have extra pipelining stages built-in.

The compiler can then shift logic and flip-flops around without altering the timing of the design.

40

Programming Example

Example: x = a*b+c*d;

unsigned 8 x[3]; //3 retiming stages;

interface bus_out() sumOut(unsigned 8 out = x[2]) with {data ={"P2","P3","P4","P5","P6","P7","P8","P9"}};interface bus_clock_in(unsigned 8 in) input() with {data ={"P10","P11","P12","P13","P14","P15","P16","P17"}};

void main(){ unsigned 8 data[4];

while(1) { par { //get the input and shift the previous inputs data[0] = input.in; data[1] = data[0]; data[2] = data[1]; data[3] = data[2]; x[0] = data[0]*data[1] + data[2]*data[3]; x[1] = x[0]; //extra stages x[2] = x[1]; } }}

Output is the last of the retiming stages.

Coded like you would without retiming.

Result is shifted through the retiming registers.

41

FIR Example

One of the exercises at the training course was to code a nine tap FIR filter that was pipelined and retimed automatically. Nine multiplications of data and coefficients, followed by

summation of the nine products. Very deep logic

Xilinx Spartan™ 3 chip was targeted. The fmax results were recorded for various number of extra retiming stages.

42

0

20

40

60

80

100

120

140

160

Frequency(MHz)

1 2 3 4 5 6

# of Ret i mi ng Stages

Fmax (MHz)

Fmax (MHz)

0

100

200

300

400

500

600

700

800

900

1000

# of Fl i pFl ops

1 2 3 4 5 6

# of Reti mi ng Stages

Fl i p Fl op Usage Before and Af ter Reti mi ng

FF BeforeFF Af ter

43

Final Notes

Not enough time to cover everything Handel-C has to offer. pointers, macro expressions

There are ways to create parameterizable code. Allows the designer to easily vary the # of worker

processes, or pipeline/retiming stages, for example.

More information available at www.celoxica.com

44

Thank-You!

Questions?