NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level

NDA Confidential. Copyright ©2005, Nallatech.1

Implementation of Floating-Point VSIPL Functions on FPGA-

Based Reconfigurable Computers Using High-

Level Languages

Dr. Malachy Devlin, Robin Bruce and Dr. Stephen Marshall

In s titu te S y s te m L e v e lIn teg ra tio n

fo r


Overview

» This presentation details the initial stages of a project whose aim is to provide developers with an FPGA-implemented implementation of the VSIPL API

» Implementation of research to date has centred around the following key areas of FPGA design:» Implementation of pipelined floating-point algorithms in FPGA

fabric» Use of novel high-level language tools to realise HDL» Resource-sharing between functional units

» Example application given is that of a basic 4096 point floating point Pulse Compression architecture» Composed of FFT/IFFT and complex multiply


VSIPL – The Vector, Signal and Image Processing API

» VSIPL is a collaboration between industrial, academic and governmental partners» VSIPL is meant as a standard API that enables write once,

run anywhere possibilities to HPEC developers.» It was developed as a C-based API targeting a single

microprocessor» VSIPL implementations are based on a profile or subset of

the full API» Implementations can have radically different underlying

hardware and memory management systems and still allow applications to be ported between them

» This makes FPGA-implemented VSIPL possible


FPGAs & Floating Point

» VSIPL is best regarded as floating-point arithmetic API, offering both single and double precision

» FPGA-based floating point has come of age» High gate density of latest generation of FPGAs allows for the

instantiation of multiple floating-point arithmetic modules» Up to 25GFlops/s of peak floating-point performance is possible» Need for high dynamic range and precision» Fixed-point solutions are becoming difficult to justify in terms of

design time needed and designer expertise required in the face of increasing resource density

» Floating point algorithms can be programmed in C and compiled to HDL code giving pipelined architectures


Challenges with FPGA VSIPL

» 1.) Finite size of hardware-implemented algorithms vs. conceptually boundless vectors.

» 2.) Runtime reconfiguration options offered by microprocessor VSIPL.

» 3.) Vast scope of VSIPL API vs. limited fabric resource.

» 4.) Communication bottleneck arising from software-hardware partition.

» 5.) Difficulty in leveraging parallelism inherent in expressions presented as discrete operations.


Architectural Possibilities

Initialisation

Function 1 Stub

Function 2 Stub

Function 3 Stub

Termination

Function 1

Function 2

Function 3

Initialisation

Data Transfer

Data Transfer

Termination

Function 1

Function 2

Function 3

Data IO

Data IO

Initialisation

Function 1

Function 2

Function 3

Termination

3.) FPGA Master Implementation: Traditional systems are considered to be processor centric, however FPGAs now have enough capability to create FPGA centric systems. In this approach the complete application can reside on the FPGA, including program initialisation and VSIPL application algorithm. The FPGA VSIPL-based application can use the host computer as a I/O engine, or the FPGA can have direct I/O attachments.

1.) Function ‘stubs’: Each function selected for implementation on the fabric transfers data from the host to the FPGA before each operation of the function and brings it back again to the host before the next operation. This model (figure 4) is the simplest to implement, but excessive data transfer greatly hinders performance. The work presented here is based on this first model.

2.) Full Application Algorithm Implementation: Rather than utilising the FPGA to process functions separately and independently, we can utilise high level compilers to implement the application’s complete algorithm in C making calls to FPGA VSIPL C Functions. This can significantly reduce the overhead of data communications, FPGA configuration and cross function optimizations.


4096-Point Pulse Compressor

» Single unit can carry out three different operations» Data stored in block RAMs to which host has access» Data stays in place on fabric while different modes are executed » Host loads data and asserts control signals to start each mode» Complex Multiply reuses resources from the FFT/IFFT, its

presence adding only multiplexer logic

Tri-mode Functional UnitMode 1 = FFTMode 2 = IFFT

Mode 3 = Complex Multiply

RealA/Result

ImagA/Result

RealB ImagB Roots_u1 Roots_u2

sizelog2 size

mode

scale


VSIPL use of HW Functions

» The functional unit is accessed from the host via the FUSE API» The hardware functions can be accessed by C functions on the

host» The 1D FFT, IFFT and complex multiply VSIPL functions can

access the hardware in a manner which is invisible to the application programmer

» Pulse compression can be performed in one of two ways:» As discrete VSIPL function calls to the hardware» As a custom function designed to make most efficient use of

the hardware, leaving the data on the fabric until all operations have been performed

» The second approach is taken here


FFT Algorithm

» FFT Algorithm was originally a radix-2 decimation-in-time with 3 nested loops» Only innermost loop was pipelined» 1024 point pulse compression took 181026 clock cycles

» The two inner loops were combined into a single loop» With inner loop pipelined 1024 point pulse compression takes only

19140 clock cycles, a reduction by a factor of 9.5x

» With the algorithm maximally pipelined, the execution time increases linearly with the data vectors» Contrasts with the exponential increases for the 3-loop case» The maximum size of pulse compression possible is limited only by

the size of the memory structures used to store data. In this case we are limited to 4096-point, requiring 120 BlockRAMs


Resource Consumption

Size External IOBs(LOC'd) MULT18X18sRAMB16s SLICESs BUFGMUXsDCMs

1024 39 39 38 48 16070 2 22048 39 39 38 74 16068 2 24096 39 39 38 126 16294 2 28192 39 39 38 230 16600 2 2

16384 39 39 38 438 16900 2 232768 39 39 38 854 17200 2 2 0

100

200

300

400

500

600

700

800

900

0 5000 10000 15000 20000 25000 30000 35000

Size

No. of B

lock

RA

M

Actual RAMB16s

Projected RAMB16s

16000

16200

16400

16600

16800

17000

17200

17400

0 5000 10000 15000 20000 25000 30000 35000

Size

Slice

s

Actual Slices

Projected Slices

» Size of FFT limited only by available blockRAM» Using a different approach to

memory access could allow for practically limitless FFTs, complex multiplies and pulse compression.

» Implemented on a Virtex II 6000 FPGA on Nallatech BenNUEY PCI Card


Pipelined Hardware vs. PC Host

» Host-to-fabric communication delays dominate total calculation time. » Illustrates need to limit host-to-fabric communication

» Without pipelining to limit it, software calculation time increases exponentially with vector size» Shows benefits of custom hardware to implement functions

Time Comparisons between Pulse Compression Sizes

0

2000

4000

6000

8000

10000

12000

64 128 256 512 1024 2048 4096

Size

Tim

e (us)

HW Only

HW & Host

Host Only

» Design is clocked at 120MHz

» HW Only:» Data transfer time not included

» HW&Host:» HW time + data transfer

» Host Only:» Software only pulse compression


Conclusion & Results

» Multiple VSIPL functions can be implemented in a single unit that capitalises on pipelining possibilities and uses resource sharing to minimise hardware requirements

» Rapid development of other VSIPL functions is possible

» Scalable architectures can tackle the problem of large vector sizes

» Floating-point algorithms can be rapidly developed and efficiently implemented

» 4096-point single-precision floating-point pulse compression in 641.5us

Documents

NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level