Upload
ashley-smith
View
225
Download
4
Tags:
Embed Size (px)
Citation preview
NDA Confidential. Copyright ©2005, Nallatech.1
Implementation of Floating-Point VSIPL Functions on FPGA-
Based Reconfigurable Computers Using High-
Level Languages
Dr. Malachy Devlin, Robin Bruce and Dr. Stephen Marshall
In s titu te S y s te m L e v e lIn teg ra tio n
fo r
NDA Confidential. Copyright ©2005, Nallatech.2
Overview
» This presentation details the initial stages of a project whose aim is to provide developers with an FPGA-implemented implementation of the VSIPL API
» Implementation of research to date has centred around the following key areas of FPGA design:» Implementation of pipelined floating-point algorithms in FPGA
fabric» Use of novel high-level language tools to realise HDL» Resource-sharing between functional units
» Example application given is that of a basic 4096 point floating point Pulse Compression architecture» Composed of FFT/IFFT and complex multiply
NDA Confidential. Copyright ©2005, Nallatech.3
VSIPL – The Vector, Signal and Image Processing API
» VSIPL is a collaboration between industrial, academic and governmental partners» VSIPL is meant as a standard API that enables write once,
run anywhere possibilities to HPEC developers.» It was developed as a C-based API targeting a single
microprocessor» VSIPL implementations are based on a profile or subset of
the full API» Implementations can have radically different underlying
hardware and memory management systems and still allow applications to be ported between them
» This makes FPGA-implemented VSIPL possible
NDA Confidential. Copyright ©2005, Nallatech.4
FPGAs & Floating Point
» VSIPL is best regarded as floating-point arithmetic API, offering both single and double precision
» FPGA-based floating point has come of age» High gate density of latest generation of FPGAs allows for the
instantiation of multiple floating-point arithmetic modules» Up to 25GFlops/s of peak floating-point performance is possible» Need for high dynamic range and precision» Fixed-point solutions are becoming difficult to justify in terms of
design time needed and designer expertise required in the face of increasing resource density
» Floating point algorithms can be programmed in C and compiled to HDL code giving pipelined architectures
NDA Confidential. Copyright ©2005, Nallatech.5
Challenges with FPGA VSIPL
» 1.) Finite size of hardware-implemented algorithms vs. conceptually boundless vectors.
» 2.) Runtime reconfiguration options offered by microprocessor VSIPL.
» 3.) Vast scope of VSIPL API vs. limited fabric resource.
» 4.) Communication bottleneck arising from software-hardware partition.
» 5.) Difficulty in leveraging parallelism inherent in expressions presented as discrete operations.
NDA Confidential. Copyright ©2005, Nallatech.6
Architectural Possibilities
Initialisation
Function 1 Stub
Function 2 Stub
Function 3 Stub
Termination
Function 1
Function 2
Function 3
Initialisation
Data Transfer
Data Transfer
Termination
Function 1
Function 2
Function 3
Data IO
Data IO
Initialisation
Function 1
Function 2
Function 3
Termination
3.) FPGA Master Implementation: Traditional systems are considered to be processor centric, however FPGAs now have enough capability to create FPGA centric systems. In this approach the complete application can reside on the FPGA, including program initialisation and VSIPL application algorithm. The FPGA VSIPL-based application can use the host computer as a I/O engine, or the FPGA can have direct I/O attachments.
1.) Function ‘stubs’: Each function selected for implementation on the fabric transfers data from the host to the FPGA before each operation of the function and brings it back again to the host before the next operation. This model (figure 4) is the simplest to implement, but excessive data transfer greatly hinders performance. The work presented here is based on this first model.
2.) Full Application Algorithm Implementation: Rather than utilising the FPGA to process functions separately and independently, we can utilise high level compilers to implement the application’s complete algorithm in C making calls to FPGA VSIPL C Functions. This can significantly reduce the overhead of data communications, FPGA configuration and cross function optimizations.
NDA Confidential. Copyright ©2005, Nallatech.7
4096-Point Pulse Compressor
» Single unit can carry out three different operations» Data stored in block RAMs to which host has access» Data stays in place on fabric while different modes are executed » Host loads data and asserts control signals to start each mode» Complex Multiply reuses resources from the FFT/IFFT, its
presence adding only multiplexer logic
Tri-mode Functional UnitMode 1 = FFTMode 2 = IFFT
Mode 3 = Complex Multiply
RealA/Result
ImagA/Result
RealB ImagB Roots_u1 Roots_u2
sizelog2 size
mode
scale
NDA Confidential. Copyright ©2005, Nallatech.8
VSIPL use of HW Functions
» The functional unit is accessed from the host via the FUSE API» The hardware functions can be accessed by C functions on the
host» The 1D FFT, IFFT and complex multiply VSIPL functions can
access the hardware in a manner which is invisible to the application programmer
» Pulse compression can be performed in one of two ways:» As discrete VSIPL function calls to the hardware» As a custom function designed to make most efficient use of
the hardware, leaving the data on the fabric until all operations have been performed
» The second approach is taken here
NDA Confidential. Copyright ©2005, Nallatech.9
FFT Algorithm
» FFT Algorithm was originally a radix-2 decimation-in-time with 3 nested loops» Only innermost loop was pipelined» 1024 point pulse compression took 181026 clock cycles
» The two inner loops were combined into a single loop» With inner loop pipelined 1024 point pulse compression takes only
19140 clock cycles, a reduction by a factor of 9.5x
» With the algorithm maximally pipelined, the execution time increases linearly with the data vectors» Contrasts with the exponential increases for the 3-loop case» The maximum size of pulse compression possible is limited only by
the size of the memory structures used to store data. In this case we are limited to 4096-point, requiring 120 BlockRAMs
NDA Confidential. Copyright ©2005, Nallatech.10
Resource Consumption
Size External IOBs(LOC'd) MULT18X18sRAMB16s SLICESs BUFGMUXsDCMs
1024 39 39 38 48 16070 2 22048 39 39 38 74 16068 2 24096 39 39 38 126 16294 2 28192 39 39 38 230 16600 2 2
16384 39 39 38 438 16900 2 232768 39 39 38 854 17200 2 2 0
100
200
300
400
500
600
700
800
900
0 5000 10000 15000 20000 25000 30000 35000
Size
No. of B
lock
RA
M
Actual RAMB16s
Projected RAMB16s
16000
16200
16400
16600
16800
17000
17200
17400
0 5000 10000 15000 20000 25000 30000 35000
Size
Slice
s
Actual Slices
Projected Slices
» Size of FFT limited only by available blockRAM» Using a different approach to
memory access could allow for practically limitless FFTs, complex multiplies and pulse compression.
» Implemented on a Virtex II 6000 FPGA on Nallatech BenNUEY PCI Card
NDA Confidential. Copyright ©2005, Nallatech.11
Pipelined Hardware vs. PC Host
» Host-to-fabric communication delays dominate total calculation time. » Illustrates need to limit host-to-fabric communication
» Without pipelining to limit it, software calculation time increases exponentially with vector size» Shows benefits of custom hardware to implement functions
Time Comparisons between Pulse Compression Sizes
0
2000
4000
6000
8000
10000
12000
64 128 256 512 1024 2048 4096
Size
Tim
e (us)
HW Only
HW & Host
Host Only
» Design is clocked at 120MHz
» HW Only:» Data transfer time not included
» HW&Host:» HW time + data transfer
» Host Only:» Software only pulse compression
NDA Confidential. Copyright ©2005, Nallatech.12
Conclusion & Results
» Multiple VSIPL functions can be implemented in a single unit that capitalises on pipelining possibilities and uses resource sharing to minimise hardware requirements
» Rapid development of other VSIPL functions is possible
» Scalable architectures can tackle the problem of large vector sizes
» Floating-point algorithms can be rapidly developed and efficiently implemented
» 4096-point single-precision floating-point pulse compression in 641.5us