Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
NORTHEASTERN UNIVERSITY
Graduate School of Engineering
Thesis Title: Phase Unwrapping On Reconfigurable Hardware
Author: Sherman Braganza
Department: Electrical and Computer Engineering
Approved for Thesis Requirements of the Master of Science Degree
Thesis Advisor: Prof. Miriam Leeser Date
Thesis Reader: Prof. Charles DiMarzio Date
Thesis Reader: Prof. Laurie King Date
Department Chair: Prof. Ali Abur Date
Graduate School Notified of Acceptance:
Dean: Prof. Yaman Yener Date
Copy Deposited in Library:
Reference Librarian Date
NORTHEASTERN UNIVERSITY
Graduate School of Engineering
Thesis Title: A Hardware/Software System for Adaptive Beamforming
Author: Albert Anthony Conti III
Department: Electrical and Computer Engineering
Approved for Thesis Requirements of the Master of Science Degree
Thesis Advisor: Prof. Miriam Leeser Date
Thesis Reader: Prof. Eric Miller Date
Thesis Reader: Prof. Laurie King Date
Thesis Reader: Sarah Leeper Date
Department Chair: Prof. Ali Abur Date
Graduate School Notified of Acceptance:
Dean: Prof. Yaman Yener Date
Copy Deposited in Library:
Reference Librarian Date
Phase Unwrapping on Reconfigurable Hardware
A Thesis Presented
by
Sherman Braganza
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirementsfor the degree of
Master of Science
in
Electrical Engineering
in the field of
Computer Engineering
Northeastern UniversityBoston, Massachusetts
December 2007
c© Copyright 2007 by Sherman BraganzaAll Rights Reserved
Acknowledgement
Very Very soon.
Abstract
Coming soon!
Contents
1 Introduction 13
2 Background 14
2.1 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 NVIDIA GPUs - G80 . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 The Keck Fusion Microscope . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Phase Unwrapping - Algorithms and Selection . . . . . . . . . . . . . 22
2.3.1 Path Following Algorithms . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Minimum Norm Algorithms . . . . . . . . . . . . . . . . . . . 29
2.3.3 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . 36
2.4 Bitwidth Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.2 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3 Implementation 45
3.1 Timing Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 The Discrete Cosine Transform . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 The One Dimensional Implementation . . . . . . . . . . . . . . . . . 49
3.3.1 Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . 51
3.3.3 Rebuild rotate . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.4 Dynamic Scaling . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.5 Data Type Conversion . . . . . . . . . . . . . . . . . . . . . . 56
3.4 The Two-Dimensional Extension . . . . . . . . . . . . . . . . . . . . 56
3.4.1 The SRAM Controller . . . . . . . . . . . . . . . . . . . . . . 57
3.4.2 Calculating The Transpose . . . . . . . . . . . . . . . . . . . . 57
3.5 Division And Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6.1 FPGA equivalent . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6.2 Full implementation . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 Data transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7.1 Programmed IO . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7.2 Direct Memory Access (DMA) . . . . . . . . . . . . . . . . . . 59
4 Results 60
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.2 Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Experiment Parameters . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Conclusion and Future Work 61
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
List of Figures
2.1 Virtex II Pro - Architecture . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 NVIDIA G80 core - Architecture . . . . . . . . . . . . . . . . . . . . 17
2.3 Optical Quadrature Microscopy setup . . . . . . . . . . . . . . . . . . 22
2.4 A raster unwrap using Matlabs ’unwrap’ routine . . . . . . . . . . . . 23
2.5 Goldsteins algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Quality Mapped algorithm . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Mask Cut algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Flynns algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9 Preconditioned Conjugate Gradient Algorithm pseudo-code . . . . . . 32
2.10 Preconditioned Conjugate gradient algorithm results . . . . . . . . . 33
2.11 Minimum LP Norm Algorithm pseudo-code . . . . . . . . . . . . . . 34
2.12 The Minimum LP Norm algorithm . . . . . . . . . . . . . . . . . . . 35
2.13 The Multi-grid algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.14 Image produced by a double-precision unwrap . . . . . . . . . . . . . 38
2.15 Using a bitwidth of 27 . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.16 Using a bitwidth of 28 . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.17 Using a bitwidth of 29 . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1 Even extension around n=-0.5 and n=N-0.5 . . . . . . . . . . . . . . 47
3.2 Components and dataflow for the forward DCT transform . . . . . . 50
3.3 Components and dataflow for the inverse DCT transform . . . . . . . 51
3.4 Re-ordering pattern in a forward shuffle . . . . . . . . . . . . . . . . . 52
3.5 The rebuild component - Forward Transform . . . . . . . . . . . . . . 54
3.6 The 1D transform including dynamic scaling . . . . . . . . . . . . . . 55
3.7 Implementation of the floating point divide and scale logic . . . . . . 59
List of Tables
Chapter 1
Introduction
Chapter 2
Background
2.1 Platforms
The concept of using an external accelerator for application speedup is not a new
one. In the early days of PCs, it was common to have an empty socket for a math
coprocessor [15] that could be used to accelerate floating point computations. More
recently, platforms such as the Cell Broadband Engine have been used as accelerators
in petaflop supercomputers [6] such as the Roadrunner machine in order to reach new
levels of computing performance. FPGAs have also been used in machines such as
the Cray XD1 [7] and GPUs in systems like the Bull Novascale supercomputer [28].
Such systems are general accelerators in the sense that they apply to a wide range
of application domains. More specific accelerators such as Ageia’s PhysX physics
accelerator [29] also exist that only target a restricted domain.
In the work presented in this masters thesis, a phase-unwrapping algorithm has
been implemented on three separate platforms: Field Programmable Gate Arrays
(FPGAs), Graphics Processing Units (GPUs) and on general purpose processors.
CHAPTER 2. BACKGROUND 15
2.1.1 FPGAs
Implementation of application specific funcationality in hardware can be performed
on either Application Specific Integrated Circuits (ASICs) or FPGAs. ASIC imple-
mentation is generally expensive in terms of initial implementation costs and thus
is reserved for high-volume productions. They cannot be reprogrammed and are
usually mass produced. Gate arrays on the other hand are programmable and ex-
ist in both volatile and non-volatile flavors. Non-volatile types such as those that
are mask-programmed are also not feasible for prototypes or low volume production.
Reprogrammable gate arrays are thus ideal for prototyping applications.
SRAM-based programmable FPGAs (which are the target for this implementa-
tion) use look-up tables (LUTs) to implement combinational logic. These LUTs,
along with other primitives such as registers, multipliers and memory, are arranged
in a regular pattern in hardware with each component being connected by a pro-
grammable interconnect. This is shown in Figure 2.1. This architecture allows for
the implementation of arbitrary functionality limited only by available area, while ex-
ploiting hardware based optimizations such as parallelization and pipelining. Specif-
ically, we use a Virtex II Pro which has hardware 18-bit multipliers, on-chip RAM
elements called BlockRAMS and two embedded PowerPCs which remain unused in
this implementation.
CHAPTER 2. BACKGROUND 16
Figure 2.1: Virtex II Pro - Architecture
2.1.2 NVIDIA GPUs - G80
The introduction of DirectX 10 necessitated the development of a new generation of
Graphics Processing Units (GPUs) that broke the old graphics pipeline composed
of specific shader and fragment units in favor of unified computation units capable
of handling either. This also allowed GPUs to effectively enter the field of High
Performance Computing (HPC).
Hardware
NVIDIA’s first foray into DX10 compatible hardware was the G80 architecture de-
picted in Figure 2.2. It consists of 128 processing elements, each capable of operating
on a separate datum in parallel with the others with full single precision floating
point. Groups of 8 elements are grouped into a multiprocessor with each multipro-
CHAPTER 2. BACKGROUND 17
cessor having its own shared memory. This shared memory is also used by threads
on the same multiprocessor to share data. Each multiprocessor also has its own set
of registers, a constant cache and a texture cache. These different memory types are
all optimized for different access types. There is also hardware thread control that
enables rapid thread switching and thus the hardware is optimized for dealing with
thousands of threads in parallel. This approach allows for the rapid computation of
massively parallel algorithms that exhibit high arithmetic intensity.
Figure 2.2: NVIDIA G80 core - Architecture
Application Programmers Interface (API)
The paradigm around which General Purpose GPUs are based is the kernel-stream
concept [4] [27] [24]. This approach maps very well to highly parallel applications.
Here, a kernel running on multiple processors operates on separate data indepen-
CHAPTER 2. BACKGROUND 18
dently. For example, think of a matrix multiplied by a scalar. Each data point in
the matrix can be read, multiplied and written back independently.
Stream processors are usually arranged in SIMD type arrays. In their API, called
the Compute Unified Device Architecture (CUDA) API, NVIDIA espoused a similar
method which they labeled Single Instruction Multiple Thread (SIMT) the difference
being that SIMT instructions do not contain information as to the width of the
processor array. For the above example of scalar-matrix multiplication, this would
mean that each data point would be operated on by its own thread.
The CUDA API operates on a large number of threads, broken up into warps,
blocks and grids. A warp consists of 32 threads that are managed together on a
multiprocessor. A thread block is a larger group of threads that only executes on a
single multiprocessor and it typically consists of multiple warps. A grid is a collection
of blocks and it operates on many multiprocessors.
There also exist two levels to the API, a higher, abstract level and the driver
API. The two levels operate in a mutually exclusive manner. The higher level pro-
vides simpler abstractions whereas the driver API gives the programmer access to
lower level aspects of the GPU. All implementations discussed in this thesis were
implemented in the high level API.
CHAPTER 2. BACKGROUND 19
2.2 The Keck Fusion Microscope
2.2.1 Modalities
The Keck microscope utilizes multiple modalities in order to generate a complete
fused image of the target. The modalities are summarized below with a full descrip-
tion available in [17].
Differential Interference Contrast (DIC)
In DIC, two waves propagate through a phase object with a sub-pixel displacement
created by a beam-splitting prism. The phase object must be transparent, with low
amounts of scattering and absorbance (typical of a live cell). The waves are delayed
by different amounts if the optical path length through the specimen, at the focal
region, varies in the shear direction. The two waves are later combined to create a
differential interference. Thus, the source of contrast in a DIC microscope is the phase
gradient of the object, transverse to the optical beam, measured by the interference
of the two beams.
Reflectance Confocal Microscopy (CRM)
The advent of the laser scanning confocal microscope created a means to record
digitally the image created in a confocal microscope. The RCM detects light that is
backscattered into the objective lens. Like other confocal microscopes, only the parts
of the cell that are in ”focus” are detected. The light passes through a beamsplitter,
to another lens that focuses the light onto a specimen. The backscattered light then
retraces its path, is re-collimated, and then reflects off the beamsplitter. The light is
CHAPTER 2. BACKGROUND 20
then focused onto a pinhole that is in the same focal plan as the image. Backscattered
light from the in-focus plane passes through the pinhole, while the out of focus light
is rejected. This allows for a light image of only in-focus back scattered light.
Laser Scanning Confocal Microscopy (LSCM)
LSCM uses specific laser illumination to excite an electron of a fluorophore from its
ground state to a metastable state. When the electron of the fluorophore relaxes
from the metastable state to the ground state, a new photon with equal energy to
the difference in energy level between the metastable state and the ground state is
given off. This is collected in the same manner as in LSCM.
Two-photon Microscopy (TPLSM)
As with LSCM, TPLSM uses the principle that an electron excited to the metastable
state will release an excited photon when relaxing to the ground state. The differ-
ence between TPLSM and LSCM is that the fluorophore simultaneously absorbs two
photons to reach the excited state. To illustrate, suppose that a fluorophore needs
one blue photon, with an energy equivalent to ”1,” to be excited to its metastable
state (and then release a green photon). Two-photon microscopy uses the principle
that two red photons, with energy equivalents to ”0.5” each, can simultaneously ex-
cite the fluorophore to its metastable state (and then release the same green photon
when relaxing). To observe this effect, large amounts of intense near-infrared light is
necessary, and this is achieved with a high-power titanium-sapphire laser.
CHAPTER 2. BACKGROUND 21
Optical Quadrature Microscopy (OQM)
Optical quadrature microscopy is a detection technique for measuring phase and
amplitude changes to a sinusoidal signal. A signal from a HeNe laser is split into
two components, reference and unknown. The unknown signal passes through the
sample. The known reference signal is split, with one component being phase shifted
by 90 degrees. The unknown signal is then mixed separately with both components
of the known reference signal. The merged signal consisting of unknown and non-
phase shifted reference is referred to as the I channel, or the in-phase signal, while the
unknown signal mixed with the 90 degree phase-shifted reference signal is referred to
as the Q channel, or the quadrature signal. By interpreting the I and Q signals as
real and imaginary values of a complex number, it is possible to find the amplitude
and phase of the unknown signal. A diagram of the setup is shown in Figure 2.3.
These concepts of quadrature detection are applied to microscopy to create the
OQM mode of the Keck 3DFM. Since coherent (HeNe laser) detection provides an
effective gain of —Eref— —Esig—, low levels of light can be used for illumination,
minimizing cell exposure/damage.
OQM forms the motivation behind the phase unwrapping acceleration project. A
software implementation running in Matlab takes nearly two minutes to unwrap a
single frame. Speeding the processing would render this modality much more useful
in processing large stacks of images.
CHAPTER 2. BACKGROUND 22
Figure 2.3: Optical Quadrature Microscopy setup
2.3 Phase Unwrapping - Algorithms and Selection
As a preliminary step to the implementation of a phase-unwrapping algorithm on
hardware (in this case a Field Programmable Gate Array or FPGA), it is necessary
to verify the choice of phase unwrapping algorithm used. This section focuses on
analyzing the properties and the tradeoffs between the different algorithms described
by Ghiglia and Pritt [10] where more complete descriptions can be found.
The idea behind most phase-unwrapping algorithms is that the correct unwrapped
phase varies slowly such that the gradient between pixels is less than a half- cycle, or π
radians. If this assumption holds true, a wrapped signal may be unwrapped by simply
summing (integrating in the continuous domain) until a gradient of abs(π) is reached
at which point the phase is added to an integer multiple of 2π and the summation
continues. However, the problem is if the data is noisy enough, phase gradients
greater than π can lead to image corruption over large segments of the data. Lower
CHAPTER 2. BACKGROUND 23
levels of noise (i.e. below π) also lead to an accumulation of error that eventually
results in large deviations near the end of the accumulation. Residues (discussed
further on in this chapter) also contribute to incorrect unwraps. An example with
an embryo and the Matlab unwrap routine is shown below in Figure 2.4. To solve
this problem, various phase unwrapping algorithms have been developed, each with
differing tradeoffs in terms of quality and performance.
Figure 2.4: A raster unwrap using Matlabs ’unwrap’ routine
2.3.1 Path Following Algorithms
Path following algorithms solve the noisy data problem by selecting the path over
which to integrate. Goldsteins algorithm, one of the most common path-following
algorithms, operates by identifying residues (points where the integral over a closed
CHAPTER 2. BACKGROUND 24
four pixel loop is non-zero) and connecting them via branch cuts or paths along which
the integration path may not intersect.
One problem with Goldsteins algorithm is that it does not utilize all the data
available to guide the generation of branch cuts. By generating a map indicating the
quality of the data over the image, it is possible to unwrap instances that cannot
be done using Goldsteins. These quality maps may be user-supplied or automati-
cally generated using the pseudo-correlation, the variance of phase derivates or the
maximum phase gradient.
Quality maps may be combined with branch cuts as in Goldsteins algorithm to
form a hybrid mask-cut method. The quality map is used to guide the placement of
branch-cuts. Another approach was proposed by Flynn that detects discontinuities,
joins them into loops and adds the correct multiple of 2π to it if the action removes
more discontinuities than it adds. Flynns minimum discontinuity solution can also
be used with a quality map to generate higher quality solutions.
Goldsteins Algorithm
The simplest algorithm in terms of computational complexity is Goldstein’s Branch
Cut Algorithm. It operates in the following way:
Step 1. Identify residues: This step is accomplished by integrating in a four
pixel loop starting at pixel p0. If the sum is 2π then the p0 is marked as a
positive residue and if the sum is 2π then it is marked as a negative residue.
Step 2. Create Branch Cuts: This step operates by connecting residues together
CHAPTER 2. BACKGROUND 25
by branch cuts until the sum of residue charges is zero.
Step 3. Integrate: Integration of the image is performed using a Breadth First
Search (BFS) exploration of all the pixels in the image. As each pixel is en-
countered, it is unwrapped, unless it lies on a branch cut. Next, segments of
the image that may have been isolated by branch cuts are unwrapped similarly.
Pixels on branch cuts are performed separately at the end.
As can be inferred from the steps above, Goldstein’s algorithm operates in O(N2)
time while consuming O(N2) space where the input image is N ×N .
The result of a phase unwrapping using Goldstein’s algorithm is shown in Figure
2.5. The problems with this method are immediately apparent. The segments of
the image that are of interest are mostly still wrapped with the amplitudes being
incorrect by a large margin.
Quality Maps
Quality maps are based on the concept of a user-supplied or auto-generated array that
defines the goodness of each phase value. These can be used to guide the unwrapping
since corrupted phase and residues usually have low quality values.
Unwrapping using quality maps works by first taking as an input the phase array
and the quality map. The quality map is either user input or based on the variance of
phase derivatives or the maximum phase gradient. The unwrapping is then performed
in a similar fashion as the Goldstein algorithm’s BFS exploration, except that the
adjoin list is not explored in FIFO order but according to quality. This leaves the
CHAPTER 2. BACKGROUND 26
Figure 2.5: Goldsteins algorithm
low quality pixels to be unwrapped at the very end, thus effectively doing the same
thing as Goldstein’s algorithm.
The results of the quality mapped unwrapping are shown in Figure 2.6. There
are significant failures noticeable in the center of the lower embryo as well as around
the edges of both.
Mask Cut Algorithm
The Quality Map method does not explicitly use all the information available (i.e.
residues). A hybrid approach called the Mask Cut Algorithm also exists. It uses
both quality maps and residues to guide the placement of branch cuts. It operates
as follows:
CHAPTER 2. BACKGROUND 27
Figure 2.6: Quality Mapped algorithm
Step 1. Identify residues: This is performed as described in Section 2.3.1.
Step 2. Create mask cuts: This operates on the lowest quality pixels in the
image, gradually growing outwards using a BFS exploration technique and
marking the low quality pixels as being part of a mask cut once a residue is
encountered. The mask cut continues growing until the charge is balanced.
Step 3. Thin the mask cuts: Mask cuts tend to be thicker than branch cuts and
need to be thinned before integration. This step clears the mask on all mask
pixels that do not lie next to a residue and can be safely removed without
changing mask connectivity.
CHAPTER 2. BACKGROUND 28
Step 4. Integrate: This is performed as described in Section 2.3.1.
For our datasets, the mask cut algorithm performs poorly as can been seen in Fig-
ure 2.7. The diagonal flecking and inconsistent phase changes render this technique
unusable for the OQM microscope.
Figure 2.7: Mask Cut algorithm
Flynn’s Algorithm
One method of phase-unwrapping is to segment the image along lines of discontinuity
into regions where each region has the same integer multiple of 2π associated with
it. This approach fails in the presence of high noise values or residues. Flynn’s
algorithm only segments regions along lines of discontinuity if the process of doing so
and adding the appropriate 2π multiple removes more discontinuities than it adds.
CHAPTER 2. BACKGROUND 29
The algorithm works as follows:
Step 1. Compute jump counts: Here horizontal and vertical jump counts are
computed. A jump count is the integer associated with a regions 2π multiple.
Step 2. Scan nodes: Go over the set of nodes adding edges and removing loops.
When no changes are made terminate.
Step 3. Compute unwrapped solution: The wrap counts are added to the
input phase data to get the final unwrapped solution.
There are various other optimizations performed on the image such as the inte-
gration of quality data that are not described here. From Figure 2.8 it can be seen
that Flynn’s algorithm provides a high quality solution, the best amongst the path
following algorithms discussed.
2.3.2 Minimum Norm Algorithms
This method of phase-unwrapping seeks to generate an unwrapped phase whose local
phase derivatives match the measured derivates as closely as possible. This compar-
ison can be defined as the difference between the two, raised to some power p.
The simplest case is the unweighted least squares method where p = 2.This family
of methods uses Fourier or DCT techniques to solve the least squares problem but
are vulnerable to noise. The pre-conditioned conjugate gradient (PCG) technique
overcomes this problem by using quality maps to zero-weight noisy regions so that
the unwrapped solution is not corrupted. There also exists a weighted multi-grid
CHAPTER 2. BACKGROUND 30
Figure 2.8: Flynns algorithm
algorithm that uses a combination of fine and coarse grids to converge on a solution.
Finally, the Minimum LP Norm algorithm solves the phase unwrapping problem for
p = 0. In this situation, the algorithm minimizes the number of discontinuities in the
unwrapped solution without concern for the magnitude of these discontinuities. This
value of p generally produces the best solution. This algorithm can be used with or
without user-supplied weights and for p < 2, it also generates its own data-dependent
weights. It iterates the PCG algorithm, which in turn iterates the DCT algorithm.
This results in the Minimum LP Norm algorithm having among the highest costs of
all the algorithms in terms of runtime and memory.
CHAPTER 2. BACKGROUND 31
Preconditioned Conjugate Gradient
The Preconditioned Conjugate Gradient(PCG) algorithm iterates the unweighted
least squares algorithm in order to perform a weighted phase unwrap. This un-
weighted least squares technique minimizes the difference between the discrete par-
tial derivatives of the wrapped phase data and the discrete partial derivatives of the
unwrapped solution. The solution φi,j that minimizes
ε2 =M−2∑i=0
N−2∑j=0
(φi+1,j − φi,j −∆xi,j)
2 +M−2∑i=0
N−2∑j=0
(φi,j+1 − φi,j −∆yi,j)
2 (2.1)
forms the final solution. This solution can be reduced to the discretized Poisson
equation given by:
(φi+1,j − 2φi, j + φi− 1, j) + (φi,j+1 − 2φi, j + φi, j − 1) = ρi,j (2.2)
where
ρi,j = (∆xi,j −∆x
i−1,j) + (∆yi,j −∆y
i,j−1)
This can be solved in the frequency domain by constructing reflections in the
x and y directions (in order to fulfill boundary condition requirements) and then
applying a two-dimensional Fourier transform to the input data. Alternatively, a two-
dimensional Discrete Cosine Transform(DCT) could be used. Applying the Fourier
transform to Equation 2.2 gives:
Φm,n =Pm,n
2cos(πm/M) + 2cos(πn/N)− 4(2.3)
CHAPTER 2. BACKGROUND 32
The PCG algorithm uses Conjugate Gradient(CG) to solve the discretized Poisson
equation. The CG technique has rapid and robust convergence and is guaranteed to
converge in N iterations for a N ×N matrix (barring roundoff error). However, the
actual number of iterations depends on the condition of the original matrix. If the
original matrix is close to the identity matrix, the matrix converges rapidly. In order
to achieve this condition, a preconditioning step is applied that solves an approximate
problem, the unweighted least-squares phase unwrapping. After this, the usual CG
steps. The algorithm is laid out in the pseudocode below:
Compute residual R_k of weight phase Laplacians
Initialize solution phi to zero
for (k=0 to MaxNumberOfIterations-1)
Solve P z_k=r_k using unweighted phase unwrapping to get z_k
Use CG method to solve for phi using z_k
if solution lies between predefined convergence bounds
exit loop
end
Figure 2.9: Preconditioned Conjugate Gradient Algorithm pseudo-code
As can be seen in Figure 2.10, the PCG algorithm produces a smooth, continuous
result but the image has a gradually greater magnitude going from right to left. This
affects both the foreground as well as the background rendering the technique of
limited use. However, the conjugate gradient method presented here will be used
later for the Minimum LP Norm algorithm.
CHAPTER 2. BACKGROUND 33
Figure 2.10: Preconditioned Conjugate gradient algorithm results
Minimum LP Norm
The Minimum LP Norm is similar to the PCG method since it also aims to minimize
the difference in gradients between the measured and calculated phases. However,
PCG sets p = 2 or to the least squares norm whereas the Minimum LP Norm algo-
rithm uses p = 0. This means that the Minimum LP Norm algorithm minimizes the
number of points where gradients of the measured points differ from the calculated
solution whereas the PCG algorithm minimizes the square of the differences, hence
ensuring that the measured gradients don’t match the solution anywhere.
For the Minimum LP Norm algorithm, we are trying to solve Equation 2.4.
CHAPTER 2. BACKGROUND 34
(φi+1,j−φi,j)U(i, j)+(φi,j+1−φi,j)V (i, j)−(φi,j−φi−1,j)U(i−1, j)−(φi,j−φi,j−1)U(i, j−1) = c(i, j)
(2.4)
where
c(i, j) = ∆xi,jU(i, j)−∆x
i−1,jU(i− 1, j) + ∆yi,jV (i, j)−∆y
i,j−1V (i, j − 1)
and U and V are data-dependent weights and c is the weighted phase Laplacian.
This equation can be rewritted as a matrix equation as in Equation 2.5.
Qφ = c (2.5)
which is solvable by the PCG method discussed in Section 2.3.2. A description
of the algorithm is given below in pseudocode.
Initialize solution phi_0 to zero
for (k=0 to maxIterations)
Compute Residual R
If R has no residues exit
Compute data dependent weights U and V
Compute weighted phase Laplacian c
Subtract c from weighted phase Laplacian of current solution (the left side of Lp Norm equation)
Solve Qphi = c with PCG.
end
if (no residues in residual)
Unwrap using Goldsteins
else
Apply post-processing congruency operation
end
Figure 2.11: Minimum LP Norm Algorithm pseudo-code
The full implementation also applies user-input or dynamically generated maps.
CHAPTER 2. BACKGROUND 35
The results of the Minimum LP Norm Algorithm are displayed below. As can
be seen, it produces the best quality images thus far, slightly better than Flynns
method. There are however, several incorrect areas such as the cell boundaries in the
lower embryos. This is partly because the algorithm reached it’s maximum number
of iterations without eliminating all residues from the residual.
Figure 2.12: The Minimum LP Norm algorithm
Multi-Grid
Multi-grid methods enable the rapid solution of PDEs on large grids. They usually
operate as fast as Fourier methods but are more size-generic (i.e. they can handle
non power-of-two sized arrays). This algorithm, while theoretically operating as fast
as or faster than the PCG algorithm, fails to produce meaningful results as shown in
CHAPTER 2. BACKGROUND 36
Figure 2.13 and is not discussed further.
Figure 2.13: The Multi-grid algorithm
2.3.3 Summary and Conclusions
The primary criteria by which these algorithms were looked as was the quality of their
unwraps. The path following algorithms operated quickly, but often had isolated
sections that were unwrapped poorly. The minimum norm algorithms had smooth
solutions, but often had large errors. However, the Minimum LP Norm algorithm
produced the best overall solution at the expense of the greatest computation time.
CHAPTER 2. BACKGROUND 37
2.4 Bitwidth Analysis
Implementing an algorithm with full floating point accuracy is usually not feasible
on either Digital Signal Processors (DSPs) or on Field Programmable Gate Arrays
(FPGAs) due to either a lack of support on the former or the formidable size re-
quirements on the latter platform. Thus before implementing any new floating point
algorithm on these platforms, it is important to both verify that the data can be con-
verted to fixed point and still have the algorithm operate accurately and also discover
the minimum bitwidth that can be used. Finding this minimum bitwidth can result
in large area savings in hardware. It is less important for the GPU implementation
since GPUs support floating point natively in hardware.
2.4.1 Implementation
The operations performed by the the DCT and the intermediate floating point oper-
ations are all scalable with a possible loss of precision. For example if f(x) represents
the DCT and Poisson calculation, and if f(x) = V then f(Cx) = CV . This allows for
the fixed point implementation to be performed by multiplying the single precision
input data x : −1 < x < 1, by a scaling factor C = 2p where p represents the number
of bits to be shifted. After the scaling, a cast to an integer data type truncates all
data after the decimal point. After the calculation f is performed the results is then
cast to floating point and scaled down.
The FFT used in the DCT implementation was KISS FFT [21], a simple open-
source implementation that supports both fixed and floating point.
CHAPTER 2. BACKGROUND 38
2.4.2 Results
The quality of the images produced by the different bitwidths was analyzed by visual
quality, lack of unwrapped sections and discontinuous jumps and by the difference
between it and the full double-precision implementation shown in Figure 2.14.
Figure 2.14: Image produced by a double-precision unwrap
The double embryo image shown in this analysis was just one of the benchmark
images used to determine the optimal bitwidth. However, this image presents a
challenging unwrap that does not converge completely before reaching the maximum
number of iterations. Hence it represents amongst the worst real-world images.
The 27-bit unwrap shown in Figure 2.15 has small isolated areas of very low
phase. This stretches the overall magnitude range of the data and causes wild visual
variation between this and the double-precision version.
The 28-bit and 29-bit versions in Figure 2.16 and Figure 2.17 do not have the
isolated low phase regions and so present an unwrap very similar to that of the
CHAPTER 2. BACKGROUND 39
Figure 2.15: Using a bitwidth of 27
double-precision version. One noticable difference is that the lines of discontinuity
(present partly because the unwrap did not converge to zero residues before reaching
the maximum number of iterations) shift from version to version. These lines can be
noticed running through cells in the lower embryo. Either precision is suitable for
the needs of this implementation.
Figure 2.16: Using a bitwidth of 28
CHAPTER 2. BACKGROUND 40
Figure 2.17: Using a bitwidth of 29
2.5 Related Work
The related work section of FPGAs and GPUs discuss slightly differing implementa-
tions. The FPGA implementation presented in this paper performs the precondition-
ing step of the PCG algorithm, which while the most computationally intensive part,
is only a small section of the overall algorithm. This preconditioning step involves a
DCT and some floating point computation. The GPU implementation on the other
hand implements the entire conjugate gradient calculation.
2.5.1 FPGAs
The popularity of the JPEG and MPEG standards has resulted in a proliferation
of hardware implementations of the DCT targeted towards specific sizes, most com-
monly the 8x8 2D implementation. This is usually accomplished by multiplying an
8x8 block from the image data by an 8x8 coefficient matrix, resulting in an output
CHAPTER 2. BACKGROUND 41
matrix that contains the component frequencies. This is a fairly expensive operation,
requiring 4096 additions and another 4096 multiplications. There have been many
publications detailing the implementation of such 2D matrix multiplications using
distributed arithmetic (DA) which reduces the number of multiplications required,
but which still results in relatively large, low latency hardware. Woods et. al. [30]
used a combination of 1D DCTs, distributed arithmetic, and transpose buffers on a
Xilinx XC6264 to generate such a design that utilized 30 percent of total chip area.
Bukhari et. al. [18] investigated the implementation of a modified Loeffler algorithm
for 8x8 DCTs on various FPGAs. Siu and Chan proposed a multiplierless VLSI
implementation[5] for an 11x11 2D transform and many other variations on small 2D
DCTs exist.
Larger 2D DCTs can be implemented using 1D DCTs by taking advantage of the
DCT’s separability property. This is accomplished by first taking the DCT of all the
rows and then of all the columns. However, there do not exist many implementations
of large 1D DCTs for reconfigurable systems. In [31], an 8 to 32 point core with a
maximum of 24 bit precision was implemented using distributed arithemetic for the
vector-coefficient matrix multiplication. A 32 point, 24 bit instance of this core has
a latency of only 32 cycles, but consumes 10588 LUTs which makes the approach
impractical for large designs. In contrast, our approach is much more compact and
therefore enables larger transform sizes at the cost of higher latency. Leong [19]
implements a variable radix, bit serial DCT using a systolic array but only describes
area requirements for designs up to N=32 which consumes between 457 and 1363
CHAPTER 2. BACKGROUND 42
adders and has a high worst case error. In comparison, our approach supports much
larger transform sizes with demonstrated designs of up to 1024 points.
The Spiral project uses a heuristic algorithm to explore the DFT design space with
performance feedback to generate a hardware-software DFT implementation, but the
comparison is with a software implementation on the embedded PowerPC which is
severely outclassed by a modern desktop processor. The project does however include
a customizable 1D DCT implemented in Verilog that is available from their website
and is the only available large 1D-DCT found[8].
Shirazi et. al. implemented a 2-D FFT, complex multiply and a 2-D IFFT on the
Splash-2 computing engine using a non-standard 18-bit floating point representation.
The hardware they target is the Xilinx 4010 and for the purposes of that applica-
tion, they use 34 FPGA chips to achieve an adequate thoroughput. The nature of
their application also does not require sending data back to the host[25]. Our imple-
mentation on the other hand uses just a single, more modern FPGA and a higher
accuracy, semi-floating point representation. Dillon implements a high performance
floating point 2D FFT that would greatly improve our own results if integrated into
our design [9]. Bouridane et. al. implement a high performance 2D FFT as well but
on images half the size of ours and with similar performance [14].
2.5.2 GPUs
With the introduction of unified shaders with DirectX 10 and above, the GPGPU
community has witnessed an explosion in the number of suitable applications and the
CHAPTER 2. BACKGROUND 43
papers implementing such algorithms on graphics platforms. The papers discussed
here reflect only those closely related to the higher level conjugate gradient kernel or
preconditioning of the matrix which is what was implemented on the GPU in this
thesis.
Karasev et. al. [16] use GPUs to implement 2D phase-unwrapping on NVIDIA
GPUs using Cg and achieve a 35x speedup. They implement the weighted least
squares algorithm, similar to the PCG and multigrid algorithms discussed previously
and apply it to Interferometric Synthetic Aperture Radar (IFSAR) data. They chose
to use multigrid and Gauss-Seidel iterations to solve the minimization problem and
compare their results to C and Matlab implementations. However, multigrid tech-
niques do not work on our datasets as has been shown previously in Cary Smith’s
research [26] as well as in the experiments executed as part of this thesis. Their algo-
rithm also requires a very high number of iterations to converge (on the order of tens
of thousands), a known result of using Gauss-Seidel, unlike the PCG or Minimum
LP norm algorithm which require tens or hundreds of iterations. Thus their total
computation time is greater than ours.
Similar in some respects to the work of Karasev, Bolz et. al. [3] implement
sparse matrix conjugate gradient solvers and regular grid multi-grid solvers on GPUs.
They achieve only modest speedups over CPUs as they work with the ATI 9700 and
Geforce FX generation of hardware and OpenGL and are thus limited to working
within the constraints of the classical graphics pipeline rather than the unified shader
architecture of the DirectX 10+ compatible video cards.
CHAPTER 2. BACKGROUND 44
The 2D DCT required by the preconditioning step uses a 2D FFT, a complex
multiply and some reordering. NVIDIA provides a high performance FFT library
called CUFFT [22] that has been benchmarked by HP and shown to provide a 3x
speedup for large transform sizes[11] as measured against a highly optimized software
implementation on a multicore HP server available as part of Intels MKL library [13].
The breakeven point at which a GPU implementation becomes feasible is between
the 512× 512 and the 1024× 1024 matrix sizes which see a speedup of about 0.9 and
3 respectively in real-world scenarios. Our input data set is 1024×512 and so should
see some improvement.
Chapter 3
Implementation
3.1 Timing Profile
3.2 The Discrete Cosine Transform
The Discrete Cosine Transform (DCT) is used in a wide variety of applications such as
image and audio processing due to its compaction of energy into the lower frequencies.
This property is exploited to produce efficient frequency-based compression methods
in various image and audio codecs such as JPEG and MPEG. However, the DCT
is also used in other applications that require larger sized transforms such as those
using the Preconditioned Conjugate Gradient (PCG) technique in applications like
adaptive filtering [12] and phase unwrapping [10]. This paper presents an algorithm,
first developed by Makhoul [20], and an implementation of it that utilizes a Fast
Fourier Transform (FFT) core to compute a DCT without significantly increasing
overall latency as compared to just a FFT core. The advantage of this approach is
the ready availability of a large number of FFT cores in both fixed- point [32] and
floating-point [1] formats which can be easily dropped in with minimal modifications
CHAPTER 3. IMPLEMENTATION 46
to the overall design.
3.2.1 Algorithm
The general algorithm presented here was first discussed by Makhoul[20]. It is an
indirect algorithm for computing DCTs using FFTs. The steps are presented as well
as their correspondence to the computation done in hardware.
Given an input signal x(n), the DFT of that signal is given by:
X(k) =N−1∑i=0
xne− 2πi
Nnk k = 0 . . . N − 1
The cosine transform can be viewed as the real part ofX(k) which is the equivalent
of saying that it is the Fourier transform of the even extension of X(k) given that
x(n) is causal (i.e. x(n) = 0, n < 0). This is the inspiration for the usual technique
for implementing the DCT, which is by mirroring the set of real inputs and taking
the real DFT of the resulting sequence. This mirroring can be performed in any of
four ways: around the n=-0.5 and n=N-0.5 sample points, around n=0 and n=N,
around n=-0.5 and n=N, and finally around n=0 and n=N-0.5. All of these methods
result in slightly different DCTs. The most commonly used even extension is the one
depicted in Figure 3.1 and this will be the focus of the algorithm and implementation
presented.
This category of DCT, obtained by taking the DFT of a 2N point even extension,
is known as a DCT Type II or DCT-II and is defined as:
X(k) = 2N−1∑i=0
xn cos(2n+ 1)πk
2Nk = 0 . . . N − 1
CHAPTER 3. IMPLEMENTATION 47
Figure 3.1: Even extension around n=-0.5 and n=N-0.5
with the even extension defined as:
x′(n) =
{x(n) n=0 . . . N-1x(2N − n− 1) n=N . . . 2N-1
The DCT-II can be shown to be solvable via DFT by noting that:
X(k) =2N−1∑n=0
xne− jπn
Nnk
=N−1∑n=0
xne−πnNnk +
2N−1∑n=N
xne−πnNnk
= ejπk2N
N−1∑n=0
xn[e−jπnNnke−
jπn2N
nk + e−jπnNnke
jπn2N
nk]
= 2ejπk2N
N−1∑i=0
xn cos(2n+ 1)πk
2N
Which is identical to the definition of the DCT above except for a multiplicative
factor of ejπk2N . A similar method can be used to write an IDCT in terms of a length
2N complex IDFT. Full details can be found elsewhere[20].
The performance of the DCT in terms of latency and area can be further im-
proved upon such that an N point real DFT/IDFT may be used. The method for
accomplishing this is outlined below.
CHAPTER 3. IMPLEMENTATION 48
A sequence v(n) can be constructed from x(n) such that it follows the restriction:
v(n) =
{x(2n) n=0 . . . N−1
2
x(2N − 2n− 1) n= N+12
. . . N-1(3.1)
When the DFT of v(n) is computed and the result multiplied by 2e−jπk2N the
resulting sequence can be written as:
X(k) = 2N−1∑i=0
vn cos(4n+ 1)πk
2Nk = 0 . . . N − 1
which is an alternative version of the DCT based on v(n).
Again, for the IDCT, the real sequence X(k) can be rearranged to form a complex
Hermitian symmetric sequence V (k) where Hermitian symmetry is defined as X(N−
k) = X∗(k) and the sequence itself as:
V (k) =1
2ejπk2N [X(k)− jX(N − k)] k = 0 . . . N − 1
An IDFT on V (k) generates the v(n) sequence described earlier, which can then be
rearranged to form x(n). This is the method used in the implementation discussed
in the later sections of this paper. However, for both the size N DCT and IDCT
it should be noted that the input sequences are either entirely real or Hermitian
symmetric and can thus be computed using FFTs with a point size of N/2 [23, 20].
This can be done by setting alternating elements of v(n) to the real and imaginary
parts of a new sequence t. That is,
t(n) = v(2n) + jv(2n+ 1) n = 0 . . .N
2− 1
The DFT of this sequence can then be computed and the original V (k) extracted
by using:
CHAPTER 3. IMPLEMENTATION 49
V (k) =1
2[T (k) + T ∗(
N
2− k)]− 0.5je
−2πkjN [T (k)− T ∗(
N
2− k)].
This gives the original V (k) which can subsequently be used to generate X(k).
A similar method can be applied to the real IFFT to realize similar savings.
The implementation described in this paper does not use the last FFT optimiza-
tion (that reduces required transform size from N to N/2) since it involves an extra
multiplication step that would reduce the accuracy of the results, since the data being
used is fixed-point. However, for a floating-point or low precision implementation,
this would be a feasible optimization.
3.3 The One Dimensional Implementation
The description of the algorithm for both the forward and inverse DCT lends itself to
a clearly defined component breakdown in terms of hardware. For the DCT, the first
component is the one that creates v(n) by re-ordering the input sequence and writing
it to memory. The second component is the actual FFT that transforms the shuffled
input data into the frequency domain. Last of all is the component that multiplies
the output V (k) by 2e−jπk2N and extracts the desired output values from the complex
FFT output. Roughly the same components are required for the inverse DCT but in
reverse order. First of all, a multiplication of a re-arranged sequence Y (k) by 0.5ejπk2N
is performed where Y (k) = X(k)− jX(N − k). Then the data is passed through an
inverse FFT of size N, followed by the mapping of v(n) to x(n). This organization of
CHAPTER 3. IMPLEMENTATION 50
the components is depicted in Figure 3.2 and Figure 3.3 for the forward and inverse
transforms respectively.
Figure 3.2: Components and dataflow for the forward DCT transform
3.3.1 Shuffle
As input data is sent sequentially into the DCT core, the first stage of processing
that occurs is the generation of the v(n) sequence. This occurs within the shuffle
component. The shuffle has a latency of one clock cycle and calculates output indices
based on the input index according to Equation 3.1. Since the shuffle component only
affects index values, all addition and subtraction performed within it is of bitwidth
log2N . Based on these output indices, the sample value is written to block RAM in
shuffled order as shown in Figure 3.4. This step of writing to block RAM is necessary
CHAPTER 3. IMPLEMENTATION 51
Figure 3.3: Components and dataflow for the inverse DCT transform
since the FFT component takes in input in sequential order but the shuffle produces
output non-sequentially.
For an inverse DCT shuffle, the FFT output data is re-arranged in the opposite
direction, forming x(n) from v(n). This is also written to block RAM before trans-
mitting back to the host since the data will not be generated in sequential order.
3.3.2 Fast Fourier Transform
The complex FFT used was provided by Xilinx LogicCore and generated using Core-
gen [32]. It allows for a range of options, including a parameterized bit-width of 8 to
24 bits for both input and phase factors, the use of either block RAM or distributed
RAM, the choice of algorithm and rounding used and the ability to set the output
CHAPTER 3. IMPLEMENTATION 52
Figure 3.4: Re-ordering pattern in a forward shuffle
ordering.
For the applications envisaged for larger DCTs, it was necessary to set the
bitwidth to a large size to maximize precision. To this end a 24 bit signed input
was used along with a block floating-point exponent for each 1D transform com-
pleted. This exponent field reduces the need to increase output bitwidth after each
FFT. Block RAM, a radix-4 block transform and bit reversed ordering were also
selected.
Since the algorithm optimization for computing a real or Hermitian symmetric
FFT using a transform of length N/2, as mentioned in the previous section, wasn’t
used due to precision issues, the imaginary input for the FFT was tied to zero. In
addition, the FFT core was set up to support both forward and inverse transforms
simultaneously as well as to have run-time configurable transform length.
CHAPTER 3. IMPLEMENTATION 53
3.3.3 Rebuild rotate
The rebuild rotate component implements the multiplication by 2e−jπk2N for the for-
ward transform and 0.5ejπk2N for the inverse. These complex exponentials can be
converted to a format consisting of sines and cosines by using Eulers formula. For
example, the forward transform is the equivalent of 2(cos(−πk2N
) + jsin(−πk2N
)).
The Coregen Sine Cosine Lookup Table 5.0 component used has a mapping be-
tween the input integer angle T and the calculated θ of θ = T 2π2T WIDTH . Thus for
θ = −πk2N
and noting that 2T WIDTH = N , the input angle works out to be −k4
and
k4
for the forward and inverse respectively. The sine and cosine of the index k is
generated and then multiplied by the results of the FFT using a complex multiply
with a latency of six cycles. This generates a 48 bit output, of which only the first
24 bits are retained.
The overall dataflow of this component is depicted in Figure 3.5. Note that for
the forward DCT transform only the real output of the complex multiplication is
used. The full functionality of the complex multiplier is retained however, since the
inverse transform requires it for the initial step as shown in Figure 3.3.
3.3.4 Dynamic Scaling
The hardware implementation was required to be as close as possible to a floating
point software implementation as detailed in the bitwidth analysis section. In order
to achieve this level of accuracy with a fixed point FFT, it was necessary to scale the
input data on the fly to maximize available dynamic range. The expanded dataflow
CHAPTER 3. IMPLEMENTATION 54
Figure 3.5: The rebuild component - Forward Transform
diagram in Figure 3.6 shows the components used in this procedure.
The max tracker component recieves incoming streaming floating point data
from SRAM and records the maximum exponent of the 1D frame. It does this by
using a comparator to see if the internally registered data is less than the incoming
value. If it is, the internal register is overwritten with the new value. The final
output of this component is what the entire data frame must be shifted by. This
output is calculated as 23− (MAX−126). The value 23 is used since the fixed point
FFT uses 24 bit data and the float to fixed point conversion will round numbers less
than one to zero. The 126 is to compensate for the exponent bias in IEEE compliant
floating point representation. Converting the register MAX into two’s complement
and simplifying, the calculation is performed as not(MAX) + 150.
CHAPTER 3. IMPLEMENTATION 55
Figure 3.6: The 1D transform including dynamic scaling
The scale component takes in data from BRAM, and adjusts the floating point
exponent field according to the MAX value calculated above. Similarly, the rescale
component adjusts the data frame back to it’s original range by subtracting the MAX
value. The rescale component also adds the block exponent produced by the FFT
as well as a constant multiplicative power of two introduced by the algorithm. This
can be summarized as:
EXPoutput = EXPinput + EXPblock − EXPscale + EXPconstant
where EXPconstant is either 4 or 3 if the transform length is 1024 or 512 respectively
CHAPTER 3. IMPLEMENTATION 56
3.3.5 Data Type Conversion
The unwrapping software deals with all it’s data as single-precision 32-bit floating
point values. However, given the limitation of only having a fixed point 24-bit FFT
available, some form of conversion between the two formats is necessary.
The RCL VFLOAT library [2] contains parameterizable float to fixed point and
fixed to floating point units capable of performing the necessary conversions. Data
coming out of the scale component is streamed into the float to fixed point com-
ponent and then transferred to the FFT. The output of the FFT is converted back
to floating point and stored. The data-type conversion encapsulates the core DCT
logic, which is further encapsulated by the dynamic scaling logic.
3.4 The Two-Dimensional Extension
The DCT, similar to the FFT, is separable. This means that a two-dimensional DCT
can be constructed by performing the 1D DCT for each of the rows followed by a 1D
DCT of the columns of the resulting matrix. This technique is called the row-column
decomposition method.
The key components to extending the one-dimensional DCT discussed earlier into
two dimensions is exploiting the onboard SRAM banks to store entire images at a
time, and calculating the transpose of the matrix after each DCT iteration.
CHAPTER 3. IMPLEMENTATION 57
3.4.1 The SRAM Controller
3.4.2 Calculating The Transpose
The purpose of the transpose component is to calculate the write address of data
coming out of the DCT component. These addresses should flip the matrix along
the diagonal. This can be accomplished by switching the row and column indices,
or rather, since the SRAM memory is linearly addressed, by multiplying the column
index by the length of a row and adding the row index. In equation form:
write addr =col addr
2∗ row length+ row addr
where row length = 512 or 1024 depending on orientation.
The column address is divided by 2 by dropping the last bit. This truncation is
necessary for the data-packing of two output values into each SRAM word to occur.
Additionally, since row length is a power of two, the multiplication and addition is
accomplished by appending the row index to the column index. This arrangement
requires minimal resources and is accomplished within a single cycle.
3.5 Division And Scaling
The section of code that performs the division and scaling that occurs between the
forward DCT and inverse DCT calculations is as follows:
for (j=0; j<1024; j++)
for (i=0; i<512; i++)
{
CHAPTER 3. IMPLEMENTATION 58
if (i==0 && j==0)
array[0]=0;
else
array[j*512+i] = array[j*512+i]/(4-2*cos(i*pi/511)-2*cos(j*pi/1023);
}
This segment of code scales the image by the factor of 4 − 2 ∗ cos(i ∗ π511
) − 2 ∗
cos(j ∗ π1023
). There are two efficient ways to perform this calculation. The first
is precomputing the factor for the entire 1024 by 512 image. This is too large to
load into BRAM and so much be stored in SRAM. It will also have to be loaded
into memory at startup, although this added latency can be amortized over multiple
transforms, as long as the FPGA accelerator board is not reset. This requires added
complexity to the SRAM memory design, but does not require the two floating point
add units which are needed for the second method detailed below.
The second approach is to precompute only the 2∗cos(j∗ π1023
) and the 2∗cos(j∗ π511
)
terms and store them into BRAM, since they occupy a relatively small amount of
space. These initial values can be integrated into the FPGA bitstream and thus do
not need to be loaded onto the board after initialization. As mentioned above, the
drawback to this is the addition of two floating point adders. This method was chosen
because of the relatively low area requirements as well as the lower complexity of the
required controller.
CHAPTER 3. IMPLEMENTATION 59
Both approaches require a floating point divide and the second requires floating
point addition. This is provided by the Xilinx floating point operator[33] which
supports both functions. Two add units and one divide unit were instantiated and
connected as detailed in Figure 3.7.
Figure 3.7: Implementation of the floating point divide and scale logic
3.6 GPU Implementation
3.6.1 FPGA equivalent
3.6.2 Full implementation
3.7 Data transfer
3.7.1 Programmed IO
3.7.2 Direct Memory Access (DMA)
Chapter 4
Results
4.1 Experimental Setup
The platform for which the FPGA implementation was designed for was the Annapo-
lis WildStar II Pro.
4.1.1 Verification
4.1.2 Benchmark Suite
4.2 Performance
4.2.1 Experiment Parameters
4.2.2 Experiments
4.2.3 Results
4.3 Summary
Chapter 5
Conclusion and Future Work
5.1 Conclusion
5.2 Future Work
Bibliography
[1] 4DSP Inc. IEEE-745 compliant floating-point FFT core for FPGA.http://www.4dsp.com/fft.htm, Last accessed March 2007.
[2] P. Belanovic. Library of Parameterized Hardware Modules for Floating-PointArithmetic with An Example Application. PhD thesis, Northeastern University,June 2002.
[3] J. Bolz, I. Farmer, E. Grinspun, and P. Schroder. ”sparse matrix solvers onthe gpu: conjugate gradients and multigrid”. ACM Transactions on Graphics,22(3):917–924, July 2003.
[4] I. Buck, T. Foley, D. Horn, J. S. K. Fatahalian, M. Houston, and P. Hanrahan.Brook for gpus: stream computing on graphics hardware. ACM Transactionson Graphics, 23(3):777–786, August 2004.
[5] Y. Chan and W. Siu. On the realization of discrete cosine transform using thedistributed arithmetic. IEEE Transactions on Circuits and Systems, 39(9):705–712, Sept 1992.
[6] C. H. Crawford, P. Henning, M. Kistler, and C. Wright. Accelerating Computingwith the Cell Broadband Engine Processor. In Proceedings of the 2008 conferenceon Computing Frontiers, pages 3–12, 2008.
[7] Cray. Cray xd1 datasheet. http://www.cray.com/downloads/Cray XD1Datasheet.pdf, Last accessed July 2008.
[8] P. D’Alberto, P. Milder, A. Sandryhaila, F. Franchetti, J. Hoe, J. Moura, andM. Puschel. Generating fpga-accelerated dft libraries. In Proceedings of theIEEE Symposium on FPGAs for Custom Computing Machines (FCCM’07),pages 173–184, 2007.
[9] T. Dillon. Two Virtex-II FPGAs deliver fastest, cheapest, best high-performanceimage processing system. In Xilinx Xcell J., pages 70–73, 2001.
[10] D. C. Ghiglia and M. D. Pritt. Two-Dimensional Phase Unwrapping: Theory,Algorithms and Software. Wiley Inter-Science, 605 Third Avenue, New York,NY, 10158-0012, 1998.
[11] ”HP”. ”accelerating hpc using gpus”. http://www.hp.com/techservers/hpccn/hpccollaboration/ADCatalyst/downloads/accelerating-HPCUsing-GPUs.pdf, Last accessed July 2008.
[12] A. Hull and W. Jenkins. Preconditioned conjugate gradient methods for adaptivefiltering. In IEEE International Symposium on Circuits and Systems, pages 540–543, June 1991.
[13] Intel. Intel math kernel library 10.0. http://www.intel.com/cd/software/products/asmo-na/eng/307757.htm, Last accessed July 2008.
[14] I.S. Uzun and A. Amira and A. Bouridane. FPGA Implementations Of FastFourier Transform For Real-Time Signal And Image Processing. In Proceedingsof the IEEE Conference On Vision, Image And Signal Processing, volume 152,pages 283–296, June 2005.
[15] W. E. F. Jr. Selecting math coprocessors. ”IEEE Spectrum”, pages 38–41, July1991.
[16] P. ”Karasev, D. Campbell, and M. Richards. ”obtaining a 35x speedup in 2dphase unwrapping using commodity graphics processors”. In ”Radar Conference,2007 IEEE”, pages 574–578, April 2007.
[17] J. Kerimo. The w. m. keck three-dimensional fusion microscope.http://www.keck3dfm.neu.edu/, Last accessed June 2008.
[18] G. K. Khurram Bukhari and S. Vassiliadis. DCT and IDCT implementationson different FPGA technologies. In Program for Research on Integrated Systemsand Circuits (ProRISC), pages 232–235, November 2002.
[19] M. P. Leong and P. H. W. Leong. A variable-radix digit-serial design methodol-ogy and its application to the discrete cosine transform. IEEE Transactions onVery Large Scale Integrated (VLSI) Systems, 11(1):90–104, Feb 2003.
[20] J. Makhoul. A fast cosine transform in one and two dimensions. IEEE Transac-tions on Acoustics, Speech, and Signal Processing, 28(1):27–34, February 1980.
[21] Mark Borgerding. Kiss fft. http://sourceforge.net/projects/kissfft/, Last ac-cessed July 2008.
[22] NVIDIA. Cufft library. developer.download.nvidia.com/compute/cuda/11/CUFFT Library 1.1.pdf, Last accessed July 2008.
[23] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. NumericalRecipies: The Art of Scientific Computing. Cambridge University Press, 1986.
[24] Rapidmind. Rapidmind: Product resources.http://www.rapidmind.net/resources.php, Last accessed July 2008.
[25] N. Shirazi, A. Abbot, and P. Athanas. Implementation of a 2-D Fast FourierTransform on FPGA-Based Custom Computing Machines. In Proceedings ofthe IEEE Symposium on FPGAs for Custom Computing Machines (FCCM’95),pages 155–163, April 1995.
[26] C. Smith. Phase unwrapping algorithms. PhD thesis, Northeastern University,2004.
[27] W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language forstreaming applications. In Proceedings of the 11th International Conference onCompiler Construction, volume 2304, pages 179–196, 2002.
[28] T. Valich. Gpu supercomputer: Nvidia tesla cards to debut inbull system. http://www.tomshardware.com/news/nvidia-graphics-supercomputer,5219.html, Last accessed July, 2008.
[29] S. Wasson. Ageia’s physx physics processing unit.http://techreport.com/articles.x/10223, Last accessed July 2008.
[30] R. Woods and D. T. J. Heron. Applying an xc6200 to real-time image processing.IEEE Design and Test of Computers, 15(1):30–38, Jan-Mar 1998.
[31] Xilinx Inc. 1-D discrete cosine transform(DCT) v2.1.http://www.xilinx.com/ipcenter/catalog/logicore/docs/ da 1d dct.pdf, Lastaccessed March 2007.
[32] Xilinx Inc. Fast fourier transform 3.2.http://www.xilinx.com/ipcenter/catalog/logicore/docs/xfft.pdf, Last accessedMarch 2007.
[33] Xilinx Inc. Floating-point operator v1.0.http://www.xilinx.com/bvdocs/ipcenter/data sheet/floating point.pdf, Lastaccessed October 2007.