NORTHEASTERN UNIVERSITY Graduate School of Engineering ... · Phase Unwrapping on Recon gurable Hardware A Thesis Presented by Sherman Braganza to The Department of Electrical and

NORTHEASTERN UNIVERSITY

Graduate School of Engineering

Thesis Title: Phase Unwrapping On Reconfigurable Hardware

Author: Sherman Braganza

Department: Electrical and Computer Engineering

Approved for Thesis Requirements of the Master of Science Degree

Thesis Advisor: Prof. Miriam Leeser Date

Thesis Reader: Prof. Charles DiMarzio Date

Thesis Reader: Prof. Laurie King Date

Department Chair: Prof. Ali Abur Date

Graduate School Notified of Acceptance:

Dean: Prof. Yaman Yener Date

Copy Deposited in Library:

Reference Librarian Date

NORTHEASTERN UNIVERSITY

Graduate School of Engineering

Thesis Title: A Hardware/Software System for Adaptive Beamforming

Author: Albert Anthony Conti III

Department: Electrical and Computer Engineering

Approved for Thesis Requirements of the Master of Science Degree

Thesis Advisor: Prof. Miriam Leeser Date

Thesis Reader: Prof. Eric Miller Date

Thesis Reader: Prof. Laurie King Date

Thesis Reader: Sarah Leeper Date

Department Chair: Prof. Ali Abur Date

Graduate School Notified of Acceptance:

Dean: Prof. Yaman Yener Date

Copy Deposited in Library:

Reference Librarian Date

Phase Unwrapping on Reconfigurable Hardware

A Thesis Presented

by

Sherman Braganza

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirementsfor the degree of

Master of Science

in

Electrical Engineering

in the field of

Computer Engineering

Northeastern UniversityBoston, Massachusetts

December 2007

c© Copyright 2007 by Sherman BraganzaAll Rights Reserved

Acknowledgement

Very Very soon.

Abstract

Coming soon!

Contents

1 Introduction 13

2 Background 14

2.1 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.2 NVIDIA GPUs - G80 . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 The Keck Fusion Microscope . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Phase Unwrapping - Algorithms and Selection . . . . . . . . . . . . . 22

2.3.1 Path Following Algorithms . . . . . . . . . . . . . . . . . . . . 23

2.3.2 Minimum Norm Algorithms . . . . . . . . . . . . . . . . . . . 29

2.3.3 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . 36

2.4 Bitwidth Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5.2 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3 Implementation 45

3.1 Timing Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 The Discrete Cosine Transform . . . . . . . . . . . . . . . . . . . . . 45

3.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 The One Dimensional Implementation . . . . . . . . . . . . . . . . . 49

3.3.1 Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . 51

3.3.3 Rebuild rotate . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.4 Dynamic Scaling . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.5 Data Type Conversion . . . . . . . . . . . . . . . . . . . . . . 56

3.4 The Two-Dimensional Extension . . . . . . . . . . . . . . . . . . . . 56

3.4.1 The SRAM Controller . . . . . . . . . . . . . . . . . . . . . . 57

3.4.2 Calculating The Transpose . . . . . . . . . . . . . . . . . . . . 57

3.5 Division And Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.6 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.6.1 FPGA equivalent . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.6.2 Full implementation . . . . . . . . . . . . . . . . . . . . . . . 59

3.7 Data transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.7.1 Programmed IO . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.7.2 Direct Memory Access (DMA) . . . . . . . . . . . . . . . . . . 59

4 Results 60

4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.2 Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.1 Experiment Parameters . . . . . . . . . . . . . . . . . . . . . 60

4.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Conclusion and Future Work 61

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

List of Figures

2.1 Virtex II Pro - Architecture . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 NVIDIA G80 core - Architecture . . . . . . . . . . . . . . . . . . . . 17

2.3 Optical Quadrature Microscopy setup . . . . . . . . . . . . . . . . . . 22

2.4 A raster unwrap using Matlabs ’unwrap’ routine . . . . . . . . . . . . 23

2.5 Goldsteins algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Quality Mapped algorithm . . . . . . . . . . . . . . . . . . . . . . . . 27

2.7 Mask Cut algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.8 Flynns algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.9 Preconditioned Conjugate Gradient Algorithm pseudo-code . . . . . . 32

2.10 Preconditioned Conjugate gradient algorithm results . . . . . . . . . 33

2.11 Minimum LP Norm Algorithm pseudo-code . . . . . . . . . . . . . . 34

2.12 The Minimum LP Norm algorithm . . . . . . . . . . . . . . . . . . . 35

2.13 The Multi-grid algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.14 Image produced by a double-precision unwrap . . . . . . . . . . . . . 38

2.15 Using a bitwidth of 27 . . . . . . . . . . . . . . . . . . . . . . . . . . 39



3.1 Even extension around n=-0.5 and n=N-0.5 . . . . . . . . . . . . . . 47

3.2 Components and dataflow for the forward DCT transform . . . . . . 50

3.3 Components and dataflow for the inverse DCT transform . . . . . . . 51

3.4 Re-ordering pattern in a forward shuffle . . . . . . . . . . . . . . . . . 52

3.5 The rebuild component - Forward Transform . . . . . . . . . . . . . . 54

3.6 The 1D transform including dynamic scaling . . . . . . . . . . . . . . 55

3.7 Implementation of the floating point divide and scale logic . . . . . . 59

List of Tables

Chapter 1

Introduction

Chapter 2

Background

2.1 Platforms

The concept of using an external accelerator for application speedup is not a new

one. In the early days of PCs, it was common to have an empty socket for a math

coprocessor [15] that could be used to accelerate floating point computations. More

recently, platforms such as the Cell Broadband Engine have been used as accelerators

in petaflop supercomputers [6] such as the Roadrunner machine in order to reach new

levels of computing performance. FPGAs have also been used in machines such as

the Cray XD1 [7] and GPUs in systems like the Bull Novascale supercomputer [28].

Such systems are general accelerators in the sense that they apply to a wide range

of application domains. More specific accelerators such as Ageia’s PhysX physics

accelerator [29] also exist that only target a restricted domain.

In the work presented in this masters thesis, a phase-unwrapping algorithm has

been implemented on three separate platforms: Field Programmable Gate Arrays

(FPGAs), Graphics Processing Units (GPUs) and on general purpose processors.

CHAPTER 2. BACKGROUND 15

2.1.1 FPGAs

Implementation of application specific funcationality in hardware can be performed

on either Application Specific Integrated Circuits (ASICs) or FPGAs. ASIC imple-

mentation is generally expensive in terms of initial implementation costs and thus

is reserved for high-volume productions. They cannot be reprogrammed and are

usually mass produced. Gate arrays on the other hand are programmable and ex-

ist in both volatile and non-volatile flavors. Non-volatile types such as those that

are mask-programmed are also not feasible for prototypes or low volume production.

Reprogrammable gate arrays are thus ideal for prototyping applications.

SRAM-based programmable FPGAs (which are the target for this implementa-

tion) use look-up tables (LUTs) to implement combinational logic. These LUTs,

along with other primitives such as registers, multipliers and memory, are arranged

in a regular pattern in hardware with each component being connected by a pro-

grammable interconnect. This is shown in Figure 2.1. This architecture allows for

the implementation of arbitrary functionality limited only by available area, while ex-

ploiting hardware based optimizations such as parallelization and pipelining. Specif-

ically, we use a Virtex II Pro which has hardware 18-bit multipliers, on-chip RAM

elements called BlockRAMS and two embedded PowerPCs which remain unused in

this implementation.


Figure 2.1: Virtex II Pro - Architecture

2.1.2 NVIDIA GPUs - G80

The introduction of DirectX 10 necessitated the development of a new generation of

Graphics Processing Units (GPUs) that broke the old graphics pipeline composed

of specific shader and fragment units in favor of unified computation units capable

of handling either. This also allowed GPUs to effectively enter the field of High

Performance Computing (HPC).

Hardware

NVIDIA’s first foray into DX10 compatible hardware was the G80 architecture de-

picted in Figure 2.2. It consists of 128 processing elements, each capable of operating

on a separate datum in parallel with the others with full single precision floating

point. Groups of 8 elements are grouped into a multiprocessor with each multipro-


cessor having its own shared memory. This shared memory is also used by threads

on the same multiprocessor to share data. Each multiprocessor also has its own set

of registers, a constant cache and a texture cache. These different memory types are

all optimized for different access types. There is also hardware thread control that

enables rapid thread switching and thus the hardware is optimized for dealing with

thousands of threads in parallel. This approach allows for the rapid computation of

massively parallel algorithms that exhibit high arithmetic intensity.

Figure 2.2: NVIDIA G80 core - Architecture

Application Programmers Interface (API)

The paradigm around which General Purpose GPUs are based is the kernel-stream

concept [4] [27] [24]. This approach maps very well to highly parallel applications.

Here, a kernel running on multiple processors operates on separate data indepen-


dently. For example, think of a matrix multiplied by a scalar. Each data point in

the matrix can be read, multiplied and written back independently.

Stream processors are usually arranged in SIMD type arrays. In their API, called

the Compute Unified Device Architecture (CUDA) API, NVIDIA espoused a similar

method which they labeled Single Instruction Multiple Thread (SIMT) the difference

being that SIMT instructions do not contain information as to the width of the

processor array. For the above example of scalar-matrix multiplication, this would

mean that each data point would be operated on by its own thread.

The CUDA API operates on a large number of threads, broken up into warps,

blocks and grids. A warp consists of 32 threads that are managed together on a

multiprocessor. A thread block is a larger group of threads that only executes on a

single multiprocessor and it typically consists of multiple warps. A grid is a collection

of blocks and it operates on many multiprocessors.

There also exist two levels to the API, a higher, abstract level and the driver

API. The two levels operate in a mutually exclusive manner. The higher level pro-

vides simpler abstractions whereas the driver API gives the programmer access to

lower level aspects of the GPU. All implementations discussed in this thesis were

implemented in the high level API.


2.2 The Keck Fusion Microscope

2.2.1 Modalities

The Keck microscope utilizes multiple modalities in order to generate a complete

fused image of the target. The modalities are summarized below with a full descrip-

tion available in [17].

Differential Interference Contrast (DIC)

In DIC, two waves propagate through a phase object with a sub-pixel displacement

created by a beam-splitting prism. The phase object must be transparent, with low

amounts of scattering and absorbance (typical of a live cell). The waves are delayed

by different amounts if the optical path length through the specimen, at the focal

region, varies in the shear direction. The two waves are later combined to create a

differential interference. Thus, the source of contrast in a DIC microscope is the phase

gradient of the object, transverse to the optical beam, measured by the interference

of the two beams.

Reflectance Confocal Microscopy (CRM)

The advent of the laser scanning confocal microscope created a means to record

digitally the image created in a confocal microscope. The RCM detects light that is

backscattered into the objective lens. Like other confocal microscopes, only the parts

of the cell that are in ”focus” are detected. The light passes through a beamsplitter,

to another lens that focuses the light onto a specimen. The backscattered light then

retraces its path, is re-collimated, and then reflects off the beamsplitter. The light is


then focused onto a pinhole that is in the same focal plan as the image. Backscattered

light from the in-focus plane passes through the pinhole, while the out of focus light

is rejected. This allows for a light image of only in-focus back scattered light.

Laser Scanning Confocal Microscopy (LSCM)

LSCM uses specific laser illumination to excite an electron of a fluorophore from its

ground state to a metastable state. When the electron of the fluorophore relaxes

from the metastable state to the ground state, a new photon with equal energy to

the difference in energy level between the metastable state and the ground state is

given off. This is collected in the same manner as in LSCM.

Two-photon Microscopy (TPLSM)

As with LSCM, TPLSM uses the principle that an electron excited to the metastable

state will release an excited photon when relaxing to the ground state. The differ-

ence between TPLSM and LSCM is that the fluorophore simultaneously absorbs two

photons to reach the excited state. To illustrate, suppose that a fluorophore needs

one blue photon, with an energy equivalent to ”1,” to be excited to its metastable

state (and then release a green photon). Two-photon microscopy uses the principle

that two red photons, with energy equivalents to ”0.5” each, can simultaneously ex-

cite the fluorophore to its metastable state (and then release the same green photon

when relaxing). To observe this effect, large amounts of intense near-infrared light is

necessary, and this is achieved with a high-power titanium-sapphire laser.


Optical Quadrature Microscopy (OQM)

Optical quadrature microscopy is a detection technique for measuring phase and

amplitude changes to a sinusoidal signal. A signal from a HeNe laser is split into

two components, reference and unknown. The unknown signal passes through the

sample. The known reference signal is split, with one component being phase shifted

by 90 degrees. The unknown signal is then mixed separately with both components

of the known reference signal. The merged signal consisting of unknown and non-

phase shifted reference is referred to as the I channel, or the in-phase signal, while the

unknown signal mixed with the 90 degree phase-shifted reference signal is referred to

as the Q channel, or the quadrature signal. By interpreting the I and Q signals as

real and imaginary values of a complex number, it is possible to find the amplitude

and phase of the unknown signal. A diagram of the setup is shown in Figure 2.3.

These concepts of quadrature detection are applied to microscopy to create the

OQM mode of the Keck 3DFM. Since coherent (HeNe laser) detection provides an

effective gain of —Eref— —Esig—, low levels of light can be used for illumination,

minimizing cell exposure/damage.

OQM forms the motivation behind the phase unwrapping acceleration project. A

software implementation running in Matlab takes nearly two minutes to unwrap a

single frame. Speeding the processing would render this modality much more useful

in processing large stacks of images.


Figure 2.3: Optical Quadrature Microscopy setup

2.3 Phase Unwrapping - Algorithms and Selection

As a preliminary step to the implementation of a phase-unwrapping algorithm on

hardware (in this case a Field Programmable Gate Array or FPGA), it is necessary

to verify the choice of phase unwrapping algorithm used. This section focuses on

analyzing the properties and the tradeoffs between the different algorithms described

by Ghiglia and Pritt [10] where more complete descriptions can be found.

The idea behind most phase-unwrapping algorithms is that the correct unwrapped

phase varies slowly such that the gradient between pixels is less than a half- cycle, or π

radians. If this assumption holds true, a wrapped signal may be unwrapped by simply

summing (integrating in the continuous domain) until a gradient of abs(π) is reached

at which point the phase is added to an integer multiple of 2π and the summation

continues. However, the problem is if the data is noisy enough, phase gradients

greater than π can lead to image corruption over large segments of the data. Lower


levels of noise (i.e. below π) also lead to an accumulation of error that eventually

results in large deviations near the end of the accumulation. Residues (discussed

further on in this chapter) also contribute to incorrect unwraps. An example with

an embryo and the Matlab unwrap routine is shown below in Figure 2.4. To solve

this problem, various phase unwrapping algorithms have been developed, each with

differing tradeoffs in terms of quality and performance.

Figure 2.4: A raster unwrap using Matlabs ’unwrap’ routine

2.3.1 Path Following Algorithms

Path following algorithms solve the noisy data problem by selecting the path over

which to integrate. Goldsteins algorithm, one of the most common path-following

algorithms, operates by identifying residues (points where the integral over a closed


four pixel loop is non-zero) and connecting them via branch cuts or paths along which

the integration path may not intersect.

One problem with Goldsteins algorithm is that it does not utilize all the data

available to guide the generation of branch cuts. By generating a map indicating the

quality of the data over the image, it is possible to unwrap instances that cannot

be done using Goldsteins. These quality maps may be user-supplied or automati-

cally generated using the pseudo-correlation, the variance of phase derivates or the

maximum phase gradient.

Quality maps may be combined with branch cuts as in Goldsteins algorithm to

form a hybrid mask-cut method. The quality map is used to guide the placement of

branch-cuts. Another approach was proposed by Flynn that detects discontinuities,

joins them into loops and adds the correct multiple of 2π to it if the action removes

more discontinuities than it adds. Flynns minimum discontinuity solution can also

be used with a quality map to generate higher quality solutions.

Goldsteins Algorithm

The simplest algorithm in terms of computational complexity is Goldstein’s Branch

Cut Algorithm. It operates in the following way:

Step 1. Identify residues: This step is accomplished by integrating in a four

pixel loop starting at pixel p0. If the sum is 2π then the p0 is marked as a

positive residue and if the sum is 2π then it is marked as a negative residue.

Step 2. Create Branch Cuts: This step operates by connecting residues together


by branch cuts until the sum of residue charges is zero.

Step 3. Integrate: Integration of the image is performed using a Breadth First

Search (BFS) exploration of all the pixels in the image. As each pixel is en-

countered, it is unwrapped, unless it lies on a branch cut. Next, segments of

the image that may have been isolated by branch cuts are unwrapped similarly.

Pixels on branch cuts are performed separately at the end.

As can be inferred from the steps above, Goldstein’s algorithm operates in O(N2)

time while consuming O(N2) space where the input image is N ×N .

The result of a phase unwrapping using Goldstein’s algorithm is shown in Figure

2.5. The problems with this method are immediately apparent. The segments of

the image that are of interest are mostly still wrapped with the amplitudes being

incorrect by a large margin.

Quality Maps

Quality maps are based on the concept of a user-supplied or auto-generated array that

defines the goodness of each phase value. These can be used to guide the unwrapping

since corrupted phase and residues usually have low quality values.

Unwrapping using quality maps works by first taking as an input the phase array

and the quality map. The quality map is either user input or based on the variance of

phase derivatives or the maximum phase gradient. The unwrapping is then performed

in a similar fashion as the Goldstein algorithm’s BFS exploration, except that the

adjoin list is not explored in FIFO order but according to quality. This leaves the


Figure 2.5: Goldsteins algorithm

low quality pixels to be unwrapped at the very end, thus effectively doing the same

thing as Goldstein’s algorithm.

The results of the quality mapped unwrapping are shown in Figure 2.6. There

are significant failures noticeable in the center of the lower embryo as well as around

the edges of both.

Mask Cut Algorithm

The Quality Map method does not explicitly use all the information available (i.e.

residues). A hybrid approach called the Mask Cut Algorithm also exists. It uses

both quality maps and residues to guide the placement of branch cuts. It operates

as follows:


Figure 2.6: Quality Mapped algorithm

Step 1. Identify residues: This is performed as described in Section 2.3.1.

Step 2. Create mask cuts: This operates on the lowest quality pixels in the

image, gradually growing outwards using a BFS exploration technique and

marking the low quality pixels as being part of a mask cut once a residue is

encountered. The mask cut continues growing until the charge is balanced.

Step 3. Thin the mask cuts: Mask cuts tend to be thicker than branch cuts and

need to be thinned before integration. This step clears the mask on all mask

pixels that do not lie next to a residue and can be safely removed without

changing mask connectivity.


Step 4. Integrate: This is performed as described in Section 2.3.1.

For our datasets, the mask cut algorithm performs poorly as can been seen in Fig-

ure 2.7. The diagonal flecking and inconsistent phase changes render this technique

unusable for the OQM microscope.

Figure 2.7: Mask Cut algorithm

Flynn’s Algorithm

One method of phase-unwrapping is to segment the image along lines of discontinuity

into regions where each region has the same integer multiple of 2π associated with

it. This approach fails in the presence of high noise values or residues. Flynn’s

algorithm only segments regions along lines of discontinuity if the process of doing so

and adding the appropriate 2π multiple removes more discontinuities than it adds.


The algorithm works as follows:

Step 1. Compute jump counts: Here horizontal and vertical jump counts are

computed. A jump count is the integer associated with a regions 2π multiple.

Step 2. Scan nodes: Go over the set of nodes adding edges and removing loops.

When no changes are made terminate.

Step 3. Compute unwrapped solution: The wrap counts are added to the

input phase data to get the final unwrapped solution.

There are various other optimizations performed on the image such as the inte-

gration of quality data that are not described here. From Figure 2.8 it can be seen

that Flynn’s algorithm provides a high quality solution, the best amongst the path

following algorithms discussed.

2.3.2 Minimum Norm Algorithms

This method of phase-unwrapping seeks to generate an unwrapped phase whose local

phase derivatives match the measured derivates as closely as possible. This compar-

ison can be defined as the difference between the two, raised to some power p.

The simplest case is the unweighted least squares method where p = 2.This family

of methods uses Fourier or DCT techniques to solve the least squares problem but

are vulnerable to noise. The pre-conditioned conjugate gradient (PCG) technique

overcomes this problem by using quality maps to zero-weight noisy regions so that

the unwrapped solution is not corrupted. There also exists a weighted multi-grid


Figure 2.8: Flynns algorithm

algorithm that uses a combination of fine and coarse grids to converge on a solution.

Finally, the Minimum LP Norm algorithm solves the phase unwrapping problem for

p = 0. In this situation, the algorithm minimizes the number of discontinuities in the

unwrapped solution without concern for the magnitude of these discontinuities. This

value of p generally produces the best solution. This algorithm can be used with or

without user-supplied weights and for p < 2, it also generates its own data-dependent

weights. It iterates the PCG algorithm, which in turn iterates the DCT algorithm.

This results in the Minimum LP Norm algorithm having among the highest costs of

all the algorithms in terms of runtime and memory.


Preconditioned Conjugate Gradient

The Preconditioned Conjugate Gradient(PCG) algorithm iterates the unweighted

least squares algorithm in order to perform a weighted phase unwrap. This un-

weighted least squares technique minimizes the difference between the discrete par-

tial derivatives of the wrapped phase data and the discrete partial derivatives of the

unwrapped solution. The solution φi,j that minimizes

ε2 =M−2∑i=0

N−2∑j=0

(φi+1,j − φi,j −∆xi,j)

2 +M−2∑i=0

N−2∑j=0

(φi,j+1 − φi,j −∆yi,j)

2 (2.1)

forms the final solution. This solution can be reduced to the discretized Poisson

equation given by:

(φi+1,j − 2φi, j + φi− 1, j) + (φi,j+1 − 2φi, j + φi, j − 1) = ρi,j (2.2)

where

ρi,j = (∆xi,j −∆x

i−1,j) + (∆yi,j −∆y

i,j−1)

This can be solved in the frequency domain by constructing reflections in the

x and y directions (in order to fulfill boundary condition requirements) and then

applying a two-dimensional Fourier transform to the input data. Alternatively, a two-

dimensional Discrete Cosine Transform(DCT) could be used. Applying the Fourier

transform to Equation 2.2 gives:

Φm,n =Pm,n

2cos(πm/M) + 2cos(πn/N)− 4(2.3)


The PCG algorithm uses Conjugate Gradient(CG) to solve the discretized Poisson

equation. The CG technique has rapid and robust convergence and is guaranteed to

converge in N iterations for a N ×N matrix (barring roundoff error). However, the

actual number of iterations depends on the condition of the original matrix. If the

original matrix is close to the identity matrix, the matrix converges rapidly. In order

to achieve this condition, a preconditioning step is applied that solves an approximate

problem, the unweighted least-squares phase unwrapping. After this, the usual CG

steps. The algorithm is laid out in the pseudocode below:

Compute residual R_k of weight phase Laplacians

Initialize solution phi to zero

for (k=0 to MaxNumberOfIterations-1)

Solve P z_k=r_k using unweighted phase unwrapping to get z_k

Use CG method to solve for phi using z_k

if solution lies between predefined convergence bounds

exit loop

end

Figure 2.9: Preconditioned Conjugate Gradient Algorithm pseudo-code

As can be seen in Figure 2.10, the PCG algorithm produces a smooth, continuous

result but the image has a gradually greater magnitude going from right to left. This

affects both the foreground as well as the background rendering the technique of

limited use. However, the conjugate gradient method presented here will be used

later for the Minimum LP Norm algorithm.


Figure 2.10: Preconditioned Conjugate gradient algorithm results

Minimum LP Norm

The Minimum LP Norm is similar to the PCG method since it also aims to minimize

the difference in gradients between the measured and calculated phases. However,

PCG sets p = 2 or to the least squares norm whereas the Minimum LP Norm algo-

rithm uses p = 0. This means that the Minimum LP Norm algorithm minimizes the

number of points where gradients of the measured points differ from the calculated

solution whereas the PCG algorithm minimizes the square of the differences, hence

ensuring that the measured gradients don’t match the solution anywhere.

For the Minimum LP Norm algorithm, we are trying to solve Equation 2.4.


(φi+1,j−φi,j)U(i, j)+(φi,j+1−φi,j)V (i, j)−(φi,j−φi−1,j)U(i−1, j)−(φi,j−φi,j−1)U(i, j−1) = c(i, j)

(2.4)

where

c(i, j) = ∆xi,jU(i, j)−∆x

i−1,jU(i− 1, j) + ∆yi,jV (i, j)−∆y

i,j−1V (i, j − 1)

and U and V are data-dependent weights and c is the weighted phase Laplacian.

This equation can be rewritted as a matrix equation as in Equation 2.5.

Qφ = c (2.5)

which is solvable by the PCG method discussed in Section 2.3.2. A description

of the algorithm is given below in pseudocode.

Initialize solution phi_0 to zero

for (k=0 to maxIterations)

Compute Residual R

If R has no residues exit

Compute data dependent weights U and V

Compute weighted phase Laplacian c

Subtract c from weighted phase Laplacian of current solution (the left side of Lp Norm equation)

Solve Qphi = c with PCG.

end

if (no residues in residual)

Unwrap using Goldsteins

else

Apply post-processing congruency operation

end

Figure 2.11: Minimum LP Norm Algorithm pseudo-code

The full implementation also applies user-input or dynamically generated maps.


The results of the Minimum LP Norm Algorithm are displayed below. As can

be seen, it produces the best quality images thus far, slightly better than Flynns

method. There are however, several incorrect areas such as the cell boundaries in the

lower embryos. This is partly because the algorithm reached it’s maximum number

of iterations without eliminating all residues from the residual.

Figure 2.12: The Minimum LP Norm algorithm

Multi-Grid

Multi-grid methods enable the rapid solution of PDEs on large grids. They usually

operate as fast as Fourier methods but are more size-generic (i.e. they can handle

non power-of-two sized arrays). This algorithm, while theoretically operating as fast

as or faster than the PCG algorithm, fails to produce meaningful results as shown in


Figure 2.13 and is not discussed further.

Figure 2.13: The Multi-grid algorithm

2.3.3 Summary and Conclusions

The primary criteria by which these algorithms were looked as was the quality of their

unwraps. The path following algorithms operated quickly, but often had isolated

sections that were unwrapped poorly. The minimum norm algorithms had smooth

solutions, but often had large errors. However, the Minimum LP Norm algorithm

produced the best overall solution at the expense of the greatest computation time.


2.4 Bitwidth Analysis

Implementing an algorithm with full floating point accuracy is usually not feasible

on either Digital Signal Processors (DSPs) or on Field Programmable Gate Arrays

(FPGAs) due to either a lack of support on the former or the formidable size re-

quirements on the latter platform. Thus before implementing any new floating point

algorithm on these platforms, it is important to both verify that the data can be con-

verted to fixed point and still have the algorithm operate accurately and also discover

the minimum bitwidth that can be used. Finding this minimum bitwidth can result

in large area savings in hardware. It is less important for the GPU implementation

since GPUs support floating point natively in hardware.

2.4.1 Implementation

The operations performed by the the DCT and the intermediate floating point oper-

ations are all scalable with a possible loss of precision. For example if f(x) represents

the DCT and Poisson calculation, and if f(x) = V then f(Cx) = CV . This allows for

the fixed point implementation to be performed by multiplying the single precision

input data x : −1 < x < 1, by a scaling factor C = 2p where p represents the number

of bits to be shifted. After the scaling, a cast to an integer data type truncates all

data after the decimal point. After the calculation f is performed the results is then

cast to floating point and scaled down.

The FFT used in the DCT implementation was KISS FFT [21], a simple open-

source implementation that supports both fixed and floating point.


2.4.2 Results

The quality of the images produced by the different bitwidths was analyzed by visual

quality, lack of unwrapped sections and discontinuous jumps and by the difference

between it and the full double-precision implementation shown in Figure 2.14.

Figure 2.14: Image produced by a double-precision unwrap

The double embryo image shown in this analysis was just one of the benchmark

images used to determine the optimal bitwidth. However, this image presents a

challenging unwrap that does not converge completely before reaching the maximum

number of iterations. Hence it represents amongst the worst real-world images.

The 27-bit unwrap shown in Figure 2.15 has small isolated areas of very low

phase. This stretches the overall magnitude range of the data and causes wild visual

variation between this and the double-precision version.

The 28-bit and 29-bit versions in Figure 2.16 and Figure 2.17 do not have the

isolated low phase regions and so present an unwrap very similar to that of the


Figure 2.15: Using a bitwidth of 27

double-precision version. One noticable difference is that the lines of discontinuity

(present partly because the unwrap did not converge to zero residues before reaching

the maximum number of iterations) shift from version to version. These lines can be

noticed running through cells in the lower embryo. Either precision is suitable for

the needs of this implementation.




2.5 Related Work

The related work section of FPGAs and GPUs discuss slightly differing implementa-

tions. The FPGA implementation presented in this paper performs the precondition-

ing step of the PCG algorithm, which while the most computationally intensive part,

is only a small section of the overall algorithm. This preconditioning step involves a

DCT and some floating point computation. The GPU implementation on the other

hand implements the entire conjugate gradient calculation.

2.5.1 FPGAs

The popularity of the JPEG and MPEG standards has resulted in a proliferation

of hardware implementations of the DCT targeted towards specific sizes, most com-

monly the 8x8 2D implementation. This is usually accomplished by multiplying an

8x8 block from the image data by an 8x8 coefficient matrix, resulting in an output


matrix that contains the component frequencies. This is a fairly expensive operation,

requiring 4096 additions and another 4096 multiplications. There have been many

publications detailing the implementation of such 2D matrix multiplications using

distributed arithmetic (DA) which reduces the number of multiplications required,

but which still results in relatively large, low latency hardware. Woods et. al. [30]

used a combination of 1D DCTs, distributed arithmetic, and transpose buffers on a

Xilinx XC6264 to generate such a design that utilized 30 percent of total chip area.

Bukhari et. al. [18] investigated the implementation of a modified Loeffler algorithm

for 8x8 DCTs on various FPGAs. Siu and Chan proposed a multiplierless VLSI

implementation[5] for an 11x11 2D transform and many other variations on small 2D

DCTs exist.

Larger 2D DCTs can be implemented using 1D DCTs by taking advantage of the

DCT’s separability property. This is accomplished by first taking the DCT of all the

rows and then of all the columns. However, there do not exist many implementations

of large 1D DCTs for reconfigurable systems. In [31], an 8 to 32 point core with a

maximum of 24 bit precision was implemented using distributed arithemetic for the

vector-coefficient matrix multiplication. A 32 point, 24 bit instance of this core has

a latency of only 32 cycles, but consumes 10588 LUTs which makes the approach

impractical for large designs. In contrast, our approach is much more compact and

therefore enables larger transform sizes at the cost of higher latency. Leong [19]

implements a variable radix, bit serial DCT using a systolic array but only describes

area requirements for designs up to N=32 which consumes between 457 and 1363


adders and has a high worst case error. In comparison, our approach supports much

larger transform sizes with demonstrated designs of up to 1024 points.

The Spiral project uses a heuristic algorithm to explore the DFT design space with

performance feedback to generate a hardware-software DFT implementation, but the

comparison is with a software implementation on the embedded PowerPC which is

severely outclassed by a modern desktop processor. The project does however include

a customizable 1D DCT implemented in Verilog that is available from their website

and is the only available large 1D-DCT found[8].

Shirazi et. al. implemented a 2-D FFT, complex multiply and a 2-D IFFT on the

Splash-2 computing engine using a non-standard 18-bit floating point representation.

The hardware they target is the Xilinx 4010 and for the purposes of that applica-

tion, they use 34 FPGA chips to achieve an adequate thoroughput. The nature of

their application also does not require sending data back to the host[25]. Our imple-

mentation on the other hand uses just a single, more modern FPGA and a higher

accuracy, semi-floating point representation. Dillon implements a high performance

floating point 2D FFT that would greatly improve our own results if integrated into

our design [9]. Bouridane et. al. implement a high performance 2D FFT as well but

on images half the size of ours and with similar performance [14].

2.5.2 GPUs

With the introduction of unified shaders with DirectX 10 and above, the GPGPU

community has witnessed an explosion in the number of suitable applications and the


papers implementing such algorithms on graphics platforms. The papers discussed

here reflect only those closely related to the higher level conjugate gradient kernel or

preconditioning of the matrix which is what was implemented on the GPU in this

thesis.

Karasev et. al. [16] use GPUs to implement 2D phase-unwrapping on NVIDIA

GPUs using Cg and achieve a 35x speedup. They implement the weighted least

squares algorithm, similar to the PCG and multigrid algorithms discussed previously

and apply it to Interferometric Synthetic Aperture Radar (IFSAR) data. They chose

to use multigrid and Gauss-Seidel iterations to solve the minimization problem and

compare their results to C and Matlab implementations. However, multigrid tech-

niques do not work on our datasets as has been shown previously in Cary Smith’s

research [26] as well as in the experiments executed as part of this thesis. Their algo-

rithm also requires a very high number of iterations to converge (on the order of tens

of thousands), a known result of using Gauss-Seidel, unlike the PCG or Minimum

LP norm algorithm which require tens or hundreds of iterations. Thus their total

computation time is greater than ours.

Similar in some respects to the work of Karasev, Bolz et. al. [3] implement

sparse matrix conjugate gradient solvers and regular grid multi-grid solvers on GPUs.

They achieve only modest speedups over CPUs as they work with the ATI 9700 and

Geforce FX generation of hardware and OpenGL and are thus limited to working

within the constraints of the classical graphics pipeline rather than the unified shader

architecture of the DirectX 10+ compatible video cards.


The 2D DCT required by the preconditioning step uses a 2D FFT, a complex

multiply and some reordering. NVIDIA provides a high performance FFT library

called CUFFT [22] that has been benchmarked by HP and shown to provide a 3x

speedup for large transform sizes[11] as measured against a highly optimized software

implementation on a multicore HP server available as part of Intels MKL library [13].

The breakeven point at which a GPU implementation becomes feasible is between

the 512× 512 and the 1024× 1024 matrix sizes which see a speedup of about 0.9 and

3 respectively in real-world scenarios. Our input data set is 1024×512 and so should

see some improvement.

Chapter 3

Implementation

3.1 Timing Profile

3.2 The Discrete Cosine Transform

The Discrete Cosine Transform (DCT) is used in a wide variety of applications such as

image and audio processing due to its compaction of energy into the lower frequencies.

This property is exploited to produce efficient frequency-based compression methods

in various image and audio codecs such as JPEG and MPEG. However, the DCT

is also used in other applications that require larger sized transforms such as those

using the Preconditioned Conjugate Gradient (PCG) technique in applications like

adaptive filtering [12] and phase unwrapping [10]. This paper presents an algorithm,

first developed by Makhoul [20], and an implementation of it that utilizes a Fast

Fourier Transform (FFT) core to compute a DCT without significantly increasing

overall latency as compared to just a FFT core. The advantage of this approach is

the ready availability of a large number of FFT cores in both fixed- point [32] and

floating-point [1] formats which can be easily dropped in with minimal modifications

CHAPTER 3. IMPLEMENTATION 46

to the overall design.

3.2.1 Algorithm

The general algorithm presented here was first discussed by Makhoul[20]. It is an

indirect algorithm for computing DCTs using FFTs. The steps are presented as well

as their correspondence to the computation done in hardware.

Given an input signal x(n), the DFT of that signal is given by:

X(k) =N−1∑i=0

xne− 2πi

Nnk k = 0 . . . N − 1

The cosine transform can be viewed as the real part ofX(k) which is the equivalent

of saying that it is the Fourier transform of the even extension of X(k) given that

x(n) is causal (i.e. x(n) = 0, n < 0). This is the inspiration for the usual technique

for implementing the DCT, which is by mirroring the set of real inputs and taking

the real DFT of the resulting sequence. This mirroring can be performed in any of

four ways: around the n=-0.5 and n=N-0.5 sample points, around n=0 and n=N,

around n=-0.5 and n=N, and finally around n=0 and n=N-0.5. All of these methods

result in slightly different DCTs. The most commonly used even extension is the one

depicted in Figure 3.1 and this will be the focus of the algorithm and implementation

presented.

This category of DCT, obtained by taking the DFT of a 2N point even extension,

is known as a DCT Type II or DCT-II and is defined as:

X(k) = 2N−1∑i=0

xn cos(2n+ 1)πk

2Nk = 0 . . . N − 1


Figure 3.1: Even extension around n=-0.5 and n=N-0.5

with the even extension defined as:

x′(n) =

{x(n) n=0 . . . N-1x(2N − n− 1) n=N . . . 2N-1

The DCT-II can be shown to be solvable via DFT by noting that:

X(k) =2N−1∑n=0

xne− jπn

Nnk

=N−1∑n=0

xne−πnNnk +

2N−1∑n=N

xne−πnNnk

= ejπk2N

N−1∑n=0

xn[e−jπnNnke−

jπn2N

nk + e−jπnNnke

jπn2N

nk]

= 2ejπk2N

N−1∑i=0

xn cos(2n+ 1)πk

2N

Which is identical to the definition of the DCT above except for a multiplicative

factor of ejπk2N . A similar method can be used to write an IDCT in terms of a length

2N complex IDFT. Full details can be found elsewhere[20].

The performance of the DCT in terms of latency and area can be further im-

proved upon such that an N point real DFT/IDFT may be used. The method for

accomplishing this is outlined below.


A sequence v(n) can be constructed from x(n) such that it follows the restriction:

v(n) =

{x(2n) n=0 . . . N−1

2

x(2N − 2n− 1) n= N+12

. . . N-1(3.1)

When the DFT of v(n) is computed and the result multiplied by 2e−jπk2N the

resulting sequence can be written as:

X(k) = 2N−1∑i=0

vn cos(4n+ 1)πk

2Nk = 0 . . . N − 1

which is an alternative version of the DCT based on v(n).

Again, for the IDCT, the real sequence X(k) can be rearranged to form a complex

Hermitian symmetric sequence V (k) where Hermitian symmetry is defined as X(N−

k) = X∗(k) and the sequence itself as:

V (k) =1

2ejπk2N [X(k)− jX(N − k)] k = 0 . . . N − 1

An IDFT on V (k) generates the v(n) sequence described earlier, which can then be

rearranged to form x(n). This is the method used in the implementation discussed

in the later sections of this paper. However, for both the size N DCT and IDCT

it should be noted that the input sequences are either entirely real or Hermitian

symmetric and can thus be computed using FFTs with a point size of N/2 [23, 20].

This can be done by setting alternating elements of v(n) to the real and imaginary

parts of a new sequence t. That is,

t(n) = v(2n) + jv(2n+ 1) n = 0 . . .N

2− 1

The DFT of this sequence can then be computed and the original V (k) extracted

by using:


V (k) =1

2[T (k) + T ∗(

N

2− k)]− 0.5je

−2πkjN [T (k)− T ∗(

N

2− k)].

This gives the original V (k) which can subsequently be used to generate X(k).

A similar method can be applied to the real IFFT to realize similar savings.

The implementation described in this paper does not use the last FFT optimiza-

tion (that reduces required transform size from N to N/2) since it involves an extra

multiplication step that would reduce the accuracy of the results, since the data being

used is fixed-point. However, for a floating-point or low precision implementation,

this would be a feasible optimization.

3.3 The One Dimensional Implementation

The description of the algorithm for both the forward and inverse DCT lends itself to

a clearly defined component breakdown in terms of hardware. For the DCT, the first

component is the one that creates v(n) by re-ordering the input sequence and writing

it to memory. The second component is the actual FFT that transforms the shuffled

input data into the frequency domain. Last of all is the component that multiplies

the output V (k) by 2e−jπk2N and extracts the desired output values from the complex

FFT output. Roughly the same components are required for the inverse DCT but in

reverse order. First of all, a multiplication of a re-arranged sequence Y (k) by 0.5ejπk2N

is performed where Y (k) = X(k)− jX(N − k). Then the data is passed through an

inverse FFT of size N, followed by the mapping of v(n) to x(n). This organization of


the components is depicted in Figure 3.2 and Figure 3.3 for the forward and inverse

transforms respectively.

Figure 3.2: Components and dataflow for the forward DCT transform

3.3.1 Shuffle

As input data is sent sequentially into the DCT core, the first stage of processing

that occurs is the generation of the v(n) sequence. This occurs within the shuffle

component. The shuffle has a latency of one clock cycle and calculates output indices

based on the input index according to Equation 3.1. Since the shuffle component only

affects index values, all addition and subtraction performed within it is of bitwidth

log2N . Based on these output indices, the sample value is written to block RAM in

shuffled order as shown in Figure 3.4. This step of writing to block RAM is necessary


Figure 3.3: Components and dataflow for the inverse DCT transform

since the FFT component takes in input in sequential order but the shuffle produces

output non-sequentially.

For an inverse DCT shuffle, the FFT output data is re-arranged in the opposite

direction, forming x(n) from v(n). This is also written to block RAM before trans-

mitting back to the host since the data will not be generated in sequential order.

3.3.2 Fast Fourier Transform

The complex FFT used was provided by Xilinx LogicCore and generated using Core-

gen [32]. It allows for a range of options, including a parameterized bit-width of 8 to

24 bits for both input and phase factors, the use of either block RAM or distributed

RAM, the choice of algorithm and rounding used and the ability to set the output


Figure 3.4: Re-ordering pattern in a forward shuffle

ordering.

For the applications envisaged for larger DCTs, it was necessary to set the

bitwidth to a large size to maximize precision. To this end a 24 bit signed input

was used along with a block floating-point exponent for each 1D transform com-

pleted. This exponent field reduces the need to increase output bitwidth after each

FFT. Block RAM, a radix-4 block transform and bit reversed ordering were also

selected.

Since the algorithm optimization for computing a real or Hermitian symmetric

FFT using a transform of length N/2, as mentioned in the previous section, wasn’t

used due to precision issues, the imaginary input for the FFT was tied to zero. In

addition, the FFT core was set up to support both forward and inverse transforms

simultaneously as well as to have run-time configurable transform length.


3.3.3 Rebuild rotate

The rebuild rotate component implements the multiplication by 2e−jπk2N for the for-

ward transform and 0.5ejπk2N for the inverse. These complex exponentials can be

converted to a format consisting of sines and cosines by using Eulers formula. For

example, the forward transform is the equivalent of 2(cos(−πk2N

) + jsin(−πk2N

)).

The Coregen Sine Cosine Lookup Table 5.0 component used has a mapping be-

tween the input integer angle T and the calculated θ of θ = T 2π2T WIDTH . Thus for

θ = −πk2N

and noting that 2T WIDTH = N , the input angle works out to be −k4

and

k4

for the forward and inverse respectively. The sine and cosine of the index k is

generated and then multiplied by the results of the FFT using a complex multiply

with a latency of six cycles. This generates a 48 bit output, of which only the first

24 bits are retained.

The overall dataflow of this component is depicted in Figure 3.5. Note that for

the forward DCT transform only the real output of the complex multiplication is

used. The full functionality of the complex multiplier is retained however, since the

inverse transform requires it for the initial step as shown in Figure 3.3.

3.3.4 Dynamic Scaling

The hardware implementation was required to be as close as possible to a floating

point software implementation as detailed in the bitwidth analysis section. In order

to achieve this level of accuracy with a fixed point FFT, it was necessary to scale the

input data on the fly to maximize available dynamic range. The expanded dataflow


Figure 3.5: The rebuild component - Forward Transform

diagram in Figure 3.6 shows the components used in this procedure.

The max tracker component recieves incoming streaming floating point data

from SRAM and records the maximum exponent of the 1D frame. It does this by

using a comparator to see if the internally registered data is less than the incoming

value. If it is, the internal register is overwritten with the new value. The final

output of this component is what the entire data frame must be shifted by. This

output is calculated as 23− (MAX−126). The value 23 is used since the fixed point

FFT uses 24 bit data and the float to fixed point conversion will round numbers less

than one to zero. The 126 is to compensate for the exponent bias in IEEE compliant

floating point representation. Converting the register MAX into two’s complement

and simplifying, the calculation is performed as not(MAX) + 150.


Figure 3.6: The 1D transform including dynamic scaling

The scale component takes in data from BRAM, and adjusts the floating point

exponent field according to the MAX value calculated above. Similarly, the rescale

component adjusts the data frame back to it’s original range by subtracting the MAX

value. The rescale component also adds the block exponent produced by the FFT

as well as a constant multiplicative power of two introduced by the algorithm. This

can be summarized as:

EXPoutput = EXPinput + EXPblock − EXPscale + EXPconstant

where EXPconstant is either 4 or 3 if the transform length is 1024 or 512 respectively


3.3.5 Data Type Conversion

The unwrapping software deals with all it’s data as single-precision 32-bit floating

point values. However, given the limitation of only having a fixed point 24-bit FFT

available, some form of conversion between the two formats is necessary.

The RCL VFLOAT library [2] contains parameterizable float to fixed point and

fixed to floating point units capable of performing the necessary conversions. Data

coming out of the scale component is streamed into the float to fixed point com-

ponent and then transferred to the FFT. The output of the FFT is converted back

to floating point and stored. The data-type conversion encapsulates the core DCT

logic, which is further encapsulated by the dynamic scaling logic.

3.4 The Two-Dimensional Extension

The DCT, similar to the FFT, is separable. This means that a two-dimensional DCT

can be constructed by performing the 1D DCT for each of the rows followed by a 1D

DCT of the columns of the resulting matrix. This technique is called the row-column

decomposition method.

The key components to extending the one-dimensional DCT discussed earlier into

two dimensions is exploiting the onboard SRAM banks to store entire images at a

time, and calculating the transpose of the matrix after each DCT iteration.


3.4.1 The SRAM Controller

3.4.2 Calculating The Transpose

The purpose of the transpose component is to calculate the write address of data

coming out of the DCT component. These addresses should flip the matrix along

the diagonal. This can be accomplished by switching the row and column indices,

or rather, since the SRAM memory is linearly addressed, by multiplying the column

index by the length of a row and adding the row index. In equation form:

write addr =col addr

2∗ row length+ row addr

where row length = 512 or 1024 depending on orientation.

The column address is divided by 2 by dropping the last bit. This truncation is

necessary for the data-packing of two output values into each SRAM word to occur.

Additionally, since row length is a power of two, the multiplication and addition is

accomplished by appending the row index to the column index. This arrangement

requires minimal resources and is accomplished within a single cycle.

3.5 Division And Scaling

The section of code that performs the division and scaling that occurs between the

forward DCT and inverse DCT calculations is as follows:

for (j=0; j<1024; j++)

for (i=0; i<512; i++)

{


if (i==0 && j==0)

array[0]=0;

else

array[j*512+i] = array[j*512+i]/(4-2*cos(i*pi/511)-2*cos(j*pi/1023);

}

This segment of code scales the image by the factor of 4 − 2 ∗ cos(i ∗ π511

) − 2 ∗

cos(j ∗ π1023

). There are two efficient ways to perform this calculation. The first

is precomputing the factor for the entire 1024 by 512 image. This is too large to

load into BRAM and so much be stored in SRAM. It will also have to be loaded

into memory at startup, although this added latency can be amortized over multiple

transforms, as long as the FPGA accelerator board is not reset. This requires added

complexity to the SRAM memory design, but does not require the two floating point

add units which are needed for the second method detailed below.

The second approach is to precompute only the 2∗cos(j∗ π1023

) and the 2∗cos(j∗ π511

)

terms and store them into BRAM, since they occupy a relatively small amount of

space. These initial values can be integrated into the FPGA bitstream and thus do

not need to be loaded onto the board after initialization. As mentioned above, the

drawback to this is the addition of two floating point adders. This method was chosen

because of the relatively low area requirements as well as the lower complexity of the

required controller.


Both approaches require a floating point divide and the second requires floating

point addition. This is provided by the Xilinx floating point operator[33] which

supports both functions. Two add units and one divide unit were instantiated and

connected as detailed in Figure 3.7.

Figure 3.7: Implementation of the floating point divide and scale logic

3.6 GPU Implementation

3.6.1 FPGA equivalent

3.6.2 Full implementation

3.7 Data transfer

3.7.1 Programmed IO

3.7.2 Direct Memory Access (DMA)

Chapter 4

Results

4.1 Experimental Setup

The platform for which the FPGA implementation was designed for was the Annapo-

lis WildStar II Pro.

4.1.1 Verification

4.1.2 Benchmark Suite

4.2 Performance

4.2.1 Experiment Parameters

4.2.2 Experiments

4.2.3 Results

4.3 Summary

Chapter 5

Conclusion and Future Work

5.1 Conclusion

5.2 Future Work

Bibliography

[1] 4DSP Inc. IEEE-745 compliant floating-point FFT core for FPGA.http://www.4dsp.com/fft.htm, Last accessed March 2007.

[2] P. Belanovic. Library of Parameterized Hardware Modules for Floating-PointArithmetic with An Example Application. PhD thesis, Northeastern University,June 2002.

[3] J. Bolz, I. Farmer, E. Grinspun, and P. Schroder. ”sparse matrix solvers onthe gpu: conjugate gradients and multigrid”. ACM Transactions on Graphics,22(3):917–924, July 2003.

[4] I. Buck, T. Foley, D. Horn, J. S. K. Fatahalian, M. Houston, and P. Hanrahan.Brook for gpus: stream computing on graphics hardware. ACM Transactionson Graphics, 23(3):777–786, August 2004.

[5] Y. Chan and W. Siu. On the realization of discrete cosine transform using thedistributed arithmetic. IEEE Transactions on Circuits and Systems, 39(9):705–712, Sept 1992.

[6] C. H. Crawford, P. Henning, M. Kistler, and C. Wright. Accelerating Computingwith the Cell Broadband Engine Processor. In Proceedings of the 2008 conferenceon Computing Frontiers, pages 3–12, 2008.

[7] Cray. Cray xd1 datasheet. http://www.cray.com/downloads/Cray XD1Datasheet.pdf, Last accessed July 2008.

[8] P. D’Alberto, P. Milder, A. Sandryhaila, F. Franchetti, J. Hoe, J. Moura, andM. Puschel. Generating fpga-accelerated dft libraries. In Proceedings of theIEEE Symposium on FPGAs for Custom Computing Machines (FCCM’07),pages 173–184, 2007.

[9] T. Dillon. Two Virtex-II FPGAs deliver fastest, cheapest, best high-performanceimage processing system. In Xilinx Xcell J., pages 70–73, 2001.

[10] D. C. Ghiglia and M. D. Pritt. Two-Dimensional Phase Unwrapping: Theory,Algorithms and Software. Wiley Inter-Science, 605 Third Avenue, New York,NY, 10158-0012, 1998.

[11] ”HP”. ”accelerating hpc using gpus”. http://www.hp.com/techservers/hpccn/hpccollaboration/ADCatalyst/downloads/accelerating-HPCUsing-GPUs.pdf, Last accessed July 2008.

[12] A. Hull and W. Jenkins. Preconditioned conjugate gradient methods for adaptivefiltering. In IEEE International Symposium on Circuits and Systems, pages 540–543, June 1991.

[13] Intel. Intel math kernel library 10.0. http://www.intel.com/cd/software/products/asmo-na/eng/307757.htm, Last accessed July 2008.

[14] I.S. Uzun and A. Amira and A. Bouridane. FPGA Implementations Of FastFourier Transform For Real-Time Signal And Image Processing. In Proceedingsof the IEEE Conference On Vision, Image And Signal Processing, volume 152,pages 283–296, June 2005.

[15] W. E. F. Jr. Selecting math coprocessors. ”IEEE Spectrum”, pages 38–41, July1991.

[16] P. ”Karasev, D. Campbell, and M. Richards. ”obtaining a 35x speedup in 2dphase unwrapping using commodity graphics processors”. In ”Radar Conference,2007 IEEE”, pages 574–578, April 2007.

[17] J. Kerimo. The w. m. keck three-dimensional fusion microscope.http://www.keck3dfm.neu.edu/, Last accessed June 2008.

[18] G. K. Khurram Bukhari and S. Vassiliadis. DCT and IDCT implementationson different FPGA technologies. In Program for Research on Integrated Systemsand Circuits (ProRISC), pages 232–235, November 2002.

[19] M. P. Leong and P. H. W. Leong. A variable-radix digit-serial design methodol-ogy and its application to the discrete cosine transform. IEEE Transactions onVery Large Scale Integrated (VLSI) Systems, 11(1):90–104, Feb 2003.

[20] J. Makhoul. A fast cosine transform in one and two dimensions. IEEE Transac-tions on Acoustics, Speech, and Signal Processing, 28(1):27–34, February 1980.

[21] Mark Borgerding. Kiss fft. http://sourceforge.net/projects/kissfft/, Last ac-cessed July 2008.

[22] NVIDIA. Cufft library. developer.download.nvidia.com/compute/cuda/11/CUFFT Library 1.1.pdf, Last accessed July 2008.

[23] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. NumericalRecipies: The Art of Scientific Computing. Cambridge University Press, 1986.

[24] Rapidmind. Rapidmind: Product resources.http://www.rapidmind.net/resources.php, Last accessed July 2008.

[25] N. Shirazi, A. Abbot, and P. Athanas. Implementation of a 2-D Fast FourierTransform on FPGA-Based Custom Computing Machines. In Proceedings ofthe IEEE Symposium on FPGAs for Custom Computing Machines (FCCM’95),pages 155–163, April 1995.

[26] C. Smith. Phase unwrapping algorithms. PhD thesis, Northeastern University,2004.

[27] W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language forstreaming applications. In Proceedings of the 11th International Conference onCompiler Construction, volume 2304, pages 179–196, 2002.

[28] T. Valich. Gpu supercomputer: Nvidia tesla cards to debut inbull system. http://www.tomshardware.com/news/nvidia-graphics-supercomputer,5219.html, Last accessed July, 2008.

[29] S. Wasson. Ageia’s physx physics processing unit.http://techreport.com/articles.x/10223, Last accessed July 2008.

[30] R. Woods and D. T. J. Heron. Applying an xc6200 to real-time image processing.IEEE Design and Test of Computers, 15(1):30–38, Jan-Mar 1998.

[31] Xilinx Inc. 1-D discrete cosine transform(DCT) v2.1.http://www.xilinx.com/ipcenter/catalog/logicore/docs/ da 1d dct.pdf, Lastaccessed March 2007.

[32] Xilinx Inc. Fast fourier transform 3.2.http://www.xilinx.com/ipcenter/catalog/logicore/docs/xfft.pdf, Last accessedMarch 2007.

[33] Xilinx Inc. Floating-point operator v1.0.http://www.xilinx.com/bvdocs/ipcenter/data sheet/floating point.pdf, Lastaccessed October 2007.

Documents

NORTHEASTERN UNIVERSITY Graduate School of Engineering ... · Phase Unwrapping on Recon gurable Hardware A Thesis Presented by Sherman Braganza to The Department of Electrical and