Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers

Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers

Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1, and Kris Gaj2

1The George Washington University,2George Mason University

{esam, mtaher, tarek}@gwu.edu, [email protected]

Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1, and Kris Gaj2

1The George Washington University,2George Mason University

{esam, mtaher, tarek}@gwu.edu, [email protected]

http://www.gwu.edu/~rcenter/s3_imagebank_big.cfm?img_filename=gw%5Fnew%5Flogo2%2Ejpg&img_title=GWU%20Logo&filesize=71583

El-Araby 2 1008 / MAPLD2005

Introduction

What are Reconfigurable Computers (RCs)? RCs are computing systems based

on the close system-level integration of one or more general-purpose processors and one or more Field Programmable Gate Array (FPGA) chips

Benefits of RCs A trade-off between traditional hardware and software Hardware-like performance with software-like flexibility Hardware can be modified on-the-fly The programming model is aimed at shielding programmers from the details

of the hardware description Orders of magnitude performance improvements over traditional systems


Introduction (cnt’d) Status of RCs

An important research subject due to the recent fast growth of FPGAs technology

Evolved from: “glue logic” between components to Accelerator boards to Stand-alone general-purpose RCs to Parallel reconfigurable computers

However, there exist multiple challenges that must be resolved


Challenges Performance

I/O Bandwidth Significant Configuration Latency

Some systems spend 25% to 98% of their execution time performing reconfiguration

Need for Efficient OS and Run-Time Reconfiguration Management Reconfiguration methods in current systems are not fully dynamic

Ease of Use Compilers/Languages

HDLs (VHDL and Verilog) are hard to use by application scientists HLLs and simple interfaces

Debuggers


Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier

Up to 256 input and 256 output ports with two tiers of switch

Common Memory (CM) has controller with DMA capability

Controller can perform other functions such as scatter/gather

Up to 8 GB DDR SDRAM supported per CM node

SRC Architecture(Hi-BarTM Based Systems)

Storage Area Storage Area Network Network

Local Area Local Area Network Network

Wide Area Wide Area Network Network DiskDisk

Customers’ Existing NetworksCustomers’ Existing Networks

PCI-XPCI-XPCI-XPCI-X

MAPMAP®®

SRC-6SRC-6

MAPMAP

PP

MemoryMemory

SNAPSNAP™™

PP

MemoryMemory

SNAPSNAP

Gig EthernetGig Ethernetetc.etc.

Common Common MemoryMemory

ChainingChainingGPIOGPIO

Common Common MemoryMemory

SRC Hi-Bar SwitchSRC Hi-Bar Switch

Source: [SRC]


SRC Programming Environment

Objectfiles

Application sources

MAP CompilerP Compiler

Logic synthesis

Place & Route

Linker.bin files

.edf files

.o files .o files

Applicationexecutable

Configurationbitstreams

HDLsources.c or .f files .vhd or .v files

Objectfiles

Application sourcesUser

Macro sources

MAP CompilerP Compiler

Logic synthesis

Place & Route

Linker

.edf files

.bin files

. files

.o files .o files

Applicationexecutable

Configurationbitstreams

HDL

.c or .f files .vhd or .v files

.v files


SRC Application Simulation Process

HLLHLLsourcesource

CFGCFGDFGDFG VerilogVerilogGeneratorGenerator

CompilerCompiler““Front-end”Front-end”

SynthesisSynthesis

Place Place and and

RouteRoute

logic.binlogic.bin

MacroMacroDefinitionDefinition

VerilogVerilog

DFGDFGBehavioralBehavioralSimulationSimulation

User Chip User Chip LevelLevel

SimulationSimulation

MacroMacroEmulationEmulation

C CodeC Code(Info File)(Info File)

MacroMacroVerilogVerilogCodeCode

OptionalOptional

MacroMacroVerilog CodeVerilog Code

Source: [SRC]


Steps to Final Logic

DFG Simulation

Verifies memory allocation

Verifies data movement

Uses real run time environment

Emulates the CM & OBM relationships

Simulates User Logic

HLL SourceHLL Source

DFG SimulationDFG Simulation

Source: [SRC]



DFG Simulation

User Logic Simulation

Test user developed macros

The application becomes the “test bench”

Push the generated logic one step closer to the actual hardware implementation

Requires “logic designer mentality” for debugging

Gives full visibility into the logic



UL SimulationUL Simulation

Source: [SRC]



DFG Emulation User Logic Simulation

MAP Hardware Execution

Full execution using ComList and User Logic on MAP

MAP ExecutionMAP Execution



UL SimulationUL Simulation

Source: [SRC]


SGI Systems(System Architecture)

C

C

C

C

C

C

C

C

C

C

C

V

RR

RR

IOIO

IO IORASC RASC

RASCRASC

NUMAlink system interconnect

General-purpose compute nodes

Peer-attached general purpose I/O

Integrated graphics/visualization

Reconfigurable Application Specific Computing

R

C

IO

RASC

V


RASC Architecture


RASC Architecture (cnt’d)


Design Flow (HDLs)

IA-32 Linux

Machine

Design iterations

Design Entry(Verilog, VHDL)

Design Synthesis(Synplify Pro,

Amplify)

Design Implementation

(ISE)

Design Verification

Behavioral Simulation(VCS, Modelsim)

Static Timing Analysis(ISE Timing Analyzer)

.v, .vhd.v, .vhd

.edf

.ncd, .pcf

.bin

MetadataProcessing

(Python)

.v, .vhd

.cfg

Altix Device Programming(RASC Abstraction Layer,

Device Manager, Device Driver)

Real-time Verification

(gdb)

.c


Design Flow (HLLs)

IA-32 Linux

Machine

RTL Generation and Integration with Core Services

Design Synthesis(Synplify Pro,

Amplify)

Design Verification

Behavioral Simulation(VCS, Modelsim)

Static Timing Analysis(ISE Timing Analyzer)

.v, .vhd

.v, .vhd

.edf

.ncd, .pcf

.bin

MetadataProcessing

(Python)

.v, .vhd

.cfg

Altix Device Programming(RASC Abstraction Layer,

Device Manager, Device Driver)

Real-time Verification

(gdb)

.c

Design Implementation(ISE)

HLL Design Entry(Handel-C, Impulse C, Mitrion C, Viva)


Application Programming Interface

Rasclib: Resource allocation in conjunction with the RASC Device

Manager Data movement to/from the COP via DMA engines Algorithm control (start, stop, single step, stepN) Automatic scaling across multiple devices Interfaces necessary for debugging


Abstraction Layer: Algorithm API

and deep scaling.

The Abstraction Layer’s algorithm API mirrors the COP API with a few additions that enable wide scaling,

Output Data

Application

COP

COP

COP

Input DataAlgorithm

COP

Input Data Output DataAlgorithm

Application COP


Based on Open Source Gnu Debugger (GDB)

Uses extensions to current command set

Can debug host application and FPGA

Provides notification when FPGA starts or stops

Supplies information on FPGA characteristics

Can “single-step” or “run N steps” of the algorithm

Dumps data regarding the set of “registers” that are visible when the FPGA is active

RASC Debugging


Applications of DWT

Pattern recognition

Feature extraction Metallurgy: characterization of rough surfaces

Trend detection: Finance: exploring variation of stock prices

Perfect reconstruction Communications: wireless channel signals

De-noising noisy data

FBI fingerprint compression

Detecting self-similarity in a time series

Video compression – JPEG 2000

Hyperspectral Dimension Reduction

Image Registration


The input image is first convolved along the rows by the two filters L and H and decimated along the columns by two resulting in two "column-decimated" images L and H

Each of the two images, L and H, is then convolved along the columns by the two filters L and H and decimated along the rows by two

This decomposition results into four images, LL, LH, HL and HH

The LL image is taken as the new input to perform the next level of decomposition

Multi-Resolution DWT Decomposition (Mallat Algorithm)


DWT Implementation (Top-Level)

data_in_H

data_in_L

data_out_H

data_out_LD1

D2

Q1

Q2

clkrst

rstclk

D Q

en

clkrst

out_enQ


FIR Module(Transposed Form)


DWT End-to-End Throughput (SRC-6 & SGI-RASC vs. P4)

Filter TypeSRC-6

(MB/sec)

SGI RASC

(MB/sec)P4

(MB/sec)

Daub1(Haar) 199 130 12

Daub2 199 130 9.98

Daub3 199 130 8.73

Image Size = 512 X 512 pixels


Conclusions DWT is implemented on both SRC-6 and SGI-RASC

systems Similarities and differences are analyzed with regard to:

System hardware architecture Ease of programming

Programming model Development time Hardware/software libraries

Performance The speed-up vs. microprocessor is reported

Primary bottlenecks limiting the performance of both systems are recognized

The capability to share and port applications between the SRC and SGI systems is explored

Documents

Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2