Upload
preston-wilcox
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers
Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers
Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1, and Kris Gaj2
1The George Washington University,2George Mason University
{esam, mtaher, tarek}@gwu.edu, [email protected]
Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1, and Kris Gaj2
1The George Washington University,2George Mason University
{esam, mtaher, tarek}@gwu.edu, [email protected]
El-Araby 2 1008 / MAPLD2005
Introduction
What are Reconfigurable Computers (RCs)? RCs are computing systems based
on the close system-level integration of one or more general-purpose processors and one or more Field Programmable Gate Array (FPGA) chips
Benefits of RCs A trade-off between traditional hardware and software Hardware-like performance with software-like flexibility Hardware can be modified on-the-fly The programming model is aimed at shielding programmers from the details
of the hardware description Orders of magnitude performance improvements over traditional systems
El-Araby 3 1008 / MAPLD2005
Introduction (cnt’d) Status of RCs
An important research subject due to the recent fast growth of FPGAs technology
Evolved from: “glue logic” between components to Accelerator boards to Stand-alone general-purpose RCs to Parallel reconfigurable computers
However, there exist multiple challenges that must be resolved
El-Araby 4 1008 / MAPLD2005
Challenges Performance
I/O Bandwidth Significant Configuration Latency
Some systems spend 25% to 98% of their execution time performing reconfiguration
Need for Efficient OS and Run-Time Reconfiguration Management Reconfiguration methods in current systems are not fully dynamic
Ease of Use Compilers/Languages
HDLs (VHDL and Verilog) are hard to use by application scientists HLLs and simple interfaces
Debuggers
El-Araby 5 1008 / MAPLD2005
Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier
Up to 256 input and 256 output ports with two tiers of switch
Common Memory (CM) has controller with DMA capability
Controller can perform other functions such as scatter/gather
Up to 8 GB DDR SDRAM supported per CM node
SRC Architecture(Hi-BarTM Based Systems)
Storage Area Storage Area Network Network
Local Area Local Area Network Network
Wide Area Wide Area Network Network DiskDisk
Customers’ Existing NetworksCustomers’ Existing Networks
PCI-XPCI-XPCI-XPCI-X
MAPMAP®®
SRC-6SRC-6
MAPMAP
PP
MemoryMemory
SNAPSNAP™™
PP
MemoryMemory
SNAPSNAP
Gig EthernetGig Ethernetetc.etc.
Common Common MemoryMemory
ChainingChainingGPIOGPIO
Common Common MemoryMemory
SRC Hi-Bar SwitchSRC Hi-Bar Switch
Source: [SRC]
El-Araby 6 1008 / MAPLD2005
SRC Programming Environment
Objectfiles
Application sources
MAP CompilerP Compiler
Logic synthesis
Place & Route
Linker.bin files
.edf files
.o files .o files
Applicationexecutable
Configurationbitstreams
HDLsources.c or .f files .vhd or .v files
Objectfiles
Application sourcesUser
Macro sources
MAP CompilerP Compiler
Logic synthesis
Place & Route
Linker
.edf files
.bin files
. files
.o files .o files
Applicationexecutable
Configurationbitstreams
HDL
.c or .f files .vhd or .v files
.v files
El-Araby 7 1008 / MAPLD2005
SRC Application Simulation Process
HLLHLLsourcesource
CFGCFGDFGDFG VerilogVerilogGeneratorGenerator
CompilerCompiler““Front-end”Front-end”
SynthesisSynthesis
Place Place and and
RouteRoute
logic.binlogic.bin
MacroMacroDefinitionDefinition
VerilogVerilog
DFGDFGBehavioralBehavioralSimulationSimulation
User Chip User Chip LevelLevel
SimulationSimulation
MacroMacroEmulationEmulation
C CodeC Code(Info File)(Info File)
MacroMacroVerilogVerilogCodeCode
OptionalOptional
MacroMacroVerilog CodeVerilog Code
Source: [SRC]
El-Araby 8 1008 / MAPLD2005
Steps to Final Logic
DFG Simulation
Verifies memory allocation
Verifies data movement
Uses real run time environment
Emulates the CM & OBM relationships
Simulates User Logic
HLL SourceHLL Source
DFG SimulationDFG Simulation
Source: [SRC]
El-Araby 9 1008 / MAPLD2005
Steps to Final Logic
DFG Simulation
User Logic Simulation
Test user developed macros
The application becomes the “test bench”
Push the generated logic one step closer to the actual hardware implementation
Requires “logic designer mentality” for debugging
Gives full visibility into the logic
HLL SourceHLL Source
DFG SimulationDFG Simulation
UL SimulationUL Simulation
Source: [SRC]
El-Araby 10 1008 / MAPLD2005
Steps to Final Logic
DFG Emulation User Logic Simulation
MAP Hardware Execution
Full execution using ComList and User Logic on MAP
MAP ExecutionMAP Execution
HLL SourceHLL Source
DFG SimulationDFG Simulation
UL SimulationUL Simulation
Source: [SRC]
El-Araby 11 1008 / MAPLD2005
SGI Systems(System Architecture)
C
C
C
C
C
C
C
C
C
C
C
V
RR
RR
IOIO
IO IORASC RASC
RASCRASC
NUMAlink system interconnect
General-purpose compute nodes
Peer-attached general purpose I/O
Integrated graphics/visualization
Reconfigurable Application Specific Computing
R
C
IO
RASC
V
El-Araby 12 1008 / MAPLD2005
RASC Architecture
El-Araby 13 1008 / MAPLD2005
RASC Architecture (cnt’d)
El-Araby 14 1008 / MAPLD2005
Design Flow (HDLs)
IA-32 Linux
Machine
Design iterations
Design Entry(Verilog, VHDL)
Design Synthesis(Synplify Pro,
Amplify)
Design Implementation
(ISE)
Design Verification
Behavioral Simulation(VCS, Modelsim)
Static Timing Analysis(ISE Timing Analyzer)
.v, .vhd.v, .vhd
.edf
.ncd, .pcf
.bin
MetadataProcessing
(Python)
.v, .vhd
.cfg
Altix Device Programming(RASC Abstraction Layer,
Device Manager, Device Driver)
Real-time Verification
(gdb)
.c
El-Araby 15 1008 / MAPLD2005
Design Flow (HLLs)
IA-32 Linux
Machine
RTL Generation and Integration with Core Services
Design Synthesis(Synplify Pro,
Amplify)
Design Verification
Behavioral Simulation(VCS, Modelsim)
Static Timing Analysis(ISE Timing Analyzer)
.v, .vhd
.v, .vhd
.edf
.ncd, .pcf
.bin
MetadataProcessing
(Python)
.v, .vhd
.cfg
Altix Device Programming(RASC Abstraction Layer,
Device Manager, Device Driver)
Real-time Verification
(gdb)
.c
Design Implementation(ISE)
HLL Design Entry(Handel-C, Impulse C, Mitrion C, Viva)
El-Araby 16 1008 / MAPLD2005
Application Programming Interface
Rasclib: Resource allocation in conjunction with the RASC Device
Manager Data movement to/from the COP via DMA engines Algorithm control (start, stop, single step, stepN) Automatic scaling across multiple devices Interfaces necessary for debugging
El-Araby 17 1008 / MAPLD2005
Abstraction Layer: Algorithm API
and deep scaling.
The Abstraction Layer’s algorithm API mirrors the COP API with a few additions that enable wide scaling,
Output Data
Application
COP
COP
COP
Input DataAlgorithm
COP
Input Data Output DataAlgorithm
Application COP
El-Araby 18 1008 / MAPLD2005
Based on Open Source Gnu Debugger (GDB)
Uses extensions to current command set
Can debug host application and FPGA
Provides notification when FPGA starts or stops
Supplies information on FPGA characteristics
Can “single-step” or “run N steps” of the algorithm
Dumps data regarding the set of “registers” that are visible when the FPGA is active
RASC Debugging
El-Araby 19 1008 / MAPLD2005
Applications of DWT
Pattern recognition
Feature extraction Metallurgy: characterization of rough surfaces
Trend detection: Finance: exploring variation of stock prices
Perfect reconstruction Communications: wireless channel signals
De-noising noisy data
FBI fingerprint compression
Detecting self-similarity in a time series
Video compression – JPEG 2000
Hyperspectral Dimension Reduction
Image Registration
El-Araby 20 1008 / MAPLD2005
The input image is first convolved along the rows by the two filters L and H and decimated along the columns by two resulting in two "column-decimated" images L and H
Each of the two images, L and H, is then convolved along the columns by the two filters L and H and decimated along the rows by two
This decomposition results into four images, LL, LH, HL and HH
The LL image is taken as the new input to perform the next level of decomposition
Multi-Resolution DWT Decomposition (Mallat Algorithm)
El-Araby 21 1008 / MAPLD2005
DWT Implementation (Top-Level)
data_in_H
data_in_L
data_out_H
data_out_LD1
D2
Q1
Q2
clkrst
rstclk
D Q
en
clkrst
out_enQ
El-Araby 22 1008 / MAPLD2005
FIR Module(Transposed Form)
El-Araby 23 1008 / MAPLD2005
DWT End-to-End Throughput (SRC-6 & SGI-RASC vs. P4)
Filter TypeSRC-6
(MB/sec)
SGI RASC
(MB/sec)P4
(MB/sec)
Daub1(Haar) 199 130 12
Daub2 199 130 9.98
Daub3 199 130 8.73
Image Size = 512 X 512 pixels
El-Araby 24 1008 / MAPLD2005
Conclusions DWT is implemented on both SRC-6 and SGI-RASC
systems Similarities and differences are analyzed with regard to:
System hardware architecture Ease of programming
Programming model Development time Hardware/software libraries
Performance The speed-up vs. microprocessor is reported
Primary bottlenecks limiting the performance of both systems are recognized
The capability to share and port applications between the SRC and SGI systems is explored