Upload
angelica-murphy
View
217
Download
3
Embed Size (px)
Citation preview
Lecture 15: Multi-FPGA System Software I November 1, 2004
ECE 697F
Reconfigurable Computing
Lecture 15
Mid-term Review
Lecture 15: Multi-FPGA System Software I November 1, 2004
SRAM-based FPGA
• SRAM bits can be programmed many times
• Each programming bit takes up five transistors
• Larger device area reduces speed versus EPROM and antifuse.
Read or Write
Data
Q
Q
Programming Bit I1I2
P1
P2P3P4
Out
2-Input LUT
Lecture 15: Multi-FPGA System Software I November 1, 2004
Field Programmable Gate Array
Lecture 15: Multi-FPGA System Software I November 1, 2004
Connection Box Flexibility
• Fc -> How many tracks does an input pin connect to?
• If logic cluster is small, FC is large FC = W
• If logic cluster is large, Fc can be less.
- Approximately 0.2W for Xilinx XC4000EX, Virtex
LogicCluster
IO pin
Tracks
OutT0 T1 T2
T0T1T2
Out
FC = 3T0 T1 T2
Lecture 15: Multi-FPGA System Software I November 1, 2004
Switchbox Flexibility
• Switch box provides optimized interconnection area.
• Flexibility found to be not as important as FC
• Six transistors needed for FS= 3
0
1
0
1
0 1
0 1
Lecture 15: Multi-FPGA System Software I November 1, 2004
Switchbox Issues
Lecture 15: Multi-FPGA System Software I November 1, 2004
Fine-grained Approach
• For 4-input LUTs 16 bits of information available
• Can be chained together through programmable network.
• Decoder and multiplexer an issue.
• Flexibility is a key aspect.
Addr
A D
A D
16X1
16X1LUT1
LUT2
Lecture 15: Multi-FPGA System Software I November 1, 2004
Growth Rate of Memory
• Approximately 2400 transistors per CLB
- (1200 per LUT) for XC4000-like implementation (32x1 SRAM)
• Six transistors per cell for Altera SRAM (2K per EAB)
Altera 10K Xilinx 4000E
Size EABs trans CLBs trans
32x1 1 12288 1 2400
32x8 1 12288 8 19200
128x8 1 12288 32 76800
512x8 2 24576 128 307200
For 512x8 fine-grained requires 10X more size
Lecture 15: Multi-FPGA System Software I November 1, 2004
Toward Computational Comparison
Dehon metrics:
Computational density of a device
λ2 x s
4 input gate-evaluations
Processor: 2 x NALU x WALU
Aproc x tcycle
FPGA: N4lut
Aarray x tcycle
Lecture 15: Multi-FPGA System Software I November 1, 2004
Degradation
• FPGA can’t really be clocked at 1/7 ns due to interconnect.
• Consider the Bubblesort block from the first class.
If (A > B) { H = A; L = B;}else { H = B; L = A;}
Ci
00001111
A00110011
B01010101
S01101001
Co
00010111
A B
A Bcompare
H requires 33 LUT delays
Lecture 15: Multi-FPGA System Software I November 1, 2004
Single-Instruction Multiple Data
• Same instruction distributed to fine-grained cells.
• Typically organized as 2-D array
• Ideal for image processing
• Typically fixed hardware located in cell
op
multi-bit
Lecture 15: Multi-FPGA System Software I November 1, 2004
Computation Unit for SIMD
• Performs different operation on every cycle
• Easy to distribute instructions on device (use global lines)
• Some local storage for data in each tile
From localstate orother arrayelements
To local stateor other arrayelements
Global Instructioncommon to all elements
.
.
.
.
.
.
Lecture 15: Multi-FPGA System Software I November 1, 2004
Computation Unit for FPGA
• Performs same operation on every cycle
• No global distribution of instructions at all (stored locally)
• Also has local storage for data.
From localstate orother arrayelements
To local stateor other arrayelements
Static instruction distinct for each array element
.
.
.
.
.
.
Lecture 15: Multi-FPGA System Software I November 1, 2004
Hybrid Architecture
• Configuration selects operation of computation unit
• Context identifier changes over time to allow change in functionality
• DPGA – Dynamically Programmable Gate Array
...
.
.
.in
Computation Unit (LUT)
out
Address Inputs(Inst. Store)
Context Identifier
Programmingmay differ foreach element
Lecture 15: Multi-FPGA System Software I November 1, 2004
In-Place Partitioning• Recursively bipartition netlist into regions of device.
a b
c d
a b c d
Lecture 15: Multi-FPGA System Software I November 1, 2004
Enhanced Mincut• Terminal propogation takes previous cuts into account during partitioning.
• Effectively create node “anchors”• Helps minimize wire length
a b
c d
Lecture 15: Multi-FPGA System Software I November 1, 2004
Formulating Force Equations
Use Hooke’s Law
Modules 1, 2, … N
mi mass of module i
xi x position of module i
Kij Attractive constant between module i and j
Fi Net force on module i from rest of modules
°
° )(
12
2
xxKFtxd
ji
N
jiji
i
d
Lecture 15: Multi-FPGA System Software I November 1, 2004
Hill Climbing Algorithms
• To avoid getting trapped in local minima, consider “hill-climbing” approach
• Need to accept worse solutions or make “bad” moves to get global minima.
• Acceptance is probabalistic. Only accept cost-increasing moves some of the time.
Cost
Solution space
Lecture 15: Multi-FPGA System Software I November 1, 2004
Maze Routing
• Evaluate shortest feasible paths based on a cost function• Like row-based device global route allocates channel
bandwidth not specific solutions. • Formulate cost function as needed to address desired
goal.
L
L
C
S
Lecture 15: Multi-FPGA System Software I November 1, 2004
Routing Tradeoffs
• Bias router to find first, best route.
• Vary number of node expansions using:
pcosti = (1 – a) x pcosti-1 + ncosti + a x disti
Lecture 15: Multi-FPGA System Software I November 1, 2004
Architectural Limitation
• Routing architecture necessitates domain selection.
• Bigger effect for multi-fanout nets
Lecture 15: Multi-FPGA System Software I November 1, 2004
Pathfinder
• Use a non-decreasing history value to represent congestion.
• Similarities to multi-commodity flow
• Can be implemented efficiently but does require substantial run time
• Only update after an interation.
ci = (1 + hn * hfac) * (1 + pn * pfac) + bn, n-1
Lecture 15: Multi-FPGA System Software I November 1, 2004
DP-FPGA
• Break FPGA into datapath and control sections
• Save storage for LUTs and connection transistors
• Key issue is grain size
• Cherepacha/Lewis – U. Toronto
Lecture 15: Multi-FPGA System Software I November 1, 2004
Rapid
• Reconfigurable Pipeline Datapath
• Ebeling –University of Washington
• Uses hard-coded functional units (ALU, Memory, multiply)
• Good for signal processing
• Linear array of processing elements.
Cell Cell Cell
Lecture 15: Multi-FPGA System Software I November 1, 2004
Basic Functional Unit
• Two inputs from adjacent blocks.
• Local memory for instructions, data.
Lecture 15: Multi-FPGA System Software I November 1, 2004
Chess Basic Block
• Switchbox memory can be used as storage
• ALU core for computation
Lecture 15: Multi-FPGA System Software I November 1, 2004
FPICs
• High internal connectivity
• Not always cost effective
Lecture 15: Multi-FPGA System Software I November 1, 2004
Reconfigurable Processing
From Hauck: Role of FPGAs
• Many places to put reconfigurable computing components
• Most implementations involve multiple discrete devices
• How should these devices be connected together?
From Hauck: Role of FPGAs
Lecture 15: Multi-FPGA System Software I November 1, 2004
Emulation Software Steps
Many of theseare dependent
on device interconnecttopology
NetlistTranslation
Partitioner
GlobalPlacer
GlobalRouter
FPGA-specificP+R
TechnologyMapping
Divide netlistinto fixed-sized
chunks
Locate an FPGAfor a chunk
Make connectionsbetween devices
Xilinx P+R
FPGA bitstreams
Lecture 15: Multi-FPGA System Software I November 1, 2004
Network Routing
• FPGAs popular in network hardware
• New protocols implemented directly in silicon
• Easy to upgrade in the field
• Washington University Gigabit Switch (WUGS)
- Switch provides up to 160 Gbps of bandwidth.
Lecture 15: Multi-FPGA System Software I November 1, 2004
Programmable Active Memory
• Developed by DEC Paris Research Group (1988-1993)
• Attached to DEC workstation via Turbochannel bus interface for burst transfers.
• Total of 12 manufactured and distributed worldwide.
• Flexible software environment.
Lecture 15: Multi-FPGA System Software I November 1, 2004
Hybrid Architecture
• Buses connect groups of FPGAs to SRAM
• Extra devices used for RAM controller and map to external interface.
Lecture 15: Multi-FPGA System Software I November 1, 2004
Logic Emulation
• Emulation takes a sizable amount of resources
• Compilation time can be large due to FPGA compiles
• One application: also direct ties to other FPGA computing applications.
Lecture 15: Multi-FPGA System Software I November 1, 2004
Are Meshes Realistic?
• The number of wires leaving a partition grows with Rent’s Rule
P = KGB
• Perimeter grows as G0.5 but unfortunately most circuits grow at GB where B > 0.5
• Effectively devices highly pin limited
• What does this mean for meshes?
Lecture 15: Multi-FPGA System Software I November 1, 2004
Virtual Wires
• Overcome pin limitations by multiplexing pins and signals
• Schedule when communication will take place.
Lecture 15: Multi-FPGA System Software I November 1, 2004
A Simple Example
FPGA 1 FPGA 2
FPGA 3FPGA 4
Lecture 15: Multi-FPGA System Software I November 1, 2004
KLFM Partitioning
• Identify nodes to swap to reduce overall cut size
• Lock moved nodes
• Algorithm continues until no un-locked node can be moved without violating size constraints
Bin 1 Bin 2
Lecture 15: Multi-FPGA System Software I November 1, 2004
Clustering
• Technology mapping before partitioning is typically ineffective since frequently area is secondary to interconnect
• Frequently bipartitioning continues after unclustering as well.
Cluster
KLFM
uncluster KLFM
• This allows for additional fine-grain moves.
Lecture 15: Multi-FPGA System Software I November 1, 2004
Higher-level Gains
• Effectively look-ahead to try to anticipate next move
• Look-ahead of 3 considered best tradeoff
Lecture 15: Multi-FPGA System Software I November 1, 2004
Are Meshes Really Realistic?
• The number of wires leaving a partition grows with Rent’s Rule
• Perimeter grows as G0.5 but unfortunately most circuits grow at GB where B > 0.5
• Effectively devices highly pin limited
• What does this mean for meshes?
P = KGB