Lecture 15: Multi-FPGA System Software I November 1, 2004 ECE 697F Reconfigurable Computing Lecture 15 Mid-term Review

Lecture 15: Multi-FPGA System Software I November 1, 2004

ECE 697F

Reconfigurable Computing

Lecture 15

Mid-term Review


SRAM-based FPGA

• SRAM bits can be programmed many times

• Each programming bit takes up five transistors

• Larger device area reduces speed versus EPROM and antifuse.

Read or Write

Data

Q

Q

Programming Bit I1I2

P1

P2P3P4

Out

2-Input LUT


Field Programmable Gate Array


Connection Box Flexibility

• Fc -> How many tracks does an input pin connect to?

• If logic cluster is small, FC is large FC = W

• If logic cluster is large, Fc can be less.

- Approximately 0.2W for Xilinx XC4000EX, Virtex

LogicCluster

IO pin

Tracks

OutT0 T1 T2

T0T1T2

Out

FC = 3T0 T1 T2


Switchbox Flexibility

• Switch box provides optimized interconnection area.

• Flexibility found to be not as important as FC

• Six transistors needed for FS= 3

0

1

0

1

0 1

0 1


Switchbox Issues


Fine-grained Approach

• For 4-input LUTs 16 bits of information available

• Can be chained together through programmable network.

• Decoder and multiplexer an issue.

• Flexibility is a key aspect.

Addr

A D

A D

16X1

16X1LUT1

LUT2


Growth Rate of Memory

• Approximately 2400 transistors per CLB

- (1200 per LUT) for XC4000-like implementation (32x1 SRAM)

• Six transistors per cell for Altera SRAM (2K per EAB)

Altera 10K Xilinx 4000E

Size EABs trans CLBs trans

32x1 1 12288 1 2400

32x8 1 12288 8 19200

128x8 1 12288 32 76800

512x8 2 24576 128 307200

For 512x8 fine-grained requires 10X more size


Toward Computational Comparison

Dehon metrics:

Computational density of a device

λ2 x s

4 input gate-evaluations

Processor: 2 x NALU x WALU

Aproc x tcycle

FPGA: N4lut

Aarray x tcycle


Degradation

• FPGA can’t really be clocked at 1/7 ns due to interconnect.

• Consider the Bubblesort block from the first class.

If (A > B) { H = A; L = B;}else { H = B; L = A;}

Ci

00001111

A00110011

B01010101

S01101001

Co

00010111

A B

A Bcompare

H requires 33 LUT delays


Single-Instruction Multiple Data

• Same instruction distributed to fine-grained cells.

• Typically organized as 2-D array

• Ideal for image processing

• Typically fixed hardware located in cell

op

multi-bit


Computation Unit for SIMD

• Performs different operation on every cycle

• Easy to distribute instructions on device (use global lines)

• Some local storage for data in each tile

From localstate orother arrayelements

To local stateor other arrayelements

Global Instructioncommon to all elements

.

.

.

.

.

.


Computation Unit for FPGA

• Performs same operation on every cycle

• No global distribution of instructions at all (stored locally)

• Also has local storage for data.

From localstate orother arrayelements

To local stateor other arrayelements

Static instruction distinct for each array element

.

.

.

.

.

.


Hybrid Architecture

• Configuration selects operation of computation unit

• Context identifier changes over time to allow change in functionality

• DPGA – Dynamically Programmable Gate Array

...

.

.

.in

Computation Unit (LUT)

out

Address Inputs(Inst. Store)

Context Identifier

Programmingmay differ foreach element


In-Place Partitioning• Recursively bipartition netlist into regions of device.

a b

c d

a b c d


Enhanced Mincut• Terminal propogation takes previous cuts into account during partitioning.

• Effectively create node “anchors”• Helps minimize wire length

a b

c d


Formulating Force Equations

Use Hooke’s Law

Modules 1, 2, … N

mi mass of module i

xi x position of module i

Kij Attractive constant between module i and j

Fi Net force on module i from rest of modules

°

° )(

12

2

xxKFtxd

ji

N

jiji

i

d


Hill Climbing Algorithms

• To avoid getting trapped in local minima, consider “hill-climbing” approach

• Need to accept worse solutions or make “bad” moves to get global minima.

• Acceptance is probabalistic. Only accept cost-increasing moves some of the time.

Cost

Solution space


Maze Routing

• Evaluate shortest feasible paths based on a cost function• Like row-based device global route allocates channel

bandwidth not specific solutions. • Formulate cost function as needed to address desired

goal.

L

L

C

S


Routing Tradeoffs

• Bias router to find first, best route.

• Vary number of node expansions using:

pcosti = (1 – a) x pcosti-1 + ncosti + a x disti


Architectural Limitation

• Routing architecture necessitates domain selection.

• Bigger effect for multi-fanout nets


Pathfinder

• Use a non-decreasing history value to represent congestion.

• Similarities to multi-commodity flow

• Can be implemented efficiently but does require substantial run time

• Only update after an interation.

ci = (1 + hn * hfac) * (1 + pn * pfac) + bn, n-1


DP-FPGA

• Break FPGA into datapath and control sections

• Save storage for LUTs and connection transistors

• Key issue is grain size

• Cherepacha/Lewis – U. Toronto


Rapid

• Reconfigurable Pipeline Datapath

• Ebeling –University of Washington

• Uses hard-coded functional units (ALU, Memory, multiply)

• Good for signal processing

• Linear array of processing elements.

Cell Cell Cell


Basic Functional Unit

• Two inputs from adjacent blocks.

• Local memory for instructions, data.


Chess Basic Block

• Switchbox memory can be used as storage

• ALU core for computation


FPICs

• High internal connectivity

• Not always cost effective


Reconfigurable Processing

From Hauck: Role of FPGAs

• Many places to put reconfigurable computing components

• Most implementations involve multiple discrete devices

• How should these devices be connected together?

From Hauck: Role of FPGAs


Emulation Software Steps

Many of theseare dependent

on device interconnecttopology

NetlistTranslation

Partitioner

GlobalPlacer

GlobalRouter

FPGA-specificP+R

TechnologyMapping

Divide netlistinto fixed-sized

chunks

Locate an FPGAfor a chunk

Make connectionsbetween devices

Xilinx P+R

FPGA bitstreams


Network Routing

• FPGAs popular in network hardware

• New protocols implemented directly in silicon

• Easy to upgrade in the field

• Washington University Gigabit Switch (WUGS)

- Switch provides up to 160 Gbps of bandwidth.


Programmable Active Memory

• Developed by DEC Paris Research Group (1988-1993)

• Attached to DEC workstation via Turbochannel bus interface for burst transfers.

• Total of 12 manufactured and distributed worldwide.

• Flexible software environment.


Hybrid Architecture

• Buses connect groups of FPGAs to SRAM

• Extra devices used for RAM controller and map to external interface.


Logic Emulation

• Emulation takes a sizable amount of resources

• Compilation time can be large due to FPGA compiles

• One application: also direct ties to other FPGA computing applications.


Are Meshes Realistic?

• The number of wires leaving a partition grows with Rent’s Rule

P = KGB

• Perimeter grows as G0.5 but unfortunately most circuits grow at GB where B > 0.5

• Effectively devices highly pin limited

• What does this mean for meshes?


Virtual Wires

• Overcome pin limitations by multiplexing pins and signals

• Schedule when communication will take place.


A Simple Example

FPGA 1 FPGA 2

FPGA 3FPGA 4


KLFM Partitioning

• Identify nodes to swap to reduce overall cut size

• Lock moved nodes

• Algorithm continues until no un-locked node can be moved without violating size constraints

Bin 1 Bin 2


Clustering

• Technology mapping before partitioning is typically ineffective since frequently area is secondary to interconnect

• Frequently bipartitioning continues after unclustering as well.

Cluster

KLFM

uncluster KLFM

• This allows for additional fine-grain moves.


Higher-level Gains

• Effectively look-ahead to try to anticipate next move

• Look-ahead of 3 considered best tradeoff


Are Meshes Really Realistic?

• The number of wires leaving a partition grows with Rent’s Rule

• Perimeter grows as G0.5 but unfortunately most circuits grow at GB where B > 0.5

• Effectively devices highly pin limited

• What does this mean for meshes?

P = KGB

Documents

Lecture 15: Multi-FPGA System Software I November 1, 2004 ECE 697F Reconfigurable Computing Lecture 15 Mid-term Review