FPGA Logic Cluster Design

Dr. Philip BriskDepartment of Computer Science and Engineering

University of California, Riverside

CS 223

How Much Logic Should Go in an FPGA Logic Block?

Vaughn Betz, Jonathan RoseIEEE Design & Test of Computers

15(1): 10-15 (1998)

Three Questions• How many inputs should the FPGA routing provide to a cluster of LUTs? (I)

– Routing flexibility vs. area

• As the number of LUTs in a logic cluster changes, how should the FPGA’s routing architecture change? (Fc)

• How many LUTs should be included in a cluster? (N)

Experimental Methodology

• 20 MCNC Benchmarks– Well-established– A bit old, even by 1998

standards– Sadly, still in use

• 4-LUT Architecture• Fs = 3– Vary other parameters to

see what works best

Area Model

• Count the number of min-width transistors required to implement a benchmark circuit in an FPGA architecture

• Normalized Area(Num min-width transistors used) / (Num BLEs used)

How many cluster inputs do we need?

We hit near 100% utilization when I = 50-60% of the total number of BLE inputs

We can pack BLEs together to share common inputs

Re-use locally generated outputs

Works because the packing algorithm was effective!

Input sharing and outputre-use within alogic cluster

Visual Depiction

I = ~0.6KN is pretty goodUse the feedbacks!

Fanout

The Packer was Effective!

It packed BLEs together to share common inputs

It re-use locally generated outputs via the feedbacks

Cluster inputs vs. Cluster size

Approx. (2N + 2)

N = 1 BLE uses 3.5/4

inputs(on average)

N = 16 BLEs uses19.7 / 64 inputs, on average

Commercial FPGAs

• Altera Flex 8000 FPGA uses a cluster of size N=8 with I=24– Results suggest to reduce I to 18 (save area)

• Xilinx 5200 FPGA uses a cluster of size N=4 with I=16– Results suggest to reduce I to 10 (save area)

Routing Flexiblity vs. Cluster Size• Set Fc = W/N– Each routing track is driven by one LUT output pin

in the cluster

Area Efficiency vs. Cluster Size

I is set to achieve 98% logic utilization

N=2 BLEs introduces intra-cluster routing

Reduce routing between logic blocks

Area efficiency rapidly degrades

beyond this point

Conclusions

• I = 2N + 2 for N < 16– Slow, linear growth

• Reduce Fc

– Works because LUT inputs are equivalent• Cluster area efficiency is within 10% for 1 < N < 8• Large clusters reduce the size of the placement

problem and increase FPGA speed

The Effect of LUT and Cluster Size on Deep-Submicron FPGA Performance and Density

Elias Ahmed, Jonathan RoseIEEE Transactions on VLSI Systems

12(3): 288-298 (2004)

Contributions

• Vary LUT size (K) from 2 to 7• Vary cluster size (N) from 1 to 10 LUTs– Experimentally determine the number of cluster

inputs (I) as a function of K and N– Clustering small LUTs (K=2,3) produces good area

results, but bad performance (~2x worse)– LUTs of size (K=4,5,6), clusters of size (N=3…10)

yield the best area-delay product

CAD Flow

Inputs Req.’d for 98% Area Utilization

I = ½K(N+1)

Total Area

• LUT sizes of K = 4,5 are the most area efficient for all cluster sizes• Reduction in total area as cluster size increases from 1-3 for all LUT sizes

• As clusters are made larger (N > 4) there is little impact on total FPGA area

• Intra-cluster routing area is 25-35% of the total area

Total Intra-cluster Routing AreaThe increase in cluster size far outweighs the rate of decrease in the number of clusters: hence the upward trend

#Clusters and Area/Cluster vs. K

25-35%

N = 1 BLE per Cluster

LUT area vs. Intra-cluster Mux Area

Intra-cluster routing area is 25-35% of logic cluster area

LUT area dominates

Intra-cluster Routing Area as a Function of LUT Size

Total intra-cluster routing area decreases near-linearly from K = 3 to 7

Total Intra-cluster Routing Area

The product of these two curves givesthe total inter-cluster routing area.

Routing area decreases linearly with LUT size• Increasing LUT sizes decreases the

number of clusters used faster than the rate of increase in routing area per cluster

• Depends on good CAD tools

Critical Path Delay vs. LUT Size

Increasing both N and K has a positive effect• Benefits saturate as N and K get large

As N and K increase• LUT delay and the delay through a

single cluster increases• The number of LUTs and clusters in

series on the critical path decreases• Reduced global routing delay

Intra-cluster Delay vs. LUT Size

Intra-cluster delay decreases as K increases• Reduction in number of BLE levels on critical pathIntra-cluster delay increases as N increases• Larger intra-cluster cluster muxes are slower• The delay through these muxes is still much faster

than global routing delay

BLE Delay vs. K

BLE delay increases linearly as K increases (intuitive)

Number of BLEs on the critical path decreases quadratically as K increases• Fewer, but larger, BLEs

Global Routing Delay vs. KAs K increases• Fewer LUTs on the critical path• Fewer global routing linksAs N increases• More opportunities to use faster

intra-cluster routing

Critical Path Delay (K = 4)

• K remains constants– No reduction in number of BLEs on critical path

• N increases– BLE and intra-cluster routing delay increase– More logic implemented internally within clusters– Can use faster intra-cluster routing instead of global routing

Critical Path Delay vs. LUT Size (Recap)Increasing N beyond 3 has minimal effects• Limited effectiveness of clustering• Architectural weakness? • Semi-effective CAD tools?

Number of Logic Clusters on Critical PathThe number of logic levels decrease withincreasing N and K• For a given K, most of the reduction is from

N = 1 to 3• The majority of the critical path delay was

reduced in this range• Increasing N is less effective when K is large

BLE Fanout vs. LUT Size

Smaller LUTs have better response to increasing N because each LUThas a relatively small fanout • Adding an extra BLE to the cluster guaranteed

some reduction in the number of logic levels

Larger LUTs have larger average fanout • Harder to ensure that increasing N will

result in fewer cluster levels on the critical path

Area-Delay ProductLarge Delays• Many BLEs on

critical path• Slightly larger area

requirement

Large area cost for K=7 outweighs marginal delay improvement

Caveats

• Quality of CAD tools• Mix of benchmark circuits• Limited exploration of routing parameter

design space– Parameters were derived from N = K = 4

Best Overall Results and Summary

• To achieve 98% LUT utilization, set I = ½K(N+1)• Small LUT sizes are not area efficient and have poor

performance characteristics• Future challenges– Reduce number of BLEs on critical path without resorting

to larger LUTs – Reduce intra-cluster routing delays

FPGA Logic Cluster Design

Documents

Programmable logic and FPGA CPU Architecture Serge Karabchevsky

One Step -- Logic Automatic Translation For FPGA Applications

FPGA Logic Emulation and Reconfigurable Systems

Offloading GNU Radio Processing with FPGA Logic - Avada App

Application Note: Spartan and Virtex FPGA Families R Logic

Painting on Placement: Forecasting Routing Congestion ... · 1). Specifically for FPGA placement, it is a packed netlist where each cluster-based logic block (CLB) could contain one

Map-Reduce Processing of K-Means Algorithm with FPGA-Accelerated Computer Cluster · 2016. 6. 5. · Map-Reduce Processing of K-means Algorithm with FPGA-accelerated Computer Cluster

Defense-Grade Zynq-7000Q SoC Data Sheet: Overview (DS196) · Programmable Logic Xilinx 7 Series Programmable Logic Equivalent Artix®-7 FPGA Kintex®-7 FPGA Programmable Logic Cells

Low Cost TDC Using FPGA Logic Cell Delay

Programming Combinational Logic on Basys FPGA Boardstan/courses/ee120a/ee120a_10fall/labs/Lab_3_programming...Lab 3 “Programming Combinational Logic on Basys FPGA Board” Manual

Programming Combinational Logic on Basys FPGA Board · 2011-11-19 · Lab 3 “Programming Combinational Logic on Basys FPGA Board” Manual EE120A Logic Design University of California

Virtex-6 FPGA Configurable Logic Block

A BASIC, FOUR LOGIC CLUSTER, DISJOINT SWITCH CONNECTED FPGA ARCHITECTURE · 2020. 2. 22. · Speci cally, this project set out to design an FPGA architecture which can implement a

FPGA Intra-cluster Routing Crossbar Design

Building Multi-Processor FPGA Systems · FPGA Fabric “Soft Logic” SoC/FPGA Hardware Architecture Overview ARM-to-FPGA Bridges Data Width configurable FPGA 42K Logic Macros Using

Spartan-6 FPGA Configurable Logic Block

Programming Combinational Logic on Basys FPGA Board

FPGA implementation of a Multi-processor for Cluster Analysis

LUXOR: An FPGA Logic Cell Architecture for Efficient

Fuzzy Logic Design Using VHDL on FPGA