Architectural Improvement for Field Programmable Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Counter Array: Enabling Efficient Synthesis of Fast
Compressor Trees on FPGA Compressor Trees on FPGA
Alessandro CevreroAlessandro Cevrero1,21,2 Panagiotis Panagiotis AthanasopoulosAthanasopoulos1,21,2
Hadi Parandeh-AfsharHadi Parandeh-Afshar22
Paolo IennePaolo Ienne22Yusuf LeblebiciYusuf Leblebici11
Ajay K. VermaAjay K. Verma22 Philip BriskPhilip Brisk22 Frank K. GurkaynakFrank K. Gurkaynak11
1 2
16th ACM/SIDA International Symposium on FPGAs
Monterey, California, USA, February 26, 2008
Motivation and ContributionMotivation and Contribution
Goal: Improve FPGA performance for arithmetic circuits.
Field Programmable Counter Array (FPCA):
[Brisk et al., DAC 2007] Programmable IP core to accelerate compressor trees Hybrid FPGA/FPCA device
Contributions:
Completely new FPCA architectureReduced routing delayMore flexibility and better mappingSimplified integration process
1/11
FPGA CommentaryFPGA Commentary
Logic cells with dedicated addition circuitry and fast carry chains Support for ternary addition [Altera Stratix II/III, Xilinx Virtex-5] Parallel accumulation uses adder trees
ASIC designers use compressor trees! Compressor tree synthesis on FPGAs via GPC mapping
[Parandeh-Afshar et al., ASPDAC 2008, DATE 2008] Faster than ternary adder trees
IP Cores DSP48, BlockRAM, etc. [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 2008] Mismatches in bitwidth limit gains
[Kuon and Rose, FPGA 2006, TCAD 2007]2/11
Methodology and SolutionMethodology and Solution
1. Transform circuit to merge disparate addition and multiplication operations to expose compressor trees
• [Verma and Ienne, ICCAD 2004]
2. Synthesize compressor tree onto FPCA
• [Brisk et al., DAC 2007]
3. Map everything else onto traditional FPGA
• Standard approach
4. Integrate FPGA+FPCA onto same die
• Ongoing research at EPFL
FPCA : programmable compressor tree
∑
+
3/11
Previous WorkPrevious WorkInitial FPCA architecture
[Brisk et al., DAC 2007] Routing network delay
Performance bottleneck
Poor area utilization Many resources unused
Large counters implement the functionality of smaller counters
“Pitch matching” problem FPCA routing channels must
align with FPGA routing channels
Leads to unnecessarily large counters
4/11
Recurring Patterns in Compressor Recurring Patterns in Compressor Tree SynthesisTree Synthesis
15
4
3
2
CPA
15:4
4:3
3:2
New FPCA architecture:
Counter Slice (CSlice) Compress one column at a
time
Propagate carry bits to neighboring CSlices
Eliminates FPGA-style routing network
No routing delay between counters
Pitch matching problem disappears
5/11
FPCA v2.0
Area Utilization
CSlice ArchitectureCSlice Architecture
Configurable
GPC
6/11
4:3
3:2
CPA
15:4
4:3
3:2
CPA
15:4CSlice
4:3
3:2
CPA
CSlice
4:3
3:2
CPA
15:4CSlice
CSlice
SiSi+1
Si+2Si+3
15:4
FPCA V2.0 Mapping HeuristicFPCA V2.0 Mapping Heuristic
FPCA synthesis heuristic: Map columns of input bits
onto FPCA Minimize the height of the
compressor tree Avoid vertical configurations,
when possible
FPCAFPCA
FPCA
…
FPCAFPCA
Horizontal Vertical
Multi-FPCA Configurations
Routing Delay
7/11
CSlice SynthesisCSlice Synthesis
CSlice V2.0 rank-3 with 16 input bits per CSlice
90nm Artisan standard cell library
Cslice Rank-1 Rank-2 Rank-3
Area [µm2] 1240 2347 2770
Delay [ns] 0.40 0.71 0.73
CPA delay [ns] 0.04 0.05 0.07
FPCA Synthesis:
Rank-3 CSlices used in experiments
8 CSlices per FPCA
Similar to dimensions of a DSP block in current FPGAs
Simplifies integration process
DFFs store configuration bitstream
Semi-custom design
Standard cells are predominant
8/11
FPCA Delay ExtractionFPCA Delay Extraction
Methodology:
Each FPCA instance is replaced with F* instance (same I/0)
Extract Delay Between F* instances
Combined these Delay with Combinational Delay extracted for the FPCA
Input Pins
Output Pins
SUM
SUM
SUM
Define a pre-placed soft IP core : F* Same dimensions and I/O as FPCA Map onto Stratix II FPGA Extract critical path delay Replace all sum operations with F*
Map compressor tree onto FPCA Configuration DFF values set to
constant values ; not optimized Measure critical path delay
For each compressor tree in the circuit
Subtract delay of F* Add FPCA delay
Methodology:
F*
F*
F*
FPCA
FPCA
FPCA
9/11
Experimental ResultsExperimental Results
Comparison
GPC Mapping [Parandeh-Afshar et al., ASP-DAC 2008] FPCA mapping (6 FPCAs per device)
FPCA Speedup Over GPC Mapping
0
0.5
1
1.5
2
2.5
3
GPC Mapping FPCA
2.40x
1.60x
10/11
ConclusionConclusionConclusion
Future Work
New FPCA architecture Hardwired connections between counters
Counters of multiple sizes organized into CSlices
Carry chains between CSlices
Avg./Max. speedups of 1.60x/2.40x compared to GPC mapping
Add pipeline registers to FPCA Increase latency, increase clock frequency, throughput
Demonstrator chip taped out in October 2007 Returned from the foundry in January 2008; PCBs ready next
week
Measure power consumption, clock frequency, I/O interface, etc.11/11
Demonstrator ChipDemonstrator Chip