Technology Mapping for Logic in FPGA Routinglib.ugent.be/fulltxt/RUG01/002/153/745/RUG01-002153745_2014_0001... · Technology Mapping for Logic in FPGA Routing Academic year 2013-2014

CONFIDENTIAL UP TO AND INCLUDING 12/31/2014 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY

Berg Severens

Technology Mapping for Logic in FPGA Routing

Academic year 2013-2014Faculty of Engineering and ArchitectureChairman: Prof. dr. ir. Jan Van CampenhoutDepartment of Electronics and Information Systems

Master of Science in Electrical EngineeringMaster's dissertation submitted in order to obtain the academic degree of

Counsellors: Ir. Elias Vansteenkiste, Ir. Karel HeyseSupervisor: Prof. dr. ir. Dirk Stroobandt

ii

CONFIDENTIAL UP TO AND INCLUDING 31 / 12 / 2014IMPORTANT

This Master Dissertation may contain confidential information and/or confidential researchresults proprietary to Ghent University or third parties. It is strictly forbidden to publish, citeor make public in any way this Master Dissertation or any part thereof without the expresswritten permission of Ghent University. Under no circumstance this Master Dissertation maybe communicated to or put at the disposal of third parties. Photocopying or duplicating itin any other way is strictly prohibited. Disregarding the confidential nature of this MasterDissertation may cause irremediable damage to Ghent University.

CONFIDENTIAL UP TO AND INCLUDING 12/31/2014 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY

Berg Severens

Technology Mapping for Logic in FPGA Routing

Academic year 2013-2014Faculty of Engineering and ArchitectureChairman: Prof. dr. ir. Jan Van CampenhoutDepartment of Electronics and Information Systems

Master of Science in Electrical EngineeringMaster's dissertation submitted in order to obtain the academic degree of

Counsellors: Ir. Elias Vansteenkiste, Ir. Karel HeyseSupervisor: Prof. dr. ir. Dirk Stroobandt

Permission of usage

The author gives permission to make this master dissertation available for consultation andto copy parts of this master dissertation for personal use. In the case of any other use, thelimitations of the copyright have to be respected, in particular with regard to the obligation tostate expressly the source when quoting results from this master dissertation.

Berg Severens, May 31, 2014

iv

Acknowledgements

This masterthesis is the completion of five years studying engineering at the University ofGhent. This had been impossible without the help and support of several people. Therefore Iwould like to thank:

• All people of the HES group: Elias, Tom, Karel, Brahim, Alexandra and ProfessorStroobandt. I would like to thank them for the constructive feedback during the weeklymeetings and the presentations.

• In particular Elias gave me excellent feedback and provided a number of useful ideas.

• My parents. Without their support these studies would not have been possible. I amvery grateful that I received this opportunity.

• My girlfriend Jolien for her patience and support.

• Sebastiaan, Christof and Jente for the pleasant atmosphere while working at the thesis.

Berg Severens, May 31, 2014

v

Technology Mapping for Logic in FPGA

Routing

by

Berg Severens

Master thesis submitted to obtain the academic degree of

Master of Science: Electrical Engineering

Academic year 2013–2014

Promotor: Prof. Dr. Ir. D. Stroobandt

Supervisors: Ir. E. Vansteenkiste, Ir. K. Heyse

Faculty of engineering

University of Ghent

Department Electronics and Information Systems

President: Prof. Dr. Ir. J. Van Campenhout

Summary

A Field-programmable gate array (FPGA) is a programmable digital chip. In this thesis we willtry to reduce its delay and area by changing the architecture. In the state-of-the-art FPGAsthere are routing and logic blocks: logic blocks contain small digital functions whereas therouting connects the appropriate logic blocks. We will combine these functionalities by addinglogic 2-input components into the routing.

We will estimate the performance of the new architecture. Therefore we will provide a tech-nology mapping algorithm. This will yield the number of components in the critical path andthe FPGA’s resource usage.

Keywords

FPGA, Architecture, Depth cost, AND gate, Technology Mapping

Technology Mapping for Logic in FPGARoutingBerg Severens

Supervisors: Elias Vansteenkiste, Karel Heyse, Dirk Stroobandt

Abstract—The choice of an appropriate architecture is crucial to design high

performance FPGAs. Therefore extensive research has been doneto optimize both routing and logic blocks. However to our knowl-edge the combination of logic and routing was never considered. Inthis paper we add logic 2-input components into the routing. In thisway many degrees of freedom can be used to put small logic unitsin strategic places. A depth-optimal technology mapping algorithmwith area recovery has been designed for both LUTs and 2-input logiccomponents in routing. For the AND gate a depth reduction of 27%is noted without increase in LUT usage.

Keywords— FPGA, architecture, technology mapping, AND gate,COFFE

I. INTRODUCTION

THE choice of an appropriate architecture is crucial todesign high performance FPGAs. Therefore exten-

sive research has been done to optimize both routing andlogic blocks. However to our knowledge the combinationof logic and routing was never considered. In this paper weadd logic 2-input components into the routing. In this waymany degrees of freedom can be used to put small logicunits in strategic places. We wanted to estimate the per-formance of this new architecture as accurate as possible.Therefore we explored both the hardware and software.

For the hardware we collaborated with three studentsduring their ‘Hardware Design Project’ [2]. They usedCOFFE [5] to estimate the area and delay effects on thewhole FPGA when adding physically the new components.

In the conventional tool flow HDL code is converted intoa bitstream to program the FPGA. This translation is donein five different steps: first the code is written as a networkof AND and NOT gates during the logic synthesis. This iscalled an ‘and-inverter-graph’ (AIG). Then this network ismapped onto a network of LUTs. This is called ‘technol-ogy mapping’. In our case we will have to map the AIG toa network of both LUTs and the considered 2-input com-ponents.

We did this for several components. The AND gateturned out to be the most promising one to implement. Forthe AND gate 27% (geometrical mean for the MCNC20and VTR benchmarks) reduction in depth is noted if weconsider the delay of a gate to be zero.

After the technology mapping the LUTs and connec-tions are physically placed on the FPGA. This is done inthe packing, place and routing algorithms. However thesealgorithms were out of the scope of this thesis.

II. HARDWARE

In the VPR [6] architecture multiplexers are built withpass gate transistors and two buffers. Our modified config-uration is depicted in Figure 1. From the left to the right wesee two multiplexers consisting of pass gates. After each

multiplexer there is a buffer with a level restorer. Thenwe placed a NAND gate. After the NAND gate we usedagain a buffer with a level restorer and at the end there is alarger buffer, which can drive a higher output load. Severalremarks can be made:• Any signal of the first multiplexer can be combined withany signal of the second multiplexer. However it is notpossible to combine two signals of the same multiplexer.• With this device it is possible to select a single signal.Therefore choose the considered input signal of the appro-priate multiplexer. Choose the ground signal for the othermultiplexer. It can be verified that the desired signal willbe present at the output.• A NAND gate with inverted inputs and double invertedoutput is in fact an OR gate. It can be shown that the addedfunctionality of the OR gate is equivalent to the addedfunctionality of the AND gate in FPGAs.

Fig. 1. The multiplexing device with AND gate functionality.

The sizes of the multiplexer buffers are optimized byusing COFFE [5]. This tool also takes into account thevarying optimal buffer sizes of other components, suchas LUTs. The resulting delay of our component did notchange significantly compared to the state-of-the-art mul-tiplexer. This can be explained by considering the reducedinput loads due to the use of smaller multiplexers. The sig-nal will arrive faster to the first buffer thanks to the lowerinput load. However it has to traverse more gates. Thewhole operation does not change the delay significantly.

Further the area increases because of an enhancedSRAM cell and buffer usage. If we replace all multiplexersof the switch blocks, the FPGA’s area increases with 11 %.

III. TECHNOLOGY MAPPING

In the technology mapping algorithm, we convert a net-work of AND and NOT gates into a network of LUTs andthe considered 2-input components. In Figure 2 this con-version is depicted: the big circles of the input networkare AND gates, the small circles on top of the big cir-cles mark inverted signals. This is called an ‘and-inverter

graph’ (AIG). This AIG is then translated to a network of3-LUTs and AND gates in this case.

We considered the technology mapping algorithmDAOmap [1]. This is a depth-optimal algorithm with arearecovery. Mathematical induction is used to calculate theminimal depth for each node. LUTs contain more degreesof freedom than some 2-input components. Therefore wehad to deal with some additional constraints. With a min-imal amount of modifications we succeeded to design adepth-optimal mapping algorithm with area recovery.

Fig. 2. The input and output networks of our technology mapping algo-rithm.

IV. CHOICE 2-INPUT COMPONENT

We chose to use the AND gate, because it yields a highdepth reduction and the overhead is low. Furthermore theAND gate is commutative. This gives us more flexibilityin the floorplanning and routing steps [4].

V. RESULTS TECHNOLOGY MAPPING

Before we discuss the results of the technology map-ping, a remark has to be made: the delay is estimated withthe depth, i.e. the number of LUTs the longest clockedsignal in the network has to traverse. When we considertwo components with different delays (for instance LUTsand AND gates whereas AND gates are faster than LUTs),we have to consider two different delay estimates whencalculating the resulting depth. We have no informationabout the routing delay when using AND gates. Thereforethe pack, place and routing algorithms should be imple-mented. As the estimated delay of using the added gates isunknown, we cannot predict the resulting total delay. How-ever we can give the best case scenario, where we considerthe depth of the gates to be zero. This case is depicted inFigure 3 for the VTR benchmarks. If we also consider theMCNC20 benchmarks, the geometrical mean of the bestcase scenario depth reduction is equal to 27 %. The LUTusage does not change significantly.

VI. CONCLUSION

We designed several technology mapping algorithmswith a maximal depth reduction of 27% (geometrical meanof the MCNC20 and VTR benchmarks, when consideringAND gates) without an increase in LUT usage. Further-more we made an estimation of the impact on the hardwarein collaboration with three other students. The FPGA’s

!"#$

!%&$

!%#$

!'&$

!'#$

!(&$

!(#$

!&$

#$

)*+)

,-./0.$

)+12

34+5

$

67,824/829869$

38:.

;($

38:.

;'$

<=>?

@@20$

<=%'

?@@2

0$

-6-

*$

-AB

.*CDE+/A./%'F

$

-A?A4G./0.$

-AHGI3C54./"F

$

+/('##$

/CD0.2

4+5$

97C$

94./.+

J898+2

#$

94./.+

J898+2

($

94./.+

J898+2

'$

94./.+

J898+2

%$

)0-$

!"#$%&"'(")

*+',"

(-.%/0

'10'2'

Fig. 3. The depth reduction for the VTR benchmarks: mapping with6-LUTs and AND gates compared to mapping with only 6-LUTs

area increases with 11 % if we replace all multiplexers inthe switch blocks, while the delay does not change signifi-cantly.

VII. FUTURE WORK

In order to estimate the performance of the new archi-tecture more accurately, the next algorithms in the conven-tional tool flow should be implemented.

ACKNOWLEDGEMENTS

The author would like to thank Elias Vansteenkistefor his excellent guidance and Pieter De Vloed, YannLaoureux and Dries Vercruyce for their contribution of thehardware estimations.

REFERENCES

[1] D. Chen, J. Cong DAOmap: A Depth-optimal Area OptimizationMapping Algorithm for FPGA Designs Proceedings of the 2004IEEE/ACM International Conference on Computer-aided Design,pp. 752–759, 2004

[2] P. De Vloed, Y. Laoureux, D. Vercruyce, E. Vansteenkiste,D. Stroobandt Designing Drivers for new FPGA Architectures Hard-ware Design Project at the University of Ghent, 2014

[3] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs, Norwell, MA, US: Kluwer Academic Publishers,1999

[4] E. Vansteenkiste, B. Al Farisi, K. Bruneel and D. Stroobandt TPaR:Place and Route Tools for the Dynamic Reconfiguration of theFPGA’s Interconnect Network Computer-Aided Design of IntegratedCircuits and Systems, IEEE Transactions, pp. 370–383, 2014

[5] C. Chiasson, V. Betz, COFFE: Fully-automated transistor sizing forFPGAs, Field-Programmable Technology (FPT), 2013 InternationalConference, pp. 34–41, 2013

[6] V. Betz and J. Rose, VPR: A New Packing, Placement and RoutingTool for FPGA Research, 1997

Technology mapping voor logica in FPGAroutering

Berg Severens

Begeleiders: Elias Vansteenkiste, Karel Heyse, Dirk Stroobandt

Samenvatting—De keuze van een geschikte architectuur is cruciaal om hoog per-

formante FPGAs te kunnen ontwerpen. Daarom is er reeds uitge-breid onderzoek gedaan om zowel de routering als de logische blok-ken in de FPGA te optimaliseren. Voor zover we weten is de com-binatie van logica en routering echter nooit onderzocht. In deze pa-per zullen we componenten met twee ingangen toevoegen in de route-ring. Zo beschikken we over veel vrijheidsgraden om kleine logischeeenheden op strategische plaatsen te kunnen gebruiken. Een diepte-optimaal technology mapping algoritme met oppervlakte optimalisa-tie werd ontworpen voor LUTs in combinatie met deze componenten.In het geval van de EN-poort hebben we een reductie in diepte bereiktvan 27%, zonder een stijging in aantal LUTs te veroorzaken.

Trefwoorden— FPGA, architectuur, technology mapping, EN-poort, COFFE

I. INLEIDING

DE keuze van een geschikte architectuur is cruciaalom hoog performante FPGAs te kunnen ontwerpen.

Daarom is er reeds uitgebreid onderzoek gedaan om zowelde routering als de logische blokken te optimaliseren. Voorzover we weten is de combinatie van logica en routeringechter nooit onderzocht. In deze paper zullen we compo-nenten met twee ingangen toevoegen in de routering. Zobeschikken we over veel vrijheidsgraden om kleine logi-sche eenheden op strategische plaatsen te kunnen gebrui-ken. We wilden de performantie van deze architectuur zoprecies mogelijk schatten. Daarom hebben we zowel dehardware als de software onderzocht.

Voor het onderzoek in de hardware hebben we samenge-werkt met drie studenten tijdens hun ‘Hardware ontwerps-project’ [2]. Ze gebruikten COFFE [5] om de oppervlakteen vertraging te schatten van de hele FPGA bij het toevoe-gen van de nieuwe componenten.

In de conventionele tool flow wordt HDL code gecon-verteerd naar een bitstream die gebruikt wordt om deFPGA te programmeren. Deze compilatie wordt gedaanin vijf verschillende stappen: eerst wordt de code geschre-ven als een netwerk van EN- en NIET-poorten. Daarnawordt dit netwerk vertaald naar een netwerk van LUTs.Deze stap heet ‘technology mapping’. In ons geval zal hetuitgangsnetwerk bestaan uit LUTs en 2-ingangspoorten.

We hebben algoritmes ontworpen voor verscheidenecomponenten. De EN-poort leek ons de meest beloftevollecomponent om te implementeren. Een reductie in dieptevan 27% (meetkundig gemiddelde voor de MCNC20 ende VTR benchmarks) werd gemeten voor de EN-poort, alswe veronderstellen dat de poort een verwaarloosbare ver-traging heeft.

Na de technology mapping worden de LUTs en de con-necties geplaatst op de FPGA. Dit wordt gedaan in deverpakkings-, plaatsing- en routeringsstappen. Deze algo-ritmes vallen echter buiten het bestek van deze thesis.

II. HARDWARE

In de architectuur van COFFE [5] zijn multiplexers op-gebouwd uit pass gate transistors en twee buffers. Onzegewijzigde configuratie is weergegeven in Figuur 1. Vanlinks naar rechts zien we twee multiplexers opgebouwd uitpass gates. Na elke multiplexer vinden we een buffer meteen spanningshersteller. Daarachter werd een NEN-poortgeplaatst. Tenslotte plaatsten we nog een buffer met eenspanningshersteller en een extra grote buffer, geschikt omeen hogere uitgangsimpedantie aan te sturen. Er kunnenverscheidene opmerkingen gemaakt worden:• Elk signaal van de eerste multiplexer kan gecombineerdworden met om het even welk signaal van de tweede mul-tiplexer. Het is echter niet mogelijk om twee signalen vaneen multiplexer te combineren.• Met deze configuratie is het mogelijk om een enkel sig-naal te selecteren. Kies daarvoor het gewenste signaal aande bijpassende multiplexer en het ingangssignaal dat ver-bonden is met de massa van de andere multiplexer. Het kangeverifieerd worden dat het gewenste signaal aanwezig zalzijn aan de uitgang.• Een NEN-poort met geınverteerde ingangen en dubbelgeınverteerde uitgang is een OF-poort. Het kan aange-toond worden dat de toegevoegde functionaliteit van eenOF-poort equivalent is met de toegevoegde functionaliteitvan een EN-poort in FPGAs.

Fig. 1. De configuratie van de nieuwe multiplexer met EN-poort functi-onaliteit.

Het optimaliseren van de buffergroottes wordt gedaandoor COFFE [5]. Dit programma optimaliseert gelijktij-dig alle buffers in en FPGA: niet alleen de routing buffers,maar ook bijvoorbeeld de uitgangsbuffers van LUTs. Devertraging van de nieuwe FPGA architectuur veranderdeniet significant wanneer we EN-poorten gebruikten in demultiplexers. Dit kan als volgt verklaard worden: doordatwe kleinere multiplexers gebruiken, verlaagt de ingangsca-paciteit. Daardoor kan het ingangssignaal sneller de eerstebuffer opladen. Het signaal moet nu echter wel door meer

buffers propageren. De hele operatie verandert de vertra-ging niet significant.

Verder wordt er wel een grotere oppervlakte gebruiktvanwege een verhoogd buffer- en SRAM cel gebruik. Alswe alle multiplexers in de schakelblokken veranderen, ver-hoogt de oppervlakte van de FPGA met 11%.

III. TECHNOLOGY MAPPING

In het technology mapping algoritme moeten we eennetwerk van EN- en NIET-poorten afbeelden op een net-werk van LUTs en de beschouwde extra componenten. InFiguur 2 wordt deze conversie afgebeeld: de grote cir-kels van het ingangsnetwerk zijn EN-poorten, de kleinecirkels markeren dat een ingangssignaal geınverteerd is.In de figuur is te zien hoe een EN-NIET graaf kan afge-beeld worden op een netwerk van 3-ingangsLUTs. We be-schouwden voor deze thesis het technology mapping al-goritme DAOmap [1]. Dit is een diepte-optimaal algo-ritme met oppervlakte optimalisatie. Er wordt gebruik ge-maakt van volledige inductie om de minimale diepte voorelke knoop te berekenen. LUTs bevatten meer vrijheids-graden dan sommige beschouwde 2-ingangscomponenten.Daarom moesten we enkele voorwaarden toevoegen. Meteen minimum aan wijzigingen zijn we er in geslaagd eendiepte-optimaal mapping algoritme te ontwerpen met op-pervlakte optimalisatie.

Fig. 2. Een ingangs- en uitgangsnetwerk van ons technology mappingalgoritme.

IV. KEUZE VAN DE TOEGEVOEGDE COMPONENTEN

We hebben ervoor gekozen om de EN-poort te gebrui-ken omwille van (1) een sterke reductie in diepte en (2) om-dat de toegevoegde vertraging en oppervlakte in de hard-ware klein zijn. Verder kan er commutativiteit gebruiktworden. Dit geeft ons meer flexibiliteit in de plaats- enrouteringsstappen [4].

V. RESULTATEN TECHNOLOGY MAPPING

Voor we de resultaten van de technology mapping be-spreken, moet de volgende opmerking gemaakt worden:de vertraging wordt geschat aan de hand van de diepte. Dediepte is het aantal LUTs waardoor het langste gekloktesignaal in het netwerk moet propageren. Wanneer we tweecomponenten beschouwen met verschillende vertragingen(bijvoorbeeld de LUT en de EN-poort, waarbij de EN-poort sneller is dan de LUT), moeten we ook twee verschil-

!"#$

!%&$

!%#$

!'&$

!'#$

!(&$

!(#$

!&$

#$

)*+)

,-./0.$

)+12

34+5

$

67,824/829869$

38:.

;($

38:.

;'$

<=>?

@@20$

<=%'

?@@2

0$

-6-

*$

-AB

.*CDE+/A./%'F

$

-A?A4G./0.$

-AHGI3C54./"F

$

+/('##$

/CD0.2

4+5$

97C$

94./.+

J898+2

#$

94./.+

J898+2

($

94./.+

J898+2

'$

94./.+

J898+2

%$

)0-$

!"#$%&"'(")

*+',"

(-.%/0

'10'2'

Fig. 3. De diepte reductie voor de VTR netwerken: mapping met 6-LUTsen EN-poorten vergeleken met mapping met enkel 6-LUTs

lende vertragingen gebruiken om de resulterende diepte teberekenen. We hebben echter geen informatie over de ver-traging van de EN-poort, omdat we de effecten in de rou-tering er niet van kennen. Daarvoor zouden de volgendestappen van de tool flow moeten geımplementeerd wor-den. Aangezien de vertraging van de nieuwe componen-ten onbekend is, kunnen we de totale vertraging ook nietvoorspellen. We kunnen wel het best mogelijke geval be-spreken, waarbij we veronderstellen dat de vertraging vande EN-poort verwaarloosbaar is. Dit geval is afgebeeld inFiguur 3 voor de VTR voorbeeldnetwerken. Als we ookde MCNC20 netwerken beschouwen, wordt het meetkun-dig gemiddelde van de reductie in diepte 27%. Het aantalLUTs verandert niet significant.

VI. CONCLUSIE

We hebben verscheidene diepte-optimale technologymapping algoritmes met oppervlakte optimalisatie ontwor-pen. Voor de EN-poort werd een meetkundig gemiddeldevan 27% gemeten van maximale reductie in de diepte voorde MCNC20 en VTR netwerken. Het aantal LUTs ver-anderde niet significant. Bovendien hebben we ook eenschatting gemaakt van de impact op de hardware. Dit werdgedaan in samenwerking met drie andere studenten. De to-tale oppervlakte van de FPGA stijgt met 11% als we allemultiplexers veranderen in de schakelblokken. De vertra-ging in de hardware verandert niet significant.

VII. TOEKOMSTIG WERK

Om de performantie van de nieuwe architectuur precie-zer te kunnen schatten, moeten de volgende stappen van deconventionele tool flow geımplementeerd worden.

DANKBETUIGINGEN

De auteur zou graag Elias Vansteenkiste bedanken voorzijn uitstekende begeleiding en Pieter De Vloed, Yann La-oureux en Dries Vercruyce voor hun bijdrage van de hard-ware simulaties.

REFERENTIES

[1] D. Chen, J. Cong DAOmap: A Depth-optimal Area OptimizationMapping Algorithm for FPGA Designs Proceedings of the 2004IEEE/ACM International Conference on Computer-aided Design,pp. 752–759, 2004

[2] P. De Vloed, Y. Laoureux, D. Vercruyce, E. Vansteenkiste, D. Stroo-bandt Designing Drivers for new FPGA Architectures HardwareDesign Project at the University of Ghent, 2014

[3] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs, Norwell, MA, US: Kluwer Academic Publishers,1999

[4] E. Vansteenkiste, B. Al Farisi, K. Bruneel and D. Stroobandt TPaR:Place and Route Tools for the Dynamic Reconfiguration of theFPGA’s Interconnect Network Computer-Aided Design of IntegratedCircuits and Systems, IEEE Transactions, pp. 370–383, 2014

[5] C. Chiasson, V. Betz, COFFE: Fully-automated transistor sizing forFPGAs, Field-Programmable Technology (FPT), 2013 InternationalConference, pp. 34–41, 2013

[6] V. Betz and J. Rose, VPR: A New Packing, Placement and RoutingTool for FPGA Research, 1997

Contents

Acknowledgements v

Overview vi

Extended abstract vii

Contents xii

Abbreviations xvi

1 Introduction 1

1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 4

2.1 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 What is an FPGA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Economical point of view . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.3 Island-style architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.4 Speed and area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Conventional toolflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Logic Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Technology mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

xii

Contents xiii

2.2.4 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.5 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Cone Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Cone Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 Cone Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.4 Area recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Logic in the nodes on interconnection network 16

3.1 Changes in the hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Original multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.2 Fully connected AND gate multiplexer . . . . . . . . . . . . . . . . . . . 19

3.1.3 Half connected AND gate multiplexer . . . . . . . . . . . . . . . . . . . 19

3.1.4 Half connected AIC multiplexer . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.5 Comparison of the three configurations . . . . . . . . . . . . . . . . . . . 21

3.1.6 Reliability COFFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.7 Replaced switch box multiplexers . . . . . . . . . . . . . . . . . . . . . . 24

3.1.8 Optimization SRAM cells . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.9 General ratio use of SRAM cells . . . . . . . . . . . . . . . . . . . . . . 26

3.1.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Changes in the tool flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Logic Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.2 Technology mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.3 Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.4 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.5 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Contents xiv

4 Candidate components 31

4.1 Non-configurable components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Proof of equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.2 Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Configurable components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 2-LUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.2 2-AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.3 Other components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.4 Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Technology Mapping 36

5.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1.1 Starting point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1.3 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Configurable components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 2-LUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.2 2-AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2.3 X(N)OR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3 Non-configurable components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3.1 The horizontal and vertical dilemma . . . . . . . . . . . . . . . . . . . . 48

5.3.2 The AND gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3.3 The NAND gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3.4 The ANDN gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3.5 Comparison candidate components . . . . . . . . . . . . . . . . . . . . . 54

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Contents xv

6 Results Technology Mapping 58

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.2 Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.2.1 AND gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.2.2 2-AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.3 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.3.1 AND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.3.2 2-AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.4 Fan-in LUTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.5 Run-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7 Conclusion and future work 75

7.1 Delay AND gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.2 Area AND gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.3 Area delay product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.4.1 Considering an other 2-input component . . . . . . . . . . . . . . . . . . 78

7.4.2 Other definition depth cost . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.4.3 Other objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.4.4 Improvement of the resulting graphs . . . . . . . . . . . . . . . . . . . . 81

7.4.5 Implementing AND gates in the connection boxes only . . . . . . . . . . 81

7.4.6 Implementing pack, place and route algorithms . . . . . . . . . . . . . . 82

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Bibliography 83

List of Figures 85

Abbreviations

AIC And-Inverter-Cone

AIG And-Inverter-Graph

ASIC Application-Specific Integrated Circuit

BLE Basic Logic Element

CB Connection Block/ Connection Box

CLB Configurable Logic Block

DAOmap Depth-optimal technology mapping with Area Optimization

FF Flip-Flop

FPGA Field-Programmable Gate Array

HDL Hardware Description Language

HES Hardware and Embedded Systems

LUT Lookup Table

NRE Non-Recurrent Engineering

SB Switch Block/ Switch Box

SRAM Static Random-Access Memory

xvi

Chapter 1

Introduction

The title ‘Technology Mapping for Logic in FPGA Routing’ contains three different thoughts:

• Extra logic will be added to an FPGA architecture (an FPGA is a programmable chip,consisting of logic blocks and wires to connect them, the routing).

• The added logic will be implemented into the routing (in contrary with the standardFPGA architectures where logic and routing are strictly separated).

• We started with adapting the technology mapping step of the FPGA compilation toolflow. The results from the technology mapping will allow us to draw conclusions for thefeasibility of new FPGA architectures with logic in the routing.

1.1 Problem

The market of these programmable and high-speed chips, called FPGAs, is growing fast. Therevenues were estimated at more than 4.5 billion dollars a year in 2011 (see figure 1.1). Inorder to further increase the value of these chips, the performance should be enhanced. Thetwo most important characteristics of an FPGA are cost and speed.

The cost is proportional with the silicon area. An important added value could be created ifthe same functionality could be implemented onto smaller FPGAs.

Furthermore increasing the design speed of FPGAs would make them more attractive as aplatform: there are many applications with strict timing constraints. With an increased designspeed, more applications would be feasible using FPGAs. For the existing applications theNRE cost, necessary to meet the timing constraints, could be decreased.

1.2 Goal

The goal of this thesis is to add extra 2-input components into the routing of the FPGA ar-chitecture. The conclusion will be whether or not these components should be implemented in

1

Chapter 1. Introduction 2

Figure 1.1: FPGA Sales Growth [1]

a real FPGA in order to increase its market value. Note that there are different options to doso: for some applications, a very fast FPGA would be useful and cost is of less importance.For other applications a cheap FPGA is chosen rather than a fast but expensive FPGA. Thus,the best case scenario is to increase both speed and the amount of functionality per area, buta significant improvement of one of both suffices to succeed.

It is a very complex problem to prove that an FPGA will be faster with the use of an extracomponent. Manufacturing a commercial suitable FPGA with a new architecture has a highcost. This is clearly not an option. In this thesis we will only describe the hardware in software.For many benchmarks, an answer will be given on how many components are needed (propor-tional to the area and thus cost) and in which sequence they appear (this will be importantfor the speed). This is done in the technology mapping algorithm.

Next to this thesis, a ‘Hardware Design Project’ is executed by three students of the first masteryear. In this project the hardware implementation of the extra components is simulated. Somenumbers about the area and speed of the added blocks will be clarified. Knowing the amountsof needed components and the numbers of area and delay in hardware, an estimate will bemade of the final result.

1.3 Overview

In chapter 2 the necessary background to understand the principles used in this thesis will bediscussed.

Chapter 3 explains the goal of this thesis in more detail. Furthermore also some hardwareresults will be provided.

In chapter 4 we consider all possible 2-input components and we will show that the AND gateand the 2-AIC will probably be the best components to use.

Chapter 5 deals with the designed technology mapping algorithms.

The results of the technology mapping algorithms are discussed in chapter 6.

Chapter 1. Introduction 3

At last in chapter 7 a conclusion will be formulated, based on the hardware results of chapter3 and the technology mapping results of chapter 6.

Chapter 2

Background

2.1 FPGA

2.1.1 What is an FPGA?

FPGA is the abbrevation of Field Programmable Gate Array. It is a programmable digitalchip. This means that the customer can configure the chip in such a way that the FPGAimplements a desirable digital function. An FPGA needs three important components to beable to do so. A schematic representation can be seen in Figure 2.1.

• The chip needs input and output connections.

• Logic blocks with both sequential and combinatorial functionality are needed. Flipflopsare used in order to maintain the states, whereas Look-up-tables (LUTs) seem to be thebest components to use for combinatorial functionality. Both flipflops and LUTs are puttogether into these logic blocks.

• There are a lot of wires and multiplexers between all I/O blocks and logic blocks tointerconnect these blocks. This is called the routing.

Both the logic blocks and the routing can be configured: the customer can choose what func-tions to implement in the logic blocks and how the blocks are connected with each other. Inpractice, the designer will write code in an HDL (Hardware Description Language), such asVHDL or Verilog. Then he uses a compilation tool flow to translate the code into a bitstream.Consequently this bitstream is used to configure the FPGA. As such the FPGA is able tocalculate several digital functions in parallel, so it can be relatively fast.

2.1.2 Economical point of view

FPGAs are part of the market of physical mediums that calculate digital functions. In thismarket, two extremes can be found:

4

Chapter 2. Background 5

Figure 2.1: FPGA architecture [2]

• The processor calculates everything in series. It is slow, but also very flexible as it canread instructions instantly.

• The ASIC (application-specific integrated circuit) can use parallelism. It is therefore veryfast. However only one digital function can be implemented.

The FPGA is a compromise between the two mediums. It can use parallelism, which makesit relatively fast. Furthermore HDL code can be implemented. So when a digital function isneeded in a high volume or high speed device, one will leave the processor behind and take anFPGA or an ASIC. The NRE (non-recurrent engineering) cost is higher for the ASIC, but thecost per unit is higher for the FPGA (because it has to provide many more physical connectionsthan strictly necessary). A comparison is made in Figure 2.2.

2.1.3 Island-style architecture

An FPGA consists of LUTs, flipflops, multiplexers and I/O blocks. This thesis assumes anisland-style architecture which was represented in Figure 2.1. Furthermore we assume that noother blocks (like adders, multiplexers, etc.) are implemented. This is the easiest playground toexecute experiments. If the introduced components will tend to be useful in this architecture,the results can be extrapolated to more realistic architectures.

Now we take a closer look at the island-style architecture. The set-up of the architecture startswith the configurable logic blocks (CLB). A CLB consists typically (depending on the manu-facturer) of eight basic logic elements (BLE). The structure of a BLE is given in Figure 3.15.In the box a 4-input LUT is depicted. The inputs are located at the top while the configurablebits are at the left. The working principle is as follows: a certain combination of inputs isgiven at the top. Then a signal is chosen from left to right by using multiplexers. The choiceof the multiplexers is determined by the inputs. In this way a LUT of N inputs can realize


Figure 2.2: FPGA - ASIC comparison [3]

12 2 FPGA Architectures: An Overview

Fig. 2.3 Basic logic element (BLE) [22]

by all the BLEs in the cluster. Modern FPGAs contain typically 4 to 10 BLEs ina single cluster. Although here we have discussed only basic logic blocks, manymodern FPGAs contain a heterogeneous mixture of blocks, some of which can onlybe used for specific purposes. Theses specific purpose blocks, also referred here ashard blocks, include memory, multipliers, adders and DSP blocks etc. Hard blocksare very efficient at implementing specific functions as they are designed optimallyto perform these functions, yet they end up wasting huge amount of logic and routingresources if unused. A detailed discussion on the use of heterogeneous mixture ofblocks for implementing digital circuits is presented in Chap. 4 where both advantagesand disadvantages of heterogeneous FPGA architectures and a remedy to counter theresource loss problem are discussed in detail.

2.4 FPGA Routing Architectures

As discussed earlier, in an FPGA, the computing functionality is provided by itsprogrammable logic blocks and these blocks connect to each other through pro-grammable routing network. This programmable routing network provides routing

Figure 2.3: A BLE with a 4-input LUT and a flipflop [2]


all 2N boolean functions that are possible: for each combination of inputs one output bit isconfigured. Next, the output signal of the LUT is divided: one signal goes directly to themultiplexer at the end, another signal goes via a flipflop. In this way the customer can chooseto use the BLE as a combinatorial or sequential component.

After the logic blocks, we move on to the routing concept. We can basically divide all multi-plexers into two different routing block categories: the switch blocks (SB) and the connectionblocks (CB), see Figure 2.4. SBs connect wires on intersections of the routing matrix, CBs arethe link between the routing wires and the CLBs.

Those blocks consist of a high number of multiplexers. A simplified implementation of theswitch block is shown in the first picture of Figure 2.5. The connection block is depicted in thesecond picture. Furthermore the third picture shows the concept of unidirectionality [4]. Thismeans that each wire can drive a signal in only one direction. In this case there is only onebuffer per wire, otherwise there are two of them.

In Figure 2.6 a multiplexer on transistor level is depicted. Note that this structure is almostthe same as for LUTs, but for a LUT the left signals are the configured signals, the signals onthe top are the inputs. In case of a multiplexer, it is reversed. In other words, for a multiplexerthere are 2N inputs where one signal is chosen. The choice is configured with N bits. For a LUTthere are N inputs, which means there are 2N combinations of inputs. For each combinationthere is one output bit to be configured.

2.1.4 Speed and area

The FPGA’s performance depends on a number of properties: cost, speed, area, power con-sumption, etc. The designer of FPGAs has to make trade-offs between these characteristics.For example: if one calculation block is used very often, the block can be multiplied and onecan use parallelism. In that case the throughput speed increases, but the area cost increasesas well. In our case we focus primarily on speed and area. The reason we do so is that thecost is proportional to the area. Power consumption is a remaining goal, but this is very com-plex and it is outside the scope of this thesis. Furthermore we will only describe the FPGAarchitecture in software and simulate what will be the best area and speed we can get with aspecific architecture.

First of all we want to have an FPGA that is as fast as possible. The clock frequency is thereverse of the clock period. In its turn, the minimal clock period is determined by the longestpath a clocked signal has to traverse in the circuit. The delay of a signal is proportional to thenumber of LUTs it traverses and the length of the connections between the LUTs. Typicallybetween 30% and 40% of the critical path delay is caused by travelling through the LUTs [6].The routing is responsible for the remaining delay. Sometimes one very long signal can slowdown the whole circuit. This is the reason why algorithms focus on the critical path to choosewhich LUTs should be connected with each other and via which multiplexers.



Fig. 2.5 Overview of mesh-based FPGA architecture [22]

2.4.1 Island-Style Routing Architecture

Figure 2.5 shows a traditional island-style FPGA architecture (also termed as mesh-based FPGA architecture). This is the most commonly used architecture amongacademic and commercial FPGAs. It is called island-style architecture because inthis architecture configurable logic blocks look like islands in a sea of routing inter-connect. In this architecture, configurable logic blocks (CLBs) are arranged on a 2Dgrid and are interconnected by a programmable routing network. The Input/Output(I/O) blocks on the periphery of FPGA chip are also connected to the programmablerouting network. The routing network comprises of pre-fabricated wiring segmentsand programmable switches that are organized in horizontal and vertical routingchannels.

The routing network of an FPGA occupies 80–90% of total area, whereas the logicarea occupies only 10–20% area [22]. The flexibility of an FPGA is mainly dependenton its programmable routing network. A mesh-based FPGA routing network consistsof horizontal and vertical routing tracks which are interconnected through switchboxes (SB). Logic blocks are connected to the routing network through connectionboxes (CB). The flexibility of a connection box (Fc) is the number of routing tracksof adjacent channel which are connected to the pin of a block. The connectivity ofinput pins of logic blocks with the adjacent routing channel is called as Fc(in); theconnectivity of output pins of the logic blocks with the adjacent routing channel iscalled as Fc(out). An Fc(in) equal to 1.0 means that all the tracks of adjacent routingchannel are connected to the input pin of the logic block. The flexibility of switchbox (Fs) is the total number of tracks with which every track entering in the switch

Figure 2.4: The routing structure of the island-style FPGA [2]

Figure 2.5: The left figure is the Wilton switch block [5]. The second one represents a connectionblock [2]. The third picture shows the concept of unidirectionality [2].

Figure 2.6: The multiplexer on transistor level


Secondly the area is of high importance as it is proportional to the cost. In the academic VPRarchitecture about 30% of the FPGA area is covered by LUTs. The remaining 70 % is usedby routing wires and multiplexers [6]. The number of routing channels between the CLBs canbe 128 or even 256. Furthermore we have to take into account that there is a multiplexer foreach single wire. For the VLX50 there are 800.000 routing multiplexers and only 22.800 LUTs(these results are found in the research group of HES at the university of Ghent). Thereforeoptimizing area is accomplished by minimizing the routing.

Conclusion: area is important for the price of an FPGA, while the delay is important for thedemand of FPGAs (because more possible applications can be implemented on FPGAs). Inthe following we will focus on both of these two performance measures. Sometimes we willhave to make trade-offs between these two.

2.2 Conventional toolflow

In a conventional FPGA tool flow, HDL code is translated into a network of LUTs, flipflopsand multiplexers. The output is a bitstream, which is used to configure the FPGA. This toolflow consists of 5 steps [2]:

2.2.1 Logic Synthesis

The logic synthesis translates the HDL code into a network of AND and NOT gates. This iscalled an and-inverter graph (AIG). An example of an AIG is given in Figure 2.7. At the topwe see the inputs of the circuit, at the bottom the outputs. The AND gates are representedas big circles, the inverters are represented by the smaller circles at the inputs of the ANDgates. For instance the output of the lowest AND gate (so, the lowest circle), will be onewhen its left input is 1 and its right input is 0. Otherwise the output of the considered ANDgate will be zero. It can be proven that any combinatorial circuit can be represented by an AIG.

Note that the presence of flipflops in this circuit was not mentioned. Of course they did notdisappear. They are hidden as inputs and outputs: the output of a flipflop can be considered

Figure 2.7: An example of an AIG Figure 2.8: An example of a network of LUTs


as an input of the combinatorial circuit and the input of a flipflop can be seen as an outputof the AIG. This also means that the logic synthesis determines what will be the states in theflipflops. After this step there will only be space to optimize the combinatorial aspect of theimplementation.

The goal for the designers of the logic synthesis tools is minimize the number of AND gates(area equivalent). Next to this they will try to accomplish a depth that is as low as possible(delay equivalent). The depth of an AIG is the maximum number of AND gates that any signalhas to traverse in the circuit.

2.2.2 Technology mapping

Technology mapping is the most important step for this thesis. This algorithm maps an AIGto a network of LUTs. Again, the goal is to minimize both the depth and the number ofresources. In Figure 2.8 the output of the technology mapping algorithm is shown for thenetwork of Figure 2.7. In this case 3-input LUTs were used. More details about technologymapping are given in section 2.3.

2.2.3 Packing

In the packing algorithm LUTs are collected to put them together into clusters (see Figure 2.9).Later on these clusters will be put onto CLBs. In the packing it is important to minimize therouting overhead. Therefore the packing algorithm will focus on common inputs of differentLUTs: in that case several signals have te be routed only once per CLB.

2.2.4 Placement

In the placement step the clusters are placed on physical CLBs on the target FPGA. It isimportant to take into account how the clusters are connected to each other (see Figure 2.10).In the placement step there can be different goals. In a wirelength-driven algorithm the placerwill try to minimize resource usage and thus the area. A time-driven algorithm will focus onthe timing of the critical path.


Fig. 2.20 Example of packing

function is based on the number of shared nets between a candidate LB and the LBsthat are already in the cluster. For each cluster, the attraction function is used to selecta seed LB from the set of all LBs that have not already been packed. After packinga seed LB into the new cluster, a second attraction function selects new LBs to packinto the cluster. LBs are packed into the cluster until the cluster reaches full capacityor all cluster inputs have been used. If all cluster inputs become occupied before thiscluster reaches full capacity, a hill-climbing technique is applied, searching for LBsthat do not increase the number of inputs used by the cluster. The VPack pseudo-codeis outlined in algorithm 2.1.

T-VPack [22] is a timing-driven version of VPack which gives added weight togrouping LBs on the critical path together. The algorithm is identical to VPack, how-ever, the attraction functions which select the LBs to be packed into the clusters aredifferent. The VPack seed function chooses LBs with the most used inputs, whereasthe T-VPack seed function chooses LBs that are on the most critical path. VPack’ssecond attraction function chooses LBs with the largest number of connections withthe LBs already packed into the cluster. T-VPack’s second attraction function hastwo components for a LB B being considered for cluster C :

Attraction(B, C) = !.Crit (B) + (1 ! !)| Nets(B) " Nets(C) |

G(2.1)

where Crit (B) is a measure of how close LB B is to being on the critical path,Nets(B) is the set of nets connected to LB B, Nets(C) is the set of nets con-nected to the LBs already selected for cluster C , ! is a user-defined constant whichdetermines the relative importance of the attraction components, and G is a normal-izing factor. The first component of T-VPack’s second attraction function choosescritical-path LBs, and the second chooses LBs that share many connections with theLBs already packed into the cluster. By initializing and then packing clusters with

Figure 2.9: The packing algorithm [2]


2.5 Software Flow 35

Fig. 2.23 Bounding box ofa hypothetical 6-terminalnet [22]

The temperature decrease rate, the exit criterion for terminating the anneal, the num-ber of moves attempted at each temperature (InnerLoopCriterion), and the methodby which potential moves are generated are defined by the annealing schedule. Anefficient annealing schedule is crucial to obtain good results in a reasonable amountof CPU time. Many proposed annealing schedules are “fixed” schedules with no abil-ity to adapt to different problems. Such schedules can work well within the narrowapplication range for which they are developed, but their lack of adaptability meansthey are not very general. In [86] authors propose an “adaptive” annealing schedulebased on statistics computed during the anneal itself. Adaptive schedules are widelyused to solve large scale optimization problems with many variables.

2.5.4.2 Partitioning Based Approach

Partitioning-based placement methods, are based on graph partitioning algorithmssuch as the Fiduccia-Mattheyses (FM) algorithm [34], and Kernighan Lin (KL) algo-rithm [6]. Partitioning-based placement are suitable to Tree-based FPGA architec-tures. The partitioner is applied recursively to each hierarchical level to distributenetlist cells between clusters. The aim is to reduce external communications and tocollect highly connected cells into the same cluster.

The partitioning-based placement is also used in the case of Mesh-based FPGA.The device is divided into two parts, and a circuit partitioning algorithm is applied todetermine the adequate part where a given logic block must be placed to minimize thenumber of cuts in the nets that connect the blocks between partitions, while leavinghighly-connected blocks in one partition.

Figure 2.10: Minimizing distances in the placement algorithm [2]

2.2.5 Routing

A matrix of CLBs and a graph of connections between them can now be used to find a routingsolution. Basically this is done by executing Dijkstra for each connection. However this is notsufficient as certain multiplexers would be used by more than one signal. Therefore an severaliterations are considered: each time a connection passes a multiplexer, the virtual cost of usingthis multiplexer is increased if the multiplexer is overused. After a certain number of iterations,a solution is found [7]. An example of a routed circuit can be found on Figure 2.11.

2.3 Technology Mapping

In this section the Technology Mapping algorithm DAOmap [8] (with area optimization of [9])is explained. First the pseudocode is given in Algorithm 1. Further the algorithm is discussedin more detail with the help of an example. This example can be found in Figure 2.12. Whendiscussing this circuit the numbers of the nodes of this figure will be used as a reference.

DAOmap stands for “Depth-optimal Area Optimization mapping algorithm”. First a depthoptimal solution of the circuit is calculated with flowmap. This is an algorithm with poly-

Figure 2.11: An example of a routed circuit [2]


nomial time complexity. It consists of three steps: “cone enumeration”, “cone ranking” and“cone selection”. After flowmap has provided a depth-optimal solution, DAOmap focusses onthe non-critical paths and tries to find implementations that can decrease the number of LUTs.This step is called “area recovery”. All four steps are described in the next paragraphs.

input : AIGoutput: Network of 6-LUTs

//ConeEnumeration;for allNodesFromInputTowardsOutput do

findAllPossibleCones(Node);end

//ConeRanking;for allNodesFromInputTowardsOutput do

for allConesOfNode doCone.depth = maximumDepth(Cone.inputs)+1;

endNode.bestCone = chooseLowestDepthCone(allConesOfNode);

end

//ConeSelection;for allOutputs do

setVisible(Output);endfor allVisibleNodesFromOutputToInput do

for VisibleNode.Cone.inputNodes dosetVisible(inputNode);

end

end

//AreaRecovery;for allNodesFromOutputToInput do

Node.requiredDepth = calculateRequiredDepth(Node);endfor allNodesFromInputToOutput do

findBestGlobalAreaUnderRequiredDepthConstraint(Node);endfor allNodesFromInputToOutput do

findBestLocalAreaUnderRequiredDepthConstraint(Node);end

Algorithm 1: Technology Mapping

Before we start with the description of the algorithm, the concept of visibility is given: whenconverting a network of AND and NOT gates into a network of LUTs, several gates will beimplemented by one single LUT. This also means that several signals of the AIG will virtuallydisappear: they are only present within the LUT. For a signal that is represented by a LUT,we say it is visible. In that case the signal can be used as the input for an other LUT.


Figure 2.12: The example input of the technology mapping algorithm

2.3.1 Cone Enumeration

A cone is a subnetwork that is defined by its inputs and output. It represents the subnetworkthat can be implemented by a LUT. As an N-LUT can implement any Boolean function ofN inputs, we only consider the input nodes and output node in technology mapping. Thecorresponding Boolean function that has to be implemented is calculated afterwards.

In the cone enumeration step we run over all nodes. For each node all possible implementationsof LUTs are enumerated. In the explanation an example of 3-LUTs is used. In figure 5.1 node6 is chosen for which all possible cones are depicted. As 3-LUTs are considered in the example,the cones can have at most 3 inputs.

2.3.2 Cone Ranking

In the second step the best possible cone for each node is chosen. In flowmap this choice isbased on the depth (see figure 2.14). If the depths of two cones are equal the expected area isused. The expected area is an estimation made on how many LUTs would be needed when im-plementing that particular cone. Unfortunately finding the best global solution is NP-hard [10].

By using mathematical induction the optimal depth can be calculated in a polynomial timecomplexity: we start first at the top and give the input nodes a depth equal to zero. Next westart descending from the top. For all cones of each node the depth is calculated while lookingat the depth of the cone inputs: the depth of the cone is the maximal depth of its inputs plusone. So, for the first level below the inputs all depths are equal to one, as there is always a conethat only uses circuit inputs (depth 0) and then adds one. For the rows below the differentcones will show differences. In this case, there are three nodes in the second row for which it


Figure 2.13: The cone enumeration, example for node 6

is still possible to find a cone with only circuit inputs. However for node 8 it is impossible,because the signal depends on four different circuit inputs, while here 3-LUTs are considered.The best possible cone has a depth of 2 in this case. Following the method of mathematicalinduction, the best possible depths in the rest of the circuit are calculated.

2.3.3 Cone Selection

The cone selection algorithm will make all necessary signals visible. It starts at the bottomof the network and sets all output signals visible, as they will certainly be needed. For eachvisible node, it sets the inputs of its best cone visible. This is done iteratively until there areno chosen cones anymore with hidden inputs. The result can be seen in figure 2.14 where thenodes with the circles will be made visible.

2.3.4 Area recovery

After the cone selection, we have a depth optimal solution. Now we will try to reduce thenumber of LUTs by considering area recovery. First we calculate the required depth for eachnode. The required depth is the maximum allowed depth a node can have without increasingthe critical path depth. Secondly we run again over all nodes and calculate for each cone theexpected effect on the global solution to decrease the number of LUTs. Then we take for eachnode the best possible cone under the constraint of the required depth. Thirdly we do theprevious iteration again, but for an optimal local solution. The area recovery step reduces thenumber of LUTs on average with about 10 % (based on own experiments).


Figure 2.14: The depths of the nodes (cone ranking) and the chosen nodes (cone selection)

Chapter 3

Logic in the nodes oninterconnection network

In this chapter the general concept is described. Therefore we reformulate the research questionin more detail:

“What are the best 2-input components to add into the routing of an FPGA architecture?Is the performance enhanced compared with the state-of-the-art FPGAs?”

First we give some more background about the new architecture. Secondly we discuss thechanges in the tool flow.

3.1 Changes in the hardware

The state-of-the-art architecture of an FPGA consists of two parts: functional blocks androuting. In the past, a lot of research has been done about choosing the best architecture forthe logic blocks. Most of the area of an FPGA is taken by the routing infrastructure (68 %in the hardware simulation tool COFFE [11]). Therefore the functionality of a logic block isoptimized as the size of such a block is small compared to the routing.

We want to add small components into the routing. The area of the components will be opti-mized because they will be implemented in a lot of multiplexers, which are highly responsiblefor the total FPGA area. Also the functionality will be optimized. Adding new functionalitycan reduce both area and delay: the area can decrease when a lower number of LUTs is neces-sary because a part of the functionality is implemented by the new 2-input components. Thedelay can decrease under two conditions: the new components are faster than LUTs and a partof the functionality of the critical path can be implemented by these new components.

We will add components in multiplexers. Therefore we consider a normal multiplexer and wedivide it into two multiplexers (see Figure 3.1). At the outputs of the two multiplexers we adda 2-input component. Now there are some trade-offs that have to be made. We want to have:

16

Chapter 3. Logic in the nodes on interconnection network 17

• many possible combinations of inputs.

• the possibility of using the device for passing a single signal.

• a small area overhead.

3.1.1 Original multiplexer

In this subsection we have a look at the current used multiplexers. In commercial FPGAsthe number of inputs of the used multiplexers varies a lot. Therefore we can not give a fullycovered insight of what has to be changed. However we can get some insight by consideringa specific example. Here we discuss the case for a multiplexer of 10 inputs. This example isdepicted in Figure 3.2. At the left side we see the 10-input multiplexer. The squares at the topof the multiplexer represent SRAM cells. An SRAM cell contains one configurable bit. Thesebits are used to program the multiplexer. One SRAM cell consists of 6 transistors. It is aprimitive flipflop where the information bit can be determined by using a shift register. Afterthe multiplexer there are two inverters which act as buffers. Both are relatively big comparedto the multiplexer itself in order to provide a stable signal. Furthermore the second inverter isbigger than the first one. This is necessary to provide a low input impedance and high outputimpedance.

The multiplexer

A more detailed scheme of the state-of-the-art multiplexer is given in Figure 3.3. The left sideof the picture represents the multiplexer, the right side depicts the two buffers. The multiplexerconsists of inputs, pass gates and SRAM cells. The principle of the pass gates is as follows: ifthe gate of the transistor is driven high, the transistor will conduct the signal. If the gate of thetransistor is driven low, the transistor does not pass the signal. Here a problem occurs: there isno saturation of the output signal as there will always be a gate-source voltage drop. Thereforethe output voltage of the SRAM cells is chosen slightly higher than the general voltage sourcein the circuit. In this way the voltage drop is no problem. This is called pass gate boosting.One will use buffers to fully saturate the signal afterwards.

This kind of multiplexer is called a two-level multiplexer. This means that the desired signalis chosen in two different stages: first one signal is chosen for each three (or two) inputs. Inthe second stage one signal is chosen out of the four remaining signals. In this way one cancombine SRAM cells for several transistors: in the first stage it is not important which signalis chosen if the output signal of this stage is not connected to the final output. Reducing thenumber of SRAM cells is important for the area as each SRAM cell takes 6 transistors.

The buffers

After the multiplexer there are two buffers. The buffers are inverters where the second inverteris larger than the first one. This is done in order to minimize the input impedance and max-imize the output impedance without having a long delay. As we invert the signal twice, the


Figure 3.1: Each multiplexer is replaced by two multiplexers with a 2-input component.

Figure 3.2: The state-of-the-art 10-input multiplexer.

Figure 3.3: The state-of-the-art 10-input multiplexer on transistor level.


output signal is equal to the input signal.

Remark that the first buffer has an additional PMOS transistor. This is called a level restorer.The main purpose of the level restorer is to saturate the input signal via a feedback loop: whenthe input signal starts at 0V and increases, the output signal starts at VDD and decreases.Normally the input signal would not fully saturate as there is a small voltage drop from theinput over the pass gates. Now Vout will drive the PMOS and the input signal is shorted withVDD. In this way the input signal is saturated.

3.1.2 Fully connected AND gate multiplexer

In this paragraph we give a first example of how to change the architecture of the multiplexers.This example is simulated by three students during their ‘Hardware Design Project’ [11]. Thesimulation is done in COFFE [6]. This is a tool that automatically calculates the best possiblebuffer sizes. It gives a first estimation of the delay and area overhead. The technology can bechosen by giving the corresponding model to COFFE. During the ‘Hardware Design Project’simulations were executed for the 22 nm technology.

The original multiplexer consists of 10 inputs. Now the multiplexer is transformed in twomultiplexers of each 10 inputs again (see Figure 3.4). The inputs of the second multiplexerare the same inputs as for the first multiplexer. In this way any combination of two of the 10inputs can be made. Furthermore we can still use this component to pass a single signal: inchapter 4 it will be shown that the best 2-input components are the AND gate and the AIC(and-inverter-cone, this is an AND gate where it can be chosen to invert the inputs separately).For both components it is possible to pass a single signal when it is given at the inputs twice.

In the Hardware Design Project several architectures have been considered. The best possiblechoice of placing inverters and level restorers turned out to be the configuration of Figure3.5. Remark that a NAND gate was used instead of an AND gate and that the inputs ofthe NAND gate are inverted. It will be shown in paragraph 4.1.1 that this is an equivalentlogic configuration. In this simulation all multiplexers of the switch boxes were replaced. Themultiplexers of the connection boxes were not replaced. For this configuration 24% increase intotal FPGA area and 6% increase in average delay were observed. For further technical detailswe refer to the report of the Hardware Design Project [11].

3.1.3 Half connected AND gate multiplexer

The increase in both delay and area are relatively high. Therefore we consider the configura-tion of Figure 3.6. In this configuration each input is connected to only one multiplexer. Theobserved results are an FPGA area increase of 15% while there is a small decrease in averagedelay of 0.5%. A possible explanation for this remarkable result is that the input impedanceof the multiplexers decreases when the number of inputs of each multiplexer decreases.


Figure 3.4: Schematic fully connected AND gate multiplexer

Figure 3.5: The simulated fully connected AND gate multiplexer


The inputs of the first multiplexer can be combined with the inputs of the second multiplexerin this configuration. However it is impossible to combine inputs of one multiplexer. This is adisadvantage compared to the fully connected AND gate implementation.

Further we want to keep the possibility of using the multiplexer to pass single signals. Thereforewe added an extra input to each multiplexer. This input is connected to the ground. If we wantto select a certain input, we choose it at the appropriate multiplexer. For the other multiplexerwe choose the ground signal. Then the desired signal appears at the output. This can be seenas follows: the NAND gate sees at one input the inverted ground, which is a logic 1. Theoutput of the NAND gate is only a logic 0 when both of its inputs are equal to the logic 1.So the resulting Boolean function of the NAND gate is the inverted second input. After theNAND gate this signal is inverted two more times. Conclusion: the desired signal is invertedfour times, so it appears unchanged at the output.

3.1.4 Half connected AIC multiplexer

After the AND gate we also consider the AIC. The component is depicted in Figure 3.7. Theand-inverter-cone is an AND gate where it can be chosen to invert the inputs separately. Thiscomponent causes an increase in delay of 5,2% while the FPGA area increases with 21%.

3.1.5 Comparison of the three configurations

In paragraph 3.2.5 it will be explained that the AIC is a complex component to work with.Therefore the focus of this thesis will be to consider the AND gates in the first place. Nowwe have to make a choice between the fully and half connected AND gate multiplexer: highoverhead and lots of input combinations, or low overhead but only a few input combinations.According to the pareto rules (see Figure 3.8, left side) we can not make a configuration choicebased on these results.

We will show that the half connected configuration performs better than the fully connectedconfiguration by describing a case were the right side of Figure 3.8 holds. Consider the situationwhere the half and fully connected configuration have two multiplexers of N inputs each (inFigure 3.9 the case of N = 6 is depicted). Now we will show that for any N the right side ofFigure 3.8 holds.

First we show that the overhead is slightly lower for the half connected configuration than forthe fully connected configuration. If we consider Figure 3.9, on the first sight it seems that theconfiguration of the multiplexers and the gate is exactly the same. However the sizes of thebuffers are important as well. These will depend on the loads seen at the input and output.Note that in an FPGA the load seen at the output depends on the input impedance of themultiplexer. This can be seen as follows: assume that every multiplexer in the routing hasexactly the same configuration. Each multiplexer is connected to other multiplexers. Then adriving multiplexer feels an output load equal to the number of multiplexers its output signal isconnected with, multiplied by the input impedance of the multiplexers. So the lower the input


Figure 3.6: The multiplexer with implemented AND gate where the inputs are not shared.

Figure 3.7: The configuration of the and-inverter-cone.

Figure 3.8: Pareto comparison for the half and fully connected configuration


impedance of a multiplexer, the lower the overhead, since the output buffers will decrease in size.

Consider Figure 3.10. At the left side a switch box is depicted with k arrivals and k departures.The meaning of a departure is shown at the right side: two multiplexers and an AND gate arebeing used for each output signal to choose signals from the arrivals of the switch box. Now wecount the average number of signals that an arrival signal has to drive. In the case of the fullyconnected multiplexer there are k output signals with each two multiplexers of each N inputs.So there are 2kN signals needed from the arriving signals. On average an arriving signal sees2N signals at its output buffer. For the half connected multiplexer there are 2k(N − 1) signalsneeded, as for each multiplexer one signal is connected to ground. Therefore each arriving signalsees 2(N − 1) multiplexers at its output. The output load for the case of the half connectedmultiplexer is lower than for the fully connected multiplexer as 2(N−1) < 2N . So the overheadfor the half connected multiplexer is lower than for the fully connected multiplexer.

Secondly we show that the added functionality of the half connected multiplexer is higher thanfor the fully connected multiplexer. Assume that any combination of inputs is of equal im-portance for us. In that case the highest number of possible input combinations will yield thebest configuration. The number of input combinations for the half connected configuration is(N − 1)2: each input signal of the first multiplexer can be combined with each input signal ofthe second multiplexer. There 2N − 2 possibilities to route a single signal as one can chooseany signal of the two multiplexers, but the ground signals are useless. For the fully connectedconfiguration there are N(N−1)

2 input combinations possible: we exclude the cases where equalinputs are chosen from both multiplexers and where two inputs are swapped. There are Npossibilities of routing a single signal. As (N − 1)2 ≥ N(N−1)

2 and 2N − 2 ≥ N for N ≥ 2, wetake always the half connected configuration. Note that N ≥ 2 always holds as a multiplexerof only one input is useless. So the functionality of the half connected multiplexer is alwaysbetter than the functionality of the fully connected multiplexer if we may assume that anyinput combination is of equal value.

Note that it is more difficult to design a routing algorithm that implements the half connectedconfiguration than the fully connected configuration. First a relative easy prototype algorithmcan be designed for the fully connected multiplexers, but afterwards a more complex routingalgorithm should be designed for the half connected multiplexers.

Figure 3.9: Comparison between the half and fully connected configuration for equal N


Figure 3.10: Number of output loads in a switch box for each arriving signal.

3.1.6 Reliability COFFE

COFFE is an academic tool. Therefore it is only an estimation of the real hardware perfor-mance as the architecture of an FPGA is only known by the manufacturing companies.

As COFFE is also a relative recent developed tool, we decided to verify some results. Wechecked the optimal number of added buffers. When we added three (instead of two) buffersafter the original multiplexer, the delay increased and the area decreased. For four buffers theopposite happened: the delay was shorter than the original delay while the area increased.Furthermore the area delay product decreased for this case. As COFFE tries to minimize thearea delay product, this is a remarkable result. Our hypothesis is that the transistor sizes ofthe first buffer are chosen too high. This causes a high output load for the previous drivingsignal. In this way the total delay increases.

COFFE works with a grid search: the sizes of the buffers are changed in several iterationsand the area delay product is minimized. Possibly the iteration intervals are taken too coarsefor the case with two buffers. This can be the reason why no better solution is found. Moreresearch in this topic is necessary to have reliable results.

3.1.7 Replaced switch box multiplexers

Only the multiplexers of the switch boxes were changed in the considered simulations. Wealso would like to know the area and delay for the case where all multiplexers (so also themultiplexers of the connection boxes) are replaced. Therefore we can consider the following ex-trapolation: when changing the multiplexers of the switch boxes, the switch box area increaseswith 40 % while the other blocks do not change significantly (the impact is less than 2% onthe total area). Now we assume that the area of the connection box would increase with 40%as well. The contribution of the connection box to the total area is 20% in the original case.Thus, changing the multiplexers in the connection boxes adds another 8% to the total area.So a total area increase of 23% can be expected when changing all multiplexers in the FPGA.


3.1.8 Optimization SRAM cells

There is an optimization possible considering the number of SRAM cells in the 6-input multi-plexer part of our new configuration: consider therefore Figure 3.11. Originally 5 SRAM cellsare being used. However an SRAM cell contains both an inverted and non inverted informa-tion bit. So taking the inverted signal causes no overhead. In the second stage we can use thisopportunity directly: always one of the two signals is chosen, so the first pass transistor or thesecond one needs a high level voltage at the input. This choice can be fully determined withthe use of one SRAM cell. For the first stage the situation is more complicated: the third passtransistor has to be driven high if the first two pass transistors are driven low. Therefore wecan use the inverted inputs of the two first SRAM cells and connect them to an AND gate.Normally AND gates are not used due to the bulk effect which makes them slow. However inthis case it is not a problem as the AND gate is used in a static way. By replacing an SRAMcell by an AND gate we save two transistors as an SRAM cell consists of six transistors andan AND gate consists of four transistors.

COFFE assumes that SRAM cells are optimized to an area equivalent of four transistors.This assumption is made because SRAM cells are being used a lot in FPGAs and the need ofoptimizing these cells is high. So in this paragraph we assume the manufacturers succeeded toreduce the area of the six SRAM cell transistors to an equivalent area of four transistors. If thisassumption is correct, the case of the AND gate optimization is not useful as this componentoccupies an area of four transistors as well. However it might be possible that some SRAMconfiguration resources can be saved and that the AND gate is indeed smaller. The secondcase, where we use one SRAM cell instead of two is able to save the area of one SRAM cellanyway. So for each multiplexing device we can save the area of two SRAM cells since the6-input multiplexer is used twice. This area saving reduces the area overhead of the switch boxmultiplexer from 40,5 % to 29,1 %. As we take into account that the switch box is responsiblefor 34,1 % of a tile, the remaining added overhead of the replaced switch box multiplexers is11 % instead of 15 %. Since the multiplexers in the connection boxes have more inputs, thisoptimization cannot be used for the connection box multiplexers. So we expect the total areaoverhead to be 19 %.

Figure 3.11: Optimization for the 6-input multiplexer part of the half connected multiplexer.


3.1.9 General ratio use of SRAM cells

In this paragraph the area overhead caused by SRAM cells in function of N is discussed. HereN is the number of inputs of the original multiplexer. In the original case we need d2

√(N)e

SRAM cells. When we implement this in a half connected AND gate multiplexer, we need2d2

√(N2 + 1)e SRAM cells. If we consider the optimization of the previous paragraph, more

complex functions are needed. On Figure 3.12 the needed SRAM cells in function of N for theoriginal multiplexer and the new multiplexer configurations are depicted. The optimization istaken into account. The ratio of needed SRAM cells for the new configuration on the neededSRAM cells of the original configuration is depicted in Figure 3.13. This shows that the ratevaries from 1 until 2. The most extreme ratios are found for low N. When N increases, theratio is about 1,5.

The choice of N will have an important impact, since the SRAM cells are responsible for about40 % of the area in the original multiplexer. Our simulated configuration was for N = 10,which has a rate of about 1,14. This means that for many other multiplexers, the added areawill be significantly higher than our considered case. In order to overcome this problem, theAND gates should only be used in the cases with a favorable N .

Further the total delay is a weighted sum of the delays of two kinds of multiplexers. However asthe delay of the multiplexer does not change significantly, the total delay will not significantlyincrease or decrease.

3.1.10 Conclusion

COFFE is a tool that helps us to estimate the effects on both area and delay when changingthe architecture. The total FPGA area increases with 19 % and the delay does not changesignificantly. However we have to take into account that COFFE only provides a first orderapproach.

3.2 Changes in the tool flow

A new architecture requires a new tool flow. In this section we will have a look at the modifi-cations in the algorithms.

3.2.1 Logic Synthesis

The output of the logic synthesis is a network of AND and NOT gates1. This does not seemto be related to the architecture. However in the synthesis there are optimizations that focuson the use of LUTs [12]. Nevertheless, the impact of these changes is rather low. Furthermorewe keep using LUTs. The only difference is that we can also use some 2-input components. In

1In the real FPGA architectures it is more complicated: also other blocks than LUTs are being used, suchas RAM, DSP or carry chains.


0 10 20 30 400

2

4

6

8

10

12

14

16

18

20

N

NumberofSRAM

cells

SRAM ori

SRAM new

Figure 3.12: Absolute comparison between the needed number of SRAM cells for the original and thenew multiplexer.

0 5 10 15 20 25 30 35 400.8

1

1.2

1.4

1.6

1.8

2

2.2

N

SRAM

new

/SRAM

ori

Figure 3.13: Relative comparison between the needed number of SRAM cells for the original and thenew multiplexer.


short, there is a link between the Logic Synthesis and the considered FPGA architecture, butin general the impact will be rather small. If the extra components would be used commerciallyin the future these optimizations should be considered, but in this phase of the exploration theexpected improvements are too low.

3.2.2 Technology mapping

In the technology mapping step, the AIG is converted into a network of LUTs. It is clearthat this step will have to be changed significantly: instead of only using LUTs, also 2-inputcomponents are made available now. All details will be given in chapter 5.

3.2.3 Packing

In the packing algorithm certain choices will have to be made that will strongly depend on thechoice where to place the AND gates. Consider therefore Figure 3.14 which shows a tile of anFPGA for 4-LUTs. A tile consists of a switch box, connection box and a CLB. An FPGA isbuilt up of many tiles. On this tile a switch box is depicted in the lower left corner. If we goup, we see a connection box which provides links to the CLB. In the CLB the inputs are alldirected to each BLE. The output of each BLE is directed to the connection box below theCLB. Further consider Figure 3.15. Here a BLE is shown in more detail: 4 out of 6 inputs arechosen via several multiplexers and led to a 4-input LUT. After the LUT the signal is directedto a flipflop or immediately to the output. This is still a simplified representation of the reality,but it is detailed enough to get insight in the problem.

The question is now where to place the AND gates. There are several problems: considerthe case where we implement the AND gates in the multiplexers of the BLE itself. Then thepossibilities are limited by the inputs of the CLB and the order in which the inputs of the CLBappear. This order will be the same for the other BLEs. When two BLEs require a differentorder to implement the AND functionality, the case is not feasible anymore. An other case canbe considered: we add AND gates before the BLE in the connection boxes or switch boxes.Then the output of an AND gate appears at the input of a CLB. In this case BLEs should betaken together in this CLB that need this output signal of the AND gate.

Depending on the previous choice, the output or input signal of the AND gate at the input ofthe LUT should be considered to compare different inputs of LUTs. Furthermore many otherconsiderations have to be taken into account. For instance one could ask how many AND gateswill be implemented in series. This kind of aspects is discussed in chapter 6. We leave thesechoices for the future work section.

3.2.4 Placement

When using extra 2-input components, certain LUTs will have an enhanced fanin. The place-ment algorithm should consider the threat of routing congestion when the fanin is high. There-fore more room could be allocated for the LUTs with many connections: the surrounding blocks


Figure 3.14: An example of a tile of an FPGA.

Figure 3.15: Absolute comparison between the needed number of SRAM cells for the original and thenew multiplexer.


of the considered LUT should be empty or should have a low fanin and fanout in order to avoidrouting congestion.

3.2.5 Routing

It is a difficult problem how to choose which physical AND gates should be used. An exampleis depicted in Figure 3.16. At the left we see the circuit to be mapped: LUTs A, B, and C areconnected via AND gates to O. There are many possibilities to choose which AND gates touse. These AND gates will be located in the SBs and CBs. Two possibilities are given. Thisdecision will influence the rest of the routing as it will occupy resources that can otherwise beused for other connections.

Fortunately a number of similar problems have been solved. In paper [13] a place and routealgorithm is discussed where connections branch out. The working principle of the routingalgorithm is probably not a difficult task as the hardest subproblems are already solved. Anecessary condition for the algorithm is commutativity of the used components. For the ANDgate this is not a problem, but for the AIC this is not possible. Therefore it would be anextensive task to design a routing algorithm for the AIC.

Before a routing algorithm can be implemented, a structural change in the architecture will beneeded. This is a very complex problem as an architecture can consist of hundreds of thousandsof multiplexers. Only a well chosen part of these multiplexers will have to be replaced by themultiplexers with added logic. This is the only way to estimate the delay of the added gatesaccurately.

3.2.6 Conclusion

Implementing the routing is a very time-consuming task. Therefore the focus of this thesis isto estimate the performance of the extra components. If this thesis could show that the newFPGA would be significantly faster and/ or smaller, this research should be continued. In theother case where the components tend not to be useful, the high cost of extra implementationshould be avoided. Because of this reason, all attention is spent to make a technology algorithmwith a high performance. In this way, we will be capable of drawing reliable conclusions.

Figure 3.16: Routing example with AND gates.

Chapter 4

Candidate components

When suggesting a component, three questions should be asked:

• What is the added functionality?

• What is the delay overhead?

• What is the area overhead?

A part of the circuit can be implemented by the new 2-input components so the number ofLUTs might decrease. Also the circuit delay could decrease as well. Next to this gain, wealso have to take the extra delay and area overhead into account while adding a particularcomponent. The gain has to be stronger than the overhead.

This chapter deals with all possible 2-input components. Further we assume that only one typeof components will be added. As mentioned before there is a routing algorithm available forcommutative components. Therefore we will use only one single type of components. Further-more if one uses multiple types of components there is a lower number of degrees of freedomand routing congestion can occur. This is the phenomenon where connections can not reacha sink anymore because the routing network is too dense occupied. First the non-configurablegates are considered, secondly the configurable components will be described.

4.1 Non-configurable components

In this section, we will go over all possible boolean functions of two inputs that might beuseful. In the first paragraph, we prove the functionality equivalence of components withinverted inputs and output. Secondly we enumerate all possible non-configurable gates and weremove the useless components. In the next chapter we will give algorithms for the remainingcomponents. Based on results of the technology mapping algorithms and the characteristics(such as commutativity and symmetry), we will make a choice at the end of this chapter.

31

Chapter 4. Candidate components 32

4.1.1 Proof of equivalence

In this subsection we give a proof of equivalence considering the performance of the technologymapping. In this context equivalence means that two output network of the mapping algorithmhave the same number of LUTs and the same depth.

Theorem: Inverting the inputs and output of a 2-input component does not affect the tech-nology mapping performance.

Proof: If we invert the inputs and output of all 2-input components in a circuit, several signalsare inverted. Now we invert all inputs and outputs of the LUTs as well. This does not affectthe performance, since the implemented boolean function of a LUT can be fully determined.Also the primary inputs and outputs (the inputs and outputs of the circuit) are inverted. Sinceevery signal is a connection between one of those four devices (primary input (PI), primaryoutput (PO), LUT or 2-input component), every signal is inverted twice. When a signal isinverted twice, it remains its original state. �

A remark has to be made: in the proof, it was assumed that the inversion of inputs and outputscould be chosen. In the final algorithms this opportunity is not used. For practical reasonsit is more convenient for the customer to work with an FPGA where the inputs and outputsdo not have to be inverted from the outside world. Nevertheless we are interested in the bigcircuits and there the differences will be negligible as the number of LUTs is large comparedto the number of I/O blocks.

4.1.2 Enumeration

In Figure 4.2 all possible 2-input boolean functions are given. Now we remove all uselessoptions. In the following enumeration we give for each possible exclusion one reason. For somefunctions several arguments can be found.

• Functions 1 until 6 do not depend on both inputs.

• Function 9 is equivalent with function 8: they are equal to each other when the inputsare swapped.

• The functionalities of 13 until 16 are equivalent with 7 until 10 (because of inverted inputsand output).

The only remaining relevant boolean functions are given in table 4.1. These will be the functionsto explore for the case where non-configurable gates are used.

4.2 Configurable components

4.2.1 2-LUT

In this section all useful configurable 2-input components will be enumerated. We start withan upper bound for the functionality. This is done by writing a technology mapping algorithm


Figure 4.1: In this example it is clear that each signal is inverted twice when all sinks and sources areinverted. Conclusion: inverting the inputs and output of the added component yields anequivalent component.

Figure 4.2: All possible 2-input boolean functions

A B 7 8 10 11 12

0 0 1 0 0 1 00 1 0 1 0 0 11 0 0 0 0 0 11 1 0 0 1 1 0

NAND ANDN AND XNOR XOR

Table 4.1: Remaining useful boolean functions


that is able to map on 2-LUTs (see Figure 4.3) and 6-LUTs. This is an upper bound sincethe 2-LUT can implement any boolean function of two inputs. Then 19 VTR benchmarks aretested with this algorithm. The resulting performance is measured by counting the number of6-LUTs and by calculating the depth of the circuit. The 2-LUT yields an upper bound for thefunctionality, but on the other hand, it is a relatively slow and large component.

4.2.2 2-AIC

The next step is to consider the 2-input and-inverter cone (2-AIC). This is an AND gate whereeach input can be inverted (a schematic is drawn in 4.3). In section 5.2 it is shown that thedifference in functionality with the upper bound is negligible. Therefore we can remove the2-LUT from our collection of useful components as it causes more delay and area overhead.

4.2.3 Other components

At this point an upper bound for the functionality (the 2-LUT) is known. Also a lower boundfor the overhead (the NAND gate, which is considered to be the fastest 2-input gate) is available.Thus we are searching for a device that contains more functionality than any non-configurablecomponent that is smaller and/or faster than the 2-AIC.

There are some more components possible. An example is the X(N)OR gate which will bediscussed in paragraph 5.2.3. It turns out that this component’s functionality is not oftenused in the VTR benchmarks. Next we considered the option of taking the AND, NAND orANDN gate with the choice to invert the output. The AND and NAND gate are equivalentin this case. Depending on the best implementation one can choose the best of the two ofthem. Both the (N)AND and (N)ANDN component are depicted in Figure 4.3. Clearly thedelay of both components would be comparable with the delay of an AIC. However the areawould be smaller. We did not write algorithms for these components as it turns out therewill not be an area decrease (see chapter 6). Therefore the focus of the thesis is the designspeed. If it can be shown that the 2-AIC enhances the speed of the FPGA, algorithms forboth the (N)AND and (N)ANDN components should be designed. It might be possible thatthe contained functionality is comparable while the component is smaller.

Figure 4.3: The first component is a 2-LUT, the second one is a 2-AIC. The third and fourth compo-nents are respectively the (N)AND and (N)ANDN gates.


4.2.4 Remark

Boolean functions can sometimes be implemented by different kinds of components, due tothe inherent architecture. Consider the following example: one makes use of NAND gates andmakes sure every single signal is inverted through each multiplexer. In that case one can choosebetween an inverted or non-inverted input signal by routing via a path with an odd or evennumber of hops. As such, using the NAND gate is equivalent with the 2-AIC for technologymapping. At this point it is difficult to estimate the routing overhead of choosing between oddand even paths. Therefore we will continue with the 2-AIC. When implementing the routingalgorithm a final decision can be made.

4.2.5 Conclusion

The 2-LUT provided an upper bound for the functionality. As the 2-AIC almost reaches theupper bound while it is smaller and faster, we keep the 2-AIC as the best possible choice.The X(N)OR gate turns out not to be useful. The (N)AND and the (N)ANDN gate are notimplemented in an algorithm. The added functionality of these components can not be higherthan the added functionality of the 2-AIC. However the speed is about the same and the area isbetter compared to the 2-AIC. Because the focus of this thesis is to explore the most importantopportunity, which is the speed, the design of these algorithms is postponed.

Chapter 5

Technology Mapping

For the remaining candidate components both the working principle and some basic resultswill be explained in this chapter. First the programming environment will be described. Thenthe configurable and non-configurable components will be discussed respectively.

5.1 Environment

5.1.1 Starting point

The conventional language to write CAD tools for FPGAs is C, for example in the environ-ment of ABC. This environment contains the different parts of the tool flow and a number ofuseful extensions. This tool was used for some purposes in this thesis, such as verification andvisualization of networks. However there was another platform available for implementing thenew technology mapping algorithm.

In the HES department, a tool called LogicMap was designed, written in Java. Clearly thislanguage is more suitable to use for experiments, because it is a language focussed on an easyprogramming experience (for instance automatic memory management) rather than a shortrun-time. As we wanted to highlight the quality of the results, the Java framework was chosen.In this way, more experiments could be done.

This starting point provided an algorithm that was suitable for 6-LUTs (and even all otherLUTs, but 6 is the most common number). The implemented algorithm is called DAOmap:depth-optimal mapping with area optimization [8]. However the area optimization is not fromDAOmap, but from [9] where some improvements for DAOmap were given. To our knowledgethis is the best academic depth-optimal technology mapping algorithm available.

5.1.2 Benchmarks

The input of the technology mapping algorithm is an AIG. The output is a network of LUTs.In this thesis the output will be a network of both LUTs and 2-input components. The used

36

Chapter 5. Technology Mapping 37

input AIGs are called the benchmarks.

The chosen benchmarks must be representative for two reasons: first of all they should actcomparable to most commercially circuits. In this way one is able to compare results in arepresentative way. Secondly, the benchmarks will be used to verify whether the output networkis equivalent to the input network. Therefore the benchmarks should contain all possible circuitexceptions. Clearly this is not possible. Since we have to use a finite number of benchmarks,we are never completely certain that an implementation algorithm is free of bugs. The choiceof benchmarks is therefore very important. In this thesis the 19 VTR benchmarks were used.Furthermore also the MCNC198 benchmarks were used for additional verification. The VTRbenchmarks are well known benchmarks in todays literature. The MCNC198 are also used,particularly the MCNC20 benchmarks, which contain the largest circuits of the MCNC198benchmarks. However the MCNC20 benchmarks are small compared to the VTR benchmarks.The MCNC20 have been used in the older literature when computers were not able yet to mapbigger circuits within a feasible run-time.

5.1.3 Verification

The output of the technology mapping contains three different data:

• The new network representation consisting of 6-LUTs and 2-input components.

• The number of used components.

• The circuit depth.

During this thesis an extensive effort was spent to make sure results were reliable. All givenresults were verified:

• For the network representation an output file (with extension .blif) is used. The booleanequivalence of two circuits can be verified with the help of the verification tools of ABC.

• Furthermore the number of components can be checked by having ABC printing the statsof the network. However ABC sees every component as a LUT, so it is not able to makea distinction between the number of LUTs and 2-input components. Therefore an addi-tional Python script was written to count externally the 2-input components. Afterwardsthe verification is done by comparing the java output numbers with the external countednumbers of 2-input components and the total number of components, counted by ABC.

• The last verification is the depth. Due to varying depth costs (see section 5.2.1) it is notpossible to use the ABC verification tools to verify the total depth. Therefore an externalscript was written to recalculate the depth. Of course this second calculation is done inan other way than it is done in the Java program.

Now the correct conclusion can be made: even though the 198 MCNC benchmarks and the19 VTR benchmarks were verified correctly, it is still possible that an other benchmark wouldcause a bug. Nevertheless all of the 217 benchmarks have passed the verification test, so thechance for bugs is relatively small.


5.2 Configurable components

In the following subsections technology algorithms will be discussed where we map an AIG toa network of 6-LUTs and the considered configurable 2-input components.

5.2.1 2-LUT

The general principle of the algorithm is given in pseudocode (see Algorithm 2). Note thatonly the cone enumeration and the cone ranking are modified to map the AIG to 2-LUTs and6-LUTs. In the next paragraphs the algorithm is explained in more detail. Therefore we willuse the example of section 2.3. In this example we consider 3-LUTs as baseline. In reality6-LUTs are being used, but the principle is similar.

Cone Enumeration

The algorithm starts with cone enumeration. Figure 5.1 is taken from section 2.3. Here wedefine all possible cones1 for one node. All possibilities of maximum three (because we consider3-LUTs in the example) inputs are enumerated. Here we will add the first change: we willmark all cones consisting of only two inputs as a 2-LUT. We can do this as a 2-LUT can realizeany Boolean function of two inputs.

Depth

The concept of ‘depth’ is of high importance in the cone ranking. Therefore first some ad-ditional information is given. It is explained in section 2.3 that the depth of a signal is thenumber of 6-LUTs it has to traverse from input to output2. The depth of a circuit is equalto the depth of the longest signal. This depth is an estimate of the delay. It is not perfect,because the routing has an important impact on the delay.

Consider now Figure 5.2. At the left side, we see an example of a path in a circuit. In thisexample the depth is equal to 6 as there are six 6-LUTs to traverse. We said that the delaywould depend on the routing. Therefore it is important to reduce the fanin and fanout ofLUTs. In that way better placement solutions can be found and more routing resources willbe available. Therefore the first 6-LUT of Figure 5.2 will probably add a lower delay thanthe second 6-LUT. So we can define the depth cost as a most likely delay added by a 6-LUT,included the routing to and from the considered LUT.

1The definition of a cone is taken from section 2.3: a cone is a subnetwork that is defined by its inputs andoutput. It represents the subnetwork that can be implemented by a LUT. As an N-LUT can implement anyBoolean function of N inputs, we only consider the input nodes and output node in technology mapping. Thecorresponding Boolean function that has to be implemented is calculated afterwards.

2The clockspeed is determined by the longest combinatorial path present in the circuit. Therefore we definethe start of a path as a primary input or the output of a flipflop. Similarly the end of a path is per definition aprimary output or the input of a flipflop


input : AIGoutput: Network of 6-LUTs and 2-LUTs


findAllPossibleCones(Node);for allPossibleCones do

if twoInputCone thenmarkAs2LUT(Cone);

end

end

end


for allConesOfNode doif is6LUT(Cone) then

Cone.depth = maximumDepth(Cone.inputs) + 1;else

Cone.depth = maximumDepth(Cone.inputs) + depthCost2LUT;end

endNode.bestCone = chooseLowestDepthCone(allConesOfNode);

end


setVisible(Output);endfor allVisibleNodesFromOutputToInput do

for VisibleNode.Cone.inputNodes dosetVisible(inputNode);

end

end

//AreaRecovery;for allNodesFromOutputToInput do

Node.requiredDepth = calculateRequiredDepth(Node);endfor allNodesFromInputToOutput do

findBestGlobalAreaUnderRequiredDepthConstraint(Node);endfor allNodesFromInputToOutput do

findBestLocalAreaUnderRequiredDepthConstraint(Node);end

Algorithm 2: Technology Mapping


Figure 5.1: The cone enumeration for a maximum of 3 inputs, example for node 6

Figure 5.2: Two signal paths in a network for depth comparison.


Consider now the right side of Figure 5.2. A critical path is given with 6-LUTs and 2-LUTs(or any other 2-input component). The question is now: which will be the longest path of thetwo of them? The first path contains six 6-LUTs, the second path contains three 6-LUTs andfive 2-LUTs. For the algorithm it is necessary to know the delay impact of a 2-LUT comparedto a 6-LUT. We know that the delay of a 2-LUT will be the lowest since a 2-LUT is faster thana 6-LUT. Furthermore we will provide more degrees of freedom in the routing to use 2-LUTsthan 6-LUTs. However it is not possible to estimate the depth cost of a 2-LUT accuratelycompared to the depth cost of a 6-LUT. The only way to find this is by implementing adetailed architecture and a corresponding routing algorithm. As we do not know this necessaryinformation at this moment, we will use the depth cost of the added component as an inputparameter of the algorithm. Results will be discussed for a range of the depth cost between0 and 1: the depth cost will be positive and it will be lower than the depth cost of a 6-LUT(which is per definition equal to 1).

Cone Ranking

In the cone ranking we start at the inputs of the circuit. Every input has a depth of zero.Then we descend one level and calculate the depths of all possible cones for each node at thatlevel. This is calculated as follows: consider all input nodes of the particular cone, take themaximum depth of these input nodes and add 1. At this point we modify the cone ranking:instead of adding 1, we can differentiate the LUTs: if the cone represents a 3-LUT (or 6-LUTin the real FPGAs), 1 is added at the end of the calculation. If the cone represents a 2-LUT,a relative depth cost (chosen between 0 and 1) is added. If we assume that we use the correctdepth cost, the algorithm is still depth-optimal. In Figure 5.3 an example of cone ranking isgiven for an equivalent depth cost of 0,6.

Now consider the real depth cost and estimated depth cost. We do not know the real depthcost, though we have to use the estimated depth cost as input parameter for the algorithm.The choice of the estimated depth cost is important: when it gets lower, the cone rankingwill tend to choose more 2-LUTs over 6-LUTs and vice versa. When working with a wrongestimated depth cost, the decisions made in the technology mapping step are not depth-optimalanymore. So when the pack, place and routing tools are implemented, the real depth cost shouldbe measured as follows: take a specified depth cost, run over the tool flow, measure the delayof the implementation. Execute this procedure for several benchmarks and manually defineddepth costs. The real depth cost is approached best by the used depth cost of the case wherethe measured delay was the lowest for all benchmarks.

In the following we will often use the concept of the depth cost. We will assume that theestimated depth cost is equal to the real depth cost. We will always use ‘depth cost’ when wemean ‘the relative delay caused by the considered 2-input component compared to the delayof a 6-LUT, including the routing delay necessary to use the particular component’.

Cone Selection and Area Recovery

In the cone selection nothing has to be changed. Further the area recovery can be used as is,except for a small modification of calculating the required depth, taking into account the depthcost of the 2-LUTs.


Figure 5.3: The principle of cone ranking when using both 2-LUTs and 3-LUTs. The depth costs arerespectively 0,6 and 1.

Figure 5.4: The cone selection. Compared with Figure 2.14 we see that the resulting depth is 2,2instead of 3. Furthermore three 2-LUTs and three 3-LUTs are used. This is cheaper thanusing six 3-LUTs, as 2-LUTs are smaller than 3-LUTs.


Results

Results of this algorithm can be found on Figure 5.5. Here 6-LUTs are considered in combina-tion with 2-LUTs. At the x-axis depth cost is varied from 0 to 1. The y-axis shows the relativecircuit depth and the number of LUTs used. These variables are the results of the sum of the19 VTR benchmarks. This means we added all numbers of 6-LUTs of the benchmarks. Thenwe compared this sum to the sum for the state-of-the-art case and we normalized the resultsto that sum. The number of 2-LUTs added is compared to the same sum. We also made asimilar sum for the depth and compared it to the depth of the state-of-the-art case.

When the depth cost is considered to be zero, all functionality is implemented by 2-LUTs.This is not a surprise: the cone ranking sees these components as free to use and will alwaysfavor them. As the output network only consists of 2-LUTs and the depth cost is consideredto be zero, the circuit depth is zero. Furthermore the number of 6-LUTs is zero in this case.At the right side on the graph, the depth cost is considered to be 1. In this case no single2-LUT is used: when the cone ranking has to choose between a 6-LUT and a 2-LUT of equalcost, he will take the 6-LUT because it can contain more functionality. In fact this point isthe normal DAOmap algorithm, as only 6-LUTs are considered. So this is our point of reference.

Consequently if we have a look at the number of 2-LUTs, we can see that in the case of zerodepth cost the relative number is 1,2. This means that when 2-LUTs are used instead of 6-LUTs, there are 1,2 times more LUTs needed. This is only a small difference and one couldask the question why 6-LUTs are being used instead of 2-LUTs. The answer is that the depthcan be made significantly smaller with 6-LUTs. Furthermore for the MCNC20 benchmarks theresult is a factor of 3,6. Clearly for these benchmarks using 6-LUTs is even more an opportu-nity compared with 2-LUTs. With an appropriate depth cost a good trade-off can be foundbetween the number of 2-LUTs and 6-LUTs.

We explained the meaning of the extreme right and left points on the graph. In between thereare still some questions we have to answer. First we have a look at the depth of the circuit.This is increasing from the left to the right, caused by an increasing depth cost. Two effectscan be observed:

• The critical path in a circuit consists of k2 2-LUTs and k6 6-LUTs, and is calculated asfollows:

delay critical path = k2 depth cost + k6

So when the depth cost increases, the calculated depth of the circuit increases.

• If the depth cost increases, the algorithm will favor more the 6-LUTs. So the networkwill change depending on the depth cost.

The first effect explains linear fragments of the graph. This will be shown later in more detailin section 6.2.1. However it is not completely linear: it converges to one, thanks to the secondeffect.


!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!" !#(" !#$" !#)" !#%" !#*" !#&" !#+" !#'" !#," ("

!"#

$%&

'()

*"+&,-&./0123&456

&7/0123&

!"#$%&8,3$&

&-./01"

$-./01"

23456"

Figure 5.5: The results of combining 2-LUTs and 6-LUTs in DAOmap for the VTR benchmarks.

Theoretically the results for the 2-LUT are of high importance. This component contains allpossible functionality for 2 inputs, so it provides an upper bound for the added functional-ity. Better results for an other component after Technology Mapping are not possible withDAOmap. Nevertheless it is important to explore the other components as well as they aresmaller and faster.

5.2.2 2-AIC

The 2-AIC is an AND gate with the choice to invert the inputs separately. The algorithm isquasi equivalent with the previous one: in the cone enumeration we mark the cones consistingof a single AND gate (and eventually with inverted inputs) as a 2-AIC. Also this componentreceives an equivalent depth cost. Once more the cone ranking will take this depth cost intoaccount. In Figure 5.6 the results of both the 2-AIC and the 2-LUT are compared. The resultsare very similar.

The upper bound is approached very closely for depth costs greater than 0,1. This is not asurprise: after the Logic Synthesis, the nodes are connected in such a way that the number ofAND gates is minimized. In other words, the input network is already an optimized configu-ration to implement 2-AICs.

As a 2-LUT needs a larger area and causes more delay than the 2-AIC, we can say that the2-AIC is better than the 2-LUT. The focus of this thesis will be to reduce the speed and weobserve that both circuit depths are very similar. As the 2-AIC is faster, this will yield thefastest FPGA.


!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(#%"

(#&"

!" !#!)" !#(" !#()" !#$" !#$)" !#*" !#*)" !#%" !#%)" !#)"

!"#

$%&

'()

*"+&,-&./012&345671&89:

&;45671&

!"#$%&<,1$&

&+,-./"01234"

123/"

&+,-./"0$+,-./4"

$+,-./"

56789"0$+,-.4"

56789"01234"

Figure 5.6: The results for the 2-AIC compared with the 2-LUT for the VTR benchmarks.

5.2.3 X(N)OR

The X(N)OR gate is represented in Figure 5.7. It is a XOR gate where one can choose to invertthe output. It will be shown in this section that this is a useless component: it is larger andslower than an AND gate. On top of this, the added functionality is very low.

Again the same principle is used in the algorithm: in the cone enumeration the possible X(N)ORcones are marked. Therefore we mark every cone that consists of two inputs where both inputsare not the direct inputs of the considered root node in the AIG. Consider therefore Figure 5.8(a). Now we will enumerate all possible cones for node 5 according to the two properties. Thefirst condition is that the cone must have two inputs. This is the case for cones3 {{1, 2},5}and {{3, 4},5}. The second condition is that both inputs must not be directly connected tothe output node in the AIG. There are no edges from node 1 to node 5 or from node 2 to node5, so the condition holds for the first cone. However the condition does not hold for the secondcone as there is an edge from node 3 to node 5 and from node 4 to node 5. So only cone {{1,2},5} would be marked in this case.

3A cone is described as follows: {{input1, input2, ..., inputN}, output}

Figure 5.7: The X(N)OR gate


A proof will be given that this is a correct marking technique. Therefore we will show thatfor any cone where both conditions hold, the network can be optimized if the cone does notrepresent a X(N)OR function. As in the Logic Synthesis the network is optimized and theseoptimizations can be done locally (so low run-time) we can assume that no unoptimized sub-networks are still present in the considered AIGs. An example of an optimization is given inFigure 5.8 (b): the first subnetwork does not represent a X(N)OR implementation, thus it canbe optimized to a smaller and faster network. Now the proof is given:

Figure 5.8: An example of a cone representing a XOR function (a) and an example of an optimization(b)

Theorem: Cone C consists of two inputs and these inputs are not directly connected to theoutput node in the AIG. ⇔ Cone C can be implemented by one X(N)OR.Proof: For the ⇒ case we consider again the table with all possible boolean functions (seeFigure 5.9).

Now we have to prove that every considered cone (a cone of two inputs where the inputs arenot directly connected to the output) is case 11 or 12. We continue this prove by excluding allother options:

• Cases 1 until 6 do not depend on both inputs. Such cases are excluded in the LogicSynthesis as they are useless.

!"#$ "%&'()'*$!"#$+,$%,)$*'-'%*$,%$.,)/$0%-1)2$

Figure 5.9: All boolean 2-input functions.


• Cases 7 until 10 are implementations of a 2-AIC. In the Logic Synthesis the number of2-AICs and the depth are minimized. If one wants to implement a signal that dependson two inputs and can be represented by using a 2-AIC, both the depth and the numberof 2-AICs is minimized by adding the 2-AIC directly after the inputs. So each time whenthis boolean function is used, the inputs of the cone are directly connected to the rootnode.

• Cases 13 until 16 are inverted 2-AICs. These functions are not used in Logic Synthesis:if they would be needed, a non-inverted 2-AIC is used. The inversion of the signal canbe solved by also inverting the signal at the inputs of the components where the signalis needed.

The only remaining functions are 11 and 124. Now we still have to prove the ⇐ direction.Cones are excluded in two cases:

• The cone does not consist of two inputs.

• The inputs of the cone are the nodes that are directly connected to the root node.

In the first case it is clearly impossible that a X(N)OR implementation is excluded. In thesecond case we exclude the implementation of an AND gate with possibly inverted inputs. Asan AND gate has a boolean function with only one 1, this is not a X(N)OR implementation.

The results for this component are given in Figure 5.10. Here a depth cost of zero was consid-ered. These results are bad. Only a few benchmarks show a very small improvement in depth.This kind of functionality is clearly not often used in circuits. So, we assume the X(N)OR gateis not interesting to consider anymore. As the functionality of the X(N)OR component is anupper bound for the functionality of the XOR and XNOR components, we can immediatelyremove these candidate components as well.

5.2.4 Comparison

It is clear that the X(N)OR gate is not useful component. Further the added functionalityof the 2-AIC approaches the upper bound, which is provided by the 2-LUT. As the 2-LUT isslower and larger, only the 2-AIC is kept as a possible candidate component.

5.3 Non-configurable components

When considering non-configurable 2-input components, a problem occurs: the inversion of theinputs and output can not be chosen. This is a problem because there are many inverted edgesin an AIG. In several cases this problem can be partially solved by inverting certain inputs

4In the 198 MCNC benchmarks exceptions were found for both the first and second statement of the proof.Thus, local improvements for the implementation of the synthesis are still possible. As these exceptions happenvery seldomly, we can pose that the influence on the results is negligible.


!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

)*+),-./0."

)+1234+5"

67,824/829869"

38:.;("

38:.;$"

<='>??20"

<=@$>??20"

-6-*"

-AB.*CDE+/A./@$F"

-A>A4G./0."

-AHGI3C54./%F"

+/($!!"

/CD0.24+5"

97C"

94./.+J898+2!"

94./.+J898+2("

94./.+J898+2$"

94./.+J898+2@"

)0-"

!"#$%&'()*

+,-$

.)#+"/+01234+,

"#$%&'()*

+*)5

67+

<=K9"

B.547"

Figure 5.10: The results of the X(N)OR gate for the VTR benchmarks.

or outputs of LUTs in the circuit5. However this is a complex problem: often choices haveto be made where implementations of gates are favored over other gate implementations. Inthe first paragraph we discuss the horizontal and vertical dilemma. After this explanation thealgorithm of the AND gate is discussed. Then also the algorithms of the NAND gate and theANDN gate will be given.

5.3.1 The horizontal and vertical dilemma

Horizontal dilemma

Consider Figure 5.11. It is impossible to implement node 7 with an AND gate as its rightinput is inverted and the inputs of an AND gate cannot be inverted. However if we implementnode 3 by using a LUT and invert its output, it becomes possible. On the other hand when weinvert node 3 it is not possible anymore for node 9 to implement this node by an AND gate,because its input must not be inverted. So we have to choose to give the possibility to node7 or to node 9. In general a horizontal dilemma occurs when the output signal of a node isconnected to both an inverted and a non-inverted input.

Vertical dilemma

Consider again Figure 5.11. Assume node 3 is inverted implemented by a LUT and node 7 isimplemented by an AND gate. The output of an AND gate cannot be inverted. This meansthat the signal of node 10 will certainly need a LUT (where you can invert the inputs). Againone has to choose to favor the AND gate implementation of node 7 or node 10. In general thevertical dilemma occurs at each inverted input of a node: then the algorithm has to choose tomake it possible to use an AND gate for the root node or the input node.

5Inverting inputs or outputs of a LUT causes no overhead: any Boolean function can be implemented by aLUT. When an input our output is inverted, only the Boolean function of the LUT has to be modified.


Figure 5.11: An example of a numbered AIG.

Remark: the vertical and horizontal problem very often occur. Furthermore when there is ahorizontal problem, there is also a vertical problem: the horizontal problem only occurs whena node has an inverted input. The presence of an inverted input is the only condition to havea vertical problem. In the next subsection a depth-optimal algorithm will be described how todeal with these choices.

5.3.2 The AND gate

In short, the horizontal and vertical problem are tackled by searching for the best possible depth(where we will consider an inverted and non inverted depth for each node) and duplicatingnodes when problems occur. Duplicating nodes increases the LUT usage, but only 1,2% of allimplemented nodes is duplicated on average. The pseudo code is given in Algorithm 3.

Cone enumeration

In the cone enumeration all cones that can be implemented by a 2-AIC are copied. The copiesare marked as an AND gate. The original cones stay LUT implementations as we still wantto have the possibility to implement any node with an inverted output (thus with a LUTimplementation).

Cone ranking

In the cone ranking two ‘best cones’ are calculated: one for the case where we want to im-plement the output of the node in the ‘inverted state’, one for the ‘non inverted state’. Thedifference between these two calculations is that in the first case no use is made of the AND


input : AIGoutput: Network of 6-LUTs and AND gates


findAllPossibleCones(Node);for allPossibleCones do

if hasTwoDirectInputs(Cone) thencopyAnd = copy(Cone);markAsAnd(copyAnd);

end

end

end


//CalculateNonInvertedDepth;for allConesOfNode do

if is6LUT(Cone) thenCone.depth = maximumDepthOfAllInputs(min(InvDepth, NonInvDepth)) + 1;

elseCone.depth = maximumDepthOfAllInputs(appropriateDepthOfInputNodes) +depthCostAnd;

end

end//CalculateInvertedDepth;for allConesOfNodeButAndCone do

//invertedCaseAlwaysImplementedByLUT;Cone.depthInverted = maximumDepthOfAllInputs(min(InvDepth, NonInvDepth) + 1;

endNode.bestCone = chooseLowestDepthCone(allConesOfNode);Node.bestConeInverted =chooseLowestInvertedDepthCone(allConesOfNodeButAndCone);

end


setNonInvertedVisible(Output);endfor allVisibleNodesFromOutputToInput do

for VisibleNode.Cone.inputNodes do// The setVisible method takes into account whether it is invertedVisible ornonInvertedVisible;setVisible(inputNode);

end

end

Algorithm 3: Technology mapping with AND gates


gate (it is impossible to invert the output). In the second case, the cone ranking can use thepossibility of adding an AND gate.

An example is given in Figure 5.12. In the example it is assumed that inputs of the circuit arenever inverted. From number 1 until 4 it is not possible to implement any node with an ANDgate. Node 5 can use an AND gate as both of its inputs are not inverted. In the example theequivalent depth cost is considered to be zero. So, for node 5 the left number (non inverteddepth) is zero, the right number (inverted depth) is equal to one.

For nodes 6, 7 and 9 the best choice is to take a 3-LUT that reaches from the inputs. For node8 this is not the case: this node depends on 4 different inputs. As we consider 3-LUTs, we cannot find a feasible cone that reaches the circuit inputs. A possibility is to take both inputs ofthe node as a cone. Then we have to take the maximum depth of its inputs, which equals one.If we implement the node by using a LUT (for the inverted case), 1 is added. The depth of thecone is 2. When we want to implement the non inverted version of the signal, we can use anAND gate. We take again the maximum depth of both inputs of the AND gate which is stillequal to one. Then we add zero, because this is the equivalent depth cost of an AND gate inthe example. The result is a depth of one. The reasoning for the other depths is similar.

Cone selection

In the cone selection (see Figure 5.13) we start with making nodes 11 and 12 ‘visible’ as theseare used as outputs. Then we start from node 12. For its cone (an AND gate), node 10 hasto be made ‘non inverted visible’, node 11 ‘inverted visible’. Note that node 11 was alreadyset ‘visible’. Now it is updated: from ‘visible’ to ‘inverted visible’. When it is not importantwhether a node is ‘inverted visible’ or ‘non inverted visible’ (in case the node is used by aLUT, then the inversion of the LUT inputs can be chosen), the node is set ‘visible’. When anode is implemented by using an AND gate there is no choice anymore for its inputs: theyare updatet to ‘inverted visible’ or ‘non inverted visible’, which is more specific than ‘visi-ble’. Sometimes it might happen (as a consequence of the horizontal dilemma) that a node hasto be both inverted and non inverted visible. This case is allowed: the signal is then duplicated.

Remark: when a node is updated from ‘visible’ to ‘inverted visible’ or ‘non inverted visible’, itcan happen that not the optimal depth is taken. Therefore at the end of the cone selection itis calculated for each node whether it is on the critical path or not. If a node is on the criticalpath we make sure that always the best possible depth is implemented. Therefore sometimesit will be necessary to duplicate certain nodes.

After node 12 we continue with the inputs of 10 and 11. The inputs of the cone of node 10 areinputs of the circuit itself. The inputs of the best cone of node 11 are nodes 2, 4 and 9. It doesnot matter to take the inverted or non inverted signal as the depth is for the three nodes thesame. By default the non inverted signal is chosen.

Compared to the result of chapter 2.3, now we implemented the same circuit with five 3-LUTs


Figure 5.12: A representation of the cone ranking when using 3-LUTs and AND gates.

Figure 5.13: A representation of the cone selection when using 3-LUTs and AND gates. In this caseonly the node at the bottom is represented by using an AND gate.


and an AND gate instead of six 3-LUTs. Furthermore the depth is 2 instead of 3 (whenconsidering a zero depth cost).

Results

The results for the 19 VTR benchmarks can be found in Figure 5.14 for the AND gate. Not allcombinatorial functionality can be implemented in AND gates, so even for a depth cost of zero,the circuit depth is not zero: there are still LUTs on the critical path. A significant reductionof both area and depth is observed, but the gain is not as high as for the 2-AIC. However theAND gate is smaller and faster than the 2-AIC (as was discussed in section 3.1). Based onthese results we have to keep both components, according to the pareto rules. Now we provethat the given algorithm is depth-optimal.

Theorem: The proposed technology mapping algorithm for the AND gate is depth-optimal.

Proof: We will prove that in the cone ranking every node gets an optimal depth annotation.We used mathematical induction to prove this. Consequently we show that a depth-optimalsolution is found in the cone selection with the information of the cone ranking.

Start condition: the depths of circuit inputs are zero.

Further we consider row N in the network. Assume that the depths (both inverted and noninverted) of all nodes of all previous levels are calculated correctly. In the cone enumerationall possible cones are enumerated. Now we only have to prove that the depths are calculatedcorrectly. No lower depth than the annotated depth can be achieved.

The depth of a LUT is calculated as follows: take the maximum of the minimum of the ‘inverteddepth’ and ‘non inverted depth’ of all inputs of the representing cone and add 1. Taking themaximum depth of all inputs and adding 1 is correct as it is literally copied from the existingDAOmap algorithm. Further we can choose to take the ‘inverted depth’ or ‘non inverted depth’because the inputs of a LUT can be inverted.

The depth of an AND gate is calculated as follows: take the maximum of the appropriatedepths for both of its inputs and add the depth cost. For both the LUT and the AND gate thefollowing statement applies: if the previous nodes are depth-optimal, we find for the considerednode the best possible implementation.

Now we still have to make the distinction between the ‘inverted depth’ and ‘non inverted depth’:for the ‘inverted depth’ take the LUT implementation (we can not use an AND gate as we cannot invert its output), for the ‘non inverted depth’ take the minimum depth of the AND gateimplementation and the LUT implementation. We conclude that the cone ranking gives indeeddepth-optimal results.

In the cone selection not always the best depth is provided at the input of a LUT as some-times the ‘non inverted implementation’ is chosen instead of the ‘inverted implementation’for instance. However after the cone selection the depth-optimal cones of the nodes on thecritical path are taken. If all critical paths of the network are depth-optimal, the network isdepth-optimal.


!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!" !#!)" !#(" !#()" !#$" !#$)" !#*" !#*)" !#%" !#%)" !#)"

!"#

$%&

'()

*"+&,-&.'!/&01

2&3456

7/&

!"#$%&8,/$&

&+,-./"

"012/"

23456"

Figure 5.14: The results of the AND gate for the 19 VTR benchmarks

5.3.3 The NAND gate

For the NAND gate, the same principle is used: in the cone ranking two different depths arecalculated. Here the operation is averse: for the inverted signal the NAND gate can be used,while this is not the case for the non inverted signal. The results can be found on Figure 5.15.

5.3.4 The ANDN gate

Again the same principle is used. Here however an extra degree of freedom is added: byswapping the inputs an other boolean function is achieved. The results when using the ANDNgate can be found on Figure 5.16.

5.3.5 Comparison candidate components

The results of the AND, NAND and ANDN gate are respectively depicted on Figures 5.14,5.15 and 5.16. We see a maximal depth reduction of respectively 18 %, 30% and 50%.

It seems that we should take the ANDN gate since the depth decreases with 50 % for a zerodepth cost, while the depths are significantly higher for the AND and NAND gate. However,the ANDN gate is a complex gate to use in the routing: the order of inputs may not beswapped. Furthermore if we want to use an ANDN gate as a normal multiplexer, the signalmight be inverted if it comes at the inverted input. So, by using the ANDN gate we lose lotsof degrees of freedom in the routing. As the routing is the bottleneck of resource usage in to-day’s FPGAs, we believe that the negative impact of the ANDN gate will be too high to favor it.


!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!" !#!)" !#(" !#()" !#$" !#$)" !#*" !#*)" !#%" !#%)" !#)"

!"#

$%&

'()

*"+&,-&'.'!/&01

2&3456

7/&

!"#$%&8,/$&

&+,-./"

0102/"

23456"

Figure 5.15: The results of the NAND gate for the 19 VTR benchmarks

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!" !#!)" !#(" !#()" !#$" !#$)" !#*" !#*)" !#%" !#%)" !#)"

!"#

$%&

'()

*"+&,-&.'!'/&01

2&3456

7/&

!"#$%&8,/$&

&+,-./"

0121/"

23456"

Figure 5.16: The results of the ANDN gate for the 19 VTR benchmarks


The second best case is the NAND gate: it has a depth reduction of 30% for a zero depth cost.However, the same problem occurs when we want to use the gate to pass a single signal: thesignal will be inverted. Again we have to give up some degrees of freedom in that case.

An additional argument to favor the AND gate over the ANDN gate and the NAND gate isthe case where we explore the results for the MCNC20 benchmarks (see Figure 5.17, 5.18 and5.19). There the results for the depth reduction for the AND, ANDN and NAND gates arerespectively 31%, 36% and 20% for a depth cost of zero. The results strongly depend on theconsidered benchmarks. For the MCNC20 benchmarks the AND gate performs better thanthe NAND gate and approaches the performance of the ANDN gate. Furthermore, for higherdepth costs it is even better than the ANDN gate. Therefore we will consider the AND gateas the best possible gate to continue with.

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!" !#(" !#$" !#)" !#%" !#*" !#&" !#+" !#'" !#," ("

!"#

$%&

'()

*"+&,-&.'!/&,+&01234/&

!"#$%&5,/$&

&-./01"

2341"

45678"

Figure 5.17: The results of the AND gate for the 20 MCNC benchmarks

5.4 Conclusion

The only remaining gates are the AND gate and the 2-AIC. The 2-AIC has a better performanceafter the technology mapping. However the delay and area of the physical 2-AIC cause moreoverhead than the AND gate. So we keep both components, according to the pareto rules.


!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!" !#(" !#$" !#)" !#%" !#*" !#&" !#+" !#'" !#," ("

!"#

$%&

'()

*"+&,-&'.'!/&,+&012/&

!"#$%&3,/$&

-./0"

12130"

34567"

Figure 5.18: The results of the NAND gate for the 20 MCNC benchmarks

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!" !#(" !#$" !#)" !#%" !#*" !#&" !#+" !#'" !#," ("

!"#

$%&

'()

*"+&,-&.'!/&,+&012/&

!"#$%&3,/$&

-./0"

12320"

34567"

Figure 5.19: The results of the ANDN gate for the 20 MCNC benchmarks

Chapter 6

Results Technology Mapping

6.1 Introduction

In this chapter we will discuss the results of technology mapping in more detail. We will do thisonly for the two best components: the AND gate and the 2-AIC. Before we begin to discussthe results, this section deals with a number of general remarks about the benchmarks.

In Figure 6.1 the 19 VTR benchmarks we used in this chapter are enumerated for conventionalmapping to 6-LUTs. The size of the benchmarks clearly varies a lot. When we compare thisgraph with Figure 6.2, we can observe a correlation between the number of LUTs and the depthof the benchmarks in case we map for the 6-LUTs. However the variation in depth is moremodest than the variation in number of LUTs. Furthermore the run-time for the conventionalmapping is depicted in Figure 6.3. Run-time is important, but it is not the main goal of thisthesis. The main goal of this thesis is to provide an algorithm which reduces the depth and/or area. We designed a mapping algorithm that influences the run-time only marginally.

On Figure 6.3 we can observe that the runtime is proportional to the number of LUTs. The vari-ation in run-time is higher than the variation in number of LUTs. This means the complexityof SimpleMap (the original algorithm) is between O(N) and O(N2): in each cone enumerationor cone ranking step, all nodes are seen once. However when the circuit is more complex, thereare more possible cones for each node. Furthermore for larger benchmarks more, and thusslower, memory is needed.

The number of connections between LUTs present in the circuit is given in Figure 6.4. This isdone by counting the sinks (output of a connection) without counting the sinks of the outputsor the inputs of the flipflops as these connections cannot be changed. The number of the con-sidered connections is a good estimate we can make for the area as 65% - 70% of the FPGA iscovered by routing [6].

In this section all absolute numbers were given for the original cases. In the following we willdiscuss only relative numbers, where each time a comparison is made with the numbers given

58

Chapter 6. Results Technology Mapping 59

!""#

!"""#

!""""#

!"""""#

$%&$'()*+)#

$&,-./&0#

12'3-/*3-4314#

.35)6!#

.35)67#

89:;<<-+#

89=7;<<-+#

(1(%#

(>?)%@AB&*>)*=7C#

(>;>/D)*+)#

(>EDF.@0/)*GC#

&*!7""#

*@A+)-/&0#

42@#

4/)*)&H343&-"#

4/)*)&H343&-!#

4/)*)&H343&-7#

4/)*)&H343&-=#

$+(#

!"#

$%&'()'*+,-'

Figure 6.1: The number of LUTs for each benchmark.

!"

!#"

!##"

!###"

$%&$'()*+)"

$&,-./&0"

12'3-/*3-4314"

.35)6!"

.35)67"

89:;<<-+"

89=7;<<-+"

(1(%"

(>?)%@AB&*>)*=7C"

(>;>/D)*+)"

(>EDF.@0/)*GC"

&*!7##"

*@A+)-/&0"

42@"

4/)*)&H343&-#"

4/)*)&H343&-!"

4/)*)&H343&-7"

4/)*)&H343&-="

$+("

!"#

$%&

Figure 6.2: The depth for each benchmark


!"#

!""#

!"""#

!""""#

!"""""#

!""""""#

$%&$'()*+)#

$&,-./&0#

12'3-/*3-4314#

.35)6!#

.35)67#

89:;<<-+#

89=7;<<-+#

(1(%#

(>?)%@AB&*>)*=7C#

(>;>/D)*+)#

(>EDF.@0/)*GC#

&*!7""#

*@A+)-/&0#

42@#

4/)*)&H343&-"#

4/)*)&H343&-!#

4/)*)&H343&-7#

4/)*)&H343&-=#

$+(#

!"#$%&'()&

*+(

Figure 6.3: The run-time for each benchmark

!""#

!"""#

!""""#

!"""""#

!""""""#

$%&$'()*+)#

$&,-./&0#

12'3-/*3-4314#

.35)6!#

.35)67#

89:;<<-+#

89=7;<<-+#

(1(%#

(>?)%@AB&*>)*=7C#

(>;>/D)*+)#

(>EDF.@0/)*GC#

&*!7""#

*@A+)-/&0#

42@#

4/)*)&H343&-"#

4/)*)&H343&-!#

4/)*)&H343&-7#

4/)*)&H343&-=#

$+(#

!"#

$%&'()'*(+

+%*,(+

-'

Figure 6.4: The number of connections for each benchmark


in this section. There are four different topics we will discuss: the depth, the area, the fan-inand the run-time. For each of these variables the results are discussed for both the AND gateand the 2-AIC.

6.2 Depth

6.2.1 AND gate

General results

The circuit depth strongly depends on the applied depth cost, as was discussed in chapter 5.Therefore we can not give the exact results of the technology mapping as we do not knowthe real depth cost at this moment. For the AND gate we will give an upper bound for theperformance by assuming that the depth cost is equal to zero. In Figure 6.5 the circuit depthsare given for the 19 VTR benchmarks when using a zero depth cost for an AND gate. There aresignificant differences: for some benchmarks the reduction exceeds 25%, for other benchmarksthe depth does not change at all. The average depth reduction is 16,8 %. However our maininterest is to make slow circuits faster. When we add up the depths of all benchmarks, thereduction of this sum is 18,5 %. This is in fact a weighted sum of the benchmarks where theslow benchmarks are more important than the fast ones.

Now consider Figure 6.6 where the depths are given in function of the depth cost for the 19VTR benchmarks and the MCNC20 benchmarks1. Furthermore the following case is plotted:the circuit is mapped by using a depth cost of zero. Consequently we increase the depth cost ofthe AND gates without changing the mapped LUT/AND network. This case is called ‘DepthMCNC20 DC = 0’ in the legenda. We can observe five different characteristics:

• When increasing the depth cost, the circuit depth increases. For the VTR benchmarksthe reduction is not significant anymore when the depth cost exceeds 0,3.

• The MCNC20 benchmarks are smaller benchmarks that have been used in literaturebefore the VTR benchmarks. The results for these smaller benchmarks are better thanfor the VTR benchmarks.

• Consider the ‘Depth (MCNC20) DC = 0’ curve. It turns out that the results changesignificantly when the depth cost exceeds 0,2: from that point on, it is better not touse too many AND gates, and take LUTs instead sometimes. This curve shows theimportance of using the depth cost as an input parameter for the technology mappingfor the AND gate.

• Now we can further analyse this curve. From point 0,3 to point 1 on the x-axis the curvebehaves in a linear way. The function we see is the outcome of the following formula,where k2 is the number of AND gates and k6 is the number of 6-LUTs:

length critical path = k2 depth cost + k6

1In this case we considered the geometrical means of the benchmark results instead of the weighted average.The geometrical mean is more frequently used in literature. However both approaches have drawbacks, as it isnot possible to perfectly summarize all information in only one number.


!"#$

!%&$

!%#$

!'&$

!'#$

!(&$

!(#$

!&$

#$

)*+),-./0.$

)+1234+5$

67,824/829869$

38:.;($

38:.;'$

<=>?@@20$

<=%'?@@20$

-6-*$

-AB.*CDE+/A./%'F$

-A?A4G./0.$

-AHGI3C54./"F$

+/('##$

/CD0.24+5$

97C$

94./.+J898+2#$

94./.+J898+2($

94./.+J898+2'$

94./.+J898+2%$

)0-$

!"#$%&"'(")

*+',"

(-.%/0

'10'2'

Figure 6.5: The relative depth reduction per benchmark.

When varying the depth cost, the slope of the curve is thus k2, the relative number ofAND gates in the critical path. As the slope is 1,27 this means that in the new criticalpath there are 127% AND gates on average compared with the number of LUTs in thecritical path of the old algorithm. Next to this about 50% LUTs is still present.

• For depth costs below 0,3 the blue curve is not linear. This is because the critical pathchanges: in a circuit there are many parallel paths. Consider the following example: acircuit has two parallel paths: one that consists of 5 LUTs, one with 4 LUTs and 3 ANDgates. If the delay of using an AND gate would be zero, the first path would be thecritical path with a depth of 5. If using an AND gate would cause a significant delay(caused by detours in the routing for instance), the cost of using an AND gate would behigher. Assume it is 0,4. In this case the second path is the critical path as it has a depthof 5,2. This is what happens in the non-linear part of the curve: first paths with only afew AND gates are critical. When increasing the depth cost, the other paths get criticalas the slope of their curves is higher. At a certain point (a depth cost of about 0,3), nomore change in critical paths is noted.

Variance in results

The depth reduction of the different benchmarks varies significantly. In this section we will tryto find out why this phenomenon occurs. Consider Figure 6.7 where the benchmark cm163a isdepicted as an AIG. This is the smallest benchmark of the MCNC198 benchmarks where it isadvantageous to use AND gates. The benchmark is represented by an AIG where the outputscan be found at the top and the inputs at the bottom. In between there are AND gates. A fullline is a non inverted signal and a dashed line an inverted signal. If this circuit is mapped for6-LUTs, the resulting network is given in 6.8. The resulting depth is 2, while there are 7 LUTsbeing used. Now consider Figure 6.9. There the circuit is mapped for AND gates and LUTs.The resulting depth is (1+2·depth cost). This is achieved by using six 6-LUTs and 7 AND gates.


!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(#%"

(#&"

(#'"

$"

!" !#(" !#$" !#)" !#%" !#*" !#&" !#+" !#'" !#," ("

!"#

$%&

!"#$%&'()$&

-./01"234"

-./01"5676"

-./01"5676"

Figure 6.6: The depth in function of the depth cost for the VTR benchmarks and the MCNC20benchmarks. Also a result is given where we did not use the input parameter of the depthcost.

We can observe four remarks if we compare the conventional mapping solution with the mappingsolution for AND gates (zero depth cost). We will describe the differences as if we started fromthe first solution and then worked to the second solution. However the real mapping algorithmdirectly calculates the second solution. Though this way of discussing the results gives moreinsight.

• The LUT at the right is replaced by four AND gates. As the critical path has a depthof two, this does not decrease the circuit depth, but it can decrease the area, dependingon the area cost of an AND gate, which is very similar to the depth cost. However theimpact of the area cost on the algorithm decisions is of lower impact than the depth costfor the depth-optimal algorithm: The depth is always favored over the area. Only whena decision has to be made that does not influence the depth, the algorithm takes the areacost into account.

• LUT 36 will use LUT 22 and input pj instead of using LUT 23. In general LUTs 36,29 and 50 use LUT 23 as an input signal. Therefore, the depth of the circuit is two andthere are three critical paths. When we can use AND gates, we can remove LUT 23 andmake node 22 visible instead. If a LUT needs the information of node 23, it can use node22 and input pj, which is the other input of 23. Note that every LUT that uses 23 canreplace node 23 by nodes 22 and pj, which means one additional input for this LUT. ForLUT 36 this is not a problem as it contains only 5 inputs originally. In the network withAND gates, LUT 36 will use node 22 and pj. This results in a total number of 6 usedinputs.

• LUT 50 also uses node 23 as an input. Unfortunately LUT 50 can not replace input node23 by using pj and 22 as it contains already 6 inputs. Therefore we replace node 50 by

Chapter 6. Results Technology Mapping 64N

etwork structure visualized by A

BC

Benchm

ark "top". Time w

as Fri May 16 15:12:02 2014.

The network contains 33 logic nodes and 0 latches.

pqpr

pspt

pu

29

36

43

50

54

47

49

40

42

46

4433

35

39

pe

45

26

28

3237

38

pn

2530

31

pm

2324

pl

53

5122

pj

27

pf

3441

4852

pipc

pdpa

pbpg

phpp

pkpo

Figure 6.7: AIG of benchmark cm163a.


Figure 6.8: LUT network of cm163a. LUTs are marked in orange. Also the input numbers of theLUTs are provided.


Figure 6.9: Network of LUTs and AND gates for cm163a. LUTs are marked in orange, AND gatesare marked in green. Also the input numbers of the LUTs are provided.


an AND gate and make both inputs of node 50 visible. Node 49 was already visible, fornode 47 a LUT is added. Now the depth of this path consists of one LUT and two ANDgates instead of two LUTs.

• For node 43 a similar problem occurs. Again an AND gate is added and the inputs ofthe AND gate are set visible by using LUTs. Note that an extra LUT is introduced hereto lower the depth of the path.

Conclusion: in this example a reduction in depth of several paths can be observed. Note thatall critical paths have to be shortened in order to reduce the circuit depth. Concerning thearea we observe both an increase (2 LUTs needed for implementing 43) and decrease (LUT54 replaced by AND gates). This explains why the number of LUTs can increase or decreasewhen changing the depth cost.

We want to stress that all critical paths have to be shortened before the circuit depth decreases.Therefore we expect that networks with many critical paths would be more difficult to improvethan networks with only a few critical paths.

When comparing the performance of different benchmarks the problem occurs that AIGs arevery complex for humans: realistic circuits consist of thousands of LUTs where LUTs cancontain tens of nodes of the AIG. So it is hard to get a good insight in this problem. Only somesmall examples (like Figure 6.7) can be explored. In these examples we note that coincidence ofhaving combinations of certain opportunities in the network is the key for a good performance.

6.2.2 2-AIC

Next to the AND gate we consider the 2-AIC. Again it is difficult to give a correct image of theresults, as these strongly depend on the applied depth cost. If we consider the depth cost to bezero, as we did for the AND gate, the results for the 2-AIC are always 100 % depth reduction:the technology mapping algorithm uses only 2-AICs as they are free to use and they can coverany desirable combinatorial function. As we consider their depth cost to be zero, the depth ofthe whole circuit is zero.

In Figure 6.10 the reductions in depth are given for the VTR benchmarks for a depth cost of0,2. In order to be able to compare these results, also the depths for the case with the ANDgates are revealed for the same depth cost. The 2-AIC performs always the same or betterthan the AND gate as it can contain more functionality. This choice of depth cost results in aweighted average reduction of 8,9 % for the 2-AIC. For the AND gates only a reduction of 6%is still remaining for a depth cost of 0,2.

6.3 Area

6.3.1 AND

In this section we will discuss the effects on the area. This is a difficult task as the output ofthe technology mapper is all we can use. There are three types of data we can use:


!"#$

!"%$

!&#$

!&%$

!'#$

!'%$

!(#$

!(%$

!#$

%$

)*+),-./0.$

)+1234+5$

67,824/829869$

38:.;($

38:.;'$

<=>?@@20$

<=&'?@@20$

-6-*$

-AB.*CDE+/A./&'F$

-A?A4G./0.$

-AHGI3C54./"F$

+/('%%$

/CD0.24+5$

97C$

94./.+J898+2%$

94./.+J898+2($

94./.+J898+2'$

94./.+J898+2&$

)0-$

!"#$%&"'(")

*+',"

(-.%/0

'10'2'

B.547$'!IKL$

B.547$IMB$

Figure 6.10: The relative depth reduction per benchmark for the AIC and the AND gate. The con-sidered depth cost is 0,2.

• First of all, we can use the number of 6-LUTs. This is the most common used numberin literature to estimate the area of a circuit (examples: [14, 15, 8]).

• However the area covered by LUTs is only 30% - 35% [6]. Therefore we also give thenumber of connections made in the circuit, as this influences the routing resource usage.

• The third variable is the number of added AND gates.

We start by giving the number of LUTs in function of the depth cost. This is depicted in Figure6.11. Both the VTR and MCNC20 benchmarks are plotted. For the MCNC20 benchmarks8% reduction in LUTs can be noted for a low depth cost, where the VTR benchmarks have anincrease of 2%. An increase is possible as the focus of the algorithm is to find a depth-optimalsolution. If there is a solution possible with more used resources (both LUTs and AND gates)and lower depth, the technology mapping algorithm will favor this solution.

Some other conclusions can be made. The area does not depend as strongly as the depth onthe depth cost of the AND gate. This can be explained as follows: the algorithm reduces thedepth by using AND gates. Sometimes there is no change in LUT usage, in other cases alower number of LUTs is needed and in other cases extra signals have to be made visible by us-ing LUTs (the inputs of the AND gates for instance). In section 6.2.1 an example was discussed.

Next to this, consider the number of AND gates of both cases: in the case of the MCNC20benchmarks more AND gates are being used than in the case of the VTR benchmarks. Thisis consistent with the fact that the MCNC20 benchmarks have better reduction in depth andnumber of LUTs.

Now we have a closer look at the case of the zero depth cost for the VTR benchmarks. Theincrease and decrease of the number of needed components is depicted in Figure 6.12. There is


!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!" !#(" !#$" !#)" !#%" !#*" !#&" !#+" !#'" !#," ("

!"#

$%&'()'*+,-.'/!0'123%-'

0%435'6(-3'

&-./01"23045"

6781"23045"

&-./01"29:7:$!5"

6781"29:7:$!5"

Figure 6.11: The needed LUTs and ANDs for the VTR and MCNC20 benchmarks.

always an increase in number of AND gates compared to the original case, as there are no ANDgates in the original circuit. The percentage reflects the number of added AND gates comparedto the original number of 6-LUTs. Further the difference in number of added 6-LUTs can bepositive or negative: sometimes AND gates can be used to reduce the area, in other cases moreresources are being used to reduce the depth. On average an increase of 2% is noted.

The remaining variable to explore is the number of connections in the circuit. These resultsare depicted in Figure 6.13. We considered the number of connections (or edges) that ends ina 6-LUT and the number of connections that ends in an AND gate. We did not display thenumber that ends up in an output or the input of a flipflop because we can not change thisnumber of connections by adding new components in the architecture.

There is an important difference between a connection that ends in a LUT or in an AND gate.Consider Figure 6.14. This is the most frequently used topology (76% of the cases where anAND gate is being used, see paragraph 6.4). Now the question is: do the three connections thatare depicted on the figure have a comparable length with a connection that is situated betweentwo LUTs? Probably the AND gate will be physically implemented very close to the input ofthe last LUT. In this way these three connections have a resource consumption comparablewith two normal connections. However there are other cases possible. Consider the case wherethe two LUTs at the left side are placed very close together and the third LUT is far away onthe circuit. Then it can be an opportunity to use an AND gate close to the first two LUTsand the three connections will occupy the area of only one connection, as the longest part ofthe connection can be shared. The opposite case is possible as well: if there are not manyAND gates, detours might be necessary. Normally the degrees of freedom of using AND gatesare higher than the degrees of freedom of using a LUT, so the area will not be more than theequivalent of three connections. Thus, the area of the three considered connections will bebetween the average area of one until three connections. The most likely number is two as inmost cases we will place the AND gates close to the output LUT.


!"#$

#$

"#$

%#$

&#$

'#$

(#$

)#$

*+,*-./01/$

*,2345,6$

78-935093:97:$

49;/<"$

49;/<%$

=>?@AA31$

=>&%@AA31$

.7.+$

.BC/+DEF,0B/0&%G$

.B@B5H/01/$

.BIHJ4D65/0'G$

,0"%##$

0DE1/35,6$

:8D$

:5/0/,K9:9,3#$

:5/0/,K9:9,3"$

:5/0/,K9:9,3%$

:5/0/,K9:9,3&$

*1.$

!"#$"%

&'(")*%$#"'+"),#)-"

$#"'+")*%)%./

0"#))

,1)23456+)'%-

)789)('&"+)

)!=>L:$

JMC:$

Figure 6.12: The decrease and increase in number of 6-LUTs and AND gates of the VTR benchmarks,depth cost zero.

The difference between connections ending in a LUT or in an AND gate is depicted on Figure6.13 as follows: the lower bound is given by the edges of the LUT (in that case, the impactof the configuration of Figure 6.14 is equal to one normal connection). The upper bound isgiven by the sum of the edges that end up in LUTs or ANDs. So the resulting comparison ofneeded connection resources will be something in between the two of them, probably close tothe middle. In case we use the middle of the upper and lower bound, the average number ofedges increases with 3,4 %.

6.3.2 2-AIC

For the 2-AIC we will only provide some basic results. In Figure 6.15 the number of used 6-LUTsand 2-AICs are depicted for both the VTR and MCNC20 benchmarks in function of the depthcost. Again there are significant differences: if we would implement the MCNC20 benchmarksby using only 2-AICs, we would need 3,5 as many 2-AICs compared to the original number of6-LUTs (point left at the top, there only 2-AICs are being used as their depth cost is zero).If we would do the same for the VTR benchmarks 1,5 times the number of original 6-LUTswould be needed. On the other hand for relative high depth costs the MCNC20 benchmarksare better than the VTR benchmarks as the number of LUTs is significantly lower.

6.4 Fan-in LUTs

In this section we discuss the size of subnetworks of AND gates that are present in theLUT/AND networks. Therefore we will look at the fan-in of the LUTs. In Figure 6.16 ahistogram is displayed for the MCNC198 benchmarks: for each LUT input the number ofLUTs connected to that single input is given. The most common case is that one LUT isconnected directly to the input of the other LUT. If there are AND gates, multiple LUTs areindirectly connected to the input. An example network of LUTs and AND gates is given in


!"#

!$#

%"#

%$#

&""#

&"$#

&&"#

&&$#

&'"#

&'$#

&("#

)*+),-./0.#

)+1234+5#

67,824/829869#

38:.;&#

38:.;'#

<=!>??20#

<=('>??20#

-6-*#

-@A.*BCD+/@./('E#

-@>@4F./0.#

-@GFH3B54./IE#

+/&'""#

/BC0.24+5#

97B#

94./.+J898+2"#

94./.+J898+2&#

94./.+J898+2'#

94./.+J898+2(#

)0-#

!"#$%&"'()*

+",'-.'/-(

("/%-(

0'12(

'34'

?30.9#B23#

?30.9#<=K#

Figure 6.13: The number of connections that ends in a 6-LUT or an AND gate. These results arecompared with the number of connections for the conventional mapping.

Figure 6.14: In 76 % of the cases an AND gate is implemented between three LUTs.

!"

!#$"

%"

%#$"

&"

&#$"

'"

'#$"

("

!" !#!$" !#%" !#%$" !#&" !#&$" !#'" !#'$" !#(" !#($" !#$"

!"#$%&"'()*

+",'-.'/012'-,'345672'

8"9:;'<-2:'

)*+,-."/0-12"

345."/0-12"

)*+,-."/6575&!2"

345."/6575&!2"

Figure 6.15: The number of 6-LUTs and 2-AICs in function of the depth cost for the VTR benchmarksand MCNC20 benchmarks.


Figure 6.17. Here we consider the case where LUTs need signals of the network of AND gates.The considered input of the first LUT has a fan-in of three: the input is connected to an ANDgate. This AND gate is connected to a LUT and another AND gate with another two LUTs.In this way the second and third considered input have a fan-in of four each. Note that thedepth is not equal. So the fan-in of an input is not strictly related with its depth. The fourthconsidered input has a fan-in of two. Also note that if we count the fan-ins, several partsof subnetworks will be counted twice: the subnetwork of the fourth LUT is covered by thesubnetwork of the third LUT for instance.

On Figure 6.16 we see for the MCNC198 benchmarks the numbers of occuring fan-ins. Thex-axis represents the fan-in. The y-axis shows for how many inputs this fan-in occured. Notethat the scale is logaritmic. For 88 % of the cases, the input is directly connected with an otherLUT, circuit input or flipflop output. The remaining 12 % consists for 76 % of a fan-in of two(this is the configuration of Figure 6.14). Further a fan-in of three occurs in 12% of the cases,while a fan-in of four is present for 5% of the inputs. So we can conclude that the fan-in isin most cases relatively low. However exceptional cases of more than 200 indirectly connectedLUTs were found. This exceptional case can be seen as an opportunity to drastically reducethe number of LUTs and the depth, but on the other hand it might be hard to implementsuch an extended AND gate subnetwork in the routing. If it turns out to be too difficult toroute large AND gate subnetworks, one could limit the size of AND gate subnetworks in thetechnology mapping.

In Figure 6.18 the results are plotted for the 2-AIC (depth cost of 0,2). In general similarcharacteristics are observed, but there is a difference: more AICs are being used and relativehigh fan-ins occur more often: only in 77 % of the cases there is one directed fan-in. Theremaining 33% consists for 42% of the configuration of a fan-in of two. A fan-in of three isnoted in 26% of the cases where an AIC is used, where a fan-in of four is responsible for 8% ofthe cases. After four the numbers decrease significantly, but not as strongly as the case for theAND gate. Exceptions of fan-ins of more than 500 are found. These results are not really asurprise as an AIC can contain more functionality than an AND gate. Again high fan-ins canbe limited by some modifications in the technology mapping algorithm.

6.5 Run-time

The run-time of our new technology mapping algorithm should not increase significantly com-pared to the original technology mapping algorithm. In Figure 6.19 a breakdown structure ofthe run-time of the SimpleMap algorithm can be found: the most time-consuming part of themapping is the cone enumeration. The most time-consuming part of the cone enumeration isto find all feasible cones. Only for the feasible cones we will add a small modification: we markthe cones that represent a specific 2-input component. Further we will calculate the depthstwice for the non-configurable components. Because of this reason, the run-time of both thecone ranking and the area recovery will double. However as the cone enumeration is by far themost time-consuming step, this modification only adds a small attribution to the run-time.

Some basic run-time experiments have been executed. The experiments showed an average


!"

!#"

!##"

!###"

!####"

!#####"

!######"

!" $" %" &" '" (" )" *" +" !#"!!"!$"!%"!&"!'"!("!)"!*"!+"$#"$!"$$"$%"$&"$'"$("$)"$*"$+"%#"%!"%$"%%"%&"%'"%("%)"%*"%+"&#"&!"&$"&%"&&"&'"&("&)"&*"&+"'#"'!"'$"'%"'&"''"'("')"

!"#

$%&'()'*+,"

-.'

!"#$%&'()'/01.'2&&*3%4'2-'*+,"-'

Figure 6.16: Fan-in of inputs of LUTs for the AND gate

Figure 6.17: Example network fan-in of inputs of LUTs


!"

!#"

!##"

!###"

!####"

!#####"

!######"

!" $" !!" !$" %!" %$" &!" &$" '!" '$" (!" ($" $!" $$" )!"

!"#

$%&'()'*+,"

-.'

/0+1*+'

Figure 6.18: Example fan-in of inputs of LUTs for the AIC

increase (geometric mean) in run-time of about 8,6%. As the circumstances of the experimentswere not ideal (only one iteration, executed on different computers, ...), we can draw thefollowing conclusion: the run-time possibly increases slightly. However we can assume thatthere is not a severe increase in run-time. Accurate run-time experiments with several iterationsunder ideal circumstances are not provided as this is not the main goal of this thesis.

!"#$%$#&'$()*"#%

+$),-#.%//0%

!"#$%()#1-#.%

!"#$%2$3$4*"#%

/($)%($4"5$(6%

Figure 6.19: The run-time of mapping the VTR benchmarks with SimpleMap.

Chapter 7

Conclusion and future work

In the previous chapters results were given for different aspects of implementing 2-input com-ponents into an FPGA. In this chapter we will combine these results and give a conclusionabout the final performance. First we discuss the delay, secondly the area. After those twosections we will deal with the area delay product, which is a conventional performance measure.Then we give some additional ideas to further improve the results and at last a conclusion willbe given.

7.1 Delay AND gate

We do not want to discuss the depth, but the delay in this section. Our real goal is to reducethe FPGA delay. The depth is only a variable to easily estimate the delay during technologymapping. So in this section we have to answer the following question: ‘What will be theimplementation delay if we use the new architecture?’.

Therefore we need two results: the delay of the new hardware components and the expectedimplementation delay of the technology mapping. According to the results discussed in para-graph 3.1.10, no increase in hardware delay is expected. Thus, we can estimate the delay asthe depth of the technology mapping. Note that the reduction in depth depends on the depthcost and that this depth cost is not known at this moment. Therefore we can only give a resultof the final delay as a function of the depth cost. The choice of benchmarks is important: theresults for the MCNC20 benchmarks were more promising than for the 19 VTR benchmarks.We want to estimate the performance of any circuit that one would like to implement on anFPGA. Therefore we take the weighted average reduction of the 19 VTR and MCNC20 bench-marks in function of the depth cost. The resulting graph is depicted in Figure 7.1. By takingthe weighted average, the VTR benchmarks have more impact as these are larger benchmarks.The maximal depth reduction is 20,4% for this combination of benchmarks while using ANDgates.

75

Chapter 7. Conclusion and future work 76

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

!" !#(" !#$" !#)" !#%" !#*" !#&" !#+" !#'" !#," ("

!"#

$%&

'()

*"+&,-&.'!/&01

2&3456

7/&

!"#$%&8,/$&

-./0"

1230"

34567"

Figure 7.1: The weighted average depth, number of AND gates and number of LUTs of the 19 VTRand MCNC20 benchmarks.

7.2 Area AND gate

The impact on the area can be estimated by multiplying the total resource usage with theoverhead per resource (i.e. multiplexers). The resource usage was discussed in section 6.3. Insection 3.1 the overhead of the new multiplexing devices was discussed.

According to paragraph 3.1.10 there is an increase of 19 % in area if we replace all multiplexersin the FPGA. However we do not have to replace all multiplexers of the FPGA. By replacingonly a percentage p of all multiplexers, the area overhead can be reduced. On the other hand byreducing p, the number of degrees of freedom decreases in the place and routing step. Becauseof this, more detours will have to be made and the average depth cost will increase. Thusthe parameter p and the depth cost are reverse proportional. An appropriate trade-off willbe necessary. However this can only be done by implementing the packing, place and routingalgorithms.

The number of LUTs does not change significantly (see Figure 7.1). Also the number of con-nections does not change a lot: an increase of 3,4 % was observed for the VTR benchmarksin section 6.3 (note that the constraint for this result is the presence of a reasonable amountof AND gates in the circuit). The MCNC20 benchmarks need a lower number of resources, sowhen we take the weighted average, the increase in connection usage is negligible. Further westill have to consider the area impact of the AND gates. However this is indirectly included inboth the hardware area overhead and the number of connections: the hardware results includethe presence of the number of physical AND gates in function of p. The number of connectionsindirectly include the extra resources needed to implement the AND gates. If we take a lookat the number of connections and at the number of LUTs, we can conclude that the resourceusage does not change significantly.

Conclusion: there is no significant change in LUT or routing resource usage for any depth cost.


However there is an area overhead for each AND gate, which depends on the percentage p ofreplaced multiplexers. Now we make the following assumption: the added area of replacing amultiplexer is equal for all multiplexers. This is not entirely correct, as was discussed in section3.1.9. However we consider this assumption in order to provide a first order model here. Thenwe can assume the added area varies linearly in function of p. Consider therefore Figure 7.2.

7.3 Area delay product

Hardwaredesigners expect different kinds of FPGAs where certain aspects (such as area, delayor power) are emphasized. However the most common performance measure is the area delayproduct [16]. The delay in function of the depth cost is known while the area in function ofthe parameter p is known. In Figure 7.3 we combined both results in one graph. On the x-axisthe depth cost is displayed. The y-axis represents the parameter p. In color the normalizedincrease or decrease in area delay product is represented with contour lines compared to thestate-of-the-art architecture: every line connects points on the graph for an equal area delayproduct. The black line is the state-of-the-art performance. Left of this line a decrease in areadelay product can be noted, while at the right of this line there is an increase. We can use thisgraph to explore the solution space: the depth cost and parameter p are related to each other(they are reverse proportional). If one finds several combinations of a p and a depth cost (byimplementing a pack, place and routing algorithm), the best combination can be determinedwith the use of this graph. However we must take into account that this graph is only anestimation and contains several first order approaches.

Now we provide a lower bound for the depth cost of an AND gate (this lower bound will bevalid for any 2-input component). Consider therefore the situation where several AND gatesare being used in multiplexers that are placed directly behind each other (see Figure 7.4). Ifwe have to implement many AND gates after each other (and it would not be possible to useparallelism), this is the best case scenario because no detours are considered. In this case thedepth cost is the delay of one hop. In COFFE the impact of one hop is considered to be 27% of the total added delay (routing + LUT delay) when using a LUT1. Therefore consider

1The total added delay per used LUT is 496 ps (routing to the LUT included). The delay of one switch box

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14

16

18

20

p

Areaoverhead(in%)

Figure 7.2: The area overhead in function of the percentage of replaced multiplexers if we assume thatall multiplexers cause the same overhead.


Depth cost

p

0 0.2 0.4 0.6 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

−0.15

−0.1

−0.05

0

0.05

0.1

0.15Performance

Original

Lower bound

Figure 7.3: The area delay product in function of the depth cost and the percentage p of multiplexersaugmented with an AND gate.

the vertical line in Figure 7.3. This is the lower bound for the depth cost, as we consideredno detours. In reality detours will have to be made, so the depth cost will be higher. As thesolution space is getting narrower (only the solution space between the black and blue curvesis remaining), the maximal gain for the objective function is about 5 % and we know that thismaximum is still optimistic. Further take into account that the effort of implementing extracomponents in an architecture is very high as there will be significant changes in the hardware.So at this point the idea seems not to be feasible from an economic point of view.

7.4 Future work

Fortunately there are some other ways possible to deal with this problem. In the following fivesubsections some proposals will be given to enhance the economical value of the idea.

7.4.1 Considering an other 2-input component

In chapter 4 we showed that there were only two components whereof the benefits could besignificantly high compared to the other components: the AND gate and the 2-AIC. Thereforein this subsection the only remaining component to discuss is the AIC. If we do the sameanalysis as for the AND gate, we obtain the graph of Figure 7.5, where the vertical line is the

hop (which has to be used to be able to place several AND gates after each other) is 136 ps. Thus the impact ofone hop is 27 % of the total delay of using a LUT. This is relatively high because COFFE considers the criticalpath: LUTs on the critical path are placed close together, so there are only a few hops needed. Therefore thetotal added delay of adding LUTs on the critical path is short, thus the relative impact of an extra hop is high.


Figure 7.4: Several AND gates placed behind each other without detours.

same lower bound as for the AND gate. Unfortunately again the solution space is very narrow,though not zero.

7.4.2 Other definition depth cost

Until now we represented the depth cost as the cost of using an AND gate without taking intoaccount the surrounding network of the AND gate. However this will influence the impactof the added AND gate on the critical path length. Consider therefore the lower bound ofsection 7.3. We considered many AND gates in series. Under this condition the lower boundwas 0,27 for the depth cost. However consider now the configuration of an AND gate betweenthree LUTs, as depicted in Figure 7.6. Suppose that in the original situation LUT 1 and 2have to be connected to LUT 3. Then the depth is determined by the longest connection. Thelongest connection is the signal from LUT 1 to LUT 3. Now suppose that we add an ANDgate between LUTs 1, 2, and 3. The depth cost is now defined as the added delay to thenetwork. The delay is equal to the longest connection, which is still the connection from LUT1 to LUT 3. Furthermore this connection has still the same delay. Thus, the depth cost of theimplemented AND gate in this case is zero. However it will not always be zero as the addeddelay strongly depends on the places of the implemented LUTs. Therefore the placement willbe an important step in the tool flow.

Conclusion: maybe the average depth cost of placing an AND gate between three LUTs can besmaller than AND gates in series. Therefore a technology mapping algorithm could be designedthat only implements this type of configuration. The gain of the technology mapping will belower in that case, but the overhead in the routing will be lower as well. If this option wouldbe chosen the AND gate would perform better than the 2-AIC as the inversion of the inputsis not necessary: the inputs of the AND gate are always LUTs. Then the output inversion ofthe LUT can be chosen to achieve a logic equivalent network.


Depth cost

p

0 0.2 0.4 0.6 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4Performance

Original

Lower bound

Figure 7.5: The area delay product in function of p and the depth cost for the AIC architecture.

Figure 7.6: An example of a zero depth cost.


7.4.3 Other objective function

The area delay product is considered as the conventional performance measure. It reducesthe cost for #calculations

s : the cost is proportional to the area, the timing reverse proportionalto the delay. However here it is assumed that the considered application can use unlimitedparallelism. In this way the customer that wants to implement the application wants to mini-mize cost(#calculations

s ). However if one can not use unlimited parallelism and there are stricttiming constraints, the delay gets more important. So our FPGA can be of higher benefit forapplications that have to be fast, where there is limited parallelism (but there has to be par-allelism, otherwise a processor is a better solution) and where an altered cost is not a problem(but if cost is no problem at all, the ASIC is still better than the FPGA). So by exploring theeconomic needs of combinations of variables, one could try to find a specific FPGA whereforethis architecture could enhance the performance even more.

7.4.4 Improvement of the resulting graphs

The graphs of Figure 7.3 and 7.5 are based on results of COFFE and technology mapping.Three different things can be done to improve or check these results:

• COFFE is only a first order approach, so the reality might be better than our results.There were still some non-consistent results, so further exploring this area might givemore insight.

• The technology mapping is depth-optimal, but not area-optimal. Maybe some additionalarea improvements are possible. However this will be difficult as we added only rathersmall modifications to the state-of-the-art technology mapping algorithm. So to improvethe technology mapping algorithm, one has to find a solution better than the state-of-the-art technology mapping algorithm.

• In order to make the graph of Figure 7.2, the assumption was made that every multiplexercaused the same area overhead. This is however only an approach: smart choices can bemade as discussed in paragraph 3.1.8. So the the real performance is slightly better thandisplayed on the graph.

7.4.5 Implementing AND gates in the connection boxes only

In section 3.1 we discussed the simulation results for the replaced multiplexers in the switchboxes. However it might be possible that implementing the AND gates in the connection boxesis still more favorable: an advantage of the new multiplexing device is the decreased inputimpedance. The disadvantage is the increased number of buffers, as we need an even numberof buffers after the NAND gate to maintain the correct functionality. In the connection boxthe output loads are lower so it might be possible that no more buffers are needed after theAND gate. Also the buffers of the switch box multiplexers that connect their outputs to theinputs of the connection boxes can be made smaller and faster as the parasitic loads caused bythe connection box multiplexers are lower. So implementing the AND gates in the connectionboxes could be interesting to consider.


7.4.6 Implementing pack, place and route algorithms

The only way to have an accurate insight in the depth cost and the parameter p is by imple-menting the pack, place and route algorithms. Then feasible points can be set on the graph ofFigure 7.3. After this step it can be decided what the easiest way is to further improve results:changing the parameters p and the depth cost or further improve the used algorithms.

7.5 Conclusion

We reformulate the research question of this thesis:

‘What are the best 2-input components to add into the routing of an FPGA architecture? Isthe performance enhanced compared with the state-of-the-art FPGAs?’

The best way to answer these questions in a period of one year was to design a suitable tech-nology mapping algorithm. The results turned out to be promising. More information wasobtained by simulating the new components in the hardware (executed by three students fortheir ‘Hardware Design Project’ [11]) with the help of COFFE.

We considered all possible 2-input components as far as we know. We concluded that the ANDgate and the AIC would probably be the best components to use. Therefore we formulatedseveral arguments in chapter 4. Further we succeeded to design several depth-optimal algo-rithms with area recovery. For these algorithms we needed the depth cost of the componentswhich is impossible to estimate at this moment. In the best case scenario the resulting circuitdepths could be reduced with 27 % (geometrical mean for the VTR and MCNC20 benchmarks)while using the AND gates. The LUT and routing resource usage did not change significantly.Furthermore the run-time of the technology mapping increased only slightly.

In section 3.1 we showed that the half connected configuration of the multiplexing devices wouldbe the best one to use. Furthermore some more explanation was given about the impact of thenumber of inputs of the used multiplexers. If all switch block multiplexers would be replacedby new multiplexers with added AND gates, the FPGA area would increase with about 11%while the delay would not change significantly.

When we combine all these results, we get the graph of Figure 7.3 for the AND gate imple-mentation and Figure 7.5 for the AIC implementation. It turns out that the maximal increasein performance is about 5% for both. This is only a minor reduction, however there are severalsolutions possible and we leave this for the future work. It is difficult to estimate the realperformance at this moment as we do not know the depth cost and the parameter p. The onlyway to get estimate the performance accurately is by implementing the next steps of the toolflow.

Bibliography

[1] D. Technologies, “Investment opportunity.” www.dynamizetech.com/investors.html,2012.

[2] U. Farooq, Z. Marrakchi, and H. Mehrez, Tree-based Heterogeneous FPGA Architecture,pp. 7 – 48. Springer, 2012.

[3] B. Kirk, “Hybrid process converts fpgas to structured asics.” http://www.eetimes.com/

document.asp?doc_id=1148532, 2004.

[4] G. Lemieux, E. Lee, M. Tom, and A. Yu, “Directional and single-driver wires in fpga inter-connect,” in Field-Programmable Technology, 2004. Proceedings. 2004 IEEE InternationalConference on, pp. 41–48, Dec 2004.

[5] M. Ayodhyawasi and K. Digari, “Interconnect structure and method in programmabledevices,” July 6 2010. US Patent 7,750,673.

[6] C. Chiasson and V. Betz, “Coffe: Fully-automated transistor sizing for fpgas,” in Field-Programmable Technology (FPT), 2013 International Conference on, pp. 34–41, Dec 2013.

[7] L. McMurchie and C. Ebeling, “Pathfinder: A negotiation-based performance-drivenrouter for fpgas,” in Field-Programmable Gate Arrays, 1995. FPGA ’95. Proceedings ofthe Third International ACM Symposium on, pp. 111–117, 1995.

[8] D. Chen and J. Cong, “Daomap: A depth-optimal area optimization mapping algorithmfor fpga designs,” in Proceedings of the 2004 IEEE/ACM International Conference onComputer-aided Design, ICCAD ’04, (Washington, DC, USA), pp. 752–759, IEEE Com-puter Society, 2004.

[9] A. Mishchenko, S. Chatterjee, and R. K. Brayton, “Improvements to technology mappingfor lut-based fpgas,” in FPGA’06, pp. 41–49, 2006.

[10] A. Farrahi and M. Sarrafzadeh, “Complexity of the lookup-table minimization problemfor fpga technology mapping,” Computer-Aided Design of Integrated Circuits and Systems,IEEE Transactions on, vol. 13, pp. 1319–1332, Nov 1994.

[11] P. De Vloed, Y. Laoureux, D. Vercruyce, E. Vansteenkiste, and D. Stroobandt, “Designingdrivers for new fpga architectures.” Hardware Design Project at the University of Ghent,May 2014.

[12] N. Vemuri, P. Kalla, and R. Tessier, “Bdd-based logic synthesis for lut-based fpgas,” ACMTrans. Des. Autom. Electron. Syst., vol. 7, pp. 501–525, Oct. 2002.

83

www.dynamizetech.com/investors.html

http://www.eetimes.com/document.asp?doc_id=1148532

http://www.eetimes.com/document.asp?doc_id=1148532

Bibliography 84

[13] E. Vansteenkiste, B. Al Farisi, K. Bruneel, and D. Stroobandt, “Tpar: Place and routetools for the dynamic reconfiguration of the fpga’s interconnect network,” Computer-AidedDesign of Integrated Circuits and Systems, IEEE Transactions on, vol. 33, pp. 370–383,March 2014.

[14] J. Cong and Y. Ding, “On area/depth trade-off in lut-based fpga technology mapping,”in Proceedings of the 30th International Design Automation Conference, DAC ’93, (NewYork, NY, USA), pp. 213–218, ACM, 1993.

[15] J. Cong and Y. Ding, “Flowmap: an optimal technology mapping algorithm for delayoptimization in lookup-table based fpga designs,” Computer-Aided Design of IntegratedCircuits and Systems, IEEE Transactions on, vol. 13, pp. 1–12, Jan 1994.

[16] I. Kuon and J. Rose, “Area and delay trade-offs in the circuit and architecture designof fpgas,” in Proceedings of the 16th International ACM/SIGDA Symposium on FieldProgrammable Gate Arrays, FPGA ’08, (New York, NY, USA), pp. 149–158, ACM, 2008.

List of Figures

1.1 FPGA Sales Growth [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 FPGA architecture [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 FPGA - ASIC comparison [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 A BLE with a 4-input LUT and a flipflop [2] . . . . . . . . . . . . . . . . . . . 6

2.4 The routing structure of the island-style FPGA [2] . . . . . . . . . . . . . . . . 8

2.5 The left figure is the Wilton switch block [5]. The second one represents aconnection block [2]. The third picture shows the concept of unidirectionality [2]. 8

2.6 The multiplexer on transistor level . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.7 An example of an AIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.8 An example of a network of LUTs . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.9 The packing algorithm [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.10 Minimizing distances in the placement algorithm [2] . . . . . . . . . . . . . . . 11

2.11 An example of a routed circuit [2] . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.12 The example input of the technology mapping algorithm . . . . . . . . . . . . . 13

2.13 The cone enumeration, example for node 6 . . . . . . . . . . . . . . . . . . . . . 14

2.14 The depths of the nodes (cone ranking) and the chosen nodes (cone selection) . 15

3.1 Each multiplexer is replaced by two multiplexers with a 2-input component. . . 18

3.2 The state-of-the-art 10-input multiplexer. . . . . . . . . . . . . . . . . . . . . . 18

3.3 The state-of-the-art 10-input multiplexer on transistor level. . . . . . . . . . . . 18

3.4 Schematic fully connected AND gate multiplexer . . . . . . . . . . . . . . . . . 20

3.5 The simulated fully connected AND gate multiplexer . . . . . . . . . . . . . . . 20

3.6 The multiplexer with implemented AND gate where the inputs are not shared. 22

85

List of Figures 86

3.7 The configuration of the and-inverter-cone. . . . . . . . . . . . . . . . . . . . . 22

3.8 Pareto comparison for the half and fully connected configuration . . . . . . . . 22

3.9 Comparison between the half and fully connected configuration for equal N . . 23

3.10 Number of output loads in a switch box for each arriving signal. . . . . . . . . 24

3.11 Optimization for the 6-input multiplexer part of the half connected multiplexer. 25

3.12 Absolute comparison between the needed number of SRAM cells for the originaland the new multiplexer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.13 Relative comparison between the needed number of SRAM cells for the originaland the new multiplexer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.14 An example of a tile of an FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.15 Absolute comparison between the needed number of SRAM cells for the originaland the new multiplexer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.16 Routing example with AND gates. . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 In this example it is clear that each signal is inverted twice when all sinks andsources are inverted. Conclusion: inverting the inputs and output of the addedcomponent yields an equivalent component. . . . . . . . . . . . . . . . . . . . . 33

4.2 All possible 2-input boolean functions . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 The first component is a 2-LUT, the second one is a 2-AIC. The third and fourthcomponents are respectively the (N)AND and (N)ANDN gates. . . . . . . . . . 34

5.1 The cone enumeration for a maximum of 3 inputs, example for node 6 . . . . . 40

5.2 Two signal paths in a network for depth comparison. . . . . . . . . . . . . . . . 40

5.3 The principle of cone ranking when using both 2-LUTs and 3-LUTs. The depthcosts are respectively 0,6 and 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 The cone selection. Compared with Figure 2.14 we see that the resulting depthis 2,2 instead of 3. Furthermore three 2-LUTs and three 3-LUTs are used. Thisis cheaper than using six 3-LUTs, as 2-LUTs are smaller than 3-LUTs. . . . . . 42

5.5 The results of combining 2-LUTs and 6-LUTs in DAOmap for the VTR bench-marks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.6 The results for the 2-AIC compared with the 2-LUT for the VTR benchmarks. 45

5.7 The X(N)OR gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.8 An example of a cone representing a XOR function (a) and an example of anoptimization (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.9 All boolean 2-input functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

List of Figures 87

5.10 The results of the X(N)OR gate for the VTR benchmarks. . . . . . . . . . . . . 48

5.11 An example of a numbered AIG. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.12 A representation of the cone ranking when using 3-LUTs and AND gates. . . . 52

5.13 A representation of the cone selection when using 3-LUTs and AND gates. Inthis case only the node at the bottom is represented by using an AND gate. . . 52

5.14 The results of the AND gate for the 19 VTR benchmarks . . . . . . . . . . . . 54

5.15 The results of the NAND gate for the 19 VTR benchmarks . . . . . . . . . . . 55

5.16 The results of the ANDN gate for the 19 VTR benchmarks . . . . . . . . . . . 55

5.17 The results of the ANDN gate for the 20 MCNC benchmarks . . . . . . . . . . 56



6.1 The number of LUTs for each benchmark. . . . . . . . . . . . . . . . . . . . . . 59

6.2 The depth for each benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 The run-time for each benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.4 The number of connections for each benchmark . . . . . . . . . . . . . . . . . . 60

6.5 The relative depth reduction per benchmark. . . . . . . . . . . . . . . . . . . . 62

6.6 The depth in function of the depth cost for the VTR benchmarks and theMCNC20 benchmarks. Also a result is given where we did not use the inputparameter of the depth cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.7 AIG of benchmark cm163a. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.8 LUT network of cm163a. LUTs are marked in orange. Also the input numbersof the LUTs are provided. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.9 Network of LUTs and AND gates for cm163a. LUTs are marked in orange, ANDgates are marked in green. Also the input numbers of the LUTs are provided. . 66

6.10 The relative depth reduction per benchmark for the AIC and the AND gate.The considered depth cost is 0,2. . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.11 The needed LUTs and ANDs for the VTR and MCNC20 benchmarks. . . . . . 69

6.12 The decrease and increase in number of 6-LUTs and AND gates of the VTRbenchmarks, depth cost zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.13 The number of connections that ends in a 6-LUT or an AND gate. These resultsare compared with the number of connections for the conventional mapping. . . 71

6.14 In 76 % of the cases an AND gate is implemented between three LUTs. . . . . 71

List of Figures 88

6.15 The number of 6-LUTs and 2-AICs in function of the depth cost for the VTRbenchmarks and MCNC20 benchmarks. . . . . . . . . . . . . . . . . . . . . . . 71

6.16 Fan-in of inputs of LUTs for the AND gate . . . . . . . . . . . . . . . . . . . . 73

6.17 Example network fan-in of inputs of LUTs . . . . . . . . . . . . . . . . . . . . . 73

6.18 Example fan-in of inputs of LUTs for the AIC . . . . . . . . . . . . . . . . . . . 74

6.19 The run-time of mapping the VTR benchmarks with SimpleMap. . . . . . . . . 74

7.1 The weighted average depth, number of AND gates and number of LUTs of the19 VTR and MCNC20 benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . 76

7.2 The area overhead in function of the percentage of replaced multiplexers if weassume that all multiplexers cause the same overhead. . . . . . . . . . . . . . . 77

7.3 The area delay product in function of the depth cost and the percentage p ofmultiplexers augmented with an AND gate. . . . . . . . . . . . . . . . . . . . . 78

7.4 Several AND gates placed behind each other without detours. . . . . . . . . . . 79

7.5 The area delay product in function of p and the depth cost for the AIC architecture. 80

7.6 An example of a zero depth cost. . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Documents

Technology Mapping for Logic in FPGA Routinglib.ugent.be/fulltxt/RUG01/002/153/745/RUG01-002153745_2014_0001... · Technology Mapping for Logic in FPGA Routing Academic year 2013-2014