Body bias aware digital design : a design strategy for ... · CMOS industrial processor design, up to 4.5x lower amount of FBB digital gates, leakage reductions of up to 2.6x at a

Body bias aware digital design : a design strategy for area-and performance-efficient CMOS integrated circuitsCitation for published version (APA):Meijer, R. I. M. P. (2011). Body bias aware digital design : a design strategy for area- and performance-efficientCMOS integrated circuits. Technische Universiteit Eindhoven. https://doi.org/10.6100/IR719493

DOI:10.6100/IR719493

Document status and date:Published: 01/01/2011

Document Version:Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can beimportant differences between the submitted version and the official published version of record. Peopleinterested in the research are advised to contact the author for the final version of the publication, or visit theDOI to the publisher's website.• The final author version and the galley proof are versions of the publication after peer review.• The final published version features the final layout of the paper including the volume, issue and pagenumbers.Link to publication

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, pleasefollow below link for the End User Agreement:www.tue.nl/taverne

Take down policyIf you believe that this document breaches copyright please contact us at:[email protected] details and we will investigate your claim.

Download date: 28. Dec. 2020

https://doi.org/10.6100/IR719493

https://doi.org/10.6100/IR719493

https://research.tue.nl/en/publications/body-bias-aware-digital-design--a-design-strategy-for-area-and-performanceefficient-cmos-integrated-circuits(ee8aa4cf-ced6-474e-8a6a-a4a36e074f5a).html

Body Bias Aware Digital Design

A Design Strategy for Area- and

Performance-Efficient CMOS Integrated Circuits

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de rector magnificus, prof.dr.ir. C.J. van Duijn, voor een

commissie aangewezen door het College voor Promoties in het openbaar te verdedigen

op woensdag 7 december 2011 om 16.00 uur

door

Rinze Ida Mechtildis Peter Meijer

geboren te Montfort

ii

Dit proefschrift is goedgekeurd door de promotoren: prof.dr. J. Pineda de Gyvez en prof.dr.ir. R.H.J.M. Otten A catalogue record is available from the Eindhoven University of Technology Library Meijer, Rinze Ida Mechtildis Peter Body Bias Aware Digital Design / by Rinze Ida Mechtildis Peter Meijer. Eindhoven: Technische Universiteit Eindhoven, 2011. ISBN : 978-90-386-2920-9 NUR : 959 Trefw.: digitale CMOS schakelingen / logische synthese / circuit optimalisatie / circuit tuning Subject headings: digital CMOS circuits / logic synthesis / circuit optimization / circuit tuning

iii

Body Bias Aware Digital Design

A Design Strategy for Area- and

Performance-Efficient CMOS Integrated Circuits

iv

Members of the dissertation committee: prof.dr. J. Pineda de Gyvez Eindhoven University of Technology (first promoter)

NXP Semiconductors prof.dr.ir. R.H.J.M Otten Eindhoven University of Technology (second promoter) prof.dr. H. Corporaal Eindhoven University of Technology prof.dr. Y. Leblebici École Polytechnqiue Fédérale de Lausanne prof.dr.ir. N.P. van der Meijs Delft University of Technology prof.dr.ir. G.J.M. Smit University of Twente prof.dr.ir. A.C.P.M. Backx Eindhoven University of Technology (chairman) The work described in this thesis was carried out at Philips Research Laboratories, Eindhoven, The Netherlands, and NXP Semiconductors, Eindhoven, The Netherlands. Copyright © 2011 by R.I.M.P. Meijer All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without the prior permission of the author. Cover design by M.J. Meijer, inspired by M. Heijsen Printing by Printservice, Eindhoven University of Technology

v

Summary

Modern integrated circuits have become sensitive to variations in process parameters, supply voltage and temperature. Digital designers account for such variability during the design phase of the Integrated Circuit. They follow worst-case design approach for guaranteeing chip operation among all Process-Voltage-Temperature corners. However, the extreme corner conditions rarely occur in most of the fabricated chips. By pursuing worst-case design, one is penalizing performance by covering low probability problems. Excessive use of design margining limits maximum speed specifications and costs additional power due to area over-dimensioning during synthesis. Worst-case design does not only constrain high-performance digital circuits, but also affects low-power digital circuits that operate from a reduced (minimum) supply voltage. To continue the digital design success in nanometer CMOS, cost-effective variation tolerant design approaches are needed that guarantee circuit robustness in the presence of variability influences while avoiding over-dimensioning of the design. This thesis presents research work on a novel body bias driven (BBD) design strategy for digital CMOS circuits that relies on forward body biasing (FBB) post-silicon tuning to achieve variation resilient operation. Unlike prior art works that only use body biasing for tuning circuit speed and power at silicon-time, this research uses also FBB as an integral part during the design process for reducing design margins and thereby constraining area over-dimensioning. Having such BBD design approach is not sufficient to avoid area over-dimensioning of the design. Therefore, a design optimization strategy was developed that is based on a performance-per-area (PPA) metric. A maximum PPA design represents the fastest design possible without area over-dimensioning. This thesis provides an in-depth analysis of the PPA design theory for both high-performance and ultra-low-power digital CMOS circuits. With the PPA design theory available, designers have a means to judge how efficient the circuit design is implemented in terms of speed and area, and implicitly power consumption. The benefits of BBD design depend on the silicon tuning range available for a given CMOS technology. The performed experimental work explored the technological boundaries of supply voltage scaling and body bias tuning through silicon experiments in 90nm, 65nm, and 45nm Low-Power (LP) CMOS processes. In particular, this research shows how much power savings can be expected, the power-performance tradeoffs that can be made, and to which extent process-dependent performance-compensation can be accomplished. BBD design relies on the fact that the chip has silicon tuning capabilities. The trend towards higher integration densities in modern chips favours a fully integrated silicon tuning solution. For this purpose, a new FBB generator design for high-performance digital circuits was developed under this research activity. The design has been implemented in 90nm CMOS, and its operation has been experimentally verified. To successfully achieve a target circuit performance, it is not necessary that all digital gates in the circuit design are forward-body-biased. To complement the BBD design, a body bias clustering method was investigated too. This method is based on design hierarchies of timing-critical circuit parts at design-time. A greedy algorithm was developed for assigning design hierarchies to body bias clusters. Basically, this approach enables partitioning of the circuit into a body-biased part and a non-body-

vi

biased part, while preventing signal routing congestion issues. A heuristic algorithm was formulated that supports the automated implementation of body bias islands in the layout. The BBD digital design strategy was validated through industrial processor designs in 90nm LP-CMOS. For standard-Vth implementations, PPA improvements of up to 40%, area and leakage reductions up to 30%, and dynamic power savings of up to 10% without performance penalties were observed. The benefits are larger for high-Vth implementations. In this case, PPA improvements up to 90%, area and leakage reductions up to 40%, and dynamic power savings of up to 25% without performance penalties were observed as a benefit from the proposed BBD design strategy. Extending BBD designs with hierarchy-based body bias clustering enabled the application of FBB to timing-critical circuit parts only. For a 90nm standard-Vth LP-CMOS industrial processor design, up to 4.5x lower amount of FBB digital gates, leakage reductions of up to 2.6x at a similar circuit speed for the body bias clustering concept were observed as compared to applying FBB to the whole design. The proposed physical design approach for implementing body bias clustered BBD designs showed minimum area and routing overheads as compared to a nominal-body-biased design. Finally, the BBD design strategy with body bias clustering and the proposed FBB generator was deployed in a mixed-signal system-chip design in 90nm LP-CMOS. The test-chip has been designed for operating at the maximum PPA point. The die size is 3.98mm x 3.98mm. At nominal VDD operation, 25% clock frequency improvement with a total energy increase of only 3% at 0.5V FBB was observed. The research described in this thesis has proven the effectiveness of BBD design with body bias clusters in a realistic Integrated Circuit vehicle. The methodology is applicable to both high-performance and low-power digital CMOS circuits.

vii

Samenvatting

Moderne geïntegreerde schakelingen zijn gevoelig geworden voor variatie in fabricage parameters, voedingsspanning en temperatuur. Digitale ontwerpers houden rekening met zulke variabiliteit tijdens het ontwerpproces van een geïntegreerde schakeling. Zij hanteren een ontwerpaanpak met worst-case ontwerpmarges om de functionaliteit van de schakeling te waarborgen voor alle grenswaardes van het fabricageproces, de voedingsspanning en de temperatuur. Echter, deze extreme grenswaardes treden zelden op in de gefabriceerde chips. Daarnaast beperkt de worst-case ontwerpaanpak de snelheid waarop een geïntegreerde schakeling kan werken doordat men ook die gevallen meeneemt die zelden voorkomen. Het overmatig gebruik van ontwerpmarges beperkt niet alleen de maximale snelheids specificatie, maar leidt ook tot een hoger vermogensverbruik vanwege overdimensionering van het circuit en daarmee het chip-oppervlak tijdens het synthese proces. Worst-case ontwerp geeft niet alleen een beperking voor hoge-snelheid digitale schakelingen, maar treft ook de schakelingen die zijn ontworpen voor een laag vermogensverbruik en die werken vanaf een verminderde (minimale) voedingsspanning. Om het succes van digitaal ontwerp in nanometer CMOS te kunnen voortzetten, is er een kosteneffectieve variatie tolerante ontwerpaanpak nodig om schakelingen robuust te maken, onder invloed van variaties, waarbij overdimensionering kan worden vermeden. Dit proefschrift presenteert onderzoekresultaten van een nieuwe variatie tolerante body bias gestuurde (BBD) ontwerpaanpak voor digitale CMOS schakelingen die er rekening mee houdt dat forward body bias (FBB) kan worden toegepast na fabricage van de schakeling. In tegenstelling tot bestaande aanpakken, die body biasing alleen toepassen om snelheid en vermogensverbruik van een schakeling aan te passen na fabricage, wordt in dit onderzoek FBB tevens gebruikt als een integraal onderdeel van het ontwerptraject om overmatige ontwerpmarges te reduceren en om oppervlakte overdimensionering van de schakeling tegen te gaan. Het beschikken over zo’n BBD ontwerpaanpak is echter niet voldoende om overdimensionering te vermijden. Vandaar dat er een ontwerp optimalisatie strategie is bedacht die gebaseerd is op een snelheid-per-oppervlakte (PPA: performance-per-area) criterium. Een maximum PPA ontwerp geeft de schakeling de hoogste snelheid zonder oppervlakte overdimensionering. Dit proefschrift bevat een diepgaande analyse van de PPA ontwerptheorie voor hoge-snelheid digitale CMOS schakelingen of CMOS schakelingen die ontworpen zijn voor een extreem laag vermogensverbruik. Met de PPA ontwerptheorie kunnen ontwerpers beoordelen hoe efficiënt de schakeling is ontworpen met betrekking tot snelheid en oppervlakte, en impliciet met betrekking tot vermogensverbruik. De voordelen van BBD ontwerp zijn afhankelijk van de CMOS technologie en het bereik waarover het silicium kan worden afgestemd na fabricage. Met het experimentele werk zijn de technologische grenzen verkend van voedingsspanning- en body bias aanpassing doormiddel van silicium experimenten in 90nm, 65nm en 45nm Low-Power (LP) CMOS technologieën. Dit onderzoek laat voornamelijk zien hoeveel vermogensbesparing er kan worden verwacht, wat de afweging is tussen vermogensverbruik enerzijds en snelheid anderzijds, en in welke mate de spreiding in snelheid ten gevolge van het fabricage proces kan worden gecompenseerd. BBD ontwerp neemt aan dat de geïntegreerde schakeling is voorzien van de mogelijkheid om silicium af te stemmen na fabricage. De trend naar een hogere integratiedichtheid

viii

in moderne geïntegreerde schakelingen geeft de voorkeur aan een volledig geïntegreerde oplossing op silicium. Hiervoor is er als deel van het onderzoek een nieuwe FBB generator ontworpen die toegepast kan worden in digitale schakelingen die werken op hoge snelheid. Deze generator is geïmplementeerd in een 90nm CMOS schakeling waarbij de werking experimenteel is aangetoond. Om de schakeling een gewenste snelheid te laten behalen is het niet noodzakelijk dat alle digitale cellen van FBB zijn voorzien. Complementair aan BBD ontwerp is tevens onderzoek verricht aan een body bias clustering methode. Deze methode is gebaseerd op circuit hiërarchieën met de traagste circuit gedeeltes tijdens de ontwerpfase. Een greedy algoritme is ontwikkeld voor het toekennen van circuit hiërarchieën aan body bias clusters. In principe maakt deze aanpak een verdeling van de schakeling mogelijk in een body bias gedeelte en een niet body bias gedeelte, waarbij congestie van signaal bedrading wordt verkomen. Een heuristisch algoritme is geformuleerd dat het mogelijk maakt om body bias eilanden automatisch in de layout te implementeren. De BBD ontwerp strategie is gevalideerd op industriële processor schakelingen in 90nm LP-CMOS. PPA verbeteringen tot 40%, oppervlakte- en lekstroom vermindering tot 30% en dynamische vermogensbesparingen tot 10% zijn waargenomen zonder snelheidsbeperking voor standaard-Vth implementaties ten opzichte van de traditionele worst-case design aanpak. De voordelen met hoge-Vth implementaties zijn groter. In dat geval zijn PPA verbeteringen tot 90%, oppervlakte- en lekstroom verminderingen tot 40% en dynamische vermogensbesparingen tot 25% zonder snelheidsbeperkingen waargenomen als meerwaarde van de voorgestelde BBD ontwerp strategie. De uitbreiding van BBD ontwerp met hiërarchie-gebaseerde body bias clustering maakt het mogelijk om FBB te gebruiken voor alleen de traagste circuit gedeeltes. Voor de toepassing in een industrieel processor ontwerp in 90nm standaard-Vth CMOS zijn tot 4.5x minder digitale cellen met FBB en een lekstroombesparing tot 2.6x waargenomen bij een vergelijkbare snelheid voor de schakeling dan in het geval dat FBB werd toegepast voor de gehele schakeling. De voorgestelde fysieke implementatie aanpak voor body bias geclusterde BBD schakelingen geeft minimale extra kosten aan oppervlak en bedrading in vergelijking met een schakeling met nominale body bias. Tot slot is de BBD ontwerp strategie, inclusief body bias clustering en de voorgestelde FBB generator, toegepast voor het ontwerpen van een mixed-signal systeem chip in 90nm CMOS. Deze test schakeling is ontworpen voor een maximale PPA. Het silicium oppervlak van deze geïntegreerde schakeling is 3.98mm x 3.98mm. Er is een 25% hogere klok frequentie met een totale energie toename van slechts 3% waargenomen, werkend vanaf een nominale voedingsspanning en 0.5V FBB. Het onderzoek dat is beschreven in dit proefschrift heeft de effectiviteit aangetoond van BBD ontwerp in combinatie met body bias clusters door de toepassing in een realistisch geïntegreerde schakeling. De methode is toepasbaar voor zowel digitale CMOS schakelingen ontworpen voor een hoge snelheid alsook voor schakelingen ontworpen voor een laag vermogensverbruik.

ix

Acknowledgements

First and foremost I would like to thank José Pineda de Gyvez, my first promoter, for offering me the opportunity to pursue a Ph.D. in his group. I am greatly indebted to you on your encouragement and support during the course of this work, helping me to push it forward whenever it was necessary. I enjoyed the many inspiring discussions and the valuable comments on the manuscripts, and appreciate the patience you showed with me. I also wish to express my appreciation to Ralph Otten, my second promoter. Despite the fact that we did not have much interaction, you were always a supporter of the research. Thank you very much for your valuable suggestions and the constructive comments on the dissertation. Moreover, I would also like to acknowledge the other members of the dissertation committee for the careful reading of the dissertation and for the insightful comments. This research would not have been possible without the support of Philips Research Laboratories and NXP Semiconductors. In particular, I gratefully acknowledge the support of Ad Ten Berg, my former groupleader at Philips, and Leo Warmerdam, my groupleader at NXP. Many thanks go to Rutger van Veen, a former colleague. I am much indebted to Rutger for his indispensable help during testing of the various monitoring circuits, and for developing the automated test flows. I am very grateful to Aatish Kumar, a former colleague, for making me familiar with digital library characterization and for his help in setting-up an initial library characterization flow. I gratefully thank Cas Groot and Leo Sevat from NXP Semiconductors. Without their help it would not have been possible to design and implement the body bias driven mixed-signal system-chip test-chip vehicle in such short time. It is a pleasure collaborating with you. I would also like to thank Ben Kup and Maarten Vertregt for their valuable help during the design phase of the forward body bias generator. Peter Bastiaansen and Marco Lammers are acknowledged for the final design implementation. Furthermore, I would like to thank Hamidreza Hashempour and Tianyuan Wu from NXP Semiconductors for their help with the latch-up test-structure design and performing the latch-up experiments under the forward body bias conditions. During this work I have collaborated with many colleagues at Philips Research Laboratories and NXP Semiconductors, and I wish to extend my warmest thanks to all those who have supported me in some way. Several students also made contributions to the research. I would like to thank Bo Liu from the Technical University of Eindhoven for his evaluation of various cell-based body bias clustering approaches. I would like to thank Amparo Correa Flores for developing the initial back-end scripting to implement body bias islands into the circuit layout. Last but not least, I would like to thank my wife Sandra for her understanding and love during the past years. Her support and encouragement was in the end what made this dissertation possible. Many thanks go also to my daughter Nikki and my son Jarno for their continuous love and support. My parents, Rinus and Riny, receive my deepest gratitude and love for their dedication and the many years of support during my undergraduate studies that provided the foundation for this work.

x

xi

Contents

Summary ..................................................................................................................... v

Samenvatting ................................................................................................................. vii Acknowledgements ......................................................................................................... ix

Contents .................................................................................................................... xi List of Figures .............................................................................................................. xiii List of Tables ............................................................................................................... xvii Glossary .................................................................................................................. xix

Chapter 1 Introduction .................................................................................................. 1

1.1 Integrated Circuit Design in Nanometer CMOS .............................................. 1

1.2 Conventional Corner-Based Digital Design .................................................... 3

1.3 Techniques for Mitigating Variability and Design Margin ............................. 6

1.3.1 Design-Time Techniques ......................................................................... 6

1.3.2 Silicon-Time Techniques ......................................................................... 7

1.4 Thesis Contributions ........................................................................................ 9

Chapter 2 Models for Body Biased Digital Design .................................................... 13

2.1 Circuit Models with Body Bias ..................................................................... 13

2.1.1 Propagation Delay Model ...................................................................... 13

2.1.2 Leakage Current Model ......................................................................... 17

2.1.3 Power Consumption Model ................................................................... 21

2.2 Circuit Area Modelling .................................................................................. 22

2.3 Discussion ...................................................................................................... 24

Chapter 3 Technology Boundaries of Post-Silicon Tuning ........................................ 25

3.1 Prior Art Analysis .......................................................................................... 25

3.2 Test Circuits and Scaling Conventions .......................................................... 26

3.3 Frequency Scaling and Tuning ...................................................................... 29

3.4 Power and Frequency Tuning ........................................................................ 32

3.5 Leakage Power Control ................................................................................. 35

3.6 Tuning Ranges for Different Vth options ....................................................... 38

3.7 Process-Dependent Timing Variability Decomposition ................................ 40

3.8 Performance-Spread Compensation............................................................... 44

3.9 Discussion ...................................................................................................... 47

Chapter 4 Embedded Forward Body Bias Generation ............................................... 49


4.2 Load Characteristics ...................................................................................... 50

4.2.1 N-well and P-well Behaviour ................................................................. 50

4.2.2 Load Modelling and Analysis ................................................................ 52

4.2.3 Latch-Up Sensitivity Analysis ............................................................... 57

4.3 FBB Generator Design ................................................................................... 60

4.3.1 Concept .................................................................................................. 60

4.3.2 Design .................................................................................................... 62

4.3.3 Layout .................................................................................................... 64

4.4 FBB Generator Experimental Results ........................................................... 65

4.5 Discussion ...................................................................................................... 69

xii

Chapter 5 Body Bias Driven Design for Area-Efficient High-Performance Circuits 71


5.2 Maximum PPA Design .................................................................................. 72

5.3 Body Bias Driven Design Concept ................................................................ 75

5.4 Optimum Design Space for High-Performance Circuits ............................... 78

5.4.1 Performance-per-Area Trends ............................................................... 78

5.4.2 Power Consumption Trends ................................................................... 79

5.4.3 Impact of Technology Scaling ............................................................... 80

5.5 Model Validation ........................................................................................... 81

5.6 Benchmarked Results .................................................................................... 83

5.6.1 Design Synthesis for Maximum PPA .................................................... 85

5.6.2 Design Synthesis for Minimum Area .................................................... 87

5.7 Discussion ...................................................................................................... 88

Chapter 6 Body Bias Driven Ultra-Low-Power Digital Circuits ............................... 89


6.2 Body-Biased Ultra-Low-Power Design ......................................................... 90

6.2.1 Circuit Models for Energy and Delay .................................................... 91

6.2.2 ULP Circuit Optimization with Body Biasing ....................................... 93

6.2.3 Process Variability Implications for ULP Digital Circuits .................... 94

6.2.4 Utilization of Body Bias Driven Design ................................................ 96

6.3 Optimum Design Space for ULP Digital Circuits ......................................... 97

6.4 Body Bias Selection and Generation ........................................................... 101

6.5 Synthesized Design Example ....................................................................... 102

6.6 Discussion .................................................................................................... 105

Chapter 7 Body Bias Clustering and Physical Design ............................................. 109

7.1 Prior Art Analysis ........................................................................................ 109

7.2 Design-Time Body Bias Clustering ............................................................. 110

7.2.1 Synthesis-Based Body Bias Clustering Exploration ............................ 111

7.2.2 Candidate path selection for FBB ........................................................ 115

7.2.3 Hierarchy-Based Body Bias Clustering ............................................... 116

7.3 Physical Design with Body Bias Clusters.................................................... 119

7.3.1 Body Bias Islands ................................................................................ 119

7.3.2 Balanced Track Utilization for Inter-Island Signal Routing ................ 121

7.3.3 Automated Implementation of Body Bias Islands ............................... 123

7.4 Forward-Body-Bias Integration into a Mixed-Signal System-chip Design . 131

7.5 Discussion .................................................................................................... 132

Chapter 8 Concluding Remarks ................................................................................ 135

8.1 Research Contribution and Results .............................................................. 135

8.2 Outlook and Suggestions for Future work ................................................... 136

References ................................................................................................................. 139

List of Publications ...................................................................................................... 145

Curriculum Vitae ......................................................................................................... 147

Reader’s Notes ............................................................................................................. 149

xiii

List of Figures

Figure 1.1 VDD and Vth scaling across various CMOS technology nodes. .................... 2

Figure 1.2 Slow-PVT margin breakdown for a FO4 inverter. ...................................... 4

Figure 1.3 Fast-PVT margin breakdown for a FO4 inverter. ....................................... 5

Figure 1.4 Impact of process and timing margin on circuit area and clock period. ..... 5

Figure 1.5 Measured performance spread of a 90nm LP-CMOS ring-oscillator. ........ 6

Figure 1.6 Conceptual diagram of an adaptive body bias control system. ................... 7

Figure 1.7 Concept of frequency binning with body biasing. ...................................... 8

Figure 1.8 RazorII flip-flop schematic and the corresponding timing diagram............ 9

Figure 1.9 Body bias driven versus conventional worst-case design synthesis.......... 10

Figure 2.1 NMOS Vth versus body biasing in 90nm SVT LP-CMOS. ....................... 14

Figure 2.2 PMOS Vth versus body biasing in 90nm SVT LP-CMOS. ........................ 14

Figure 2.3 Inverter intrinsic capacitance versus body biasing. ................................... 15

Figure 2.4 FO1 inverter propagation delay versus body biasing. ............................... 16

Figure 2.5 Considered leakage mechanisms in deep-submicron transistors. ............. 17

Figure 2.6 Inverter leakage current versus body biasing. ........................................... 20

Figure 2.7 Ring-oscillator power consumption versus body biasing. ......................... 22

Figure 2.8 Area and clock period trade-off for a generic digital logic circuit. ........... 23

Figure 3.1 Schematic diagram of the ring-oscillator monitoring circuit. ................... 26

Figure 3.2 Die photograph of the 45nm LP-CMOS ring-oscillator test-chip. ............ 27

Figure 3.3 Die photograph of the 90nm LP-CMOS shift-register test-chip. .............. 27

Figure 3.4 Voltage scaling and body biasing operations. ........................................... 28

Figure 3.5 Frequency scaling and tuning for the 65nm LP-CMOS SVT ringo. ......... 29

Figure 3.6 Frequency versus N-well and P-well biasing in 65nm LP-CMOS. ........... 30

Figure 3.7 Frequency dependency on supply voltage and body bias. ........................ 31

Figure 3.8 Frequency versus total power in CMOS 65nm. ........................................ 32

Figure 3.9 Trading-off frequency for total power consumption in CMOS 65nm. ..... 32

Figure 3.10 Total power of a logic core versus body biasing in CMOS 90nm. ......... 33

Figure 3.11 Total power correlation between a logic core and a ringo. ..................... 34

Figure 3.12 Leakage versus VDD for a 65nm LP HVT NMOS device. ....................... 35

Figure 3.13 Leakage versus temperature for a 65nm LP HVT NMOS device. .......... 36

Figure 3.14 Leakage reduction in 65nm SVT LP-CMOS using VS and BB. ............ 36

Figure 3.15 Temperature-dependent leakage reduction in 65nm LP-CMOS SVT. ... 37

Figure 3.16 Frequency versus VDD for different Vth-options in CMOS 45nm. ........... 38

Figure 3.17 Frequency versus leakage for a BB-tuned ringo in CMOS 45nm. .......... 39

Figure 3.18 Leakage versus body biasing in CMOS 45nm. ....................................... 40

Figure 3.19 Oscillation period correlation plot of layout-identical ringo’s................ 41

Figure 3.20 Systematic delay spread versus logic depth under body biasing............. 42

Figure 3.21 Random delay spread versus logic depth under body biasing. ................ 43

Figure 3.22 Estimated delay spread for the 21-stage ringo in CMOS 45nm. ............. 43

Figure 3.23 Estimated oscillation period for a 21-stage ringo in CMOS 45nm. ........ 44

Figure 3.24 Ringo frequency and leakage wafer spread for 40 dies in CMOS 65nm.44

Figure 3.25 Process-dependent performance compensation with body biasing. ........ 45

Figure 3.26 Performance compensation strategies in 65nm LP-CMOS SVT. ........... 46

Figure 3.27 Performance compensation in 45nm LP-CMOS SVT. ........................... 47

Figure 4.1 CMOS inverter cross-section with junction diodes displayed. ................. 50

Figure 4.2 P-well current versus body bias experiments in 90nm LP-CMOS. ........... 51

xiv

Figure 4.3 N-well current versus body bias experiments in 90nm LP-CMOS. .......... 51

Figure 4.4 Layout implementation example of a body biased digital circuit. ............ 53

Figure 4.5 Well current versus FBB and process corner in 90nm LP-CMOS............ 54

Figure 4.6 Well current versus FBB at two VDD’s in 90nm LP-CMOS. ..................... 54

Figure 4.7 Well current versus FBB and temperature in 90nm LP-CMOS. ............... 54

Figure 4.8 Well capacitance versus FBB and process corner in 90nm LP-CMOS. ... 56

Figure 4.9 Well capacitance versus FBB at two VDD’s in 90nm LP-CMOS. ............. 56

Figure 4.10 Well capacitance versus FBB and temperature in 90nm LP-CMOS. ..... 56

Figure 4.11 Parasitic thyristor structure in CMOS circuits. ....................................... 57

Figure 4.12 Equivalent circuit of the parasitic thyristor structure. ............................. 57

Figure 4.13 Schematic I-V characteristics of the parasitic thyristor structure [37].... 58

Figure 4.14 Cross-section of the latch-up test-structure. ............................................ 59

Figure 4.15 Latch-up experimental results for 90nm LP-CMOS test-structure. ........ 60

Figure 4.16 Conceptual circuit diagram of the proposed FBB generator. .................. 61

Figure 4.17 Detailed block diagram of the proposed FBB generator. ........................ 61

Figure 4.18 Simplified circuit diagram of the reference circuit. ................................ 62

Figure 4.19 Simplified circuit diagram of a resistor element of the RDAC. .............. 63

Figure 4.20 Circuit diagram of the pre-driver of the voltage buffer. .......................... 63

Figure 4.21 Circuit diagram of the p-well output stage of the voltage buffer (left), and generator bandwidth vs. digital IP block dimension (right). ...................................... 64

Figure 4.22 Layout implementation of the FBB generator in 90nm LP-CMOS. ....... 65

Figure 4.23 Measured INL of the 90nm FBB generator. ............................................ 66

Figure 4.24 Measured transient response of the FBB generator for 0.5V FBB. ........ 67

Figure 4.25 Zoomed-in transient response of the FBB generator for 0.5V FBB. ...... 67

Figure 4.26 Measured transient response for the FBB generator under voltage scaling conditions of the digital IP load. .................................................................... 68

Figure 4.27 Measured FBB generator load regulation at 0.5V FBB. ......................... 68

Figure 5.1 Greedy algorithm to obtain fitting parameters of expression (5.1). .......... 73

Figure 5.2 FBB utilization under body bias driven design. ........................................ 75

Figure 5.3 Nominal-process performance under body bias driven design. ................ 76

Figure 5.4 Performance tuning for a minimum area body bias driven design. ........... 77

Figure 5.5 Area, speed, and PPA trade-off for a generic digital logic circuit. ........... 78

Figure 5.6 PPA versus clock period for a generic digital logic circuit. ...................... 79

Figure 5.7 Area, speed, and power trade-off for a generic digital logic circuit. ........ 80

Figure 5.8 Area, speed, and power trade-off curves for a generic digital logic circuit implemented in different technology nodes. .............................................................. 81

Figure 5.9 Area versus speed and PPA of the 90nm microprocessor. ........................ 82

Figure 5.10 Area versus speed and total power of the 90nm microprocessor. ........... 83

Figure 6.1 Energy versus VDD for a generic 90nm HVT digital circuit. ..................... 90

Figure 6.2 Energy versus clock frequency for a generic 90nm digital circuit. ........... 93

Figure 6.3 Trading-off energy and performance under process variability. ............... 94

Figure 6.4 Area and performance trade-offs under process variability. ..................... 95

Figure 6.5 Energy versus performance in CMOS 90nm with body biasing. .............. 96

Figure 6.6 Area, clock period, and energy trends in CMOS 90nm at VDD=0.5V. ...... 98

Figure 6.7 Energy versus FBB in CMOS 90nm at VDD=0.5V. ................................... 99

Figure 6.8 EDP, area and clock period trends in CMOS 90nm at VDD=0.5V. ............ 99

Figure 6.9 PPA versus total energy in CMOS 90nm at VDD=0.5V. .......................... 100

Figure 6.10 Conceptual circuit diagram of the body bias power delivery. ............... 102

Figure 6.11 Area versus clock period for the 90nm microprocessor at VDD=0.5V... 103

Figure 6.12 PPA versus energy for the 90nm microprocessor at VDD=0.5V. ........... 105

xv

Figure 7.1 Frequency vs. leakage for the microprocessor with body bias clusters. . 112

Figure 7.2 Body-biased cells per cluster and PPA for the microprocessor. ............. 113

Figure 7.3 Path delay distribution of the 90nm SVT microprocessor. ..................... 115

Figure 7.4 Algorithm for selecting candidate design hierarchies for FBB. .............. 117

Figure 7.5 Frequency versus body-biased reference gates for the microprocessor with hierarchy-based body bias clustering. ....................................................................... 118

Figure 7.6 Layout example of a body bias island. .................................................... 121

Figure 7.7 Schematic of track utilization for inter-island signal routing. ................. 122

Figure 7.8 Track utilization exploration for the 90nm SVT microprocessor. .......... 123

Figure 7.9 Heuristic algorithm for to determine the preferred floorplan solution of a design utilizing body bias clusters. ........................................................................... 124

Figure 7.10 Heuristic algorithm for preferred body bias island width and height. .. 125

Figure 7.11 Heuristic algorithm for preferred location of the body bias islands. .... 126

Figure 7.12 Heuristic algorithm results for body bias island integration into the 90nm SVT microprocessor with one 0.5V FBB island. ..................................................... 127

Figure 7.13 Floorplanning results for the 90nm SVT microprocessor with one 0.5V FBB island by using the heuristic algorithms. .......................................................... 128

Figure 7.14 Layout of the 90nm SVT microprocessor with one 0.5V FBB island. . 128

Figure 7.15 Digital layout area for the FBB-enabled mixed-signal SoC test-chip. .. 130

Figure 7.16 Maximum performance versus FBB experiments for the mixed-signal SoC test-chip under a reference benchmark application. ......................................... 131

Figure 7.17 Relative total power and energy increase versus FBB experiments for the mixed-signal SoC test-chip under a reference benchmark application..................... 132

xvi

xvii

List of Tables

Table 3.1 Voltage conventions for scaling operations. .............................................. 29

Table 3.2 Frequency tuning ranges for various CMOS nodes (SVT). ....................... 31

Table 3.3 Power-frequency tuning ranges for various CMOS nodes (SVT). ............. 35

Table 3.4 Leakage current savings for various CMOS nodes (SVT) at T=25oC. ....... 38 Table 4.1 Reference gate characteristics in 90nm SVT LP-CMOS. .......................... 53

Table 4.2 Measured FBB generator characteristics in 90nm LP-CMOS. .................. 65

Table 5.1 Technology scaling results of maximum PPA designs at T=85oC. ............ 81

Table 5.2 Model fitting parameters for the 90nm microprocessor design. ................. 82

Table 5.3 Industrial processor designs for maximum PPA in 90nm LP-CMOS. ....... 84

Table 5.4 Example gate count of three industrial microprocessor designs. ............... 85

Table 5.5 Industrial processor designs for minimum area in 90nm LP-CMOS. ........ 86

Table 6.1 Simulated ring-oscillator performance- and leakage increase versus FBB for 90nm HVT LP-CMOS under slow-process corner and 25oC operation. .............. 97

Table 6.2 Technology scaling of maximum PPA design at VDD=0.5V and T=25oC. 101

Table 6.3 Parameter values of expression (5.1) for the industrial microprocessor design in 90nm HVT LP-CMOS at VDD=0.5V. ........................................................ 102

Table 6.4 Microprocessor designs at VDD=0.5V in 90nm HVT LP-CMOS. ........... 104

Table 6.5 Microprocessor designs at VDD=0.7V in 90nm HVT LP-CMOS. ............ 106

Table 7.1 Body bias clustered microprocessor designs in 90nm SVT LP-CMOS. .. 114

Table 7.2 Hierarchy-based body bias clustered microprocessor design characteristics in 90nm SVT LP-CMOS. ......................................................................................... 120

Table 7.3 Main design characteristics of the mixed-signal SoC test-chip. ............... 130

xviii

xix

Glossary

ABB AVS BB BBD BTBT CGU CMOS D2D DIBL DVS EDA EDP FBB FO GIDL GP HVT IC INL IP LP LVT MEP MPPAP MOS NBB NMOS OCV PMOS PPA PVT RBB RDAC RTL SoC SRAM SRH STA SVT TAT ULP VS WCD WID

Adaptive body bias Adaptive voltage scaling Body bias Body bias driven Band to band tunnelling Clock generation unit Complementary metal oxide semiconductor Die-to-die Drain induced barrier lowering Dynamic voltage scaling Electronic Design Automation Energy delay product Forward body bias Fanout Gate induced drain leakage General purpose High threshold voltage

Integrated Circuit Integral non-linearity Intellectual Property Low power Low threshold voltage Minimum energy point Maximum performance-per-area point Metal oxide semiconductor Nominal body bias n-channel metal oxide semiconductor On-chip variation p-channel metal oxide semiconductor Performance per area Process-Voltage-Temperature Reverse body bias Resistive digital-to-analog converter Register transfer level System-on-chip Static random access memory Shockley Read Hall Static timing analysis Standard threshold voltage Trap assisted tunnelling Ultra low power Voltage scaling Worst case design Within-die

xx

Chapter 1

Introduction

ICRO-ELECTRONICS in the form of Integrated Circuits underpins much of our society and economy today. It is the heart of a new generation of devices

that are changing our daily life fundamentally. Digital ubiquity, along with a growing semiconductor utilization for consumer electronics and a miniaturization of equipment are key issues to attain digital convergence. Integrated Circuits will become increasingly intelligent. More computational performance will be needed for satisfying application requirements to enhance user experience. At the same time, the power consumption should be similar, or preferably lower, to help achieving environmental objectives towards “Green-ICT” solutions. In this chapter, an overview of the technological implications for next-generation Integrated Circuits is presented in terms of power efficiency and performance variability, followed by a review of existing design approaches and silicon tuning practices. The chapter closes with the contributions of this work, and organization of the thesis.

1.1 Integrated Circuit Design in Nanometer CMOS

According to Moore’s Law, the integration density of Integrated Circuits is doubling approximately every two years [1]. This trend was predicted by Gordon Moore in 1965, and still holds per today. In 1974, Dennard et.al. described the MOSFET scaling rules for obtaining simultaneous improvements in transistor density, switching speed and power consumption [2]. This transistor scaling theory, also known as the constant field scaling, underlies Moore’s Law and the evolution of Integrated Circuits over several decades. It consists of reducing all dimensions by a factor s (≈1.4) enabling higher integration density. In the constant-field scaling scenario, the circuit speed increases, theoretically, with the amount of scaling s. Constant field scaling has known benefits such as lower power per circuit, constant power density, and power delay product that increases by s 3. However, for CMOS technology, over the last ten years, it has been impossible to scale the power supply voltage (VDD) while maintaining speed because of the constraints on the threshold voltage (Vth) of the transistors [3]. The VDD and Vth scaling trends have been illustrated in Figure 1.1 for Low-Power CMOS technology nodes. Due to increasing leakage currents in scaled devices, Vth is not lowered to avoid significant static power consumption. Therefore, the electrical field is rising in proportion to s. In its turn, this is resulting in an almost constant circuit power despite scaling, increased power density by s 2, and power delay product improvement by a factor of s only. Power consumption of conventional electronic devices is a major concern because the dense devices produce a significant amount of heat imposing constraints on circuit performance and IC packaging. The case for portable devices is obvious, e.g. the goal is to maximize battery time. Integrated Circuits fabricated in Low-Power CMOS

M

2 Chapter 1 Introduction

technologies are typically used for consumer type of applications, such as mobile phones where the emphasis is on the lowest possible static power consumption.

0

0.5

1

1.5

2

2.5

3

3.5

CMOS Technology Node

VDD

Vth

Figure 1.1 VDD and Vth scaling across various CMOS technology nodes.

Given the fact that modern system chips are power limited, the lowest-power design is not interesting for a wide range of applications since it has insufficient performance. Contrarily, the highest-performance design is not interesting as well because of too much power consumption. The focus should be on constrained circuit optimizations for power and performance, e.g. the highest performance for a given power budget, or the minimum power consumption for a given operating frequency. Designing Integrated Circuits for low power will be a key practical and competitive advantage in the coming decade. Modern Integrated Circuits are hampered by excessive environmental and process variability that cause electrical parameters to vary [4][5][6]. Environmental variations are caused by changes in supply voltage and temperature conditions, which are both dynamic in nature. Supply voltage variations are caused by activity changes in the circuit and lead to resistive (IR) and inductive (LdI/dt) voltage drops due to a non-ideal power supply generation and delivery. A reduced power supply lowers the drive strengths of digital gates, and degrades circuit speed. Temperature variations may be due to changes in ambient temperature or due to local hot spots on the chip during operation. Temperature influences carrier mobility, µ, and Vth of transistors, thereby impacting circuit speed and leakage. Local hot spots are particularly a problem in high-performance micro-processors where within-die temperature fluctuations are not only a major challenge for performance, but also for packaging. The operating temperature range of Integrated Circuits is driven by application requirements, e.g. 0oC up to 70oC for commercial applications, -40oC to 85oC for industrial applications, and -55oC up to 125oC for military applications. Identifying the worst-case condition for VDD and temperature variation is very difficult. Designers focus on limiting VDD and temperature variation across the chip, e.g. by using a traditional VDD margin equal to ±10% deviation from the nominal VDD, or the utilization of thermal throttling [7]. Finally, variations in process parameters are due to non-idealities during the silicon fabrication, which are static in nature. With process geometries continue to shrink, the ability to control critical device parameters is becoming increasingly difficult [6][8]. This has resulted in significant dopant fluctuations, variations in oxide thicknesses, variations in device geometries due to

1.2 Conventional Corner-Based Digital Design 3

lithography hardware resolution limitations. Process variability influences circuit speed, power consumption and leakage. It can be categorized into two main classes: 1) die-to-die (D2D) which are variations between different chip samples that affect all devices on the same chip in the same way, and 2) within-die (WID) which are variations within the same chip that may affect different devices on the same chip in different ways. WID variation is becoming more prominent with ongoing technology scaling. The increasing magnitude of process variability for a scaled CMOS technology could lower the performance of a circuit by one technology generation [5], and could even lead to design failure [9]. Designers must ensure circuit robustness against process variability to obtain sufficiently high yield of the Integrated Circuit design. Nowadays, there is a trend to achieve low-power consumption by operating the Integrated Circuit at a reduced VDD, i.e. sub-threshold or near-threshold design approaches [10][11]. When operating at reduced VDD’s, the circuit becomes even more sensitive to process variability influences as compared to nominal VDD operation. Consequently, variation-resilient design practices are important for achieving operational robustness in both high-performance and low-power Integrated Circuits. The pressure to drive cost and die size down while increasing performance and functionality embedded in electronic systems, motivates the need for system chips. System chips can integrate heterogeneous functionality into the same Integrated Circuit. They may comprise digital, memory, analog, mixed-signal, and often radio-frequency functions. These different functions exhibit quite different behaviours as technology is scaled. The general design trend is to implement more and more functionality into the digital domain1. This design trend is also visible in analog mixed-signal circuits where the digital transistor is used for digital-assist logic for calibration and differential pair offset calibration. The main reasons for increased digital integration is because digital circuit implementations exhibit superior technology scaling properties in terms of reduced area and better noise immunity. In addition, digital design is supported by highly automated electronic design tools that enable high designer’s productivity and reduced time-to-market. Therefore, digital circuit modules are generally the most dominant power consumers in system chips, while they also generally determine the system chip performance.

1.2 Conventional Corner-Based Digital Design

Traditionally, digital designers use case files, or corner-files during the design and verification stages of the Integrated Circuit. Such files describe the specification limits for deviations caused by the fabrication process, supply voltage range and temperature conditions. In this way, a guard-banding design approach is followed for guaranteeing circuit operation among Process-Voltage-Temperature (PVT) corners. This is often referred to as corner-based or Worst-Case Design (WCD). Under WCD, design synthesis is performed at the PVT corner that provides the minimum circuit speed (e.g. slow process, VDDnom-10%, 125oC). This PVT corner is referred to as the slow PVT corner. The design verification is performed for a range of PVT corners through static timing analysis (STA) to provide the designer with feedback if the targeted circuit speed (set-up timing) is achieved, if functionality (hold-timing) is guaranteed, and if leakage power specifications are not violated. One of the important PVT corners for checking hold timing and leakage power is the fast PVT

corner, e.g. fast process, VDDnom+10%, -40oC. Yet unrealistic is the fact that process corners are lacking the detail of WID variation. This leads to a pessimistic

1 ITRS Roadmap 2009 Edition. [Online]. Available: http://www.itrs.net/Links/2009ITRS/Home2009.htm


representation of the device performance, since it is assumed that all devices within the die are performing worst-case under slow process conditions. Consequently, designers apply on-chip variation (OCV) margin for modelling WID variations during timing analysis to account for a possible speed difference between different paths in the design. Let us consider now an example for demonstrating the impact of PVT margins on circuit performance of a Fanout-4 (FO4) inverter across technology nodes. For this purpose, Spectre circuit simulations have been performed for such circuit by using a MOS Model 11 transistor model that has been calibrated for the respective technology node. Figure 1.2 shows the simulated performance decrease due to PVT margining for the slow PVT corner. Oppositely, Figure 1.3 shows the simulated performance increase for the fast PVT corner. Both figures illustrate the performance impact when changing individual variability sources (Process, Voltage, Temperature) or their collective impact (All) with respect to nominal PVT operation (e.g. nominal process, VDDnom, 25oC). Observe in both figures that the performance impact due to process is nearly constant around 20%. This implies that process variability is well controlled for the analyzed technology nodes. Kuhn et.al. made a similar conclusion for 90nm, 65nm and 45nm technologies from Intel [12]. Also observe that the performance impact of VDD becomes more important in scaled CMOS technologies. This increasing importance of VDD is mainly because the decreasing transistor over-drive voltage, VDD-Vth, across technology nodes. Furthermore, observe that the temperature impact on performance decreases with ongoing technology scaling. This is because the zero-temperature-coefficient (ZTC) bias point of transistors is located closer to the nominal VDD in a newer technology node [13]. At the ZTC bias point, the Vth and carrier mobility derivatives are cancelling each other over a specified temperature range, thereby making the transistor drain current independent of temperature. Finally, observe in both figures that the collective impact of PVT on performance is different from the sum of the individual contributions. This is because the process, voltage and temperature are dependent variables. For example, the impact of process variability on performance is larger when operating at VDD-10% for the slow PVT corner case, while the impact of VDD is larger as well for the slow process case. The collective impact of PVT on performance can be up to 60% and up to 32% for the slow and fast PVT corner, respectively.

0%

10%

20%

30%

40%

50%

60%

70%

0.18µm 0.13µm 90nm 65nm 45nm

Performance Decrease

All Process: slow Voltage: Vdd-10% Temperature: 125degC

Values are w.r.t. the nominal PVT

Figure 1.2 Slow-PVT margin breakdown for a FO4 inverter.

1.2 Conventional Corner-Based Digital Design 5

0%

10%

20%

30%

40%

50%

60%

70%

0.18µm 0.13µm 90nm 65nm 45nm

Performance Increase

All Process: fast Voltage: Vdd+10% Temperature: -40degC

Values are w.r.t. the nominal PVT

Figure 1.3 Fast-PVT margin breakdown for a FO4 inverter.

There exists a fundamental trade-off between circuit area, performance and power consumption of digital circuit designs. For example, high-performance digital circuits are larger in size and are more power consuming due to the application of speed optimization techniques such as logic re-structuring and remapping, buffering and resizing. This trade-off between area, performance and power is influenced by operating conditions as well as by the applied design margining. Design margining includes PVT margins and extra timing margins like clock uncertainty and OCV. Figure 1.4 shows a qualitative example for a generic digital circuit design on how the area-clock period trade-off curve is affected under different process and timing margins. Observe the circuit area dependency on the process margin. The up-sizing of a slow-process design occurs at a lower clock frequency as compared to the other process conditions. This is simply because the lower intrinsic speed of digital gates in case of slow-process. If a lower process margin can be tolerated without impacting parametric yield, circuit performance can be further increased before area up-scaling. Also observe that a timing margin like clock uncertainty can also have a significant impact on circuit area in case of high-performance circuits. To avoid spending area unnecessarily, it is important to set timing margins as low as possible while still guaranteeing that the circuit meets timing under all corner cases. Finally, area over-dimensioning increases silicon costs as well as the overall circuit power, and therefore, should be avoided.

clock period

circuit area

target performance

process margin

slow

nominal

fast

VDD, T constant

WCD

timing margin

Figure 1.4 Impact of process and timing margin on circuit area and clock period.


1.3 Techniques for Mitigating Variability and Design Margin

1.3.1 Design-Time Techniques WCD can effectively deal with process and environmental variability through design margining. However, excessive use of design margining makes design specifications harder to meet due to the associated over-dimensioning of the design. Over-dimensioning leads to a larger silicon footprint, higher power consumption and larger leakage. While safe, WCD is prohibitively conservative, since it assumes full correlation of all variability sources by setting all parameter values to their worst-case (or best-case) values. However, many design parameters are totally uncorrelated. For example, transistor channel length variations (which are lithography based) are uncorrelated to Vth variations (which are doping based). Moreover, it is rare that all process parameter values are worst or best-case. Such extreme corners hardly occur in most of the fabricated chips. Instead, most chips will display a performance centered around the nominal-process design. By pursuing WCD, one is penalizing performance by covering low probability problems. To illustrate this, Figure 1.5 shows experimental results on process-dependent spread in oscillation frequency for a 205-stage Fanout-1 (FO1) inverter-based ring-oscillator in 90nm HVT LP-CMOS. The measurements have been performed for a production volume of 3.3 Million die samples at VDD=1.2V and room temperature. Moreover, circuit simulations revealed corner frequencies of 79.5MHz, 101.1MHz, and 130MHz for slow, nominal and fast process corners, respectively. Observe in Figure 1.5 that the measured frequency distribution is centered around the nominal frequency. The 3σ values of 89.8MHz and 112.4MHz are located at a significant distance from the specified slow and fast process corners. This demonstrates the pessimism in timing performance when utilizing WCD.

Slo

w p

roce

ss c

orn

er

fre

qu

ency:

79

.5M

Hz

Fa

st

pro

ce

ss c

orn

er

fre

qu

ency: 1

30

MH

z

Figure 1.5 Measured performance spread of a 90nm LP-CMOS ring-oscillator.

Data for a production volume of 3.3Million samples at VDD=1.2V and room temperature.

In recent years, statistical static timing analysis (SSTA) has been emerged as a potential candidate to replace traditional STA for maximizing product yield and improving timing accuracy [14][15]. In SSTA, the circuit delay and the arrival times are considered as random variables. The technique uses cumulative probability distribution functions to model arrival times, and probability density functions to model gate delays. These probability distributions can be used to model variations in transistors and interconnects that are not handled by traditional STA. As result,

1.3 Techniques for Mitigating Variability and Design Margin 7

SSTA provides “timing yields” or probabilities for a chip to meet its timing specifications. As compared to STA, the technique can model the factors affecting process variation in a single analysis run. This will not only eliminate the need for corners but remove much of their inherent pessimism. Despite these benefits, SSTA has not totally found its way in industrial practices. This is because of, among other reasons, the moving average of process parameters, the flexibility of fabricating the same chip in multiple foundries, the small returns as compared to an intelligent corner selection approach, and the lack of appropriate Electronic Design Automation (EDA) tools for statistical logic synthesis. 1.3.2 Silicon-Time Techniques Post-silicon tuning refers to a change in operating conditions of the Integrated Circuit after fabrication. In this area, the main ideas for reducing variability effects are related to adaptive supply voltage and body biasing approaches [16][17][18][19]. Voltage scaling (VS) refers to the modification of VDD, while body bias (BB) refers to the adaptation of the Vth of the transistors. Both techniques impact device current, and accordingly, the performance and power consumption of the Integrated Circuit. The three most influential techniques for silicon tuning over the last 10 years are highlighted in the remainder of this section as way of introduction to the topic. In 1998, Miyazaki et.al. proposed the use of adaptive body bias (ABB) control to keep circuit delay constant irrespective of D2D process variability [18]. Body biasing was applied to the whole digital Integrated Circuit. The proposed control scheme makes use of a delay line as silicon speed sensor, a control function that implements compare and decode functions and body bias generators, as depicted in Figure 1.6. Like the digital circuit, the propagation delay of the delay line is body-biased controlled. The operation of the control scheme is as follows. An external reference clock is provided as input. The comparator measures the delay difference between this reference clock and its delayed version; it then converts the amount of delay into a register address in the decoder. Based on the decoder output, the body bias generators provide body bias voltages to keep the delay line’s delay constant. It is assumed that the delay of the digital circuit is proportional to the delay of the delay line. In this way, the body bias of the circuit is adapted to correct the influence of process variability on circuit timing.

Figure 1.6 Conceptual diagram of an adaptive body bias control system.

© IEEE [18]

In 2002, Tschanz et.al. proposed a similar solution while applying body biasing to local circuit parts [19]. This enabled not only the reduction of D2D process variability, but also a partly reduction of WID process variability. They utilized body biasing for increasing the operating frequency for microprocessor chips such that more chips can be placed in the highest frequency bin. Figure 1.7 shows


experimental results of frequency versus leakage for 62 die samples of a 150nm CMOS processor test-chip that utilizes body biasing. The process-induced frequency variation, σ/µ, could be significantly reduced when applying body bias to the whole design (ABB in Figure 1.7). It resulted in a larger number of accepted dies and more dies in the highest frequency bin as in case of nominal body bias (NBB). The application of multiple body bias voltages within the design (WID-ABB in Figure 1.7) was shown to be even more effective. It resulted in a further reduction in frequency variation, while again all dies were accepted and most of them in the highest frequency bin. Although body biasing impacts leakage, it was shown that this impact was least for WID-ABB. By using multiple body bias voltages per die, precise control over the die frequency and leakage has been found possible.

Figure 1.7 Concept of frequency binning with body biasing.

Frequency bin improvement for ABB (left) and WID-ABB (right). © IEEE [19]

In general, it has been shown that VS can be used during silicon testing to improve product binning yields with comparable effectiveness as BB schemes [16]. VS requires changing the fixed supply voltage to an adjustable one, and the addition of optimization time during silicon testing. Combining voltage scaling and body biasing can also be very effective at increasing yields. Researchers have proposed several voltage scaling approaches ranging from open-loop dynamic voltage scaling (DVS) to closed-loop adaptive voltage scaling (AVS) [20][21]. More recently, a new technique to eliminate worst-case PVT margins has been proposed [22][23]. It relies on supply voltage management, for voltage scalable digital circuits, based on in-situ error detection and correction. This technique, coined as Razor, is based on a timing-error detecting flip-flop that is part of speed-critical circuit paths. The simplified circuit diagram of the second-generation Razor flip-flop and its corresponding timing diagram are illustrated in Figure 1.8. The RazorII flip-flop is based on a latch, and is equipped with a detection clock generator and transition detector. Timing errors are detected on the internal latch node for spurious transitions. A valid transition occurs when data is setup to the latch input before the rising edge of the clock. The data occurs at the output node of the latch after its clock-to-Q delay. A short negative pulse on the detection clock is used to disable the transition detector for at least the duration of the clock-to-Q delay of the latch after the rising edge of the clock. An invalid transition concerns a transition during the time that both clock and detection clock are at logic’1’. In this case, an error signal is generated, which can be used to engage a mechanism to restore the correct state. [23] demonstrated the application of the RazorII flip-flop combined with a conventional architectural replay mechanism to recover from timing errors in a voltage scalable high-performance processor.

1.4 Thesis Contributions 9

Figure 1.8 RazorII flip-flop schematic and the corresponding timing diagram.

© IEEE [23]

The main benefit of the Razor approach is that it can compensate the timing impact due to worst-case design margining at run-time. Irrespective of process and environmental variability, it scales the supply voltage to the point of first timing failure for a given chip at a given operating frequency. This results into significant energy savings. In addition, the supply voltage can be scaled even lower than the first failure point, deliberately tolerating a targeted timing error rate. In this way, Razor enables trading-off the overhead of error-correction and the additional energy savings due to sub-critical operation. Drawbacks of the Razor approach are the requirement of quite an amount of area and architectural overhead to achieve the error correction and the setting of a minimum delay on logic stages of a pipeline. The downside of silicon tuning approaches is that they do not address the over-dimensioning of the design.

1.4 Thesis Contributions

To continue the digital design success in nanometer CMOS, cost-effective variation tolerant design approaches are needed that guarantee circuit robustness in the presence of variability influences while avoiding over-dimensioning of the design. In this thesis a new gate-level synthesis strategy for digital CMOS circuits is proposed. It makes use of forward body biasing (FBB), and one can refer to it as the body bias driven (BBD) design strategy. BBD design is a design centering methodology consisting of two design phases, namely pre-silicon design optimization and post-silicon tuning. BBD design is used to reduce the impact of process parameter spread on design behaviour, thereby it enables design with reduced design margins. Consequently, pre-silicon design optimization is done by selecting the appropriate synthesis point in between worst-case and nominal process conditions given a silicon tuning range based on body biasing. Post-silicon tuning includes on-chip capabilities to correct performance deviations due to fabrication outcome. This thesis presents the front-end and back-end aspects of the proposed BBD design strategy and its application to high-performance and ultra-low-power applications.


clock period

circuit area

Body bias

driven design

VDD, T constantConventional

worst-case design

target performance

Arrow A: Improving performanceArrow B: Reducing area

B

A

Figure 1.9 Body bias driven versus conventional worst-case design synthesis.

BBD design synthesis enables circuit design with reduced design margining by accounting for a given FBB tuning range. This can be translated into improved circuit performance, lower power operation, or reduced circuit area. Figure 1.9 puts in perspective a qualitative example how BBD design can be utilized to improve circuit performance and, or circuit area as compared to conventional WCD. For a given circuit area, BBD design can achieve higher speed than WCD by utilizing FBB for achieving excess performance (arrow A). BBD design can also achieve smaller circuit solutions at a given clock period by utilizing FBB for constraining circuit upsizing (arrow B). In this work, the benefits and implications of BBD design have been investigated when utilized in different application scenarios: high-performance or ultra-low-power designs. The benefits of BBD design depend on the available silicon tuning range. Although researchers have already proven the benefits of supply voltage scaling and body biasing, the effectiveness of these techniques in state-of-the-art CMOS technologies was unclear. The performed experimental work explored the technological boundaries of supply voltage scaling and body bias tuning through silicon experiments in 90nm, 65nm, and 45nm CMOS processes. In particular, it has been investigated how much power savings can be expected, the power-performance tradeoffs that can be made, and to which extent process-dependent performance-compensation can be accomplished. This is discussed in Chapter 3. BBD design relies on the fact that the chip has silicon tuning capabilities. The trend towards higher integration densities in modern chips favours a fully integrated silicon tuning solution. For this purpose, a new FBB generator concept for high-performance digital circuits was developed under this research activity. The design has been implemented in 90nm CMOS, and its operation has been experimentally verified. This is addressed in Chapter 4. Having such BBD design approach available is not sufficient to avoid over-dimensioning of the design. Therefore, a design optimization strategy was developed that is based on a performance-per-area (PPA) metric. A maximum PPA design represents the fastest design possible without circuit over-dimensioning. An in-depth analysis of the PPA design theory is presented in this thesis. This theory allows us to predict the design’s maximum PPA with a minimum number of synthesis trials. With the PPA design theory available, designers have a means to judge how efficient the

1.4 Thesis Contributions 11

circuit design is implemented in terms of speed and area. This new theory is presented in Chapter 5 for high-performance digital CMOS circuits and in Chapter 6 for ultra-low-power digital CMOS circuits. To successfully achieve a target circuit performance, it is not necessary that all digital gates in the design are body-biased. In this research activity, it has been investigated how to effectively partition the design into a body biased part and a non-body-biased part. The motivation of such partitioning is to relax design constraints of the embedded FBB generator design. A physical design approach on the implementation of body bias islands in the layout has been developed as well. This body bias clustering approach is presented in Chapter 7. Finally, the proposed BBD design strategy was used for the implementation of a mixed-signal system-on-chip (SoC) test-chip in 90nm CMOS. Experimental results are provided that demonstrate the effectiveness of BBD design with body bias clustering in a realistic chip vehicle. This work is shown in Chapter 7 too.


Chapter 2

Models for Body Biased Digital

Design

NTRODUCING digital design styles that exploit the presence of body biasing requires detailed understanding of body biased circuit behaviour. Circuit level

models that use (non-) linear derating functions were developed to account for the presence of body biasing. In other words, the influence of body biasing on physical MOS transistor parameters impacting circuit performance and power consumption has been analyzed. Next to this, a circuit model that can bind silicon area to circuit speed was developed too, as will be seen throughout the chapter. The models presented in this chapter are used for the design of the FBB generator (Chapter 4), and the optimization techniques explained in Chapter 5 and Chapter 6.

2.1 Circuit Models with Body Bias

In this section body biasing is introduced into delay, leakage and power consumption models in the form of (non-)linear derating functions that can easily be fitted to a technology node. 2.1.1 Propagation Delay Model The Vth of a MOSFET is usually defined as the gate voltage where an inversion layer forms at the interface between the insulating layer (oxide) and the transistor’s body. Body biasing impacts Vth by changing the transistor’s body voltage with respect to its source. Under body bias conditions, the Vth of an NMOS transistor can be approximated by using well-known Shichman-Hodges model, which is associated with formation of a conducting inversion layer at the source end [24]:

( )FFSBthththth VVVVV φφγ 2200 −++=∆+= (2.1)

where Vth0 is the threshold voltage without body bias applied, and ∆Vth is the change in Vth due to body biasing. γ is the body effect coefficient, 2φF is the surface potential in strong inversion, and VSB is the source-to-body voltage (VSB>0 for RBB, and VSB<0 for FBB). A similar expression holds for the PMOS transistor. Despite the Shichman-Hodges model is a simplified first-order model suitable for long-channel transistors only, it suits the purpose to illustrate that Vth is non-linear dependent on the applied body biasing. The Vth reduces with FBB, and contrarily increases with RBB. Vth is strongly process dependent. Figure 2.1 and Figure 2.2 put in perspective Vth as function of body bias voltage for an NMOS and PMOS transistor in 90nm standard-

I

14 Chapter 2 Models for Body Biased Digital Design

Vth (SVT) Low-Power (LP) CMOS, respectively. The results have been shown for the traditional process corner conditions, while they have been obtained through Spectre circuit simulations by using the MOS Model 11 transistor model for the respective technology node to account for all short-channel effects. Observe that the actual value of Vth, and its sensitivity to body bias strongly depends on the process corner: fast, nominal, or slow. For the nominal NMOS device, body biasing from 0.5V (FBB) down to -1.2V (RBB) spans over a Vth range of about 255mV. This range is somewhat smaller for PMOS devices (~237mV). In the next chapter, the impact of these Vth ranges on circuit power-performance tuning will be quantified for various process technology nodes.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

Th

resh

old

Vo

lta

ge

[V

]

Body-to-Source Voltage [V]

slow nominal

fast

FBBRBB

90nm SVT LP-CMOS

NMOS W/L= 1µm/0.1µm

VDS=50mV, VGS=1.2V

T=25oC

Figure 2.1 NMOS Vth versus body biasing in 90nm SVT LP-CMOS.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2

Th

resh

old

Vo

lta

ge

[V

]

Body-to-Source Voltage [V]

RBBFBB

slownominal

fast

90nm SVT LP-CMOS

PMOS W/L= 1µm/0.1µm

VDS=-50mV, VGS=-1.2V

T=25oC

Figure 2.2 PMOS Vth versus body biasing in 90nm SVT LP-CMOS.

Another important parameter for propagation delay is the intrinsic capacitance of a digital gate. The intrinsic capacitance consists of two components, namely junction capacitance and gate-drain(source) capacitance of both PMOS and NMOS transistors. Only the junction capacitance portion is body bias dependent. It concerns the capacitance of the drain-to-body and body-to-source junction diodes. The voltage dependence of the capacitance of a generic junction diode can be expressed as

2.1 Circuit Models with Body Bias 15

follows [25]:

m

diode

jj

j

V

CAC

−

=

0

0

1φ

,

+=

00

1

2 φε

da

dasi

jNN

NNqC 0φ<∀ diodeV (2.2)

where Cj0 is the junction capacitance at zero bias, Aj is the junction area, Vdiode is the anode-cathode diode voltage, i.e. the bulk-source (bulk-drain) voltage VBS (VBD), φ0 is the built-in voltage across the junction, and m is the grading coefficient of the junction, for example m=1/2 for abrupt junction and m=1/3 for linearly graded junction profiles. Parameter εsi is the permittivity of silicon, Na and Nd are the doping in the p and n regions of the junctions, respectively. Observe from (2.2) the increasing junction capacitance with FBB (Vdiode>0). This is because of the decreasing depletion layer for an increasing electric field across the diode. Contrarily, the junction capacitance reduces with RBB (Vdiode<0). Figure 2.3 shows the simulated intrinsic (output node) capacitance of a minimum-sized CMOS inverter as function of body biasing in 90nm SVT LP-CMOS. Like before, the simulation has been performed with Spectre by using the MOS Model 11 transistor model. From the aforementioned behaviour of the junction capacitance, it is understood that the junction capacitance increases with FBB. Observe an increase of intrinsic capacitance from 3.3fF to 4.7fF when applying a 0.5V FBB w.r.t. a nominal body biased device. This corresponds to a capacitance increase of about 40%. This capacitance decreases from 3.3fF to a value of 3fF when 1.2V RBB is applied. Nevertheless, the total (output node) capacitance change is less because of the presence of additional capacitances in a circuit that are not body bias dependent, such as interconnect and fan-in capacitance. For a FO4 inverter, the total capacitance increases by about 8% and decreases by about 2% for 0.5V FBB and 1.2V RBB, respectively.

0

2

4

6

8

10

12

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

Intr

insi

c C

ap

acit

an

ce

[fF

]

Body Bias Voltage [V]

slownominalfast

FBBRBB

Minimum-size CMOS inverter

90nm SVT LP-CMOS

VDD=1.2V, T=25oC

nominal

fast

slow

Figure 2.3 Inverter intrinsic capacitance versus body biasing.

The propagation delay of a digital gate is dependent on both Vth and intrinsic capacitance of the digital gate. Since these parameters are body bias dependent, also the propagation delay is body bias dependent. Conventionally, the propagation delay of a digital gate can be modelled by using Sakurai’s empirical alpha-power model [26]:


( )( )αβ thDD

DDextrintr

drive

DDloadgate

VVx

VCxC

I

VCd

−

+==

thDD VV >∀

(2.3)

where Cintr and Cextr are the intrinsic and extrinsic load capacitance of the gate, respectively, x is the gate sizing factor (x≥1), β(VDD-Vth)

α is the average driving current of the gate, and α is a parameter that models velocity saturation. The model analyzes NMOS drain-source current while neglecting the PMOS drain-source current for a rising input transition, and vice versa for falling input transition; these assumptions are valid for fast switching circuits [26]. Expression (2.3) is valid as long as transistors are in super-threshold operation (VDD>Vth). Recall that FBB increases Cintr and decreases Vth. The gate delay decreases with FBB because the impact of Vth reduction on the delay dominates the impact of Cintr increase. Contrarily, the gate delay increases with RBB. This is illustrated in Figure 2.4 in which gate delay measurements for a CMOS inverter are plotted as function of body biasing. The results have been extracted from measurements of a 101-stage inverter-based ring-oscillator in 90nm SVT LP-CMOS. In total, 69 chip samples have been measured. The symbols in Figure 2.4 correspond to the results obtained for the median chip sample. The error bars indicate the fastest and slowest chip sample of the measured sample set.

0

5

10

15

20

25

30

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

Pro

pa

ga

tio

n D

ela

y [

ps]


Minimum-sized FO1 inverter

90nm SVT LP-CMOS, 69 chip samples

VDD=1.2V, room temperature

FBBRBB

Model: delay ∝ (1+kVBB)

Figure 2.4 FO1 inverter propagation delay versus body biasing.

Inverter delay extracted from ring-oscillator measurements.

Symbols: median sample, error bars: fastest and slowest sample, solid line: model.

Observe from Figure 2.4 that the gate delay has a linear dependency with body biasing. The sensitivity of gate delay to body biasing is somewhat process-dependent, e.g. the sensitivity is highest for the slowest sample and lowest for the fastest sample. Based on this insight, one can model the delay of a digital gate as function of body biasing as follows:

( )BBgate kVx

ddd +

+= 11

0

(2.4)

The first term represents the intrinsic gate delay (d0=CintrVDDIdrive

-1) and extrinsic (or fan-out dependent) gate delay (d1=CextrVDDIdrive

-1) [27]. This term can be obtained


from expression (2.3) at nominal body bias conditions. The second term models the impact of body biasing on gate delay by a linear function. VBB represents the body bias voltage value: VBB=Vpwell=VDD-Vnwell. Parameter k is a fitting parameter, which depends on process, VDD and Vth-option. Moreover, k can be different for each digital gate. The delay of a minimum-sized CMOS inverter in 90nm SVT LP-CMOS for nominal process, VDD and temperature conditions was simulated for analysis purposes. Under these conditions, one can observe a maximum error below 2% when using expression (2.4) for a body bias range from 1.2V RBB up to 0.5V FBB. Expression (2.4) has also been calibrated for the gate delay measurements, shown in Figure 2.4. Observe the close match between the model and experimental results (k = -5.84 ps/V).

Based on expression (2.4), the path delays of a CMOS digital logic circuit are modelled as follows:

( ) Ψ∈∀+

+= ∑

∈

j110 BBi

ji i

i

ij Vkx

ddD (2.5)

where i is an index that runs over all gates in the circuit, j is an index that runs over all paths in the circuit, Dj is the delay of path j, Ψ is the collection of all paths in the circuit. 2.1.2 Leakage Current Model The leakage current of a digital gate consists of various leakage components [28]. In the following four different types of leakage currents will be discussed, namely: 1) sub-threshold leakage, 2) gate-oxide leakage, 3) gate-induced drain leakage, and 4) junction leakage. These four leakage mechanisms are illustrated in Figure 2.5. Other leakage components were found to have a small impact on the overall transistor leakage, even under body bias conditions.

I1

I2

I3

I4

Gate

DrainSource

Well

n+ n+

I1: sub-threshold leakage

I2: gate-oxide leakageI3: gate induced drain leakageI4: junction leakage

P-well

Figure 2.5 Considered leakage mechanisms in deep-submicron transistors.

Sub-threshold leakage (I1)

The dominant leakage component in modern short-channel MOS transistors is sub-threshold leakage, which is the current that flows between the drain and source of an MOSFET when the transistor operates in weak-inversion (VGS<Vth). The sub-threshold leakage depends on transistor size, VDD, Vth and temperature. The following expression relates the sub-threshold leakage to other device parameters [28]:


8.120

0 1

eUL

WCI

eeII

Tox

U

V

mU

VVV

thresholdsubT

DS

T

DSthGS

µ

η

=

−=

−−−

−

(2.6)

where η is the Drain Induced Barrier Lowering (DIBL) coefficient, m is the sub-threshold slope factor, UT is the thermal voltage, and µ denotes carrier mobility. Recall that body biasing impact Vth of transistors. Observe in expression (2.6) that exponential dependency between sub-threshold leakage and Vth. This shows that body biasing has a large impact on sub-threshold leakage. Gate-oxide leakage (I2)

Gate-oxide leakage can significantly contribute to the overall leakage current of a MOS transistor. The current is caused by direct tunnelling current as a result of the high field across the thin gate-oxide. When the transistor is in the on-state, the gate-oxide leakage is the largest and will appear between the transistor’s gate and channel. When the transistor is in the off-state, the gate-oxide leakage will appear between the transistor’s drain and gate, in case of a voltage difference between these two terminals. Thinner oxide thickness and higher VDD enhance the electric field and therefore increase gate-oxide leakage. Gate-oxide leakage is not affected by body biasing. Gate induced drain leakage (I3)

Another prominent leakage component is gate induced drain leakage (GIDL). GIDL is due to the high electric field at the drain side of the MOS transistor; it causes depletion at the drain region below the gate-drain overlap region. GIDL occurs at a low VGS and high VDS bias and generates carriers into the substrate and drain from surface traps or band-to-band tunnelling (BTBT). Like gate-oxide leakage, thinner oxide thickness and higher VDD (higher potential between gate and drain) increase GIDL. GIDL is also known as surface BTBT current. The BTBT current strongly depends on RBB, as will be shown in the next paragraph. Junction leakage (I4)

The contribution of junction leakage to the overall leakage current can normally be neglected at NBB conditions. However, this may be no longer the case when body biasing is applied. The junction leakage in a modern CMOS technology consists mainly of two different components: 1) ideal diode current and 2) non-ideal diode current [29]. The ideal diode current can be described with Shockley’s diode equation, which relates the diode current Idiode to the diode voltage Vdiode:

−= 1T

diode

U

V

Sdiode eII (2.7)

Parameter IS represents the saturation current of the diode; it is proportional to the area of the diode. UT is the thermal voltage of about 26mV at room temperature. The ideal diode current starts growing exponentially under forward biased conditions, Vdiode>0, e.g. FBB. The -1 term can be ignored when Vdiode >> UT. For the reverse bias case, the ideal diode current becomes equal to the diode’s saturation current, IS.


The non-ideal diode current can be described by the sum of Shockley-Read-Hall generation/recombination (SRH), trap-assisted tunnelling (TAT), and band-to-band tunnelling (BTBT) currents [29][30][31]. These are the physical current components that are responsible for the increasing junction leakage in modern scaled CMOS technologies. The SRH generation and recombination of charge carriers at depletion layer traps gives rise to deviations from the ideal diode behaviour. The SRH current density of a p-n junction can be described as follows [30]:

≈ T

diode

U

V

iSRH e

qWnJ

2

2τ (2.8)

( ) ( )diode

da

dasi VNN

NN

qW −

+= 0

2φ

ε (2.9)

where q is the electron charge, W is the width of the depletion region which is a function of the applied bias voltage, ni is the intrinsic carrier concentration, τ is the carrier lifetime, εsi is the permittivity of silicon, Na and Nd are the doping in the p and n region, respectively, and φ0 is the built-in voltage across the junction. In forward bias mode, SRH current is based on a net recombination rate which leads to non-ideal additional current at low forward bias. In reverse bias mode, SRH current is based on a net generation rate and causing additional leakage. At higher fields, SRH is enhanced by a TAT current due to tunnelling of electrons via trap states in the depletion region to empty band states at the other side of the junction under the influence of the applied electric field. The TAT current density of a p-n junction can be described as follows [30][31]:

( )

( )

−−

⋅−

=ξ

πξ h

h

q

EEm

Tg

TTAT

Tg

eEE

WNMqmJ

3

24

3

23*

3*

8 (2.10)

( )( )diode

da

da

si

VNN

NNq−

+= 0

2φ

εξ (2.11)

where Eg is the silicon band-gap energy, ET is energy (in eV) corresponding to trap centers, measured from top of the valence band, m* is the effective mass, M is the matrix element associated with the trap potential, NT is the density of traps occupied by electrons, h is the 1/2π times Planck’s constant, and ξ is the maximum electric field across the depletion region which is bias voltage dependent. The TAT current increases for an increasing reverse bias across the junction. At even higher fields, the BTBT current will become the most important junction leakage in reverse biased junctions. It is due to the direct tunnelling from e.g. electrons from the valence band of the p-region to the conduction band of the n-region, thereby causing the generation of holes in the p-region. The BTBT current density of a p-n junction can be described by [30]:

−

⋅−

=ξ

π

ξ h

h

q

Em

g

diodeBTBT

g

eE

VqmJ

3

24

22

3*

3*

4

2 (2.12)


BTBT current has a dominant exponential dependence on diode bias voltage through the electric field across the junction. A reverse bias across the junction increases BTBT current by increasing the electric field across the junction. Contrarily, a forward bias reduces BTBT current. Typically, sub-threshold leakage dominates the junction leakage, especially the SRH and TAT currents. Junction leakage can become important at large FBB voltages (ideal diode current) and large RBB voltages (BTBT current). This makes RBB less effective to reduce circuit leakage at large RBB values, because RBB reduces sub-threshold leakage only [32]. Leakage modelling

After having discussed the important leakage current components, a normalized leakage current model that accounts for body biasing, as used in this work, will be presented. There exists an exponential dependency between body biasing and sub-threshold leakage, GIDL and junction leakage. Therefore, the following non-linear derating function is used to account for the impact of body biasing on leakage:

( )( )

<∀−+

≥∀−+=

01

0151

31

4

2

BB

VlVl

BB

VlVl

normVele

Veleleakage

BBBB

BBBB

(2.13)

where l1,l2,l3,l4,l5 are polynomial coefficients which depend on process, VDD and Vth-option as well as temperature. These coefficients can be different for each type of digital gate. The first term of expression (2.13) models the sub-threshold leakage dependency on body biasing. The second term concerns either the junction leakage under FBB (forward biased diode current), or the junction leakage under RBB and GIDL. Expression (2.13) can be utilized for a single digital gate, or a digital circuit.

0.01

0.1

1

10

100

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

No

rma

lize

d L

ea

ka

ge

Cu

rre

nt


Minimum-sized FO1 inverter



Model:( )( )

<∀−+

≥∀−+∝

01

0151

31

4

2

BB

VlVl

BB

VlVl

Vele

Veleleakage

BBBB

BBBB

BBVle 1

( )132 −BBVl

el

( )154 −BBVl

el

Figure 2.6 Inverter leakage current versus body biasing.

Inverter leakage extracted from ring-oscillator measurements.


Figure 2.6 shows the leakage current of a CMOS inverter as function of body biasing. The results have been extracted from measurements of the same circuit as before. The symbols in Figure 2.6 correspond to the median chip sample, while the error bars correspond to the fastest and slowest chip sample of the measured sample set. The solid line indicates the results of the non-linear derating function which was fitted to the median sample results (l1=2.26, l2=0.08, l3=10.27, l4=0.05 , l5=-0.93). A maximum error below 10% was observed between the results from the median


sample and expression (2.13) across a body bias range from 1.2V RBB up to 0.5V FBB. 2.1.3 Power Consumption Model The total power consumption of a digital gate can be modelled by the sum of dynamic and leakage power consumption:

( ) DDleakckDDextrintr

leakdyngate

VxIfVCxCa

PPP

++=

+=2

(2.14)

where a is the switching activity of the gate, which is the average number of transitions (0→1 or 1→0) a signal switches per unit of time. fck is the operating frequency (=1/Tck). Parameter Cintr is the intrinsic capacitance of the gate, which is body bias dependent. Parameter Ileak is the leakage current of a gate, which depends both VDD, Vth and VBB. Expression (2.14) assumes that all transitions are full-swing. Glitches due to small delay differences at the gate inputs may have partial swings that cannot be correctly modelled by (2.14). The body bias dependency of the intrinsic capacitance of a digital gate has already been discussed in section 2.1.1 and illustrated in Figure 2.3. One can model this dependency with the help of non-linear regression techniques. In the performed experiments, intrinsic capacitance values were extracted from dynamic power consumption simulations for the same 90nm LP-CMOS inverter as used before. As a result, it was found that the intrinsic capacitance dependency on body biasing can be modelled by using a power fitting function that is inspired by expression (2.2). As before, the normalization of the intrinsic capacitance model has been done against the NBB case.

( ) 2

1

,1

1m

BB

normintrVm

C−

= (2.15)

Parameters m are the fitting coefficients, which are different for each digital gate. For the simulated CMOS inverter, a maximum error below 5% was observed when using expression (2.15) to model the body bias impact on dynamic power for a body bias range from 1.2V RBB up to 0.5V FBB. By combining (2.13), (2.14) and (2.15), one can model the total power consumption of a generic CMOS digital logic circuit as:

( )

( )( )

( )( )

<∀−+

≥∀−++

+

+

−=

∑

∑

∑

=

=

=

N

i

BB

VlVl

ileakDDi

N

i

BB

VlVl

ileakDDi

N

i

ckDDiextrm

BB

iintri

itotal

VeleIVx

VeleIVx

fVCVm

CxaP

BBBB

BBBB

14,

12,

1

2,

1

,

01

01

1

51

31

2

(2.16)

where i is an index that runs over all digital gates in the circuit, and N is the number of digital gates in the circuit.


Figure 2.7 shows experimental results of power consumption versus body biasing for a 101-stage inverter-based ring-oscillator circuit in 90nm SVT LP-CMOS. The symbols correspond to the median chip sample, while the error bars correspond to the fastest and slowest chip sample of the 69 measured samples. The spread in power consumption at a given body bias is mainly due to the difference in oscillation frequency of the different chip samples. The solid line in Figure 2.7 indicates the results of the calibrated expression (2.16) where the polynomial coefficients obtained to fit leakage were used. The fitting coefficients for the intrinsic capacitance are: m1=1.82, m2=0.12. Notice the close match between the measured and calculated results. Namely, a maximum error below 3% is observed when using expression (2.16) for a body bias range from 1.2V RBB up to 0.5V FBB.

0

50

100

150

200

250

300

350

400

450

500

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

Po

we

r C

on

sum

pti

on

[µµ µµ

W]


FO1 101-stage inverter-based ring-oscillator



Model: ( )( )( )( )( )

<∀−+

≥∀−++

+

+

−∝

01

01

1

51

31

2

4,

2,

2,

1

,

BB

VlVl

ileakDD

BB

VlVl

ileakDD

oscDDiextrm

BB

iintri

total

VeleIV

VeleIV

fVCVm

CxP

BBBB

BBBB

Figure 2.7 Ring-oscillator power consumption versus body biasing.


2.2 Circuit Area Modelling

The area of a digital circuit is related to the targeted circuit timing. In fact, the circuit area is largest when a high operating speed is required, and smallest for circuits with unconstrained timing. The area is affected by the application of design optimization techniques, such as gate sizing, buffer insertion, logic re-structuring and logic decomposition, which are utilized to enhance circuit speed. The area of a digital logic circuit can be modelled by the sum of areas of all gates in the circuit:

∑=

=N

i

iitotal AxA1

(2.17)

where Ai is the minimum area of gate I, and N is the number of digital gates in the circuit. The gate sizing factor x is dependent on the clock period constraint of the synthesized circuit, e.g. x=f(Tck). Figure 2.8 shows a typical trade-off curve for circuit area and clock period of a given generic digital logic circuit. The curve is constructed from a multitude of synthesis runs such that the same design meets distinct clock period constraints. In Figure 2.8 the area and clock period have been normalized to the best performing design (Amax,

2.2 Circuit Area Modelling 23

Tmin). This design is obtained by constraining the gate sizing of all digital gates to their maximum size in the digital library. The faster designs (Tck<Tmin) are obtained for unconstrained gate sizing, e.g. when the gate sizing of all digital gates is not limited to their maximum size in the digital library. Observe that high-performance circuits consume more area than slow circuits. This is due to the utilization of design optimization techniques during the logic synthesis phase such as gate upsizing and logic re-ordering for speeding-up critical circuit paths.

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

Relative Clock Period

Relative Area

(Amax,Tmin)


RelativeArea

Figure 2.8 Area and clock period trade-off for a generic digital logic circuit.

The trend shown in Figure 2.8 can be modelled by a rational function with χ, δ, and η as independent fitting parameters of the following expression:

ηδ

χ+

+=

ck

totalT

A (2.18)

The general form of expression (2.18) describes a rectangular hyperbola. Parameters δ and η model the shift from the origin, and χ models the hyperbola scale factor. The vertical asymptote of the hyperbola is located at a clock period of Tck=-δ. This clock period represents the minimum clock period of the design that is theoretically possible. In practice, this minimum clock period will never be achieved due to area constraints. The horizontal asymptote of the hyperbola is located at a circuit area of Atotal=η. This area represents the minimum area of the design in case of a timing un-optimized design. Finally, the hyperbola scale factor, χ, accounts for the impact of design optimization techniques such as gate upsizing and logic re-structuring. The fitting parameters of expression (2.18) are dependent on VDD, Vth option, and the amount of body biasing accounted for at design-time. Moreover, they are also dependent on the design margins used during design synthesis stage. Finally, due to its generic nature, expression (2.18) can be fitted onto any digital circuit design, including designs that include hard IP blocks such as memories or analog-mixed-signal IPs. Circuit area of hard IP blocks can be accounted for by parameter η, and parameter χ can model any impact of hard IP blocks on circuit timing. In Chapter 5 will be shown how expression (2.18) can be used to enable fast


reconstruction of the area-clock period trade-off curve of a circuit design. Its modelling accuracy will be also discussed.

2.3 Discussion

(Non-)linear derating functions for modelling performance, leakage and power consumption of a digital circuit with body biasing were presented in this chapter. These models enable digital designers with a quick insight in the behaviour of the circuit at design-time or at run-time, when utilizing body biasing. The models were validated through correlation against experimental results for a digital ring-oscillator based circuit in 90nm LP-CMOS at VDD=1.2V and room temperature. For a body bias ranging from 1.2V RBB up to 0.5V FBB, a maximum error of about 2%, 10% and 5% was observed for the proposed delay model, leakage model and power consumption model, respectively. The proposed models can easily be fitted to a given design, PVT condition or technology node. With the availability of the models presented in this chapter, the characteristics of a new design style that makes use of the presence of body biasing can be explored. Before doing this, however, first the power-performance tuning ranges need to be understood that are available with post-silicon tuning techniques like power supply voltage scaling and, or body biasing. These tuning ranges are a function of technology node, process and operating conditions. The technological boundaries of VDD scaling and body bias tuning in modern CMOS technologies will be presented in the next chapter.

Chapter 3

Technology Boundaries of Post-

Silicon Tuning

HIS chapter concentrates on technological quantitative pointers for VDD scaling (VS) and body bias (BB) tuning in modern CMOS digital designs. In particular,

it will be shown the amount of power savings that can be expected, the power-performance tradeoffs that can be made, and it will also be shown to which extent process-dependent performance-compensation can be used. For this purpose, various process technologies were experimentally evaluated to determine technological boundaries for VS and BB when applied to digital logic circuits. This evaluation is based on an extensive analysis of test-circuits fabricated in 90nm general-purpose (GP), 90nm LP, 65nm LP and 45nm LP triple-well CMOS processes.

3.1 Prior Art Analysis

Many researchers have identified VS and BB as effective silicon tuning knobs for lowering power consumption, reducing leakage current, increasing circuit performance, and achieving process-spread compensation [16][17][18][19]. However, there are only few works available that provide quantitative pointers for using such know-how in deep submicron technologies, especially across various technology nodes. In this section a review of prior art is performed including a gap analysis which is the motivation for this work. Several researchers have analyzed the scalability of BB across technology nodes. Huang et.al. demonstrated that FBB generally offers a better technology scalability than RBB [34]. They performed an analysis for technology nodes ranging from 0.18µm down to 50nm. Chatterjee et.al. showed that RBB offers higher leakage savings as compared to the use of non-minimum channel length transistors, VS, or the stack effect [35]. Their analysis was concentrated on 0.13µm, 0.1µm and 70nm CMOS technologies. Hokazono et.al. illustrated that FBB is a promising approach for realizing optimum Vth scaling in the era when gate dielectric thicknesses can no longer be scaled down [36][37]. They also showed that the body effect factor (γ) improves in a next-generation CMOS technology. These observations were made from device experiments in 120nm, 65nm and 45nm CMOS, and device simulations in 10nm CMOS. Although the aforementioned works demonstrated technology scalability of BB techniques, only very limited results were provided on the impact of VS and BB on digital circuit power and performance.

T

26 Chapter 3 Technology Boundaries of Post-Silicon Tuning

Other researchers concentrated on the application of VS and BB for a given technology node. Chen and Naffziger examined the application of VS and BB for improving parametric product yield and bin distributions in terms of frequency and power [16]. Statistical circuit simulations were performed for a combinational logic circuit in 0.1µm CMOS. The authors concluded that VS and BB are sufficiently effective in trading-off performance against power in such a way that the overall parametric yield can dramatically be improved. The weakness though is that their analysis is only based on circuit simulations. Von Arnim et.al. presented the efficiency of BB for reducing leakage or for improving performance in 90nm CMOS [38][39]. They showed experimental results for small digital circuits. However, they did neither quantify power-performance trade-off ranges for VS or BB, nor their effectiveness in compensating the impact of process parameter spreads. Yet other researchers looked into optimizing process technologies for body biasing. Imai et.al. presented a 65nm CMOS technology with high-K gate dielectric for which the body-effect factor was enhanced in order to have a larger RBB range [40]. They showed a two times higher body-effect factor for their optimized process technology. Yasuda et.al. presented a technology modification for equalizing body bias sensitivity in 65nm CMOS multi-Vth transistors [41]. The authors demonstrated a significant leakage reduction with RBB using the optimized process technology. Since both works pursued technology optimization in 65nm CMOS, one may imply that body bias sensitivity of an un-optimized technology is limited. However, this requires a more detailed analysis to understand the technological boundaries of industrial CMOS processes. The motivation of this work is to address the aforementioned gaps. In this work the effectiveness of the VS and BB post-silicon tuning knobs on the three key digital circuit parameters: operational performance, power consumption and leakage

current, is investigated experimentally including scalability across technology nodes.

3.2 Test Circuits and Scaling Conventions

In this work, two monitoring circuits have been developed. The first monitoring circuit is a clock generation unit (CGU) that consists of multiple independent ring-oscillators and corresponding selection circuitry. Its purpose is to provide experimental results on the possible trade-offs between power and performance, leakage reduction and process spread compensation opportunities for combinational CMOS circuits as function of VS and BB. This will demonstrate the impact of such post-silicon tuning knobs to adapt the intrinsic digital circuit behaviour.

÷1024

VDD VNWELL

VSS VPWELL

VDDP

VSS

enableout

Ring-oscillator

Periphery

Figure 3.1 Schematic diagram of the ring-oscillator monitoring circuit.

3.2 Test Circuits and Scaling Conventions 27

Core A CoreB Core CSVT

Selection circuit

LVT HVT

Ring-oscillator Figure 3.2 Die photograph of the 45nm LP-CMOS ring-oscillator test-chip.

A simplified circuit diagram of a ring-oscillator is illustrated in Figure 3.1. It uses minimum-sized standard-cell inverters as delay elements, and a nand-2 gate for enabling control. The oscillation frequency is divided for low-frequency read-out. By means of a tri-state buffer, one can enable multiple ring-oscillators to share the same output pin. The CGU has been implemented in 90nm, 65nm and 45nm technology nodes, while designed full-custom using digital standard cells. It contains two copies of 10 inverter-based ring-oscillators (ringos) with different chain lengths which are part of the same core. Layout identical ringo instances are placed at a short distance (within 100-200µm depending on the process node). The test-chip contains different cores with different threshold voltage options. Each core has independent power supply voltage (VDD) pads for current measurements. The ground pads (VSS), and independent body bias voltage pads for PMOS (VNWELL) and NMOS (VPWELL) devices are common for all cores. Body biasing is enabled for N-well and P-well independently through triple-well isolation. Figure 3.2 shows a die photograph of the 45nm LP-CMOS test-chip.

Shift-register core

Figure 3.3 Die photograph of the 90nm LP-CMOS shift-register test-chip.

The second monitoring circuit is a circular shift-register, which has only been laid out in 90nm LP-CMOS. Its purpose is to provide experimental power consumption results for clocked CMOS circuits as function of VS and BB. Understanding the behaviour of clocked CMOS circuits is relevant, because they represent the majority of today’s digital circuits. This analysis will demonstrate if the results obtained by the first monitoring circuit can be generally applied to clocked CMOS circuits as


well. The shift-register design contains 8K flip-flops, and 50K logic gates. The logic gates are connected as delay lines between two consecutive flip-flop stages, which have an average logic depth of six cells. One can emulate the activity of any digital core with this circular shift register by shifting in a sequence of zeros and ones. Like the CGU, it has independent bias control over supply voltage, N-well and P-well biasing. The CGU provides the clock to the shift-register. In this case, the CGU has been implemented by using a commercial place-and-route tool. The shift-register is used to perform correlated measurements against the ring-oscillators in the CGU for validation purposes. Figure 3.3 shows a die photograph of the 90nm LP-CMOS shift-register test-chip. Figure 3.4 shows a graph of frequency versus power as a function of either or both VS and BB. The thick line shows the nominal trend when the supply voltage is varied from its maximum to its minimum value. A VS operation consists of sweeping the supply voltage while maintaining a nominal constant body bias. BB is essentially the opposite approach: the supply voltage is kept constant and the body bias is swept. Here, it holds that frequency and power have an almost linear negative dependence on the threshold voltage. The result is a “cloud” of frequency- power points for a given supply voltage. Finally, VS+BB corresponds to the case when both supply voltage and body biasing are swept.

power

frequency

VS

FBB

min Vth

max Vth

nom Vth

nom VDD

max VDD

min VDD

RBB

Figure 3.4 Voltage scaling and body biasing operations.

The test-circuits have been fabricated in 90nm GP, 90nm LP, 65nm LP and 45nm LP triple-well CMOS processes. Devices in 90nm GP-CMOS operate at a nominal VDD of 1V, their counterparts in 90nm/65nm LP-CMOS operate at 1.2V, and 45nm LP-CMOS devices operate at 1.1V. On average, the nominal standard-Vth is about 0.27V, 0.37V, 0.43V, and 0.35V for 90nm GP, 90nm LP, 65nm LP, and 45nm LP-CMOS, respectively. Body biasing enables adaptation of these nominal Vth values. Table 3.1 presents the voltage ranges that have been employed during the measurements. Observe that the wells were forward biased for at most 0.5V and reverse biased up to 1.2V. Forward biasing is constrained by the turn-on voltage of the transistors’ body-source junction diode. Essentially, reverse biasing is unconstrained, but high reverse biasing voltages results in increased BTBT current. All measurements have been performed using a Verigy 93K SoC test system in a controlled temperature environment. The temperature is controlled by a Temptronic thermostream.

3.3 Frequency Scaling and Tuning 29

Table 3.1 Voltage conventions for scaling operations.

90nm GP 90nm/65nm LP 45nm LP

VS VDD [0.5,1.0]V [0.6,1.2]V [0.6,1.1]V

BB VNWELL [VDD-0.5,VDD+1.0]V [VDD-0.5,VDD+1.2]V [VDD-0.5,VDD+1.1]V

VPWELL [-1.0,0.5]V [-1.2,0.5]V [-1.1,0.5]V

In the next sections it will be shown how these techniques can be used to alter the power-performance of integrated circuits. Please note that in the next sections the term “ringo” will be used to refer to the ring-oscillators in the CGU.

3.3 Frequency Scaling and Tuning

In most applications there is not always a need for peak performance. In those cases, voltage scaling can be employed for reducing power consumption and to slow down the core’s computing power. In fact, maximum operating frequency and supply voltage for a circuit design are coupled, as discussed in chapter 2. Supply voltage reduction is the most effective technique for reducing dynamic power consumption, as can be deduced from expression (2.14).

1E+6

10E+6

100E+6

1E+9

0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3

Frequency [Hz]

Power supply voltage [V]

BB

maxVth

VS

minVth

Figure 3.5 Frequency scaling and tuning for the 65nm LP-CMOS SVT ringo.

Let us now investigate the frequency scaling and tuning ranges offered by VS and BB in 65nm LP-CMOS. For this purpose, the dynamic range of a 101-stage SVT ringo that is part of the CGU was determined first. To a first order approximation its oscillation frequency of a ringo under NBB conditions can be calculated as

( )thDD

DDload

thDD

gate

osc VVVNC

VV

Ndf >∀

−==

22

1 αβ (3.1)

where N is the number of delay cells in the ringo, and dgate is the propagation delay of a digital gate as shown in expression (2.3). Recall from expression (2.3) that α is a process dependent parameter that takes into account velocity saturation. In the case of velocity saturated devices, α is close to 1. Because of the low α factor, it follows


then that frequency scales almost linearly with VDD. Expression (2.4) can be used to account for the impact of body biasing on ringo frequency in expression (3.1). Figure 3.5 shows the ringo frequency as function of power supply as obtained through silicon experiments. Each cloud of dots is associated to a unique supply voltage. Each dot in a cloud corresponds to a unique N-well and P-well bias combination, and the line joining clouds indicates the nominal trend. The ringo frequency at nominal supply (VDD=1.2V) is 327MHz, and 16.2MHz at minimum supply (VDD=0.6V). This results in a VS tuning range of about 310MHz. Recall that the Vth is about 0.43V on average for this technology at nominal VDD. When operating at reduced VDD’s, the Vth increases because the lower impact of DIBL; it increases by about 100mV at VDD=0.6V. The large frequency reduction with VS is because the supply voltage becomes close to the Vth. For those low VDD’s, the transistors are no longer velocity saturated (α=2). For the applied VDD range, VS renders an approximate 20x frequency reduction. If the lower bound of VS would be set to 0.7V, the frequency reduces by about 7x.

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

2.1

2.2

2.3

2.4

-1.2

-1.1-1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.10

0.1

0.2

0.3

0.4

0.5

N-well bias voltage [V]

P-well bias voltage [V]

Nominal

VDD=1.2V

Figure 3.6 Frequency versus N-well and P-well biasing in 65nm LP-CMOS.

SVT ringo with nominal frequency at 327MHz. The contours are at 20MHz intervals.

One can now analyze the impact of BB as a frequency tuning mechanism at each VDD point. Notice that the relative tuning range is not the same for all VDD values. In particular, frequency spans of approximately –87% to +279% at VDD=0.6V were measured and of approximately -22% to +25% at VDD=1.2V with respect to their nominal frequencies. The larger tuning range of BB at reduced supply voltages can be explained by the fact that the threshold voltage is a larger portion of the gate drive of the transistors. At such low gate drive, the frequency becomes very sensitive to changes in Vth. Notice that a tuning range of -87% at VDD=0.6V implies an 8.1x lower frequency for RBB. In fact, at VDD=0.6V the circuit operates in the sub-threshold region for strong reverse body biasing conditions. In this case, the current is exponentially related to the gate drive voltage, and the frequency is much lower than in case of nominal body biasing. For the measured silicon, BB gives an absolute tuning range of 155MHz for the chosen N-well and P-well voltages when operating

3.3 Frequency Scaling and Tuning 31

at VDD=1.2V. At VDD=0.6V this tuning range is around 60MHz. Figure 3.6 shows a contour plot of the BB scaling operation at VDD=1.2V. The contours are at 20MHz intervals, and the nominal frequency is at 327MHz. Notice that it is possible to change the Vth of the PMOS and NMOS transistors independently and still attain the same frequency. Obviously, the choice of Vth has a significant impact on leakage power consumption as will be shown later in this chapter. Figure 3.7 shows the frequency tuning for the BB scaling operation as function of a symmetrical well bias (Vnwell=VDD-Vpwell) and various supply voltages. Notice that the frequency saturates for strong reverse body biasing due to its limited Vth control range.

0

50

100

150

200

250

300

350

400

450

-1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5

Frequency [MHz]

Body bias voltage [V]

65nm LP-CMOS

SVT ring-oscillator Vnwell=VDD-Vpwell

FBBRBB

VDD=1.2V

VDD=0.6V

VDD=0.7V

VDD=0.8V

VDD=0.9V

VDD=1.0V

VDD=1.1V

Nominal

Figure 3.7 Frequency dependency on supply voltage and body bias.

The same analysis has been performed for ringo’s in 90nm and 45nm CMOS. Table 3.2 shows a summary of the measured frequency scaling and tuning ranges for the different process technologies. VDD_low and VDD_high correspond to the lower and higher VDD limit for a given process technology, as indicated in Table 3.1. Notice the large frequency scaling ranges for the different process technologies as well as the large frequency tuning range at reduced VDD. For large reverse body biasing the threshold voltage saturates yielding as a result an asymptotic limit on the lowest possible operating frequency. Observe that GP-CMOS shows a lower dependence on VDD and Vth as compared to LP-CMOS primarily because the threshold voltage of the former technology is lower.

Table 3.2 Frequency tuning ranges for various CMOS nodes (SVT).

90nm GP 90nm LP 65nm LP 45nm LP

VS

3.4x 6.7x 20.1x 13.4x

BB VDD_low VDD_high

[-28,34]% [-8,9]%

[-87,123]% [-28,18]%

[-87,279]% [-22,25]%

[-72,208]% [-19,27]%

VS+BB

5.2x 61.3x 204.3x 59.5x


3.4 Power and Frequency Tuning

The ultimate use of the VS and BB schemes is for performance tuning with performance being the optimal combination of frequency and power, i.e. the lowest power for a given frequency. To investigate the available power-frequency tuning range offered by VS and BB in 65nm LP-CMOS, one can consider the same ring oscillator as before. Figure 3.8 presents a plot of the ringo frequency as function of the total power of the CGU, e.g. both CGU-static and dynamic power consumption of the ringo. In these experiments static power takes into account all sources of leakage, e.g. sub-threshold leakage, gate-oxide leakage, etc.

0

50

100

150

200

250

300

350

400

450

0 20 40 60 80 100 120 140 160 180

Frequency [MHz]

Power consumption [µµµµW]

BB

VDD=1.2V

VDD=1.1V

VDD=1.0V

VDD=0.9V

0.8V

0.7V

0.6V

maxVth

minVth

nomVth

VS

65nm LP-CMOS

SVT ring-oscillator

Figure 3.8 Frequency versus total power in CMOS 65nm.

150

200

250

300

350

400

450

40 60 80 100 120 140 160 180

Frequency [MHz]

Power consumption [µµµµW]

A

VDD=1.2V

VDD=1.1V

VDD=1.0V

B

65nm LP-CMOS

SVT ring-oscillator

Figure 3.9 Trading-off frequency for total power consumption in CMOS 65nm.

3.4 Power and Frequency Tuning 33

The plot of Figure 3.8 allows us to evaluate power savings and tuning range control of VS and BB. Measurement results indicate 82x power savings by 20.1x frequency downscaling using VS when downscaling VDD from 1.2V to 0.6V. The use of BB at VDD = 1.2V results in ±33% power and ±25% frequency tuning with respect to the nominal operating point. At VDD = 0.6V one can observe a power tuning range that spans from -78% to +342% and a frequency tuning range from -87% to +279% with respect to no BB. The combination of VS and BB yields ~500x power savings with ~204x frequency scaling from the highest possible frequency (minimum Vth) to the lowest one (maximum Vth). These results show the strength of the combined use of VS and BB. Let us now explore possible power-performance tradeoffs by using VS and BB. Figure 3.9 shows a zoom-in of Figure 3.8 at VDD =1.2V. If VS and BB are applied such that the nominal VDD becomes 1.1V instead of 1.2V, and the Vth’s are pulled to a smaller value as indicated by arrow A in Figure 3.9, one can see that it is possible to achieve ~14% power savings with no frequency penalty. A more aggressive VDD downscaling to 1.0V, while pulling the Vth’s to their minimum value, results in 34% power savings at about 10% frequency penalty as indicated by arrow B. Similar results have been found for 90nm and 45nm LP-CMOS. The index factors for 90nm LP-CMOS are: 16% power savings with no frequency penalty at VDD=1.1V, and 33% power savings with 6% frequency penalty at VDD=1.0V. For 45nm LP-CMOS, it was observed: 16% power savings with no frequency penalty at VDD=1.0V, and 39% power savings with 14% frequency penalty. For a limited VDD range, the benefits of combined VS+BB are not found to be technology-node dependent for the considered LP-CMOS process technologies. For 90nm GP-CMOS, however, a slightly larger voltage dependency of performance was observed. Downscaling from its nominal VDD of 1.0V to 0.9V, and lowering the Vth’s a minimum, results in ~19% power savings with ~4% frequency penalty. At VDD=0.8V and minimum Vth’s, ~45% power savings are achieved with ~19% frequency penalty only. This indicates that there exists a lower frequency tuning range with BB for GP-CMOS.

0

20

40

60

80

100

120

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

Total power core [mW]

P-well bias voltage [V]

N-well

biasing

maxVth

minVth

90nm LP-CMOS

VDD=1.2V

Figure 3.10 Total power of a logic core versus body biasing in CMOS 90nm.


Let us investigate the properties of BB in 90nm LP-CMOS on the shift register. Figure 3.10 shows the core’s total power for a given circuit activity and VDD=1.2V. Each dot in the clouds is associated to an N-well biasing condition. The line joining the clouds indicates the case when symmetric well biasing is applied. Observe that the well biasing allows a total power tuning range of about 42mW; this represents about 60% of the nominal power consumption.

0

20

40

60

80

100

120

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Total power core [mW]

Total power ringo [mW]

VDD=1.2V

VDD=

0.6V

VDD=

0.7V

VDD=

0.8V

VDD=0.9V

VDD=1.0V

VDD=1.1V

BB

VS

Figure 3.11 Total power correlation between a logic core and a ringo.

Figure 3.11 shows the power consumption correlation between the shift register and the ringo for different VDD values. The same conventions as before were used in this plot, i.e. each cloud is associated to a unique VDD value and each point in the cloud corresponds to a unique N-well and P-well bias combination. The shift register operates at the same VDD as the CGU, while its operating frequency is provided by the CGU. The circuit activity of the shift register is kept constant. The dynamic power dominates the total power in both circuit blocks, and therefore, their total power can be estimated by P ≈ aC⋅VDD

2⋅ f, where aC represents the switching circuit capacitance. Since both circuit blocks operate at the same supply voltage and frequency, their power consumption is linearly related by a ratio determined by the switching circuit capacitance. This can be observed in Figure 3.11, where the power consumption of the circuit blocks remains linearly correlated while applying VS and/or BB. Table 3.3 puts in perspective the power-frequency ranges for the ringo’s in the considered process technologies. Notice that there exist large power-frequency ranges for each process technology. For the cases of VS only, or VS+BB, the ratio of power and frequency shows a factor of 4x energy savings when scaling for the nominal VDD to half of its value. This indicates that the total ringo power is dominated by dynamic power consumption. Furthermore, observe that LP-CMOS offers a larger power and frequency tuning range than GP-CMOS when utilizing BB alone. The frequency tuning range of GP-CMOS is about 3x lower.

3.5 Leakage Power Control 35

Table 3.3 Power-frequency tuning ranges for various CMOS nodes (SVT).


VS

Power savings + frequency penalty

13.7x 3.4x

27.7x 6.7x

82.0x 20.1x

44.0x 13.4x

BB

VDD_low Power tuning

Frequency tuning

[-29,40]% [-28,34]%

[-80,119]% [-81,123]%

[-78,342]% [-87,279]%

[-75,235]% [-72,208]%

VDD_high Power tuning

Frequency tuning

[-9,14]% [-8,9]%

[-30,24]% [-27,18]%

[-25,33]% [-22,25]%

[-20,34]% [-19,27]%

VS+BB

Power savings + frequency penalty

21.8x 5.2x

183.1x 61.3x

500.5x 204.3x

230.2x 59.5x

3.5 Leakage Power Control

Leakage power is one of the main concerns in deep sub-micron technologies. In fact, VS and BB are often used for leakage reduction purposes. For older process technologies, leakage current is dominated by sub-threshold conduction. Sub-threshold leakage for a given device strongly depends on threshold voltage choice, process condition, supply voltage and temperature. For sub-100nm CMOS, other leakage components have become increasingly important [28]. The most prominent ones are direct tunnelling currents through the thin gate-oxide, and band-to-band tunnelling currents (mainly GIDL). Both leakage components are strongly VDD

dependent. Figure 3.12 and Figure 3.13 put in perspective leakage current as function of power supply and temperature for a high-Vth NMOS device in 65nm LP-CMOS technology. These results are obtained through circuit simulations for a typical process condition. Observe in Figure 3.12 that sub-threshold leakage, gate-oxide tunnelling, and BTBT currents are of the same order of magnitude at nominal process-voltage-temperature conditions. Both Figure 3.12 and Figure 3.13 show that the dominant leakage component in the total leakage depends on the operating condition.

10E-15

100E-15

1E-12

10E-12

100E-12

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

Leakage current in [A/ µµ µµm]

Power supply voltage in [V]

Total leakage Subthreshold Gate oxide tunneling BTBT

Minimum-channel length

T=25oC

Figure 3.12 Leakage versus VDD for a 65nm LP HVT NMOS device.


10E-15

100E-15

1E-12

10E-12

100E-12

1E-9

-50 -25 0 25 50 75 100 125 150

Leakage current in [A/ µµ µµm]

Temperature in [oC]

Total leakage Subthreshold Gate oxide tunneling BTBT

Minimum-channel length

VDD=1.2V

Figure 3.13 Leakage versus temperature for a 65nm LP HVT NMOS device.

Figure 3.14 shows the impact of VS and BB on the leakage current for the CGU in 65nm LP-CMOS at T=25oC. The plot shows measured leakage current versus body bias for three distinct values of power supply. Body biasing is applied symmetrically for N-well and P-well, respectively. The forward and reverse body biasing range is indicated. Clearly, it is shown in Figure 3.14 that the leakage current grows exponentially when applying forward body biasing. This is because of the increased sub-threshold leakage when lowering the Vth’s. In reverse body-biasing operation, the leakage current achieves a minimum value around 500mV RBB. For stronger reverse body biasing, BTBT dominates the leakage current eliminating the ability of BB to reduce leakage. Observe in Figure 3.14 that applying RBB of 300mV at VDD=1.2V is as effective as lowering VDD by that same amount. For larger RBB at VDD=1.2V, VS becomes more effective to reduce leakage. This is because BTBT current and gate-oxide leakage are strongly reduced for lower VDD operation.

1E-9

10E-9

100E-9

1E-6

-1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

CGU leakage current [A]

Body bias voltage [V]

VDD=1.2V

VDD=0.9V

VDD=0.6V

FBBRBBMedian sample

T=25oC

Figure 3.14 Leakage reduction in 65nm SVT LP-CMOS using VS and BB.

3.5 Leakage Power Control 37

For the measured die sample, leakage reduces by 5.1x when VDD is scaled down from 1.2V to 0.6V. When using RBB alone at VDD = 1.2V, leakage decreases only by 2.9x. This low impact of RBB is because of a high level of BTBT current as explained before. When using RBB alone at VDD=0.6V, leakage decreases by 6.8x. The combination of VS with RBB renders a leakage reduction of 34.6x. Forward body biasing by 0.4V at VDD=1.2V, 0.9V or 0.6V increases the leakage current by 7.4x, 10.2x, or 13.7x, respectively. The actual leakage savings utilizing VS and RBB is impacted by temperature. At elevated temperatures, the Vth’s become lower causing sub-threshold leakage to become a bigger part of the total leakage current. BTBT current depends only weakly on temperature, and gate-oxide leakage is not temperature dependent. Temperature dependence of leakage current for various die samples to quantify its impact on the potential of VS and RBB to reduce leakage was measured as well. Figure 3.15 shows experimental results for leakage reduction versus temperature for the same die sample as before. Observe that VS becomes less effective to reduce leakage with increasing temperature. This is because the leakage increases exponentially with a reducing Vth, while VS cannot compensate for such leakage increase. RBB can reduce leakage slightly more effectively when temperature increases, because the total leakage current gets fully dominated by sub-threshold leakage. At very high temperatures, i.e. T=100oC, the Vth is lowered so much that RBB cannot further reduce leakage because of the constrained body bias range used in the experiments. The trend of VS+RBB shows the collective effect of reducing leakage by VS and RBB. In this case, leakage savings are about constant for temperatures up to 75oC.

5.1 4.03.2 2.42.8

3.5 3.5 2.6

6.8

8.9 9.7

7.2

34.635.8

30.8

17.4

0

10

20

30

40

25 50 75 100

Leakage reduction factor

Temperature [oC]

VS RBB (Vdd=1.2V) RBB (Vdd=0.6V) VS+RBB

Figure 3.15 Temperature-dependent leakage reduction in 65nm LP-CMOS SVT.

The actual leakage savings achieved by VS and RBB are also impacted by process parameter variations as well as Vth option. Sub-threshold leakage strongly depends on process skew, while gate-oxide leakage and BTBT current are only weakly dependent. The dominant leakage components determine if VS or RBB is more effective. Leakage current of the CGU has been measured for 40 die samples from the same silicon wafer at 25o Celsius. Leakage currents ranging from 17.3nA up to 322.6nA were observed depending on the die sample. This corresponds to leakage current variations of about 18.7x.


Table 3.4 shows the average leakage current savings for 65nm LP-CMOS obtained for the measured 40 die samples. The reduction factors for 90nm GP-/LP-CMOS and 45nm LP-CMOS technologies are also shown in this table. The product of leakage savings with VS and RBB yields substantial benefits as indicated in row VS+RBB.

Table 3.4 Leakage current savings for various CMOS nodes (SVT) at T=25oC.


VS

5.3x 3.3x 5.6x 4.1x

RBB VDD_low VDD_high

4.1x 1.2x

6.6x 3.5x

4.5x 2.5x

3.8x 2.0x

VS+RBB

21.6x 21.5x 24.8x 15.3x

3.6 Tuning Ranges for Different Vth options

Traditionally, a number of Vth options are available for the transistor devices in a given CMOS technology node to enable designers to optimize their designs for performance, or leakage. Low-Vth (LVT) is offered to provide a performance advantage over the standard-Vth (SVT) at the expense of higher device leakage. Contrarily, high-Vth (HVT) is offered to provide a lower device leakage as compared to SVT at the cost of a lower performance. Let us address now the impact of Vth choice on the tuning ranges with VS and BB for a 45nm LP-CMOS technology.

1E+6

10E+6

100E+6

1E+9

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

Frequency [Hz]

Power Supply Voltage [V]

HVT

SVT

LVT

401MHz

212MHz

310MHz

45nm LP-CMOS101-stage ring-oscillator

57 die samples

nominal body bias, T=25oC

Figure 3.16 Frequency versus VDD for different Vth-options in CMOS 45nm.

Figure 3.16 shows the frequency distributions versus supply voltage for 57 die samples of a LVT, SVT, and HVT 101-stage ringo, respectively. The symbols indicate the frequency of the median die sample. The ringo frequencies for a nominal supply voltage (VDD=1.1V) were measured at 410MHz (LVT), 310MHz (SVT), and 212MHz (HVT). When comparing the median die samples, LVT is about 28% faster than SVT, while HVT is about 48% slower than SVT at nominal VDD point. A frequency downscaling of 7.8x (LVT), 13.4x (SVT), and 33.3x (HVT) was observed when VDD reduces from 1.1V to 0.6V. For the SVT median sample, the use of BB at VDD=1.1V provided a large frequency tuning range from -19% till +27% w.r.t. the

3.6 Tuning Ranges for Different Vth options 39

nominal operating point. The frequency tuning index factors are -11% (-31%) up to +17% (+41%) for a LVT (HVT) median die sample. This shows that the frequency tuning range is lowest for LVT, and highest for HVT. Although the reduced range for LVT, the tuning range is still significant. BB tuning can effectively improve circuit performance, or reduce circuit leakage. Let us probe now if BB-tuned SVT circuits can eliminate the use of LVT of HVT masks in 45nm LP-CMOS. Figure 3.17 shows the oscillation frequency versus leakage current for ringo’s with different Vth options under body bias conditions. The solid symbols indicate the nominal body bias point. The median die samples yielded an LVT leakage about 5.3x higher than in case of SVT at VDD=1.1V and 25oC, while being about 28% faster, as mentioned before. For HVT, about 3.1x lower leakage than SVT was measured while being about 48% slower. At 1.1V VDD the experiments confirm that SVT with 0.5V FBB can achieve LVT performance. However, this gives a 3.6x higher leakage than LVT for the median samples. VS is not preferred for achieving LVT performance due to the associated large power penalty with VDD up-scaling. Furthermore, it was observed that SVT with RBB alone cannot achieve nominal HVT leakage. This is due to the small body factor (γ) available, and the presence of BTBT current (mainly GIDL) at large RBB values. Alternatively, VS alone or combined with RBB enable SVT circuits to effectively achieve HVT leakage.

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0 1 10 100

Normalized Frequency

Normalized Leakage

HVT

SVT

LVTFBB

RBB

45nm LP-CMOS101-stage ring-oscillatormedian die sample

VDD=1.1V, T=25oC

0.5V FBB

1.1V RBB

Figure 3.17 Frequency versus leakage for a BB-tuned ringo in CMOS 45nm.

Figure 3.18 shows the SVT leakage current distributions versus body biasing for two distinct VDD values at 25oC. The symbols indicate the results for the median sample. At 1.1V VDD, 1.5x-2.8x leakage savings were measured using optimal RBB settings. Reducing VDD from 1.1V down to 0.6V is more effective (4.0x-4.5x). Combined VS+RBB provided 10x-22x leakage savings. The actual leakage savings are strongly temperature dependent, as shown before in Figure 3.14.


0.01

0.1

1

10

100

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

Normalized Leakage


VDD=1.1V

VDD=0.6V HVT-min

HVT-max

FBBRBB45nm LP-CMOS

SVT ring-oscillator core

57 die samples

T=25oC

1.2µA

VS

Figure 3.18 Leakage versus body biasing in CMOS 45nm.

3.7 Process-Dependent Timing Variability Decomposition

The influence of process parameter spread on circuit behaviour becomes higher and higher. For instance, in older technologies greater than 0.18µm a Vth spread of say 50mV on a nominal Vth of 450mV was not that crucial, in nanometer technologies with a nominal Vth of 250mV this variation can make circuit operation quite difficult. The influence of process variability in digital integrated circuits can be found as variation in maximum operating speed, and power consumption. This section concentrates on process-dependent timing variations in 45nm LP-CMOS.

The contributions of systematic and random process variability on ringo timing were evaluated first. For this purpose, pairs of layout-identical ringo’s that are placed on the same die sample are used. The two identical instances that constitute a ringo-pair are closely located at a short distance of 125µm in a matched environment to achieve a same systematic process variability influence within the die. The impact of systematic and random process variability on ringo timing has been determined by correlating the oscillation period of the ringo’s in the ringo-pair for each die sample. Figure 3.19 shows the clock period correlation plot for the ringo-pair for 57 available die samples from the same wafer. Each point relates to two ringo’s from one ringo-pair. This representation enabled us to separate random process variability effects from the total process variability, as indicated in Figure 3.19. Under the same systematic process variability, the point that corresponds to both ringo’s of a given ringo-pair is located on the correlation line. This implies that any difference between oscillation period of both ringo’s of the ringo-pair is due to random process variability. In Figure 3.19, the timing variations due to random effects will appear in the direction perpendicular to the correlation line. The total variation, which contains both systematic and random process variability, appears along the correlation line.

3.7 Process-Dependent Timing Variability Decomposition 41

3.0

3.1

3.2

3.3

3.4

3.5

3.0 3.1 3.2 3.3 3.4 3.5

Period Ring-B [ns]

Period Ring-A [ns]

Random

variation45nm LP-CMOS

101-stage SVT ring-oscillator57 pairs of oscillators

VDD=1.1V, T=25oC

Figure 3.19 Oscillation period correlation plot of layout-identical ringo’s.

Statistical delay variations due to systematic and random effects have been calculated as follows. Let the ring delay, d, be defined as half of the oscillation period Tosc/2 for a given ringo, i.e. the delay of the number of delay stages. Each ringo pair with ring delays d1 and d2 for the first and second ringo, respectively, has been used in the calculations. The mean ring delay (µring) that accounts for all ringo-pairs can be calculated as

∑∑==

+==

n

i

iin

i

iring

dd

nd

n 1

21

1 2

11µ (3.2)

where n is the total number of ringo-pairs, and d is the mean value of ring delay for two ringo’s in a given ringo-pair from the total number of available ringo-pairs. The

variance ( 2ringσ ) which includes total process variability can be calculated as

( )∑=

−=n

i

ringiring dn 1

22 1µσ (3.3)

The variance due to random effects ( 2RNDringσ ) can be obtained from

( ) ( )∑=

−+−=n

i

iiiiRNDring ddddn 1

2

2

2

12 1

σ (3.4)

And finally, the variance due to systematic effects ( 2SYSringσ ) can be calculated as

222RNDringringSYSring σσσ −= (3.5)

For homogeneous ring-oscillator structures, the mean delay of a single cell in a ringo (µcell) can be determined by dividing the mean ring delay (µring) by the number of delay cells in the ringo (N)


N

ring

cell

µµ = (3.6)

Similarly, one can determine the variances due to random and systematic effects per

cell ( 22 , SYScellRNDcell σσ ). The impact of random process variability on each cell is not

correlated. Contrarily, the impact of systematic process variability is fully correlated for each cell in a ringo. Therefore, the variances per cell can be calculated as follows

N

RNDring

RNDcell

22

σσ = (3.7)

2

22

N

SYSring

SYScell

σσ = (3.8)

By using expressions (3.2)-(3.8), the statistical delay spread of an 11-, 21-, 31-, 41- and 101-stage ringo has been calculated for three BB values at 1.1V VDD. Figure 3.20 and Figure 3.21 show the results; the symbols relate to the 3σ systematic and random delay spread, respectively, as obtained from the 11-, 21-, 31- and 41-stage ringo. The trend lines are extrapolated from the 101-stage ringo, which are closely matching the results from the other ringos. For given BB voltage, the random variability is a significant portion of the total variance for a low amount of delay cells. For an increasing number of cells, the random variability becomes relatively less important due to the averaging effect associated to its statistical independent nature in different

cells ( NRNDring ∝σ and NSYSring ∝σ ). For the considered silicon under NBB,

the systematic and random delay spreads are equally large for about four delay cells, while the systematic delay spread is dominant for a larger amount of delay cells. Both systematic and random delay spreads reduce consistently when FBB is applied, while it increases for RBB. This is because of the magnitude of Vth in the gate drive of transistor devices, VDD-Vth. Under FBB, the Vth is the smallest fraction of the gate drive, which results that Vth variations have less impact on gate drive variations, thus less delay variation. The opposite holds true for RBB.

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35 40

Systematic 3

σσ σσdelay spread [ps]

Number of Delay Cells

45nm LP-CMOSSVT ring-oscillatorsVDD=1.1V, T=25oC

1.1V RBB

0.5V FBB

NBB

Figure 3.20 Systematic delay spread versus logic depth under body biasing.

3.7 Process-Dependent Timing Variability Decomposition 43

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35 40

Random 3

σσ σσdelay spread [ps]

Number of Delay Cells

45nm LP-CMOSSVT ring-oscillatorsVDD=1.1V, T=25oC 1.1V RBB

0.5V FBB

NBB

Figure 3.21 Random delay spread versus logic depth under body biasing.

Figure 3.22 shows a more detailed analysis for the 21-stage ringo. The total delay spread is about 2x lower for 0.5V FBB with respect to the nominal BB case. Contrarily, the spread is about 2x higher for 1.1V RBB. Observe that FBB can significantly reduce both systematic and random delay spread in 45nm LP-CMOS.

-1.1 0 0.5

Total spread 71.50 34.36 18.59

Systematic spread 67.68 31.43 16.52

Random spread 23.05 13.88 8.52

0

20

40

60

80

3σσ σσdelay spread [ps]


45nm LP-CMOS21-stage SVT ring-oscillatorVDD=1.1V, T=25oC

Figure 3.22 Estimated delay spread for the 21-stage ringo in CMOS 45nm.

Figure 3.23 puts in perspective the oscillation period mean (symbols) and overall 3σ-spread (error bars) versus body biasing for the 21-stage ringo. A ±11% spread in oscillation period was observed at the nominal operating point. This spread could be fully compensated through BB tuning using up to 0.2V FBB for slow die samples, and up to 0.7V RBB for fast die samples. This shows that BB enables compensation of process-dependent performance spread, as it will be discussed in the next section.


0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

Normalized Oscillation Period


45nm LP-CMOS21-stage SVT ring-oscillatorVDD=1.1V, T=25oC

FBBRBB

Figure 3.23 Estimated oscillation period for a 21-stage ringo in CMOS 45nm.

3.8 Performance-Spread Compensation

Understanding the tradeoffs in performance and power is not sufficient to ensure a successful outcome of the IC. The basic problem is that failure of deep sub-micron process technologies to continue with constant process tolerances opens avenues for new challenging low-power process options and emerging design technologies. As the variation of fundamental parameters such as channel length, threshold voltage, thin oxide thickness and interconnect dimensions goes well beyond acceptable limits, “on the fly” performance compensation is becoming necessary. In this section post-silicon tuning strategies will be shown that enable compensation of process-dependent performance spread using 65nm LP-CMOS, and 45nm-LP CMOS examples.

250

275

300

325

350

375

400

425

450

0 50 100 150 200 250 300 350 400 450

Frequency [MHz]

CGU leakage current [nA]

slow

fast

nominal

unbalanced

Corner results

fast427MHz, 430nA

fnsp337MHz, 144nA

nominal336MHz, 71nA

snfp335MHz, 88nA

slow270MHz, 17nA

Figure 3.24 Ringo frequency and leakage wafer spread for 40 dies in CMOS 65nm.

3.8 Performance-Spread Compensation 45

Figure 3.24 shows the measured ringo frequency and CGU leakage current at nominal VDD of 1.2V and nominal temperature of 25oC for 40 die samples coming from the same 65nm LP-CMOS SVT wafer. The five process corner specifications, as determined from circuit simulations, are indicated as well. The fast process corner gives the lowest Vth value within the process window for both PMOS and NMOS transistors. Contrarily, the slow process corner gives the highest Vth value for both transistors. The unbalanced process corners give a high Vth for one transistor and a low Vth for the other one. The nominal process corner gives the nominal Vth value within the process window for both PMOS and NMOS transistors. The total frequency and leakage spread of the measured sample set is found to be about 100MHz and 305nA, respectively. This translates into a relative frequency spread of ~36%, and a relative leakage spread of ~18.7x. The samples with frequencies below “nominal” are considered as yield losses while samples above “nominal” are consuming unnecessary extra power. The leakage of a “fast” corner sample is about 6.1x higher than the “nominal” reference. Contrarily, the leakage of a “slow” corner sample is about 4.2x lower. Let us discuss now three strategies for compensating the undesired process-dependent frequency and leakage spread by means of post-silicon tuning. A first strategy is to perform post-silicon tuning with body biasing only. The tuning ranges for “fast” and “slow” samples were determined from experiments. Figure 3.25 shows the potential of body biasing to compensate performance for the same die samples as shown before. With 0.4V FBB, a 21% frequency increment from the slow corner renders a target frequency of 327MHz, and likewise, a 14% adjustment with 1.2V RBB from the fast corner results in a target frequency of 366MHz. At the same time, the leakage current increases by ~9.8x (from 17nA to 170nA) for a “slow” corner sample, and reduces by ~2.5x (from 430nA to 177nA) for a “fast” corner sample. Observe that in both cases, that is, from slow to nominal and from fast to nominal, the leakage current of the tuned device is approximately 2.4x higher than the “nominal” reference. For the available die sample set it was shown that the application of BB gives basically a 100% parametric yield improvement. In addition, the leakage spread can be reduced to a factor of ~3.8x as indicated in Figure 3.25 by the dotted line at a nominal frequency of 336MHz.

250

275

300

325

350

375

400

425

450

0 50 100 150 200 250 300 350 400 450

Frequency [MHz]

CGU leakage current [nA]

slow

fast

nominal

unbalanced

Corner results

fast427MHz, 430nA

fnsp337MHz, 144nA

nominal336MHz, 71nA

snfp335MHz, 88nA

slow270MHz, 17nA

366MHz

327MHz

170nA 177nA

RBB

FBB

Figure 3.25 Process-dependent performance compensation with body biasing.


A second strategy for compensating frequency and leakage spread is based on using BB and VS independently. FBB is used to increase the performance of “slow” samples as explained before. VS is not used in this case because it would require a higher supply voltage than nominal, which may lead to reliability issues for the silicon. Therefore, VS is only used to reduce the frequency and total power for “fast” samples. This approach is more power-efficient than when using RBB alone, because now both dynamic and leakage power are reduced. For a “fast” corner sample, VS can lower VDD by about 124mV which reduces its switching energy by ~19.6% while still being able to meet the nominal frequency specifications. Leakage current reduces less than when using BB alone; the leakage reduces by ~1.1x (from 430nA to 386nA) for a “fast” corner sample. Consequently, the leakage current of the tuned device is about 5.5x higher as compared to the “nominal” reference. A third and last strategy consists of setting VS+FBB jointly. Again, FBB alone is used to increase the performance of “slow” samples. “Fast” samples are biased using VS+FBB to meet nominal frequency specifications while saving power. FBB is used to reduce Vth such that VS can reduce VDD more than the case with no FBB, thereby, enabling further overall power savings. Combined VS+FBB for a “fast” corner sample can lower VDD by about 219mV, which reduces switching energy by about 33.3%. However, this comes at a penalty of increased leakage current. For a “fast” corner sample with 0.4V FBB, the leakage increases by about 3.7x (it becomes 1600nA) as compared to the “fast” corner with no FBB. When comparing against the “nominal” reference, the leakage current is about 23x higher. Figure 3.26 puts in perspective the previous results for compensating process-dependent performance spread in 65nm LP-CMOS. The values for frequency, power supply voltage, and leakage current are plotted for reference and tuned process corners. The indicated numbers are normalized to the “nominal” corner reference. Notice that BB can effectively reduce frequency and leakage spread, while VS can trade-off higher operating frequency for improved power efficiency. Further total power savings can be achieved with VS+FBB at the expense of increased leakage.

1.27 1 0.8 0.97 1.09 1 11 1 1 1 1 0.9 0.82

6.06

10.24

2.39 2.49

5.44

22.54

0

5

10

15

20

25

Fast Nominal Slow FBB RBB VS VS+FBB

Relative f requency Relative supply voltage Relative leakage

Slow corner

compensation

Fast corner

compensationReference

corners

Figure 3.26 Performance compensation strategies in 65nm LP-CMOS SVT.

Finally, the capability of body biasing for achieving performance compensation in 45nm LP-CMOS is shown next. Frequency and leakage were measured for the 57 available die samples from the same wafer. For each die sample, it was possible to tune its frequency to the nominal target specification through BB tuning. A BB range

3.9 Discussion 47

from 0.1V RBB up to 0.2V FBB is required. This gives basically an enhancement to 100% parametric yield for the sample set. A 32% frequency increase with 0.5V FBB was measured for the slowest die sample at a 26x leakage penalty. This offers sufficient tuning range for compensating process-dependent performance spread.

0.9

1.0

1.1

1.2

0 1 10


Normalized Leakage

-0.1 0 0.2N

um

be

r o

f S

am

ple

s

10

0


Nominal frequency

specification

45nm LP-CMOS101-stage SVT ring-oscillator57 die samples

VDD=1.1V, T=25oC

0.1

Figure 3.27 Performance compensation in 45nm LP-CMOS SVT.

3.9 Discussion

In this chapter experimental results have been presented that show the extent to which voltage scaling and body bias are useful for power and delay tuning in state-of-the-art CMOS technologies. It was shown that the benefits of VDD scaling are primarily for low-power and of body biasing for performance tuning. For instance, for a 65nm LP-CMOS state-of-the-art technology power savings are in the order of 82x through 20x frequency downscaling. Contrary to the belief that high-Vth has a considerable impact on leakage power reduction, it was found that RBB alone reduces leakage only by 2.5x at VDD=1.2V. At lower supply voltage (VDD=0.6V), one can observe a larger leakage reduction of 6.8x. However, combined VDD scaling and body biasing yields ~25x leakage reduction. With the increased impact of process variability on circuit design, body biasing turns out to be a good design technology to keep parametric yield under control. In particular, one can observe the means to tune devices with characteristics in the slow or fast process corners to performance specifications of a nominal process corner. While at VDD=1.2V, a ±20% frequency and a ±22% power tuning range of body biasing may look limited, the frequency tuning range proves to be effective for process-dependent performance compensation. In fact, one can observe a continuous frequency tuning despite the wide frequency spread. These tuning indices show that the combined use of VDD scaling and body biasing offers significant performance control. Of course this tuning comes at the price of increased static power consumption. In the achieved results this static power increase is in the order of 2.4x to meet the required specs.


Chapter 4

Embedded Forward Body Bias

Generation

HE convergence of multiple applications into a single device drives integrated circuit solutions that are both high performance and power efficient. The

application of post-silicon tuning in integrated circuits enables trading-off the chip power and performance per die sample. The trend towards higher integration densities in modern chips favours a fully integrated solution of the required components for post-silicon tuning to enable more cost-effective system solutions. In this chapter, the design and implementation of an embedded FBB generator for high-performance digital circuits is presented. Experimental results will be provided for a standalone FBB generator circuit that is implemented in a 90nm LP-CMOS technology.


In this work the focus is on applying forward body biasing to improve digital circuit performance. When a circuit is active, FBB is preferred over VDD scaling to enhance performance due to its lower dynamic power penalty, as discussed in section 3.6. The joint use of FBB and VDD reduction is preferred over VDD reduction alone for achieving low-power operation. When a circuit is in standby, FBB should not be applied because it increases leakage power. This motivates the application of FBB dynamically at runtime [42]. FBB requires a voltage generator circuit to generate the required N-well and P-well bias, respectively. From an industrial perspective the generator should comply with the following requirements: 1) it should be digitally controllable to simplify system integration, 2) the FBB voltage generation should be transparent to any voltage scaling approach, i.e. the amount of applied FBB should be constant relative to the supply voltage of the digital circuit, 3) the FBB generator should be powered off from the available core supply, and finally, 4) it should have low power consumption and small area occupation. Several FBB generators have been proposed in the literature, but none of them meets all of the aforementioned requirements. Tschanz et al. presented an adaptive body bias (ABB) voltage generator [19]. The main drawback of their implementation is that FBB is applied only to PMOS transistors to avoid the use of a triple well technology. Likewise, the FBB voltage is VDD-dependent, as well as the need for a voltage level higher than VDD. Choi and Shin proposed a more sophisticated solution for providing body bias voltages to multiple macros in the design [43]. However, their solution also requires voltage levels higher than the core VDD and lower than VSS, mainly for generating RBB, and also for this design, the FBB voltage is VDD-

T

50 Chapter 4 Embedded Forward Body Bias Generation

dependent. Sumita et al. presented another ABB generator [44]. However, it has similar constraints as the one proposed in [43]. Komatsu et al. proposed a FBB generator for enabling self-adjusted FBB [45]. Their solution cannot dynamically control the FBB voltage, while the generated FBB voltage is highly sensitive to VDD and strongly temperature dependent. Other publications imply using a FBB generator without discussing in detail its implementation [46][47][48]. In this work a FBB generator has been designed that can meet all four aforementioned requirements.

4.2 Load Characteristics

Before presenting the approach on FBB voltage generation, let us address first the N-well and P-well electrical characteristics of a body-biased digital CMOS circuit. Such information is essential for the design specification of the FBB generator, because it provides insight into the behaviour of its load. 4.2.1 N-well and P-well Behaviour Figure 4.1 shows a cross-section example of a simple digital circuit, namely a CMOS inverter. The potential of the different circuit nodes have been indicated for a body-biased case; the junction diodes have been indicated as well.

P+ P+ N+N+ N+P+

P-well N-well

Deep N-well

P- substrate

+ +Vpwell VnwellVDDVout

Vin

Figure 4.1 CMOS inverter cross-section with junction diodes displayed.

Typically, all junction diodes are either reverse biased or non-biased (due to zero voltage drop across the diode). Under FBB conditions, this is no longer the case. Let us consider the example for a 90nm LP-CMOS technology. At nominal VDD, the junction diodes between P-well and N-well/Deep-N-well remain reverse biased, even under FBB conditions. At reduced VDD’s, however, these junctions may become forward biased when Vpwell>Vnwell. The transistor body-source (P-well-N+) junctions become forward biased when Vpwell>0V or Vnwell<VDD. The same holds true for the body-drain junctions depending on the drain voltage value. Finally, the P-substrate/Deep-N-well junction diode remains reverse biased, since P-substrate is at ground potential. As presented in Chapter 2, the junction leakage in a modern CMOS technology consists mainly of two different components: 1) classical diode current (see expression (2.7)), and 2) band-to-band tunnelling current (see expression (2.12)). The

4.2 Load Characteristics 51

well current contains the sum of all junction currents related to the respective N-well or P-well. Under FBB conditions, the well current is dominated by diode current. Contrarily, BTBT current dominates the junction current under RBB conditions.

100E-12

1E-9

10E-9

100E-9

1E-6

10E-6

100E-6

1E-3

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

Magnitude of Pwell Current [A]

Pwell Bias Voltage [V]

90nm LP-CMOSHVT ring-oscillators3 die samples

VDD=1.2V, T=125oC

Band-to-band tunneling dominant(current is sourced)

Classical diode current dominant(current is sinked)

FBBRBB

Figure 4.2 P-well current versus body bias experiments in 90nm LP-CMOS.

1E-9

10E-9

100E-9

1E-6

10E-6

100E-6

1E-3

10E-3

-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6

Magnitude of Nwell Current [A]

(VDD - Vnwell) Bias Voltage [V]

90nm LP-CMOSHVT ring-oscillators3 die samples

VDD=1.2V, T=125oC

Band-to-band tunneling dominant(current is sinked)

Classical diode current dominant(current is sourced)

FBBRBB

Figure 4.3 N-well current versus body bias experiments in 90nm LP-CMOS.

Experimental results of P-well and N-well current have been obtained for the CGU design in 90nm LP-CMOS. The CGU design has been described in section 3.1. Figure 4.2 and Figure 4.3 present the results for P-well and N-well, respectively. The well current under RBB conditions is dominated by BTBT current. A well current increase of up to three orders of magnitude was observed for a RBB range up to 1.2V RBB. Under FBB conditions, the well current is dominated by the diode current of the forward biased junctions. In this case, a well current up to four orders of magnitude is observed for a FBB range up to 0.5V FBB. A FBB generator needs to


provide sufficient output drive current capability for supplying the diode current of the forward biased junctions. Another important parameter for the FBB generator design is the capacitance of the wells. The well capacitance consists of two components: 1) junction capacitance (see expression (2.2), and 2) well-to-channel or well-to-gate capacitance. The overall junction capacitance consists of the sum of capacitances of the junction diodes related to the respective N-well or P-well; the junction diodes are illustrated in Figure 4.1. The well-to-channel capacitance is a depletion capacitance that exists when the MOS transistor operates in the linear or saturation regime. Like the junction capacitance, also this capacitance is voltage dependent. The well-to-gate capacitance is relevant for those MOS devices that operate in the cut-off regime. This capacitance is not voltage dependent. For digital CMOS logic, the well-to-channel capacitance (MOS in linear region) and well-to-channel capacitance (MOS in cut-off region) can be expressed by:

−=−−

diode

asieffeffchanneltowell

V

qNLWC

*0

1

2 φε

*0φ<∀ diodeV (6.1)

ox

effeffsi

gatetowellt

LWC

ε=−− (6.2)

where WeffLeff is the effective transistor gate area, φ0

* is the built-in voltage of the well-to-channel junction, Vdiode is the voltage across this junction, and tox is the gate oxide thickness of the MOS transistor. Observe that the well-to-channel capacitance depends on FBB, while the well-to-gate capacitance does not. Therefore, the overall well capacitance consists of a FBB dependent and a non-FBB dependent part. The maximum well capacitance is obtained for the case of a maximum FBB. At this condition, the (maximum) load capacitance of the FBB generator should be determined. 4.2.2 Load Modelling and Analysis A digital CMOS circuit consists of a multitude of digital logic cells that are placed within standard-cell rows. Figure 4.4 shows a layout implementation example of a body-biased digital standard-cell circuit. All body-biased PMOS transistors within this circuit block experience the same amount of FBB, and the same holds true for the NMOS devices. In this way, body-biased digital cells can share the same physical N-well and P-well. Deep N-well isolation is added for separating the P-well of the body-biased NMOS devices from the P-substrate. Only 2µm extra is needed for the Deep N-well at each side of the body biased circuit part in 90nm LP-CMOS. The N-well and P-well connections are made through dedicated well tap cells; the tap cells have been inserted in columns at a maximum pitch of 60µm. This maximum pitch is a design-rule from the fab to prevent latch-up in the circuit.


Vpwell

Vnwell

= tap cell = N-well = P-well = Deep N-well

≦≦≦≦ 60µm

~2µm

Figure 4.4 Layout implementation example of a body biased digital circuit.

The total well current / well capacitance of the circuit is the sum of well current /well capacitance for each digital cell in the circuit. Different digital CMOS circuits contain different amount and types of digital cells. Therefore and without loss of generality, one can model a digital circuit by means of reference gates for estimating the overall circuit leakage. A reference gate is a virtual gate that is based on a combination of a 2-input NAND and 2-input NOR gate with a standard (single) drive capability. The electrical characteristics of a reference gate are determined from the average of the eight input combinations for the NAND and NOR gates. Table 4.1 summarizes the reference gate characteristics in a 90nm SVT LP-CMOS process under nominal PVT conditions and 0.5V FBB, as obtained through circuit simulations.

Table 4.1 Reference gate characteristics in 90nm SVT LP-CMOS.

Conditions: nominal process corner, VDD=1.2V, T=25OC, and 0.5V FBB.

Reference gate 1mm2 circuit

Cell area [µm2] 4.39 1000000 N-well current [A] -109.92⋅10-12 -25.04⋅10-6 P-well current [A] 155.84⋅10-12 35.50⋅10-6 N-well capacitance [F] 6.33⋅10-15 1.44⋅10-9 P-well capacitance [F] 3.76⋅10-15 0.86⋅10-9

The reference gate has been used to represent all gates in the digital circuit. The translation between a given logic gate and a reference gate is based on cell area comparison. For example, a flip-flop with a cell area of about 14.3µm2 is represented by 3.25 reference gates. In this way, one obtains a digital circuit that consists of reference gates only. Such circuit has been used to determine the total well current and total well capacitance, as required for the design of the FBB generator. Table 4.1 summarizes these characteristics for a 1mm2 digital circuit (using reference gates), for the same conditions as before.


1.0E-18

1.0E-16

1.0E-14

1.0E-12

1.0E-10

1.0E-08

1.0E-06

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Forward Body Bias [V]

Ma

gn

itu

de

of

We

ll C

urr

en

t [A

]

90nm LP-CMOS

SVT reference gate

VDD=1.2V, T=25oC

P-well current

Process: snsp, nom, fnfp

N-well current

Process: snsp, nom, fnfp

Figure 4.5 Well current versus FBB and process corner in 90nm LP-CMOS.

1.0E-18

1.0E-16

1.0E-14

1.0E-12

1.0E-10

1.0E-08

1.0E-06

0.0 0.1 0.2 0.3 0.4 0.5 0.6


Ma

gn

itu

de

of

We

ll C

urr

en

t [A

]

90nm LP-CMOS

SVT reference gate

nom.process, T=25oC

P-well current

VDD: 1.2V, 0.5V

N-well current

VDD: 1.2V, 0.5V

Figure 4.6 Well current versus FBB at two VDD’s in 90nm LP-CMOS.

1.0E-18

1.0E-16

1.0E-14

1.0E-12

1.0E-10

1.0E-08

1.0E-06

0.0 0.1 0.2 0.3 0.4 0.5 0.6


Ma

gn

itu

de

of

We

ll C

urr

en

t [A

]

90nm LP-CMOS

SVT reference gate

VDD=1.2V, nom.process

P-well current

125oC

-40oC

25oC

N-well current

Figure 4.7 Well current versus FBB and temperature in 90nm LP-CMOS.


Figure 4.5 shows the simulated well current of a reference gate as function of FBB for different process conditions in 90nm LP-CMOS. Observe that both P-well and N-well current are strongly FBB-dependent. A six orders of magnitude current increase has been observed for a FBB voltage range of up to 0.55V. This well current is due to the diode current of the forward biased junctions, as explained before in section 4.2.1. There exists too only a weak dependence on process state through the diode saturation current. The well current shows a minimum around a FBB of 0.1V, which the point that the well current direction changes. The P-well (N-well) sinks (sources) current for FBB voltages higher than 0.1V. Below 0.1V FBB, the P-well (N-well) sources (sinks) current. The sourcing well current is due to junction leakage of reversed-biased MOS transistor junctions (ideal,SRH/TAT) [30][33]. At the minimum well current point, the current through the forward- and reverse-biased junctions is balanced. Figure 4.6 presents the simulation results of the well current of a reference gate as function of FBB for two distinct VDD voltages in 90nm LP-CMOS. The well current dependence on VDD is not so large. However, observe that the minimum well current point shifts towards a smaller FBB voltage. This is because the leakage of reverse-biased junctions is reduced because the lower reverse bias (∝VDD) applied in case of VDD reduction. Figure 4.7 shows the well current of a reference gate as function of FBB for different operating temperatures in 90nm LP-CMOS. Observe that temperature has a large impact on well current; A seven orders of magnitude current increase has been observed for a temperature increase from -40oC to 125oC. This is because the built-in potential of a pn-junction is temperature dependent, which has an exponentially relationship with the diode current as shown in expression (2.7). Thus, it is essential to consider the well current at the largest FBB voltage and the high temperature for the design specification of the FBB generator. Finally, notice that the minimum well current point shifts to a lower FBB value in case of higher operating temperature. The opposite holds for the low operating temperature case. This is because the diode current through forward-biased junctions is more important in terms of magnitude current change than the current through reverse-biased junctions; it is a direct consequence of the biasing condition of the junction (Vdiode in expression (2.7)). Figure 4.8 shows the simulation results of well capacitance versus FBB for different process conditions in 90nm LP-CMOS. The N-well capacitance is consistently larger than the P-well capacitance because the PMOS dimensions are larger than the NMOS ones. Observe that the well capacitance does not have the same dynamic range as the well current. However, it can increase up to about 60% when FBB is increased to 0.55V. The process state shows a similar impact on well capacitance; this sensitivity is because of the variations in number of dopants for the pn-junctions. Figure 4.9 presents the simulation results of the well capacitance of a reference gate as function of FBB for two distinct VDD voltages in 90nm LP-CMOS. The well capacitance dependence on VDD is low, because those junctions with a VDD voltage at one of their terminals are reverse biased, thus having only a small contribution to the overall well capacitance. Figure 4.10 shows the well capacitance of a reference gate as function of FBB for different operating temperatures in 90nm LP-CMOS. Observe that temperature has a similar impact on well capacitance as FBB or process state. The temperature dependence is because the built-in potential of a pn-junction is temperature dependent. From a well capacitance perspective, it is essential to consider the largest FBB voltage, slow process state, and high temperature for setting the design specification of the FBB generator.


0

1

2

3

4

5

6

7

8

9

10

0.0 0.1 0.2 0.3 0.4 0.5 0.6

We

ll C

ap

aci

tan

ce [

fF]


90nm LP-CMOS

SVT reference gateVDD=1.2V, T=25oC

P-well capacitance

N-well capacitance

snsp

nom

fnfp

snspnomfnfp

Figure 4.8 Well capacitance versus FBB and process corner in 90nm LP-CMOS.

0

1

2

3

4

5

6

7

8

9

10

0.0 0.1 0.2 0.3 0.4 0.5 0.6

We

ll C

ap

aci

tan

ce [

fF]


90nm LP-CMOS

SVT reference gate

nom.process, T=25oC

P-well capacitance

N-well capacitance

VDD=0.5V

VDD=1.2V

VDD=0.5V

VDD=1.2V

Figure 4.9 Well capacitance versus FBB at two VDD’s in 90nm LP-CMOS.

0

1

2

3

4

5

6

7

8

9

10

0.0 0.1 0.2 0.3 0.4 0.5 0.6

We

ll C

ap

aci

tan

ce [

fF]


90nm LP-CMOS

SVT reference gate

VDD=1.2V, nom.process

P-well capacitance

N-well capacitance

125oC

25oC

-40oC

125oC

25oC

-40oC

Figure 4.10 Well capacitance versus FBB and temperature in 90nm LP-CMOS.


4.2.3 Latch-Up Sensitivity Analysis With the reduction of VDD in next-generation CMOS nodes, the latch-up sensitivity of digital CMOS circuits is reduced. However, the application of body biasing can again increase the sensitivity of CMOS circuits to latch-up. This motivates a more thorough analysis. The cause of latch-up can be found in a parasitic thyristor structure that is formed by bipolar devices inherently present in CMOS circuits, as shown in Figure 4.11. In latch-up conditions, a current flows from VDD to VSS due to triggering of this parasitic thyristor. In many cases, latch-up is destructive for the circuit. In those cases where latch-up is not destructive, the latch-up current can only be eliminated by turning-off VDD.

P+ P+ N+N+ N+P+

P-well N-well

Deep N-well

P- substrate

+ +Vpwell VnwellVDDVout

Vin

Rnw

Rpw

Figure 4.11 Parasitic thyristor structure in CMOS circuits.

Figure 4.12 shows the equivalent circuit diagram of the parasitic thyristor structure. IA represents the anode current. Let us first discuss a number of design practices that have been identified in prior art works for alleviating latch-up sensitivity in CMOS circuits. Latch-up cannot occur when βnpnβpnp<1 [49]. The bipolar current gain, β, can be reduced by increasing the bipolar base length, e.g. a higher distance between p+ drain junction of the PMOS transistor and the n+ drain junction of the NMOS transistor. Also, minimizing the well supply connection impedance (Rnw, Rpw) effectively reduces the latch-up sensitivity of the circuit. The fab has specified a design rule to bound the maximum distance between a well tap and a transistor region, i.e. a maximum distance of 30µm in 90nm LP-CMOS. Another means to reduce latch-up sensitivity is by using guard-rings between drain junctions and the N-well/P-well edge. However, this measure is typically not used in digital logic standard-cells due to its large area cost.

Rpw+

Vpwell

+

VDD-Vnwell

VDD

Rnw

IA

pnp

npn

Figure 4.12 Equivalent circuit of the parasitic thyristor structure.


Digital CMOS circuits may experience a higher sensitivity to latch-up depending on the operating conditions [50]. A higher VDD operation increases the latch-up sensitivity due to higher voltage potentials across the terminals of the bipolar devices. However, this is not so much of a concern for digital circuits implemented in modern CMOS technologies, since the VDD needs to be substantially higher than its nominal value before experiencing latch-up related issues. A higher operating temperature also increases latch-up sensitivity. This is because the base-emitter voltage of the bipolar devices decreases with temperature, thereby making it easier for the bipolars to be turned-on. Therefore, the worst-case condition for latch-up occurs at the highest VDD and highest operating temperature. Body biasing can further increase the latch-up sensitivity of the digital CMOS circuit. Since electron injection from P-well to N-well becomes the trigger event, FBB degrades latch-up immunity and RBB improves latch-up immunity. Consequently, the maximum FBB is limited by the maximum tolerable P-well-to-N-well junction current. Hokazono et.al. demonstrated varying latch-up characteristics under body biasing conditions [37]. Figure 4.13 presents their experimental results of I-V characteristics of parasitic thyristor test-structures in CMOS 45nm while operating at VDD=1.0V. Under NBB conditions, each base junction of the thyristor is shorted to the anode (=VDD) and cathode (=VSS) so that the two emitter junctions are at zero bias. Therefore, the current flow from anode to cathode is blocked and results in the forward blocking region, for example the low-current region of the I-V curve until the voltage snapback occurs. The snapback point indicates the triggering of the thyristor. Beyond this point the thyristor is in the conduction state, e.g. the high-current region of the I-V curve shown in Figure 4.13. Latch-up occurs and is sustained when the anode voltage equals or exceeds the holding voltage. Figure 4.13 shows that FBB (VF in Figure 4.13) decreases the holding voltage while the opposite holds for RBB (VR in Figure 4.13). Hokazono et.al. concluded that latch-up was no issue for designs fabricated in the considered 45nm CMOS technology, since the holding voltage was higher than VDD [37].

Figure 4.13 Schematic I-V characteristics of the parasitic thyristor structure [37].


Let us consider now the latch-up sensitivity of FBB in a 90nm LP-CMOS technology. A set of dedicated test-structures for latch-up testing purposes was implemented. The cross-section of a test-structure is provided in Figure 4.14. Observe that it consists of the typical N-well-P-well-Deep-N-well organization as found in body-biased digital CMOS circuits. The N+ junction in P-well models the drain/source node of a NMOS device, while the P+ junction in N-well models the drain/source node of a PMOS device. These junctions have been placed at minimum distance allowed by design rules for maximizing latch-up sensitivity by making the base lengths of the thyristor shortest, thus maximizing the current gain. The well taps have been placed at about 10µm from the other N+/P+ junctions. All dimensions of the metal interconnections are upsized to handle AC currents up to 100mA. The latch-up testing has been performed on supply and body bias pins while the circuit is powered on. The testing has been done for a limited set of VDD and FBB values, where FBB has been applied symmetrically to N-well and P-well. The VDD biasing has been swept from 1.2V up to 2V, and FBB has been swept from 0V up to 0.5V. In both cases, a step size of 0.1V has been used. Finally, the latch-up testing has been done for three different temperatures: 25oC, 125oC and 150oC.

P+ N+N+P+

P-well N-well

Deep N-well

Vpwell VDD

N-well

P- substrate

N+

VnwellVSS

Figure 4.14 Cross-section of the latch-up test-structure.

Figure 4.15 presents the experimental latch-up test results for the aforementioned test-structure. The triangle symbols relate to the results obtained at 150oC, while the circles relate to 125oC. The symbols indicate the highest FBB value for a given VDD bias at which no latch-up has been detected. Recall that a FBB step size of 0.1V was used. In practice, it may be possible to further increase FBB slightly (<0.1V) for obtaining the maximum FBB that defines the boundary for a latch-up insensitive circuit. In other words, the VDD value at this maximum FBB value equals the holding voltage of the thryristor. The trend lines of Figure 4.15 provide a graphical impression about the maximum FBB value as function of VDD. Observe from Figure 4.15 that the circuit is immune to latch-up when operating at VDD≤1.2V while applying 0.5V FBB to both N-well and P-well. At 125oC operation, it is still safe to operate up to VDD=1.3V with 0.5V FBB. The FBB value should be limited to prevent the circuit to becoming sensitive to latch-up for 125oC and 150oC operation. This is because the holding voltage is temperature dependent [50]. For the considered 90nm LP-CMOS, the maximum allowed VDD value equals VDD=1.5V. At higher VDD, the integrity of the gate oxide can no longer be guaranteed over chip lifetime. At VDD=1.5V, the 0.3V FBB and 0.2V FBB are found to result in a latch-up immune circuit at 125oC and 150oC operation, respectively. The same measurements for an


operating temperature of 25oC were carried out. These experiments revealed that no latch-up was detected for VDD ranging from 1.2V up to 2V while applying 0.5V FBB. From these tests, one can conclude that latch-up is not a concern in this application in 90nm LP-CMOS when operating at VDD≤1.3V, T≤125oC and up to 0.5V FBB. Therefore, no special measures were taken to protect the circuit against latch-up in the designing then embedded FBB generator circuit.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2


Power Supply Voltage [V]

Maximum VDD violationAllowed VDD range

VDD scaling

T=150oC

T=125oC

Figure 4.15 Latch-up experimental results for 90nm LP-CMOS test-structure.

4.3 FBB Generator Design

After having discussed the characteristics of the load, let us further discuss in this section the concept, design and implementation of the proposed FBB generator. The design was implemented in 90nm LP-CMOS. Recall from section 4.1 the FBB generator requirements: 1) digital control to simplify system integration, 2) FBB generation transparent to VDD scaling, 3) powered from the available core supply, and 4) low power and small area. FBB needs to be applied to a digital circuit with a size of 1mm2, but optionally to larger circuit sizes as well. For a 1mm2 digital circuit, the estimated P-well capacitance is 0.93nF, and 1.56nF for the N-well, with 3.1mA of P-well current and -1.7mA of N-well current at 0.5V FBB. The respective process and operation conditions are: nominal process, VDDD=1.2V, and T=85oC. The design of the FBB generator is shown in the next sections. 4.3.1 Concept Figure 4.16 presents a conceptual view of the proposed FBB generator circuit. The digital circuit represents the circuit portion to which body biasing is applied. It is supplied between supply voltage VDDD and ground VSSD. In order to have a digital control interface, the body bias generation has been implemented by digital-to-analog (D/A) converters; the reference voltage Vref should be equal or larger than the maximum required FBB voltage. A constant reference current Iref is supplied through the resistive D/A converter (RDAC) for implementing a floating voltage source. In this way, one is able to generate a P-well and N-well voltage that is referenced to VSSD and VDDD, respectively, for supporting transparency to possible voltage scaling conditions. Finally, the output voltage of the RDAC is buffered by means of a voltage buffer for providing a low-ohmic output.

4.3 FBB Generator Design 61

Figure 4.16 Conceptual circuit diagram of the proposed FBB generator.

Figure 4.17 shows a more detailed representation of the proposed FBB generator. The Vnwell and Vpwell are the outputs of the FBB generator. There are two supply pairs: (VDDA,VSSA) and (VDDD, VSSD). VDDA and VSSA are the nominal supply voltage and ground of the system, respectively. VDDD and VSSD are the supply voltage and ground of the digital circuit portion under FBB control, respectively. The n-bit BBnw and BBpw digital input signals are decoded to match the RDAC control signals. The reference circuit creates the constant current reference for the RDAC’s. The FBB generator contains two control signals ENA and MODE that: are paired to the standby or active modes of the digital circuit portion under FBB control, select the internal or external reference voltage, and select the bypass switches when the circuit is in standby.

Figure 4.17 Detailed block diagram of the proposed FBB generator.


4.3.2 Design Let us address now the details of the building blocks that constitute the FBB generator. First, the reference circuit will be discussed; its simplified circuit diagram is presented in Figure 4.18. The purpose of the reference circuit is to generate a constant reference current that flows through the RDACs. The circuit has been implemented as follows. A feedback circuit derives the reference current, Iref, from the reference voltage, Vref, of 700mV. Vref can be internally generated by a resistor tree, or it can be externally generated by, e.g. a bandgap circuit. The operational amplifier has been implemented by a differential pair. The MODE signal selects the internal or external reference voltage. The reference resistor, Rref, is matched to the RDAC resistors. The reference current, Iref, is mirrored to create the current reference for the RDACs. The ENA signal can turn-off the resistor tree and the amplifier to minimize the static current consumption when the FBB generator is in standby.

Figure 4.18 Simplified circuit diagram of the reference circuit.

The required FBB voltage is generated by the resistor tree of the RDAC. There is a voltage drop of about 540mV across the resistor tree, which corresponds to a maximum FBB of about 540mV. Each resistor tree consists of 64 poly resistors, thus, the smallest possible FBB step is about 8.5mV which translates into 6-bit resolution. This resolution is sufficiently high to enable fine-tuned FBB generation. The resistor tree is referenced to VDDD or VSSD, respectively. The reference circuit supplies a constant bias current through the resistor tree. The resistor tree has been implemented by an array of resistor elements. The circuit diagram of one resistor element is shown in Figure 4.19. Each resistor element contains two poly resistors and corresponding switch functionality to connect a given node to the output line. In total, there exist 8 horizontal select (HS) and 8 vertical select (VS) bit lines to select a given node in the tree. The resistor elements are connected such that the resistor elements form a chain of 64 poly resistors. An input decoder is used for converting a 6-bit input signal of the FBB generator (BBnw or BBpw) to enable a single horizontal-vertical bit line pair of the RDAC.

4.3 FBB Generator Design 63

Figure 4.19 Simplified circuit diagram of a resistor element of the RDAC.

VDDA

VSSA

inp

ENA ENA

ENA

outp

outninn

Figure 4.20 Circuit diagram of the pre-driver of the voltage buffer.

The voltage buffer is implemented by a miller-compensated class-AB amplifier as unity-gain buffer. It is powered from VDDA and VSSA. The buffer consists of two stages, the pre-driver and an expandable output stage. The pre-driver contains the input stage and a gain stage. Figure 4.20 shows the circuit diagram of the pre-driver. The input stage is implemented by a double input pair to cover the wide input voltage range as provided by the RDAC, especially when the digital IP has a voltage scalable supply. A cascoded gain stage is used to achieve high gain. The pre-driver can be turned-off by the ENA signal. In this case, the outputs outp and outn, are clamped to VDDA and VSSA, respectively. The expandable drive unit is implemented by a rail-to-rail class AB output stage, which can maintain a small current in steady state and that is able to offer a large current during a transient. Such output stage is very convenient for driving large capacitive loads due to its current source/sink capability. Figure 4.21(left) shows a circuit diagram of a drive stage for providing the P-well bias to the digital IP. The output stage for providing the N-well bias is similar, except that the switches are connected to VDDA and VDDD, respectively. Circuit stability is accomplished using a Miller compensation scheme embedded in the output stage unit. When multiple drive units are used, output stages are placed in parallel which maintains the ratio between maximum load capacitance and Miller capacitance to be constant, thereby ensuring stability. In Figure 4.21, the switches are indicated along


with their control signals. The switches are used to clamp the output to fixed potentials when the voltage buffer is turned-off. This ensures that the digital IP block is always properly body biased.

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Size of Digital IP Block Load [sq.mm]

Relative Bandwidth

Figure 4.21 Circuit diagram of the p-well output stage of the voltage buffer (left), and

generator bandwidth vs. digital IP block dimension (right).

Multiple output stages can be connected to the pre-driver. The number of output stages to be used depends on the size of the digital IP block. The pre-driver with one output stage is suitable for driving a digital IP block size of 1mm2. Two output stages can drive a digital IP block of up to 2mm2, etcetera. In this way, an expandable output stage, and thus a re-usable FBB generator solution has been created to drive digital IP blocks of different sizes. The collection of one output stage for P-well and N-well is referred to as drive unit. Figure 4.21(right) shows the relative bandwidth of the generator as function of the digital IP block dimension and number of drive units. Observe that the bandwidth reduces for larger digital IP blocks that require more drive units, which is a characteristic of the used Miller compensation scheme. 4.3.3 Layout The FBB generator design has been implemented in a 90nm LP-CMOS technology. Figure 4.22 shows the circuit layout where each building block has been indicated. The base unit contains the reference circuit, the RDAC and decoders, and the pre-driver. The drive unit is connected to the base unit by abutment. The total area of the base unit and drive unit is 250µm by 125µm. The reference circuit, RDAC and decoders, pre-driver and output stage consume 24%, 30%, 14%, and 32% of the total area, respectively. The area of the drive unit alone is 80µm by 125µm. Additional drive units can be connected to each other by abutment. Alternatively, they can be spatially distributed in the overall chip layout while hooked up to the base unit. The base unit and one drive unit can drive a digital circuit size of up to 1mm2. The FBB generator area is about 0.03mm2. Observe that the circuit area of the FBB generator is only a small fraction (~3%) of the digital circuit area of 1mm2.

4.4 FBB Generator Experimental Results 65

Figure 4.22 Layout implementation of the FBB generator in 90nm LP-CMOS.

4.4 FBB Generator Experimental Results

Measurements have been performed on the FBB generator that was fabricated in 90nm LP-CMOS. All supply (VDDD,VDDA), ground (VSSD,VSSA) and I/O connections of the circuit are available at the package pins. The experiments have been performed at VDDD=VDDA=1.2V, VSSD=VSSA=0V, room temperature while using the internal reference voltage (MODE=logic’1’), unless mentioned otherwise. No external reference voltage has been connected. The outputs of the generator, Vnwell and Vpwell, are capacitively loaded with about 1.5nF when accounting for the external capacitor and pad/package parasitics. Also, both outputs have a current load of about 3mA at 0.5V FBB through resistive load (current is sinked from Vnwell output, and sourced to Vpwell output). The load on the outputs of the FBB generator is considered to be representative for a 1mm2 digital circuit. Table 4.2 summarizes the main design characteristics of the FBB generator in 90nm LP-CMOS, as obtained through silicon measurements.

Table 4.2 Measured FBB generator characteristics in 90nm LP-CMOS.

VDDD=VDDA=1.2V, VSSD=VSSA=0V, MODE=’1’, room temperature

Parameter

Unit

Base Unit + 1 Drive Unit

Maximum FBB on N-well mV 541

Maximum FBB on P-well mV 543

Active current, Idd1) µA 191.2

Standby current, Iddq nA 99

Bandwidth2) kHz 263

Slew rate – P-well FBB Rise Slew rate – N-well FBB Rise

mV/µs

mV/µs

307.7 317.5

Slew rate – P-well FBB Fall Slew rate – N-well FBB Fall

mV/µs

mV/µs

235.3 250.0

1) Idd at nominal BB, 2) Bandwidth at 0.5V FBB

Observe that up to 541mV FBB and 543mV FBB can be applied to N-well and P-well, respectively. The considered configuration consumes about 191µA in active mode. In standby, it leaks about 99nA at room temperature. Circuit simulations


revealed that an additional drive unit increases the active and standby current by about 90µA and 54nA, respectively. The FBB generator has a bandwidth of 263kHz and a worst case slew rate of about 235mV/µs. From a digital systems perspective, the FBB generator bandwidth can be interpreted as to how often can the IP block change its FBB voltage, while the slew rate indicates how fast is the FBB voltage available. The FBB generator can change body bias voltage within a few microseconds. This makes the circuit suitable for both dynamic and adaptive body biasing applications.

0.6

0.7

0.8

0.9

1

1.1

1.2

0

0.1

0.2

0.3

0.4

0.5

0.6

0 5 10 15 20 25 30 35 40 45 50 55 60 65

N-W

ell

Vo

lta

ge

[V

]

P-W

ell

Vo

tla

ge

[V

]

RDAC code

VDDA=VDDD=1.2V

Figure 4.23 Measured INL of the 90nm FBB generator.

Figure 4.23 presents the measurement results of the integral non-linearity (INL). Both Vpwell and Vnwell are plotted against the 6-bit RDAC code. The INL shows how much the RDAC transfer characteristic deviates from an ideal one which is a straight line. A mean step size of 8.5mV was observed for both RDACs, and an average and maximum deviation from the ideal transfer characteristic of 4.25mV and 5mV, respectively. The reason for these different deviations is because of the impact of random process variability. Figure 4.24 shows the measurement results of the transient in order to generate a 0.5V FBB for the N-well and P-well. Initially, the FBB generator is disabled (ENA=’0’); a nominal well bias voltage is provided on the Vnwell and Vpwell outputs, since these outputs are connected directly to the supply or ground, respectively, through the clamp transistors shown in Figure 4.17. After the ENA signal is asserted, the FBB generator is turned-on and its outputs will be set to provide 0.5V FBB to both N-well and P-well. Observe the fast charging and discharging of the well voltages. Figure 4.25 shows a zoomed-in view of the transient response of the FBB generator after turned-on. Notice the high slew rates; slew rates of about 317mV/µs and 307mV/µs for N-well and P-well were measured, respectively. Such slew rates enable the charging of the wells to 0.5V FBB within 2µs. The delay between the asserted ENA signal, and the time that the generator outputs start changing is caused by the intrinsic delay of the FBB generator.

4.4 FBB Generator Experimental Results 67

VSS

2: Vnwell

1: Vpwell

3: ENA

Figure 4.24 Measured transient response of the FBB generator for 0.5V FBB.

Slewrate: 307mV/µs

Slewrate: 317mV/µs

VSS

2: Vnwell

1: Vpwell

3: ENA

Figure 4.25 Zoomed-in transient response of the FBB generator for 0.5V FBB.

Figure 4.26 demonstrates the operation of the FBB generator along with a digital circuit that make use of VDD scaling (i.e. VDDD=scaled, VSSD=VSS). Observe that the Vnwell follows VDDD to maintain 0.5V FBB when reducing VDDD from 1.2V down to 0.8V. Vpwell is also maintained at 0.5V FBB. This shows that the proposed voltage generator is suitable for use in power-managed digital circuit designs.


500mV FBB

500mV FBB

VSS

2: Vnwell

1: Vpwell

3: VDDD

Figure 4.26 Measured transient response for the FBB generator

under voltage scaling conditions of the digital IP load.

The magnitude of the well currents is mainly dependent on the size of the digital circuit under control and the temperature. The dependence between N-well/P-well voltage and N-well/P-well current has been considered as well. For this purpose, one can use the base unit with one drive unit for FBB generation. Figure 4.27 plots the obtained well voltages and current trends. This indicates the operational range of the FBB generator. Observe that both N-well and P-well voltages remain constant at 0.5V for well currents up to about |15|mA. Such well currents are about 5x and 8x larger than the expected maximum P-well and N-well current for a 1mm2 digital IP block, respectively (P-well: 3.1mA and N-well: -1.7mA at 85oC).

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

We

ll V

olt

ag

e [

V]

Magnitude of Well Current [mA]

VDDA=VDDD=1.2V

Vnwell

Vpwell

Figure 4.27 Measured FBB generator load regulation at 0.5V FBB.

4.5 Discussion 69

4.5 Discussion

In this chapter a new fully-integrated forward body bias (FBB) generator was presented that holds its voltage constant relative to the (scalable) power supply of a digital IP. The generator is modular and can drive distinct digital IP block sizes in multiples of up to 1mm2. The design has been implemented in 90nm LP-CMOS. Its basic unit for driving digital IP blocks up to 1mm2 occupies a silicon area of 0.03mm2. The proposed generator completes a 500mV FBB voltage step within 2µs. The bandwidth of the design has been measured at 263kHz. The active current is measured to be about 191µA, while the standby current 99nA at room temperature. With the new FBB generator available, FBB can now be seamlessly integrated into Integrated Circuit designs. Due to its small circuit occupation, it only adds a marginal area overhead to the design. The digital control interface allows the FBB generator to be easily integrated as part of a dynamic body bias control scheme. A main feature of the new FBB generator is that it can provide a FBB relative to the digital supply voltage, which makes it transparent to use in any power-managed, voltage scalable digital circuit.


Chapter 5

Body Bias Driven Design for Area-

Efficient High-Performance Circuits

ORST-case design uses extreme process corner conditions which rarely occur. This limits maximum speed specifications and costs additional power due to

area over-dimensioning during synthesis. In this chapter a new design synthesis strategy for digital CMOS circuits will be presented that makes use of forward body biasing. The proposed approach renders consistently a better performance per area ratio by constraining circuit over-dimensioning without sacrificing circuit performance. An in-depth analysis of the new body bias driven design theory is provided. It is complemented by an algorithm that enables fast reconstruction of the area-clock period trade-off curve of the design. These new concepts were validated through industrial processor designs in 90nm LP-CMOS.


Conventional and well-established digital design practices are based on a worst-case design (WCD) style to guarantee chip operation for meeting timing specifications among the process corners [51]. The circuit is designed in the slow-process corner to meet frequency specifications, while the maximum leakage target is verified in the fast-process corner. However, such extreme process corners rarely occur in most of the fabricated chips. Moreover, WCD makes high performance specifications harder to meet due to over-dimensioning of the design. Over-dimensioning leads to a larger silicon footprint, higher power consumption and larger leakage. Statistical circuit design has long been seen as a viable way to avoid the use of worst-case parameters [52][53]. Yet these approaches have not totally found their way in industrial practices. This is because of, among other reasons, the moving average of process parameters, the flexibility of fabricating the same chip in multiple foundries, and the lack of appropriate EDA tools for statistical logic synthesis. Alternatively, post-silicon tuning has been proposed for improving product-binning yields and for trading-off power-performance [16][19][42], but do not eliminate the problem of area over-dimensioning. A joint design-time and post-silicon tuning optimization strategy for minimizing leakage under delay constraints was proposed in [54]. This approach relies on detailed process variability inputs, and is capable of reducing process-dependent delay spread. However, it does neither consider a timing speed-up nor a circuit area reduction as outcome. Others propose body bias clustering at design-time for minimizing leakage under delay constraints [55][56], or enhancing circuit performance [57]. These approaches do not consider a (joint) design-time

W

72 Chapter 5 BB Driven Design for Area-Efficient High-Performance Circuits

optimization for improving performance or reducing area of the circuit. Vth assignment and gate sizing during the design synthesis phase is a known problem [58][59]. Vth assignment has been used for reducing leakage or for reducing dynamic power consumption. Leakage power of digital IP blocks is mostly a concern when the circuit is in standby mode. High-performance circuits typically use LVT devices to speed-up critical delay paths at a higher intrinsic device leakage penalty [58]. This higher leakage is unacceptable for portable applications since it increases standby power, and thereby reduces battery time. The use of body biasing offers several advantages: 1) it offers a continuum of Vth values, 2) it is Vth technology flavour independent, 3) it can be used on top of multi-Vth assignments, and 4) it can be applied dynamically or adaptively. FBB can achieve LVT performance during operation, while it can be turned-off to achieve low leakage in standby [42]. LVT circuits with RBB cannot achieve such low leakage, as saw in section 3.5. Like multi-Vth and gate sizing assignments, body bias driven designs render smaller footprint area than WCD. Unlike multi-Vth and gate sizing assignments, the Vth choice is not technology constrained since it is possible to characterize a standard cell library with FBB targeting a given Vth value within a certain range of Vth’s. In the following sections a new design optimization method is presented that leverages FBB to improve the performance per area ratio of digital CMOS circuits. This approach is coined as body bias driven (BBD) design.

5.2 Maximum PPA Design

The area and clock period trade-off curve of a digital CMOS logic circuit can be modelled by a rational function with χ, δ, and η as independent fitting parameters, as shown by expression (2.18) and repeated here for convenience.

ηδ

χ+

+=

ck

totalT

A (5.1)

The area-clock period trade-off curve can be reconstructed through an iterative process that requires a collection of sufficient data points for correct reconstruction. Collecting data points evenly spaced across the clock period range requires many synthesis runs, which are time-consuming for large designs. Fortunately, the trade-off curve is a continuous function over a closed clock period interval. A reconstruction is possible with three specific data points only, namely a first point at a small clock period and large circuit area, a second point at a large clock period and a small circuit area, and a third point when the slope of the curve is -1. Recall that the general form of expression (5.1) describes a rectangular hyperbola. The clock period TA at which the slope of the hyperbola equals to -1, and the corresponding circuit area, AA, can be determined as follows.

TA = −δ + χ ∀Tck ≥ Tmin ∧ χ ≥ 0 (5.2)

AA = η + χ ∀Tck ≥ Tmin ∧ χ ≥ 0 (5.3)

where Tmin is the clock period of the best-performing design. This design is obtained by constraining gate sizing of digital gates to their maximum size in the digital library.

5.2 Maximum PPA Design 73

The proposed approach to curve reconstruction is based on a greedy search algorithm. The proposed algorithm searches iteratively for the clock period value at which the slope of the area-clock period trade-off curve equals -1 (Tck=TA). This is similar to the Newton-Raphson approach, which is known for its fast convergence [60]. Instead of calculating the derivative of the area-clock period explicitly, expression (5.2) is used to determine Tck=TA. With every iteration, fitting parameters of expression (5.1) are (re-)calculated until the difference between the old and new TA value is within a certain error bound, ε. Let us now address the proposed algorithm as described in Figure 5.1. As a first step, the design is synthesized at the minimum clock period bound, Tlow. The synthesis tool returns the actual clock period, T1, and circuit area, A1. Note that T1>Tlow when Tlow<Tmin, for other cases T1≈Tlow. Next, the design is synthesized at the maximum clock period bound, Thigh. Thigh should be chosen large enough to ensure the clock period range at which area over-dimensioning occurs is captured. A Thigh value of 10T1 was employed. The synthesis tool returns the actual clock period, T2, and circuit area, A2. The third synthesis point is chosen based on the bi-section to ensure proper conditioning with three points for curve fitting based according to expression (5.1). The least-squares fitting method was used to determine the fitting parametersχ, δ, and η. The clock period, Tnext=TA, is determined at the slope of the area-clock period trade-off curve which equals -1 from expression (5.2). When the difference between the current clock period, Tcur, and the new one, Tnext, is larger than the tolerated error, ε, a new synthesis run is executed at Tcur=Tnext. Again, the least-squares method is used to re-calculate the fitting parameters using the available synthesis results. This process is repeated until Tnext-Tcur is smaller than ε. If this condition is met, one has determined the final fitting parametersχ, δ, and η for reconstructing area-clock period trade-off curve for the design. The proposed algorithm fails to converge when the clock period interval Tck=[T1,T2] does not contain the clock period TA. Therefore, one should ensure a proper choice of the clock period starting points Tck=Tlow,Thigh.

Figure 5.1 Greedy algorithm to obtain fitting parameters of expression (5.1).

Circuit performance and area are key performance metrics for digital circuit designers. Let us introduce a new metric (PPA) to qualify how effectively the design achieves high performance while accounting for the impact of area scaling. The

Algorithm 1: Find_Fitting_Parameters(Tlow, Thigh, ε) 1. Choose minimum clock period bound Tlow 2. Synthesize design at Tck=Tlow: obtain (T1,A1) 3. Choose maximum clock period bound Thigh (Thigh ≥ 10T1) 4. Synthesize design at Tck=Thigh: obtain (T2,A2) 5. Calculate Tcur = (T1 + T2)/2 6. Set iteration index i = 3 7. Synthesize design at Tck=Tcur: obtain (Ti,Ai) 8. Perform least-squares regression by using expression (5.1) and all

available (T,A) pairs as data points: obtain χ, δ, and η 9. Calculate Tck=Tnext using expression (5.2) 10. if |Tnext-Tcur| > ε then 11. Tcur = Tnext 12. Increase iteration index: i=i+1 13. Go to Line 7 14. end if 15. return fitting parameters χ, δ,η


performance of a digital circuit is usually defined by its operating frequency. When the circuit operates at its maximum operating frequency, the minimum clock period equals the delay of the slowest circuit path, e.g. the critical path. Let fck=1/Tck=1/max(Dj), where max(Dj) is the critical path delay (see expression (2.5) for reference). Now, the PPA is defined as

totalcktotal

ck

ATA

fPPA

1== (5.4)

The PPA metric depends on the technology node, process and operating conditions, the technology’s Vth option, and the standard cells available for circuit synthesis. A higher PPA value indicates that the circuit design utilizes silicon area more effectively to achieve a high performance. By combining (5.1) and (5.4), one obtains the following closed-form expression for PPA with clock period as only variable:

( ) ckck

ck

TT

TPPA

ηδηχδ

+++

= (5.5)

There exists a design point at which a maximum PPA occurs. This point indicates the optimum performance without circuit over-dimensioning. This point is the desired design point when designing for high-performance while avoiding circuit over-dimensioning. The clock period value at which the maximum PPA occurs (Tbest), can be determined by making the derivative of PPA with respect to Tck equal to zero. By solving the equation for Tck, one obtains a closed-form expression for Tbest, namely

Tbest = −δ +−δχηη

∀Tck ≥ Tmin ∧ δχη ≤ 0 (5.6)

Tck>Tbest, yields circuits without area over-dimensioning, and the contrary holds true for Tck<Tbest. Therefore, Tbest identifies the minimum possible clock period without circuit over-dimensioning. The maximum PPA at Tck=Tbest is obtained after substituting expression (5.6) into expression (5.5) as follows

max PPA( ) =1

χ −δη + 2 −δχη ∀Tck ≥ Tmin ∧ δχη ≤ 0 (5.7)

In those cases where Tbest is too large to meet the target frequency specification of high-performance designs, over-dimensioning cannot be avoided, thereby worsening PPA. The maximum PPA design can be obtained from design synthesis for a clock period of Tck=Tbest, while using the fitting parametersχ, δ, and η as outputted from the algorithm shown in Figure 5.1. A normalized representation of PPA us used in the forthcoming analysis. The normalization is done against the highest performance design under WCD (fck=fmax=1/Tmin, Atotal=Amax).

PPAnorm =Tmin Amax

Tck Atotal

(5.8)

5.3 Body Bias Driven Design Concept 75

5.3 Body Bias Driven Design Concept

Under WCD, digital CMOS circuits are implemented to meet timing specifications for slow process conditions. Recall, however, that FBB enhances circuit speed. Bearing this in mind, one does not need to pursue WCD. Instead, it is possible to design the circuit in between the worst and nominal process corners provided that the IC has FBB capabilities to correct performance deviations due to fabrication outcome. This creates opportunities for more cost-effective solutions without sacrificing performance specifications and parametric yield. The amount of FBB required can be calibrated at test-time, or during boot of the chip.

0.7

1

1.1

0 0.1 0.2 0.3 0.4

Forward Bias Voltage [V]

0 0.1 0.2 0.3 0.4


10

.90

.8

Relative Circuit Area

FBB

0V

0.1V

0.2V

0.3V

0.4V

1

0.5V

0.40.5

Experimental results for a 90nm LP-CMOS ring-oscillatorsat VDD=1.1V and T=85oC for a slow die sample

FBB

[V]

Performance

increase [%]

Rel. leakage

increase factor

SVT HVT SVT HVT0

0.10.20.30.40.5

049

141924

07

14223140

1.01.62.74.99.9

24.4

1.01.83.57.7

20.480.0

Figure 5.2 FBB utilization under body bias driven design.

Figure 5.2 illustrates the parameters that are under control with body bias driven (BBD) design. The right-hand side of Figure 5.2 plots the dependency between clock period and FBB. A higher FBB value enables faster circuit operation. The amount of speed-up depends on the process technology, the used transistor threshold voltage option, and the design’s power supply voltage. The left-hand side of Figure 5.2 plots the relationship between circuit area and clock period. For increasing FBB values, the curve shifts linearly proportional to a reducing clock period. Notice that a performance increase by FBB can be traded-off against a performance decrease due to a smaller circuit area. In this way, it is possible to maximize the PPA ratio of the circuit at design-time, while meeting a target performance. The effectiveness of BBD design depends on the performance tuning range available with FBB. The experimental results that were obtained for a set of SVT and HVT ring-oscillator test-structures in 90nm LP-CMOS are briefly summarized through Figure 5.2. A 24% and 40% performance increase is observed for the SVT and HVT ring-oscillator test-structures, respectively, when 0.5V FBB is applied to both N- and


P-wells simultaneously. Contrarily, leakage increases by up to about 25x and 80x, respectively. The leakage increase is more severe for HVT. This is because the forward-biased junction leakage at 0.5V FBB dominates over the sub-threshold leakage. To alleviate the leakage penalty with FBB, one can disable FBB in standby or low-performance use-cases. In fact, this is dynamic FBB. Also, let us focus on the use of SVT and HVT in the remainder of this work. This is because the intrinsic leakage of LVT is about 10x higher than in case of SVT, which has a large impact on power consumption in standby or low-activity use-cases. Let us take a look at the following optimization problems for BBD design. Let Ψ represent all paths in the circuit. Dj expresses the propagation delay of a path j ∈Ψ. There are q gates in the circuit; The gate sizing factor of gate i ∈ j is represented by parameter xi. For each type of gate, there are m different gate sizes available. Ptotal is the total power consumption of the circuit. Let VBB=Vpwell=VDD-Vnwell represent the amount of FBB applied to the digital core. Then, the maximum PPA optimization is as follows:

The result of this optimization is the highest performing design without area over-dimensioning that meets a maximum power requirement, Pmax. Figure 5.3 shows an example of how BBD design can be utilized to improve a slow-process performance to achieve a nominal-process performance. FBB needs only to be applied to those die samples with a lower speed than the one from a nominal process outcome. In contrast to the use of post-silicon tuning only, BBD design implements the best area-efficient high-performance design, e.g. the maximum PPA design.

frequency

# die samples

Proposed

“body bias driven” design

BBDTarget

FBB enabled Design optimization

WCDTarget

Conventional

“worst-case” design

Figure 5.3 Nominal-process performance under body bias driven design.

maximize PPA subject to Ptotal ≤ Pmax 1 ≤ xi ∀i ∈ 1, 2, … ,q

x ∈ 1, 2, … ,m VBB =[0,0.5]V

5.3 Body Bias Driven Design Concept 77

Another optimization strategy can be pursued to achieve a minimum area design that meets performance requirements as follows:

The result of this optimization is the smallest design that meets targeted performance and power requirements in case a given body biasing is applied to the circuit. Figure 5.4 shows an example of how BBD design can be utilized to achieve performance specifications while reducing circuit area. FBB achieves performance enhancement at the same time that a smaller circuit area reduces circuit speed. Like before, FBB needs only to be applied to those die samples with a lower speed than the nominal process outcome.

frequency

# die samples Proposed

“body bias driven” design

Conventional

“worst-case” design

Target

FBB enabled Design optimization

Figure 5.4 Performance tuning for a minimum area body bias driven design.

Let us discuss now the relationship between the fitting parameters of expression (5.1) for WCD and BBD design styles. Parameter η is identical for both design styles because of the very relaxed or unconstrained timing. Now, notice that if both WCD and BBD circuits were optimized in the same way over the entire clock period range, then χ would be the same. However, this is not true in general since BBD libraries are faster than conventional libraries, e.g. the gate drive of a forward body biased cell is larger than the one of the same cell without FBB. Let us now take a look at speed and area tradeoffs between WCD and BBD design styles. Suppose first that a given circuit area (Atotal_bbd=λAtotal_wcd) is desired, then the clock period of the BBD circuit can be obtained from the WCD clock period:

( )( )( )wcdckwcdbbdwcdwcd

wcdckwcdbbd

bbdbbdckT

TT

_

__ +−+

++−=

δηηλχ

δχδ (5.9)

where λ represents the fraction Atotal_bbd/Atotal_wcd. Parameter λ equals 1 for a constant circuit area between both design styles. Alternatively, suppose now that a given clock period (Tck_bbd=σTck_wcd) is pursued, then the circuit area of the BBD circuit can be obtained from the WCD circuit area as follows

minimize A subject to Dj ≤ Tck ∀ j ∈ Ψ

Ptotal ≤ Pmax 1 ≤ xi ∀i ∈ 1, 2, … ,q

x ∈ 1, 2, … ,m VBB =[0,0.5]V


( )( )( ) wcdwcdwcdtotalwcdbbd

wcdwcdtotalbbd

bbdbbdtotalA

AA

σχησδδηχ

η+−−

−+=

_

__

(5.10)

where σ represents the fraction Tck_bbd/Tck_wcd, and equals 1 for a constant clock period. Notice from expression (5.9) that the speed advantage of the BBD circuit depends only on the difference between δ’s provided that χbbd =χwcd. The smaller area of BBD in expression (5.10) is also due to the difference of δ’s. These results are expected since digital gates with FBB have a greater output drive than without FBB. Consequently smaller area gates are employed in BBD designs. Expressions (5.9) and (5.10) enable designers to estimate the effectiveness of BBD over WCD in trading-off circuit speed against area. Design and process technology alternatives can be compared once the parameter values forχ, δ, and η are known. These parameters are design dependent because of different amount and type of digital cells used as well as the logic implementation. Moreover, they are also process technology dependent because circuit area, performance and body bias sensitivity depend on technology scaling. For example, a given digital logic circuit will be smaller (lower η, different χ) and faster (lower |δ|, different χ) when implemented in a next-generation CMOS technology.

5.4 Optimum Design Space for High-Performance Circuits

Let us explore area, performance and power trends for WCD and BBD design styles by using the previously presented models. For this purpose, let us take a generic digital logic circuit with calibrated technology parameters for 90nm LP-CMOS. The analysis was done at VDD=1.1V and T=85oC. For the BBD design, the utilized maximum FBB is 0.5V to explore the limits of a PPA driven design. All results relate to the slow-process corner. Let us also discuss technology scaling implications by analyzing the same circuit in 65nm and 45nm LP-CMOS. The same process and operating conditions have been used as before with the exception of VDD=1V for the 45nm case.

5.4.1 Performance-per-Area Trends

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

0.6

0.6

0.8

0.8

1

1

1

1.2

1.2

1.2

1.4

1.4

1.6

1.6

1.8

2


Relative Area

Figure 5.5 Area, speed, and PPA trade-off for a generic digital logic circuit.

Solid line: WCD, dashed line: BBD, overlay: PPA.

5.4 Optimum Design Space for High-Performance Circuits 79

Figure 5.5 shows the design exploration space for circuit area, clock period and PPA. The area-clock period trend curves are plotted for WCD (solid line), and BBD (dashed line) design. The iso-PPA curves are plotted as overlay. The intersection with the area-clock period curves represents the normalized PPA ratio of the design as defined by expression (5.8). Logic synthesis usually aims at achieving a given target speed. As way of example, all PPA values of Figure 5.5 have been normalized to the maximum frequency of operation under WCD (Tck=Tmin). This reference point is highlighted by the triangle symbol in Figure 5.5. The triangle is located at a clock period of Tmin, while the circles relate to Tbest which are the corresponding best PPA points. Observe from Figure 5.5 that for a given circuit area, BBD design achieves higher performance than WCD counterparts. Alternatively, BBD design enables lower area designs for a given clock period. Any FBB of less than 0.5V, results in area-clock period curves located in between the two curves plotted in Figure 5.5. Therefore, it makes most sense to use BBD design with the maximum possible FBB to obtain the best PPA ratio. Figure 5.6 highlights the PPA and clock period trends under WCD and BBD design. Notice that BBD design achieves a better PPA ratio than WCD under all circumstances. From large clock periods towards smaller ones, the PPA increases to a maximum value irrespective of the chosen design style. The increasing PPA is because the decrease in clock period is greater than the increase in circuit area. This trend is reversed after the maximum PPA has been reached due to area over-dimensioning. Observe from Figure 5.6 that the maximum PPA can significantly be higher than the PPA of the maximum frequency under WCD. At large clock periods, the PPA of the BBD and WCD circuits is similar (not shown in Figure 5.6).

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6


Relative PPA

WCD

BBD

Figure 5.6 PPA versus clock period for a generic digital logic circuit.

Solid line: WCD, dashed line: BBD.

5.4.2 Power Consumption Trends Figure 5.7 shows the design exploration space for circuit area, clock period and power consumption. The trend lines for total power and dynamic power have been indicated under body bias conditions. The total power includes both dynamic and leakage power consumption. The iso-power curves are plotted as overlay; The solid and dotted lines correspond to the total power and dynamic power, respectively. The intersection of the total power curves with the area-clock period curves represents the


power consumed by the design. Observe in Figure 5.7 that the power increases for decreasing clock periods due to a larger circuit area, and higher frequency of operation. The dynamic power increases linearly proportional to operating frequency. Recall that the same power supply voltage of 1.1V is used for both WCD and BBD design. Only when FBB is applied, the leakage power becomes noticeable in the total power consumption. In this case, a difference occurs between the iso-total-power and iso-dynamic-power curves due to FBB under BBD design. This difference becomes larger for larger FBB values. The snapback point of the total power trends defines the maximum FBB value to be applied from a power point-of-view. In the considered case, this point occurs at an FBB value of 0.5V. Notice that BBD enables lower power operation at a constant clock period. This is because of the lower circuit area for BBD design. For a given power target, BBD design offers better performance and area figures. However, BBD design consumes more power for the same circuit area as WCD. This is not only because of the higher operating frequency, but also due to the higher junction capacitance and leakage power associated to the application of FBB.

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

0.25

0.5

0.50.75

0.75

0.75

1

1

1

1.2

5

1.25

1.25

1.5

1.5

1.7

5

1.75

22.5

3


Relative Area

Figure 5.7 Area, speed, and power trade-off for a generic digital logic circuit.

Solid line: WCD, dashed line: BBD, overlay: total power consumption (solid)

and dynamic power consumption (dotted).

5.4.3 Impact of Technology Scaling Figure 5.8 shows the design exploration space for circuit area, clock period and total power for the same generic digital logic circuit in different process technology nodes. Three groups of iso-power curves are plotted as overlay, each representing a given technology node. The symbols represent the maximum PPA designs for which the results are summarized in Table 5.1. All values have been normalized to the maximum PPA design under WCD for 90nm LP-CMOS. Observe in Figure 5.8 the same area-clock period trends for each technology node. BBD design consistently outperforms WCD. The maximum PPA design is faster and smaller in a next-generation technology. Consequently, the PPA increases with technology scaling, as illustrated in Table 5.1. BBD design achieves a similar PPA increase in each technology, because the performance increase with FBB is nearly constant (see section 3.2). Also observe in Table 5.1, the opposing total power trends under WCD and BBD design. For WCD, the maximum PPA design operates at lower power in a

5.5 Model Validation 81

scaled technology, despite the higher clock speed and the increasing leakage. This is no longer the case for BBD design, mainly due to the amplified leakage with FBB which is more pronounced in a scaled technology.

Table 5.1 Technology scaling results of maximum PPA designs at T=85oC.

Relative

PPA


Relative Total

Power

WCD 90nm LP-CMOS

65nm LP-CMOS 45nm LP-CMOS

1 1.45 2.49

1 0.85 0.61

1 0.87 0.74

BBD 90nm LP-CMOS


1.27 1.82 3.17

0.79 0.68 0.46

1.75 1.81 2.12

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.20.4

0.6

0.8

1

1.2

1.4

1.61

1.5

2

1

1.5

2

2.5

3

1

1.5

2

2.5

3

45


Relative Area

WCD

BBD

90nm LP-CMOS

65nm LP-CMOS

45nm LP-CMOS

Maximum PPA

Figure 5.8 Area, speed, and power trade-off curves for a generic digital logic circuit

implemented in different technology nodes.

The values have been normalized to the maximum PPA WCD in CMOS 90nm. Solid line:

WCD, dashed line: BBD, overlay: total power, symbols: maximum PPA designs.

5.5 Model Validation

WCD and BBD design have been analyzed and compared for a commercial microprocessor design in 90nm LP-CMOS. The circuit contains 3764 flip-flops and about 31K combinational gates. It makes use of SVT devices only. This section presents correlated results obtained from logic synthesis and the models presented in this chapter. As before, the analysis has been performed for slow-process conditions, VDD=1.1V, and T=85oC. BBD design makes use of a maximum FBB of 0.5V. Commercial synthesis tools can target area optimization subject to delay constraints. To validate the new approach, BBD synthesis was carried out in Cadence RTL Compiler2. Digital cell libraries have been re-characterized to account for FBB in 90nm LP-CMOS using Altos’ Liberate library characterizer3. The library characterization uses the effective current source model (ECSM) for timing, noise

2 Cadence RTL Compiler. [Online]. Available: http://www.cadence.com/products/ld/rtl_compiler 3 Altos Liberate. [Online]. Available: http://www.altos-da.com/pdfs/liberate-ds.pdf


and power modelling. To enable BBD synthesis, FBB-characterized timing views have been created, utilizing 0.5V FBB for PMOS and NMOS transistors. Both WCD and BBD digital cell libraries have been characterized for slow process conditions, VDD=1.1V, and T=125oC settings. Such digital cell libraries also enable static timing verification of BBD circuits. The design synthesis targeted reconstruction of the area-clock period trade-off curve. The greedy algorithm as presented in section 5.2 was used for this purpose. For each design style, only four synthesis runs were required for reconstruction. The algorithm received the following inputs: Tlow=3ns, Thigh=30ns, and ε=250ps. Table 5.2 summarizes the fitting parameters of expression (5.1) for the microprocessor design as well as TA and Tbest for WCD and BBD design styles. One more synthesis run was required to obtain the optimum PPA design.

Table 5.2 Model fitting parameters for the 90nm microprocessor design.

χχχχ [µm2⋅ns] δδδδ [ns] ηηηη [µm

2] TA [ns] Tbest [ns]

WCD 11.38⋅104 -4.65 2.78⋅10

5 15.3 6.0

BBD 8.64⋅104 -3.65 2.7510

5 13.0 4.7

Figure 5.9 shows the design exploration space for circuit area and clock period for the given microprocessor design. The synthesis results have been indicated by circles and triangles for WCD and BBD design, respectively. The filled symbols correspond to the four synthesis cases based on the proposed algorithm. The open symbols are additional synthesis cases for trend verification purposes only. The solid and dotted lines show the calculated trade-off curves for WCD and BBD design, respectively, by using expression (5.1). The fitting parameters of the model are given in Table 5.2.

0 5 10 15 20 25 30 352

2.5

3

3.5

4

4.5

5x 10

5

Clock Period [ns]

Circuit Area [sq.um]

0.260.290.360.40.450.510.590.7

0.760.84

1.01

1.00

0.96

0.96

0.91

0.88

0.30.320.370.410.460.530.60.720.790.88

0.98

1.13

1.27

1.25

1.16

1.14

Figure 5.9 Area versus speed and PPA of the 90nm microprocessor.

Lines: WCD (solid) and BBD (dashed) model, symbols: synthesis results.

The normalized PPA ratio is indicated for each design. T=85oC.

5.6 Benchmarked Results 83

Observe from Figure 5.9 the close match between the modelled and the synthesized area-clock period trends. The RMS-error between the calculated curves and the location of each synthesis result is within 1.5%. After completing the fourth synthesis run, it was possible to calculate TA and Tbest values which are given in Table 5.2. The PPA value for each synthesis point has been indicated in Figure 5.9 normalized to a Tmin of 5.5ns under WCD. Observe the existence of a maximum PPA design for both WCD (PPAnorm=1.01 @ 6ns) and BBD design (PPAnorm=1.27 @4.7ns). The calculated Tbest values match within 5% of the values obtained through synthesis (WCD: 6ns, BBD: 4.9ns). As expected, BBD design not only gives a better performance but also better area utilization as indicated by the PPA value.

0 5 10 15 20 25 30 352

2.5

3

3.5

4

4.5

5x 10

5

Clock Period [ns]


0.090.100.140.160.190.240.260.37

0.420.55

0.78

1.00

1.28

1.37

1.58

2.04

0.150.170.190.220.250.310.370.46

0.540.73

0.81

0.99

1.25

2.23

2.58

3.46

Figure 5.10 Area versus speed and total power of the 90nm microprocessor.

Lines: WCD (solid) and BBD (dashed) model, symbols: synthesis results.

The normalized total power is indicated for each synthesized design. T=85oC.

Figure 5.10 presents the same area and clock period curve as before, but now the symbols indicate the normalized power consumption of each synthesis run. For the given microprocessor design, BBD design provides lower power operation than WCD at the same clock period. Contrarily, BBD design consumes more power at the same circuit area.

5.6 Benchmarked Results

BBD and WCD results are shown in this section for three industrial processor designs in 90nm LP-CMOS. Logic synthesis, physical implementation and power analysis has been done using Cadence’s RTL Compiler, First Encounter, and Encounter Timing System, respectively. All results have been obtained for a slow process corner, VDD=1.1V, and T=85oC. BBD design utilizes a maximum FBB of 0.5V. All area results account for layout effects including the overhead for Deep-N-well isolation.


Clo

ck p

erio

d A

rea

PP

A

Dyn

amic

P

ower

L

eaka

ge

Pow

er

Des

ign

W

CD

B

BD

W

CD

B

BD

W

CD

B

BD

W

CD

B

BD

W

CD

B

BD

U

nit

[n

s]

[ns]

[ µ

m2 ]

rel.

[mW

] re

l. [µ

W]

rel.

Dig

ital

sig

nal p

roce

ssor

SV

T

6.5

5.0

3265

6 1.

04

1.10

1.

38

2.9

2.16

9.

1 21

.9

Mic

ropr

oces

sor

S

VT

6.

0 4.

9 34

2705

0.

97

1.01

1.

25

19.1

1.

60

67

21.4

Mul

tim

edia

pro

cess

or

S

VT

3.

8 3.

0 30

4718

4 0.

95

1.03

1.

35

67.2

1.

72

519

22.3

Dig

ital

sig

nal p

roce

ssor

HV

T

10.5

6.

5 34

386

1.01

1.

19

1.90

1.

4 2.

28

0.8

216

Mic

ropr

oces

sor

HV

T

8.5

6.0

3744

40

0.93

1

1.52

13

.8

1.69

6.

7

230

Mul

tim

edia

pro

cess

or

H

VT

5.

8 3.

9 34

4729

8 0.

89

1.02

1.

73

42.0

1.

83

50.7

27

4

Table 5.3 Industrial processor designs for maximum PPA in 90nm LP-CMOS.

Relative values are shown w.r.t. WCD for the given Vth option. Conditions: Slow Process Corner, VDD=1.1V and T=85OC.


Table 5.4 Example gate count of three industrial microprocessor designs.

#flip-flops #logic gates

Digital signal processor

227 4416

Microprocessor

3764 34390

Multimedia processor

41749 252759

Each processor design has been implemented in both SVT and HVT flavours. Table 5.4 shows the gate count summary. Two synthesis cases have been investigated, namely: i) a maximum PPA design, and ii) a maximum frequency design under WCD. In the latter case, BBD design is utilized to operate at the same speed at a lower area cost to improve the PPA ratio. The synthesis cases reflect the optimization problems, formulated in section 5.3. 5.6.1 Design Synthesis for Maximum PPA Table 5.3 presents the processor design results targeting a maximum PPA design. Five circuit parameters are presented, namely clock period, circuit area, PPA, dynamic and leakage power. The BBD design results are presented relative to the WCD results. The PPA ratio has been normalized to the maximum performance under WCD (Tck=Tmin). Let us first consider SVT results. Observe that the PPA ratio differs for each design. This depends on circuit characteristics such as circuit size, path delay distribution, and logic depth. Under WCD, the PPA ratio ranges from 1.01 to 1.10. The maximum PPA point for small circuits (low η value) tends to be located at larger clock period values (Tbest>Tmin), as can be inferred from expression (5.6). This explains the high PPA value of 1.10 for the digital signal processor. For large circuits (high η value), the maximum PPA value is located closer to, or equal to the minimum clock period, Tbest≈Tmin. The path delay distribution of the multimedia processor is the reason of the better PPA value w.r.t. the microprocessor design (1.03 i.s.o 1.01). The multimedia processor has many (nearly) critical delay paths which are largely responsible for the area over-dimensioning of the design when requiring high-performance. BBD design enables significant improvements in maximum PPA as compared to WCD, mainly due to higher clock speeds. The maximum PPA of the BBD designs ranges between 1.25-1.38. Let us look now into HVT results. The same PPA trends are observed as in the SVT case, but the increase in maximum PPA is much larger (maximum PPA: 1.52-1.90). This is because FBB has a larger impact on circuit speed for HVT. Worth noticing is that the HVT BBD processors can operate at the same speed of the SVT WCD equivalents. However, their PPA values are slightly lower due to a higher circuit area. Irrespective of the Vth option used, BBD design provides always a higher maximum PPA ratio than WCD. All BBD circuits operate faster than their WCD counterparts, while circuit area is comparable. Table 5.3 also shows the dynamic power and leakage power consumption for each processor design. Notice that dynamic power dominates leakage power, even at a high operating temperature of 85oC and when FBB is applied. The ratio between dynamic and leakage power is in the range of 100-300 for SVT WCD (800-2100 for HVT WCD) for the considered processor designs. Under BBD design, this ratio is reduced to 10-30 for SVT BBD design, and 5-20 for HVT BBD design. Observe that the dynamic power for BBD is generally higher than under WCD. There are two reasons for this, namely 1) the higher clock speed, and 2) the higher junction capacitance due to FBB. Next, the BBD leakage power is significantly higher than


Clo

ck p

erio

d A

rea

PP

A

Dyn

amic

P

ower

L

eaka

ge

Pow

er

Des

ign

W

CD

,BB

D

WC

D

BB

D

WC

D

BB

D

WC

D

BB

D

WC

D

BB

D

Uni

t

[ns]

[ µ

m2 ]

rel.

[mW

] re

l. [µ

W]

rel.

Dig

ital

sig

nal p

roce

ssor

SV

T

5.5

4245

7 0.

74

1 1.

35

3.7

0.98

14

12

.5

Mic

ropr

oces

sor

S

VT

5.

5 37

0932

0.

85

1 1.

18

22.5

0.

93

78

17.1

Mul

tim

edia

pro

cess

or

S

VT

3.

6 32

3378

1 0.

89

1 1.

17

71.3

0.

95

569

19.8

Dig

ital

sig

nal p

roce

ssor

HV

T

8.8

4862

1 0.

60

1 1.

72

2.0

0.75

1.

4 15

2

Mic

ropr

oces

sor

HV

T

8.5

3744

40

0.80

1

1.50

12

.6

0.75

6.

7 18

4

Mul

tim

edia

pro

cess

or

H

VT

5.

7 28

5778

7 0.

84

1 1.

36

43.3

0.

87

55

219

Table 5.5 Industrial processor designs for minimum area in 90nm LP-CMOS.

Relative values are shown w.r.t. WCD for the given Vth option. Conditions: Slow Process Corner, VDD=1.1V and T=85OC.


the WCD leakage when FBB is utilized. FBB turns on the transistor’s junction diodes, which leads to a high additional leakage current, especially at higher temperature operation. This will be also reflected in the total power which is the sum of dynamic and leakage power components. However, recall that FBB is only applied to those chip samples with a lower frequency than the targeted one due to the process outcome. Such slow samples have already an intrinsic low leakage current. Slow process-corner samples receive the maximum FBB, while the other slow samples receive a lower FBB. When no FBB is applied, the BBD leakage power is proportional to the circuit area scaling (not shown in Table 5.3). In addition, recall that FBB is applied dynamically during chip operation. In this way one avoids the leakage penalty associated to FBB during standby operation. 5.6.2 Design Synthesis for Minimum Area Table 5.5 presents the processor design results targeting a maximum WCD performance. The BBD circuits were designed to match the WCD performance. In this case, BBD circuits can enable significant area savings, irrespective of the Vth option used. Area reductions between 11% and 26% w.r.t. their WCD versions were observed for the SVT BBD designs. The benefits for the HVT processors are larger due to the stronger FBB dependence (14%-40% area savings). The reduced circuit area comes mostly from the area scaling of the combinatorial logic. In general, BBD circuits have less logic gates than WCD ones, while the amount of flip-flops is the same. The largest area savings has been obtained for the digital signal processor, which has about 19x more logic gates than flip-flops. The ratio between logic gates and flip-flops is lower for the other circuits, as can be derived from Table 5.4. The lowest ratio is found for the multimedia processor, namely about 6x more logic gates than flip-flops. This explains the area scaling trends observed. The PPA for the BBD processors is not optimal, because the BBD operating frequency is not fully utilized. However, it is significantly higher than for their WCD equivalents irrespective of the Vth option used. BBD design renders consistently lower dynamic power than WCD when operating at the same maximum WCD frequency. The power reduction comes from the reduced circuit area despite the increasing junction capacitances with FBB. The dynamic power reduces up to 7% for the SVT processors, while HVT processors achieve up to 25% dynamic power reductions. It was noticed that BBD design primarily affects logic gates in the data path; the clock power is not much reduced. Thus, the dynamic power savings are larger for circuits with higher data activities. As before, the BBD leakage power is much higher than the WCD leakage when FBB is utilized. The leakage power increases up to 20x for the SVT processors, and up to 219x for the HVT ones. Recall that this leakage increase is of no concern since FBB is disabled during standby operation. The leakage power for BBD design without FBB enabled decreases by the same factor as the circuit area (not shown in Table 5.5). For samples that do not need FBB to achieve performance, a leakage reduction up to 26% and up to 40% is possible in case of SVT and HVT, respectively.


5.7 Discussion

In this chapter a new design synthesis strategy for digital CMOS integrated circuits was presented that makes use of forward body biasing. The proposed approach renders consistently a better performance per area ratio by constraining circuit over-dimensioning without sacrificing circuit performance. An in-depth analysis of the body bias driven design analysis was provided, which enables designers to predict the design’s optimum performance per area with a minimum number of synthesis runs. These new concepts were validated through industrial processor designs in 90nm LP-CMOS. Performance-per-area improvements of up to 40%, area and leakage reductions up to 30%, and dynamic power savings of up to 10% without performance penalties were observed for SVT implementations. The benefits are larger for HVT implementations. In this case, it was observed performance-per-area improvements up to 90%, area and leakage reductions up to 40%, and dynamic power savings of up to 25% without performance penalties as a benefit from the proposed BBD design strategy. BBD design enables the implementation of digital circuits with less design margins as compared to conventional WCD approaches. This enables circuit designers to realize more cost-effective, higher performance or lower power solutions. The over-dimensioning of the design can be avoided by following the performance-per-area design theory. The maximum PPA design identifies the fastest design possible without area over-dimensioning. BBD design is particularly effective for high-performance digital circuits or those circuits that are performance constrained. It remains effective as long as a sufficient silicon tuning range is available in a given technology node.

Chapter 6

Body Bias Driven Ultra-Low-Power

Digital Circuits

HE quest for more energy-efficient integrated circuit solutions opened avenues for ultra-low-power enabling design styles. The effectiveness of body bias driven

designs to enable low-area performance-efficient ultra-low-power (ULP) digital circuits is addressed in this chapter. It will be shown that this new approach renders significant improvements in circuit performance, area occupation, and energy consumption in the presence of process parameter variations. The focus is on low-energy performance-efficient design strategies applicable to consumer portable applications with target operating frequencies in the range of 1MHz to 50MHz.


Reducing supply voltage (VDD) is the most favoured approach to achieve ultra-low-power. For very low speed and energy starving applications such as autonomous sensor nodes, RF-ID tags, and biomedical applications, sub-threshold design became extremely popular [61][62][63]. In sub-threshold operation, the VDD is set below the threshold voltage of the transistors (Vth), thereby minimizing energy consumption. The operation at minimum energy is achieved in sub-threshold regime [64][65]. Figure 6.1 illustrates the theoretically minimum energy point (MEP) for a generic digital circuit design in 90nm LP-CMOS. However, operating in sub-threshold comes at a large performance penalty; clock frequencies are limited to a few hundred kilohertz only. This has led to the fact the sub-threshold design is only for niche applications where performance is a second priority. Many digital circuit applications require clock frequencies beyond a few hundred kilohertz. This motivated the introduction of near-threshold design, where VDD is raised to a level just above the threshold voltage of transistors [66]. In this way, circuit performance can be increased into the MHz-range while sacrificing minimum energy operation. Near-threshold design can achieve up to 10x higher energy-efficiency as compared to nominal VDD operation [66]. Typical applications benefitting from near-threshold operation are portable applications. To satisfy more performance-demanding applications, near-threshold design has been combined with circuit-level parallelism and wide-range voltage scaling [11][67]. Although this approach demonstrated its effectiveness to achieve the required performance levels, the disadvantages are the increase in circuit area and the fact that not every circuit is suited to be parallelized.

T

90 Chapter 6 Body Bias Driven Ultra-Low-Power Digital Circuits

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

10-4

10-3

10-2

10-1

100

101

10-5

Supply Voltage [V]

Normalized Energy/cycle

Eleak=∫∫∫∫VDD⋅⋅⋅⋅Ileak dt

T=25oC

αααα=0.1%

Edyn=ααααCVDD2

MEP

Etotal=Edyn+Eleak

Figure 6.1 Energy versus VDD for a generic 90nm HVT digital circuit.

MEP refers to the minimum energy point. Vth=0.6V.

A key issue for ultra-low-power circuits is the increased sensitivity of circuit performance to process parameter variations when operating at reduced VDD supplies. Circuit design with minimum size devices was suggested to operate at the theoretically MEP [65]. However, device upsizing may be required to achieve operational robustness against process variability at the expense of a higher energy consumption and larger area occupation [68][69]. Achieving process balance, e.g. an equal NMOS and PMOS on-current, has been shown to be essential for minimum energy operation [70]. Moreover, industrial integrated circuits must be dimensioned such that specifications are met across the entire process window as defined by the foundry to ensure high production yield; this motivates a corner-based design approach. VDD scaling and body biasing have been proposed to counteract process variation in sub-threshold or near-threshold operation [71][72][73][74]. Such approaches have been applied in the form of post-silicon circuit tuning. Finally, Vth assignment during the design synthesis phase is a known problem (see section 5.1). Vth assignment can be utilized to achieve a low energy operation, but the impact of process parameter variations on circuit performance remains an issue. It follows then that the state of the art either uses post silicon tuning for performance compensation, or multi-Vth assignments for low energy operation. Still, these two approaches are mutually exclusive. In the following sections the body bias driven design approach will be utilized for ultra-low-power digital circuits. Thus, it copes with the shortcomings of speed and energy mentioned above.

6.2 Body-Biased Ultra-Low-Power Design

Let us consider now the analytical circuit models for energy, delay and area of ULP digital circuits as well as the corresponding optimization formulations. The circuit models for propagation delay, power consumption and area occupation have been

6.2 Body-Biased Ultra-Low-Power Design 91

provided in Chapter 2 with a focus on super-threshold operation. In this section, these models will be complemented by focusing on sub- and near-threshold operation. A model for energy consumption will be introduced which is a typical metric for ULP digital circuits due to its relationship to battery time. Finally, process variability implications on circuit performance and energy consumption are covered in this section as well. 6.2.1 Circuit Models for Energy and Delay The total energy consumption of a digital gate can be modelled as the sum of the active (or dynamic) energy, and the leakage energy consumed during the clock period. By taking expression (2.14) and dividing it by clock frequency, fck, one obtains:

( ) DDleakckDDextrintr

leakdyngate

VxITVCxC

EEE

⋅⋅++=

+=2

(6.1)

where x is the gate sizing factor (x≥1), Cintr and Cextr are the switching intrinsic and extrinsic capacitance of a gate, respectively, and Tck is the operating clock period.

The leakage current of the gate depends on gate size, VDD and Vth: Tth mUV

leak eI−

∝ ,

where m is the sub-threshold slope factor and UT is the thermal voltage. Based on expression (2.3), the delay of a digital logic gate can be modelled as:

( )drive

DDextrintr

gatexI

VCxCd

+= (6.2)

where Idrive is the current drive of a gate. In sub-threshold, the drive current is

exponentially dependent on VDD and Vth: ( ) TthDD mUVV

drive eI−

∝ , as can be deduced

from expression (2.6) for VGS=VDD and VDS=0. In super-threshold, the drive current

can be expressed by: ( )αthDDdrive VVI −∝ , as shown before in expression (2.3). In

sub-/near-threshold, the circuit speed is reduced substantially due to a low gate drive voltage, VDD-Vth. Now, reducing Vth with FBB has a large impact on gate delay because Vth is a larger portion of the gate drive voltage. This makes FBB attractive for use in ultra-low-power circuits. In fact, FBB is more energy-efficient than VDD upscaling to increase the gate drive voltage when dynamic energy dominates the overall energy. The contrary holds true for leakage dominant situations. In Chapter 2 non-linear derating functions for a reference circuit or logic gate have been introduced to account for the impact of body biasing on intrinsic capacitance, Cintr (expression (2.15)), leakage current, Ileak (expression (2.13)) and a linear derating function for path delay, Dj (expression (2.5)). Such functions can be easily calibrated for a given process technology by means of circuit simulations. Except for the path delay, the derating functions also hold for the sub-/near-threshold operation regime. They are repeated here for convenience.

( ) 2

1

,1

1m

BB

normintrVm

C−

= (6.3)

( ) 0131

2 ≥∀−+= BB

VlVl

norm Veleleakage BBBB (6.4)

Recall that l1, l2, l3, m1 and m2 are fitting parameters, and VBB represents the symmetrical FBB value: VBB=VDD-Vnwell= Vpwell. The left term in (6.4) models the sub-


threshold leakage increase due to FBB, while the right term models the current due to forward-biased junctions. In sub-threshold operation, the drive current is exponentially dependent on VDD and Vth. The use of a linear derating function for accounting the impact of body biasing on delay is no longer providing satisfactory results in contrast to super-threshold operation. Therefore, the following exponential derating function is used.

BBkV

norm edelay = (6.5)

where k, is a fitting parameter. Simulation results for a CMOS 90nm ring-oscillator revealed the accuracy of expressions (6.3)-(6.5). A maximum error of 3%, 15% and 6% was observed for (6.3), (6.4), and (6.5), respectively, when applying up to 0.5V FBB for a VDD range from 0.5V to 0.7V in the performed experiments. Based on the models presented above, the total energy, and delay of a CMOS digital circuit design under FBB can be modelled as:

( )( )( )

0

1

11

2,

,

1

,

31

2 ≥∀

−++

+

−= ∑=

BB

N

iVl

i

Vl

ileakick

DDiextrm

BB

iintri

DDtotal V

eleIxT

VCVm

Cxa

VE

BBiBBi

(6.6)

( ) Ψ∈∀≤+= ∑∈

−− j1,,

1, ck

ji

Vk

idriveiextriiintrDDj TeICxCVD BBi (6.7)

where i is an index that runs over all gates in the circuit, N is the total number of gates in the circuit, a is the average circuit activity factor, j is an index that runs over all circuit paths, Dj is the delay of path j, and Ψ is the collection of all paths in the circuit. Expression (6.7) constrains the delay of each circuit path to be less than the targeted clock period, Tck. The dependence of Cintr,i on FBB is accounted for through fitting parameter ki. At a given VDD, the lowest energy design is obtained when no gates are up-sized, e.g. igatesxi ∀= 1 . However, this also leads to the slowest

design, as can be inferred from (6.7). The area of a CMOS circuit design is approximately equal to the sum of all gate areas, as expressed by (2.17). The actual circuit area depends on the targeted operating frequency. The smallest design is obtained when no gates are up-sized. Both dynamic and leakage energy consumed by a circuit is proportional to the circuit

area. Notice that iiintr AC ∝, and iileak AI ∝, . By combining (2.17) and (6.6), one

obtains the following expression

( ) 2,

2DDtotalloadtotalckDDLDDCtotal VCATVVE ++= λλ (6.8)

where λC is the switching capacitance-area proportionality factor, λL is the leakage-area proportionality factor, and Cload,total is the summed switching load capacitance of each gate. Observe from (6.8) the linear dependence between energy and circuit area. For a given supply voltage and body bias assignment, the dependence between circuit area and clock period can be expressed by a rational function, as proposed in section 2.2. The approach for reconstructing the area-clock period trade-off curve has been


presented in section 5.2, as well as the definition of the PPA metric for binding silicon area to circuit speed. 6.2.2 ULP Circuit Optimization with Body Biasing Two optimization problems for ultra-low-power digital circuits are formulated for this analysis. Both circuit optimizations will be used in this chapter. The first optimization problem has been formulated to obtain a minimum energy design. Let VBB=Vpwell=VDD-Vnwell represent the amount of FBB applied to the digital core. Then, the minimum energy design optimization is as follows:

The result of this optimization is the smallest and minimum energy design that meets a targeted performance under scaled supply voltage and body bias conditions. The second optimization problem has been formulated to obtain the maximum PPA design. The maximum PPA optimization is as follows:

The result of this optimization is the highest performing design without area over-dimensioning that meets a maximum energy requirement under scaled supply voltage and body bias conditions.

10-6

10-4

10-2

100

10-1

10-3

10-5

10-1

100



HVT

SVT

Leakage energy

dominated

Dynamic

energy dominated

1.2V

1.1V

1.0V

0.9V

0.8V

0.7V

0.6V

0.5V

0.4V

0.3V

0.2V

T=25oC

αααα=0.1%

Figure 6.2 Energy versus clock frequency for a generic 90nm digital circuit.

The curves relate to SVT and HVT options, respectively.

minimize Etotal subject to Dj ≤ Tck ∀ j ∈ Ψ xi = 1 ∀i ∈ 1, 2, … ,q

VDD = [0,1.2]V VBB = [0,0.5]V

maximize PPA subject to Etotal ≤ Emax 1 ≤ xi ∀i ∈ 1, 2, … ,q

VDD = [0,1.2]V VBB = [0,0.5]V


6.2.3 Process Variability Implications for ULP Digital Circuits This section presents the implications of process variability on digital circuit performance and energy consumption. However, Let us first discuss the impact of Vth choice on circuit behaviour. For this purpose, energy and performance trade-offs for a generic digital circuit design in 90nm LP-CMOS are presented first. The design has been optimized for minimum energy while meeting a given clock frequency constraint under variable VDD and nominal body bias.

Figure 6.2 illustrates the energy and performance trade-off for a generic digital circuit design in 90nm LP-CMOS for two Vth options. All results relate to a slow-process corner; The Vth of a HVT device is about 0.6V, while the difference between SVT and HVT is about 100mV. As expected, there exists a MEP for both design cases. The location of the MEP depends on circuit activity, process and temperature conditions. Observe that both Vth’s are suboptimal for a narrow range of performance. Under identical conditions, the MEP of the SVT design is located at a (~10x) higher operating frequency than in case of the HVT design. This implies that SVT is the preferred choice for performance-constrained ultra-low-power designs but at a larger leakage overhead. However, the choice of Vth depends on circuit characteristics (e.g. maximum operating frequency, circuit activity) and operating conditions (e.g. temperature). Without loss of generality, let us focus on ultra-low-power digital designs implemented in a 90nm HVT LP-CMOS technology in the remainder of this chapter.

10-6

10-5

10-4

10-3

10-2

10-1

100

10-1

100



Slow process

Fast process

1.2V

Nominal process

1.1V

1.0V

0.9V

0.8V

0.7V

0.6V

0.5V

0.4V

0.3V

0.2V

T=25oC

αααα=0.1%

µ-3σ

(slow)

µ+3σ

(fast)

Chip frequency spread

at constant VDD

µ

(nom)

Figure 6.3 Trading-off energy and performance under process variability.

VDD scaling trends for a generic minimum-energy 90nm HVT digital circuit.

The energy consumption and the maximum performance of a circuit are sensitive to variations in process parameters. Figure 6.3 illustrates this sensitivity for a generic digital circuit in 90nm HVT LP-CMOS. Each curve corresponds to a distinct process corner, for example slow, nominal, and fast process. The same optimization has been used as before; The design has been optimized for minimum energy while meeting a


given clock frequency constraint under variable VDD and nominal body bias. This means that each point on a curve corresponds to the lowest energy design that meets a given clock frequency constraint under variable VDD. The energy and performance values have been normalized to the slow-process corner design at a nominal VDD of 1.2V. Observe the increasing variation in clock frequency at reduced VDD’s. While a frequency spread of 20% has been observed at VDD=1.2V between slow and nominal process corners, this spread increases to about 4x at VDD=0.5V. Eventually, the spread saturates at very low VDD’s. A MEP can be reached irrespective of process condition; however, it is located at different supply voltages and clock frequencies. The MEP is located at a VDDopt of 0.36V, 0.32V and 0.31V for slow, nominal and fast process corners, respectively. The clock frequency at which the MEP occurs can differ by about one order of magnitude depending on process condition. At a given energy target, the clock frequency of the design is limited by the slow process corner. In this case, all transistors in the logic gates of the critical path are in the slow-process corner. Since practical systems are implemented such that the dynamic energy is much larger than the leakage energy, the slow process corner also gives the largest energy design at a given clock frequency due to a higher required VDD. This motivates design optimization under slow process conditions for improving both performance and energy-efficiency of the design.

10-3

10-2

10-1

1

1.2

1.4

1.6

1.8

2


Normalized Area

VDD=0.5V

T=25oC

αααα=0.1%

Slow process

Fast process

Nominal process

Figure 6.4 Area and performance trade-offs under process variability.

Trends are for a generic 90nm HVT digital circuit. An area of ‘1’ corresponds to a

minimum-energy design.

Figure 6.4 shows the conventional behaviour of today’s synthesis tools under a fixed VDD and Vth assignment. Essentially, conventional synthesizers upsize the circuit area to meet speed constraints. The area versus performance curves under different process conditions are plotted in Figure 6.4 when operating at VDD=0.5V. The curves are constructed from a multitude of points, each corresponding to a design with unique speed requirements. As indicated before, the worst circuit performance relates to the slow process condition. Observe that circuit sizing is a weak parameter to increase performance for sub-threshold circuits. Instead, VDD scaling is more effective for performance increase. Therefore, the amount of circuit sizing for


achieving higher performance or operational robustness against process parameter variations should be limited to avoid spending excessive area and energy. 6.2.4 Utilization of Body Bias Driven Design Traditionally, digital CMOS circuits are implemented to meet timing specifications for slow process conditions. Ultra-low-power circuits are dimensioned such that they can operate at a minimum possible VDD at which timing can just be met. In this way, the total energy of the circuit is minimized. Under BBD design, FBB can be utilized to enable smaller and less energy consuming circuit solutions without sacrificing performance specifications and parametric yield. FBB needs to be applied to slow die samples only for reducing the process-dependent performance spread. Notice that a performance increase by FBB can also be traded-off against a performance decrease due to smaller area or lower energy circuit solutions. The amount of FBB required depends on circuit performance increase needed to achieve the target clock frequency.

10-6

10-4

10-2

100

10-1

10-3

10-5

10-1

100



Conventional design

BBD + 0.4V FBB

1.2V

BBD + 0.25V FBB 1.1V

1.0V

0.9V

0.8V

0.7V

0.6V

0.5V

0.4V

0.3V

0.2V

T=25oC

αααα=0.1%

Figure 6.5 Energy versus performance in CMOS 90nm with body biasing.

Trends relate to a generic 90nm HVT digital circuit under conventional

and body bias driven design.

Figure 6.5 shows the energy and performance trade-off for a conventional and body bias driven minimum energy designs in 90nm HVT LP-CMOS for a VDD scaling scenario. The indicated trend lines relate to slow-process corner conditions. The same normalization has been used as before. As way of example, two trade-off curves have been illustrated for BBD design, each with a different FBB value. Observe that BBD designs achieve lower energy per cycle than the conventional design over a wide frequency range. This is because FBB enables lower VDD operation than the non body biased designs. Since BBD design makes use of FBB, one could expect a significant leakage energy penalty for BBD design. However, the impact of BBD design on leakage energy is rather limited primarily because of the increased operating frequency with FBB. Moreover, BBD design operates at similar or better circuit timing than the conventional design, resulting in better EDP figures.

6.3 Optimum Design Space for ULP Digital Circuits 97

Finally, BBD design can effectively speed-up slow-process samples for achieving nominal performance, as can be deduced by comparing Figure 6.3 and Figure 6.5. The amount of FBB required to achieve such performance spread reduction is VDD dependent. Circuit simulations on HVT ring-oscillator circuits in 90nm HVT LP-CMOS were carried out to quantify the effectiveness of FBB in sub-threshold/near-threshold. The circuit descriptions contain all layout parasitics including the bulk diffusion diodes. Slow process conditions and 25oC operating temperature were applied. At these conditions, the NMOS and PMOS Vth are at 0.6V and 0.55V, respectively. The performed simulations used a VDD of 0.5V and 0.7V as a reference for sub-threshold and near-threshold designs, respectively. Using FBB, the Vth is lowered and consequently moves the transistor’s biasing point towards the strong-inversion regime. The applied FBB voltage is limited to 0.5V to avoid turning on the junction diodes. The FBB is applied symmetrically to PMOS and NMOS transistors, e.g. Vpwell=VDD-Vnwell. Table 6.1 Simulated ring-oscillator performance- and leakage increase versus FBB for

90nm HVT LP-CMOS under slow-process corner and 25oC operation.

FBB [V]

Relative circuit

performance increase

Relative leakage

current increase

VDD=0.5V VDD=0.7V VDD=0.5V VDD=0.7V

0 0.1 0.2

0.25 0.3

0.35 0.4 0.5

1 1.8 3.3 4.4 5.8 7.7 9.8 16

1 1.3 1.7 1.9 2.1 2.4 2.7 3.3

1 2.4 5.9 9.7 16 27 51

436

1 2.2 5.2 8.3 14 23 42

237

From Table 6.1, notice the large impact of FBB to increase circuit performance at the applied VDD levels. With 0.5V FBB, the performance increases by ~16x for VDD=0.5V, and ~3x for VDD=0.7V. Contrarily, the respective leakage increase is ~436x and ~237x. At such large FBB, the leakage increase is because the transistor junctions become forward-biased. Obviously a judicious choice of FBB is needed. One advantage of using FBB is that it can be turned off in standby mode operation or low-performance use-cases to avoid the large leakage penalty.

6.3 Optimum Design Space for ULP Digital Circuits

Let us consider now the design space of energy, area, and timing for WCD and BBD design styles by using the models presented in section 6.2.1. The analysis has been performed on a generic digital circuit design in 90nm HVT LP-CMOS. In particular and without loss of generality, let us focus on an ULP design at VDD=0.5V. All results relate to slow-process corner, and 25oC operating temperature. This implies sub-threshold operation since Vth=0.6V for the applied conditions. Finally, let us discuss technology scaling implications by analyzing the same circuit in 65nm and 45nm LP-CMOS. Figure 6.6 shows the results from today’s synthesis tools under a fixed VDD and Vth assignment. The area vs. clock-period curves of the WCD (solid line) and the BBD design (dashed line) are plotted. The curves are constructed from a multitude of


points, each corresponding to a design with unique speed requirements. Iso-energy curves are plotted as overlay. As way of example, all values in Figure 6.6 have been normalized to the WCD at the maximum PPA point (MPPAP: Tck=Tbest), as highlighted by the circle symbol. The MPPAP indicates the upper area bound for design without area over-dimensioning for achieving the higher circuit speed. Consider now a constant circuit area, then one can see that BBD design achieves a higher performance than the WCD (arrow A in Figure 7). The curve for a BBD design with a higher FBB value will be located at smaller clock period values. Notice also that the circuit’s energy consumption is nearly constant over a large clock period range; this implies that dynamic energy is dominant. However, energy increases as well at large FBB values. This is because of the increased junction capacitance and leakage with FBB. Alternatively, BBD design enables smaller circuit designs for a fixed clock period (arrow B). In this case BBD design can lower the circuit’s energy significantly; the amount of energy savings is design dependent. In general, one can say that for a given energy budget, BBD design offers better performance and area figures.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60.9

1

1.1

1.2

1.3

1.4

0.75

11

1

1.251.25

1.25

1.2

5

1.51.5

1.5

1.5

1.751.75

1.75

1. 7

5

22

2

2

2.252.25

2.25

2.2

52.2

52.5

2.5


Relative Area

Conventional design

BBD + 0.1V FBB

A

B

VDD=0.5V

T=25oC

αααα=0.1%

MPPAP

Figure 6.6 Area, clock period, and energy trends in CMOS 90nm at VDD=0.5V.

Trends relate to a generic 90nm HVT digital circuit. Solid line: WCD, dashed line: BBD

design + 0.1V FBB, overlay: total energy. MPPAP refers to the maximum PPA point.

Figure 6.7 presents the total energy versus the amount of FBB. The solid-line curve indicates a multitude of BBD designs optimized for distinct FBB values and with the smallest possible design area/energy consumption. The dashed line relates to the BBD designs with a maximum PPA. The normalization is at the maximum PPA point (MPPAP, circle symbol). Notice that energy can be reduced by more than 25% by walking along the iso-clock “1” line when using ~40mV FBB for BBD design (star symbol). This operating condition renders as well the smallest energy design. The minimum energy BBD design with 0.4V FBB achieves about 7.7x higher operating speed for the same energy consumption as the WCD at maximum PPA (triangle symbol). Alternatively, the BBD design with 0.4V FBB in case of maximum PPA achieves about 9.2x higher operating speed at an energy penalty of about 28% w.r.t. as the WCD at maximum PPA. One can also observe that the increase in energy is

6.3 Optimum Design Space for ULP Digital Circuits 99

rather limited up to 0.4V FBB at the considered operating temperature of 25oC. For larger FBB values, the leakage current due to forward-biased junctions significantly increases and consequently so does the energy.

0 0.1 0.2 0.3 0.4 0.50.6

0.8

1

1.2

1.4

1.6

1.8

2

0.1

0. 1

0.15

0.1

5

0.2

50.2

5

0.5

0.5

0.75

0.7

50.7

51


Relative Energy

MPPAP

VDD=0.5V

T=25oC

αααα=0.1%

Figure 6.7 Energy versus FBB in CMOS 90nm at VDD=0.5V.

Trends relate to a generic 90nm HVT digital circuit. Solid line: minimum energy BBD,

dashed line: maximum PPA BBD, overlay: relative clock period.

0.9 1 1.1 1.20

0.2

0.4

0.6

0.8

1

1.2

1.4

0.1 0.10.1

0.5

0.5

0.51

1

1

1.5

Relative Area

Relative EDP

MPPAP

VDD=0.5V, T=25oC

αααα=0.1%

Figure 6.8 EDP, area and clock period trends in CMOS 90nm at VDD=0.5V.

Trends relate to a generic 90nm HVT digital circuit. Solid line: WCD, dashed line: BBD

+ 0.4V FBB, overlay: relative clock period.


Figure 6.8 plots the EDP versus circuit area for the sub-threshold design. The solid line relates to the WCD, while the dashed line relates to various BBD design cases. The circle symbol highlights the reference non body biased design which is designed at its maximum PPA point. The star symbol indicates the minimum area design with small FBB (~40mV) that has the same operating speed as the reference design. The iso-clock period lines have been plotted as overlay. In general, the smallest designs achieve the best (or minimum) EDP. BBD design consistently achieves a better EDP than the conventional design, irrespective of the FBB applied. Without loss of generality, consider now the particular example of BBD design with a 0.4V FBB. Observe that the BBD design can run at 0.13x the clock period of the WCD, and that the corresponding EDP is 7.7x lower (as indicated by the triangle symbol in Figure 9). In other words, the minimum area BBD design can improve EDP by almost one order of magnitude with respect to the WCD. Also notice that sensitivity of EDP to circuit sizing reduces under BBD design. This implies that circuit sizing increases its effectiveness for trading-off circuit speed against energy consumption under BBD design, without seriously affecting energy efficient operation.

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

100

101 0.1

0.250.25

0.50.5

0.5

11 1

1.51.5

1.5

Relative Energy

Relative PPA

Conventional

BBD+ 0.1V FBB

MPPAP

VDD=0.5V

T=25oC

αααα=0.1%

BBD+ 0.2V FBB

BBD+ 0.3V FBB

BBD+ 0.4V FBB

Figure 6.9 PPA versus total energy in CMOS 90nm at VDD=0.5V.

Trends relate to a generic 90nm HVT digital circuit. Solid line: WCD, dashed line: BBD.

Figure 6.9 highlights the PPA and total energy trends. The iso-clock period lines have been plotted as overlay. As before, the red circle indicates the MPPAP of the WCD, the star symbol points to the minimum area BBD design with the same performance, and the triangle symbol relates to the minimum area BBD design with the same energy consumption as the maximum PPA conventional design. One can observe that BBD design exhibits a noteworthy higher PPA over the WCD under all circumstances. This higher PPA comes primarily from the increased circuit performance and reduced circuit area. FBB is simply a more energy-efficient approach for speeding-up the circuit than the conventional gate area up-sizing. Observe also that for each BBD design, there exists a MPPAP at which the design utilizes circuit area most effectively, as indicated by the circle symbols. In other words, it is the area-speed point for which performance is the highest before the

6.4 Body Bias Selection and Generation 101

circuit is over-dimensioned. Over-dimensioning comes at the price of a significant energy penalty as compared to an insignificant performance increase. Consequently, circuit sizing should be constrained by the MPPAP of the design. When a higher performance is needed, the threshold voltage of the transistors should be lowered, or alternatively the VDD of the circuit can be increased. Table 6.2 illustrates the impact of technology scaling on maximum PPA designs for the same generic digital circuit design in different process technology nodes. All values have been normalized to the maximum PPA design under WCD in 90nm LP-CMOS. Observe that BBD design consistently outperforms WCD; A maximum PPA design is smaller in a next-generation technology. Consequently, the PPA increases with technology scaling. The total energy consumption reduces in a next-generation technology due to reduced circuit capacitances. The leakage energy is more pronounced in 45nm LP-CMOS, which can be observed by the energy increase of the BBD design case. In a next-generation technology, the speed increase with BBD design is somewhat lower, mainly because a lower device threshold voltage. In general, one can state that BBD design remains effective with scaling in the analyzed process technology nodes.

Table 6.2 Technology scaling of maximum PPA design at VDD=0.5V and T=25oC.

BBD utilizes an FBB of 0.25V.

Relative PPA


Relative Total

Energy

WCD 90nm LP-CMOS


1 1.93 3.97

1 0.96 0.97

1 0.74 0.48

BBD 90nm LP-CMOS


4.00 6.14

11.52

0.25 0.30 0.33

1.05 0.76 0.58

6.4 Body Bias Selection and Generation

Body biasing requires deep N-well isolation, power delivery networks and voltage generation. Let us address the body bias selection and generation with a particular focus on application in ULP digital circuits. The physical design aspects of BBD design will be presented in Chapter 7. It is possible to use a fully programmable body bias generator for providing the required FBB values for post-silicon tuning. Alternatively, and without loss of generality, let us consider the implementation of a fixed FBB voltage. A single FBB voltage of VDD/2 has been selected for joint N-well and P-well biasing of the body biased circuit when operating at reduced supply voltages (VDD≤1V). This FBB voltage is found sufficient for slow-to-nominal process corner performance compensation at VDD=0.5V or 0.7V. More FBB can further increase performance but at leakage penalty, as shown in Table 6.1. By using a single FBB voltage, one can limit both the area and energy overhead associated the use and generation of multiple body bias voltages. Figure 6.10 shows a conceptual circuit diagram of the proposed body bias power delivery. The FBB generator consists of a VDD/2 voltage generator and a switch box to support dynamic FBB. The voltage generator can be implemented by, for example a resistor divider, a low-power low-dropout regulator or switched-capacitor converter


unit. The FBB generator only needs to supply the difference between well currents, because the N-well current sources the P-well current. Therefore, the generator only needs a limited output current drive capability.

VDD

Vnwell

VDD

Vpwell

ena

enaena

ena

ena

0.5VDDBody Bias

Generator

Figure 6.10 Conceptual circuit diagram of the body bias power delivery.

6.5 Synthesized Design Example

Commercial synthesis tools target area optimization subject to delay constraints. Synthesis results for an industrial microprocessor design in 90nm HVT LP-CMOS are shown in this section. Design synthesis has been performed for slow-process and 25oC temperature conditions. A BBD design was utilized with a FBB voltage of VDD/2. Digital cell timing and power have been characterized using Altos’ Liberate library characterizer. Logic synthesis has been done with Cadence RTL Compiler. Synthesis has been directed towards the smallest design with minimum leakage and dynamic power that meets timing constraints. The circuit area accounts for layout effects such as row utilization, FBB power delivery networks and Deep N-well isolation. The microprocessor contains 3764 flip-flops and about 31K combinational gates. It makes use of HVT devices only. For both WCD and BBD design styles, the design synthesis has been done at VDD=0.5V and VDD=0.7V. This relates to a sub-threshold and a near-threshold design, respectively. Figure 6.11 presents the area and clock period trends for the sub-threshold design. The results obtained from synthesis, have been indicated by circles and triangles for the WCD and BBD designs, respectively. The trend lines are calculated from expression (5.1) by using the fitting parameters shown in Table 6.3. The minimum energy and maximum PPA designs have been indicated in Figure 6.11 by diamond and circle symbols, respectively. The key results have been highlighted in Table 6.4.

Table 6.3 Parameter values of expression (5.1) for the industrial

microprocessor design in 90nm HVT LP-CMOS at VDD=0.5V.

χχχχ [µm2⋅µs] δδδδ [µs] ηηηη [µm

2]

WCD 3.94⋅104 -1.04 2.74⋅10

5

BBD 0.98⋅104 -0.26 2.72⋅10

5

6.5 Synthesized Design Example 103

0 500 1000 1500 2000 2500 3000 3500 4000 45002.6

2.8

3

3.2

3.4

3.6

3.8

4

4.2x 10

5

Clock Period [ns]


ConventionalBBD

+ 0.25V FBB

MPPAP

MPPAP

VDD=0.5V

T=25oC

Figure 6.11 Area versus clock period for the 90nm microprocessor at VDD=0.5V.

Lines: expression (4.1), symbols: HVT synthesis results.

Notice in Figure 6.11 that the maximum PPA conventional design has a significantly higher performance than its minimum energy design counterpart. The minimum energy design is located at Tck=4µs (fck=250kHz), while the maximum PPA design is located at Tck=1.33µs (fck=751kHz). Observe in Figure 6.11 large performance increase that can be achieved with BBD design. The maximum PPA BBD design can be found at Tck=~350ns (fck=~2.9MHz). By comparing the maximum PPA designs, BBD design achieves up to 3.8x higher performance w.r.t. the WCD. In addition, it achieves a 9% smaller circuit area. This gives a PPA improvement of 4.2x for the BBD design over the WCD. Moreover, the overall energy consumption of the BBD design is about 10% lower, and the EDP is improved by 4.2x. At the same clock frequency, BBD design reduces circuit area by 29% at the same clock frequency as the maximum PPA WCD, which results in a PPA improvement of 39%, an overall energy reduction of 36%, and an EDP improvement of 36%. By comparing the minimum energy designs, BBD design achieves up to 3x higher performance w.r.t. the WCD at the same energy consumption. Moreover, the BBD design achieves a 3% smaller circuit area, and PPA and EDP improvements of 3.1x and 3.3x, respectively. Figure 6.12 illustrates the PPA versus energy trends for this sub-threshold design. Notice that the same trends as shown before in Figure 6.9 are observed from this experimental synthesis. BBD design improves the maximum PPA value by about 4.2x w.r.t. the maximum PPA WCD. If operation at a maximum PPA is not selected, one could also trade-off the energy per cycle of the design against a smaller circuit area but at a significant performance penalty. For the WCD, this would result into operation at 85pJ/cycle for 26% lower area and 3x lower frequency. Typically, in conventional design methods, the VDD is raised to compensate this performance loss, but this comes at the price of worsening energy consumption.


Min

imum

Ene

rgy

Des

ign

Max

imum

PP

A D

esig

n D

esig

n

WC

D

BB

D

W

CD

B

BD

Uni

t

ab

solu

te

ab

solu

te

Rel

ati

ve

imp

rove

men

t a

bso

lute

a

bso

lute

R

elati

ve

imp

rove

men

t

Cir

cuit

Are

a [µ

m2 ]

2898

51

2882

02

3%

3931

12

3565

85

9%

Per

form

ance

[k

Hz]

25

0 13

33

3.0x

75

1 28

57

3.8x

PP

A

[MH

z/µ

m2 ]

8.6⋅

10-7

2.

7⋅10

-6

3.1x

1.

9⋅10

-6

8.0⋅

10-6

4.

2x

Ene

rgy

[pJ/

cycl

e]

85

77

1.1x

12

1 11

0 1.

1x

ED

P

[pJ⋅

µs]

339

103

3.3x

16

2 38

4.

2x

Table 6.4 Microprocessor designs at VDD=0.5V in 90nm HVT LP-CMOS.

Conditions: Slow Process Corner, VDD=0.5V and T=25OC. BBD Design utilizes an FBB value of 0.25V.

6.6 Discussion 105

The previous analysis has also been performed for a near-threshold design (VDD=0.7V). Table 6.5 summarizes the results. Notice the 20x higher performance than the sub-threshold design for maximum PPA (fck,near= 15.7MHz vs. fck,sub = 751kHz). For the near-threshold maximum PPA designs, the BBD design achieves almost 2x higher performance, up to 8% smaller circuit area, 2.1x higher PPA, and 2x improved EDP w.r.t. the WCD. At the same clock period, BBD design reduces circuit area by 23%, increases PPA by 30%, consumes 29% less energy, and improves EDP by 29%. By comparing the minimum energy designs, BBD design achieves up to 50% higher performance w.r.t. the WCD at the same energy consumption. In this case, the BBD design achieves a comparable circuit area, a PPA improvement of 53%, and a 33% improved EDP.

70 80 90 100 110 120 130 140 1500

1

2

3

4

5

6

7

8

9x 10

-6

Energy per cycle [pJ]

PPA [MHz/sq.um]

Conventional

BBD + 0.25V FBB

MPPAP

MPPAPVDD=0.5V

T=25oC

Figure 6.12 PPA versus energy for the 90nm microprocessor at VDD=0.5V.

Symbols: HVT synthesis results.

6.6 Discussion

In this chapter a new body bias driven design synthesis strategy for (near-) sub-threshold digital designs has been presented. This approach renders low-area energy-efficient designs without the major performance penalties associated to conventional sub-threshold design. Its effectiveness was demonstrated on a microprocessor designed for sub-threshold operation in 90nm HVT LP-CMOS. Up to 3.8x higher performance, 4.2x improved energy-delay product, 10% lower energy per cycle, 9% area reduction, and a 4.2x performance-per-area improvement have been observed as result of the proposed design strategy. A near-threshold design of the same microprocessor showed similar but somewhat lower benefits. Finally, it was shown that the proposed approach offers more energy efficient designs as compared to today’s practice when targeting the frequency range attractive for many consumer electronics applications.


Min

imum

Ene

rgy

Des

ign

Max

imum

PP

A D

esig

n D

esig

n

WC

D

BB

D

W

CD

B

BD

Uni

t

ab

solu

te

ab

solu

te

Rel

ati

ve

imp

rove

men

t ab

solu

te

ab

solu

te

Rel

ati

ve

imp

rove

men

t

Cir

cuit

Are

a [µ

m2 ]

2864

01

2803

28

2%

3789

21

3473

25

8%

Per

form

ance

[M

Hz]

6.

7 10

.0

50%

15

.7

30.3

1.

9x

PP

A

[MH

z/µ

m2 ]

23.3

⋅10-6

35

.7⋅1

0-6

53%

41

.5⋅1

0-6

87.3

⋅10-6

2.

1x

Ene

rgy

[pJ/

cycl

e]

162

162

0%

254

253

0%

ED

P

[pJ⋅

µs]

24.3

16

.2

33%

16

.2

8.3

2.0x

Table 6.5 Microprocessor designs at VDD=0.7V in 90nm HVT LP-CMOS.

Conditions: Slow Process Corner, VDD=0.7V and T=25OC. BBD Design utilizes an FBB value of 0.35V.

6.6 Discussion 107

It was also presented that BBD design is effective when utilized for the design of ultra-low-power digital circuits. Such circuits are extremely sensitive to process variability due to the low supply voltage operation. With the conventional WCD approach, ultra-low-power circuits need to be significantly up-sized to become variation-tolerant. BBD design relaxes the circuit up-sizing, while offering a higher intrinsic performance. This results not only into more economically attractive circuit solutions, but also in faster ones. Consequently, BBD design can enable ultra-low-power operation for consumer portable applications with target operating frequencies in the range of 1MHz to 50MHz.


Chapter 7

Body Bias Clustering and Physical

Design

HE utilization of body biasing in digital integrated circuits enables power-performance tuning and leakage control. Body biasing is typically employed to

reduce circuit leakage through RBB, or to increase circuit performance through FBB. Depending on application needs, different circuit sections may require a different body biasing. Ideally, one would like to apply a unique body bias to individual digital gates for obtaining the best trade-off between performance and leakage. However, such an approach is impractical due to physical design overhead such as well separation, body bias supply delivery networks, and body bias generation. The design and implementation aspects of body bias clustering for digital CMOS circuits are presented in this chapter. This approach is based on hierarchy-based body bias clustering for which FBB is applied on speed-critical circuit portions only. Both front-end and back-end design aspects will be addressed. Finally, the new concepts were validated on an industrial micro-processor design in 90nm LP-CMOS, and demonstrated in a mixed-signal SoC design.


Researchers have explored the potential of body bias clustering for the purpose of active leakage reduction or improving performance [55][56][57][73][75][77][78]. Kulkarni et.al. presented one of the first works, where body bias clustering was proposed to minimize active leakage in digital circuit blocks given an required circuit performance [55]. Their approach requires probability distributions of optimum body biases for each digital gate in the presence of process variability. Clusters of gates are formed based upon these body-bias probability distributions and their correlations, where each cluster is tuned to minimize power while meeting the delay. The authors showed that their approach can achieve 38%-68% leakage savings for similar circuit performance as compared to a dual-Vth design style in CMOS 90nm. They concluded that only a small amount of body bias clusters (2-4) were required. However, the drawback of this approach is the required characterization of probability density functions for each individual digital gate in the circuit. They are subject to rise- and fall times of input signals of the gate, its output load capacitance, and the moving average of process parameters. Other statistical variation-aware body bias clustering approaches have the same drawback [77][78]. Theodorescu et.al. proposed another approach that is based on fine-grained independent body bias regions [56]. Body bias regions were chosen based on the physical contours of micro-architectural modules such as caches, registers, or

T

110 Chapter 7 Body Bias Clustering and Physical Design

execution units. Adaptive body biasing was applied for each body bias region to continuously adapt to variations in process and operating conditions. With 144 body bias regions available, the authors showed 69% leakage savings at constant frequency, or alternatively 16% higher speed as compared to nominal body biasing. The drawback of this approach is the high number of body bias regions required, which significantly complicates physical design and gives an unacceptable area overhead due to well spacings. Gregg et.al. presented an approach for applying N-well biasing to circuit regions on a well-by-well basis [75]. Only the N-wells of digital gates on speed-critical paths are under control to reduce area overhead of Deep N-well isolation. For two test-circuits in CMOS 90nm, the authors observed an improved parametric yield related to power-performance constraints from 12% up to 73%. A main drawback is the limited speed-up with body biasing due to N-well biasing only. The aforementioned works do not show the details of a feasible layout implementation of their proposed methodology. In fact, the layout example presented in [55] does not seem to include a realistic well separation, while the irregular cell placement seems to be challenging for the implementation of a proper body bias routing grid. Sathanur et.al. proposed a physically clustered FBB design scheme for a standard-cell layout style [57]. Selective FBB is applied those standard-cell rows that contain the most critical gates in order to reduce the leakage overhead. Although the authors show a feasible layout implementation of body bias regions, the drawback is the large area increase due to Deep N-well separation between standard-cell rows that operate at different body biases. Hamamoto et.al. suggested body bias clustering based on the layout of the design with a fixed cell placement [73]. The proposed clustering approach targets minimum active leakage overhead for reaching a given speed target by utilizing the swapped body bias approach as proposed in [76]. The main drawbacks of this approach are: 1) it relies on a swapped body biasing that can only be used in sub-threshold circuits, and 2) it requires fine-grained spatially-distributed body bias clusters which gives area and routing overheads (routing overhead occurs when a swapped body bias is not utilized). In this work the motivation for a clustering approach is two-fold. Primarily, one wants to reduce the numbers of digital gates that receive FBB to relax constraints for the embedded FBB generator in terms of loading current and capacitive load. This has not been considered in prior art works. Secondly, as indicated in prior art works, one wants to benefit from a reduced leakage due to FBB by applying FBB to fewer logic gates while operating in active mode. This is especially beneficial under fast process and high temperature conditions. Recall that dynamic FBB is used to reduce leakage power when the circuit is in standby.

7.2 Design-Time Body Bias Clustering

Prior art solutions perform body bias clustering at netlist level of the design [55][56][57][73][75][77][78]. In contrast, the approach for body bias clustering hereby proposed starts at the logic synthesis as an extension of the BBD design approach. In this way, one is able to optimize the circuit design by applying FBB only to speed-critical circuit sections. Speed non-critical circuit sections receive a nominal body bias voltage. This design optimization aims at the use of a minimum number of forward body biased digital gates for achieving performance, power consumption and, or area targets.

7.2 Design-Time Body Bias Clustering 111

7.2.1 Synthesis-Based Body Bias Clustering Exploration Let us explore the trade-off between body bias cluster size (i.e. the amount of digital gates that receive FBB) and number of body bias clusters (i.e. the use of multiple FBB voltages). Without loss of generality, a high-performance industrial microprocessor design in 90nm SVT LP-CMOS is used for illustration purposes The design is similar to the one used in Chapter 5; it contains about 3.8k flip-flops and 34k logic gates. The design synthesis is performed for slow process conditions, VDD=1.1V and a temperature of 85oC. The commercial logic synthesis tool from Cadence, RTL compiler, has been used for synthesizing the design. Digital cell libraries have been re-characterized for timing under different FBB conditions by using Altos’ Liberate library characterizer. The libraries are at NBB, and 0.1,0.2,0.25,0.3,0.4,0.5V FBB. One can use the same optimization problem for obtaining the maximum PPA design, as expressed in section 5.3, hereby repeated for convenience. Let Ψ represent all paths in the circuit. Let Dj express the propagation delay of a path j ∈Ψ. There are q gates in the circuit; the gate sizing factor of gate i ∈ j is represented by parameter xi. For each type of gate, there are m different gate sizes available for digital cells in a digital standard-cell library. Ptotal is the total power consumption of the circuit. Let VBB=Vpwell=VDD-Vnwell represent the amount of FBB applied to the digital core. Then, the maximum PPA optimization is as follows:

Timing information is extracted by using a commercial static timing analysis tool, e.g. Cadence Encounter Timing System. The clock period Tbest at which the maximum PPA occurs; it is different for NBB design, and FBB designs. Under FBB conditions, Tbest was obtained for the maximum FBB of 0.5V. After this, the design is re-synthesized to explore body bias clustering. The optimization problem for driving the logic optimizer is changed now to use the least FBB cells.

where VBB(1),… ,VBB(n) refers to a discrete body bias set that corresponds to the number of body bias clusters in the design. The result of this optimization is a design with a minimum utilization of FBB that meets the given timing and power requirements.

maximize PPA subject to Ptotal ≤ Pmax 1 ≤ xi ∀i ∈ 1, 2, … ,q

x ∈ 1, 2, … ,m VBB =[0,0.5]V

minimize total leakage subject to Dj ≤ Tbest ∀ j ∈ Ψ 1 ≤ xi ∀i ∈ 1, 2, … ,q

x ∈ 1, 2, … ,m VBBi ∈ VBB(1),… ,VBB(n)


1.0

1.1

1.2

1.3

1.4

0 5 10 15 20 25 30

Re

lati

ve

Clo

ck F

req

ue

ncy

Relative Leakage Power

1 cluster: NBB

90nm SVT LP-CMOS

Slow process, VDD=1.1V, T=85oC

1 cluster:

0.5V FBB2 clusters:

NBB, 0.5V FBB

3 clusters:

NBB, 0.25,0.5V FBB

4 clusters:

NBB, 0.1,0.3,0.5V FBB

3 clusters:

NBB, 0.4,0.5V FBB

6 clusters: NBB,

0.1,0.2,0.3,0.4,0.5V FBB

Figure 7.1 Frequency vs. leakage for the microprocessor with body bias clusters.

Figure 7.1 shows the trade-off between maximum clock speed and leakage power of the microprocessor design for a different number of body bias clusters. The clock frequency and leakage power have been normalized to NBB case. Observe a frequency increase of about 22% when applying 0.5V FBB to the whole design. This frequency increase comes at the penalty of about 27x higher leakage versus the NBB design. The leakage penalty can be reduced to 11-14x when utilizing body bias clustering. This is because of the nature of timing driven optimization by the synthesis tool towards a balanced design which contains many nearly-timing-critical circuit paths by selectively applying FBB. The lowest leakage penalty is obtained for the six cluster case as well as the three cluster case with asymmetric body biasing (NBB, 0.4V FBB, 0.5V FBB). The leakage penalty is somewhat larger for the other body bias clustering cases, e.g. the two cluster case gives a leakage penalty of almost 14x. The primary reason for the somewhat higher leakage penalty is because more 0.5V FBB cells are used as compared to the former two cases. Furthermore, observe the larger circuit speed when utilizing only higher body bias values: NBB, 0.4V FBB, 0.5V FBB and NBB, 0.5V FBB. This is because the logic optimizer drives for higher usage of lower FBB cells in the other cases to reduce leakage, which impacts speed. Figure 7.2 shows the breakdown of the cells per body bias cluster as well as the PPA value obtained. Each column shows the FBB value applied per cluster. The PPA is normalized to the NBB case. The design consists of about 66k gate equivalents when 0.5V FBB is applied to the whole design. Observe that the amount of body-biased cells reduces to about 20k for the two cluster case. More body-biased cells are needed when more clusters are utilized. This is because the optimizer directs towards the use of lower FBB for reducing leakage, and consequently, more cells need to be body biased to achieve the targeted clock speed. Also observe that the PPA reduces when body bias clustering is done. This is because the optimizer utilizes, among others, gate sizing to compensate for the speed loss of gates when a lower FBB than 0.5V is applied, which increases area. The two cluster case provides the best PPA results as compared to the other clustering cases. Additionally, only a single FBB generator is needed for the single or two cluster cases which is advantageous for physical design integration of body biasing.


0

10

20

30

40

50

60

70

1 2 3 4 5 6

# B

od

y-B

iase

d C

ell

s [k

Re

fGa

tes]

0.5

0.25

0.5

0.5

0.5

0.4

0.5

0.3

0.1

0.5

0.3

0.1

0.4

0.2

90nm SVT LP-CMOS


1 cluster:

FBB2 clusters:

NBB,FBB

3 clusters:

NBB,FBB

3 clusters:

NBB,FBB4 clusters:

NBB,FBB

6 clusters:

NBB,FBB

PPAnorm=1.26

PPAnorm=

1.16

PPAnorm=

1.11

PPAnorm=

1.10

PPAnorm=

1.08

PPAnorm=

1.08

# Body-BiasedCells[kRefGates]

Figure 7.2 Body-biased cells per cluster and PPA for the microprocessor.

Table 7.1 provides a summarized overview of the microprocessor design characteristics when utilizing body bias clustering. The following circuit parameters can be compared: clock period, circuit area, PPA, total power, leakage power and the number of reference gates. In addition to the previous observations, notice that the circuit area becomes higher when more body bias clusters have been utilized. This is because digital gate sizes need to be up-sized to achieve a given gate drive capability when applying a lower amount of body biasing. For a leakage point-of-view, an upsized digital gate has a lower leakage current than a gate with FBB. As a result, the optimizer has driven towards higher utilization of up-sized digital gates to which a lower amount of FBB is applied. Primarily because of the area impact, the PPA is degraded when more body bias clusters have been utilized. Furthermore, observe that there is a tendency towards utilization of higher FBB values in the body bias clustering cases. This is because the logic synthesis aims to achieve a balanced design in terms of timing performance. The body bias cluster exploration revealed that use of two body bias clusters only is preferred to implement a design with the least amount of digital gates with FBB. One cluster should receive a NBB, while the other cluster should receive the maximum FBB required. From an area and PPA point-of-view, it should be clear that fewer body bias clusters are preferred. Limiting the amount of body bias clusters to a maximum of two is also preferred because only one FBB voltage needs to be generated and supplied to the circuit. However, the exploration also revealed that the circuit area and PPA can degrade when utilizing body bias clusters at design synthesis time. The logic synthesis tool is constrained differently to avoid this degradation, as will be explained in the section 7.2.3. A final point of observation is that the logic synthesis tool may assign a different body bias to consecutive digital gates in the circuit path. This is not a preferred approach since it can result in many signal routings between body bias clusters, which may raise routing congestion issues in the place-and-route stage of the design. An approach that has the tendency to minimize the number of signal routings between body bias clusters is presented in section 7.2.3.


C

lock

pe

riod

[n

s]

Cir

cuit

ar

ea

[µm

2 ]

PP

A

[MH

z/µ

m2 ]

Tot

al

pow

er

[mW

]

Lea

kage

po

wer

[µ

W]

Num

ber

of r

efer

ence

gat

es

Des

ign-

1: R

efer

ence

N

BB

4.

85

3746

75

550·

10-6

25

.4

55

6827

8

Des

ign-

2: S

ing

le b

od

y b

ias

F

BB

∈

0.5

V

3.98

36

2235

69

4·10

-6

31.0

14

68

6601

1

Des

ign-

3: 6

bo

dy

bia

s cl

ust

ers

N

BB

, FB

B ∈

0.

1,0.

2,0.

3,0.

4,0.

5V

4.03

41

8770

59

2·10

-6

30.1

63

3

N

BB

: 4

3425

(57

%)

0.

1V F

BB

:

3409

(

5%)

0.

2V F

BB

:

4066

(

5%)

0.

3V F

BB

:

4688

(

6%)

0.

4V F

BB

:

7744

(10

%)

0.5V

FB

B:

129

82 (

17%

)

To

tal:

76

314(

100%

) D

esig

n-4:

4 b

od

y b

ias

clust

ers

NB

B, F

BB

∈

0.1,

0.3,

0.5

V

4.04

41

6155

59

5·10

-6

30.1

76

1

N

BB

: 4

4051

(58

%)

0.

1V F

BB

:

2993

(

4%)

0.

3V F

BB

:

9651

(13

%)

0.

5V F

BB

: 1

9142

(25

%)

T

ota

l:

7583

7(10

0%)

Des

ign-

5: 3

bo

dy

bia

s cl

ust

ers

NB

B, F

BB

∈

0.25

,0.5

V

4.02

41

1909

60

5·10

-6

29.6

76

7

N

BB

: 4

5902

(61

%)

0.25

V F

BB

:

909

1 (

12%

)

0.5V

FB

B:

200

71 (

27%

)

To

tal:

75

064(

100%

) D

esig

n-6:

3 b

od

y b

ias

clust

ers

NB

B, F

BB

∈

0.4,

0.5

V

3.98

41

0631

61

3·10

-6

29.6

62

6

N

BB

: 4

5552

(61

%)

0.

4V F

BB

: 1

8900

(25

%)

0.

5V F

BB

: 1

0389

(14

%)

T

ota

l:

748

31(1

00%

) D

esig

n-7:

2 b

od

y b

ias

clust

ers

NB

B, F

BB

∈

0.5

V

3.96

39

5690

63

8·10

-6

31.6

75

6

NB

B:

520

17 (

72%

)

0.5V

FB

B:

200

91 (

28%

)

To

tal:

72

108(

100%

)

Table 7.1 Body bias clustered microprocessor designs in 90nm SVT LP-CMOS.

BBD design utilizes body bias clusters. Conditions: Slow process corner, VDD=1.1V and T=85OC.


7.2.2 Candidate path selection for FBB

Let Ψ represent all paths in the circuit; Ω contains the speed-critical circuit paths only, where Ω ⊆Ψ. Dj expresses the propagation delay of a path j ∈Ψ. The speed-critical circuit paths are those circuit paths that fulfil the following requirement:

Ψ⊆Ω∈∀≤≤ jmaxmax DDD jjσ (7.1)

( )

≤∀

>∀+=

thDD

Vk

thDDBBj

jVVe

VVVkBBj

1σ (7.2)

where Dmax is the delay of the critical path of the circuit design at nominal body bias. σj is the speed-up factor of the circuit path j due to FBB, which has been shown before in expressions (2.5) and (6.7). The fitting parameter kj is a function of process, supply voltage, temperature, and Vth-option. The path group Ω is obtained by performing two distinct timing analysis runs. The first run is based on the same FBB-characterized digital cell timing libraries as used during design synthesis (Tck=Tbest_fbb=σDmax). The second run is based on the use of the NBB-characterized digital cell timing libraries only (Tck=Tbest_nbb=Dmax). Figure 7.3 provides the visual representation of the candidate path selection for FBB, as explained above.

0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

100

200

300

400

500

600

700

800

Path Delay in [ns]

Number of Occurrences

Maximum

speed-up

with FBB

DmaxσσσσDmax

Numberof Circuit Paths

Figure 7.3 Path delay distribution of the 90nm SVT microprocessor.

The candidate paths for FBB have been indicated.

Let us formulate now the time constraint for all paths in the circuit:

Ψ∈∀=≤ j_max fbbbestjj TDD σ (7.3)

The speed-critical circuit paths that violate the clock period constraint of the BBD design are assigned to path group Ω. They are the candidate paths for application of FBB. Other circuit paths are receive a NBB. Speed-critical circuit paths are identified by applying a maximum FBB voltage (e.g. 0.5V FBB) under BBD design.


7.2.3 Hierarchy-Based Body Bias Clustering This section introduces the new body bias clustering concept that is based on design hierarchies. A hierarchical digital design is assumed, e.g. the RTL design is divided into multiple lower-level modules or sub-modules, which is valid for most of today’s digital designs. In the context of this work, design hierarchies are referred to as the lowest-level RTL modules that cover a given circuit path (or section of the given circuit path) in the design. For this approach all digital gates within a given design hierarchy level will either receive NBB or a given maximum FBB value, e.g. the case of two body bias clusters will be utilized. The given design hierarchy level may contain sub-modules that either receive NBB or FBB. The motivation for using design hierarchies/soft macros instead of circuit paths is as follows. Firstly, speed-critical micro-architectural modules (execution units, buses, memory hierarchies) are typically implemented as soft macros. Secondly, the amount of input/output ports of micro-architectural modules is limited. This enables us to limit the amount of signal nets that cross body bias cluster boundaries, and therefore this minimizes the chance of routing congestion due to such nets. Thirdly, once the design hierarchies that should receive FBB are known, one can account for them during a re-synthesis step for further timing optimization. Also, the identified design hierarchies may be re-used to perform body bias clustering in circuit designs with a similar chip architecture. A disadvantage of hierarchy-based body bias clustering is the higher number of FBB cells than optimally required, and as a consequence the obtained circuit solution is not leakage optimal. Let us address now the approach to obtain the hierarchy set, H, that contains the candidate hierarchies for receiving FBB. For this purpose, circuit timing is obtained as explained in the previous section. After BBD design synthesis, a first timing run is performed by using the same FBB-characterized digital cell timing libraries as used during logic synthesis. From this, one obtains the reference clock period, Tck=σDmax, for the case for which FBB is applied to the whole design. Next, a second timing run is performed on the same design, while now using NBB-characterized digital cell timing libraries instead. In this case, the minimum clock period is degraded to Tck=Dmax, since no FBB is applied. From these timing runs, it is possible to extract the FBB speed-up factor, σ. The candidate design hierarchies for FBB are obtained from the algorithm that is described in Figure 7.4. hi is denoted as design hierarchy i in the design. As a first step, speed-critical circuit paths that violate the clock period constraint of the BBD design (Tck=Tbest_fbb) are assigned to path group Ω. They are the candidate paths for application of FBB. This process is shown from lines 2 to 9 in Figure 7.4. Information on timing slack of the path j, the design hierarchies (hi : j ∈

hi) to which the path belongs to, and the composition of path delay per design hierarchy is maintained for each circuit path j∈Ω. The aforementioned design hierarchies refer to the lowest hierarchy levels in the design needed for covering circuit path j. These design hierarchies may have sub-hierarchies in the other body bias cluster. The union of hierarchies, h1∪h2∪…∪hn, constitutes the hierarchy set, Η, that contains all circuit paths of group Ω. As a second step, the candidate design hierarchies for FBB are determined. This process is shown from lines 10 to 21 in Figure 7.4. The circuit paths of Ω are sorted for increasing timing slack. Starting with the critical path, the timing slack is compared against a tolerable percentage slack value, ε. Parameter ε is the amount of tolerable relative negative slack for a design with body bias clusters instead of full FBB (ε ≤ 0). When Slack(1)/(σDmax)≤ε, the RTL leaf module with the largest impact on path delay is assigned to hierarchy set H, thus should receive FBB. This design hierarchy is denoted as hx for which D1(hx)=maxD1(hi) ∀ hi ∈ Hierarchy(1) ∧ hx∉H. The delay and slack matrices are updated to account for FBB for those design hierarchies that have been assigned to H. When


the critical path contains more design hierarchies at NBB, iteration is started to assign a next design hierarchy to hierarchy set H. In case all design hierarchies of the critical path have been assigned to H, or if the critical path meets the timing constraint, the process is repeated until all circuit paths j ∈ Ω have been processed. Finally, the algorithm returns the hierarchy set H containing the candidate hierarchies for application of FBB.

Figure 7.4 Algorithm for selecting candidate design hierarchies for FBB.

The design is re-synthesized after having obtained the FBB candidate hierarchy set, H. During synthesis, hierarchy set H is assigned to one power domain (FBB cluster) while the rest of the design is assigned to another power domain (NBB cluster). One can make use of power domain aware synthesis with Cadence RTL Compiler for hierarchy assignment. BBD design has been employed for optimizing the body-bias clustered design under body bias conditions to meet design targets. The same high-performance microprocessor design is used as before to evaluate hierarchy-based body bias clustering. The following conditions for design synthesis are applied: slow process conditions, VDD=1.1V and a temperature of 85oC. Further, two body bias clusters are used: NBB and 0.5V FBB. For evaluation purposes, five different hierarchy choices for body bias clustering have been compared for the BBD microprocessor; each design uses a different ε value.

Algorithm 2: Find_FBB_Hierarchies(σ, Dmax, ε)

1. Delay=; Hierarchy= 2. foreach circuit path j∈Ψ do 3. if Dj > σDmax then 4. Assign path j to Ω 5. Slack(j) = σDmax - Dj 6. Hierarchy(j)= hi : j ∈ hi 7. Delay(j) = Dj(hi) ∀ hi ∈ Hierarchy(j) 8. end if 9. end foreach 10. for n = 1 to |Ω| 11. Sort slack, hierarchy, and delay matrices for increasing timing slack 12. if Slack(1)/(σDmax) ≤ ε then 13. Assign most timing-critical hierarchy hx to H:

HhH x ∪=

14. foreach circuit path j∈ Ω do 15. Update delay matrix components:

HhhDhD xxjxj ∈∀= )()( σ

16. Re-calculate slack: ( )

( )∑

∈∀

−=jHierarchyh

ij

i

hDDjSlack max)( σ

17.

end foreach 18. if (Hierarchy(1) \ H) ≠ ∅ then 19. Go to Line 12 20. end if 21. end if 22. Increase path counter: n=n+1 23. end for 24. return hierarchy set H


1.0

1.1

1.2

1.3

1.4

0 10 20 30 40 50 60 70

Re

lati

ve

Clo

ck F

req

ue

ncy

Number of Body Biased Reference Gates [kRefGates]

90nm SVT LP-CMOS


Hierarchy-Based Body Bias Clustering

1 cluster:

0.5V FBB

1 cluster: NBB

ε=-0.20ε=-0.175

ε=-0.15ε=-0.1

ε=-0.05

2 clusters:

NBB, 0.5V FBB

reference

Figure 7.5 Frequency versus body-biased reference gates for the microprocessor with

hierarchy-based body bias clustering.

Figure 7.5 shows the trade-off between maximum clock speed and number of FBB cells in case of two hierarchy-based body bias clusters. The clock frequency has been normalized to NBB case. The single cluster cases and the two cluster case from section 7.2.1 have been indicated for reference purposes. Observe in Figure 7.5 that hierarchy-based body bias clustering (case: ε= -0.05) can achieve comparable clock speeds as the case when 0.5V FBB is applied to the whole design. At the same time, the required number of body-biased cells significantly reduces from ~66k for the single cluster case to 38k (ε= -0.05) or ~17k (ε= -0.2) for hierarchy-based body bias clustering. The number of FBB cells increases when ε increases to zero, because more design hierarchies receive FBB. At the same clock speed, hierarchy-based body bias clustered designs require more cells with FBB than the two-cluster reference case from section 7.2.1 (case: reference in Figure 7.5). The maximum clock speed is slightly lower (~2-3%) for the same amount of body-biased cells. Table 7.2 shows a more detailed overview of the body-bias clustered design. Design-1 and Design-2 are the single cluster reference designs for NBB and 0.5V FBB, respectively. Design-3 is two-cluster reference as obtained from synthesis-based body bias clustering from section 7.2.1. In addition to the previous observations, hierarchy-based body bias clustering can reduce the leakage overhead from 27x down to 10x with respect to the NBB case. For the same number of body-biased cells, the leakage overhead is lower than the two-cluster reference case at a small 2-3% frequency penalty. The circuit area of body bias clustered designs is larger than for a single-cluster 0.5V FBB case. This is because the logic optimizer drives towards smaller gate sizing solutions when FBB is utilized. In other words, the more gates that receive FBB, the smaller the circuit design. This also explains why hierarchy-based body bias clustering is able to offer smaller area solution than synthesis-based body bias clustering. The PPA and total power trends for the different designs can also be explained based on the area scaling trends. Notice that hierarchy-based body bias clustered designs gives 7.1x fewer inter-cluster signal routings than the two-cluster reference case for about the same amount of FBB cells; routing congestion is less likely to occur for hierarchy-based body-bias clustered designs.

7.3 Physical Design with Body Bias Clusters 119

Through BBD design, one can alleviate the maximum clock frequency penalty through design optimization despite a large negative slack is tolerated during the candidate hierarchy selection process (ε is set to be more negative). For ε = -0.2, a limited frequency penalty of only 3% was found for the microprocessor design, while requiring 1.7x less body-biased cells and about 2.6x lower leakage than a single-cluster 0.5V FBB design. In this case, only the ALU hierarchy of the processor has been assigned to the FBB domain (the ALU is identified as the most speed-critical part of the processor). Notice one can trade-off PPA, clock speed and the number of body-biased cells for a given design when utilizing BBD design with hierarchy-based body bias clustering. Hierarchy-based body bias clustering can be applied on any size of IP or system design. Although the concept for assigning sub-hierarchies of a processor design to a given body bias cluster was demonstrated, this approach is scalable for handling larger systems as well. However, a system chip may contain multiple processor sub-systems that require distinct body bias clusters because of uncorrelated power-performance needs. Hierarchy-based body bias clustering can be easily extended for supporting distinct body bias clusters for given IPs in the system. For example, candidate design hierarchies for FBB may be assigned to a unique body bias cluster per sub-system. This can be supported by applying the algorithm presented in Figure 7.4 at the sub-system level.

7.3 Physical Design with Body Bias Clusters

In this section the physical design aspects of body bias clustered designs are addressed. A novel layout implementation approach is presented for body bias clusters. Without loss of generality, the approach is demonstrated by incorporating body bias clusters into the same commercial microprocessor as used before. For this purpose, the micro-processor design in 90nm SVT LP-CMOS is used that utilizes two hierarchy-based body bias clusters, namely NBB, and 0.5V FBB, for the case of ε= -0.175 (see Table 7.2). This design has about 34% logic area in the 0.5V FBB body bias cluster. The proposed approach has been mapped onto Encounter Digital Implementation System from Cadence to implement the design. 7.3.1 Body Bias Islands A body bias island concept is introduced which is a physical layout region that contains digital gates of a body bias cluster to which FBB is applied. One cluster may consist of one or more islands. Figure 7.6 shows a body bias island layout example as embedded into NBB digital standard cells; it fits onto the digital standard-cell row template. An island is surrounded by an N-well ring and isolated from the substrate through a Deep-N-well. A spacing is needed between standard-cells within and outside the island; this spacing is as large as 2µm in 90nm LP-CMOS including the N-well ring. Furthermore, the N-well and P-well connections are made through dedicated well tap cells; the tap cells are inserted in columns at a maximum pitch of 60µm. This maximum pitch is a design-rule from the fab to prevent latch-up in the circuit. A body bias supply grid is connected to the tap-cells of the island, while the power supply (VDD, VSS) is connected to the tap-cells for the NBB circuit part.


C

lock

pe

riod

[n

s]

ε C

ircu

it

area

[µ

m2 ]

PP

A

[MH

z/µ

m2 ]

Tot

al

pow

er

[mW

]

Lea

kage

po

wer

[µ

W]

Num

ber

of

inte

r-is

land

ro

utin

gs

Num

ber

of r

efer

ence

gat

es

Des

ign-

1: R

efer

ence

N

BB

4.

85

n.a.

37

4675

55

0·10

-6

25.4

55

n.

a.

6827

8

Des

ign-

2: S

ing

le b

ody

bia

s

FB

B ∈

0.

5V

3.

98

n.a.

36

2235

69

4·10

-6

31.0

14

68

n.a.

66

011

Des

ign-

3: 2

bod

y b

ias

clust

ers

NB

B, F

BB

∈

0.5

V

Syn

thes

is B

ased

Clu

ster

ing

3.96

n.

a.

3956

90

638·

10-6

31

.6

756

1016

6

NB

B:

520

17 (

72%

)

0.5V

FB

B:

200

91 (

28%

)

Tota

l:

7210

8(10

0%)

Des

ign-

4: 2

bod

y b

ias

clust

ers

NB

B, F

BB

∈

0.5

V

Hie

rarc

hy B

ased

Clu

ster

ing

4.09

-0

.2

4232

13

578·

10-6

34

.9

568

623

N

BB

: 5

9981

(78

%)

0.

5V F

BB

: 1

7142

(22

%)

T

ota

l:

7712

3 (1

00%

) D

esig

n-5:

2 b

od

y b

ias

clust

ers

NB

B, F

BB

∈

0.5

V

Hie

rarc

hy B

ased

Clu

ster

ing

4.05

-0

.175

40

1255

61

6·10

-6

33.6

74

6 14

18

N

BB

: 5

8476

(66

%)

0.

5V F

BB

: 2

4645

(34

%)

T

ota

l:

7312

1 (1

00%

) D

esig

n-6:

2 b

od

y b

ias

clust

ers

NB

B, F

BB

∈

0.5

V

Hie

rarc

hy B

ased

Clu

ster

ing

4.06

-0

.15

3949

78

623·

10-6

33

.5

843

1815

NB

B:

437

37 (

61%

)

0.5V

FB

B:

282

41 (

39%

)

Tota

l: 7

1978

(100

%)

Des

ign-

7: 2

bod

y b

ias

clust

ers

NB

B, F

BB

∈

0.5

V

Hie

rarc

hy B

ased

Clu

ster

ing

4.04

-0

.1

3835

21

645·

10-6

32

.1

928

2190

NB

B:

380

33 (

54%

)

0.5V

FB

B:

318

57 (

46%

)

Tota

l:

6989

0(10

0%)

Des

ign-

8: 2

bod

y b

ias

clust

ers

NB

B, F

BB

∈

0.5

V

Hie

rarc

hy B

ased

Clu

ster

ing

4.01

-0

.05

3750

46

665·

10-6

31

.5

1020

29

79

N

BB

: 3

0489

(45

%)

0.

5V F

BB

: 3

7856

(55

%)

T

ota

l:

6834

5 (1

00%

)

Table 7.2 Hierarchy-based body bias clustered microprocessor design characteristics in 90nm SVT LP-CMOS.

BBD design utilizes hierarchy-based body bias clustering. Conditions: Slow process corner, VDD=1.1V and T=85OC.


The area overhead of a body bias island is determined by the required spacing between Deep-N-well and the N-well outside the body bias island. Let this spacing be defined as ∆ (which corresponds to 2µm in 90nm LP-CMOS). For a rectangular body bias island of width W and height H, and assuming ∆ to be accommodated at all four sizes of the body bias island, the area overhead can be determined as follows.

( ) 242 ∆++∆= HWAoverhead (7.4)

This area overhead can be reduced by placing the body bias island at the side of a digital standard-cell block. However, designers have to be careful that this does not raise wire congestion issues during the routing stage since the sides of the body bias island that are at the digital block boundary are effectively lost for signal routing.

Vpwell Vnwell

= tap cell = N-well = P-well = Deep N-well

≦≦≦≦ 60 µm

Figure 7.6 Layout example of a body bias island.

Finally, the area overhead of a body bias island increases linearly proportional to the number of islands. Many small (spatially distributed) body bias islands give rise to a non-negligible area overhead. Therefore, designers have to carefully trade-off the number of body bias islands against the area overhead. 7.3.2 Balanced Track Utilization for Inter-Island Signal Routing The dimensioning of the body bias island is dependent on the number of routing resources available for inter-island routing. Routing congestion is highly likely in case of too many inter-island routings. Fortunately, hierarchy-based body bias clustering has a low intrinsic amount of inter-island routings as shown in Table 7.2 which alleviates routing congestion. To further prevent routing congestion, the aim is to have a balance between inter-island routings that enter/leave the island in the horizontal or vertical direction. Such balance is achieved by selecting an appropriate island size (aspect ratio) and placement within the layout. In this way, one can alleviate possible wire congestion issues at a given side of the body bias island. For a rectangular body bias island, the total number of available routing tracks for a given island side in vertical and horizontal direction is defined as follows.


r

Wnt verticalvertical =

(7.5)

r

Hnt horizontalhorizontal = (7.6)

where nvertical and nhorizontal are the number of available metal layers for signal routing in vertical and horizontal direction, respectively, and r is the signal wire routing pitch. The track utilization describes the percentage occupation of routing tracks, and is defines as

vertical

topvt

topvtt

uU

__

__ =

; vertical

bottomvt

bottomvtt

uU

__

__ = (7.7)

horizontal

leftht

lefthtt

uU

__

__ =

; horizontal

rightht

righthtt

uU

__

__ = (7.8)

where ut_v_top and ut_v_bottom are amount of utilized tracks in the vertical direction for the top and bottom body bias island boundaries, respectively. ut_h_left and ut_h_right are amount of utilized tracks in the horizontal direction for the left and right body bias island boundaries, respectively. Figure 7.7 shows a graphical example of signal wires crossing the island boundaries. It should be clear that track utilization can be different for each boundary of the body bias island.

Body

Bias

IslandUt_h_right

Ut_v_top

Ut_v_bottom

Ut_h_left

Figure 7.7 Schematic of track utilization for inter-island signal routing.

Let us investigate the layout implementation of the commercial micro-processor design using one body bias island with FBB. The layout area of the processor design is about 630x630µm2 in 90nm LP-CMOS, which corresponds to an overall row utilization of about 80%. The body bias clustering case of ε= -0.175 was utilized. For this design, about 34% of the processor cell area is in the FBB body bias island


(Design-5 in Table 7.2). The body bias island is square-sized and occupies about 350x350µm2. The layout contains the power supply grid for the whole design and body bias networks for the body bias island only. The seven metal layer back-end-of-line configuration of the 90nm LP-CMOS technology was used. More than 200 floorplans with different island placements and a fixed island size have been implemented for evaluating the influence of balanced track utilization. Timing-driven place-and-route was utilized. The island placement is based on Monte-Carlo Sampling, which provides random island placements across the layout. The total signal wire length of the design is used as quality metric for evaluating the impact of balanced track utilization, e.g. a minimum wire length is considered as the best result. In this context, a track utilization is balanced when a similar amount of signal wires cross the island boundary in the horizontal or vertical directions.

1.20E+06

1.25E+06

1.30E+06

1.35E+06

1.40E+06

1.45E+06

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Total Wirelength [ µµ µµm]

Average Track Utilization

Track Balance: 75% to 100%

Track Balance: 50% to 75%

Track balance: 25% to 50%

Track balance: 0% to 25%

Figure 7.8 Track utilization exploration for the 90nm SVT microprocessor.

Row utilization of about 80%, one 0.5V FBB island.

Figure 7.8 plots the total signal wire length versus average track utilization for body-bias clustered microprocessor design. The average track utilization relates to the average of the different track utilizations for each side of the body bias island. Each dot in the plot relates to a placed-and-routed processor design with a unique island placement. Observe the total wire length tends to be smaller for designs with balanced track utilization. Moreover, the average track utilization also tends to be lower for designs with balanced track utilization. With lower track utilization, possible routing congestion is alleviated and the router can find shorter wire solutions for signal net routing. Therefore, it can be concluded that designs with balanced track utilization for the body bias island are preferred and have the tendency to prevent routing congestion. 7.3.3 Automated Implementation of Body Bias Islands The heuristic algorithm for finding a suitable floorplan solution for digital designs and that utilizes body bias clusters is introduced here. The algorithm has been implemented in TCL (Tool Common Language) supported by Cadence Encounter Digital Implementation System.


Figure 7.9 presents the proposed algorithm which is based on the perturbation of the body bias island characteristics. An initial floorplan places the body bias island in the center of the floorplan. The quality of this floorplan is improved by perturbation of the island characteristics and re-floorplanning. The total wire length of the design is the main cost function of the algorithm. The total floorplan area remains constant during the optimization phase regardless of the insertion of additional body bias islands. There are two possible perturbations for a body bias island that are executed in sequence. The first one is the resizing and reshaping of the island, which aims to find the island size and shape for which the total wire length of the design is minimum. The second one is the relocation of the island in the layout, which aims to find the best location of the body bias island at which the total wire length is minimum. The procedure for multiple body bias islands is the same. The last phase of the algorithm increases the number of body bias islands. The perturbation of multiple islands is done in the same way is done for a single island. The algorithm is stopped after the maximum amount of iterations has been reached, the maximum number of body bias islands has been reached, or in case a multiple island solution does not improve the total wire length of the design.

Figure 7.9 Heuristic algorithm for to determine the preferred

floorplan solution of a design utilizing body bias clusters.

Figure 7.10 shows the algorithm for resizing and reshaping the body bias islands for minimizing the total wire length of the design. The body bias island that is part of the initial floorplan starts with a row utilization of 100%. At such high row utilization one faces wire congestion issues of inter-island routings. The algorithm decreases the row utilization based on the balance between horizontal and vertical routing track utilization of the island. When the horizontal track utilization exceeds the vertical track utilization, the height of the island is increased to free more horizontal routing resources. The amount of island height increase is determined by a minimum step size x biased by the track utilization ratio Ut_h/Ut_v. When the vertical track utilization

Algorithm 3: bodybiasisland_floorplanning() /* imax is the maximum number of iterations */ /* kmax is the maximum number of body bias islands */ /* s is a given floorplan solution with resized and relocated body bias islands */ /* ωbest is a total wirelength of the best floorplan solution */

1. Set initial value for minimum wire length: ωbest = ∞ 2. Set iteration index: i=1 3. Set number of body bias islands: k=1 4. while (i ≤ imax || k ≤ kmax) do 5. s = relocate_bodybiasisland(size_shape_bodybiasisland(k)) 6. if (wirelength(s) < ωbest) then 7. ωbest = wirelength(s) 8. Increase number of body bias islands: k=k+1 9. Set iteration index: i=1 10. else 11. Stop algorithm: k=kmax 12. end if 13. Increase iteration index: i=i+1 14. end while 15. return s


is higher, the width of the island is increased to free more vertical routing resources. The island width increase is determined by a minimum step size x biased by the track utilization ratio Ut_v/Ut_h. In case of a balanced track utilization, both island width and height are increased by a minimum step size x. This process is repeated for all body bias islands in the design as long as either the row utilization of the island is higher than a minimum set value, ϕ, or if the total wire length increases. The algorithm returns the floorplan solution with the minimum total wire length of the design.

Figure 7.10 Heuristic algorithm for preferred body bias island width and height.

Algorithm 4: size_shape_bodybiasisland(k) /* imax is the maximum number of iterations */ /* Ut_h(j) is the average horizontal track utilization for body bias island j */ /* Ut_v(j) is the average vertical track utilization for body bias island j */ /* ϕ is the minimum row utilization for all body bias islands */ /* x is the minimum step size for increasing body bias island width and height */ /* s is a given floorplan solution with resized body bias islands */ /* S is the set of floorplan solutions with resized body bias islands */

1. Set initial value for minimum wire length: ωbest = ∞ 2. Set iteration index: i=1 3. Set maximum row utilization for all body bias islands: Uc_bbi(j,i)=1 4. while (i ≤ imax || Uc_bbi(j,i) > ϕ) do 5. Place-and-route design: obtain Ut_v, Ut_h, Uc_bbi, s

6. ( )nsSi

n 1=∪=

7. for (j = 1 to k) do /* foreach body_bias_island j do */ 8. if (Ut_v(j,i) > Ut_h(j,i)) then 9. Increase body bias island width: W(j,i)= W(j,i)+(Ut_v(j,i)/Ut_h(j,i))x 10. elseif (Ut_v(j,i) < Ut_h(j,i)) then 11. Increase body bias island height: H(j,i)= H(j,i)+(Ut_h(j,i)/Ut_v(j,i))x 12. else 13. Increase body bias island width: W(j,i)= W(j,i)+x 14. Increase body bias island height: H(j,i)= H(j,i)+x 15. end if 16. end for 17. if (wirelength(s) < ωbest ) then 18. ωbest = wirelength(s) 19. else 20. Stop algorithm: i=imax 21. end if 22. Increase iteration index: i=i+1 23. end while 24. return s for which total wirelength is minimum


Figure 7.11 Heuristic algorithm for preferred location of the body bias islands.

The relocation optimizer is initiated after resizing and reshaping of the body bias islands. This optimizer attempts to find a suitable island locations for which the total wire length is minimum. Basically, these are the locations for the track utilizations of the island are more or less balanced, which are spread all over the floorplan. It is very likely to find multiple islands with balanced track utilization in a close proximity from each other. In addition, it is also very likely that one can find multiple island solutions with a minimum total wire length if there exists many different locations in the floorplan for which the body bias island shows balanced track utilization.

Algorithm 5: relocate_bodybiasisland(s) /* Ut_h(j) is the average horizontal track utilization for body bias island j */ /* Ut_v(j) is the average vertical track utilization for body bias island j */ /* Θ is the threshold value for balanced routing resources */ /* imax is the maximum allowed iterations per body bias island */ /* m is the number of balanced islands */ /* mmax is the maximum number of balanced islands reference*/ /* q is the quadrant number */ /* s is a given floorplan solution with relocated body bias islands */ /* S is the set of floorplan solutions with relocated body bias islands */

1. Initialize set of floorplan solutions: sS =

2. Set number of balanced islands: m=0 3. Move islands to the first quadrant 4. Set quadrant number: q=1 5. while (q≤4 || m < mmax) do 6. Set iteration index: i=1 7. while (i<imax || m < mmax) do 8. Place-and-route design: obtain Ut_v, Ut_h, Uc_bbi, s 9. SsS i ∪=

10. for (j = 1 to k) do /* foreach body_bias_island j do */ 11. Calculate balance of routing resources:

Ut_v_bal(j,i)=|Ut_v_top(j,i)-Ut_v_bottom(j,i)|/max(Ut_v_top(j,i),Ut_v_bottom(j,i)) Ut_h_bal(j,i)=|Ut_h_right(j,i)-Ut_h_left(j,i)|/max(Ut_h_left(j,i),Ut_h_right(j,i)) Ut_bal(j,i)= (Ut_v_bal(j,i) + Ut_h_bal(j,i))/2

12. if (Ut_bal(j,i) < Θ) then 13. Increase number of balanced islands: m=m+1 14. else if (Ut_v_top(j,i) != Ut_v_bottom(j,i)) then 15. X(j,i)=X(j,i)·Ut_v(j,i)/Ut_v_top(j,i) 16. end if 17. if (Ut_h_left(j,i) != Ut_h_right(j,i)) then 18. Y(j,i)=Y(j,i)·Ut_h(j,i)/Ut_h_right(j,i) 19. end if 20. end if 21. end for 22. Increase iteration index: i=i+1 23. end while 24. Move islands to the next quadrant 25. Increase quadrant number: q=q+1 26. end while 27. return s for which total wirelength is minimum


Figure 7.11 shows the algorithm for relocation of the body bias islands. The floorplan is explored in a systematic manner, and therefore the layout plane was divided into quadrants: 1) bottom-left, 2) top-left, 3) top-right, and 4) bottom-right. The island will move to a different quadrant when either a given number of balanced island solutions have been found, or a maximum amount of iterations has been reached for exploring a given quadrant. The algorithm searches for body bias island locations with balanced track utilization with an adaptive step. In case of body bias islands with imbalanced track utilization, a farther location is selected for the next iteration to explore the layout. Contrarily, a closer location is selected for the next iteration in case of a balanced island. The algorithm is halted when more candidate locations for balanced islands have been found than a given threshold value, mmax, while returning the best floorplan solution. The algorithm moves to a different quadrant after completing a given number of iterations for a quadrant. If all four quadrants have been explored, the algorithm is halted irrespective the number of balanced islands found, and the best floorplan solution is returned. Let us now briefly discuss the case of optimizing multiple body bias islands. Two body bias islands replace a single island after completing its total wire length optimization. Next, the same procedure is started as for the single island case, as depicted in Figure 7.9. Resizing and relocation of each island is done based on its own track utilization. After the total wire length optimization for the two island case is completed, one of the two body bias islands is replaced by two islands, resulting in three body bias islands in total. Next, the algorithm will iterate again to determine size, shape and location of the body bias islands. The iterative process of increasing the number of islands is repeated until there is no improvement in the total wire length of the design or until the maximum number of iterations has been reached. The algorithm will return the best found solution.

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Wir

e le

ng

th o

ve

rhe

ad

[%

]

Iteration

Resizing Relocation

Figure 7.12 Heuristic algorithm results for body bias island integration into the 90nm

SVT microprocessor with one 0.5V FBB island.


FBB Island

NBB Island

Figure 7.13 Floorplanning results for the 90nm SVT microprocessor with one 0.5V FBB

island by using the heuristic algorithms.

Figure 7.14 Layout of the 90nm SVT microprocessor with one 0.5V FBB island.


Next, let us explore the effectiveness of the heuristic algorithm for integrating body bias islands into the floorplan of the aforementioned micro-processor design in 90nm LP-CMOS. The same design settings have been used as before, namely two body bias clusters (NBB and 0.5V FBB), a total layout area is about 630x630µm2 (80% overall row utilization), and 34% of the processor cell area in the FBB island. In the initial floorplan the FBB island in placed in the center of the floorplan, while occupying a layout area of 330x330µm2 (~100% row utilization for the island). The algorithm is initiated, starting with the island resizing and reshaping phase and followed by the island relocation phase. Figure 7.12 shows the total wire length overhead for each iteration of the physical design while using the proposed heuristic algorithm. The total wire length overhead is obtained after the comparison against a physical design with nominal body bias, i.e. only one cluster at NBB for the whole design (Design-1 in Table 7.2). The results relate to a microprocessor design with a single body bias island. After the resizing phase, an optimum row utilization of 85% for the FBB island (W=330µm, H=354µm) has been obtained as compared to 78% for the NBB island. Initially, the design faced wire congestion issues for a 100% row utilization of the FBB island. The wire congestion was solved during the resizing phase be lowering row utilization of the FBB island. After the relocation phase, the overhead in total wire length has significantly decreased from a large initial value of about 18% down to already a small overhead of about 3%. Figure 7.13 and Figure 7.14 provide a layout view of the obtained solution. In a next phase, the algorithm divided the single FBB island into two ones for further total wire length optimization by exploring multiple island solutions. It was observed that the algorithm assigns a maximum number of FBB cells to the largest FBB island during the resizing phase. This trend continues until the largest FBB island contains most cells of the body bias cluster, while at the same time the smallest FBB island becomes nearly empty. This behaviour can be explained by the characteristics of timing driven placement optimization, for which timing related circuit parts are placed in close proximity. This holds for all design hierarchies in a body bias cluster which have been selected based on their timing relationship. Consequently, the placement tool attempts to converge all FBB cells into a single FBB island solution. Hence, the use of a single FBB island for each body bias cluster is preferred from the integration point of view. Finally, let us explore the impact of resizing at the preferred FBB island location, as outputted by the heuristic algorithm. For this purpose, an FBB island of size 330x330µm2 (~100% row utilization for the island) is placed at the preferred location. Next, the resizing algorithm is started to investigate if resizing can further reduce the total wire length overhead below a value of 3%. An optimum row utilization of 80% for the FBB island (W=337µm, H=369µm) is obtained. The total wire length overhead was comparable as found before; a wiring overhead of 2.7% was observed as compared to an NBB solution. However, the FBB island size is slightly larger than before (80% row utilization instead of 85%). This result shows that not much improvement in total wire length can be expected from an additional resizing phase after execution of the proposed heuristic algorithm. Moreover, it should also be clear that a wiring overhead of about 3% versus the NBB design is relatively small. Therefore, it can be concluded that the proposed algorithm successfully dimensions and places the FBB island within the floorplan of the design. Since the algorithm converges in a few iterations, it is much more efficient than a Monte Carlo sampling of island size and position across the whole layout. Despite the heuristic nature of the algorithm, it achieves attractive results for supporting an automated body bias island integration methodology.


Figure 7.15 Digital layout area for the FBB-enabled mixed-signal SoC test-chip.

Table 7.3 Main design characteristics of the mixed-signal SoC test-chip.

Technology 90nm LP-CMOS, triple-well, mixed SVT/HVT, 7 metal layer process

Die Size 3.98mm x 3.98mm Gate Count Logic – Total

Logic-FBB island

SRAM

1.3MRefGates 214kRefGates

1.4Mbit Supply Voltage 3.3V (I/O), 1.2V (Core) Clock Frequency 250MHz Body Bias Technology Body bias driven design synthesis,

body bias islands, embedded FBB generation

7.4 Forward-Body-Bias Integration into a Mixed-Signal System-chip Design 131

7.4 Forward-Body-Bias Integration into a Mixed-Signal

System-chip Design

The BBD design approach including body bias islands has been deployed for the design of a mixed-signal system-chip Integrated Circuit in 90nm LP-CMOS. This test-chip consists of analog sensor and interfacing functions, digital processing engines, embedded memories, as well as clock generation and power management units. FBB has been utilized for speed-critical sections of the digital logic circuits. These circuit parts have been clustered into a central body bias island, or FBB Island. The digital gates of in the island receive an independent body bias for PMOS and NMOS transistors, ranging from a nominal body bias up to 0.5V FBB. The FBB has been generated by the FBB generator that has been presented in Chapter 4. During standby operation, FBB is always disabled for saving leakage power. The test-chip has been designed based on the BBD design approach that has been presented in Chapter 5. The design goal was to achieve maximum performance without area over-dimensioning, thus maximum-PPA design. The usage of FBB has been selected for timing-critical circuit sections only. For this purpose, hierarchy-based body bias clustering has been applied as presented in Chapter 7. All FBB-enabled digital gates have been clustered into a single FBB island. FBB has been applied to about 17% of the digital core area. Figure 7.15 illustrates the chip layout. The digital core region which includes the body bias islands has been highlighted as well as the FBB generator building block. Table 7.3 shows an overview of the main design characteristics of the 90nm LP-CMOS test-chip design.

100%

105%

110%

115%

120%

125%

130%

0 100 200 300 400 500 600

Ma

xim

um

Pe

rfo

rma

nce

Forward Body Bias [mV]


Benchmark application running

Figure 7.16 Maximum performance versus FBB experiments for the

mixed-signal SoC test-chip under a reference benchmark application.

Figure 7.16 shows experimental results of the relative maximum operating performance as function of the applied FBB voltage. FBB has been applied symmetrically to both N-well and P-well regions of the FBB island. Observe up to 25% clock frequency improvement at 0.5V FBB when operating at VDD=1.2V and room temperature, while running a reference benchmark application. This proves effectiveness of BBD design including body bias clustering in a system-chip design. Complementary, the relative active power and energy increase as function of FBB voltage is illustrated in Figure 7.17. The total power increase is about 28% at 0.5V


FBB when operating at VDD=1.2V and room temperature. This power increase is primarily due to the higher frequency of operation when applying FBB. The total power also increases due to an increased junction capacitance of forward body biased transistor devices, as explained in Chapter 2. To illustrate its impact, the relative active energy against the applied FBB voltage is plotted as well. A total energy increase up to 3% is observed at 0.5V FBB when operating at VDD=1.2V and room temperature. This energy increase is directly proportional to capacitance increase of the forward body biased junctions. The selective application of FBB through body bias clustering reduces the energy increase significantly versus the case when FBB is applied to the full design.

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

3.5%

0%

5%

10%

15%

20%

25%

30%

0 100 200 300 400 500 600

Re

lati

ve

En

erg

y In

cre

ase

Re

lati

ve

Po

we

r In

cre

ase

Forward Body Bias [mV]


Benchmark application running

Figure 7.17 Relative total power and energy increase versus FBB experiments

for the mixed-signal SoC test-chip under a reference benchmark application.

7.5 Discussion

In this chapter a new hierarchy-based body bias clustering approach for digital CMOS integrated circuits has been proposed for which FBB is applied to speed-circuit circuit portions only. Focus is on reducing the amount digital gates with FBB to relax loading constraints of the embedded FBB generator. In contrast to prior art, body bias clustering is done during the logic synthesis phase. This enables designers to perform body bias driven design optimization for achieving performance, power consumption, and area targets in the presence of body bias clusters. Also a physical design strategy for body bias clustered designs has been proposed which is based on body bias islands. These new concepts were validated through an industrial processor design in 90nm LP-CMOS. The use of two body bias clusters (NBB and 0.5V FBB) was found to be the optimal choice. Moreover, the physical design based on two body bias islands only for integrating these clusters provided the best results. The proposed clustering approach provided up to 4.5x less digital gates that require FBB, and up to 2.6x lower leakage than in case of applying FBB to the whole design. Furthermore, a small (~3%) signal routing overhead w.r.t. a NBB design was observed. Finally, the body bias clustering was demonstrated in a mixed-signal SoC design in 90nm LP-CMOS.

7.5 Discussion 133

Hierarchy-based body bias clustering offers designers a flexible tool-supported solution to reduce the amount of digital gates that should receive FBB. Due to its generic nature, it can be applied at different levels in the design, e.g. IP-block, sub-system, or system-level. Moreover, disjoint design hierarchies may have independent body bias clusters. Although BBD designs with body bias clusters can reach the same performance targets as conventional BBD designs, their usage should be limited to be applies in those cases for which the FBB generator is heavily loaded. This is because the maximum PPA, area occupation, and total power consumption is impacted during design synthesis when FBB is applied to a lower amount of digital gates.


Chapter 8

Concluding Remarks

HIS chapter summarizes the research contributions of this thesis, highlights the achieved results, and shows suggestions for future work.

8.1 Research Contribution and Results

Conventional digital design practices are based on a worst-case design style to guarantee chip operation for meeting timing specifications among the process corners. Worst-case design makes high performance specifications harder to meet due to circuit over-dimensioning, which leads to a larger silicon footprint, higher power consumption and larger leakage. This thesis presents research work on a new design synthesis strategy for digital CMOS circuits that makes use of forward body biasing (FBB). Unlike prior art works that only use silicon tuning for improving product-binning yields and for trading-off power-performance, a body bias driven (BBD) design synthesis strategy is proposed that constrains circuit area over-dimensioning. The main research contributions of this work include: Quantification of post-silicon tuning technological boundaries for digital circuits

through experiments in state-of-the-art 90nm, 65nm and 45nm low-power (LP) CMOS. The results enable designers to judge how much silicon tuning range can be utilized for improving power-performance, reducing leakage and performance variability control.

The introduction of a new figure-of-merit parameter: performance-per-area (PPA). The PPA ratio defines how effectively a circuit design achieves high performance while accounting for the impact of area scaling. There exists a maximum PPA point for every design, which is the desired design point when designing for high-performance while avoiding circuit over-dimensioning.

A body bias driven gate-level optimization method that leverages FBB to improve the PPA ratio of digital CMOS circuits. An in-depth analysis of the BBD design theory has been provided for both high-performance and ultra-low-power digital CMOS circuits. This theory allows designers to predict the design’s maximum PPA with a minimum number of synthesis trials.

A new embedded FBB generator design that holds its FBB output voltage constant relative to the (scalable) power supply of a digital circuit. A modular generator solution has been presented that can drive distinct digital IP block sizes in multiples of up to 1mm2.

T

136 Chapter 8 Concluding Remarks

A body bias clustering method based on design hierarchies of timing-critical circuit parts. A greedy algorithm has been provided for assigning design hierarchies to body bias clusters at design-time. The proposed approach enables separation the circuit into a body biased part and a non-body biased part, while preventing signal routing congestion issues.

A physical design approach for BBD designs with body bias clusters. The proposed approach is based on utilization of body bias islands in the circuit layout. A heuristic algorithm has been provided that supports automated implementation of body bias islands.

The new BBD design concept has been validated through industrial processor designs in 90nm LP-CMOS. For standard-Vth implementations, PPA improvements of up to 40%, area and leakage reductions up to 30%, and dynamic power savings of up to 10% without performance penalties were observed. The benefits are larger for high-Vth implementations. In this case, PPA improvements up to 90%, area and leakage reductions up to 40%, and dynamic power savings of up to 25% without performance penalties were observed as a benefit from the proposed BBD design strategy. Extending BBD designs with hierarchy-based body bias clustering enabled the application of FBB to timing-critical circuit parts only. For a 90nm standard-Vth LP-CMOS industrial processor design, up to 4.5x lower amount of FBB digital gates, leakage reductions of up to 2.6x at a similar circuit speed for the body bias clustering concept were observed as compared to applying FBB to the whole design. The proposed physical design approach for implementing body bias clustered BBD designs showed minimum area and routing overheads as compared to a nominal-body-biased design. Finally, the BBD design strategy with body bias clustering was deployed in a mixed-signal system-chip design in 90nm LP-CMOS. The test-chip has been designed for operating at the maximum PPA point. The die size is 3.98mm x 3.98mm. At nominal VDD operation, 25% clock frequency improvement with a total energy increase of only 3% at 0.5V FBB was observed when executing a reference benchmark application. The research described in this thesis has proven effectiveness of BBD design with body bias clusters in a realistic Integrated Circuit vehicle.

8.2 Outlook and Suggestions for Future work

BBD design enables new opportunities for design optimization of Integrated Circuits by considering body biasing as an additional design parameter during synthesis process. In this thesis it has been demonstrated that BBD design can be implemented with commercial Electronic Design Automation (EDA) tools. In Chapter 7 the BBD design approach has been extended with hierarchy-based body bias clustering. Although hierarchy-based body bias clustering can also be implemented with commercial EDA tools, there is definitely room for improvement. More research is needed to make body bias clustering an integral part of logic synthesis, e.g. to identify candidate design hierarchies that should receive FBB at synthesis-time instead of at netlist-level. Also, research is needed towards automated body bias cluster integration into the design layout; this concerns automated sizing, placement and splitting approaches of body bias islands. Having design tools with such capability can significantly improve design turn-around time of body-bias clustered designs, and also it may help body-bias clustered designs becoming a more standard design practice. From another viewpoint, more research is needed to explore the use of BBD design in other than digital logic functions. One of the candidate functions are static random

8.2 Outlook and Suggestions for Future work 137

access memories (SRAMs). SRAMs are key Intellectual Property blocks in modern system chip designs. There are opportunities to introduce BBD design into SRAMs. Research is needed to explore if the peripheral section of the SRAM can benefit from BBD design in the same manner as digital logic circuitry. The body bias of the SRAM periphery can potentially be shared with the body bias of the digital logic part to achieve collective benefits. Additionally, more research is required to investigate if BBD design can also be applied to the matrix section of the SRAM, e.g. memory (bit-) cells. BBD design can enable trading-off read/write margins of bit-cells versus speed and area of the memory matrix. Besides SRAMs, the use of BBD design concepts should be explored for utilization in analog / mixed-signal IP blocks. It is well-known that analog / mixed-signal circuits do not scale so well as digital circuits across process technology nodes. Research is needed to investigate the use of BBD design in analog / mixed-signal IP to improve design characteristics such as higher performance, lower power, reduced circuit area and/or enhanced circuit robustness in the state-of-the-art and next-generation CMOS processes. The technological boundaries of body biasing were discussed in Chapter 3. For various process technologies, it has been shown the impact of body biasing on digital circuit performance, power consumption and leakage current. It has been demonstrated that the application of 0.5V FBB at nominal VDD can utilize an excess performance increase of 18%-27% in 90nm, 65nm, and 45nm LP-CMOS. The benefits of BBD design can be further enhanced by following a design/technology co-design strategy. More research is needed to improve the body bias sensitivity on device performance in state-of-the-art and next-generation CMOS processes, in particular the sensitivity to FBB. Such research could be in the same direction as proposed by Imai et.al. for improving RBB sensitivity in CMOS 65nm [40]. Designers can exploit a higher FBB sensitivity in BBD designs in different ways, for example, to achieve higher-performance designs with larger maximum PPA values, or to implement lower-power designs at a given operating frequency when combined with VDD reduction. Moreover, such design strategy can extend the use of conventional CMOS processes to reach speed and power targets before new technologies, like FinFets or multi-gate devices, are required. This will particularly be interesting for utilization in nowadays Integrated Circuits and Systems that are following a “More-than-Moore” system integration trend. Such Integrated Circuits and Systems focus on function diversification rather than increasing transistor density alone, thereby they aim at getting the most out of “older” process technology nodes for mixed-signal function integration. Hence, I believe that the utilization of BBD design in the context of “More-than-Moore” Integrated Circuits and Systems is a seamless fit for enabling a new analog-assisted digital concept. One of the characteristics of BBD design is the need of FBB post-silicon tuning for individual die samples to correct performance deviations due to fabrication outcome. Such tuning could be part of Integrated Circuit production testing as proposed by [79]. There are a few drawbacks with such approach, namely the testing costs will increase due to the extra test-time for the silicon tuning tests, and non-volatile storage is needed on-chip for storing the FBB settings that are obtained at testing time. To alleviate these drawbacks, more research is needed to enable automated calibration of Integrated Circuits to compensate for process-dependent performance spreads at application-time, i.e. as part of a chip start-up sequence. Suitable approaches are needed ensuring circuit robustness against D2D and WID variability to achieve sufficiently high yield of the Integrated Circuit design. The benefits of a spatial approach for silicon tuning, i.e. the application of different FBB settings for different regions of the Integrated Circuit, should be explored. Furthermore, application-time

138 Chapter 8 Concluding Remarks

calibration should also be explored for its potential for compensating the impact of one or more dynamic effects (e.g. VDD variations, temperature variations, lifetime effects) on chip performance when utilized during run-time of the Integrated Circuit. This work should investigate suitable body bias controllers for offering adaptive body biasing control solution. Researchers should explore the use of tunable Integrated Circuits in self-healing design concepts, e.g. Integrated Circuits that adjust themselves dynamically to the environment in which they operate to ensure a certain level of functionality and operational performance. Finally, I expect that the world will see the first examples of adaptive Integrated Circuits in production with self-calibration features for compensating against performance variability influences in the next 5~10 years.

139

References

[1] G. E. Moore, “The Microprocessor: Engine of the Technology Revolution,” Communication of the ACM, Vol.40, No.2, February 1997, pp.112-114.

[2] R. Dennard, F. Gaensslen, V. Rideout, E. Bassous, and A. LeBlanc, “Design of Ion-implanted MOSFETs with Very Small Physical Dimensions,” IEEE Journal of Solid

State Circuits, Vol. SC-9, No.5, October 1974, pp. 256-268.

[3] D. J. Frank, “Power constrained CMOS scaling limits,” IBM Journal of Research

&Development, Vol. 46, No. 23, March/May 2002, pp. 235-244.

[4] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and Vivek De, “Parameter Variations and Impact on Circuits and Microarchitecture,” Proceedings of

DAC, Anaheim, CA, USA, June 2003, pp. 338-342.

[5] K. A. Bowman, S. G. Duvall, and J. D. Meindl, “Impact of Die-to-Die and Within-Die Parameter Fluctuations on the Maximum Clock Frequency Distribution for Gigascale Integration,” IEEE Journal of Solid State Circuits, Vol. 37, February 2002, pp. 183–190.

[6] S. R. Nassif, "Design for Variability in DSM Technologies," Proceedings of IEEE

ISQED, San Jose, CA, USA, March 2000, pp. 451-454.

[7] D. Brooks and M. Martonosi, “Dynamic Thermal Management for High-Performance Microprocessors,” Proceedings of the Int. Symposium on High-Performance

Computer Architecture, Monterrey, Mexico, January 2001, pp. 171-182.

[8] S. Borkar, “Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation,” IEEE Micro, Vol. 25, No. 6, pp. 10-16, November-December 2005.

[9] P. S. Zuchowski, P. A. Habitz , J. D. Hayes, and J. H. Oppold, “Process and Environmental Variation Impacts on ASIC Timing,” Proceedings of ICCAD, San Jose, CA, USA, November 2004, pp.336-342.

[10] A. Wang and A. Chandrakasan, “A 180-mV Subthreshold FFT Processor Using a Minimum Energy Design Methodology,” IEEE Journal of Solid State Circuits, Vol. 40, No. 1, January 2005, pp. 310-319.

[11] B. Zhai et.al. “Energy efficient near-threshold chip multi-processing,” Proceedings of

ISLPED, Portland, OR, USA, August 2007, pp.32-37.

[12] K. Kuhn, C. Kenyon, A. Kornfeld, M. Liu, A. Maheshwari, W-K Shih, S. Sivakumar, G. Taylor, P. VanDerVoorn, and K. Zawadzki, “Managing Process Variation in Intel’s 45nm CMOS Technology,” Intel Technology Journal, Vol. 12, Issue 2, June 2008, pp. 93-110.

[13] A. Bellaouar et.al. “Supply Voltage Scaling for Temperature Insensitive CMOS Circuit Operation,” IEEE Transactions on Circuits and Systems – II: Analog and

Digital Signal Processing, Vol.45, No.3, March 1998, pp. 415-417.

[14] C. Visweswariah, K. Ravindran, K. Kalafala, S.G. Walker, S. Narayan, D.K. Beece, J. Piaget, N. Venkateswaran, and J.G. Hemmett, “First-Order Incremental Block-Based Statistical Timing Analysis,” IEEE Transactions on Computer-Aided Design of

Integrated Circuits and Systems, Vol. 25, No. 10, October 2006, pp. 2170-2180.

[15] J. Jess, K. Kalafala, S. R. Naidu, R. Otten, and C. Visweswariah, “Statistical Timing for Parametric Yield Prediction of Digital Integrated Circuits,” IEEE Transactions on

Computer-Aided Design of Integrated Circuits and Systems, Vol. 25, No. 11, November 2006, pp. 2376-2392.

140 References

[16] T. Chen and, S. Naffziger, “Comparison of Adaptive Body Bias (ABB) and Adaptive Supply Voltage (ASV) for Improving Delay and Leakage Under the Presence of Process Variation,” IEEE Transactions on VLSI Systems, Vol.11, No.5, October 2003, pp.888-899.

[17] T. Kuroda, T. Fujita, S. Mita, T. Nagamatsu, S. Yoshioka, K. Suzuki, F. Sano, M. Norishima, M. Murota, M. Kako, M. Kinugawa, M. Kakumu, and T. Sakurai, “A 0.9-V, 150-MHz, 10-mW, 4 mm2, 2-D Discrete Cosine Transform Core Processor with Variable Threshold-Voltage (VT) Scheme,” IEEE Journal of Solid-State Circuits, Vol.31, No.11, November 1996, pp. 1770-1779.

[18] M. Miyazaki, H. Mizuno, and K. Ischibashi, “A Delay Distribution Squeezing Scheme with Speed-Adaptive Threshold-Voltage CMOS (SA-Vt CMOS) for Low Voltage LSls,” Proceedings of ISLPED, Monterey, CA, USA, August 1998, pp. 48-53.

[19] J. Tschanz, J. Kao, S. Narendra, R. Nair, D. Antoniadis, A. Chandrakasan, and V. De, “Adaptive Body Bias for Reducing Impacts of Die-to-Die and Within-Die Parameter Variations on Microprocessor Frequency and Leakage,” ISSCC Digest of Technical

Papers, San Francisco, CA, USA, February 2002, pp. 344-345.

[20] D. Lackey et.al. “Managing power and performance for system-on-chip designs using Voltage Islands,” Proceedings of ICCAD, San Jose, CA, USA, November 2002, pp. 195-202.

[21] L. Nielsen er.al. “Low-power operation using self-timed circuits and adaptive scaling of the supply voltage,” IEEE Transactions on VLSI Systems, Vol.2, No.4, December 1994, pp. 391-397

[22] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, “ Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation,” IEEE Micro, Vol. 24, No. 6, November-December 2004, pp. 10-20.

[23] S. Das, C. Tokunaga, S. Pant, W-H Ma, S. Kalaiselvan, K. Lai, D. Bull, and D. Blaauw, “RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance,” IEEE Journal of Solid-State Circuits, Vol. 44, No. 1, January 2009, pp. 32-48.

[24] H. Shichman, D. A. Hodges, “Modeling and Simulation of Insulated-Gate Field-Effect Transistor Circuits,” IEEE Journal of Solid-State Circuits, Vol. SC-3, No. 3, September 1968, pp. 285-289.

[25] B. Razavi, Design of Analog CMOS Integrated Circuits, New York: McGraw-Hill Inc., 2000.

[26] T. Sakurai and R. Newton, “Alpha-Power Law MOSFET Model and its Applications to CMOS Inverter Delay and Other Formulas”, IEEE Journal of Solid-State Circuits, Vol.25, No.2, April 1990, pp. 584-593.

[27] I. Sutherland, B. Sproull, D. Harris, Logical Effort: Designing Fast CMOS Circuits, San Francisco: Morgan Kaufmann, 1999.

[28] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leakage Current Mechanisms and Leakage Reduction Techniques in Deep-Submicron CMOS Circuits,” Proceedings of the IEEE, Vol. 91, No. 2, February 2003, pp. 305-327.

[29] A. Scholten et.al. “The Physical Background of JUNCAP2,” IEEE Transactions on

Electron Devices, Vol.53, No.9, September 2006, pp. 2098-2107.

[30] S. M. Sze, Physics of Semiconductor Devices, New York: Wiley, 1981.

[31] P. Chakrabarti, A. Gawarikar, V. Mehta and D. Garg, “Effect of Trap-assisted Tunneling (TAT) on the Performance of Homojunction Mid-Infrared Photodetectors based on InAsSb,” Journal of Microwaves and Optoelectronics, Vol.5, No.1, June 2006, pp.1-14.

141

[32] A. Keshavarzi et.al. “Technology Scaling Behaviour of Optimum Reverse Body Bias for Standby Leakage Power Reduction in CMOS IC’s,” Proceedings of ISLPED, San Diego, CA, USA, August 1999, pp.252-254.

[33] Y. Taur and T. H. Ning, Fundamentals of Modern VLSI Devices, Cambridge Univ. Press, New York, 1998, Chapter 2, pp. 94–95.

[34] S. Huang, C. Wann, Y. Huang, C. Lin, T. Schafbauer, S. Cheng, Y. Cheng, D. Vietzke, M. Eller, C. Lin, Q. Ye, N. Rovedo, S. Biesemans, P. Nguyen, R. Dennard, and B. Chen, “Scalability and Biasing Strategy for CMOS with Active Well Bias,” Proceedings of Symposium on VLSI Technology, Kyoto, Japan, June 2001, pp.107-108.

[35] B. Chatterjee, M. Sachdev, S. Hsu, R. Krishnamurthy, and S. Borkai, “Effectiveness and Scaling Trends of Leakage Control Techniques for Sub-130nm CMOS Technologies,” Proceedings of ISLPED, Seoul, Korea, August 2003, pp. 122-127.

[36] A. Hokazono, S. Balasubramanian, K. Ishimaru, H. Ishiuchi, T. King Liu, and C. Hu, “MOSFET Design for Forward Body Biasing Scheme,” IEEE Electron Device Letters, Vol.27, No.5, May 2006, pp. 387-389.

[37] A. Hokazono, S. Balasubramanian, K. Ishimaru, H. Ishiuchi, C. Hu, and T. King Liu, “Forward Body Biasing as a Bulk-Si CMOS Technology Scaling Strategy,” IEEE

Transactions on Electron Devices, Vol.55, No.10, October 2008, pp. 2657-2664.

[38] K. von Arnim, E. Borinski, P. Seegebrecht, H. Fiedler, R. Brederlow, R. Thewes, J. Berthold, and C. Pacha, “Efficiency of Body Biasing in 90 nm CMOS for Low Power Digital Circuits,” Proceedings of ESSCIRC, Leuven, Belgium, September 2004, pp.175-178.

[39] K. von Arnim, E. Borinski, P. Seegebrecht, H. Fiedler, R. Brederlow, R. Thewes, J. Berthold, and C. Pacha, “Efficiency of Body Biasing in 90-nm CMOS for Low-Power Digital Circuits,” IEEE Journal of Solid-State Circuits, Vol.40, No.7, July 2005, pp.1549-1556.

[40] K. Imai, Y. Yamagata, S. Masuoka, N. Kimuzuka, Y. Yasuda, M. Togo, M. Ikeda, and Y. Nakashiba, “Device Technology for Body Biasing Scheme,” Proceedings of

ISCAS, Kobe, Japan, May 2005, pp. 13-16.

[41] Y. Yasuda, Y. Akiyama, Y. Yamagata, Y. Goto, and K. Imai, “Design Methodology of Body-Biasing Scheme for Low Power System LSI With Multi-Vth Transistors,” IEEE Transactions on Electron Devices, Vol.54, No. 11, November 2007, pp.2946-2956.

[42] A. Kesharvarzi et al., “Forward Body Bias for Microprocessors in 130nm Technology Generation and Beyond,” Symp. on VLSI Circuits Digest of Technical Papers, Honolulu, HI, USA, June 2002, pp.125-128.

[43] B. Choi and Y. Shin, “Lookup Table-Based Adaptive Body Biasing of Multiple Macros,” Proceedings of ISQED, San Jose, CA, USA, March 2007, pp.533-538.

[44] M. Sumita, S. Sakiyama, M. Kinoshita, Y. Araki, Y. Ikeda, and K. Fukuoka, “Mixed Body-Bias Techniques with Fixed Vt and Ids Generation Circuits,” IEEE Journal of

Solid-State Circuits, Vol.40, No.1, January 2005, pp.60-66.

[45] Y. Komatsu, K. Ishibashi, M. Yamamoto, T. Tsukada, K. Shimazaki, M. Fukazawa, M. Nagata,”Substrate-Noise and Random-Fluctuations Reduction with Self-Adjusted Forward Body Bias,” Proceedings of CICC, San Jose, CA, USA, September 2005, pp. 35-38.

[46] G. Ono, M. Miyazaki, K. Watanabe, and T. Kawahara, “An LSI System with Locked in Temperature Insensitive State Achieved by Using Body Bias Technique,”

Proceedings of ISCAS, Kobe, Japan, May 2005, pp.632-635.

142 References

[47] K. Kim, and Y-B. Kim, “Optimal Body Biasing for Minimum Leakage Power in Standby Mode,” Proceedings of ISCAS, New Orleans, LA, USA, May 2007, pp.1161-1164.

[48] M. Miyazaki, G. Ono, and T. Kawahara, “Optimum Threshold-Voltage Tuning for Low-Power, High-Performance Microprocessor,” Proceedings of ISCAS, Kobe, Japan, May 2005, pp. 17-20.

[49] A. Ochoa, Jr. and P.V. Dressendorfer, “A Discussion of the Role of Distributed Effects in Latchup,” IEEE Transactions on Nuclear Science, Vol. NS-28, No. 6, December 1981, pp. 4292-4294.

[50] J. Dooley and R. Jaeger, “Temperature dependence of latchup in CMOS circuits,” IEEE Electron Device Letters, Vol. EDL-5, February 1984, pp. 41-43.

[51] J. Zhang, “Worst Case Design of Digital Integrated Circuits,” Proceedings of ISCAS, London, UK, June 1994, pp.153-156.

[52] S. Duvall, “A Practical Methodology for the Statistical Design of Complex Logic Products for Performance,” IEEE Trans. on VLSI Systems, Vol.3, No.1, March 1995, pp.112-123.

[53] A. Nardi et al., ” Impact of Unrealistic Worst Case Modeling on the Performance of VLSI Circuits in Deep Submicron CMOS Technologies,” IEEE Transactions on

Semiconductor Manufacturing, Vol.12, No.4, November 1999, pp.396-403.

[54] M. Mani et al., “Joint Design-Time and Post-Silicon Minimization of Parametric Yield Loss using Adjustable Robust Optimization,” Proceedings of ICCAD, San Jose, CA, USA, November 2006, pp.19-26.

[55] S. Kulkarni et al., “A Statistical Framework for Post-Silicon Tuning through Body Bias Clustering,” Proceedings of ICCAD, San Jose, CA, USA, November 2006, pp.39-46.

[56] R. Teodorescu et al., “Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing,” Proceedings of MICRO-40, Chicago, IL, USA, December 2007, pp.27-39.

[57] A. Sathanur et al., “Physically Clustered Forward Body Biasing for Variability Compensation in Nanometer CMOS design,” Proceedings of DATE, Nice, France, April 2009, pp.154-159.

[58] M. Hirabayashi et al., “Design Methodology and Optimization Strategy for Dual-VTH Scheme using Commercially Available Tools,” Proceedings of ISLPED, Huntington Beach, CA, USA, August 2001, pp.283-286.

[59] Y. Liu and J. Hu, “A New Algorithm for Simultaneous Gate Sizing and Threshold Voltage Assignment,” IEEE Trans. on Computer-Aided Design of Integrated Circuits, Vol. 29, No. 2, February 2010, pp. 223-234.

[60] J.F. Bonnans, J.C. Gilbert, C. Lemarechal, C.A. Sagastizabal, Numerical optimization,

theoretical and numerical aspects, Second edition, Berlin Heidelberg New York: Springer-Verlag, 2006.

[61] B.Zhai et.al. “A 2.60pJ/Inst Subthreshold Sensor Processor for Optimal Energy Efficiency,” Proceedings of Symposium on VLSI Circuits, Honolulu, Hawaii, USA, June 2006, pp. 154–155.

[62] D. Friedman et.al. “A low-power CMOS integrated circuit for field-powered radio frequency identification tags,” Proceedings of ISSCC, San Francisco, CA, USA, February 1997, pp. 294–295.

[63] C. Kim, H. Soeleman, K. Roy, “Ultra-Low-Power DLMS Adaptive Filter for Hearing Aid Applications,” IEEE Transactions on VLSI Systems, Vol. 11, No. 6, December 2003, pp. 1058-1067.

143

[64] A. Wang, A. Chandrakasan, and S. Kosonocky, “Optimal Supply and Threshold Scaling for Subthreshold CMOS Circuits,” Proceedings of Symposium on VLSI

Circuits, April 2002, pp.7

[65] B. Calhoun et.al. “Device Sizing for Minimum Energy Operation in Subthreshold Circuits,” Proceedings of CICC, Orlando, FL, USA, May 2004, pp. 95-98.

[66] R. Dreslinski et.al. “Near-Threshold Computing: Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits,” Proceedings of IEEE, Vol. 98, No. 2, February 2010, pp.253-266.

[67] Y. Pu et.al. “An Ultra Low-Energy/Frame Multi-standard JPEG Co-processor in 65nm CMOS with Sub/Near Threshold Power Supply”, Proceedings of ISSCC, San Francisco, CA, USA, February 2009, pp. 146-147.

[68] B. Zhai et.al. “Analysis and Mitigation of Variability in Subthreshold Design,” Proceedings of ISLPED, San Diego, CA, USA, August 2005, pp. 20-25.

[69] J. Kwong and A. Chandrakasan “Variation Driven Device Sizing for Minimum Energy Subthreshold Circuits,” Proc. of ISLPED, Tegernsee, Germany, October 2006, pp. 8-13.

[70] J.F. Ryan et.al “Analyzing and Modeling Process Balance for Sub-threshold Circuit Design,” Proceedings of Great Lakes symposium on VLSI, Stresa-Lago Maggiore, Italy, March 2007, pp. 275-280.

[71] B. Mishra et.al. “Variation Resilient Adaptive Controller for Subthreshold Circuits,” Proceedings of DATE, Nice, France, September 2009, pp. 142-147.

[72] N. Jayakumar and S.P. Khatri, “A Variation-tolerant Sub-threshold Design Approach,” Proceedings of DAC, Anaheim, CA, USA, June 2005, pp. 529-534.

[73] K. Hamamoto et.al. “Tuning-Friendly Body Bias Clustering for Compensating Random Variability in Subthreshold Circuits,” Proceedings of ISLPED, San Francisco, CA, USA, August 2009, pp.51-56.

[74] D. Markovic et.al. “Ultralow-Power Design in Near-Threshold Region,” Proceedings

of the IEEE, Vol. 98, No.2, February 2010, pp. 237-252.

[75] J. Gregg, and T. Chen, “Post Silicon Power/Performance Optimization in the Presence of Process Variations Using Individual Well-Adaptive Body Biasing,” IEEE

Transactions on VLSI Systems, Vol. 15, No.3, March 2007, pp. 366-376.

[76] S. Narendra et.al. “Ultra-Low Voltage Circuits and Processor in 180nm to 90nm Technologies with a Swapped-Body Biasing Technique,” ISSCC Digest of Technical

Papers, San Francisco, CA, USA, February 2004, pp. 156-157.

[77] C. Zhuo, D. Blaauw, and D. Sylvester, “Variation Aware Gate Sizing and Clustering for Post-Silicon Optimized Circuits” Proceedings of ISPLED, August 2008, Bangalore, India. pp. 105-110.

[78] C. Zhuo, J-H Chang, D. Sylvester, and D. Blaauw, “Design time body bias selection for parametric yield improvement,” Proceedings of ASPDAC, Taipei, Taiwan, January 2010, pp. 681-688.

[79] S. Kumar et.al. “Body Bias Voltage Computations for Process and Temperature Compensation,” IEEE Transactions on VLSI Systems, Vol. 16, No. 3, March 2008, pp. 249-262.

144 References

145

List of Publications

1. M. Meijer, F. Pessolano and J. Pineda de Gyvez, “Technology Exploration for

Adaptive Power and Frequency Scaling in 90nm CMOS”, Proceedings of ISLPED, Newport Beach, CA, USA, August 2004, pp. 14-19.

2. M. Meijer, F. Pessolano and J. Pineda de Gyvez, “Limits to Performance Spread Tuning using Adaptive Voltage and Body Biasing”, Proceedings of ISCAS, Kobe, Japan, May 2005, pp. 23-26.

3. M. Meijer, and J. Pineda de Gyvez, “Technological Boundaries of Voltage and Frequency Scaling for Power Performance Tuning,” in Adaptive Techniques for Dynamic Processor Optimization, A. Wang and S. Naffziger Ed., Springer, 2008, pp.25-47.

4. M. Meijer, B. Liu, R. van Veen and J. Pineda de Gyvez, “Post-Silicon Tuning Capabilities of 45nm Low- Power CMOS Digital Circuits,” Proceedings of Symposium

on VLSI Circuits, Kyoto, Japan, June 2009, pp.110-111.

5. M. Meijer, and J. Pineda de Gyvez, “Body Bias Driven Design Synthesis for Optimum Performance Per Area,” Proceedings of ISQED, San Jose, CA, USA, March 2010, pp. 472-477.

6. R.I.M.P. Meijer and J. J. Pineda de Gyvez, “Body Biasing,” WIPO Patent Application WO 2010052607, May 14, 2010

7. R.I.M.P. Meijer, “Integrated Circuit,” WIPO Patent Application WO 2010073166, July 1, 2010

8. M. Meijer, J. Pineda de Gyvez, B. Kup, B. van Uden, P. Bastiaansen, M. Lammers, and M. Vertregt, “A Forward Body Bias Generator for Digital CMOS Circuits with Supply Voltage Scaling,” Proceedings of ISCAS, Paris, France, June 2010, pp. 2482 – 2485.

9. M. Meijer, J. Pineda de Gyvez, and A. Kapoor, “Ultra-Low-Power Digital Design with Body Biasing for Low Area and Performance-Efficient Operation,” Journal of Low

Power Electronics, Vol.6, No. 4, 2011, pp. 1-12

10. M. Meijer and J. Pineda de Gyvez, “Body Bias Driven Design Strategy for Area and Performance Efficient CMOS Circuits,” IEEE Transactions on VLSI Systems, accepted for publication

146

147

Curriculum Vitae

R.I.M.P. (Maurice) Meijer was born on March 8, 1975 in Montfort, The Netherlands. He received the B.Eng. degree (cum laude) in electrical engineering from Eindhoven Polytechnic, The Netherlands, in 1999 and the M.Sc. degree in electrical engineering from Eindhoven University of Technology, The Netherlands, in 2004. He has been working towards the Ph.D. degree at the Faculty of Electrical Engineering, Eindhoven University of Technology. He expects to receive this degree based on the work presented in this thesis on December 7, 2011. From February 1999 to August 2006, he was a research scientist with the Digital Design and Test group of Philips Research Laboratories, The Netherlands, where he worked on signal integrity aspects of digital integrated circuits in deep-submicron CMOS technologies, low-power digital circuit design, and adaptive power management for System-on-Chip applications. Currently, R.I.M.P. Meijer is a senior scientist with the Central Research and Development division of NXP Semiconductors, The Netherlands. His research interests are in the areas of (ultra-) low-power design, design for variation-resilient circuit operation, and system’s power management for high-performance mixed-signal integrated circuits.

148

149

Reader’s Notes

Documents

Body bias aware digital design : a design strategy for ... · CMOS industrial processor design, up to 4.5x lower amount of FBB digital gates, leakage reductions of up to 2.6x at a