214
Adaptive and Low-Complexity Microarchitectures for Power Reduction Jaume Abella Ferrer 2005 A thesis submitted in fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY / DOCTOR PER LA UPC Departament d’Arquitectura de Computadors Universitat Polit` ecnica de Catalunya

Adaptive and Low-Complexity Microarchitectures for Power …d’importants les petites coses i per obrir-me el seu cor tant sincerament. Gr`acies tamb´e a Carla i F`atima per estar

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Adaptive and Low-Complexity

    Microarchitectures for Power Reduction

    Jaume Abella Ferrer

    2005

    A thesis submitted in fulfillment ofthe requirements for the degree of

    DOCTOR OF PHILOSOPHY / DOCTOR PER LA UPC

    Departament d’Arquitectura de ComputadorsUniversitat Politècnica de Catalunya

  • ii ·

  • · iii

    Adaptive and Low-Complexity

    Microarchitectures for Power Reduction

    Jaume Abella Ferrer

    2005

    Advisor: Antonio González Colás

    A thesis submitted in fulfillment ofthe requirements for the degree of

    DOCTOR OF PHILOSOPHY / DOCTOR PER LA UPC

    Departament d’Arquitectura de ComputadorsUniversitat Politècnica de Catalunya

  • iv ·

  • · v

    Salud, hijas de Zeus!Otorgadme el hechizo de vuestro canto.

    Celebrad la estirpe sagradade los sempiternos inmortales,

    los que nacieron de Geay del estrellado Urano,

    los que nacieron de la tenebrosa Nochey los que crió el salobre Ponto.

    (...)E inspiradme esto, Musas,

    que desde un principiohabitáis las mansiones oĺımpicas,

    y decidme lo que de ello fue primero.

    Heśıodo – ”Teogońıa”

  • vi ·

  • · vii

    A tu que em vas ensenyar a lligar-me les sabates,a tu que em vas ensenyar a jugar als escacs,

    allà on sigueugràcies

  • viii ·

  • Abstract

    Technology and microarchitecture evolution is driving microprocessors towardshigher clock frequencies and higher integration scale. These two factors translateinto higher power density, which calls for more sophisticated and expensive coolingsystems. Reduction of power dissipation can be very beneficial not only in terms ofcooling cost reduction, but also for saving energy or increasing performance for agiven thermal solution or extending battery life.

    Processors are often designed to achieve high performance for a wide range ofapplications with different resource requirements. Thus, it is often the case thatthe resources are underutilized. Hence, we can save energy because resources wasteenergy while they are idle. In general, the structures are sized in such a way thatmaking them larger hardly increases the performance, but making them smaller mayharm the performance for some programs or for some parts of some programs. Thus,there is room to dynamically adapt these structures to cut the energy consumptionof those parts that do not contribute to increase performance. Additionally, thistype of worst case design requires high complexity power-hungry structures.

    This thesis presents new microarchitectural techniques to reduce the energy con-sumption and complexity of the main microprocessor structures. We propose newcache memory, issue logic, load/store queue and clustered microarchitecture designs,as well as techniques to dynamically resize these structures. We show that the pro-posals presented in this dissertation reduce significantly the dynamic and leakageenergy by means of low complexity structures and resizing mechanisms.

  • x ·

  • Acknowledgements

    Ha estat un llarg camı́ i sou molts els que m’heu ajudat. Com donar-vos les gràciesa tots? Una bona pregunta la resposta de la qual intenten ser aquestes ĺınies.Primer de tot vull començar donant les gràcies al meu director de tesi, l’AntonioGonzález, per tot el seu suport durant aquests anys i per la seva paciència davantels meus rampells, sobretot al principi, quan l’únic que em deien en els congressosera ”rejected”. També vull donar les gràcies als meus pares i ma germana perproporcionar-me aquest refugi tant necessari. Sempre he tingut el seu suport iconfiança en mi.

    Ara miro enrera i recordo el dia que vaig entrar en aquest departament, en elprojecte MHAOTEU. Allà vaig conèixer als que són els meus primers amics de debò,els que em coneixen millor que jo mateix i que han tingut cura de mi tot aquesttemps. En Xavi sempre ha estat un amic i sovint, un germà gran. Mai podré agrair-li prou tot el que m’ha ensenyat com a persona, ni el que m’ha arribat a ajudarprofessionalment. Sempre ha estat un plaer treballar amb ell i espero poder-ho ferdurant molt més temps. Quan penso en Xavi també penso en Nerina. I tot i que elsfaci ràbia, sempre que recordo a un me’n recordo de l’altre. Si puc dir que a algúestimo de debò és a ells dos. Nerina sovint ha estat el contrapès que m’ha ajudat amantenir els peus a terra, i que m’ha protegit de mi mateix. És l’amiga a qui recorroquan se m’enfonsa el terra a sota els peus. Un cop em van fer una carta astral i deiaque sóc Capricorn amb ascendència de Balança. Vet aqúı que Xavi és Capricorn iNerina Balança. Potser per això són dos pilars bàsics en la meva vida. Per mi ellssempre hi són, i procuro ajudar-los sempre que em deixen, tot i que no me’n surtimassa bé. Estigueu segurs que la majoria de coses bones que tinc són culpa seva.

    Ara seria injust no donar-li les gràcies a Josep Maria per les moltes conversesanant i tornant del ”súper”, i per una complicitat que poca gent pot oferir. Tant debo la coca-cola t’ajudi a conservar-te molts anys tal com ets!! (tot i que em reservoel dret a vetar-te els acudits!!).

    Cronològicament ara em toca parlar de l’Alex Pajuelo, un altre Capricorn. Bonagent els Capricorn ;-). Crec que potser hem compartit un miler de cafès, i sovintconverses que han contribut a la nostra amistat.

    També aqúı he fet un altre amic per tota la vida: Fran. és Balança... casualitatsde la vida? Ell és ”papá pato”, i els seus amics som els seus aneguets. De vegadesdiuen que els amics de veritat es poden comptar amb els dits d’una mà. Ell és undels meus dits.

  • xii ·

    I perdoneu que als altres em limiti a llistar-vos, però necessitaria una altra tesiper dir-vos com sou d’importants per mi: Xavi Verdú (sempre transmetent calor!!),Germán (i la seva passió pel PP), Oliver (alias transformer), Ayose (quins momentstan ı́ntims a Madrid i a Denver!!), Carmelo (en els seus dies), Marco (que a les sisde la tarda encara diu ”pronto”), Ale, Eduard, Enric, Ramon, Pedro, Llorenç, Suso,Pepe, Fernando, Daniel J., Daniel O., Jordi G., Alex A., Raimir, Rubén González,Rubén Gran,... i tants d’altres que m’he deixat.

    Encara que sembli increble, encara no he acabat perquè fora de la universitattambé hi ha vida!! Vull donar les gràcies a Gazmira per ensenyar-me com sónd’importants les petites coses i per obrir-me el seu cor tant sincerament. Gràciestambé a Carla i Fàtima per estar a prop meu i escoltar-me quan ho he necessitat.Gràcies a Carmina, Nay i Liz per confiar tant en mi.

    El camı́ és llarg i la tesi només n’és una etapa. Ha arribat el moment de seguirel camı́. Aqúı teniu el fruit d’aquests anys de feina.

    I que no se m’oblidin els agraiments oficials: This work has been partiallysupported by the Ministry of Education and Science under grants AP2002-3677,TIN2004-07739-C02-01 and TIN2004-03072, the CICYT project TIC2001-0995-C02-01, Feder funds, and Intel Corporation. We would like to thank the anonymousreviewers of the papers in this thesis by their comments.

  • Contents

    Abstract viii

    Acknowledgments x

    Contents xiii

    List of Figures xvii

    List of Tables xxi

    1 Introduction 11.1 Sources of Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . 31.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Power Efficiency Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Cache Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.4.1 Fast and Slow L1 Data Cache . . . . . . . . . . . . . . . . . . 81.4.2 Low Leakage L2 Cache . . . . . . . . . . . . . . . . . . . . . . 81.4.3 Heterogeneous Way-Size Caches . . . . . . . . . . . . . . . . . 8

    1.5 Issue Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5.1 Adaptive Issue Queue and Register File . . . . . . . . . . . . . 91.5.2 Low-Complexity Floating-Point Issue Logic . . . . . . . . . . . 10

    1.6 Load/Store Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.7 Clustered Microarchitectures . . . . . . . . . . . . . . . . . . . . . . . 111.8 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2 Evaluation Framework 132.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Tools and simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.2.1 Simplescalar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.2 Wattch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.3 CACTI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3 Cache Memories 213.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    3.1.1 Low Miss Rate Schemes . . . . . . . . . . . . . . . . . . . . . 24

  • xiv · CONTENTS

    3.1.2 Pseudo-Associative Caches . . . . . . . . . . . . . . . . . . . . 243.1.3 Non-Resizing Low Power Schemes . . . . . . . . . . . . . . . . 253.1.4 Resizing Low Power Schemes . . . . . . . . . . . . . . . . . . 25

    3.2 Fast and Slow L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . 263.2.1 Energy and Delay Models in CMOS Circuits . . . . . . . . . . 263.2.2 Criticality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.3 Cache Organizations . . . . . . . . . . . . . . . . . . . . . . . 283.2.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 303.2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.3 IATAC: Low Leakage L2 Cache . . . . . . . . . . . . . . . . . . . . . 403.3.1 Predictors for L2 Caches . . . . . . . . . . . . . . . . . . . . . 403.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 493.3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    3.4 Heterogeneous Way-Size Caches . . . . . . . . . . . . . . . . . . . . . 603.4.1 Heterogeneous Way-Size Cache (HWS Cache) . . . . . . . . . 603.4.2 HWS Cache Evaluation . . . . . . . . . . . . . . . . . . . . . 653.4.3 Dynamically Adaptive HWS cache (DAHWS cache) . . . . . . 743.4.4 DAHWS Cache Evaluation . . . . . . . . . . . . . . . . . . . . 783.4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    4 Issue Logic 874.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

    4.1.1 Basic CAM-based Approaches . . . . . . . . . . . . . . . . . . 904.1.2 Matrix-based Approaches . . . . . . . . . . . . . . . . . . . . 924.1.3 Issue Logic Based on Dynamic Code Pre-Scheduling . . . . . . 934.1.4 Issue Logic Based on Dependence Tracking . . . . . . . . . . . 93

    4.2 Adaptive Issue Queue and Register File . . . . . . . . . . . . . . . . . 944.2.1 Baseline Microarchitecture . . . . . . . . . . . . . . . . . . . . 944.2.2 Adaptive Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 984.2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 1024.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    4.3 Low-Complexity Floating-Point Issue Logic . . . . . . . . . . . . . . . 1124.3.1 Proposed Issue Logic Design . . . . . . . . . . . . . . . . . . . 1134.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 1224.3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

    5 Load/Store Queues 1295.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315.2 SAMIE-LSQ: A New LSQ Organization for Low Power and Low Com-

    plexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325.2.1 SAMIE-LSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1335.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 1415.2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

  • CONTENTS · xv

    6 Clustered Microarchitectures 1516.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1546.2 Ring Clustered Microarchitecture . . . . . . . . . . . . . . . . . . . . 155

    6.2.1 Ring Clustered Processor . . . . . . . . . . . . . . . . . . . . . 1566.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 1636.2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

    7 Conclusions 1757.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1777.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

    Bibliography 181

  • xvi · CONTENTS

  • List of Figures

    1.1 Power evolution for Intel microprocessors . . . . . . . . . . . . . . . . 51.2 Power evolution for AMD microprocessors . . . . . . . . . . . . . . . 51.3 Cooling system cost with respect to the power dissipation [55] . . . . 61.4 Temperature distribution of a microprocessor [55] . . . . . . . . . . . 6

    2.1 Pipeline for sim-outorder . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.1 Power dissipation compared to the baseline technology for differentVTH and VDD values . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    3.2 Load criticality distribution for different cache sizes of the 2-way set-associative slow cache configuration . . . . . . . . . . . . . . . . . . . 32

    3.3 IPC loss of criticality-based cache for the guided and the randomversions w.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.4 Performance loss of locality-based and criticality-based organizationsw.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.5 Miss ratio breakdown of critical and non-critical loads for the 16KBbaseline cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.6 Miss ratio breakdown of critical and non-critical loads for the 32KBbaseline cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3.7 Miss ratio breakdown of critical and non-critical loads for the 64KBbaseline cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3.8 Dynamic energy consumption for a 16K baseline cache . . . . . . . . 373.9 Dynamic energy consumption for a 32K baseline cache . . . . . . . . 373.10 Dynamic energy consumption for a 64K baseline cache . . . . . . . . 383.11 Leakage energy requirements for the different cache organizations and

    sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.12 Average time between hits to cache lines (time between hits) and the

    average time from the last access to replacement (time before miss)against varying numbers of accesses. The results correspond to fourrepresentative programs (note logarithmic scale) . . . . . . . . . . . . 42

    3.13 Structures required for the IATAC mechanism for a 4MB (512KB)L2 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.14 Algorithm of the IATAC mechanism . . . . . . . . . . . . . . . . . . 443.15 Mechanism to update the decay interval for the adaptive mode control 49

  • xviii · LIST OF FIGURES

    3.16 IPC degradation for the different mechanisms for 512KB and 4MBL2 caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    3.17 L2 turn off cache line ratio for the different mechanisms for 512KBand 4MB L2 caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    3.18 L2 miss ratio for the different mechanisms for 512KB and 4MB L2caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    3.19 IPC loss for the SPEC CPU2000 benchmarks and a 512KB L2 cache 53

    3.20 Number of misses for the SPEC CPU2000 benchmarks and a 512KBL2 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    3.21 Decay interval coefficient of variation . . . . . . . . . . . . . . . . . . 55

    3.22 Energy consumption for the different mechanisms for 512KB and4MB L2 caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    3.23 EDP for the different mechanisms for 512KB and 4MB L2 caches . . 58

    3.24 ED2P for the different mechanisms for 512KB and 4MB L2 caches . . 59

    3.25 Associativity utilization for the L1 data cache . . . . . . . . . . . . . 61

    3.26 Associativity utilization for the L1 instruction cache . . . . . . . . . . 62

    3.27 Associativity utilization for the L2 unified cache . . . . . . . . . . . . 62

    3.28 Indexing functions for a conventional cache (left) and a HWS cache(right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    3.29 Example of RIT update for a HWS cache . . . . . . . . . . . . . . . 64

    3.30 Number of cache configurations for associativity ranging from 2 to 8,and number of different way sizes ranging from 2 to 6 . . . . . . . . . 66

    3.31 Hit rate for 2-way set-associative L1 Dcaches . . . . . . . . . . . . . . 67

    3.32 Hit rate for 3-way set-associative L1 Dcaches . . . . . . . . . . . . . . 68

    3.33 Hit rate for 4-way set-associative L1 Dcaches . . . . . . . . . . . . . . 69

    3.34 Example of better behavior of HWS cache with respect to a conven-tional cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    3.35 Signature Size resizing algorithm [38] . . . . . . . . . . . . . . . . . . 75

    3.36 DAHWS cache resizing algorithm . . . . . . . . . . . . . . . . . . . . 77

    3.37 Miss rate, percentage of active lines, IPC and number of reconfigura-tions for the different L1 Data cache resizing schemes. The number ofreconfigurations is split according to the number of ways whose sizeis changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    3.38 Miss rate, percentage of active lines, IPC and number of reconfigu-rations for the different L1 instruction cache resizing schemes. Thenumber of reconfigurations is split according to the number of wayswhose size is changed . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    3.39 Miss rate, percentage of active lines, IPC and number of reconfigura-tions for L2 unified maxDAHWS cache. The number of reconfigura-tions is split according to the number of ways whose size is changed . 84

    4.1 Issue logic for an entry of a CAM/RAM-array . . . . . . . . . . . . . 90

    4.2 Multiple-banked issue queue . . . . . . . . . . . . . . . . . . . . . . . 95

  • LIST OF FIGURES · xix

    4.3 Scheme of a read operation . . . . . . . . . . . . . . . . . . . . . . . . 96

    4.4 Scheme of a write operation . . . . . . . . . . . . . . . . . . . . . . . 96

    4.5 Heuristic to resize the reorder buffer and the issue queue . . . . . . . 100

    4.6 IPC for different interval lengths . . . . . . . . . . . . . . . . . . . . . 104

    4.7 Reorder buffer occupancy reduction for different interval lengths . . . 104

    4.8 IPC loss for the different techniques . . . . . . . . . . . . . . . . . . . 106

    4.9 Issue queue dynamic energy savings . . . . . . . . . . . . . . . . . . . 107

    4.10 Issue queue leakage energy savings . . . . . . . . . . . . . . . . . . . 107

    4.11 Dynamic energy savings for the integer register file and rename buffersw.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    4.12 Leakage energy savings for the integer register file and rename buffersw.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    4.13 Dynamic energy savings for the FP register file and rename buffersw.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    4.14 Leakage energy savings for the FP register file and rename buffersw.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    4.15 Reduction in number of dispatched instructions . . . . . . . . . . . . 111

    4.16 IPC loss of IssueFIFO technique w.r.t. the unbounded conventionalissue queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    4.17 Issue time computation for LatFIFO scheme . . . . . . . . . . . . . . 116

    4.18 IPC loss of LatFIFO technique w.r.t. unbounded conventional issuequeue for the FP benchmarks . . . . . . . . . . . . . . . . . . . . . . 118

    4.19 Example of selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

    4.20 IPC loss of MixBUFF technique w.r.t. unbounded conventional issuequeue for the FP benchmarks . . . . . . . . . . . . . . . . . . . . . . 121

    4.21 Performance for the integer benchmarks . . . . . . . . . . . . . . . . 123

    4.22 Performance for the FP benchmarks . . . . . . . . . . . . . . . . . . . 124

    4.23 Energy breakdown for the different schemes . . . . . . . . . . . . . . 125

    4.24 Normalized power dissipation . . . . . . . . . . . . . . . . . . . . . . 126

    4.25 Normalized energy consumption . . . . . . . . . . . . . . . . . . . . . 126

    4.26 Normalized EDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

    4.27 Normalized ED2P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

    5.1 IPC of ARB with respect to an ideal unbounded LSQ. Configurationswith different number of banks and addresses per bank are shown . . 134

    5.2 SAMIE-LSQ organization . . . . . . . . . . . . . . . . . . . . . . . . 135

    5.3 Average number of entries occupied in an unbounded SharedLSQ fordifferent configurations of the DistribLSQ . . . . . . . . . . . . . . . 139

    5.4 Number of programs that do not use the AddrBuffer during the 99%of their execution for a varying number of SharedLSQ entries . . . . . 140

    5.5 IPC loss of SAMIE-LSQ with respect to the 128-entry conventionalLSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

  • xx · LIST OF FIGURES

    5.6 Number of deadlock-avoidance pipeline flushes per million of cyclesfor SAMIE-LSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

    5.7 Dynamic energy consumption for the LSQ . . . . . . . . . . . . . . . 1465.8 Dynamic energy consumption breakdown for the SAMIE-LSQ . . . . 1475.9 Dynamic energy consumption for the L1 data cache . . . . . . . . . . 1485.10 Dynamic energy consumption for the data TLB . . . . . . . . . . . . 1485.11 Accumulated active area in mm2 for the LSQ . . . . . . . . . . . . . 1495.12 Active area breakdown for the SAMIE-LSQ . . . . . . . . . . . . . . 149

    6.1 Ring clustered microarchitecture . . . . . . . . . . . . . . . . . . . . . 1576.2 Steering algorithm for the ring clustered microprocessor . . . . . . . . 1586.3 Example of the steering algorithm . . . . . . . . . . . . . . . . . . . . 1596.4 Placement alternatives for 8 clusters . . . . . . . . . . . . . . . . . . 1606.5 High level layout for cluster modules . . . . . . . . . . . . . . . . . . 1616.6 High level layout for cluster modules with integer and FP independent

    rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.7 Steering algorithm for the conventional clustered microprocessor . . . 1646.8 Speedup of Ring over Conv . . . . . . . . . . . . . . . . . . . . . . . 1666.9 Average number of communications per instruction . . . . . . . . . . 1676.10 Average distance per communication . . . . . . . . . . . . . . . . . . 1686.11 Average delay per communication due to bus contention . . . . . . . 1686.12 Workload imbalance using NREADY figure . . . . . . . . . . . . . . . 1696.13 Distribution of the dispatched instructions across the clusters . . . . . 1706.14 Speedup of Ring over Conv for different bus latencies . . . . . . . . . 1716.15 Simple Steering algorithm for both Ring and Conv processors . . . . 1716.16 Speedup of Ring+SSA over Conv+SSA . . . . . . . . . . . . . . . . . 1726.17 Workload imbalance using NREADY figure with the Simple Steering

    Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

  • List of Tables

    2.1 Compile and run commands for the SPEC CPU2000 . . . . . . . . . 162.2 Fast forwarded instructions for the SPEC CPU2000 . . . . . . . . . . 17

    3.1 Cache sizes used in the comparison . . . . . . . . . . . . . . . . . . . 303.2 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 503.4 Energy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.5 Energy breakdown for IATAC mechanism . . . . . . . . . . . . . . . 573.6 Feasible configurations for a HWS cache with associativity 2 or 3 and

    capacity ranging from 8KB to 32KB . . . . . . . . . . . . . . . . . . 653.7 3-way set-associative L1 Dcaches (conventional caches are represented

    in italics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.8 4-way set-associative L1 Dcaches (conventional caches are represented

    in italics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.9 3-way and 4-way set-associative L1 Icaches (conventional caches are

    represented in italics) . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.10 3-way and 4-way set-associative L2 caches (conventional caches are

    represented in italics) . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.11 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    4.1 Delay and energy for the different components of a multiple-bankedregister file design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    4.2 Delays and energy for read/write operations in the sequential andparallel schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    4.3 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.4 Reorder buffer size reduction . . . . . . . . . . . . . . . . . . . . . . . 1064.5 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.6 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 114

    5.1 Access time of conventional cache accesses and access time when thephysical cache line is known for different cache configurations. Thenumber of bytes per line is 32 in all configurations . . . . . . . . . . . 141

    5.2 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 1425.3 SAMIE-LSQ configuration . . . . . . . . . . . . . . . . . . . . . . . . 142

  • xxii · LIST OF TABLES

    5.4 Energy consumption of the different types of accesses to a 128-entryconventional LSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

    5.5 Energy consumption for the different activities of the SAMIE-LSQ . . 1435.6 Area of the different components of the conventional LSQ and

    SAMIE-LSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

    6.1 Area of the main cluster’s blocks . . . . . . . . . . . . . . . . . . . . 1616.2 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 1656.3 Evaluated configurations . . . . . . . . . . . . . . . . . . . . . . . . . 165

  • 1

    Introduction

  • CHAPTER 1

    INTRODUCTION

    Technology and microarchitecture evolution is driving microprocessors towardshigher clock frequencies and higher integration scale. These two factors translateinto higher power density, which calls for more sophisticated and expensive coolingsystems. Reduction of power dissipation can be very beneficial not only in terms ofcooling cost reduction, but also for saving energy or increasing performance for agiven thermal solution or extending battery life.

    The energy consumption can be classified basically into two categories: dynamicenergy consumption and leakage energy (or leakage for short). The dynamic energyconsumption is produced by the circuit activity (transitions from 1 to 0, and viceversa), whereas leakage is caused by any powered-on circuit due to the intrinsiccharacteristics of the CMOS technology and its fabrication process.

    The temperature is strongly related with the energy consumption. Reducingthe energy consumption in the hottest spots of the processor is crucial to preventthe chip from reaching too high temperatures. If the temperature is too high, theprocessor may slow down or even stop the execution of programs. Thus, it may bebeneficial sacrifying some performance to keep the temperature at bearable levels toprevent such temperature emergencies from happening. The temperature emergencycan be also prevented distributing the activity across larger areas. If we success indistributing the activity homogeneously, the peak temperature achieved in the chipwill be lower and fewer stalls will be required.

    1.1 Sources of Power Dissipation

    The sources of energy consumption on a CMOS chip can be classified as dynamicand static power dissipation. The dominant component of energy consumption inCMOS is dynamic power consumption caused by changes in the state of the circuit.A first order approximation of the dynamic power consumption of CMOS circuitryis given by the formula:

    P = C ∗ V 2DD ∗ f

    where P is the power, C is the effective switch capacitance, VDD is the supplyvoltage, and f is the frequency of operation. The dynamic power dissipation arisesfrom two main sources:

  • 4 · Chapter 1. Introduction

    • The main source corresponds to the charging and discharging of the circuitparasitic capacitances. Every low-to-high and high-to-low logic transition ina node incurs a change of charge in the associated parasitic capacitance, dis-sipating power, which translates into heat.

    • The short circuit power dissipation is caused by short circuit currents. Duringthe transition on the input of a CMOS gate both p and n channel devices mayconduct simultaneously, briefly establishing a short circuit from the supplyvoltage to ground. The short circuit power dissipation is not much significant.

    The static energy consumption is caused basically by leakage currents. Thus,static energy is often referred to as leakage energy. Leakage power is growing inCMOS chips. Until recently, it was a secondary order effect; however, the totalamount of leakage grows exponentially with every new technology. Different stud-ies [112, 120, 125] predict that it can be as significant as the dynamic power in thenear future technologies.

    1.2 Motivation

    Energy consumption is a concern in current and future microprocessors. As tech-nology evolves, the power dissipation, the energy consumption and the chip temper-ature become more and more critical. Designing low power and high-performancestructures is basic to make processors more powerful. To show how significant thepower consumption is, we use the figures 1.1 and 1.2 from [106] where it can beseen the evolution of the power dissipation during the last years for Intel and AMDprocessors. The power dissipation grows from generation to generation and hence,the cooling system becomes more expensive, as shown in figure 1.3 [55]. We observethat the Pentium 4 maximum power dissipation is around 100W, which is over thelimit of cheap cooling systems. Further increasing of the power dissipation impliesa huge cost of the cooling system. Thus, saving energy is crucial for future designs.Additionally, saving energy increases the battery life of laptops and embedded sys-tems.

    Another issue strongly related to the power dissipation is the temperatureachieved in the different parts of the chip. Dissipating power increases the tem-perature of the chip, which requires complex cooling systems to keep it in a bear-able level. But the temperature is not distributed homogeneously in the whole chipas shown in figure 1.4 [55]. Some structures, like the issue logic, produce highertemperatures than the others due to their high power density. Thus, reducing thepower dissipation of these structures has a deeper impact in the cooling solutionthan saving energy in other structures.

    Summing up, saving energy is beneficial to extend the battery life. The coolingsystem cost is also reduced by saving energy, especially if we save energy in thehottest spots of the chip, since that helps to reduce the maximum temperature.

  • 1.2. Motivation · 5

    Maximum power for INTEL processors

    0

    20

    40

    60

    80

    100

    120

    PentiumP5

    PentiumP54

    PentiumMMX

    PentiumP6

    Pentium II Pentium III Pentium IV

    Wat

    ts

    Fig. 1.1: Power evolution for Intel microprocessors

    Maximum power for AMD processors

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    AMD K5 AMD K6 AMD K7 AMD K8

    Wat

    ts

    Fig. 1.2: Power evolution for AMD microprocessors

  • 6 · Chapter 1. Introduction

    Fig. 1.3: Cooling system cost with respect to the power dissipation [55]

    Fig. 1.4: Temperature distribution of a microprocessor [55]

  • 1.3. Power Efficiency Metrics · 7

    We can save energy because processors are often designed for (almost) the worstcase, and most of the time many resources are overdesigned. In general, the struc-tures are sized in such a way that making them larger hardly increases the perfor-mance, but making them smaller may harm the performance for some programs orfor some parts of some programs. Thus, in this thesis we investigate schemes todynamically adapt these structures to cut down the energy consumption of thoseparts that do not contribute to increase the performance. The following sectionsdescribe how these issues have been faced for different structures of the chip whosecomplexity, dynamic and/or leakage energy consumption are significant.

    1.3 Power Efficiency Metrics

    Different metrics related with power-efficiency have been proposed to compare dif-ferent schemes [20]. Depending on what the constraints are, different metrics shouldbe used. Power is adequate when the heat is the main constraint, whereas energyis used for comparing schemes where the battery lifetime is the strongest constraint.Other metrics like energy-delay (EDP) and energy-delay2 (ED2P) are more ap-propriate when execution time is also important, as it usually is.

    Different systems based on a superscalar processor can have different limitations.For instance, laptops have limitations in their cooling systems, so heat is a constraint(strongly related to power). Additionally, battery lifetime (energy) is also a con-straint that must be considered in laptop’s processor design. If the processor is to beused in a desktop or mainframe, then execution time is a significant factor (EDP).For the highest performance server-class machines, it may be appropriate to weightthe delay part even more (ED2P).

    1.4 Cache Memories

    The first structures we deal with are the cache memories. First level (L1 for short)data and instruction caches have a significant dynamic and leakage energy con-sumption. Their dynamic energy is very important since L1 caches are accessedvery frequently. It is not rare accessing L1 data and instruction caches more thanonce per cycle. Their contribution to the energy consumption of the chip variesform one microarchitecture to another, but as an example, we see that a 21464 Al-pha processor [127] was expected to devote 26% of its dynamic energy to caches.Additionally, cache memories are the structures that occupy most of the area of thechip, especially the L2 cache, and hence, they contribute importantly to the totalleakage of the chip.

    We have proposed different mechanisms to adapt the caches to the programrequirements. The following subsections describe the contributions of this thesis toreduce the energy consumption of the cache memories.

  • 8 · Chapter 1. Introduction

    1.4.1 Fast and Slow L1 Data Cache

    As stated before, processors are often designed for (almost) the worst case. Forinstance, all data cache accesses are served as soon as possible, although some ofthem can tolerate large latencies. Since fast caches consume significant dynamicand leakage energy, using these caches for all the accesses is a waste of energy. Ourcontributions [4] to deal with this issue are as follows:

    • We propose using different L1 data cache modules with different latency andenergy characteristics to serve the accesses depending on their latency tol-erance. The organization of the modules is also studied. We propose bothschemes with a flat and a hierarchical organization.

    • The memory instructions are classified dynamically to make them access themost suitable module. Thus, we have some degree of control over the perfor-mance and the energy consumption because non-critical instructions can beserved from an slow module whose energy consumption is low.

    1.4.2 Low Leakage L2 Cache

    Most of the current microprocessors have an on-chip L2 cache. A big budget oftransistors is devoted to this structure that occupies a large area of the chip. Thus,even if low leakage transistors are used to implement this structures, its leakage isnoticeable. We have studied state-of-the-art techniques that turn off those cachelines whose contents are not expected to be reused. These techniques dynamicallyadapt the cache size. Previous techniques were proposed for L1 caches and theydo not work well for L2 caches because their behavior differs a lot from L1 cachesbehavior. Our contributions [9] to save L2 cache leakage are the following:

    • We have found that local predictors do not work well to predict when a L2cache line can be turned off, and hence, global prediction is required.

    • We have investigated the relation between the number of accesses to a cacheline and the time interval to turn it off. The result is a prediction mechanismthat effectively predicts when cache lines can be safely turned off.

    • We propose an implementation of such predictor and show that it achievessignificant leakage energy savings, close to an oracle mechanism, with negligibleperformance degradation.

    1.4.3 Heterogeneous Way-Size Caches

    Most caches are set-associative caches, which provide a certain degree of associativ-ity for all the sets. On the other hand, we experimentally observe that only few cachesets require some associativity. Thus, conventional set-associative caches are overde-signed because their ways have homogeneous sizes. Additionally, this kind of caches

  • 1.5. Issue Logic · 9

    lacks of flexibility to be adapted dynamically since all the ways have to be resizedin concert. To tackle these limitations we propose the following contributions [7]:

    • We propose an heterogeneous way-size cache design (HWS cache for short)that fits better the program requirements. The HWS cache enables associa-tivity for all sets, but they partially share the required space.

    • HWS cache is shown to perform similarly to a conventional cache of higherassociativity and/or greater capacity. Thus, a HWS cache achieves the sameperformance but requires lower dynamic and/or leakage energy.

    • An algorithm is proposed to dynamically resize the HWS cache.It is shown tobe much more adaptable than a conventional cache.

    1.5 Issue Logic

    We study the issue logic [1] because it is one of the most important structuresin terms of energy and complexity. This structure has a lot of activity since allthe instructions are placed in it and every time an instruction is issued, all theinstructions in this structure have to be checked for dependences on the producedregister. The issue logic consumes a lot of energy due to its many comparisons andis one of the main hotspots of superscalar processors. Their contribution to theenergy consumption of the chip varies form one microarchitecture to another, but asan example, we see that the 21464 Alpha processor [127] devotes 46% of its dynamicenergy to the issue logic and register files.

    Furthermore, the register file, whose energy consumption and complexity is alsohigh, is coupled to the issue logic, which increases the energy consumption of thissmall area of the chip. Thus, reducing the energy consumption and/or complexityof these structures is crucial to reduce the maximum temperature of the chip.

    We have designed different mechanisms to adapt the issue logic to the programrequirements and reduce its complexity. The following subsections detail the con-tributions of this thesis in this area.

    1.5.1 Adaptive Issue Queue and Register File

    The issue queue is sized to provide high ILP for all types of programs. However,a large number of instructions placed in the issue queue do not often increase theperformance because they have been dispatched so early. Hence, the issue queuecan be dynamically resized to fit the program requirements. The idea of resizingthe issue queue is not new. Some works have proposed to turn off the empty partsof the queue, or even turn off some parts that would be used if they were turnedon but that are not expected to increase the performance. But the state-of-the-arttechniques have some limitations that we try to overcome with a new mechanismto dynamically resize the issue queue. The contributions [3, 2] in this area are asfollows:

  • 10 · Chapter 1. Introduction

    • We show that there is a strong relation between the issue queue and thereorder buffer occupancies that can be exploited to effectively resize the formerstructure.

    • We propose a mechanism based on this relation to resize the issue queue whichis shown to perform better than state-of-the-art approaches in terms of per-formance and energy.

    • Delaying the dispatch of some instructions has a beneficial effect on the registerfile, since its pressure is reduced. We take advantage of this feature and makethe register file also adaptable. Hence, it is resized accordingly with the issuequeue, producing further energy savings.

    1.5.2 Low-Complexity Floating-Point Issue Logic

    The complexity of the full-associative issue logic is a concern because it is directly re-lated with its high energy consumption. In the literature there are a lot of attemptsto reduce the complexity of this structure by lowering the issue logic associativity.State-of-the-art solutions to effectively reduce the complexity and the energy con-sumption of the issue logic have been shown to perform well for integer programs,but the performance for FP programs is low.

    State-of-the-art mechanisms are based on using either dependences or latenciesof the instructions to classify them into simple structures. FP programs performpoorly with low complexity structures based on dependences, but their performanceis not much better when using latencies, which require higher complexity hardware.We contribute [5] to tackle this problem as follows:

    • We propose a new issue logic design that fits FP program requirements withlow complexity structures that provide high-performance.

    • Our solution is based on using both the dependences and the latencies in differ-ent stages to achieve a low complexity design that provides high performance.

    • The functional units can be distributed across the small queues of the proposedissue logic, which further reduces the complexity of the design as well as itsenergy requirements.

    1.6 Load/Store Queue

    The load/store queue (LSQ for short), similarly to the issue queue, requires manyfully-associative lookups to check for dependences. While the issue queue keepstrack of the register dependences, the LSQ keeps track of the memory dependences.The LSQ entries are allocated at dispatch. Every time that an address is computed,it is compared with that of many other memory instructions and the complexityand energy required to do all these comparisons is significant.

  • 1.7. Clustered Microarchitectures · 11

    Most state-of-the-art approaches focus on reducing the energy consumption ofthis structure by filtering accesses or pipelining them. A few works propose alter-native approaches that require huge structures to achieve moderate reductions incomplexity and dynamic energy. Our contributions [8] in this area are:

    • We propose a set-associative design of the LSQ, where the associativity isvery low to reduce the energy consumption and the complexity, while theperformance remains similar to that of a fully-associative LSQ.

    • We enable LSQ entries to hold several memory instructions to further reducethe number of required comparisons.

    • Each entry of the LSQ is extended with some fields to record some informationabout the position of the data in the L1 data cache and the address translationof the TLB to save energy in these two structures.

    • The different components of the new LSQ are dynamically resized to keep theleakage energy requirements low.

    1.7 Clustered Microarchitectures

    Distributing is a well-known technique to save energy and reduce complexity anddelays. A clustered microarchitecture has a subset of the resources in each cluster.However, conventional clustered microarchitectures must trade workload balance forcommunications to achieve high-performance. The minimum number of communi-cations is achieved sending all the instructions to the same cluster, but this resultsin a totally imbalanced scenario. On the other hand, achieving a perfect work-load balance requires a very high number of communications because producers andconsumers use to be in different clusters. Thus, they are opposite objectives.

    Conventional clustered microarchitectures send the instructions to a few clustersto reduce the number of required communications, until some degree of imbalanceis achieved. Then, instructions are forced to go to other clusters, which results, ingeneral, in extra communications. Thus, the activity is high in a few clusters duringsome cycles, and some performance is lost due to the forced communications. Inthis area we contribute [6] as follows:

    • We propose organizing the clusters in a ring fashion in such a way that fastbypasses are set between each cluster and the following one in the ring, insteadof forwarding the data to the cluster itself.

    • We show how a steering algorithm similar to that of the conventional microar-chitecture inherently achieves better activity distribution for the ring clusteredmicroarchitecture, and increases the performance.

    • We study simple steering algorithms and show that the ring clustered mi-croarchitecture dynamically distributes the activity across all clusters with low

  • 12 · Chapter 1. Introduction

    communication requirements. On the other hand, the conventional microar-chitecture is unable to balance the workload and concentrates the activity ina few clusters during long periods, which is so negative in terms of temper-ature. Thus, the ring microarchitecture succeeds in dynamically distributingthe activity while keeping the performance high.

    1.8 Organization

    The rest of this thesis is organized as follows. Chapter 2 presents the evaluationframework. This section details the simulators and tools that have been used andhow they have been modified. The benchmarks used are also described.

    Chapter 3 presents the contributions on cache memories. The different ap-proaches as well as the related work are described. Chapter 4 details our proposalson low power adaptive issue logic schemes. Chapters 5 and 6 present our contri-butions on load/store queues and clustered microarchitectures respectively. Finally,chapter 7 presents the main conclusions of this dissertation and points out someideas for future work.

  • 2

    Evaluation Framework

  • CHAPTER 2

    EVALUATION FRAMEWORK

    This chapter describes the evaluation framework that we have used in this thesis.The conclusions and results presented in this dissertation have been obtained withthe benchmarks, the simulators and other tools that are presented in the followingsections.

    2.1 Benchmarks

    The focus of this thesis is reducing the energy consumption and complexity ofsuperscalar processors. Hence, the most suitable benchmarks are those for high-performance systems. We have chosen the SPEC CPU2000 that is an industry-standardized CPU-intensive benchmark suite [115]. These benchmarks, developedfrom real user applications, measure the performance of the processor, memory andcompiler on the tested system. The benchmark suite consists of 12 integer programsand 14 floating point programs. For the sake of generality we have used both theinteger and the FP programs in the whole thesis.

    Table 2.1 presents the parameters used to compile each benchmark, as well asthe input files used to run them. We have used the ref input data set, but weprovide the name of the input files because some of the benchmarks have severalinputs. The table is divided into two parts: the first part corresponds to the FPbenchmarks, whereas the second part presents the integer benchmarks.

    The benchmarks have been compiled using the native HP/Alpha compiler. Allflags used to compile are shown in the table but the -non shared flag, which hasbeen used for all benchmarks. This flag is required to simulate the benchmarks withthe Simplescalar simulator [23] that has been used in this thesis.

    We have used execution-driven simulations. To simulate significant parts of theprograms, we have measured the number of instructions to be skipped in our binaries.The simulations have been run after fast forwarding a given number of instructionsand warming up the caches and tables with 100 millions of instructions. In the restof the thesis, the default number of instructions simulated is 100 millions exceptotherwise is stated. Table 2.2 provides the number of fast forwarded instructionsfor each benchmark. We skip 200 millions of instructions for most benchmarksexcepting those that require larger fast forwards.

  • 16 · Chapter 2. Evaluation Framework

    FP programsBenchmark Compile options Input filesammp cc -O3 -lm -DSPEC CPU2000 ammp.in, init cond.run.1, init cond.run.2,

    init cond.run.3applu f77 -O4 applu.inapsi f77 -O4 apsi.inart cc -O3 -lm c756hel.in, a10.img, hc.imgequake cc -O3 -lm inp.infacerec f90 -O4 ref.in, ar1.asc, ar2.asc, pk2.asc,

    graphPars.dat, imagePars.dat,matchPars.dat, ref-albumPars.dat,ref-probePars.dat, trafoPars.dat

    fma3d f90 -O4 fma3d.ingalgel f90 -O4 -fixed galgel.inlucas f90 -O4 lucas2.inmesa cc -O3 -lm mesa.in, numbersmgrid f77 -O4 mgrid.insixtrack f77 -O4 fort.16, fort.2, fort.3, fort.7, fort.8, inp.inswim f77 -O4 swim.inwupwise f77 -O4 wupwise.in

    INT programsBenchmark Compile options Input filesbzip2 cxx -O3 input.sourcecrafty cc -O3 -lm -DSPEC CPU2000 crafty.in

    -DALPHAeon cxx -O2 -lm -I. -DNDEBUG chair.camera, chair.control.cook,

    chair.control.kajiya,chair.control.rushmeier, chair.surfaces,eon.dat, materials, spectra.dat

    gap cc -O3 -lm -DSYS IS BSD ref.in, all *.g files-DSPEC CPU2000 LP64-DSYS HAS CALLOC PROTO-DSYS HAS MALLOC PROTO-DSYS HAS TIME PROTO

    gcc cxx -O3 -lm 166.i-Dalloca= builtin alloca

    gzip cc -O3 input.sourcemcf cc -O3 -lm inp.in

    -DWANT STDC PROTOparser cc -O3 -lm -DSPEC CPU2000 ref.in, 2.1.dict, all words/* filesperlbmk cxx -O3 -lm perfect.pl, all lib/* files

    -DSPEC CPU2000 DUNIXtwolf cc -O3 -lm -DSPEC CPU2000 ref.blk, ref.cel, ref.net, ref.parvortex cc -O3 -lm lendian.rnv, lendian.wnv, lendian1.raw,

    -DSPEC CPU2000 LP64 lendian2.raw, lendian3.raw, persons.1kvpr cxx -O3 -lm -DSPEC CPU2000 arch.in, net.in, place.in

    Table 2.1: Compile and run commands for the SPEC CPU2000

  • 2.2. Tools and simulators · 17

    FP programs INT programsBenchmark Fast forwarded Benchmark Fast forwarded

    instructions (millions) instructions (millions)ammp 1.400 bzip2 200applu 200 crafty 200apsi 200 eon 200art 200 gap 200equake 3.400 gcc 200facerec 200 gzip 200fma3d 200 mcf 3.400galgel 200 parser 200lucas 3.400 perlbmk 200mesa 500 twolf 500mgrid 200 vortex 200sixtrack 500 vpr 200swim 500wupwise 3.400

    Table 2.2: Fast forwarded instructions for the SPEC CPU2000

    2.2 Tools and simulators

    Three different simulators and tools have been used in this thesis. They are describedin the following sections.

    2.2.1 Simplescalar

    The base simulator that we have used for the different works is the Simplescalartoolset [23]. This simulator has been chosen because it is widely used bythe computer architecture community and offers easily modificable well-organizedsource files. Two simulators of Simplescalar have been used: sim-outorder andsim-cache.

    sim-outorder is a detailed performance simulator of modern superscalar mi-croprocessors. We have modified the source code to adapt the simulator to ourrequirements. The main enhancements that we have done to the baseline microar-chitecture are the separation of the reorder buffer and the issue queue, and themodeling of the ports of the register file.

    The pipeline for sim-outorder is shown in figure 2.1. The instructions arefetched from the instruction cache accessing the branch predictor if required. Then,the instructions are decoded and dispatched to the reorder buffer and the issuequeue. Load and store instructions are split into an address computation thatis placed in the issue queue and the reorder buffer, and a memory access that isplaced in the load/store queue. Instructions stay in the issue queue until theyare issued to the funcional units and in the load/store queue until they commit.When instructions are issued, they wake up the instructions depending on them.The instructions writeback their results when they finish their execution. At this

  • 18 · Chapter 2. Evaluation Framework

    Fetch Dispatch Scheduler

    Memory scheduler

    Exec

    Mem

    Writeback Commit

    Icache Dcache

    DTLBITLB

    L2 cache

    Virtual memory

    Fig. 2.1: Pipeline for sim-outorder

    point the unresolved branches realize if they were mispredicted or not. In case ofmisprediction the pipeline is flushed. Finally, the instructions commit and leave thepipeline. Store instructions access memory at this stage.

    sim-cache is a cache simulator that provides cache statistics (hits, misses, re-placements, etc), with much faster simulations than sim-outorder, which simulatesthe whole processor.

    2.2.2 Wattch

    We have also used the Wattch simulator [21], which is an architecture-level powerand performance simulator based on Simplescalar. Wattch adds activity counters tothe sim-outorder simulator, and estimates the energy consumption of the differentstructures using the CACTI tool [111].

    The main processor units that Wattch models fall into four categories:

    • Array Structures: Data and instruction caches, cache tag arrays, all regis-ter files, register alias table, branch predictors, the reorder buffer, and largeportions of the issue queue and the load/store queue.

    • Fully Associative Content-Addressable Memories: Issue queue wakeup logic,load/store queue address checks, TLBs (if they are configured as fully-associative).

    • Combinational Logic and Wires: Functional units, issue queue selection logic,dependency check logic at decode stage, and result buses.

    • Clocking: Clock buffers, clock wires, etc.

    We have enhanced Wattch in the same way as sim-outorder simulator. Thepower model has been extended properly to keep track of the power dissipation ofthe modified structures.

  • 2.2. Tools and simulators · 19

    2.2.3 CACTI

    Finally, cache statistics such as delay, energy per access and area have been drawnfrom the CACTI tool [111], which is a timing, power and area model for cachememories. This tool has been also used to estimate the energy of other structureslike the register file isolating some components and sizing them properly.

    The CACTI model has the following parameters:

    • Cache size

    • Associativity (direct-mapped, set-associative or fully-associative)

    • Line size (number of bytes per line)

    • Number of ports of each type (read, write and read/write ports)

    • Technology

    • Number of banks

    CACTI presents results in terms of area, energy consumption and delay for thedecoders, bitlines, wordlines, comparators, sense amplifiers, routing buses, outputdriver, etc.

  • 20 · Chapter 2. Evaluation Framework

  • 3

    Cache Memories

  • CHAPTER 3

    CACHE MEMORIES

    The relevance of cache memories increases in current and future microprocessors.The latency and energy consumption of caches increase from generation to genera-tion due to different factors.

    The fraction of the area devoted to caches grows in every new processor genera-tion, mainly due to the L2 cache. It seems reasonable to expect that in a few yearsL3 caches will often be on-chip. Thus, even though low leakage transistors are usedfor caches, their large area makes them to be the one of the leakiest structures inthe chip.

    Cache memories also contribute significantly to the dynamic energy of the chip.The dynamic energy of caches is especially high for L1 data and instruction cachesbecause they are accessed very frequently. For instance, a 4-way superscalar proces-sor may require around one access per cycle to the L1 instruction cache to fetchinstructions, and one or two accesses per cycle to the data cache to load or storedata given that around 1/3 of the instructions use to be memory accesses.

    Cache latency is another key point. Dynamic and leakage energy of caches canbe significantly reduced by adjusting the supply and threshold voltages of the tran-sistors, but then its latency is increased. If we increase latency, some performance islost and the total processor energy consumption may be increased. Thus, voltagesmust be adjusted carefully if we do not want to produce counterproductive effects.Some cache energy can be also saved if the cache size or associativity is reduced,but the miss rate may increase. In general, the higher the miss rate, the larger theexecution time. Hence, reducing the cache size or associativity saves some energyin the cache, but may cause higher energy consumption in the rest of the chip.

    Reducing the cache energy consumption without increasing the cache latency orthe miss rate is a challenge. In the following sections of this chapter we presentthe approaches that we have proposed to tackle this problem. In section 3.1 wepresent state-of-the-art techniques that address the cache energy and performanceimprovement. Then, we present the different proposals for cache memories of thisthesis in sections 3.2, 3.3 and 3.4.

  • 24 · Chapter 3. Cache Memories

    3.1 Related Work

    This section reviews some related work, which has been classified into different cate-gories for the sake of readability. Literature on cache architectures is very abundant.Here, we just outline some of the closest works to our proposals.

    3.1.1 Low Miss Rate Schemes

    Several approaches to reducing the miss rate and/or complexity of conventionalcaches have been proposed. Conventional caches use a subset of bits from the addressto index the tag and data arrays, as dictated by the modulo function. Some authorsshow that other functions provide lower miss rates since they reduce the number ofconflicts. Topham et al. [121] propose an implementation of a polynomial modulofunction to index the cache. Kharbutli et al. [73] propose index functions basedon the use of prime numbers. Both works show a significant reduction in terms ofconflicts misses with respect to the conventional modulo function, but require someextra hardware and delay to access caches, which may have an impact especially forL1 caches.

    Other approaches to reduce the cache miss rate focus on the replacement func-tion. Different replacement functions have been proposed to improve the perfor-mance of LRU for L2 caches [130], to take into account the cost of the replacementsin L2 caches [67], and to enable the compiler to provide hints to the cache system toreplace those cache lines that are not likely to be reused soon [126]. Kim et al. [74]propose a non-uniform cache architecture (NUCA) for wire-delay dominated on-chipcaches and a scheme to place the data in this kind of caches to obtain low miss ratesand reduce latency. Chishti et al. [36] present an improved version of NUCA.

    3.1.2 Pseudo-Associative Caches

    Different approaches have been proposed to try to achieve the miss rate of a set-associative cache and the latency and power dissipation of a direct-mapped one.Some implementations focus on reducing the access time [11, 17, 69, 109, 135],while other approaches are based on predicting the way where the data is stored [30,65, 95]. One way to do it [65] consists in using way predictors to access just oneway in set-associative caches. Further work has been done by Powell et al. [95],who propose using way-prediction and selective direct-mapping for non-conflictingaccesses. This approach is based on the performance-oriented schemes by Batsonand Vijaykumar [17] and by Calder et al. [30].

    Power can also be reduced at the expense of latency by accessing the wayssequentially [72].

  • 3.1. Related Work · 25

    3.1.3 Non-Resizing Low Power Schemes

    Several works [52, 66, 78] have investigated the effects on performance and powerof supply and threshold voltages. Heo et al. [59] present a circuit technique toreduce the leakage of bit lines by means of leakage-biased bit lines. Transmissionline caches [18] reduce delay and power for L2 caches using on-chip transmissionlines instead of conventional RC wires.

    There are also different approaches based on changing the cache system organi-zation. Kin et al. [77] propose a small filter cache placed before the conventionalL1 cache to serve most of the accesses without accessing the L1 data cache. Sev-eral works [51, 64, 105, 122] propose a cache organization with different specializedmodules for different types of accesses.

    Compression has also been used to save power in caches. Yang and Gupta [131]present a data cache design where frequent values are encoded to reduce the powerdissipation. Canal et al. [33] and Villa et al. [124] propose compressing zero-valuedbytes. Alameldeen and Wood [13] propose an adaptive compression mechanism forL2 caches.

    Physical cache organization and distribution has been also studied. Ghose andKamble [49] study the effects of using subbanking, multiple line buffers and bitlinesegmentation to reduce dynamic power dissipation in caches. Su and Despain [119]investigate vertical and horizontal cache partitioning, as well as Gray code addressingto reduce dynamic power. Hezavei et al. [60] study the effectiveness of different lowpower SRAM design strategies like divided bitline, pulsed wordline and isolatedbitline.

    Other approaches have focused on reducing the number and/or complexity ofcache accesses. For instance, Witchel et al. [129] use compile-time information toallow some loads and stores to access the data cache without a tag check wheneverit can be guaranteed that the memory access will be to the same line as an earlieraccess. Memik et al. [82] present a kind of filters to early detect whether a cache misswill also miss in the following cache levels. These filters reduce the miss penalty andthe power consumption since the energy spent by these filters is much lower thanaccessing the corresponding cache. Buyuktosunoglu et al. [27] propose an algorithmto gate fetch and issue to reduce the number of accesses to the instruction cacheand other structures.

    3.1.4 Resizing Low Power Schemes

    Energy can be saved resizing the cache to fit the program requirements. Powell etal. [96] propose to gate the supply voltage (VDD) or the ground path of the cache’sSRAM cells whose contents are unlikely to be required. The content of the cache lineis lost but it practically does not leak. Agarwal et al. [10] propose a gated-groundscheme for turning off cache lines but still preserving their contents. This kind ofcircuits requires only one supply voltage, but they highlight that for technologiesunder 100 nm, small variations in the threshold voltage may destroy the contents of

  • 26 · Chapter 3. Cache Memories

    the cell. Thus, if a very high-precision threshold voltage cannot be achieved duringthe fabrication process, the stability of the cells cannot be guaranteed, making thistechnique non-viable. Flautner et al. [44] propose to reduce the supply voltage ofsome cache lines by putting them in drowsy mode (a kind of sleep mode) to reducetheir leakage without losing their contents. This kind of circuits requires two supplyvoltages. To address this limitation, Kim et al. [76] propose super-drowsy caches.They behave similarly to drowsy caches but only one VDD is required. The maindrawback of this approach is that cells in drowsy state are much more susceptibleto soft errors [80].

    Heuristics to decide when and which cache lines should be turned off make useof these techniques to turn off individual cache lines. Kaxiras et al. [71] and Zhou etal. [137] have recently proposed different techniques to reduce leakage by switchingoff cache lines whose contents are not expected to be reused, using the gated-VDDapproach. Kim et al. [75] present a different heuristic based on drowsy caches. Li etal. [81] have observed that L2 cache subblocks that are present in L1 can be turnedoff.

    Energy savings can also be achieved by dynamically [15, 39] or statically [14, 134,136] reconfiguring cache characteristics such as cache size, associativity and activeways. Yang et al. [132] study different static and dynamic resizing mechanisms andpropose a hybrid mechanism. Dhodapkar and Smith [38] have studied a variety ofdynamic cache resizing algorithms. They propose a simple mechanism to effectivelyresize L1 caches, saving significant power with a low number of reconfigurations.

    3.2 Fast and Slow L1 Data Cache

    This section proposes different cache organizations that reduce significantly dynamicand leakage energy consumption with a small performance loss. This study triesto guide processor designers to choose the cache organization with best trade-offbetween efficiency and energy consumption.

    This section is organized as follows. Section 3.2.1 introduces the model usedto choose supply and threshold voltages. Section 3.2.2 details the definition ofcriticality that guides some of the evaluated cache systems. Some experimentalcache organizations are presented in section 3.2.3 and their results are shown insection 3.2.4. Finally, section 3.2.5 draws the main conclusions of this work.

    3.2.1 Energy and Delay Models in CMOS Circuits

    CMOS power dissipation [104, 103] is given by 3.1 where dynamic power (Pdyn) andleakage power (Pleak) can be expressed as shown in 3.2 and 3.3 respectively.

    P = Pdyn + Pleak (3.1)

    Pdyn = pt · CL · V2DD · fCLK (3.2)

    Pleak = I0 · 10−VTH/S · VDD (3.3)

  • 3.2. Fast and Slow L1 Data Cache · 27

    In the power equations pt is the switching probability, CL is the load capacitance(wiring and device capacitance), VDD is the supply voltage and fCLK is the clockfrequency. I0 is a function of the reverse saturation current, the diode voltageand the temperature. VTH is the threshold voltage. Finally, S corresponds to thesubthreshold slope and is typically about 100mV/decade. Using equation 3.3 it canbe observed that leakage power dissipation decreases by 10 times if VTH increases0.1V.

    CMOS propagation delay can be approximated by the following simple α powermodel [104]1 where k is a proportionality constant specific to a given technology.

    Delay = k ·CL · VDD

    (VDD − VTH)α(3.4)

    The α power reflects the fact that the transistors may be velocity saturated. α iscompressed in the range [1 : 2], where α = 1 implies complete velocity saturationand α = 2 implies no velocity saturation. For the 0.18 µm technology assumed inthis work, α is typically 1.3.

    From equations 3.2 and 3.3, it can be concluded that decreasing VDD reducesboth dynamic and leakage power dissipation and slightly increasing VTH reducesdrastically leakage, but both parameters adjustments increase the propagation delayas equation 3.4 shows. Thus, there is a trade-off between reducing power dissipationand increasing delay propagation with minimum performance loss.

    3.2.2 Criticality

    In modern superscalar processors, where multiple instructions can be processed inparallel, deciding when a given resource should be assigned to an instruction is awell-known problem. For instance, when two ready instructions require the samefunctional unit to be executed, only one of them can be chosen. Different policiesare used to take these decisions in existing processors, but they are quite simpleand inaccurate. Some studies [43, 98, 118] have proposed techniques to heuristi-cally obtain more accurate information and use it to increase performance. Loadinstructions are especially harmful if they have high latencies and are in the criticalpath [107]. Thus, the criticality of load instructions is important information tohandle them efficiently.

    An exact computation of the criticality of each load instruction is not feasibledue to its complexity. Thus, an approximation to the criticality is proposed. Thenwe propose an accurate predictor of criticality according to our definition. For theproposed criticality-based cache organization, we will only need to classify loads intotwo categories: critical and non-critical, so we need a mechanism to decide whena load can be delayed one or more cycles and when delaying it will significantlydegrade performance.

    1The subthreshold current is considered to be a constant and it is assumed that transistors arein the current saturation mode.

  • 28 · Chapter 3. Cache Memories

    Criticality Estimation

    In order to decide whether an instruction is critical or not we keep track of thenumber of cycles elapsed since an instruction has finished its execution until itcommits. If this number of cycles is greater than a given threshold N , then theinstruction is considered non-critical. Intuitively, this criteria indicates that theinstruction belongs to a chain of dependent instructions which is not the longestone or that there is an instruction that stops the commit process (for instance aload that misses L1 cache), and thus, this chain may take some more cycles withoutperformance degradation. In our experiments, after evaluating different values forN , we have observed that N = 4 cycles gives the best results for the chosen cacheorganizations.

    With the previous criteria, the last instruction of every dependence chain is cor-rectly classified, but this criticality must be propagated upwards in the dependencechains. Thus, we consider if the data produced by the instruction (if any) is imme-diately used by at least another critical instruction. With this criterion only thoseinstructions belonging to a chain of dependent instructions that are executed as soonas possible, are considered critical.

    The criticality predictor has been implemented as a 2048 untagged entry tablewhere each entry is a 2-bit saturated counter whose most significant bit is the pre-diction. Initially the table indicates that all the instructions are critical. The tableis updated by every instruction that commits. If the committing instruction hasbeen waiting for commit less than N cycles (N = 4 in our experiments), or its pro-duced data (if any) is forwarded to another critical instruction through a bypass andthe depending instruction is issued immediately, the corresponding 2-bit counter isincremented, otherwise it is decremented.

    The evaluation section describes how this criticality predictor has been validated.

    3.2.3 Cache Organizations

    This section describes different cache organizations that are compared to a baselineL1 monolithic 1-cycle latency cache. Our proposals are based on two L1 cachemodules implemented with different technology parameters. One of them is a 1-cycle latency cache implemented with the same technology than the baseline. Itwill be referred to as Fast Cache in the rest of this work. The second one is a 2-cycle latency cache implemented with lower VDD and higher VTH than the baseline,in order to reduce both dynamic and leakage power dissipation at the expense ofincreasing the access time. It will be referred to as Slow Cache in the rest of thiswork.

    According to the formulas described in section 3.2.1 we are interested in decreas-ing VDD and increase VTH as much as possible with the following limitations: theseparameters should be technologically feasible and the latency should be at most 2times larger than the latency of the baseline cache.

  • 3.2. Fast and Slow L1 Data Cache · 29

    Power dissipation w.r.t. the baseline technology for different Vth/Vdd combinations

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    0,53 0,54 0,55 0,56 0,57 0,58 0,59 0,6

    1,18 1,2 1,24 1,26 1,28 1,3 1,32 1,34

    Vth / Vdd

    PdynPleak

    Fig. 3.1: Power dissipation compared to the baseline technology for different VTH and VDD values

    Leakage energy consumption can be analytically estimated, but dynamic energydepends on the program, thus optimal generic values for VDD and VTH cannot becomputed. In order to guide the selection of these values, figure 3.1 shows differentvalid combinations of these values that can be chosen and the expected dynamicand leakage power dissipation compared to the baseline technology. The assumedparameters for the baseline technology are VDD = 2.0V and VTH = 0.55V [28]. Therest of the parameters are described in section 3.2.1.

    We have chosen VTH = 0.57V and VDD = 1.28V technology for the Slow Cachebecause it reduces both power dissipation sources to the same percentage.

    Proposed Cache Organizations

    Two different cache organizations are proposed. The first one is a hierarchicallocality-based cache system where the fast cache is the first level data cache, theslow cache is the second level cache and the baseline’s second level cache is the thirdlevel cache. In this organization the slow cache should be larger than the fast cacheto be useful.

    The second one is a flat criticality-based cache system with no inclusion (somedata contained in fast cache may not be in slow cache and vice versa). Both thefast and the slow caches are accessed always in parallel. If a critical load hits in theslow cache and misses in the fast cache, the cache line is copied from the slow tothe fast cache. If a critical load misses both caches, then the data fetched from thefollowing cache level is allocated only in the fast cache. If a non-critical load hits atleast in one of both caches, the data is not copied from one cache to the other. Ifa non-critical load misses in both caches, then the data fetched from the following

  • 30 · Chapter 3. Cache Memories

    Hierarchical / Hierarchical /Baseline Criticality-based Criticality-based

    (3-way slow) (2-way slow)L1 Fast Slow Fast Slow16K 4K 12K 4K 8K32K 8K 24K 8K 16K64K 16K 48K 16K 32K

    Table 3.1: Cache sizes used in the comparison

    cache level is allocated only in the slow cache. Finally, if a store hits at least inone cache there is no data copy, but if it misses, assuming that the policy used iswrite-allocate, the data is fetched to the fast or the slow cache depending on thecriticality of the store instruction.

    Another important consideration is the cache size. Most of the existing proces-sors have data caches whose size is in the range [16K : 64K]. Table 3.1 describes thedifferent cache sizes used to compare the different alternatives. All caches describedhave 32-bytes cache lines with 2 read/write ports. The baseline and fast caches are2-way set-associative.

    It can be seen in table 3.1 that there are two versions for both proposals. Inthe first one, the total size is the same than the baseline (slow cache is 3-way set-associative) and in the second one the total size is smaller than the baseline butthe slow cache has the same associativy than the fast one (slow cache is 2-wayset-associative). The cache parameters for both proposals (the hierarchical andthe criticality-based ones) are exactly the same, so their performance and powerdissipation are comparable.

    3.2.4 Performance Evaluation

    This section evaluates the accuracy of the criticality predictor and the performanceand power dissipation of the different cache organizations in a superscalar processor.

    Experimental Framework

    Our power dissipation and performance results are derived from Wattch [21] as de-scribed in section 2.2. The programs used are the whole SPEC CPU2000 benchmarksuite [115]. Table 3.2 shows the processor parameters.

    Criticality Evaluation

    This section describes the experiments done in order to verify the effectiveness ofthe criticality detection mechanism. Figure 3.2 shows the distribution of criticalloads for the criticality-based 2-way set-associative slow cache configuration. The3-way set-associative slow cache configuration, the locality-based organizations andthe baseline show similar results (they differ less than 3% in all cases) so their criticalloads distribution is not depicted.

  • 3.2. Fast and Slow L1 Data Cache · 31

    Parameter ValueFetch, Decode, Issue, Commit width: 4 instructions/cycleIssue queue size: 40 entriesReorder buffer size: 64 entriesRegister file: 80 INT + 80 FPIntALU’s: 3 (1 cycle)IntMult/Div: 1 (3 cycles pipelined mult,

    20 cycles non-pipelined div)FP ALU’s: 2 (2 cycles pipelined)FP Mult/Div: 1 (4 cycles pipelined mult,

    12 cycles non-pipelined div)Memory Ports: 2Branch Predictor: Hybrid: 2K entry Gshare, 2K entry bimodal

    and 1K entry metatableBTB: 2048 entries, 4-wayL1 Icache size: 64KB 2-way, 32-byte lines, 1 cycle latencyL1 Dcache size: 2-way, 32-byte linesL2 Unified cache: 512KB, 4-way, 64-byte lines, 10 cycles latencyMemory: 50 cycles, 2 cycles interchunkData TLB: 128 entries, 30 cycles miss penaltyInstruction TLB: 128 entries, 30 cycles miss penalty

    Table 3.2: Processor configuration

    The integer percentage of critical loads is higher than the floating point onebecause the integer applications show less ILP than floating point ones. As thefigure shows, SpecINT2000 has near twice the percentage of critical loads (60%approx.) with respect to SpecFP2000 (30% approx) across all cache configurations.

    After studying what percentage of loads is considered critical, the next step isverifying that those loads are really critical. In order to verify that the criticalitycriterion detects the critical loads, we will compare the execution of the criticality-based 2-way slow cache organization versus the baseline in two ways:

    • The loads considered as critical or non-critical are treated as critical or non-critical respectively.

    • The same percentage of loads that were considered as critical ones in theprevious simulation will be considered critical, but this time they will be chosenrandomly.

    As figure 3.3 shows, the criticality scheme achieves significantly higher perfor-mance than the random scheme across all cache sizes. It can be seen that whenloads are chosen as critical according to the criticality criterion the IPC loss is muchlower than in the randomly chosen scheme so, it can be concluded that the crit-icality criterion gives a good classification of loads that can be used to guide thecriticality-based cache organization.

  • 32 · Chapter 3. Cache Memories

    Critical Loads

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    SP

    EC

    INT

    SP

    EC

    FP

    SP

    EC

    SP

    EC

    INT

    SP

    EC

    FP

    SP

    EC

    SP

    EC

    INT

    SP

    EC

    FP

    SP

    EC

    4K+8K 8K+16K 16K+32K

    non-criticalcritical

    Fig. 3.2: Load criticality distribution for different cache sizes of the 2-way set-associative slowcache configuration

    IPC loss when loads are classified

    -1%

    0%

    1%

    2%

    3%

    4%

    5%

    SP

    EC

    INT

    SP

    EC

    FP

    SP

    EC

    SP

    EC

    INT

    SP

    EC

    FP

    SP

    EC

    SP

    EC

    INT

    SP

    EC

    FP

    SP

    EC

    4K + 8K 8K + 16K 16K + 32K

    criticality schemerandom scheme

    Fig. 3.3: IPC loss of criticality-based cache for the guided and the random versions w.r.t. thebaseline

  • 3.2. Fast and Slow L1 Data Cache · 33

    IPC loss w.r.t. the baseline

    -1%

    0%

    1%

    2%

    3%

    SP

    EC

    INT

    SP

    EC

    FP

    SP

    EC

    SP

    EC

    INT

    SP

    EC

    FP

    SP

    EC

    SP

    EC

    INT

    SP

    EC

    FP

    SP

    EC

    4K + 12K (8K) 8K + 24K (16K) 16K + 48K (32K)

    locality 3waycriticality 3waylocality 2waycriticality 2way

    Fig. 3.4: Performance loss of locality-based and criticality-based organizations w.r.t. the baseline

    Cache Organizations Comparison

    The comparison between the locality-based and the criticality-based cache organi-zations versus the baseline has been done based on different metrics: performance(IPC), miss ratio, dynamic energy consumption and leakage energy consumption.

    Performance. Figure 3.4 shows the IPC loss for both cache organizations versusthe baseline. 2way and 3way stand for 2-way and 3-way set-associative slow cacherespectively. The SpecINT2000, SpecFP2000 and SPEC2000 percentages have beenaveraged using the harmonic means of the IPC’s. It can be observed that the locality-based scheme works better than the criticality-based scheme for the SpecINT2000but the criticality-based scheme achieves better results than the locality-based forthe SpecFP2000. We have observed that the loads can be classified as critical ornon-critical, but it is common that the data fetched by a non-critical load is reusedby a critical one and vice versa. Due to this, if there is no capacity limitation in thefast cache it is better to fetch all data to the fast cache than fetching some data tothe slow cache if it has to be fetched later to the fast cache by a critical load.

    In general, integer applications have small working sets so, the locality-basedscheme that always fetches the data to the fast cache works better than thecriticality-based scheme. But for floating point applications with huge working sets,this performance loss due to delaying some critical loads that find their data in theslow cache instead of the fast cache, is compensated by retaining during more cyclesdata that will be reused by critical loads in the fast cache, instead of replacing itwith data that only will be used by non-critical loads during that period of time.

  • 34 · Chapter 3. Cache Memories

    16 KB - Critical loads

    50%

    55%

    60%

    65%

    70%

    75%

    80%

    85%

    90%

    95%

    100%

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    SPECINT SPECFP SPEC

    hit fast hit slow miss

    (a) 16KB baseline, critical loads

    16 KB - Non-critical loads

    50%

    55%

    60%

    65%

    70%

    75%

    80%

    85%

    90%

    95%

    100%

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    SPECINT SPECFP SPEC

    hit fast hit slow miss

    (b) 16KB baseline, non-critical loads

    Fig. 3.5: Miss ratio breakdown of critical and non-critical loads for the 16KB baseline cache

    It can be seen that for FP programs the criticality-based scheme may achievebetter results than the baseline due to the beneficial effect of not placing data fetchedby non-critical loads in the fast cache.

    Another observation is that the larger the caches, the lower the difference betweenboth cache organizations. In fact, for a 16KB fast cache and a 32KB (or 48KB) slowcache organization, both schemes perform similarly.

    Miss Ratios. Figures 3.5, 3.6 and 3.7 show the miss ratios for critical and non-critical loads for different cache sizes. These figures classify loads into L1 hits andmisses for the baseline. For the other organizations the loads are classified into threegroups: those that hit in the Fast cache, those that miss in the Fast cache but hitin the Slow cache, and those that miss in both L1 caches. Note that the scale for allthe figures begins at 50% for the sake of showing better the hit/miss distribution,because in all cases the fast cache hit ratio is higher than 50%. Hit fast, hit slowand miss stand for hit in the fast cache, miss in the fast cache but hit in the slowcache, and miss in both L1 caches respectively.

    We can observe that, in general, for the SpecFP2000 the fast cache hit ratio ofthe critical loads in the criticality-based schemes is slightly higher than the sameratio in the locality-based schemes. For the SpecINT2000 higher hit ratios in thefast cache are achieved in the locality-based schemes because the working sets aresmall and critical loads reuse data fetched by non-critical loads to the fast cache.

  • 3.2. Fast and Slow L1 Data Cache · 35

    32 KB - Critical loads

    50%

    55%

    60%

    65%

    70%

    75%

    80%

    85%

    90%

    95%

    100%ba

    selin

    e

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    SPECINT SPECFP SPEC

    hit fast hit slow miss

    (a) 32KB baseline, critical loads

    32 KB - Non-critical loads

    50%

    55%

    60%

    65%

    70%

    75%

    80%

    85%

    90%

    95%

    100%

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    SPECINT SPECFP SPEC

    hit fast hit slow miss

    (b) 32KB baseline, non-critical loads

    Fig. 3.6: Miss ratio breakdown of critical and non-critical loads for the 32KB baseline cache

    64 KB - Critical loads

    50%

    55%

    60%

    65%

    70%

    75%

    80%

    85%

    90%

    95%

    100%

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti

    calit

    y 2w

    ay

    base

    line

    loca

    lity

    3way

    criti

    calit

    y 3w

    ay

    loca

    lity

    2way

    criti