Modeling and Analysis of a Cache Coherent Interconnect · Performance analysis is demonstrated on a multi-cluster mobile-device-like compute subsystem model, using both PARSEC benchmarks

Eindhoven University of TechnologyFaculty of Electrical Engineering

Electronic Systems Group

Modeling and Analysis of aCache Coherent Interconnect

Thesis Report

submitted in partial fulfillmentof the requirements for the degree of

Master of Sciencein

Embedded Systems

by:Uri Wiener

Supervisors:Prof. Kees Goossens

Dr. Andreas Hansson

v2.2, August 2, 2012

Abstract

System on Chip (SoC) designs integrate heterogeneous components, increasing in number, performance, andmemory-related requirements. Such components share data using distributed shared memory, and rely on cachesfor improving performance and reducing power consumption. Complex interconnect designs function as the heartof these SoCs, gluing together components using a common interface protocol such as AMBA, with hardware-managed coherency a key feature for improving system performance. Such an interconnect is ARM’s CCI-400, thefirst to implement the AMBA 4 AXI Coherence Extensions (ACE). Capturing the behavior of such a SoC-orientedinterconnect and its impact on the performance of a system requires accurate modeling, simulation using adequateworkloads, and precise analysis.

In this work we model ACE and a CCI-like interconnect using gem5, a Transaction-Level-Modeling (TLM)simulation framework. We extend gem5’s memory system to support ACE-like transactions. In addition, weimprove the temporal behavior of gem5’s interconnect model by better representing resource contention. As such,this work intertwines two challenging computer architecture topics: coherence protocols and on-chip interconnects.Performance analysis is demonstrated on a multi-cluster mobile-device-like compute subsystem model, using bothPARSEC benchmarks and BBench, a web-page rendering benchmark. We show how the system-level coherencyextensions introduced to gem5’s memory system provide insights about modeling coherent single- and multi-hopSoC interconnects. Our simulation results demonstrate that the impact of snoop response latency on overall systemperformance is highly dependent on the amount of inter-cluster sharing. The changes introduced significantlyimprove the interconnect’s observability, enabling better characterization of a workload’s memory requirements andsharing patterns.

,ךלש רהמ יכה ליחתמ התא

.ריבגמ התא טאל טאלו

"אתבס עצבמ" ,ביגש 'ובמרק' ןולא -

You start off as fast as you can,

and very slowly increase the pace.

-Alon ‘Krembo’ Sagiv, “Operation Grandma”

i

Acknowledgments

Chronologically, those who made it happen should be credited first. Hadn’t it been to Dr. Benny Akesson andProf. Kees Goossens’s match-making efforts, I would have never crossed the English Channel. For that I give themmy full appreciation. While Kees bravely became my supervisor, Andreas had the questionable pleasure of directlysupervising my work. For their help and cooperation throughout the past year, I take off my hat.While supervising is done from a bird’s-eye view, the man who was affected the most by my presence must beSascha Bischoff. Thank you for all your help, knowledge sharing, patience and for translating Dr. Taylor’s Englishwhen times were tough.Many many thanks to my friendly colleagues at ARM - Akash, Charlie, Djordje, Hugo, Matt E., Matt H., Radhika,Peter, Rene, Robert and William (alphabetical order, guys, nothing else) for their companion and support.My appreciation goes also to the gifted people behind ACE and CCI in Cambridge and Sheffield who cooperated,spared time and shared insights whenever I asked.In addition, I’d like to thank all gem5 developers, but especially to Ali, Dam and Chris, for providing tips, patches,feedback and plenty of time required for making it all work.From a personal perspective, I’d like to thank my family and friends in Israel, the Netherlands and Germany forall their support during my stay in the UK.But above all, my love and endless appreciation to Tali, which supported me in making this move and bared havingthe English Channel between us for half a year.

Uri WienerCambridge, The United KingdomAugust 2nd, 2012

ii

Contents

List of Figures v

List of Tables vi

List of Abbreviations vii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Hardware-based system coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3.2 Thesis Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Contribution of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 52.1 System coherence protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Coherent interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Memory system performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Background 83.1 System coherency and AMBA 4 ACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.1 Memory Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.2 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.3 System-level coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.4 AMBA 4 AXI Coherence Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.5 ACE transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Interconnect modeled: CCI++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.1 CoreLink CCI-400 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.2 CCI++: extending ACE and CCI to multi-hop memory systems . . . . . . . . . . . . . . . . 16

3.3 The Modeling Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Simulation and modeling framework: gem5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4.1 gem5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4.2 Simulation flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4.3 Test-System Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 The gem5 memory system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.5.1 The basic elements: SimObjects, MemObjects, Ports, Requests and Packets . . . . . . . . . . 203.5.2 Request’s lifecycle example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5.3 Intermezzo: Events and the Event Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5.4 The bus model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5.5 A simple request from a master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5.6 A simple response from a slave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5.7 Receiving a snoop request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5.8 Receiving a snoop response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5.9 The bridge model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii

3.5.10 Caches and coherency in gem5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5.11 gem5’s cache line states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5.12 Main cache scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6 Target platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6.1 gem5 building block for performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.6.2 Simulator performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.7 Differences between ACE and gem5’s memory system . . . . . . . . . . . . . . . . . . . . . . . . . . 293.7.1 System-coherency modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Interconnect model 324.1 Temporal transaction transport modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Resource contention modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 ACE transaction modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.1 ReadOnce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3.2 ReadNoSnoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3.3 WriteNoSnoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3.4 MakeInvalid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.5 WriteLineUnique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Modeling inaccuracies, optimizations and places for improvement . . . . . . . . . . . . . . . . . . . . 394.5 Bus performance observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Implementation and Verification Framework 425.1 Memtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2 ACE-transactions development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.3 ACE transaction verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3.1 ReadOnce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.3.2 ReadNoSnoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.3.3 WriteNoSnoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3.4 WriteLineUnique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Performance Analysis 496.1 Metrics and method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.3 Small-scale workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.3.1 MemTest experiments setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.3.2 MemTest experiments results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.4 Large-scale workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.4.1 PARSEC experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.4.2 PARSEC results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.4.3 Why do the missing results prove our hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 596.4.4 BBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.4.5 Insight from Bus Stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.5 Results analysis and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7 Conclusions and Future Work 667.1 Technical conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.1.1 Interconnect observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667.1.2 Modeling SoC platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667.1.3 The impact of delaying snoop responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.1.4 Truly evaluating ACE-like transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.2 Reflections, main hurdles and difficulties faced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

A Contributions to gem5 70

iv

List of Figures

3.1 Example top-level system using CCI-400. Source: [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 big.LITTLE architecture concept. Source: ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 big.LITTLE task migration. Source: [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 ACE cache line states. Source: [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.5 AXI and ACE channels. Source: [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.6 ACE transaction groups. Source: [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.7 CCI-400 internal architecture. Source: [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.8 gem5 example systems with typical CCI-like interconnects . . . . . . . . . . . . . . . . . . . . . . . . 173.9 gem5 Speed vs. Accuracy Spectrum. Source: [19] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.10 gem5 simulation inputs, outputs and runtime interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 193.11 gem5 2-core test system example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.12 Master to slave transaction sequence diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.13 Abstracted diagram of the bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.14 Target platform block diagram. Source: ARM technical symposium 2011 . . . . . . . . . . . . . . . 273.15 gem5 target initial and final target platform overview . . . . . . . . . . . . . . . . . . . . . . . . . . 283.16 gem5 target initial and final target platform overview . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 The bus-model’s inheritance diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Added bus outgoing traffic queuing scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Abstracted diagram of the layered bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1 A 3:2:1 old (unintuitive) MemTest system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2 A 3:2:1 new MemTest system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.3 MemTest 3:1 system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.4 ReadOnce for an M-state line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.5 ReadNoSnoop for an E-state line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.6 MemTest 3 system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.7 WriteNoSnoop in an ACE context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.8 WriteNoSnoop for an E-state line in a multi-hop interconnect . . . . . . . . . . . . . . . . . . . . . . 465.9 WriteLineUnique for an O-state line in an ACE context . . . . . . . . . . . . . . . . . . . . . . . . . 475.10 WriteLineUnique in a multi-hop context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.1 Small-scale snoop-response test system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.2 Simulation results for MemTest-based snoop-response latency experiments . . . . . . . . . . . . . . . 526.3 Normalized PARSEC communication patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.4 General simulation and simulator-performance stats for PARSEC benchmarks . . . . . . . . . . . . . 556.5 Total DRAM bandwidth for PARSEC benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.6 Timing-model CPU statistics for PARSEC benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 576.7 L1 and L2 cache miss rates for PARSEC benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.8 Bus stats for PARSEC benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.9 PARSEC execution times vs. snoop response penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.10 PARSEC normalized execution times vs. snoop response penalty . . . . . . . . . . . . . . . . . . . . 616.11 bodytrack-failing system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.12 PARSEC bodytrack 4 cores 4t simsmall cache occupancy in percent . . . . . . . . . . . . . . . . . . 626.13 BBench Gingerbread and Ice Cream Sandwich snapshots . . . . . . . . . . . . . . . . . . . . . . . . . 636.14 BBench transaction distribution demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

v

List of Tables

3.1 Projection of ACE cache line state to MOESI naming. Source: [12] . . . . . . . . . . . . . . . . . . . 123.2 Correlation of ACE and gem5 interface semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Main port interface methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Projection of gem5 cache line state space to MOESI . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Bus outgoing traffic queuing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

vi

List of Abbreviations

ACE AXI Coherence Extensions

ACP Accelerator Coherence Port

AMBA Advanced Microcontroller Bus Architecture

AXI Advanced eXtensible Interface

BBench Browser-Bench

BFM Bus-Functional Model (driver/stub)

BLM Bandwidth and Latency Monitor

CMP Chip Multi-Processor

CPU Central Processing Unit

CSS Compute Sub-System

DMA Direct Memory Access

DMC Dynamic Memory Controller

DRAM Dynamic Random Access Memory

DVM Distributed Virtual Memory

FSM Finite State Machine

GIC Global Interrupt Controller

GPGPU General-Purpose computing on a GPU

GPU Graphics Processing Unit

I/O Input/Output

ICS Ice-Cream Sandwich (Android)

IP Intellectual Property

IPC Instructions Per Cycle

ISA Instruction-Set Architecture

kIPS kilo Instructions Per Second

LLC Last-Level Cache

LQV Latency Regulation by Quality Value

LRU Least-Recently Used

LSQ Load/Store Queue

vii

MIPS Million Operations Per Second

MPSoC Multi-Processor System on Chip

MSHR Miss Status Handling Registers

NoC Network on Chip

O3 Out of Order

OCP Open Core Protocol

OS Operating System

OT Outstanding Transactions

PIO Programmed I/O

PMU Performance Monitoring Unit

PoS Point of Serialization

PQV Periodic Quality of Service Value Regulation

PV Programmer’s View

QoS Quality of Service

QVN Quality of Service Virtual Networks

RT Real-Time

RT: Register Transfer Level

SCU Snoop Control Unit

SoC System on Chip

SWMR Single-Writer Multiple-Reader

TLB Translation Look-aside Buffer

TTM Time to Market

UFS Universal Flash Storage

VNC Virtual Network Computing

viii

Chapter 1

Introduction

1.1 Motivation

The latest consumer-electronics MPSoCs combine many less-powerful CPUs instead of a single very-powerful CPUfor power efficiency. Such MPSoCs commonly include a GPU for graphics and general-purpose, and variousapplication-specific accelerators. This mix of hardware components aims at providing compute power in the mostefficient way, as power and performance are critical requirements. As the amount of cores and IPs increases, sodoes the importance of communication. As in most cases, these devices are not safety-critical. Chip makers spendhuge efforts trying to improve average-cases for better system performance, reducing access latencies, increasingthroughputs, reducing off-chip accesses etc. Introducing caches is a common approach for such average-case ori-ented designs. In a multi-core or multi-cluster SoC, caches must be orchestrated in order to maintain system-widecoherency. This challenge becomes more acute as not just CPUs, but also GPUs make use of caches, and as systemscontain a mix of coherent and non-coherent blocks.

System coherency can be established in many ways. Software-based-coherency is a conventional and commonsolution in which any access to shared data must be known up-front, and managed using synchronization mechanismssuch as mutexes and semaphores. However, this comes at the expense of significant run-time overheads (e.g. the needto busy-wait for a lock), as well as the additional programming challenges. Software-based coherence is infamousfor lurking bugs and additional efforts required to avoid them - directly impacting the critical Time-To-Market(TTM). In addition, software-based coherency requires consumers and producers to interact via the main memory.Each DRAM (off-chip) access is orders of magnitude more expensive (both power- and latency-wise) comparing toan on-chip data access (e.g. SRAMs used by caches). In fact, we could squeeze in more than 1000 logical operationson chip before offsetting a DRAM access.

1.2 Hardware-based system coherency

ARM, a semiconductor and software company dominating in the field of mobile phone chips, develops a wide rangeof IPs such as RISC CPUs, GPUs, VPUs, high-speed connectivity products etc. It introduced the AdvancedMicrocontroller Bus Architecture (AMBA), a de facto standard for on-chip communication, and developsvarious AMBA IPs. AMBA 4, the most recent version of the AMBA protocol family, contains the AXI CoherenceExtensions (ACE) [6], aiming at providing hardware-based system coherency. Among ARM’s latest products,the CoreLink CCI-400 Cache Coherent Interconnect [8] is a high-performance, power-efficient interconnectdesigned to interface between various processors, peripherals, dynamic memory controllers and so forth. CCI-400 isthe first interconnect implementation which supports AMBA 4 ACE, thus enables hardware-managed coherency forcache-coherent computer subsystems such as ARM Cortex clusters combined with GPUs and coherent I/O devices.

AMBA 4 ACE and CCI-400 supporting this protocol aim at hardware-based system coherency in order toimprove on software-based coherency: reduce off-chip accesses, avoiding buggy software, shortening TTM andso forth. However, the assumptions at the base of ACE are yet to be proven. Investigating the true impact ofhardware-based coherency is a challenging problem: it requires both developing realistic models of the hardwarewhich support ACE, as well as suitable workloads: software which actually utilizes this new infrastructure.

Recently ARM presented its big.LITTLE [7] architecture of SoCs, combining heterogeneous sets of processingclusters. In such an architecture, hardware-based coherency is a key for allowing fast and inexpensive migrationof tasks between clusters. Similarly, GPUs can take advantage of the additional instructions provided to directly

1

communicate with the CPUs, saving off-chip activity and various synchronization overheads. However, these sce-narios have not yet been tested on real systems, with dedicated software: there is currently no evidence that ACEindeed is advantageous.

This work aims at answering some of the questions asked, and providing better tools to understand the trueimpact of ACE-like protocols in a modern compute subsystem. This research involves the entire spectrum of asystem: how decisions in the architectural and micro-architectural levels affect global system performance. Correctmodeling, appropriate experiments and valid conclusions require in-depth understanding across this spectrum.

1.3 Goals

The main objective of this research is to investigate the impact of hardware-based system-coherent memory system,and in particular an interconnect which establishes system coherence. This is a challenging task for several reasons:

• Conducting system-level performance analysis in general, and particularly interconnect performanceanalysis, is both complex and tricky. Firstly, a full-system simulator is required, such that provides sufficientlyrealistic hardware models, capable of running real-world workloads. At the same time, the simulator has toprovide a reasonable level of abstraction both for enabling high-level models, as well as for performance. Theuse of RTL simulations is limited from several aspects:

– Implementation availability: a sufficiently mature implementation of the platform under investigationis required - which is rarely the case when exploration has to be done.

– Simulation speed: RTL simulations have notably crippling performance, which severely limits theworkloads being used.

– Flexibility and composability: exploring configurations is very limited, and in many cases impracticalfor large design-space explorations.

Currently available computer system-modeling infrastructures, such as SimpleScalar [14], usually providelimited capabilities, making them less appropriate for full-system level evaluation. For example, it is mostlynot possible to use them for booting Linux or Android OS, or to run full-system workloads; mostly their IPlibrary (e.g. GPU, ARM cores, other devices) is not diverse enough for composing realistic complete platforms.Simics [32], for instance, is such a full-system simulation platform, yet it is outdated, unmaintained, and nolonger available, as it was commercialized in 1998.

• Availability of appropriate hardware models: there are currently no models for composing a platformwhich implements hardware-based system coherency. This can in theory be achieved using gem5’s Ruby [19]memory model, yet that remotely resembles CCI.

• Currently there are no workloads (hence software and drivers) that utilize or stress I/O coherency. Havingthe right hardware model is worthless if it is not exercised to demonstrate its full capabilities. There are,however, workloads that stress multi-core/thread platforms (e.g. PARSEC [17] and SPLASH-2 [44]).

• The underlying assumptions made when designing AMBA 4 ACE seem well-based, and expectations regardingthe potential benefit are high, yet this is fairly an uncharted research territory, with very limited relevantresearch done in this domain. There are no (published) existing SoC-oriented cache coherent interconnectsavailable other than CCI. As such, there is no shared knowledge about the impact different design optionshave on performance and costs.

1.3.1 Problem Statement

As with many other complex IPs, CCI-400 has to be evaluated for its performance and system-wide implicationsunder various scenarios. Performing rigorous analysis and exploration on RTL implementation is infeasible. Full-system models for simulation with a real operating system and realistic benchmarks are now within reach. However,they lack adequate modeling of system coherence support, and more importantly a cache-coherent interconnectmodel. Hence, a high-level, functionally correct and performance-reflecting transaction-level model of the cachecoherent interconnect is required.

2

1.3.2 Thesis Goals

• Capture the behavior of the ARM AMBA4-family Cache Coherent Interconnect in a transaction-level model.

• Develop a strategy to reason bottom-up about the model’s fidelity. This includes defining an evaluationmethod, performance metrics, generation of stimuli etc.

• Identify sharing behaviors in full-system simulation using realistic benchmarks. Analyze the interactionbetween the CPUs, GPUs, memory system and interconnect. Investigate the performance impact of CCI invarious systems in terms of total latency per workload, cache statistics (e.g. hit/miss rates, blocking time),interconnect throughput etc.

1.4 Contribution of this work

This work introduces a set of system-coherency transactions to gem5’s memory system. The added value of this ismulti-fold:

• It enables followers (e.g. gem5 users) to utilize these transactions with various workloads, and analyze theimpact of hardware-based cache coherence in a full-system context.

• The transactions implemented are inspired by AMBA 4 ACE transactions. Yet while ACE is intended tosupport a single-hop interconnect, gem5’s bus model must support being part of a multi-hop interconnect.This is since gem5 provides flexible connectivity and easy composition of systems in many shapes and kinds.As such, each transaction that was added to gem5 was designed to comply with the ACE specifications whenthe bus is directly connected to masters and slaves, yet also to handle intricate situations which are notconsidered by ACE. Therefore, the extended transactions serve as leading examples and solve dilemmasraised in multi-hop cache-coherent interconnects. This topic is further discussed in Section 4.3.

• In order to develop, verify and evaluate these extensions in a fine-grained manner, a suitable infrastructurewas created. It is based on gem5’s MemTest, as described in Section 5.2. This framework is a convenientvehicle for both small-scale experiments, as done in Section 6.3, automated functional validation ofthe memory system, and a friendly means for introducing new transaction types to gem5.

In addition, this work was part of a joint effort in making gem5’s bus model more realistic in thestructural and temporal aspects. This included:

• Improved resource contention modeling by splitting the bus to AXI-like layers replacing a unary resource.This is a leap forward from a tri-state-like bus model, towards a realistic crossbar model with separate request,response, and snoop channels. Further details are provided in Section 4.2.

• Improving the temporal modeling of snoop traffic, from a loosely-timed approach towards approximately-timed snoop responses. This aims at overcoming modeling abstractions made by gem5’s memory system atthe expense of simulation performance, and is dealt with in Section 4.1.

The interconnect is the heart of its hosting SoC, arbitrating and transporting data in various patterns. Assuch, being able to collect data from the bus model is essential for gathering insight about the system. In order toimprove the bus’s observability, statistics of various bus internals were added. These statistics justified theirnecessity during the experiments stage, as an enabling tool for in-depth analysis of the system behavior (e.g.of cross-cluster sharing patterns, or the lack of sharing).

The experiments provided in Chapter 6 provide insights about system bottlenecks under various multi-threaded workloads, and mostly reasoning for the impact the interconnect has in such circumstances on thesystem’s performance.

Apart from the afore-mentioned contributions which are directly related to the investigation at-hand, indirectly-related contributions to gem5 have been made. These are described in Appendix A.

1.5 Thesis organization

The rest of this report is organized as follows: Chapter 2 discusses recent publications dealing mostly with var-ious approaches for implementing system-level coherency. Chapter 3 contains an in-depth description of system

3

coherency, ACE, the modeled CCI-like interconnect, and gem5 - the simulation framework used for modeling CCI.This includes an overview of gem5’s memory system’s building blocks and semantics. This overview is requiredto understand the modeling challenges and inaccuracies dealt with next. In Chapter 4 the main drawbacks of theprevious bus model are discussed, and which changes had to be implemented for the bus to become a more realisticcoherent interconnect model. Chapter 5 provides lower-level details of the implementation flow, the challenges inextending gem5 to support new transaction types, and a description of the verification process. The infrastruc-ture and methodology developed are described, as these have great value for any future work on gem5’s memorysystem. Chapter 6 describes the experiments performed for gaining insight about the performance of a coherentcompute subsystem under various workloads, followed by a discussion of the results. Small-scale experiments wereused to clearly demonstrate the impact of varying snoop responses latency on a system’s performance. Large-scaleexperiments provide a better understanding of how dependent the impact is on a workload’s sharing patterns.Chapter 7 provides general conclusions from this research, as well as a collection of related challenges that eitherwere beyond the scope of this work, or are now within reach. Appendix A contains a list of contributions to gem5the author made which were not necessarily related to the problem under investigation, yet were part of a jointopen-community effort to improve gem5 as an enabling architectural exploration platform.

4

Chapter 2

Related Work

The niche covered by this project intersects the following general domains: on-chip interconnects, cache and I/Ocoherency, coherent-devices sharing patterns, and memory system performance. As SoC-oriented I/O-coherentinterconnects are cutting-edge technology, very little published related work exists, either due to the topic’s novelnature, or as vendors avoid sharing their proprietary knowledge.

2.1 System coherence protocols

• ARM’s latest cache-coherent interconnect, CoreLink CCI-400, is the first to support hardware-managed co-herency as defined by AMBA 4 ACE [6]. This is a differentiating feature under investigation, with a promisingperformance gain wherever interaction between CPU and I/O is a bottleneck. I/O data coherence dealswith maintaining coherence of data shared by a CPU and a peripheral device. Typically, CPU and I/O devicesinteract according to a consumer-producer model, each signaling to the other when data is ready or actionshave to be taken, using some signaling mechanism. As consumer devices, such as set-top boxes, contain agrowing number of peripherals, enforcing I/O coherence becomes a challenging topic. In [16], four types ofsolutions for this challenge are presented:

– Software-managed coherence is typical for traditional embedded architectures. The CPU’s caches areexplicitly managed by software. For instance, by forcing write-backs or performing uncached writeoperations.

– Hardware-managed coherence is a typical solution in general-purpose computers, and is in general trans-parent to software. It may significantly improve average-case performance for some workloads, yet itmight cause negative effects (e.g. consume snoop bandwidth in vain when data is not shared).

– Scratchpad-based : in this approach, instead of transferring data through the main memory, I/O datapasses via private memories which must be completely managed by software. This solution is inappro-priate when sharing data across several CPUs as each CPU can only access its own scratchpad.

– A hybrid solution that combines all previous solutions: data can flow via more than one channel betweendevices, making coherence-managing a tougher task, aiming at harvesting the fruits of all approaches.

These four solutions are compared by their potential performance, design complexity (directly affecting theproduct’s time-to-market), silicon costs and suitable applications. Amongst these solutions, the optimal so-lution depends on application and system characteristics. In this work, the underlying assumption is thathardware-based system coherence can provide sufficient advantages at run time to justify the hardware’s com-plexity. In addition, this work contains an more in-depth quantitative evaluation of the impact of hardware-supported system coherence, providing more insight. The avid reader can find in [16] an extensive explanationof each of the possible solutions for providing system-level coherency.

• The Open Core Protocol (OCP) [40, 30] is an openly-licensed alternative to ARM’s AMBA family of protocols,both aiming at system-level integration of IPs. OCP is an international partnership, driven by leadingsemiconductor companies such as MIPS, Nokia, and Philips. The OCP open community alternative toARM’s proprietary system coherence protocol is a system-level cache coherence extension to OCP,presented in [5]. The proposed extension is a backwards compatible coherent OCP interface. The authors

5

discuss the design challenges and implications introduced by this extension. Similarly to ACE’s cache linestates terminology described in Section 3.1.4, the OCP extension supports a range of coherence protocols andschemes - hence both are orthogonal to any specific coherence protocols. The authors describe how a snoopybus-based scheme can be specified, as well as a directory-based scheme. The correctness of the OCP extensionwas verified using NuSMV [22], a symbolic model-checking tool. In contrast, in this work, correctness isverified by means of simulation-based functional testing.

• Enabling support for ACE transactions, as any coherence protocol, involves much more than changes to thebus. Hardware support for system-coherence requires changes to the entire memory systemfor supporting new transactions, including changes to the bus, cache controllers, support for multi-layeredhierarchies etc. Such challenges are discussed in [24], providing both theoretical examples and a real-worldmultiprocessor system - the Sun SGI Challenge board. While no performance analysis or simulation resultsare provided in [24], in this work a quantitative study is also provided in Chapter 6.

• The concept of hardware cache-coherent I/O is not a recent notion outside the context of mobile device.In [31], the HP PA-RISC architecture has been introduced, as part of the HP 9000 J/K-class program. Itscache coherency scheme allows the participation of I/O hardware, reducing coherence overheads from thememory system and processors. The concept of hardware-based cache-coherent I/O is presented, includingimplications on software aiming to utilize it. I/O data transfers are commonly of two kinds: DMA or PIO.DMA transactions take place without the intervention of the host processor. PIO requires the host processorto move data by reading and writing each address. In the presented architecture, PIO transactions are notpart of the memory address space, and thus are not a coherency concern. Only DMA transactions are treatedas part of the memory coherency challenge. The heart of the proposed solution is an I/O Adapter whichbridges between the processor-memory bus (the Runway bus, which all processors use for snooping) and theHP-HSC I/O bus. A set of limiting assumptions on the transactions made by an I/O master are provided,hence unlike an ACE or ACE-lite master, it is not just another master. Very limited quantitative dataregarding the benefits from this solution are provided. In gem5, as in ACE, any I/O master can participatein the cache coherence scheme as any other master with no limitations on the transactions made by the I/Omaster. In addition, allowing ACE-lite masters (see Section 3.1.4 for more details) enables relieving the masterfrom handling incoming snoops, in cases where this is either redundant or software-coherence is used.

2.2 Coherent interconnects

• The tight coupling of interconnect ordering properties to cache coherence schemes is discussed in [33].While a shared snooping bus has till now been adequate for multi-core designs, the integration of CPUclusters and more coherent clients on a single chip requires a different type of coherent interconnect. A greedysnooping algorithm is presented, including its applicability to an arbitrary unordered interconnect topology.While a bus enforces a total ordering and atomicity, modern interconnects such as the CCI-400 might not.The presented solution requires each cache controller to broadcast coherence requests to all other nodes in thesystem. A processor first sends its coherence request message to an ordering point to establish the total order.The ordering point then broadcasts the request message to all nodes. The ordering point also blocks on thataddress, preventing any subsequent coherence requests for the same address to race with a request alreadyin progress. CCI enforces memory consistency by ordering transactions to same addresses using a Point ofSerialization (PoS) per master interface. Simply put, this additional hardware stalls any incoming request incase another transaction to the same address is still in flight beyond the PoS. The gem5 bus model, however,does not limit or stall in-flight requests to the same address, thus implementing a weaker consistency model.Similarly to the solution presented, gem5’s coherence protocol requires each shareable request to trigger asnoop broadcast to all snoopable masters, as discussed in Chapter 4.

• The system coherence support provided by ACE enables hardware blocks across the system to snoop eachother. Different workloads can demonstrate very different sharing patterns, e.g. how much data is actuallytransferred on-chip as a result of inter-IP snooping, and how much snooping traffic was in vain. This workprovides quantitative insight for various workloads, analyzing a major part of the cost of system coherence.Redundant snoop traffic can be reduced by introducing snoop filters. In [4], a method for filteringsnoops based on dynamic detection of sharing patterns is presented. This enables a more scalable cachecoherence, reducing the impact of broadcasts at the heart of snoopy-based protocols. AMBA 4 ACE specifiesoptional support for external snoop filters in the form of dedicated transactions and possible cache line state

6

transitions. For example, a master must broadcast an Evict message for each cache line that it evicts, tonotify snoop filters about the change.

• Moving from an atomic bus to split buses decouples the request to response phases of a transaction, enablingnew requests to be issued before previous ones were responded. This potentially complicates coherence, asconflicting requests can be issued. In [26], two techniques for overcoming this hurdle are proposed: aretry technique and a notification technique. As in [24], the authors provide the SGI Challenge system as amotivating example, in which requests which might create a conflict are delayed. This solution is impracticalwhen a processor cannot snoop data responses intended for other processors and foresee a conflict. The twoproposed techniques are evaluated using simulations. Similarly, the gem5 bus model, which was initially aunary resource, has been split to AXI-like layers (as described in Section 4.2) for realistically modeling anARM interconnect. The modified gem5 bus model does not limit the transactions in-flight, hence conflictingrequests can be issued. Coherence is enforced by the cache controllers and by utilizing express snoops andinstantaneous upstream requests, the need to implement transitional cache-line states and race conditions isavoided. As a side-effect, snoop responses occur instantaneously, and therefore snoop responses are punishedto compensate for the temporal inaccuracy of snoop requests. The method used for improving snoop temporalbehavior modeling is further discussed in Section 4.1.

• A major difference between ACE and previous non-coherent AMBA protocols is the addition of dedicatedAXI channels for snoop requests and responses (namely an address channel - AC, a response channel- CR, and a data channel - CD). A different approach for establishing a coherent interconnect is presentedin [38]. The authors claim that control messages are short and data messages typically carry cache block-sized payloads. To exploit this duality, the proposed interconnect architecture consists of two asymmetricnetworks to separate requests and responses, and reduce the channel width of the request network. Theauthors used FLEXUS [43], a cycle-accurate full-system simulator, for simulating a tiled CMP, and ORION2.0 [29] for power estimations. Their experiments demonstrated reduced power consumption with minorperformance degradation.

• Similarly to the approach in [38], a NoC which differentiates short control signals from long datamessages is presented in [20]. The differentiation is implemented using traffic prioritization and itspurpose is to reduce cache access delays. However, the proposed architecture is aimed at larger-scale CMPsand therefore utilizes a directory-based coherence protocol. Here again, the multiplexing of coherence trafficon the same channels motivates the added complexity. While priorities are a typical solution in a multi-hopnetwork (and reflected to a NoC), the single-hop CCI is aimed at sufficiently small SoCs, where this crossbarinterconnect is most adequate.

2.3 Memory system performance

• A significant part of this work aimed at improving the temporal modeling of the gem5 memorysystem, and the ability to measure its performance. In [35], a similar study was performed, aiming atgaining insight about Intel’s Nehalem Multiprocessor System. Specifically, this work tried to quantify theperformance of the memory system in different locations, as it is composed of multi-level cachesand an integrated memory controller (Quick Path Interconnect). This work provides similar insights, bothas transportation of transactions through the bus are modeled more accurately, the bus’s resource contentionis improved, and new statistics are introduced to the bus model and its layers, significantly increasing thesystem’s observability.

7

Chapter 3

Background

3.1 System coherency and AMBA 4 ACE

At the heart of this project lies the CCI-400, a cache-coherent interconnect. Understanding its implication on thesystem requires familiarization with memory consistency models and cache coherence.

3.1.1 Memory Consistency

Firstly, in order to motivate why are these needed, a simple example is given in Listing 1.

// assume initial values of the following global variables

// bool flag = false;

// int data = OLD_VAL;

void thread_1() {

data = NEW_VAL;

flag = true;

}

void thread_2() {

int local_data;

while (flag) {};

local_data = data;

}

Listing 1: A 2-core shared-memory example

Assume a multi-core system, where thread 1() and thread 2() are simultaneously invoked. One would expectlocal data’s value to be NEW VAL at the end of each invocation, yet this is not the case. It relies on the executionorder of load and store instructions issued. In this example, a shared memory is used by two independent processes;only if all instructions are executed sequentially at the program order, will local data be guaranteed to eventuallycontain NEW VAL. Such ordering is one feature of a memory consistency model. The reader can find an extensiveintroduction to memory consistency and cache coherence is in [39], including motivation, formal proofs and high-levelconcepts, theoretical solutions, and real-world examples.

A memory consistency model defines both what behaviors programmers can expect when running programs,as well as which optimizations system implementors may use for improving performance or reducing costs. Thethree leading models presented in [39] are:

• Sequential Consistency (SC), the most straight-forward model, requiring ”memory order” (a total order onthe memory operations of all cores) not to conflict with any ”program order” (the order of per-core operations).

• Total Store Order (TSO), a more relaxed model, which enables the use of issued-writes, hence write-buffers,under some limitations. Whenever explicit ordering is required, ”FENCE” instructions are used.

8

• eXample relaxed Consistency model (XC), which enables out-of-order scheduling under most circum-stances, unless explicitly specified using ”FENCE” instructions.

Each model raises implementation challenges. An example of a common one is how atomic instructions can berealized. In general, models are evaluated using the 4 Ps:

• Programmability: how easy it is to understand the model, and to write multi-threaded programs whichconform to the model.

• Performance: the trade-offs the model allows between performance and costs.

• Portability: the complexity of modifying code from one model to another.

• Precision: how precisely the model is defined, mostly in formal representation and not only using naturallanguages (although common in practice).

3.1.2 Cache Coherence

The use of caches for improving average-case performance comes the a price of requiring to manage all copies of amemory location, to avoid making use of obsolete values. The goal of a coherence protocol is to maintain coherenceby enforcing the following invariants:

• Single-Writer, Multiple-Reader (SWMR): at any time for a memory location, either one core may write(and read) the location or one or more cores may only read from it.

• The Data-Value Invariant: updates to the memory location are passed correctly so that cached copies of thememory location always contain the most recent version.

To implement these invariants, each storage structure (e.g. cache, LLC/memory) is associated with a finite statemachine called a coherence controller. The controllers create a distributed system, exchanging messages forensuring these invariants. This interaction is specified by the coherence protocols. Each controller implements anFSM per block, and processes events for each block that might change its current state. For each cache, each blockB is characterized by its validity (holding a valid/invalid copy of B), dirtiness (a modified B), exclusivity (permittedto modify B), and ownership (responsible of answering queries regarding B). An example of a primitive protocol,is a protocol with states V (valid) and I (invalid), and a transient state IV D stating B is invalid, and once dataD arrives, it becomes valid. Most coherence protocols support a subset of the {M(odified), O(wned), E(xclusive),S(hared), I(nvalid)} set of states. While adding more states enables better performance, it comes at a price ofdesign complexity.

Coherence protocols are generally divided into two groups:

• Snooping Coherence Protocols: a cache controller initiates a request for a block by broadcasting arequest message to all other coherence controllers. The coherence controllers collectively do the right thing.Such protocols rely on the interconnection network to deliver the broadcast messages in a consistent order toall cores, hence requires the use of a shared logical bus with atomic request handling.

• Directory Coherence Protocols: scalable alternatives based on uni/multicasting queries instead of broad-casting. A cache controller initiates a request for a block by unicasting it to the memory controller that isthe home for that block. Each memory controller maintains a directory that holds state about each block inthe LLC/memory. When the owner controller receives the forwarded request, it completes the transactionby sending a data response to the requestor. Hence although scalable, transactions might take more time tocomplete.

CCI-400, for instance, implements a fully-connected snoop topology, enabling each slave to snoop other slaves(see Section 3.2.1 for details). As with memory consistency models, implementation challenges (such as usingnon-atomic buses), are abundant. Yet while memory consistency models may complicate a programmer’s work,hardware-managed cache coherence should be transparent to the programmer.

9

3.1.3 System-level coherency

Caches have been proven to be an efficient means of improving average-case performance in general-purpose com-puting. They provide low-latency accesses to local copies of the memory, while eliminating many off-chip accesses.As such, the use of caches has spread to other SoC components such as GPUs and accelerators. While the benefitsand costs of caches in CPUs are by now well established, it is not the case for systems with caches in multipledevices.

System-level coherency extends the same principles used with CPU-caches to the entire system’s scope: thecontents of any memory address may reside in more than a single place in the system, as long as all copies are ofthe most recent data. This is to avoid making use of stale copies of this address. At any moment, at most one localcopy can be modified. To to obtain this privilege (write access), no other cache should hold a copy of that line.

Maintaining memory consistency and cache coherence can be either implemented by software or by dedicatedhardware. Furthermore, sharing of resources (such as the main memory) requires an interconnect: ranging from asimple bus to a complex network-on-chip.In addition to these typical systems, smartphones, tablets, and other mobile devices are all by now cache-coherentsystems. A top-level block diagram of such a system is provided in Figure 3.1.

CoreLink CCI-400 Cache

Coherent Interconnect

ACE

interface, S4

ACE

interface, S3

ACE-Lite plus DVM

interface, S2

ACE-Lite plus DVM

interface, S1

ACE-Lite plus DVM

interface, S0

Cortex-A15 Cortex-A15 MMU-400 MMU-400 MMU-400

NIC-400

DMA-330 LCD

Asynchronous

bridge

Asynchronous

bridge

Coherent I/O

device

Mali-T604

graphics unitGIC-400

ACE-Lite

Interface, M2

ACE-Lite

Interface, M1

ACE-Lite

Interface, M0

DMC-400 Other DMC NIC-400

Other

slaves

Other

slaves

Wide I/O

DRAM

DDR3/

LPDDR2

Figure 3.1: Example top-level system using CCI-400. Source: [8]

The system’s main processing units are the two coherent Coretex-A15 clusters, and the Mali-T604 GPU, sharingdistributed memory (including a DDR and a wide I/O DRAM). The interconnect of this system is the CCI-400, gluing together various clients using the AMBA 4 ACE protocol. The MMU-400 [10] is a System MemoryManagement Unit, which controls address translation and access permissions. It serves a single master, suchas a GPU. Other key components are the DMC-400 [13] (a dynamic memory controller), NIC-400 (a networkinterconnect), and the GIC-400 [9] (a Generic Interrupt Controller). The system’s main interconnect, the CCI-400,establishes cross-system consistency and coherence based on the ACE protocol. In this example, slave interfaces S4and S3 (both connected to Cortex-A15 processors) support the ACE protocol. Full coherency and sharing of datais managed between these processors. The CCI-400 is further described in Section 3.2.1.

Advantages over software-based coherency

Software-based coherency requires explicit orchestration between sharing components, and is based on off-chipshared data structures. This is commonly a complex task, which is error prone. In addition, running the software

10

required for managing coherency consumes processing time and as such also power. Invalidation of cache lines isa must when passing data between caches - introducing off-chip accesses. In general, as the amount of caches andtheir sizes increase, a more efficient solution is needed.

In addition to the conventional advantages of using caches, ARM has recently presented its multi-clusteredbig.LITTLE [7] architecture, depicted in Figure 3.2. This architecture introduces a new scenario where system-

bigBest performance

Demanding tasks

LITTLEMaximum efficiency

“Always on” tasks

128-bit AMBA 4

Cortex-A15

Cortex-A15

SCU + L2 Cache

CoreLink CCI-400 Cache Coherent Interconnect

128-bit AMBA 4

Kingfisher

KF

SCU + L2 Cache

KF

Cortex-A15

System Memory

GIC-400

Figure 3.2: big.LITTLE architecture concept. Source: ARM

level hardware-based coherency has a promising advantage. In this scenario, depicted in Figure 3.3, a task ismigrated from the outbound processor to the inbound processor, for instance in order to either allow it to runon a slower and more power-efficient processor or vice versa. System coherence allows seamless transfer of datafrom the outbound processor’s cache to the inbound processor’s cache directly on-chip via the interconnect. This

Copyright © 2012 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd.

All other trademarks are the property of their respective owners and are acknowledged

Page 6 of 8

One of the reasons the task migration can be so fast is that the amount of processor state involved in the task migration is relatively small. The processor that is going to be turned off, which is termed the outbound processor, must have all of the integer and Advanced SIMD register files contents saved along with the entire CP15 configuration state. The processor that is going to resume execution, which is termed the inbound processor, must then restore all of the state saved from the outbound processor. Additionally, any active interrupts that are being controlled by the GIC-400 must be migrated. Less than 2,000 instructions are required to achieve save-restore and because the two processors are architecturally identical there is a one-to-one mapping between state registers in the inbound and outbound processors.

Figure 5 big-LITTLE Task Migration

Figure 5 describes the task migration process between inbound and outbound processors. Coherency is clearly a critical enabler in achieving a fast task migration time as it allows the state that has been saved on the outbound processor to be snooped and restored on the inbound processor rather than going via main memory. Additionally, because the level-2 cache of the outbound processor is coherent it can remain powered up after a task migration to improve the cache warming time of the inbound processor

Task Migration Stimulus

Save State

Normal Operation

L2 Snooping Allowed

Outbound Processor

Cache Invalidate

Ready for Task Migration

Task Migration State (Snoop Outbound Processor)

Inbound Processor

Outbound Processor OFF

Enable Snooping

Restore State

Normal Operation

Power Down

Disable Snooping

Clean L2 Cache

Power on & Reset

Figure 3.3: big.LITTLE task migration. Source: [12]

scenario assumes most accesses by the inbound cluster will, due to temporal and spatial locality, be to cache linescurrently residing in outbound processor’s cache. Once the thread has been migrated to execute on its new cluster,the outbound processor can be powered-off. After a period of time or depending on cache hit-rate changes, theoutbound cache can be flushed and powered-down.

3.1.4 AMBA 4 AXI Coherence Extensions

As its abbreviation states, ACE [6] is an extension of AXI for providing hardware-based system-level coherency.ACE aims to keep traffic as much as possible on-chip, allowing multiple up-to-date cached copies of the sameaddress. It aims at supporting various types of caches, and not only the typical processor caches. For instance,an I/O coherent device might have a write-through cache, hence cannot accept a dirty cache line when issuing a

11

Table 3.1: Projection of ACE cache line state to MOESI naming. Source: [12]ACE notation MOESI abbreviation MOESI meaning ACE meaning

Unique Dirty M Modified Not shared, dirty, must be written back to memoryShared Dirty O Owned Shared, dirty, must be written back to memory.

Only one copy can be in Shared Dirty state.Unique Clean E Exclusive Not shared, cleanShared Clean S Shared Shared, no need to write back, may be clean or dirtyInvalid I Invalid Invalid

read request. ACE is designed for effective scaling, supporting snoop-based, directory-based and hybrid coherencymechanisms (using snoop filters). In fact, ACE is protocol-agnostic, and is based on five cache-line state terms:

• Unique: the cache line resides only in this cache.

• Shared: the cache line may be in another cache. A line held in more than one cache must be in Shared state,and contain the same data.

• Clean: the cache controller does not have to update the main memory.

• Dirty: the cache controller is responsible for updating the main memory.

• Invalid: the cache line must not be considered valid.

The state-space of possible combinations of these indicators is depicted in Figure 3.4.

Unique

Dirty

Unique

Clean

Shared

Dirty

Shared

Clean

Invalid

InvalidValid

Unique Shared

Dir

tyC

lean

Figure 3.4: ACE cache line states. Source: [12]

The ACE cache line states are designed to support components using any of the MOESI-protocol family. Devicescomplying to ACE are not required to support all five cache states. For example, an ARM Cortex-A15 internallyimplements a MESI protocol. As an example, mapping of ACE to MOESI is described in Table 3.1.

In order to accommodate coherence transactions, AXI was extended with three new channels, as depicted inFigure 3.5:

• AC channel - Coherent address channel: ACADDR, used for sending the address of a snoop request to acached master, accompanied with control signals.

• CR channel - Coherent response channel: CRRESP, used by the master to respond to each snoop addressrequest. A narrow 5-bit response indicating data transfer, dirty data and sharing.

• CD channel - Coherent data channel: CDDATA, used by the master to provide data in response to asnoop. Optional for Write-Through caches.

The ACE specification defines two types of ACE master interfaces:

• Full-ACE, which contains all ACE channels, as depicted in Figure 3.5. Hence a full-ACE master can issuesnoop requests and can be snooped by the interconnect. An example of such a master could be an ARMCoretex A15 cluster.

12

MasterInterconnect

(Slave)

AWADDR

WDATA

ARADDR

BRESP

RDATA

ACADDR

CRRESP

CDDATA

AXI

ACE

Figure 3.5: AXI and ACE channels. Source: [12]

• ACE-lite, which does not include the AC, CR and CD channels yet has the additional signals on the existingAXI channels, enabling it to issue snoop requests but it cannot be snooped. An example of such a master canbe a GPU, or a coherent I/O device.

In ACE, unlike AXI, snoop requests must be responded in-order as they do not have an ID. The interconnectcontrols all transactions progress, and may either issue snoops to masters in parallel or serialize them. Accesses toexternal memory can be issued upon snoop miss, or speculatively before snoop responses have arrived. However, ifa response from the external memory arrives prior to a snoop response, it is discarded and the fetching might bere-issued if necessary. Such a scenario is possible, for instance, if one of the caches being snooped is in a low-powerstate and is therefore slow to respond.

3.1.5 ACE transactions

At the heart of the ACE protocol are the transaction types which are aimed at minimizing off-chip traffic at theexpense of additional on-chip transactions, added logic, and the derived power consumption. The ACE transactionbank can be categorized according to several criteria:

• Shareability, which is a generalization of cacheability. A shareable address can reside in more than one cachesimultaneously. To clarify the difference between non-shareable and uncacheable addresses: an uncacheableaddress cannot reside in any cache in the system. A non-shareable address can reside in a cache, yet not inmore than one.

• Shereability domain, denoting a group of masters that might share an address range. Each shareabletransaction contains its shareability domain, in order to avoid issuing redundant snoops to masters whichcannot contain this address.

• Bufferability, or more specifically, whether or not a response has to come from the final destination, andwrite request made visible at the final destination. For instance a normal non-cacheable bufferable readrequest can be responded by any master, yet should not be stored locally for future use. This means receiveddata should not cause allocation of the line in its local cache.

• Cache Maintenance transactions, such as invalidation broadcasts.

• Write-Back transactions: either requiring the written line to stay allocated or not. In addition, an Evictnotification transaction is optional and is only required for supporting snoop filters.

• Response data criteria, such as:

13

– ownership-passing: whether the response can or cannot be a dirty line. A requesting master might notbe able to accept ownership, thus must be provided with a clean response.

– uniqueness: whether the response can be shared with other caches or must be unique.

Figure 3.6 provides an overview of all ACE transactions, including an indication whether a transaction issupported by ACE-lite masters or only by full-ACE masters. The complete specification of all ACE transactions is

ACE-lite subset

Read

Write

Non-shared

ReadOnce

WriteUnique

WriteLineUnique

Non-cached

MakeInvalid

CleanShared

CleanInvalid

Cache Maintenance

MakeUnique

ReadUnique

CleanUnique

Shareable Write

ReadClean

ReadNotSharedDirty

Shareable Read

ReadShared

WriteClean

WriteBack

Write-Back

Evict

Read Channel Write Channel

Figure 3.6: ACE transaction groups. Source: [12]

available in [6]. The selected transactions that have been modeled are further described as part of Section 4.3.

3.2 Interconnect modeled: CCI++

3.2.1 CoreLink CCI-400

”CoreLink” [11] is a family of system IPs developed by ARM, which includes interconnects, memory controllers,system controllers and accompanying tools. It supports ARM’s latest AMBA 4 interface specifications, includingAXI Coherency Extensions (ACE) interface protocols.

The CoreLink product range includes an SoC-oriented interconnect, the CCI-400 Cache Coherent Interconnect[8], which combines interconnect and coherency functions in a single module. The CCI-400 supports connectivity forup to two ACE [12] masters (e.g. Cortex-A15 multicore processors), three ACE-Lite masters (e.g. Mali-T604 GPUsor coherent I/O devices), and three ACE-lite master interfaces (e.g. for memory controllers, system peripherals).Additional features of this high-bandwidth, cross-bar interconnect are:

• A fully-connected snoop topology: each ACE/ACE-lite slave interface can snoop the other ACE slaveinterfaces. This provides higher snoop bandwidth comparing to the previous I/O coherency solution providedby ARM - Accelerator Coherence Ports (ACP), which allowed external processors to snoop inside caches ofCoretex A15 clusters. Also, ACP required the entire snooped cluster to be powered-up in order to performsnooping from a remote cluster.

• Quality-of-Service (QoS) regulation for shaping traffic profiles, based on latency, throughput, or averageoutstanding transactions.

• A Performance Monitoring Unit (PMU) which consists of an event bus, event counters, clock cycle counteretc.. It enables counting performance-related events, such as measuring the snoop hit rate for read requestson a particular slave interface.

• A Programmers View (PV) to control the coherency and interconnect functionality.

• Speculative fetches support: each master interface can issue a fetch in parallel with issuing a snoop request,thus hiding snoop latencies in case of a snoop-miss.

14

• Support of all types of AMBA 4 barrier transactions. These are broadcast from each slave interface to eachmaster interface, ensuring that intermediate transaction source and sink points observe the barrier correctly.This is commonly used in order to guarantee execution order in less strict memory models, as discussed inSection 3.1.1.

• Support of Distributed Virtual Memory (DVM) messages transport between masters. DVM transactionsenable invalidating other cluster’s TLB’s. The DVM support is based on a separate physical network, yet ispart of the hardware-offloading effort, potentially replacing software-based TLB invalidation.

• An independent Point-of-Serialization (PoS) (referred to as serialization point in [39], or point of coherence)per connected slave. All transactions to any single address in the system have to be serialized. As such, theinterconnect arbiters which request should be accepted, thus orders all requests. Speculative fetches bypassthe PoS, yet in case the response from a slave arrives before all snoop responses, it is ignored as it mightcontain a stale copy.

Figure 3.7 depicts the micro architecture of the CCI-400. Due to the commercial nature of this product, apart

CoreLink CCI-400

Cache Coherent

InterconnectACG entryACG entry ACG entry ACG entry

Architectural

Clock Gating

(ACG) entry

ACE master ACE master

ACE-Lite

master

ACE-Lite

master

ACE-Lite

master

Register slice

S4

Register slice

S3

Register slice

S2

Register slice

S1

Register slice

S0

Burst splitter

(ABS)

Burst splitter

(ABS)

Burst splitter

(ABS)

QoS latency

and OT

regulators

QoS latency

regulator

QoS latency

regulator

ACE Slave

Interface (ASI)

S0

ACE Slave

Interface (ASI)

S1

ACE Slave

Interface (ASI)

S2

ACE Slave

Interface (ASI)

S3

ACE Slave

Interface (ASI)

S4

Snoop Router

Block (SRB)

DVM Manager

(DVMM)

ACE-Lite Master

Interface (AMI)

M2

ACE-Lite Master

Interface (AMI)

M1

ACE-Lite Master

Interface (AMI)

M0

Write Unique

Block (WUB)

Performance

Monitoring

Unit (PMU)

CCI Register

Block (CRB)

Peripheral

Interconnect

(PIC)

Register slice M0Register slice M1Register slice M2

C-channel

Control

block

C-channel

ACE-Lite slaveACE-Lite slave ACE-Lite slave

Fully-connected cross-bar interconnect

EVENTBUS

QoS latency

and OT

regulators

QoS latency

and OT

regulators

Figure 3.7: CCI-400 internal architecture. Source: [8]

from the above-mentioned information (originating from its Technical Reference Manual [8]), internals are mostlydisclosed.

As CCI-400 is the first commercial SoC-oriented interconnect with system-level coherency support, it sets itselfas a useful vehicle for investigating system-level coherency in a real-world compute subsystem context. In addition,CCI-400 was planned to be the interconnect of a compute subsystem which is further discussed in Section 3.6,enabling correlation of simulation results with real hardware.

15

3.2.2 CCI++: extending ACE and CCI to multi-hop memory systems

ARM’s CCI-400 is a state of the art SoC-oriented interconnect. However when performing such a modeling-basedresearch, one should focus on the most interesting and relevant aspects:

• The research is aimed at investigating the impact of hardware-based system-level coherence and does notaim at functional verification of a product. Rather the product is a vehicle for understanding technologycapabilities and trends. The product provides a correlation point for aligning simulations with reality onsmall scale, in order to justify large-scale experiments which cannot be performed otherwise.

• Creating a cycle-accurate model which fully covers the product’s features and temporal behavior has a veryhigh price tag:

– Achieving such an accurate level of modeling requires several man-years.

– A cycle-accurate model will cripple simulation performance, thus will significantly limit the feasible setof workloads.

• Similarly, all hardware models available currently in gem5 provide some level of abstraction which means theyare not a one-to-one representation of a product. As such, some of the abstractions made limit the abilityto realistically model CCI’s behavior, or will exhibit different performance. Examples of such differences arediscussed in Section 3.7.

• Investigations should focus on general concepts and should not be limited to any specific implementation.The model should avoid limitations which are platform or product-specific, and serve a broader audience.

As such, the changes introduced to gem5 as part of this work should not be measured on the same scale asCCI-400, as they provide much more:

• On one hand, we limited the research to the most interesting and representative ACE transactions supportedby CCI-400 (as elaborated in Section 4.3), which resulted in gaining many insights about challenges in imple-menting hardware-based system-level coherency. Implementing all ACE transactions would, in many cases,be more of the same, when aiming at a qualitative analysis.

• On the other hand, the ACE specifications, and CCI that implements it, were designed to provide a single-hop interconnect solution. Meaning, ACE transactions can only exist directly between an ACE master, CCI,and ACE slaves. However, as systems scale up and contain more components, the use of buses and bridgesbecomes a must. Multi-layered hierarchies of coherent components will require a scalable solution, such thatdoes not involve a single, central interconnect. Such complex systems are expected to utilize multi-hopinterconnects, in which end-to-end transactions pass through several relays. As such, the gem5 coherentbus model was extended to support ACE-like transactions, such that:

– when the bus model is instantiated in a classic CCI context, where it is directly connected only toACE-like masters and slaves, it will function according to the ACE protocol. Such a system is depictedin Figure 3.8(a): the system contains two CPUs, each with a private cache. Each cache is connectedto the main, single, interconnect. From the interconnect’s perspective, these caches are ACE masters,as they issue ACE-like transactions. In addition, the interconnect is connected to a DRAM controller,representing an ACE-like slave.

– when the bus model is instantiated in a multi-layered system containing several buses, e.g. for arbitratingbetween two level-1 caches and a level-2 cache, all such intermediate buses must be able to handle ACE-like transactions. However, these intermediate buses must differentiate between slaves which correlate toACE-slaves, and slaves which may lead to ACE-like masters. An example of such a system is depictedin Figure 3.8(b). This system consists of four CPUs in a two-cluster formation. Each cluster has itslevel-2 cache, shared by two level-1 caches. Therefore each cluster requires its own CCI-like bus toarbitrate access to the level-2 cache. For instance, in such a system, an intermediate bus might receivea downstream snoop request from its slave port. This is a use-case which is not possible in the ACEarchitecture. Such complex systems will most probably exhibit inter-cluster sharing (depending on theworkloads at hand). As such, they are extremely important when investigating the impact of a hardware-managed system-level coherency. Therefore, the ACE-like transactions modeled were extended beyond thedefinition of the ACE specifications, to support multi-hop interconnects. As such, to avoid any confusionsor misinterpretation of gem5’s bus model’s implementation of memory transactions, it better be treated

16

root : Root

system : System

CCI : CoherentBus

physmem : SimpleMemory

cpu0 : MemTest

cpu1 : MemTest

l1c0 : BaseCache

l1c1 : BaseCache

master

port

slave

test

cpu_side

test

cpu_side mem_sidemem_side

(a) A gem5 system with an ACE-like interconnect

root : Root

system : System

"CCI++" : CoherentBus

l2c0 : BaseCache

l2c1 : BaseCache




cpu0 : MemTest

cpu1 : MemTest

cpu2 : MemTest

cpu3 : MemTest

l1c0 : BaseCache

l1c1 : BaseCache

l1c2 : BaseCache

l1c3 : BaseCache

master

port

slave

mem_sidecpu_sidemem_sidecpu_side

masterslavemasterslave

test

cpu_side

test

cpu_side

test

cpu_side

test

cpu_side mem_sidemem_sidemem_sidemem_side

(b) A gem5 system with interconnects supported by CCI++

Figure 3.8: gem5 example systems with typical CCI-like interconnects

as an extended version of CCI, hence, a CCI++, as it provides much more in the contexts of multi-hopinterconnects.

As such, each ACE-transaction that was implemented required extrapolating the original requirements to abroader context. The impact of such changes is not limited to the bus model only, but to the entire gem5 memorysystem, as described in Section 3.5. For instance, an ACE-like transaction issued by a CPU must be interpretedby a cache, which should either handle the request on its own, forward it onwards, or generate a different type oftransaction and dispatch it to its slave (downstream).

We capitalize on the flexible connectivity provided by gem5, to investigate complex systems, rather than limitingthe scope of this research to systems with a single centric coherent interconnect.

Figures 3.8(a) and 3.8(b) were automatically generated using gem5. This feature, including interpretation ofsuch diagrams, is further described in Appendix A as one of this project’s indirect contributions.

3.3 The Modeling Challenge

A full-system simulation framework enables design-space exploration. It allows performing functional feasibilityanalysis, performance analysis and power and costs estimation. Performing such simulations using actual hardwareimplementation (e.g. RTL simulation) may be very accurate but less convenient for exploring new designs. Onthe other hand, many simulators lack adequate models for existing components. Models should be as realistic aspossible, without compromising on simulation speed. Quantifying realistic is a tricky challenge. When analyzingthe behavior of full systems, the ability to utilize workloads that run on actual hardware is critical for correlatingmodels. It enables extrapolating from the correlated set of points to any modeled platform, and reasoning aboutthe conclusions made. To that extent, gem5 is a simulation framework which offers convenient exploration, can runreal-world workloads, and has been proved to correlate with real hardware.

17

3.4 Simulation and modeling framework: gem5

3.4.1 gem5

gem5 [19] is a flexible and highly configurable full-system event-driven simulation framework. gem5 has a detailedand flexible memory system, useful for exploring state of the art interconnects and protocols, such as this projectinvolves. As depicted in Figure 3.9, it offers a wide spectrum of simulation speed vs. accuracy, based on the

Processor Memory System

ClassicRuby

Simple GarnetSE SpeedFSSEFS

In-OrderSEFS

O3SEFS Accuracy

CPUModel

SystemMode

AtomicSimple

TimingSimple

Figure 3.9: gem5 Speed vs. Accuracy Spectrum. Source: [19]

following orthogonal set of capabilities:

• Four CPU models: AtomicSimple (a single IPC model), TimingSimple (which includes timing of memoryreferences), InOrder (a pipelined in-order CPU), and O3 (a pipelined out-of-order CPU).

• Two System Modes: System-call Emulation (SE) which emulates most system-level services, thus eliminat-ing the need to model devices and an operating system, and Full System (FS), which contains a bare-metalenvironment, including peripheral devices and an operating system.

• Two memory systems: Classic, a fast and easily configurable system, and Ruby, for studying new coherencymechanisms in networks on chip. In this work we extend the Classic memory system to capture the behaviorof the CCI-400.

• Support of major popular ISAs including ARM, ALPHA, MIPS, Power, SPARC and x86.

• Ability to boot Linux and Android OSs using ARM, ALPHA and x86 ISAs.

The flexibility provided by gem5 enables creating test systems with multiple CPUs and GPUs (as demonstratedin Section 6). Note that not all combinations of the above-mentioned capabilities are currently supported.

gem5 is developed by a wide community [3] including both academy and industry. gem5 has been used inhundreds of publications. Unlike SimpleScalar [1], gem5 is released under an open-source license. gem5’s communityis very active, keeping gem5 changing - adding more models, supporting more workloads, and fixing more bugs,as in any large-scale software project. Revision control is distributed and based on Mercurial [37], hence basedon differential changes from a given revision. Each change proposal is reviewed by the community prior to beingcommitted.

A key aspect in gem5 is its object-oriented design. gem5 utilizes standard interfaces such as the port interface(used to connect two memory objects together) and the message buffer interface. Although constantly changing, theport and interface semantics have become more TLM-like semantics. These SystemC [28] and TLM -like semanticsare essential in establishing connectivity between independent models.

gem5 is being continuously correlated with the actual hardware being modeled, such as development boards(e.g. Snowball SKY-S9500-ULP-C01, as published in [21]) containing ARM CPUs.

3.4.2 Simulation flow

The core of gem5 is an event-driven simulation engine which tightly combines C++ and Python sources. Each com-ponent in the simulation is represented by a SimObject, reflected simultaneously as a C++ object and as a Pythonobject. The purpose of these two worlds is to enable easy and flexible composition of any system, made possiblewith Python. The simulator is compiled for a specific architecture (e.g. ARM, x86) and verbosity/optimizationlevel. Figure 3.10 depicts simulation inputs, outputs and runtime interfaces. Generally speaking, a simulation flowcontains the following stages:

18

test system .py

scripts

stats.txt

config.ini

simout

simerr

system.terminal

Android Kernel

Android Disk

Image

Boot loader

gem5 simulator

.rcS script

framebuffer.bmp

Figure 3.10: gem5 simulation inputs, outputs and runtime interfaces

• Creation of the test-system: the simulator is invoked with several command-line arguments, the mostimportant one being a Python script which composes the test-system, and what stimuli will the test use. Thetest-system is generated in a sequential manner, instantiating its components (CPUs, memories, peripherals)and connecting them to a single entity. In addition, the script specifies what software will be run on thetest system. For a full-system simulation, this includes specifying which Operating System (OS) kernel touse, a disk image, a boot loader, and a script which will be invoked by the guest system (the system beingsimulated) once it has finished booting. The main test script utilizes other Python scripts for common taskssuch as instantiating caches and CPUs. The test script may provide input arguments, such as the number ofCPUs, the CPU model to be used, cache sizes, cache associativity and so forth. At the end of this process, aconfig.ini file is generated. This file contains a complete listing of the test-system’s components and settings.Hence it fully reflects what is being simulated. In addition, we have added automatic visualization of thesystem in the form of a block-diagram is generated. This output is further discussed in Appendix A.

• Event-driven simulation: It is either limited by a maxtick parameter, or self terminated by the guestsystem. During a Full-System simulation, the following outputs are produced:

– simout and simerr : the standard output and error streams generated by the host (simulator)

– system.terminal : the output of the simulated system’s terminal

– framebuffer.bmp: the latest contents of the simulated system’s display

In addition to these outputs, it is possible to interact with the simulated system using:

– A telnet connection: enabling text-based control and visibility of the test-system.

– A Virtual Network Computing (VNC) session: the test-system runs a VNC server, making it possible tointeract with the active session using keyboard and mouse inputs from the outside world. The display isequivalent to the contents of the framebuffer.bmp.

The simulation stage include booting the test-system, followed by a test-specific scenario specified in an .rcSscript.

19

• Post-simulation: including reporting of statistics gathered during simulation. These statistics are providedin stats.txt. Note that statistics can be reported and reset during simulation, either once or periodically. Forinstance, for benchmarking a specific scenario, it might be misleading to include statistics gathered during thebooting-stage. The gem5 term for such statistics is Stats, following the name of the class which implementsall required functionality.

3.4.3 Test-System Example

An example for a test-system is depicted in Figure 3.11. For clarity reasons, this architecture block-diagram is

root

system

(type:LinuxArmSystem, machine:RealView_PBX)

bridge cf0 (disk image)

interrupt cntr

IO bus

memory bus

NV memory

physical memory

terminal

VNC server

cpu0 (AtomicSimple)

D$ I$ dtb itb

RealView

GIC RTC UART timers

cpu1 (AtomicSimple)

D$ I$ dtb itb

tol2bus

L2 $

Figure 3.11: gem5 2-core test system example

partial. The complete list of components, their configuration and connectivity is listed in the config.ini file createdby gem5 prior to the main simulation stage. This example system includes two CPU clusters. Each cluster containsan ARM AtomicSimple CPU, a data cache, an instruction cache, an instruction TLB, and a data TLB. Both clustersshare a level-2 cache. Multiplexing the level-2 cache requires arbitration. This is the reason for having an instanceof gem5’s bus model named tol2bus in-between the cache and its masters. The system also includes an interruptcontroller, various timers, a UART, and components for realizing a terminal and a VNC session. Most of theseperipherals have been omitted from Figure 3.11 for clarity.

3.5 The gem5 memory system

3.5.1 The basic elements: SimObjects, MemObjects, Ports, Requests and Packets

Each model instance in gem5 is a SimObject. A particularly interesting type of SimObjects are MemObjects. Theseare objects which can communicate with other objects via their TLM-like Ports. A MemObject can have either aMaster port, a Slave port, or both. Typically:

• A master is a model which generates transactions, namely requests. Examples for such models are: CPUs,GPUs and traffic generators.

• A slave is a model which can only respond to requests is received. A memory model or a peripheral devicesuch as a UFS model are typical slaves.

20

• In addition, there are devices which have both a master and a slave port. Such devices include caches, bridges,buses, and communication monitors which are gem5 shims used to monitor activity between a master and aslave.

The collection of such devices, together with all memories, will be referred in this work as the gem5 memorysystem.

The gem5 port semantics enforces clear connectivity and responsibilities. Each port must be connected uponsimulation start, and can only be connected to a single port of the opposite type. Buses make use of port vectorswhich enable them to connect to more than one master, and more than one slave. The implication of this connectivitysemantics is that in each system there is a directed connected graph of MemObjects, with edges leading from mastersto nodes. Figure 3.8(a) is a simple example for such a graph: each directed edge represents a connection betweena Master port and a Slave port. We denote a master’s slave and a slave’s master as its peer.

In order to communicate, a master creates a Request. A request contains an abundance of parameters, with adestination address being one of the most important ones. This request is conveyed to the slave in an enclosingPacket. Throughout the request’s lifecycle it could be conveyed using different packets, and eventually it will reachits destination slave. In case a request requires a response, the original packet will be converted to a response andwill be transported back to its initiating master in a similar manner - conveyed in packets. Each packet containsa command field, indicating the purpose of the request. The trivial example of such command types are a readrequest, a read response, a write request, but also less common, such as load-conditional response.

Each master port can be either snoopable, hence can accept snoop requests from its slave peer, or non-snoopable.The collection of the port semantics presented thus far is sufficient to model ACE and ACE-lite interfaces, usinga mapping provided in Table 3.2. Note that a slave port is snoopable if its peer’s port is snoopable. A packet is

Table 3.2: Correlation of ACE and gem5 interface semanticsACE-naming gem5-naming gem5-snoopable

(full) ACE master master yesACE-lite master master no

ACE slave slave no

considered to be a snoop if it is either marked as express snoop, or if it is transported in a direction opposite to theconventional direction. Hence a read request that a master receives is considered to be a snoop read request, sincea master can only issue read requests and does not handle read requests.

The interaction between a master and a slave port in order to pass-on a packet is done using function calls listedin Table 3.3.

Table 3.3: Main port interface methodsMaster port method Slave port method

send request receive requestreceive response send response

send snoop response receive snoop responsereceive snoop request send snoop request

Each of these functions has three variants:

• Functional : used by gem5 only for performing backdoor operations such as loading the kernel binary to theguest-system’s memory. Hence a functional necessity for bypassing the complexities of the simulated memorysystem.

• Atomic: fully functional with minimal time tracking, used mainly for fast-forwarding, e.g. when using Atom-icSimple CPUs, in order to quickly reach an interesting point in time before switching to a detailed mode.

• Timing : fully functional with detailed temporal modeling, at the cost of degraded simulation performance.

Changes to the memory system must be reflected in all three modes, as switching between atomic and timing modescan occur at any time, while functional-mode can co-exists with both.

21

3.5.2 Request’s lifecycle example

In order to explain how transactions occur in gem5’s memory system, several use-cases are provided in the followingsections. These use-cases are listed starting from the simplest transaction to more complex scenarios. Instead ofproviding a single complex scenario, isolated segments demonstrating how each model interacts with the memorysystem are given.

The simplest possible system contains only a single master and a single slave. For instance, a CPU (the initiator,a master) directly connected to a memory (a slave, responder). Alternatively, it could be a CPU issuing a readrequest to a cache resulting in a hit. Such an example is depicted in Figure 3.12. The general sequence of actions

CPU master port slave port memory

send request receive request receive request

Master

request initiator

Slave

responder

make responsesend responsereceive responsereceive response

Figure 3.12: Master to slave transaction sequence diagram

throughout the request’s lifecycle is as follows:

• Assuming the need to send a request has arisen in the CPU, it will first create a request and enclose it in apacket.

• The master will call its (master) port’s send request method. Depending on the circumstances, this could beeither of the send functional / atomic / timing request methods. Currently we abstract from any mode-relateddetails.

• The master port calls its (slave) peer’s receive request method, passing the packet to the slave’s port with theoriginal request.

• The slave port calls its owner’s receive request function to handle the request.

• The slave (memory in this case) receives the packet, handles the request (for example, reads from some addressand attaches the data to the packet), and transforms the request to a response. This is merely a change inthe request’s attributes.

• The slave calls its port’s send response method.

• The slave port calls its peer’s (master) port’s receive response

• The master handles the response, and de-allocates both the packet and the response.

In this simple case, during the entire transaction the same packet and request instances were used. In more complexcases the request may be passed on enclosed in different packets along its route.

From here onwards the explanations will abstract from the owner-to-port details and will regard them as a singleentity for clarity.

3.5.3 Intermezzo: Events and the Event Queue

The difference between atomic and timing modes is crucial, and is a typical tradeoff between modeling accuracy andsimulation speed. gem5 is an even-driven simulator which maintains a central event queue for scheduling events toany moment in the future. Each model can make use of this mechanism and register events, requiring to provide:

• the time (simulation tick) at which this event should occur

• an event handler (process() function) that is called once the event is due.

22

As such, modeling a sequence of events in gem5 can be done in one of two general methods:

• Function call, as done in Atomic mode, annotating estimated temporal progress and letting the caller dealwith actual delays in simulation. This resembles SystemC’s loosely timed notion.

• Scheduling an event to happen later on, as done in Timing mode, causing time to actually progress along thesequence of actions. Hence a sequence can be split to several phases which can occur at different times. Theresult is a much more accurate and realistic behavior, which slows down simulation. This resembles SystemC’sapproximately timed notion.

One of the changes made as part of this work included improving the temporal behavior of response flow throughthe memory system by switching from a loosely-timed to an approximately-timed approach. This is further discussedin Section 4.1.

The same is applied to any of the stages described in Section 3.5.2 which was depicted in Figure 3.12: eachfunction call could be, and many times is, replaced with queuing of an event to a future time, thus modeling timeprogress and resource latency.

3.5.4 The bus model

The gem5 bus model is the sole component that can be connected to multiple masters and multiple slaves. Thename bus may be at some times misleading since its internal implementation can model any type of interconnect.Figure 3.13 depicts an abstraction of the bus prior to the start of this work. The bus was modeled as a unary

master master master

master port master port master port

busretry list

bus busy?

slave port slave port slave port

slaveslave

slave port slave port

master port master port

Outstanding

requests

Figure 3.13: Abstracted diagram of the bus

resource that can be either occupied or not. As such, each incoming transaction attempt would first check whetherthe bus is available to provide service. In case the bus is indeed busy, the requesting port is added to a retry list.An exception was in the case of snoop requests, hence requests coming from a slave, which are passed in zero-timeregardless of the bus’s state. This modeling peculiarity is discussed in Section 4.1. Once the bus is available aftera period in which it was occupied, the first port in the retry list is granted the right to utilize the bus. Each slavedeclares its address range to the bus upon pre-simulation system composition. This is required such that the buscan dispatch requests from a master to the correct slave (responsible for that address). Each request that is receivedand requires a response is added to the outstanding requests list. This is used to distinguish responses (which needto be sent upstream to the initiating master) from snoop responses (which need to be sent downstream towards thesnooping bus).

In the next sections, the main four use-cases will be presented. In order to provide descriptions which are notobsolete, the interaction sequences provided describe the bus as it is in its current state, after being split to a layeredbus. The layered bus is further described in Section 4.2.

23

3.5.5 A simple request from a master

Once a bus receives a request from a master:

1. The bus first checks whether its request layer is idle. Snoop requests are serviced regardless of the resource’savailability yet regular requests may be refused if the request layer is busy.

2. The bus forwards the packet upstream to all snoopable masters (hence issue a snoop request).

3. The master will receive the packet and handle the snoop. The packet is forwarded upstream recursively to allmasters. In case of a snoop hit, the cache will do the right thing and act according to the coherence protocol,marking the packet as MemInhibited, stating that a master took responsibility of providing a response.Following masters will be aware of this, thus understand this snoop caused a snoop-hit, and act accordingly.

4. The bus sends the request packet to its destination slave according the the packet’s destination address.

5. In case the packet is marked as MemInhibited, the slave will delete the packet as some master has committedto providing a response. Otherwise, it will make the packet a response and schedule the response.

6. The bus marks the request layer as occupied for the duration of the packet’s sending time, plus a fixed buslatency.

The entire process is done at the same simulation tick, hence in zero-time. This introduces several peculiarities,such as the fact a slave is aware of the snoop result (hit or miss) already at the time of the request’s arrival tothe bus. The slave will only react if needed, hence the result of this sequence is equivalent to modeling perfectspeculative prefetching. Hence snoop latency is hidden, as practically snoop miss responses come at zero-time withzero-cost.

3.5.6 A simple response from a slave

Once a bus receives a response from a slave:

1. The bus first checks whether its response layer is idle. Responses may be refused if the response layer is busy.

2. The response is sent to upstream to its appropriate master.

3. The bus marks the response layer as occupied for the duration of the packet’s sending time, plus a fixed buslatency.

3.5.7 Receiving a snoop request

Snoop requests pass through instantly. As such, once a bus receives a snoop request (from a slave), the receivedpacket is forwarded to all snoopable masters.

3.5.8 Receiving a snoop response

Once a bus receives a snoop response:

1. The bus first checks whether its snoop response layer is idle. Snoop responses may be refused if the snoopresponse layer is busy.

2. In case the snoop response is due to a snoop request we forwarded, forward the snoop response to the snooporigin (downstream).

3. In case the snoop response is due to a snoop request we issued (as a result of receiving a request), send aresponse to the requesting master (upstream).

4. The bus marks the snoop response layer as occupied for the duration of the packet’s sending time, plus a fixedbus latency.

24

3.5.9 The bridge model

Currently gem5 does not support connecting two buses to each other, due to an assumption that a response froma bus must always be accepted, which might fails if the receiving side is also a bus which is occupied. The solefunctionality of a bridge is to buffer requests in both directions in a finite queue. In case its queue is full, the packetwill be marked as not-acknowledged (nacked) and sent back.

3.5.10 Caches and coherency in gem5

gem5’s cache model implements an MOESI-like coherence protocol. This protocol aims at minimizing the amountsof off-chip activity - which might not always realistically represent a system. The term cache model actually is anabstraction which describes both the cache and the cache controller. The cache model is based on three main datastructures:

• A TagStore, which is the actual cached data and tags stored as a list of blocks. The TagStore can implementany replacement policy, yet the one used by default is Least-Recently Used (LRU). Replacements and evictionsare determined by the TagStore during handling of a request.

• A Miss Status Handling Registers (MSHR) block which contains tracking data regarding cached read andwrite requests for cache lines that are not in the TagStore and are pending a response.

• A Write Buffer for uncatchable, evicted, and dirty lines that need to be written downstream.

3.5.11 gem5’s cache line states

Each cache line stored in the TagStore contains the following state indicators:

• Valid: indicates whether the data stored is valid.

• Readable: indicates whether the data stored can be read. An example for a situation where a line can bevalid yet not readable is a write-miss to a previously-readable line (e.g. shared and pending an upgrade toexclusive-state).

• Writable: indicates that the cache is permitted to make changes to its copy.

• Dirty: indicates that the copy stored has to be updated in the main memory.

Examples of optimizations utilized in gem5’s cache model which are implementation-dependent and might notrealistically model a system are:

• A read request for a line which does not reside in any other cache will automatically result in exclusive-state.This optimization eliminates the need for requesting an upgrade (to Exclusive-state) upon receiving a writerequest, yet does not necessarily realistically represent a system.

• Ownership passing is supported, yet only when a dirty line is snooped by read-exclusive request. The impactof ownership-passing policies is further discussed in Section 7.3.

• Read requests to a modified line never result in write-back of the line, which is the classic expected behavior.Data is passed on-chip which might not truly represent a system.

• Snoop requests from the slave are handled and forwarded in zero time. This major inaccuracy is intendedfor avoiding race conditions in the memory system, and mostly the need to implement transition-states in thecache-controller.

Table 3.4 projects gem5’s cache line state-space to MOESI. The purpose of this projection is to provide acomplete picture which enables correlation with the ACE cache-line state space provided in Table 3.1

3.5.12 Main cache scenarios

In the following sections, the main possible scenarios are provided.

25

Table 3.4: Projection of gem5 cache line state space to MOESIMOESI state Writable Dirty Valid

M 1 1 1O 0 1 1E 1 0 1S 0 0 1I 0 0 0

A read request

Upon a read request (from the master):

• In case of a hit, the cache line is provided to the master (upstream).

• In case of a miss, a cache-line-sized read request is generated and sent to the slave (downsteram). Once aresponse arrives, it is stored locally (except for cases such as an uncacheable read or a full cache), and aresponse to the original request is provided to the master (upstream).

A write request

Upon a write request (from the master):

• In case of a hit, the cache line is updated.

• In case of a miss, a read-exclusive request is generated and send to the slave (downstream). Once a responsearrives, the new data is stored locally (here too there are exceptions according to the coherence protocol whichis beyond the scope of this abstracted description.). A write response is sent to the master (upstream).

• In case of an uncacheable write request, the write request is added to a write-back buffer to be sent downstream.

A snoop request

Upon a snoop request (from the slave):

• The cache will forward the snoop request upstream in zero-time. This ensures all caches are updated instan-taneously, thus eliminating the need for transitional cache states, and various race conditions.

• The cache controller follows the coherence protocol and provides a response. E.g. a snoop read request fora valid cache line will be responded with the data, and the packet will be marked as MemInhibited, statingthat this cache is responsible for scheduling a snoop response. The cache line state will be updated accordingto the coherence protocol.

• In case of a miss, the packet will not be modified..

A downstream snoop response

In case of a snoop response from a master:

• If the response is to a request which the cache forwarded, it is forwarded downstream.

• Otherwise, it is treated as any response to a read or write request as described in Sections 3.5.12 and 3.5.12.

3.6 Target platform

The CCI-400 was intended to be the interconnect in a compute subsystem developed by ARM. This would haveenabled correlation of the model with real hardware, running the same workloads both on hardware and in thesimulator. This platform aims to model hand-held mobile devices. A block-diagram of the target platform isprovided in Figure 3.14.

26

I2C

PM

IC

IC

Exte

rna

l P

erip

he

rals

I2C

Bu

s

GP

IO

I2C

(Se

tup

&

Brig

htn

ess)

Ke

yp

ad/In

pu

t

Ke

yp

ad/I

np

ut

GP

S

An

ten

na

UA

RT

/Te

st

Mu

lti-F

orm

at

En

co

de

r/

De

co

de

r

Vid

eo

Po

st

Pro

ce

ssin

g

(Sca

ling

, D

e-

inte

rla

ce

)C

SI-

2

Co

ntr

olle

r

Me

dia

CS

I-2

PH

Y

Tra

ce

Bu

s

+

JT

AG

HD

MI C

on

tro

ller

TF

T/O

LE

D

Ba

cklig

ht

Co

ntr

olle

r

LE

D

Ba

cklig

ht

LV

DS

PH

Y

LV

DS

Co

ntr

olle

r

Dis

pla

y

I2S

SA

TA

Co

ntr

olle

r

SA

TA

PH

Y

To

uch

Co

ntr

olle

r

US

B/D

ock

US

B 3

.0

Co

ntr

olle

r

US

B 3

.0

PH

Y

Ra

nd

om

Nu

m G

en

Se

cu

rity

PP

U

PP

U

PP

UD

isp

lay P

ow

er

Ga

tes

Po

we

r G

ate

Co

ntr

ol

I2C

Eth

ern

et

Co

ntr

olle

r

Eth

ern

et

PH

Y

Eth

ern

et

DM

A P

ow

er

Ga

te

DM

A

Co

mm

s

DS

P

DM

A

Co

ntr

olle

r

Ra

dio

An

ten

na

RF

Fla

sh

Fla

sh

SS

D

PP

UR

ad

io P

ow

er

Ga

te

PP

UH

igh

BW

Pe

rip

he

rals

Po

we

r G

ate

s

SD

/MM

C

Co

ntr

olle

r

SD

/

MM

C

Me

dia

Po

we

r G

ate

s

Clo

cks,

Re

se

ts

Ea

gle

CP

U 0

L2 Cache

Ea

gle

CP

U 1

GP

U T

60

4/T

60

8

SM

MU

Ea

gle

2/4

-CP

U C

lus

ter

Orw

ell

LP

DD

R2/D

DR

3

Co

ntr

olle

r

Se

cu

rity

Se

cu

rity

Kin

gfi

sh

er

2/4

-CP

U C

lus

ter

Ca

ch

e C

oh

ere

nt

Inte

rco

nn

ec

t (C

CI)

Orw

ell

DM

C

NIC

40

0

NIC

40

0

On

-Ch

ip

Me

mo

rie

s

Sy

ste

m

Co

ntr

ol

Pro

ce

ss

or

SM

MU

SM

MU

Dis

pla

y

Sy

ste

m &

Po

we

r

Co

reS

igh

t

SMMU

De

bu

g &

Tra

ce

B&

L C

trl

Tim

ers

2 x

UA

RT

SB

as

e

Pe

rip

he

rals

Co

lum

bu

s

HD

LC

D

Dis

pla

y 0

HD

LC

D

Dis

pla

y 1

Sc

ratc

h

RA

M

Se

cu

re

SR

AM

RO

M

Retention

Retention

Secure

Secure

Ea

gle

CP

U 2

Ea

gle

CP

U 3

KF

CP

U 0

L2 Cache

KF

CP

U 1

KF

CP

U 2

KF

CP

U 3

L2 CacheU

nif

ied

Sh

ad

er

0

Un

ifie

d

Sh

ad

er

1

Un

ifie

d

Sh

ad

er

2

Un

ifie

d

Sh

ad

er

3

SM

MU

L2 Cache

Un

ifie

d

Sh

ad

er

4

Un

ifie

d

Sh

ad

er

5

Un

ifie

d

Sh

ad

er

6

Un

ifie

d

Sh

ad

er

7

= T

ab

let/S

ma

rtp

ho

ne

= H

igh

-En

d T

ab

let

Co

nfig

ura

tio

ns

DD

R3

PH

Y

DD

R3

PH

Y

DD

R3

DD

R3

Co

lum

bu

s B

as

ed

Hig

h-

Pe

rfo

rma

nc

e T

ab

let

So

C

DD

R3

DD

R3

Vo

lta

ge

Do

ma

ins

Te

mp

Se

nso

r

GP

IO

UA

RT

Mu

x

Au

dio

Bu

ffe

rI2

CIn

tern

al I2

C B

us

Gyro

sco

pe

+

Co

mp

ass

GP

S

AC

97

Co

ntr

ol

Au

dio

CO

DE

C

Pro

xim

ity

+ L

igh

t

Se

nso

r

Fu

se

In

terf

ace

Fu

se

sN

V

Co

un

ter

Fla

sh

Fla

sh

Co

ntr

olle

r

Au

dio

AD

C

So

C

Clo

ck &

Re

se

t

Co

ntr

ol

Hig

h B

an

dw

idth

Pe

rip

he

rals

Extensions Interface

Fig

ure

3.14

:T

arget

pla

tform

blo

ckd

iagra

m.

Sou

rce:

AR

Mte

chn

ical

sym

posi

um

2011

27

Modeling the target platform in gem5 required:

• composing an equivalent platform from the available components,

• tweaking gem5’s models to match the target platform’s configuration as much as possible (e.g. clock frequen-cies, cache sizing and latencies)

• preparing workloads which can be run both on the target platform and on gem5. As a key workload, Browser-Bench (BBench), a web-rendering benchmark, was selected. BBench is further discussed in Section 6.4.4.

The gem5 target platform was initially designed as an abstracted version of the planned target platform. Itshigh-level design is provided in Figure 3.15(a).

A15 A15

L2

A15 A15

L2

CCI

“GPU”

traffic gen.

physmem extendedMem Peripherals NIC

RealView peripherals

(a) gem5 simplified target platform overview

HD Vid.

traffic gen.

little little

L2

A15 A15

L2

CCI

GPU

L2

Peripherals NIC

RT NIC

Display

trafficgen.

physmem

RealView peripherals

(b) gem5 target platform (end goal) overview

Figure 3.15: gem5 target initial and final target platform overview

The gem5 target platform:

• Is based on gem5’s RealView ARM development board configuration, peripherals and memory map, hencecapable of running same software/OS including BBench, as a currently-available hardware product,

• Is composed of two clusters of two ARM Coretex A15-like out-of-order (O3) CPUs, each with a private level-1cache and a shared level-2 cache per cluster.

• A trace player / random traffic generator as a starting point to mimic GPU behavior, generating read andwrite traffic to a private memory (extendedMem) and also read requests to the main memory (physmem).

• A CCI-like interconnect, capable of issuing ACE-like system-level coherent transactions

• Communication monitors (described in Section 3.6.1) on each of the interconnect’s ports, for improved ob-servability of traffic around the interconnect. The communication monitors are depicted as discs on each ofthe interconnect’s port.

This model was implemented and tested. However during testing of the target platform model, simulationperformance issues (discussed in Section 3.6.2) were detected and required re-evaluating the investigation approach.

A model which closer-resembles the actual target platform was designed to contain a GPU model (instead of atraffic generator), additional real-time sensitive video and display traffic generators utilizing a separate interconnect,is depicted in 3.15(b). This model was not realized due to the performance issues discussed in Section 3.6.2 and alack of adequate GPU model, which was essential for investigating sharing patterns between CPUs and the GPU.

3.6.1 gem5 building block for performance analysis

A fundamental feature in any simulation framework is observability. In order to analyze a system’s performance,the simulator must provide infrastructure for monitoring events and collecting statistics during a simulation. gem5provides two main tools for this matter:

• Stats is a class which provides convenient means of collecting scalars or vectors or inputs upon samplingevents. These can either simply record values, or perform more complex arithmetical operations (Formulas).

28

• Communication monitors enable in-depth analysis of accumulated traffic between a master and a slaveport. The communication monitor acts as a non-functional shim, sniffing all port activity. Example statisticsprovided by the monitor are: bandwidth and latency statistics, transaction distribution and so forth.

In both cases, statistics can be reset or output to a file, either once or periodically.

3.6.2 Simulator performance

The process of testing the target platform model raised question marks regarding the simulator performance whenrunning a test system which utilizes multiple O3 CPUs. Repeated testing and analysis of the simulator’s progressbased on reported kIPS implied simulation of a real-world workload would require an unreasonable amount oftime, for the scope of this project. For example, we discovered that it would take several weeks to run a BBenchsimulation for a system with four O3 CPUs.

In order to verify these findings, an in-depth investigation using Valgrind [36] and Callgrind [42] was performed.Callgrind is a profiling tool which logs function and library calls history at runtime, including the number ofinstructions executed, their relationship to source lines, the caller/callee relationship between functions, and thenumbers of such calls. Gaining such insight comes at a price of significantly degraded run-time. Results for a partialBBench run using the target platform model described in Section 3.6 are provided in Figure 3.16.

The simulator was found to be spending a major part of the time performing memory allocation and de-allocationas part of the O3-model CPUs operation. The impact on overall simulation performance was acute as the targetplatform contained four instances of the O3 model.

The investigation’s findings spurred several changes in gem5 which resulted in 12% performance improvement,yet concluded that full simulations of real-world workloads cannot be performed using the O3 model. Alternativesare discussed as part of Chapter 6.

3.7 Differences between ACE and gem5’s memory system

When approaching the task of modeling the ACE protocol in gem5, we first need to understand the gap betweenthe current state of gem5’s memory system, and what ACE tries to achieve. The purpose of this section is providemotivation for these differences using a leading example.

As stated before, the ACE protocol is aimed at single-hop interconnects, while gem5’s connectivity requiressupporting multi-hop interconnects. An example of ACE’s intention is the listing of all expected cache-line after-states per transaction. For instance, the ACE specifications for a WriteNoSnoop request (discussed in Section 4.3.3and in [6]) require the end-state of the cache line to never be dirty (modified).

• In a single-hop interconnect, the interconnect would receive a WriteNoSnoop request and forward it directlyto the slave, assuming it is a memory device that will be updated with the new data, and therefore any linecontaining this address could not be dirty once the transaction is over.

• However, in gem5, a CPU may issue a WriteNoSnoop to any type of peer slave. In case this is a cache, aWriteNoSnoop does not require updating the end-memory responsible of this address, but simply specifies theaddress is not shareable for avoiding unnecessary snooping. As such the receiving cache should:

– In case of a hit to a writable line, perform a write-hit

– In case of a miss or a hit to a readable line, issue a ReadNoSnoop downstream, which is inherently alsoa read-exclusive request.

– Only once a lowest-level bus receives a WriteNoSnoop request, will it perform the request as defined bythe ACE specifications.

This subtlety is a typical example of how an ACE transaction should be extended, and cannot be modeled as-is ingem5. ACE transactions influence the entire memory system, including caches and buses, and not only the bus.

Additional differences are discussed as part of Section 4.3

3.7.1 System-coherency modeling

gem5 system-level coherency relies on a set of conventions which maintain the system coherent at any point in time:

29

• All snoops requests received by a bus or a cache from its slave are forwarded to all upstream masters, asdescribed in Section 3.4:

– The bus forwards all requests arriving from any of its masters to all snoopable masters (upstream).

– The bus forwards all snoop requests originating from its slaves towards its masters.

– The cache forwards all snoop requests originating from its slave to its master.

• The memory system relies on all upstream requests to be forwarded in zero-time in order to keep the contentsof all caches coherent without the need of transition states, and without the hassle of race conditions.

Hence in gem5 cache-line state changes occur unrealistically faster than in ACE. However, as both gem5 and ACEutilize a MOESI-like coherence protocol, the after-states that should be modeled in gem5 in the CCI-context mustbe the same. This leading requirement is fulfilled per implemented ACE transaction as described in 4.3.

Compliance can be demonstrated by issuing the modeled ACE transactions in a gem5 system which resemblesthe CCI-context (as in Figure 3.8(a)). Extending each ACE transaction should guarantee system-level coherence ismaintained in any form of interconnect and memory system.

Furthermore, the notions of shareability and bufferability discussed in Section 3.1.5 are missing in gem5 and areessential for modeling ACE transactions.

The above-mentioned requirements for modeling extended, ACE-like, hardware-managed system-level coherencyhave been implemented and are further discussed in Chapter 4.

30

(a) Call-graph statistics summary

(b) Call-graph visualization of O3 activity

Figure 3.16: gem5 target initial and final target platform overview

31

Chapter 4

Interconnect model

The gem5 bus model provides a mature ground for modeling realistic system-level coherent interconnects. However,some key points in the model should be dealt with in order to increase its correlation with a CCI-like interconnect.In this chapter, gaps between the old gem5 bus model to the desired model are discussed. Solutions for bridgingthese gaps are provided, some by means of implementation and some by guidelines to future work.

During this project the bus model has been modified by other members of the gem5 community, and mostly bymy supervisor, Dr. Andreas Hansson. Although these changes are not fruits of this work, they made the bus modelmuch closer to an ACE-like interconnect.

Problems in the bus model that have been dealt with are:

• Connectivity semantics: gem5’s memory system lacked a strict notion of master and slave ports. A portcould be connected to any other port. Recent work modified all ports to be either master or slave ports,enforcing strict directed connectivity with distinguished responsibilities. This introduced semantics madegem5 ports AXI-like. An additional change, where each master port can be either snoopable or not, wasanother step closer towards ACE-like interfaces.

• The bus model was previously used in connectivities which should and should not handle coherent traffic.The bus functionality was redesigned to a BaseBus bus base-class, from which two types of models inherit,as depicted in Figure 4.1:

EventManager Serializable

SimObject

MemObject

BaseBus

CoherentBus NoncoherentBus

DetailedCoherentBus

Figure 4.1: The bus-model’s inheritance diagram

– A NonCoherentBus class, which does not support any coherent traffic (hence cannot snoop or besnooped), which better resembles non-coherent interconnects than the original bus,

– A CoherentBus class, which supports both coherent and non-coherent traffic, and resembles a CCI-likeinterconnect.

32

– A more timing-accurate bus model was designed, the DetailedCoherentBus, which is derived from theCoherentBus and is a step closer to a realistic interconnect. It is further described in Section 4.1.

• The bus had utilized a unary resource for modeling contention, as described in Section 3.5.4. This typeof contention is suitable for modeling tri-state-like buses. Modern medium-scale interconnects are typicallycrossbar-based. An improved resource contention mechanism in the form of layers was introduced as part ofthe bus-class split to coherent and non-coherent bus. This solution is further discussed in section 4.2.

• The bus model supported a very limited set of coherent transactions. As such, support of ACE-like transactionsextended to suit both CCI-like as well as multi-hop interconnects has been added, and discussed in Section4.3.

• The bus model did not contain any performance monitoring capabilities, which crippled the ability analyzeits performance and impact on the entire guest (simulated) system. In order to provide such capabilities,statistics were added to all bus models, as described in Section 4.5.

• The memory system as a whole does not support barrier transactions, while these are part of the ACEspecifications. Although barriers (of both data and sync types) may have significant impact on a system’sperformance, they are currently implemented within gem5’s CPU models. The required effort estimated tocure the situation was beyond the scope of this work. As it is both an interesting feature with potentiallysignificant performance impact, it is further discussed in Section 7.3.

4.1 Temporal transaction transport modeling

The gem5 memory model utilizes an approach which is a mix of loosely-timed and approximately-timed approaches.Hence in some cases function calls are used, passing on the responsibility of modeling temporal behavior, and inother cases event queuing is used, better modeling time progress in each model.

As discussed in Section 3.4, in order to avoid race conditions and transitional states in caches, snoop requestsare passed-on in zero-time. In order to compensate for this approximation, gem5’s memory system tries to penalizesnoop responses with longer latencies, aiming at to provide a reasonable aggregate delay for the entire transaction.

However, snoop responses, as well as several other cases which will be covered in this section, pass through thebus in a loosely-timed manner, passing on the responsibility to mimic time-progress to the receiving-end.

In order to improve the bus model’s temporal-behavior accuracy, transaction scheduling using queued-ports wasimplemented. As explained in 3.5.3, gem5 provides the infrastructure to delay execution of events till a future pointin time. This infrastructure was used in order to delay outgoing traffic from the bus to its destination port, insteadof letting the bus directly send the packet with timing annotations that should be dealt with by the receivingend. The change is based on queued ports, which maintains an event queue, enabling activities to be scheduled inaddition to its basic functionality of a port.

Table 4.1 lists outgoing traffic cases which now utilize queuing. While the first two cases are trivial to grasp, the

Table 4.1: Bus outgoing traffic queuingBus port traffic direction scenario figure

Slave port upstream (from the bus to a master) response from a slave 4.2(a)Master port upstream (from the bus to a slave) request from above 4.2(b)Master port downstream (from the bus to a slave) snoop response from above 4.2(c)

third scenario is not: it requires a system in which there are buses in-between two levels of caches, and situationwhere a snoop is responded and flows downstream through the bus. This flow of response - from the a master to aslave, cannot occur in classic ACE but only in such a multi-hop interconnect. All three scenarios are depicted inFigure 4.2. A description of all three scenarios:

• In Figure 4.2(a), the CPU issues a request which traverses downstream through the bus to the memory. Therequest traversal is marked as a green arrow. The response, marked with a blue arrow, is queued (hencescheduled for future sending) at the bus’s slave port (marked by a red Q sign).

• In Figure 4.2(b), the same scenario occurs. Queuing has also been added for request traffic going downstreamtowards the slave. The same marking method is used.

33

root: Root

system: System

membus: CoherentBus

physmem: SimpleMemory

cpu: MemTest

master

port

slave

test

Q

Request

Response

(a) response from a slave

root: Root

system: System

membus: CoherentBus


cpu: MemTest

master

port

slave

test

QRequest

Response

(b) request from above

root: Root

system: System

membus: CoherentBus

l2c0: BaseCache

l2c1: BaseCache


tol2bus0: CoherentBus


cpu0: MemTest

cpu1: MemTest

cpu2: MemTest

cpu3: MemTest

l1c0: BaseCache

l1c1: BaseCache

l1c2: BaseCache

l1c3: BaseCache

master

port

slave


masterslavemasterslave

test

cpu_side

test

cpu_side

test

cpu_side

test

cpu_side mem_sidemem_sidemem_sidemem_side

Q

Request

Request

Request

RequestSnooprequest

Snoop request

Snoop requestSnoop response

Snoop response

(c) snoop response from above

Figure 4.2: Added bus outgoing traffic queuing scenarios

• In Figure 4.2(c), a read request is issued by CPU3. Assuming the cache line only resides in the level-1 cacheof CPU 1, the request will be sent downstream till the main interconnect (membus), which will forward asnoop request upstream (marked in blue), through level-2 cache 0 and through an intermediate bus. Thesnoop response (marked in purple) will enter a queued master port on route from the intermediate bus to thelevel-2 cache.

This set of changes included shifting time-delaying the packet from the cache’s responsibility to a split responsi-bility which better models a real system. In the updated implementation, the cache adds its own latency on top ofthe arrival moment, and not on top of a timing annotation provided buy the bus. In addition, the semantics of theannotated time have been updated such that the bus provides the delay taken for the packet to traverse throughthe bus, and not absolute delivery times.

In order to provide backwards-compatibility, the CoherentBus was not modified but rather a new class whichderives from the CoherentBus is used - the DetailedCoherentBus. One reason for not forcefully applying this changeonto the CoherentBus is that additional queuing has a negative performance impact. Hence it is essential tomaintain a less-accurate yet faster bus model.

Note that the added queuing mechanism is not applied to any packets marked as express snoop, as they mustbe passed-on in zero-time.

In addition to queuing of these three scenarios, a dedicated snoop response latency parameter was added to theDetailedCoherentBus in order to investigate the impact of punishing snoop responses on the system’s performance.

Evaluation of the above-mentioned changes is extensively discussed in Chapter 6.

4.2 Resource contention modeling

The bus model, in its previous form, was presented in Section 3.5.4. The bus’s availability was determined bya unary shared resource, regardless of the initiating port and the type of transaction (whether it is a request, aresponse, a snoop request or a snoop response). To better model modern on-chip buses, the notion of layers wasintroduced. A layer represents a unary resource. Each layer has a state which represents its current availability:

• IDLE, indicating that the layer is not occupied.

• BUSY, indicating that the layer is occupied for some duration due to an in-flight transaction.

34

• RETRY, when a bus finished serving a port and now checks if any other port has been pending the bus toservice its request.

Each bus now utilizes layers for modeling contention over its resources:

• Both the non-coherent and the coherent bus have separate request layer and response layer. The bus will onlyserve a request if its request layer is idle. Similarly, the bus will only serve a response if its response layer isidle.

• The coherent bus has an additional layer, a snoop response layer, for modeling contention over a snoop-responsechannel. This added channel brings the bus a step closer towards an ACE-like interconnect.

• Conceptually, the coherent bus virtually has an additional snoop request layer which has infinite capacity andzero latency - due to the gem5 memory system peculiarity for avoiding intermediate cache-line states, forcingthe bus to serve snoop requests at any time.

An abstracted overview a the layered coherent bus is available in 4.3. As a reference, recall the unary-resource-based bus, as depicted in 3.13.

master master master

master port master port master port

busretry list

Request layer

slave port slave port slave port

slaveslave

slave port slave port

master port master port

Outstanding

requests

Response layer

Snoop response

layer

Figure 4.3: Abstracted diagram of the layered bus

This set of improvement was a result the group’s work, as part of joint parallel efforts to enable a more accuratemodeling of the memory system.

Following this change, the bus can be easily extended to a true crossbar interconnect by having a request-layerper slave port and a response layer per master port, which is the equivalent of two multiplexer sets that implementa crossbar.

4.3 ACE transaction modeling

A key contribution of this work lies in a set of ACE-like transactions that have been added to gem5’s memory system.The ACE specifications introduce coherence-related transactions and attributes (as discussed in Section 3.1.5), yetgem5 is already supports system-level coherency for its current set of transactions. As such, the transactions thatwere selected to be modeled need to be such that provide added value, rather than simply slow down the simulator.Furthermore, it is beyond the scope of this work to model the complete set of ACE transactions, neither is this acost-efficient approach when analyzing a system.

In order to evaluate which ACE transactions are most important to model, a study was concluded. The mostuseful input for this study would have been the ACE transaction distribution in real workloads. However, there

35

is currently neither hardware nor any utilizing software. As such, we consulted with ARM’s CCI-400 design andverification team. This included analyzing which transactions are expected to be most influencing, most common,and for what type of master. As a reference point, we were aided by the transaction distributions used throughputthe validation process of the product - knowing that their estimations might be misleading.

Note that each transaction modeled is described in detail starting Section 4.3.1.We expect the following distribution patterns:

• CPUs will mostly issue shareable transactions, a few unshared transactions, and cache maintenance transac-tions which are expected to be an enabler in the big.LITTLE architecture.

• GPUs will issue mostly ReadOnce read requests, and WriteLineUnique or WriteLine write requests. Thisassumption was also reassured by a throughput-computing R&D specialist conducting research in that domain.However, WriteLine requires cache line (byte-enable) strobes which currently exist only in a side-branch ofgem5 and thus WriteLine will not be modeled.

• Video controllers will exhibit a distribution similar to the GPU’s.

• Coherent I/O devices are expected to issue unshared reads, ReadOnce reads, unshared writes, and a combi-nation of WriteLineUnique and WriteUnique.

• Barriers and DVM transactions were expected to be issued in relatively minor quantities, yet have a non-negligible impact. However, as discussed in Section 4, due to gem5 limitations this topic is beyond the scopeof this research.

When prioritizing tasks to be modeled, we also considered the expected implementation effort required forimplementing each transaction.

gem5’s coherence protocol has a fixed ownership-passing and on-chip data passing policies, as described inSection 3.5.10. Every non-exclusive read request issued in gem5 is responded with a clean copy. As such, ReadClean,ReadNotSharedDirty and ReadShared will all result in the same end-state if modeled as-is in gem5. As such, sinceReadShared is already implemented, ReadClean and ReadNotSharedDirty are not supported. In order to addmeaningful support for these two transactions, one must add the notion of capability of receiving ownership. Henceeach request must contain an additional attribute stating whether the responsibility for this address can or cannotbe passed on to the initiator. An example for such a case is I/O-coherent devices with Write-Through caches, andcannot accept dirty data. This further research possibility is raised as part of Section 7.3.

The validation process of all ACE-like transactions is described in Section 5.3, including simulation examplesdemonstrating each transaction’s impact on the system.

In the following section, each ACE transaction that was added to gem5 is described, including:

• Its ACE semantics, including a typical use-case.

• How the ACE specifications were extended to fit into a multi-hop interconnect, while complying to the ACEdefinition when a single-hop interconnect is modeled.

• Implementation details that provide insight as to how similar transactions can be extended and modeled ingem5.

4.3.1 ReadOnce

ReadOnce is a read transaction that is used in a region of memory that is shareable with other masters. Hence thecache line requested might reside in a cache. This transaction is used when a snapshot of the data is required. Thelocation must not be cached locally for future use. The data passed must therefore be clean, meaning ownership ofthis cache line must not be passed on to the requesting master.

A typical use-case for ReadOnce is a GPU in a need to read configuration data, or contents that it will notrequire for any future use. The shareable domain in such a case would contain CPUs and the GPU.

The main differences between ReadOnce and a ReadShared (or simply Read in gem5 terminology):

• The response should not result in allocation of a cache line. This is not the same as an uncacheable address,which is memory-region specific and not request-specific.

• The cache-line owner can keep its line in a writable state, even if it is dirty.

36

• The cache-line owner must not pass ownership.

Therefore, the main benefit from a ReadOnce comparing to a SharedRead is that a cache line in M-state willnot be forced to O-state or even to perform a write-back operation, nor a read-from-memory - saving two off-chipoperations. A cache containing an M-state line being snooped by a ReadOnce must supply its data but does notneed to modify its line state.

This type of transaction must not be generated randomly without any constraints, as it assumes the line is notin the request initiator’s cache. A failing scenario would consist of the line in being M state in the initiator’s cache,followed by a ReadOnce issued by the initiator. The result would be the fetching of a stale copy from the mainmemory.

In order to support ReadOnce in gem5, a new packet attribute had to be introduced: noAllocate. This indicationimplies that the response must not cause allocation to any cache along the traversal of the request.

No changes to the bus are required, as ReadOnce is forward to all snoopable masters. A more accurate imple-mentation would involve the notion of shareability domains, for limiting the set of masters being snooped to thoseof the same domain as specified in the request. Extending ReadOnce to work in a multi-hop interconnect requiresmodifying each cache which is snooped by a ReadOnce snoop to forward the snoop as-is without any modificationsto its cache lines. In case of a cache hit, the cache is responsible of providing a response without making any changesto the line state.

4.3.2 ReadNoSnoop

ReadNoSnoop is a read transaction that is used in a region of memory that is not shareable with other masters.Note that this does not mean uncacheable, hence the line can be cached, but not in other master’s caches. Theresponse must not contain shared or dirty data, hence ownership-passing is not permitted.

The purpose of this transaction is to avoid issuing snoop requests in vain, in cases where there is a-prioriknowledge that this cache will not reside in any other master’s cache. Thus issuing ReadNoSnoop transactionscannot be transparent to the programmer and is derived from the system architecture.

To support this type of transaction in gem5, a new packet attribute has to be added, adding the notion ofshareability. Shareability can be interpreted as a generalization of cacheability, since shareable implies cacheable,but non-shareable may be either cacheable or not.

A ReadNoSnoop request packet should be marked as non-shareable. The bus model should not forward thisrequest to any other master as a snoop request.

Insights about ReadNoSnoop and its possible extension to multi-hop interconnects:

• Since the requested line must not be shared, this is in fact a private case of a ReadExclusive request, where nosnooping is performed beyond the transactions route from the initiating master to either a containing cacheor a slave.

• As the request is for exclusive state, caches along the transaction’s traversal can contain the line only in statesM (if inclusive) or I. This requirement has to be implemented in the cache model.

ReadNoSnoop cannot be randomly generated as it assumes a line does not reside in any other cache outsidethe port-chain leading from the requesting master to the destination slave. In case the line does reside in anothercache, the initiating master might read a stale copy from the destination slave.

A typical scenario for ReadNoSnoop would be a coherent I/O device issuing a ReadNoSnoop request to anaddress which is cacheable (e.g. a cache line-fill request) yet not shared with any other master, as a private bufferin the main memory.

4.3.3 WriteNoSnoop

A WriteNoSnoop transaction is used in a region of memory that is not shareable with other masters. A WriteNoS-noop transaction can result from: a program action (e.g. store) or an update of the main memory for a cache linethat is in a non-shareable region of memory. WriteNoSnoop responses must never be shared or dirty.

Similarly to ReadNoSnoop, the original purpose of this request is to avoid issuing snoop requests in vain whenthe cache line cannot reside in a different cache in the system.

WriteNoSnoop is a useful methodological example of demonstrating the difference between the ACE specificationand its extension to a multi-hop memory system:

37

• In ACE, a master which issues a WriteNoSnoop is directly connected to the main interconnect (CCI). Therequest is then directly forwarded to the appropriate slave, thus performing a memory write operation.

• In gem5, a master issuing a WriteNoSnoop might be connected to multiple layers of caches and buses alongthe path to the destination slave. Assuming the master is connected to a cache:

– In case of a cache miss, the cache will issue a ReadNoSnoop request to exclusively obtain the cache line.Once the cache line is obtained, the line is modified and a write response is provided to the issuing CPU.

– In case of a cache hit, as the line is not shareable, it is either in E-state or M-state. The cache willupdate the line and provide a write response to the issuing CPU.

– Upon eviction of the cache line, since it is a non-shareable line a WriteBack request can be issueddownstream without performing any snooping.

Thus in the context of gem5, a WriteNoSnoop can also trigger a ReadNoSnoop and a non-snooping WriteBackrequests.

4.3.4 MakeInvalid

A MakeInvalid transaction is used in a region of memory that is shareable with other masters. The MakeUniquetransaction ensures that the cache line can be held in a Unique state, by broadcasting an invalidation message toall possible sharers. This permits the master to carry out a store operation to the cache line, but the transactiondoes not obtain a copy of the data for the master. All other copies of the cache line are invalidated. The requestmust be of full cache-line size.

This request type is a member of the cache maintenance set of transactions provided by ACE. An example fora utilizing request is WriteLineUnique, which was implemented and is further described in Section 4.3.5.

MakeInvalid is the first transaction to be introduced to gem5 that broadcasts a request. Hence it does notrequire a response, yet should be forwarded to all snoopable masters. As the gem5 memory system did not supportbroadcasts, prior to implementing MakeInvalid, broadcasting had to be implemented, as described in Section 4.3.4.

In ACE, a master issues a MakeInvalid request, which is forwarded by the interconnect to all other snoopablemasters. In gem5, the broadcast message reaches each coherent bus, which in turn forwards this invalidation-snoopmessage upstream. The process continues till all snoopable masters in the system received the invalidation message.

Broadcast support

gem5 supported transaction types which do not require a response, such as gem5’s WriteBack, which blindly writesdata downstream, with no need of a response. However, such transactions are aimed at a specific, single destination.One a packet reaches its destination, the request conveyed by the packet would be de-allocated.

However, in a broadcast, any receiving end must not attempt to free the conveyed request. On the other hand,the memory system must implement some sink for eventually freeing the original request.

Since this infrastructure is transaction-independent, it has been developed and posted separately. To implementbroadcasting, a new packet attribute was added, which denotes that the packet contains a forwarded request, andthus should not be freed at the receiving master’s end, but be forwarded on upstream (e.g. to a bus’s masters, orto a cache’s master port). It is the responsibility of the broadcast initiator to free the request once the broadcastis over.

This mechanism resembles the isMemInhibit mechanism used to instantaneously flag (hence broadcast, shareknowledge between components) that a request has been responded.

This need is a result of the possible multi-level memory systems gem5 supports, and is a generalization of ACEand CCI.

4.3.5 WriteLineUnique

A WriteLineUnique transaction is used in a region of memory that is shareable with other masters. A single writeoccurs, that is required to propagate to main memory. A WriteLineUnique transaction must be a full cache linestore and all bytes within the cache line must be updated.

According to ACE, a master issues a WriteLineUnique towards CCI. In turn, CCI performs two operations:

• Send the write request downstream to a slave,

38

• Sends MakeInvalid requests to all other snoopable masters.

In gem5, each master port receiving the MakeInvalid broadcast must forward it upstream towards all snoopablemasters. Once the MakeInvaild broadcast has ended, it is sunk once it reaches a slave port which does not haveany further downstream port. This solution is similar for the method used to sink requests in a slave, when therequest had already been responded by some cache in the system.

However, the major difference between the ACE and gem5 extension is in case the request results in a hit toan exclusive or modified cache line. In such cases no broadcast has to be performed, as the line is guaranteed notto reside in any other cache in the system. The bus behavior in gem5 is the same as CCI’s: the data should bewritten to main memory and invalidation should be broadcast to all other snoopable masters.

The purpose of a WriteLineUnique is to inform all masters that might contain a line that its entire contentswill be written to, and as such there is no reason to perform a WriteBack as would be required when a WriteLineis issued.

4.4 Modeling inaccuracies, optimizations and places for improvement

Although this work was extensive with regards to the changes made to the gem5 memory system, there are stillseveral modeling inaccuracies which should be revised in order to make the memory system more realistic:

• Perfect access speculation: currently a snoop hit will result in the memory controller not to handle arequest but to delete it. This means a speculative fetch to the DRAM was magically avoided. This harmssimulation accuracy, as the memory controller should have been occupied - thus we lose both bandwidth,latency, and resource contention that would have been different had the speculative fetch been performed.The afore-mentioned oddity is a result of the express-snooping mechanism. Another inaccuracy caused by thismechanism is the parallel 0-time snooping. An interconnect might serialize snoop requests, or implementdifferent fetching speculation.

• In addition, snoop misses practically come at zero-cost.

• The memory system adopts an optimization which provides a cache line in an exclusive state upon a readrequest which did not result in any snoop-hit. This is not a magical modeling, yet is not a by the bookimplementation of the MOESI protocol.

• Due to all of the afore-mentioned reasons, it seems inevitable to re-design the snooping mechanism such thatno zero-time or magical knowledge sharing could be done. The effect of such a change would be noticable inworkloads that generate sufficiently high inter-cluster snoop hit rates (passing data between clusters), while amemory-intensive client is simultaneously draining the DRAM’s bandwidth. It would best to correlate gem5’smemory system with existing hardware for using such a mixed workload.

• The cache-line ownership passing policy has significant impact on the reported performance. For instance,data is passed on-chip when a read-exclusive snoop request results in a hit to a modified line. The by the bookimplementation would require performing a Write-Back to update the main memory with the modified lineprior to providing the requesting master a clean, exclusive copy. Instead, data is passed on-chip together withownership of the line. The impact of such assumptions can significantly change overall system performanceand therefore might be redesigned to support other protocols.

• It is due to the afore-mentioned trade-offs that the actual benefit of the implemented ACE transactions cannotbe demonstrated using the current gem5 memory system. An elaborated discussion with per-transaction-typeexplanation is provided in Section 7.1.4. Nevertheless, the insight gathered during the process of implementingand extrapolating ACE-transactions to gem5 is a significant contribution on its own.

• The bus model now utilizes layers to better model resource contention. However it still misses a mechanismfor limiting the number for outstanding transactions. The work done on the bus brought it significantly closerto a modern interconnect yet this key feature is still missing.

• Currently barriers are implemented in the CPU models. Offloading barriers to the memory system could havesignificant impact. The avid reader can an informal description of ARM barriers in [34], or [6] for the exactACE definition.

39

• The bus model does not model any PoS, QoS features, and address striping. These features should beinvestigated for their impact. E.g. QoS support should be tested to verify the assumptions made during theinterconnet’s design.

All of the afore-mentioned topics have been troubling to several members of the active gem5 community. Theyrange from challenging to daunting and are left as future work for very justified reasons. Nevertheless, importantcomputer-architecture and SoC design conclusions can be made based on these features and as such they should becarefully considered as part of future research. Further topics proposed as future work are provided in Section 7.3.

4.5 Bus performance observability

In order to analyze the interconnect’s performance, and evaluate the changes made to the bus and the memorysystem, statistics were added to model. In Section 3.6.1 the main performance-analysis building blocks gem5offers were presented. The main advantage of utilizing stats in the bus models and in each layer is the ease ofachieving insight from just several new stats. The current alternative would be to utilize multiple communicationmonitors, one between each two ports which are of interest from a system perspective, and post-processing theirdata to get a system-wide an overview. Utilizing communication monitors also comes at a high performance pricewhen comparing to stats, as communication monitors result in deeper call stacks, and additional allocation andde-allocation of SenderState structs which are attached to each packet passing through a communication monitor.

Since adding statistics collecting and dumping has a negative impact on performance, statistics have to becarefully considered. To improve observability with a moderate performance impact, a small set of statistics wereadded.

Global bus statistics

The following statistics are collected on a bus-level per bus type:

• Throughput: bandwidth (bytes/sec) provided by the bus as a result of servicing packets of any kind.Aggregated over all ports. In case the bus is a coherent bus, the throughput is the sum of the data (regularrequests and responses) and snoop traffic throughput.

• Transaction distribution: per transaction type (e.g. read request, read exclusive request, write response,upgrade request), the number of transactions serviced by the bus, aggregated over all ports.

• A per-port transaction distribution: per master/slave port, the number of transactions of any typeserviced by the bus.

• In both the coherent and non-coherent bus, aggregate data through the bus is monitored and used forcalculating the bus’s throughput.

• In the coherent bus, aggregate snoop data through the bus is monitored in addition.

The above-mentioned statistics are calculated using aggregation of packet data and type collected upon servicingeach incoming packet. The packet’s request type, and payload size (when applicable) are recorded and averagedover the sampling period.

Layer statistics

The following statistics are collected per bus layer, for gaining resource-specific insight:

• Occupancy: the average number of master or slave ports waiting for the layer to become available. Morespecifically, ports pending in the layer’s retry list.

• Utilization: the percent of the time the layer was occupied (not in IDLE state).

• Average waiting ports: the average number of ports pending for the layer to become available for theirservice.

• Average waiting time in the retry list: the average time a port waits from its arrival to the retry list tillthe layer becomes available for service.

• A per-port occupancy: reflecting the time each port waits till the layer becomes available for its service.

The above-mentioned statistics are based on book-keeping done upon any layer state change.

40

4.6 Conclusions

• Making changes to gem5’s memory system requires in-depth understanding of a vast amount of code. Specif-ically, the bus and caches’ mechanisms are far from trivial and require a very intensive study period beforechanging a single character.

• In order to make any changes to such complex systems, it would be best to have proper documentation,convenient observability, and fine-grained test-rigs for playing around with changes before trying to invoke areal workload as BBench.

• Such debugging infrastructure has been added to better visualize complex interactions in the memory system.

• Such a fine-grained testing infrastructure has been created, based on a MemTest unit-test like tester which isfurther described in Section 5.1.

• Applying real-world specifications to an exploration infrastructure introduces very challenging design ques-tions. These questions can lead to insight about how a system or interconnect should be designed. Some ofthe questions raised are provided as suggested future-work topics in sections 4.4 and 7.3.

41

Chapter 5

Implementation and VerificationFramework

Before commencing any code changes to gem5’s memory system, in-depth know-how of transaction flows andimplemented coherence protocol had to be acquired. gem5 includes a convenient verbosity system which enablestriggering informative messages according to the need, yet these were not sufficient for bridging the knowledgegap. Furthermore, no small-scale workloads were available, small enough to be able and isolate sequences andinteractions. For this reason, on one hand, additional debug information was added to the existing mechanism, anda small-scale testbench named MemTest was used. In this chapter MemTest is described, focusing on improvementsmade that are useful for testing and making changes to the memory system. It is then used for validating ACEtransactions. Lastly, statistics that have been added to the bus model for improved observability are discussed.

5.1 Memtest

This project required making several types of changes to gem5’s memory system, all throughout its hierarchy: inbuses, caches, packet and request methods and so forth. As gem5’s memory system and the interactions which occurin a system are very intricate, a fine-grained tool for learning and testing small-scale sets of events was required.gem5 already had a unit-test-like infrastructure called MemTest, yet it suffered from various problems, and was ingeneral obsolete.

The main changes included:

• The old MemTest was based solely on random generation of requests. However, it did not generate any truesharing, as each CPU would generate a request to a specific byte in a cache line, according to its unique CPUID. The updated MemTest generates truly random requests, thus generates true sharing. As such, MemTestposes itself as a functional tester or small-scale regression for the memory system.

• In order to enable fine-grained manual injection of interesting sequences of instructions to the memory system,a scenario support was added to MemTest. Each scenario contains a list of instructions to be injected, includingbasic details such as the CPU that should initiate the request, the type of the request, the time at whichthe request should be initiated and so forth. This boosted development and verification progress and isdemonstrated in Section 5.2.

• The original MemTest composed test systems in a recursive method, which eventually created awkwardsystems. Figure 5.1 contains a ‘3:2:1‘ (hence three L2 caches, each connected to two L1 caches, each connectedto a CPU) system using the original MemTest. Figure 5.2 contains a functionally equivalent ‘3:2:1‘ systemusing the improved MemTest. The visual difference makes it much easier to understand the simulated system.

• Supporting any test-system hierarchy: a MemTest system is defined using a fanout-like tree colon-separatedstring specification, where each level is defined by the number of caches (or CPUs in the last level) per cachein the previous level. For example: a ‘2:1‘ specification describes a system with two caches, each connectedto a CPU. A ‘2:3:1‘ specification describes a system with two L2 caches, each connected to three L1 caches,each of which is connected to one CPU. While the original MemTest supports a limited set of specifications,the modified MemTest supports any specification. Note that the purpose of the funcmem is for functionalchecking of each performed request, hence this memory is a scoreboard and not part of the simulated system.

42

root : Root

system : System


cache0 : BaseCache

cache0 : BaseCache

cpu : MemTest

cache1 : BaseCache

cpu : MemTest

cpu_side_bus : CoherentBus

cache1 : BaseCache

cache0 : BaseCache

cpu : MemTest

cache1 : BaseCache

cpu : MemTest


cache2 : BaseCache

cache0 : BaseCache

cpu : MemTest

cache1 : BaseCache

cpu : MemTest



funcmem : SimpleMemory

system_port

port

mem_side

slave

cpu_side

mem_side

slave cpu_side

test functional

port

mem_side

cpu_side

test functional

master

mem_side

cpu_side

mem_side

slave cpu_side

test functionalmem_side

cpu_side

test functional

master

mem_side

cpu_side

mem_side

slave cpu_side

test functionalmem_side

cpu_side

test functional

master master

Figure 5.1: A 3:2:1 old (unintuitive) MemTest system

root: Root

system: System

membus: CoherentBus

l2c0: BaseCache

l2c1: BaseCache

l2c2: BaseCache


funcmem: SimpleMemory




cpu0: MemTest

cpu1: MemTest

cpu2: MemTest

cpu3: MemTest

cpu4: MemTest

cpu5: MemTest

l1c0: BaseCache

l1c1: BaseCache

l1c2: BaseCache

l1c3: BaseCache

l1c4: BaseCache

l1c5: BaseCache

system_port

port

masterslave

mem_sidecpu_sidemem_sidecpu_sidemem_sidecpu_side

port

masterslavemasterslavemasterslave

test

cpu_side

functionaltest

cpu_side

functionaltest

cpu_side

functionaltest

cpu_side

functionaltest

cpu_side

functionaltest

cpu_side

functional

mem_sidemem_sidemem_sidemem_sidemem_sidemem_side

3

2 2 2

1 1 1 11 1

Figure 5.2: A 3:2:1 new MemTest system

5.2 ACE-transactions development

A major part of this project required adding support of ACE-like transactions to gem5’s memory system. Devel-opment and testing of these transactions was heavily based on MemTest, as it allowed injecting any stimuli andcovering corner cases easily.

Each newly-introduced transaction was tested using:

• Scenarios in which other caches in the system contain the same cache line accessed in states M/O/E/S/I,in order to verify caches were correctly modified and the cache-line after-state is as required. Each scenariocontained a warm-up stage which is meant to bring the system to the required situation before the transactionunder test is issued.

• Simple and complex systems, for instance systems with two and four CPUs, with or without level-2 caches,

43

to verify snooping is forwarded and handled correctly. Hence systems with either single or multi-hop inter-connects.

• Using either Atomic or Timing memory modes, to verify functional correctness in both modes.

5.3 ACE transaction verification

The following section describes how each of the ACE-like transactions was tested. Since the verification processcovering all situations for all test systems is exhaustive and lengthy, only illustrative traces will be provided. Allprovided examples are for Timing memory mode. The provided snapshots contain the injected scenario and snippetsfrom gem5’s log which contain any initiating or completion of a request, as well as main cache activities such ascache-line state changes. Note that all addresses specified in the input scenario are offsets that are added duringsimulation on top of a base-address (0x10000 in this case), thus the logs will contain a different address. Time unitsare simulation ticks, which are equivalent to 1 ps. Requests are always issued with a sufficiently long break betweenconsecutive transactions such that there is never more than one transaction in-flight.

5.3.1 ReadOnce

This section contains a simple scenario for testing a ReadOnce request to a line which resides in a cache in thesystem in M-state. The test system is a simple ‘3:1‘ MemTest system depicted in Figure 5.3.

root : Root

system : System

membus : CoherentBus


cpu0 : MemTest

cpu1 : MemTest

cpu2 : MemTest


l1c0 : BaseCache

l1c1 : BaseCache

l1c2 : BaseCache

system_port

port

masterslave

test

cpu_side

functional

port

test

cpu_side

functionaltest

cpu_side

functional

mem_sidemem_sidemem_side

Figure 5.3: MemTest 3:1 system

The test scenario is trivial:

• A Write request to address 0 is issued by CPU 0 (system.cpu0 ) at time 0. As a result, CPU 0’s cache(system.l1c0 ) will issue a read-exclusive request to obtain the cache line in E-state, and then modify the line(updating it to M-state) due to the write request.

• A ReadOnce request to the same address is issued by CPU 2 (system.cpu2 ) at time 3000. The request issent to the bus, forwarded (snoop-read) to all other caches, handled as snoop-hit by system.l1c0 which laterprovides a snoop response with the data, while keeping its line in M-state. The test ends once the ReadOnceresponse arrives to CPU 2.

The scenario and a snippet of the simulation log are provided in Figure 5.4.

5.3.2 ReadNoSnoop

This section contains a test scenario for testing a ReadNoSnoop request to a line which resides in a cache in thesystem in E-state. The test system is a simple ‘3:1‘ MemTest system depicted in Figure 5.3.

The test scenario is very simple:

44

Figure 5.4: ReadOnce for an M-state line

• A ReadEx (read exclusive) request to address 0 is issued by CPU 0 (system.cpu0 ) at time 0. As a result,CPU 0’s cache (system.l1c0 ) will issue a read-exclusive request to obtain the cache line in E-state.

• A ReadNoSnoop request to the same address is issued by CPU 2 (system.cpu2 ) at time 3000. The request issent to the bus, yet not forwarded to any other master, but directly sent to the main memory, which providesthe response. The CPU 0’s handles the response and allocates a cache line. At the end-state, the cache lineis in E-state.

This scenario obviously breaks coherency in the system, yet it is injected for testing and demonstration system.The purpose of ReadNoSnoop is to avoid snooping in vain, and what we demonstrated is that indeed the bus doesnot forward ReadNoSnoop requests upstream as snoops.

Figure 5.5: ReadNoSnoop for an E-state line

5.3.3 WriteNoSnoop

This section contains two test scenarios for testing a WriteNoSnoop request:

• A scenario for demonstrating WriteNoSnoop as-defined in the ACE specifications in a system which utilizesa single-hop interconnect, as depicted in Figure 5.6:

root : Root

system : System

membus : CoherentBus


cpu0 : MemTest

cpu1 : MemTest

cpu2 : MemTest


system_port

port

masterslave

test functional

port

test functional test functional

Figure 5.6: MemTest 3 system

In this scenario, CPU 2 issues a WriteNoSnoop request to address 0 is issued by at time 0. As a result, nosnooping is performed and the bus directly sends the request to the memory which provides a completion

45

response. The simulation log is provided in Figure 5.7. The scenario and a snippet of the simulation log areprovided in Figure 5.7.

Figure 5.7: WriteNoSnoop in an ACE context

• A scenario for demonstrating WriteNoSnoop in a simple multi-hop interconnect, as depicted in Figure 5.3.

Figure 5.8: WriteNoSnoop for an E-state line in a multi-hop interconnect

– CPU 0 initiates a ReadEx (read exclusive) request to address 0 at time 0. CPU 0’s cache receives theline and stores it in E-state.

– CPU 2 initiates a WriteNoSnoop request to address 0 at time 3000. CPU 1’s cache initiates a ReadNoS-noop request, receives the line, stores it in E-state and then modifies it using the write-request’s data.As such, the cache line state changes from E to M. The bus does not issue any snooping during the writeoperation.

Also in this case, the toy example breaks coherency as now two caches in the system contain different valuesrepresenting the same address. Yet this illegal scenario is for demonstration purposes only.

5.3.4 WriteLineUnique

WriteLineUnique triggers the generation of MakeInvalid in its ACE context, and thus MakeInvalid will be coveredin this section. Similarly to Section 5.3.3, two scenarios will be provided: one for demonstrating an ACE implemen-tation, and another for a multi-hop interpretation. Both scenarios make use of a two-clustered system as depictedin Figure 3.8(b).

• In its ACE implementation, when the bus receives a WriteLineUnique request, it issues a MakeInvalid broad-cast to all other masters to invalidate their copies of the cache line being accessed. Such a scenario is providedin Figure 5.9.

– The first two instructions are issued to create a situation with a line in an O-state: CPU 0 issues a writerequest to address 0 at time 0, eventually leading to the line being in M-state in its cache (system.l1c0).A following read request from CPU 1 at time 1000 leads to system.l1c0 moving from M-state to O-stateand passing a copy to CPU 1’s cache. The end result of these first two instructions is a system with twocopies of the same cache line, one in O-state (system.l1co) and one in S-state (system.l1c1).

– At time 3000, CPU 3 issues a WriteLineUnique to address 0. As this write instruction is of a cache-linesize, its level-1 cache allocates the line and updates it to be in M-state. An invalidation (MakeInvalid)broadcast is sent downstream which is forwarded to all snoopable masters. The copies in caches sys-tem.l1c0 and system.l1c1 are both invalidated and the transaction ends.

• In the gem5 context, a WriteLineUnique issued by a master can be received by an intermediate cache. In sucha case, if the cache line is in a Unique-Clean (E) or Unique-Dirty (M) states, there is no need to broadcast aninvalidation request as the coherence protocol ensures the line will not reside in any other cache downstreamin the system, nor in any other cluster.

46

Figure 5.9: WriteLineUnique for an O-state line in an ACE context

Figure 5.10: WriteLineUnique in a multi-hop context

47

– The system wakes up and at time 3000 CPU 3 issues a Read request to address 0. This eventually leadsto this cache line being in E-state, and not as one might have expected, in S-state, due to an optimizationdescribed in Section 4.4. This is since the line does not reside in any other cache in the system at thatmoment.

– At time 4000, CPU 3 issues a WriteLineUnique. While in the classic ACE context this would havetriggered an invalidation broadcast, here since the line is already in E-state, there is no need to issue aninvalidation broadcast. The cache line state is updated to M-state and the transaction ends.

5.4 Conclusions

• MemTest proved to be essential for studying small-scale interaction in gem5’s memory system, and an enablerfor rapidly introducing new types of transactions. It has been demonstrated as an easy tool for creating anysought situation in the memory system, even if it is coherency-breaking.

• All ACE-like transactions were verified using MemTest.

• Automatically visualizing test systems makes verifying a test-system’s structure trivial. This feature is furtherdiscussed in Appendix A.

• The afore-mentioned changes are useful for a large croud of the gem5 users. Specifically, they provide userswith more convenient tools and improved system observability.

48

Chapter 6

Performance Analysis

6.1 Metrics and method

Throughout the experiments, the following indicators will be used to investigate a system’s performance:

• General system, CPUs and caches statistics: test execution time (guest time), CPU idle cycles percentage,CPU memory references, L1 I-cache miss-rate, L1 D-cache miss-rate, L2 miss-rate

• Memory stats: DRAM read / total bandwidth, aggregate total / read data through the bus

• Bus stats: throughput, request / response / snoop response layer utilization, average time in snoop responselayer retry list, aggregate data and snoop-data through the bus.

• Simulator performance: simulation time (wall-clock), host instruction rate, host tick rate. This is meant forevaluating how reasonable it is to run various sizes of workloads.

All information gathered will be based on gem5’s stats mechanism, which was described in Section 3.6.1.

6.2 Hypotheses

The importance of an engineered post-experiments hypothesis section is mostly methodological. However, thepurpose of this section is to stress expected trends or behavior throughout the simulations covered in this chapter.

Regarding the impact of snoop response latency, and queued responses in general:

• Little if any change should be seen for a sweeping increased snoop response latency for all tests with a singlecore. There might be some impact as the system also contains an I/O cache which can be snooped, yet thesharing patterns will most probably be negligible.

• For larger systems, we expect all system indicators to degrade as the snoop response latency knob is scaledup.

• For all experiments, we expect simulations utilizing the CoherentBus (and not the DetailedCoherentBus) toperform better consistently, as its temporal behavior is less accurate, in too good a way.

• These hypotheses are based on an underlying assumption that PARSEC benchmarks exhibit intensive inter-core and inter-cluster sharing patterns. As inter-cluster sharing degrades, so will the implication of the snoopresponse latency (if there is no sharing between clusters, no snoop hit traffic on the interconnect, then noresponses are penalized).

• Due to the acute simulation performance problem discussed in Section 3.6.2, running lengthy simulations withmultiple A15 O3 cores would not be feasible. For instance, a BBench run on a four-core platform will requireweeks of processor time.

Regarding the impact of splitting the bus to layers for improved resource contention modeling:

• There is no question about this move’s necessity. Previous single-resource bus was an outdated model. Sincethe bus has been rapidly changing, we cannot establish a fair comparison of old-vs-new.

49

Regarding evaluation of ACE transactions

• As briefly described in Section 4.4, and will be elaborated in Section 7.1.4, the current state of gem5’s memorysystem does not provide a reference system for holding a non-discriminating comparison.

6.3 Small-scale workloads

We are interested in evaluating the impact of snoop response service latency and bus resources contention on asystem’s performance. We aim to demonstrate how the added bus statstics enable observing what the systembottlenecks are. In order to stress the bus such that snoop response latency will make a difference, we have toutilize workloads which are inherently multi-threaded with intensive sharing patterns. This is since the lack ofsharing means no snoop hits will occur, thus no snoop response traffic will be observed.

In order to evaluate a set of features, it would be wise to start with small-scale workloads. In such workloads,oriented to stress specific features, there is less noise in the system. For instance in a full-system running LinuxOS, it is harder to reason for small-scale effects. Furthermore, small-scale simulations are also a form of verificationto the work done, reassuring implementation is correct, hence functions as required.

A convenient tool for small-scale explorations is MemTest, which was discussed in Section 5.1. MemTest can beused to conveniently generate any arbitrary symmetric system, and can be easily configured to inject transactionsto the memory system in interesting patterns.

6.3.1 MemTest experiments setup

The most primitive yet representative system that can be used for an experiment which deals with snoop traffic onthe bus is depicted in Figure 6.1.

root : Root

system : System

membus : DetailedCoherentBus


cpu0 : MemTest

cpu1 : MemTest


l1c0 : BaseCache

l1c1 : BaseCache

system_port

port

masterslave

test

cpu_side

functional

port

test

cpu_side

functional

mem_sidemem_side

Figure 6.1: Small-scale snoop-response test system

The test system:

• The system and stimuli were designed to provide controllable and significant sharing behavior and snooptraffic through the bus.

• Contains two MemTest CPU transaction generators. Each of them is configured to generate 10,000 randomread or write transactions, with a 65 % probability of issuing a read request.

50

• The memory size is configurable. Each generated request is issued to a random destination address withinthis memory range. As a result, the smaller the memory, the higher the likelihood of an access to result in asnoop hit (triggering a snoop response).

• Each CPU has a private 32 KB cache.

• Both caches are connected to the main interconnect. The interconnect’s snoop response latency can beadjusted.

• Each system was simulated with memory sizes ranging from 128 B to 32 MB. Beyond that range, the chanceof randomly sharing a cache line are not interesting in this context as their impact on system performancewill be negligible.

• Each system was simulated using a range of snoop penalty latencies (from 10 ns to 90 ns), and in additionusing the previous coherent bus model (CoherentBus) as a reference.

6.3.2 MemTest experiments results

Results for the MemTest-based set of experiments are provided in Figure 6.2.In all the observed metrics, results demonstrate four separate phases:

• At the lowest range of the x-axis scale, which represents the memory address range to which random requestswere issued (roughly 102 to 103 bytes), execution times (depicted in Figure 6.2(a)) are decreasing as theshared memory area increases. At this point the aggregate snoop data through the bus (depicted in Figure6.2(c)) is maximal and decreasing, as is the miss rate. This can be attributed to the tiny shared addressspace: the likelihood that the caches contain an address required by the other cache is highest. The reason forthe relatively high execution times is due to the higher miss-rates (depicted in Figure 6.2(g)) and additionalsnoop-traffic through the bus. The reason for the increased miss-rate is the high probability that a cache 0will try to write to a line which resides in cache 1. Such an access will invalidate the cache 1’s line, thusincreasing cache 1’s miss-rate in next instructions.

• As the shared range increases towards 104.5, the increased shared memory range causes less such cross-invalidations (observable in Figure 6.2(b)), resulting in lower cache miss-rates and thus shorter executiontimes.

• In the range around the cache size we observe a sharp change in all parameters. Obviously, once the sharedmemory range is larger than the cache size, the miss rate will rise sharply, degrading all performance in allmetrics. This is also observable by the deep dive in aggregate snoop data, seen in Figure 6.2(c).

• Beyond 105 the sharing patterns decay, and the system becomes limited by the physical memory, as we cansee in Figure 6.2(f) saturates.

The results are all consistent with our hypotheses:

• As seen in Figure 6.2(a), the reference bus always outperforms the detailed coherent bus.

• In systems which utilize the DetailedCoherentBus, as the snoop penalty rises, the system’s performancedecreases.

The set of metrics used provided insight about the system’s bottlenecks for any of the configurations.

6.4 Large-scale workloads

The small-scale experiments provided almost absolute observability regarding the impact of snoop response latencyin the bus on system performance as a function of shareability: how much do clusters actually snoop-hit eachother. Although the small-scale experiments used toy-systems, they provided meaningful insight. The confidencelevel in the changes made has been partially established. In order to evaluate the same phenomena in larger-scale,full-systems and real-life workloads, we PARSEC [17] benchmark tests.

51

102 103 104 105 106 107 108

shared memory area size (bytes)

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

Exec

utio

n tim

e (s

econ

ds)

Coherent Bus10 ns30 ns50 ns70 ns90 ns

(a) Execution time

102 103 104 105 106 107 108


0

20

40

60

80

100

util

izat

ion

- lay

er o

ccup

ancy

(%)


(b) Snoop response layer utilization

102 103 104 105 106 107 108


0

200000

400000

600000

800000

1000000

1200000

tota

l sno

op d

ata

thro

ugh

laye

r (by

tes)


(c) Aggregate snoop data through the bus

102 103 104 105 106 107 108


102

103

104

105

106

107

tota

l (no

n sn

oop)

dat

a th

roug

h th

e bu

s (b

ytes

)


(d) Aggregate data (non-snoop) through thebus

102 103 104 105 106 107 108


0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

bus

thro

ughp

ut (b

ytes

/s)

1e9


(e) Bus throughput (data and snoop)

102 103 104 105 106 107 108


104

105

106

107

108

109

1010

Tot

al b

andw

idth

to/fr

om th

is m

emor

y (b

ytes

/s)


(f) DRAM throughput

102 103 104 105 106 107 108


0.4

0.5

0.6

0.7

0.8

0.9

1.0

mis

s ra

te fo

r ove

rall

acce

sses


(g) Cache miss rate

Figure 6.2: Simulation results for MemTest-based snoop-response latency experiments

52

6.4.1 PARSEC experiments

PARSEC is a benchmark suite composed of multi-threaded programs. The suite was designed to be representativeof next-generation shared-memory programs for chip-multiprocessors. Initial exploration of PARSEC benchmarksperformed at ARM on systems with two CPUs showed little sensitivity to L2 hierarchy (whether shared or private),as well as insensitivity to snoop latencies. Our hypothesis was that the modeling performed was partial and thecurrent bus model will demonstrate variations in the results as a function of the snoop response latency.

All experiments are run on the target platform gem5 model as presented in Section 3.6, which contain four cores,or slimmed-versions with one or two cores. Since invoking a simulation with four A15 O3 cores was seen to beproblematic performance-wise (as discussed in Section 3.6.2), simulations were run using Timing CPUs.

The following PARSEC benchmarks were used:

• Blackscholes: a financial analysis model: an option pricing kernel that uses the Black-Scholes partial differ-ential equation (PDE). Involves a large number of read requests, and with very few write requests. Exhibitslight and regular communication pattern (each core has a consumer-producer relation with a single other core:neighborhood communication).

• Bodytrack: a computer vision application: body tracking application that locates and follows a marker-lessperson. Exhibits very light (sparse) communication pattern (only few of the threads have a consumer-producerrelation with a different core).

• Canneal: an engineering workload: simulates cache-aware annealing kernel which optimizes the routing costof a chip design. Consumes a very high DRAM bandwidth, exhibits and a very high IPC, high Cache andDTLB miss rate. Exhibits extremely heavy sharing pattern (each core communicates intensively with the restof the cores).

• Fluidanimate: fluid dynamics animation: fluid dynamics application that simulates physics for animationpurposes with the Smoothed Particle Hydrodynamics (SPH) method. Exhibits high extra-CPU utilization, alight and regular communication pattern (similar to Blackscholes).

• Swaptions: Financial Analysis: prices a portfolio of swaptions with the Heath-Jarro-Morton (HJM) frame-work. High extra-CPU utilization; dense but irregular communication pattern.

• Streamcluster: Data mining: stream clustering data-parallel algorithm used for finding medians. Exhibitslow sharing behavior, uses shared-memory mostly as read-only. Its communication pattern is similar toBlackhole’s (neighborhood communication) yet less intensive.

• Vips: Media Processing Image processing application; exhibits extremely high L2 miss rate ( 50%), and highDRAM write bandwidth. Exhibits irregular communication pattern.

• x264: Media Processing: an H.264 video encoding application; spawns many threads (not more than 4simultaneously), stresses I-side (I-Cache, Branch prediction, ITLB), and exhibits irregular communicationpattern.

The provided analysis and description of each of the benchmarks is based on [15, 17, 18] and previous work conductedat ARM. As PARSEC is multi-threaded, it is important to understand the communication patterns between CPUsin each benchmark, as each benchmark may exhibit different sharing or consumer-producer patterns. Figure 6.3provides insight into sharing patterns each PARSEC benchmark exhibits on a multi-core platform. The gray-level ofeach each cell in a matrix represents how intense is communication between CPUs, with darker shades representingintensive communication. The axis units are CPU IDs.

The PARSEC benchmark suite offers six standardized input sets. According to [18], larger input sets guaranteethe same properties of all smaller input sets. As such, to enable exploration of a large set of configurations, thesimsmall input set was used. According to [17] the PARSEC benchmark set is diverse in the synchronizationmethods its benchmarks use, working sets, locality, data sharing and off-chip traffic. This poses PARSEC as arepresentative workload bundle for our needs.

6.4.2 PARSEC results

The results provided in the following section are based on statistics collected during the benchmark run. Simulationrequires Linux OS to be booted prior to starting each test, yet the boot process might severely effect the results.

53

blackscholes

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

bodytrack

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

canneal

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

dedup

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

facesim

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

ferret

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

fluidanimate

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

freqmine

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

streamcluster

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

swaptions

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

vips

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

x264

consumer CPU

pro

duce

r C

PU

0 4 8 12 16 20 24 28

02468

1012141618202224262830

low highcommunication

Figure 5: Normalised communication between different CPUs during the entire parallel phase of the program for the Parsecbenchmark suite.

plications that use a significant fraction of the shared ad-dress space for producer-consumer communication, alsouse a signification fraction of communicating writes in thisway. The two exceptions to this observation are volrendand water-nsquared. Volrend only uses around 10% of theshared address space for producer-consumer communica-tion, but more than 55% of its communicating writes. Water-nsquared uses around 35% of its shared address space forproducer consumer communication, but only 7% of its com-municating writes.

4.4 Read-Set Stability

The read set is considered stable when it is known that aproduced value will be consumed or not be consumed by agiven processor. As such, a processor that always consumesa produced value contributes to a stable read set. Similarly,a processor that never consumes a produced value also con-tributed to a stable read set. A processor that consumes onlyhalf of the produced values contributes to an unstable read

set. As such, a migratory sharing pattern will be classifiedas a mostly stable read set. A produced value is consumedby only one processor (and not consumed by all other pro-cessors). As such, a migratory location is considered highlypredictable. In order to classify a location as stable, it isnecessary that at least two communicating write accessesare performed on that location.

Figures 11a and 11b show the results for the stability ofthe read set. We find that both in the spatial and quantativeanalysis a significant number of locations and write accesseshave a very stable read set (80% to 100%). In many casesthese results roughly overlap with the migratory sharing re-sults from figure 9. Minor differences in these results (forexample more locations are classified migratory than thereare locations with a read set stability) are due to slight dif-ferences in measuring these locations. For example, the lastwrite in a migratory pattern does not have to be a communi-cating write. As such, if a migratory pattern consists of only2 writes then it is possible that it will not be considered forthe read set stability analysis.

93

Figure 6.3: Normalized communication between different CPUs during the entire parallel phase of the program forthe Parsec benchmark suit. Source: [15]

54

moreover, when using small datasets which require a short time to run comparing to the boot process. As suchstatistics have been reset after the boot process and the results shown are for the benchmark period only.

The selected set of results aims at providing insight regarding system performance and how memory-intensivethe benchmark is.

In case normalized results are provided, normalization is done per configuration (number of cores, number ofthreads, and benchmark used) using the result of the Coherent Bus as a reference point. Hence, normalized valuespresented are the ratio between each of the DetailedCoherentBus’s statistics and that of the Coherent Bus. Resultsare only provided if reference value exists.

Execution time and simulator performance

From a system-level point of view, the two most important metrics are execution time and energy. There is ongoingwork for adding power modeling to gem5, yet this is currently unavailable. The execution times presented are fora parallel execution, hence represent guest time and not an aggregation of the processing times of all CPUs.

This section provides an overview of PARSEC execution (guest system) times and simulator performance.Results are for the Coherent Bus (and not the Detailed Coherent Bus), which is the reference bus in the rest of theprovided results.

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0

10

20

30

40

50

60

70

time

[sec

]

1 cores 1t1 cores 2t2 cores 2t2 cores 4t4 cores 4t4 cores 8t

(a) Execution times overview

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Norm

aliz

ed e

xecu

tion

time


(b) Normalized execution times

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0

5000

10000

15000

20000

25000

30000

Real

tim

e el

apse

d on

the

host

(sec

onds

)


”(c) Simulation time

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

800000

1000000

1200000

1400000

1600000

Sim

ulat

or in

stru

ctio

n ra

te (i

nst/s

)


(d) Host instruction rate

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Sim

ulat

or ti

ck ra

te (t

icks

/s)

1e91 cores 1t1 cores 2t2 cores 2t2 cores 4t4 cores 4t4 cores 8t

(e) Host tick rate

Figure 6.4: General simulation and simulator-performance stats for PARSEC benchmarks

An overview of the execution times demonstrates the rich variety even when using the simsmall dataset: teststake from less than a second to more than a minute of guest time to finish. The benchmarks differ significantly intheir behavior:

• most of the benchmarks, e.g. blackscholes, canneal, and swaptions, demonstrate decreasing execution timesas more cores are available. This can be observed both from Figure 6.4(a) and from the normalized executiontimes in Figure 6.4(b).

• in the case of streamcluster, the using two threads per core significantly crippled performance. It should bestated that the CPU model used models an in-order core and results are expected to differ significantly usinga multi-threading capable CPU.

Simulator performance results are brought to provide the reader a glimpse of gem5’s capabilities and not aspart of a scientific investigation. They provide an insight as to what activities slow down the simulation. The

55

host instruction rate results provided in Figure 6.4(d) demonstrate that for the benchmarks and CPU model used,the impact of simulating multiple cores is negligible - a positive piece of information. This of course comes at theprice of accuracy, as the O3 model which provides more accurate results has demonstrated sever degradation as thenumber of CPUs used increases. The host instruction rate can be seen as the simulator’s horse power : how muchcomputation can be done in a time unit. On the other hand, the simulator’s tick-rate in Figure 6.4(e) represents animaginary clock frequency which is the clock rate at which the guest time is simulated comparing to the host time.A tick represents one picosecond. Since the simulated system contains components which utilize different clockfrequencies, this is obviously an over-simplification. The simulator performance is therefore a combination of thetwo. During simulations of the reference bus, the simulator exhibited and average of roughly 1.2 MIPS. The cannealwas the main culprit, demonstrating that intensive inter-core communication slows down the simulator significantlythis observation is most visible in Figure 6.8(a), and by examining canneal’s communication pattern in Figure 6.3.

DRAM performance

This section deals with observed usage behavior in gem5’s main memory model which represents both a memorycontroller as well as the dynamic/external memory itself. These results provide insight regarding any off-chipmemory access which severely typically impacts both performance and power consumption.

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.4

0.6

0.8

1.0

1.2

Tot

al b

andw

idth

to/fr

om th

is m

emor

y (b

ytes

/s)


(a) Total DRAM bandwidth

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Tot

al re

ad b

andw

idth

from

this

mem

ory

(byt

es/s

) 1e81 cores 1t1 cores 2t2 cores 2t2 cores 4t4 cores 4t4 cores 8t

(b) DRAM read bandwidth

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Num

ber o

f byt

es re

ad fr

om th

is m

emor

y


(c) Aggregate data read from the DRAM

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Norm

aliz

ed a

ggre

gate

byt

es re

ad


(d) Normalized aggregate data read fromthe DRAM

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0

1

2

3

4

Num

ber o

f byt

es w

ritte

n to

this

mem

ory


(e) Aggregate data written to the DRAM

Figure 6.5: Total DRAM bandwidth for PARSEC benchmarks

The observed total and read DRAM bandwidths appearing in Figures 6.5(a) and 6.5(b) respectively are verymuch benchmark-dependent. Canneal demonstrates a much higher bandwidth than all other benchmarks. Allbenchmarks demonstrate scalable bandwidth, dependent on either the number of threads (as in vips) or only onthe number of CPUs (as in fluidanimate).

To demonstrate the claimed scalability, in order to provide a clearer comparison, a normalized aggregate dataread from the DRAM is depicted in Figure 6.5(d). In most benchmarks there is a distinct scaling: the readbandwidth scales up while the amount of data read scales down. We assume that the parallelizable workloads issuethe same amount of read requests per time unit per CPU roughly as would a single CPU system would, causingthe increase in total bandwidth. The down-scaling of aggregate data read is due to the added caches in systemswith more CPUs. This behavior demonstrates that the gem5 memory system currently behaves as a DistributedShared Cache, and that snoop traffic overheads might not be realistic enough. This pitfall was dealt with in thiswork, yet the added snoop response latency is just a chain in a link of changes required to model such systems

56

more realistically. On average each benchmark reads about 1.5 GB of data, yet throughout different patterns andexecution times.

Regarding written data, depicted in Figure 6.5(e): unlike the aggregate data read, here benchmarks vary signif-icantly in the total amount of data written. At most 0.4 GB is written (in vips), and in some benchmarks almostno data is written. Only in vips the amount of data written scales down as the number of threads increases. Again,these differences are reasonable as benchmarks differ significantly.

CPU and cache stats

The statistics presented in this chapter focus on Timing CPU and cache statistics currently available in gem5. Theabundance of available statistics is impressive yet here we focus on statistics most relevant to this investigation, inorder to identify system bottlenecks in the memory hierarchy. The importance of these statistics is the insight theyprovide about the characteristics of each benchmark, such as how memory-intensive is a benchmark.

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Perc

enta

ge o

f idl

e cy

cles


(a) CPU idle cycles percentage

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0nu

mbe

r of m

emor

y re

fere

mce

s1e9


(b) CPU memory references

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

rate

of m

emor

y re

fere

nces

(ref

eren

ces/

sec)


(c) CPU memory references per second

Figure 6.6: Timing-model CPU statistics for PARSEC benchmarks

Figure 6.6(a) is provided under the assumption that a CPU in such benchmarks is idle mostly when pendingfor a memory operation to finish. As our CPU model is not an O3 model, this should provide a clear indicationof how memory-performance-limited the system is. There is a visible scaling of the idle cycle percentage as morecores are added, yet not necessarily a linear increase.

Another means of understanding the impact of the memory system is to understand the qualities of the bench-mark. By observing the rate of memory references in Figure 6.6(c) we can see how memory-intensive each benchmarkis, and how its dependence on the memory system scales as more CPUs are added. There are several different pat-terns, for instance, for blackscholes we can observe a linear decrease as a function of the CPUs in the system. Theswaptions benchmark, on the other hand, exhibits a fairly constant rate of memory references. In streamcluster,the adding a second thread per core significantly increases the number of memory references.

To complete the picture of how a benchmark depends on the memory system, we must also observe the cacheand interconnect performance.

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

L1 Ic

mis

s ra

te fo

r ove

rall

acce

sses

(mis

ses/

acce

sses

)


(a) L1 I-cache miss rate

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.000

0.005

0.010

0.015

0.020

0.025

0.030

L1 D

c m

iss

rate

for o

vera

ll ac

cess

es (m

isse

s/ac

cess

es)


(b) L1 D-cache miss rate

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.00

0.05

0.10

0.15

0.20

L2 m

iss

rate

(mis

ses/

acce

sses

)


(c) L2 miss rate

Figure 6.7: L1 and L2 cache miss rates for PARSEC benchmarks

Caches are designed to decrease access times under two main assumptions: temporal and spatial locality. While

57

high I-cache miss rates are related to code sparsity, D-cache miss rates indicate how data accessed is scattered. Assuch, workloads can exhibit very different I-cache and D-cache miss-rates, as code and data are orthogonal.

For instance, in Figure 6.7(b) the x264 benchmark demonstrates a relatively high I-cache miss rate comparingto other benchmarks, and low D-cache miss-rates in Figure 6.7(a). A higher miss-rate moves the performancebottleneck lower in the memory system hierarchy, towards the L2-cache, the main interconnect, and the externalmemory.

L2 miss rates are provided in Figure 6.7(c) and demonstrate again how memory-intensive canneal is. ComparingFigures 6.7(c) and 6.6(c) enables us to learn not just how memory-intensive a workload is, but how sparse andinconsistent the memory accesses are. For instance, while x264 is most memory intensive in terms of memoryreferences rate, its miss rate is amongst the lowest in the benchmark set.

Both L1 caches do not exhibit any exceptional patterns. However L2 miss-rates of blackscholes, bodytrack andswaptions in Figure 6.7(c) demonstrate a 3x to 6x increase in miss-rate in four-CPU configurations, comparing toother configurations. This can be attributed to the fact four-CPU configurations have two L2 caches, thus datathat once was stored in a shared L2 cache is now split between two caches. This may be a useful indicator ofcharacterizing inter-cluster sharing patterns in a workload.

Obviously, L2 miss rates strongly correlate to the DRAM bandwidth provided in Figure 6.5(a).All miss-rates provided are the ratio between total access hits and total accesses performed. For systems with

more than a single cache, the value represents the value of the lowest-indexed cache (hence the left-most in a systemblock diagram such as Figure 6.11) from that level and not an average. This is decision is mostly for post-processingconvenience under the assumption that thread dispatching by the Linux SMP scheduler is uniform.

BUS stats

One of the contributions this work provides is the improved bus observability due to added stats in the bus models,as described in Section 4.5. The stats used in the this section are a subset of these added stats.

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.4

0.6

0.8

1.0

1.2

bus

thro

ughp

ut (b

ytes

/s)


(a) Bus throughput

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

2.2

2.4

2.6

2.8

3.0

3.2

util

izat

ion

- lay

er o

ccup

ancy

(%)


(b) Bus request layer utilization

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

2.2

2.4

2.6

2.8

3.0

3.2

3.4

3.6

util

izat

ion

- lay

er o

ccup

ancy

(%)


(c) Bus response layer utilization

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

5000

5500

6000

6500

7000

7500

ave

rage

tim

e in

retr

y lis

t (tic

ks)


(d) Bus snoop response average time inretry list

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

tota

l sno

op d

ata

thro

ugh

laye

r (by

tes)


(e) Aggregate snoop data through the bus

blackscholes

bodytrackcanneal

fluidanimate

streamcluste

r

swaptionsvips

x264

benchmark

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

tota

l dat

a th

roug

h la

yer (

byte

s)


(f) Aggregate non-snoop data through thebus

Figure 6.8: Bus stats for PARSEC benchmarks

The bus throughput provided in Figure 6.8(a) represents the total data and snoop traffic in bytes per second.Here too, canneal stands out, demonstrating high bus bandwidth. The benchmarks can be categorized in threegeneral groups:

58

• Streamcluster, exhibiting fixed bandwidth, regardless of the test system (number of cores/threads).

• Exhibited bandwidth scales up with the number of cores - such as fluidanimate and bodytrack.

• Exhibited bandwidth scales up with the number of threads - such as vips and x264.

All bandwidth measurements strongly correlate to the DRAM bandwidth from Figure 6.5(a), hence reflect littlesnooping data through the bus (which is validated in Figure 6.8(e)).

The bus request layer utilization statistics provided in Figure 6.8(b) strongly correlate with the total bandwidth.The bus snoop response layer throughout all the benchmarks demonstrated close-to-zero utilization. This is a

good indication of very low cross-cluster sharing. The aggregate snoop data through the bus and the average timea snoop response was delayed by average due to a busy response layer are depicted in Figures 6.8(e) and 6.8(d).

PARSEC benchmarks execution time for various snoop penalties

In section 6.3 a set of small-scale experiments demonstrated the impact of delaying snoop responses on a (toy-)system’s performance. One of the major reasons for holding the PARSEC set of experiments was for testing largersystems using larger multi-threaded workloads for their sensitivity to snoop performance. For each benchmark andsystem, we invoked simulations for various snoop-response latencies ranging from 10 ns to 90 ns. While in thesmall-scale experiments, the effects of our modification were visible and significant, the PARSEC set of simulationsused demonstrated less to no impact.

Execution times per PARSEC benchmark are depicted in Figure 6.9.These results stress our findings based on aggregate snoop data through the bus from Figure 6.8(e): the volume

of snoop data passing through the bus is minor in all the tests performed.Minor differences can be seen when normalizing execution times, as depicted in Figure 6.10.Some of the benchmarks, such as the blacksholes’s simulations on two-CPU systems, demonstrate the expected

sensitivity to snoop response latency. However most of the benchmarks do not exhibit such sensitivity.Note that results for most of the tests which make use of a four-CPU system (either with four or eight threads)

are missing. This is further discussed in Section 6.4.3.As the same patterns of insensitivity to snoop response latency were consistently observed for other statistics,

no further figures are provided as they do not add any new insights.

6.4.3 Why do the missing results prove our hypothesis

Throughout Section 6.4.2, a vast majority of four-core-based simulation results are missing. The absolute majorityof missing results are those which utilize the DetailedCoherentBus, which makes use of queued ports for delayingresponse and specifically for delaying snoop responses.

The missing results are tests which terminated prematurely due to gem5 causing a Segmentation Fault, a bugindicator which is the bread and butter of every programmer. As gem5 is a bleeding-edge simulation infrastructure,such bugs are not necessarily due to a recent change.

In most cases where the DetaildCoherentBus caused simulations to end prematurely, the reference bus, theCoherentBus, did not.

In search for the root cause of this bug, it was found not to be due to a bug in the DetailedCoherentBus, butrather in gem5’s cache model. Simply put, the new bus pushed the simulated system to a new dark corner whichwas not supported by the simulator.

This finding is very useful:

• It demonstrates that a system is sensitive to the new method of delaying snoop responses.

• This bug was only discovered in systems with four CPUs. Of all systems that were simulated, only the four-CPU systems which are composed of two independent clusters connected to the main bus. Each cluster hasits own level-2 cache. Hence in such systems, two level-2 caches can communicate via the bus. In one- andtwo-CPU systems, only a single level-2 cache exists.

In order to better understand the root cause, we re-simulated a failing setup, a four-core four-threaded bodytracktest, using a 90 ns snoop response latency. The system is depicted in Figure 6.11.

As stated earlier, a reasonable guess was that the problem is related to a level-2 cache, as one of the maindifferences between the tests that passed and those which did not was the two-cluster formation. Investigation ledto the root cause, being a full MSHR list in one of the level-2 caches. Hence the cache apparently became overly

59

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t

system configuration

1.0

1.5

2.0

2.5

3.0

3.5

4.0

time

[sec

]

coherent bus10 ns30 ns50 ns70 ns90 ns

(a) Blackscholes

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


5

10

15

20

25

time

[sec

]


(b) Bodytrack

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


10

11

12

13

14

time

[sec

]


(c) Canneal

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


15

20

25

30

35

time

[sec

]


(d) Fluidanimate

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


20

30

40

50

60

70

time

[sec

]


(e) Streamcluster

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


0

5

10

15

time

[sec

]


(f) Swaptions

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


10

20

30

40

50

60

time

[sec

]


(g) vips

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


10

12

14

16

18

20

time

[sec

]


(h) x264

Figure 6.9: PARSEC execution (guest) times vs. snoop response penalty per configuration for PARSEC benchmarks

60

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


0.001

0.002

0.003

0.004

0.005

0.006

0.007

norm

aliz

ed e

xecu

tion

time

+9.98e 1


(a) Blackscholes

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


0.001

0.002

0.003

0.004

0.005

norm

aliz

ed e

xecu

tion

time

+9.99e 1


(b) Bodytrack

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


1.000

1.005

1.010

1.015

norm

aliz

ed e

xecu

tion

time


(c) Canneal

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

norm

aliz

ed e

xecu

tion

time

+9.96e 1


(d) Fluidanimate

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

norm

aliz

ed e

xecu

tion

time

+9.99e 1


(e) Streamcluster

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


0.0

0.2

0.4

0.6

0.8

1.0

norm

aliz

ed e

xecu

tion

time


(f) Swaptions

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


0.001

0.002

0.003

0.004

0.005

0.006

norm

aliz

ed e

xecu

tion

time

+9.97e 1


(g) vips

1_cores_1

t

1_cores_2

t

2_cores_2

t

2_cores_4

t

4_cores_4

t

4_cores_8

t


0.980

0.985

0.990

0.995

1.000

1.005

1.010

norm

aliz

ed e

xecu

tion

time


(h) x264

Figure 6.10: PARSEC normalized execution (guest) times vs. snoop response penalty per configuration for PARSECbenchmarks

root: Root

system: LinuxArmSystem

membus: DetailedCoherentBus

iocache: BaseCache

bridge: Bridge


realview: RealView

mmc_fake: AmbaFake

rtc: PL031

a9scu: A9SCU

flash_fake: IsaFake

watchdog_fake: AmbaFake

sbcon0: I2cBus

local_cpu_timer: CpuLocalTimer

realview_io: RealViewCtrl

l2x0_fake: IsaFake

gpio1_fake: AmbaFake

dmac_fake: AmbaFake

cf_ctrl: IdeController

uart3_fake: AmbaFake

gic: Gic

timer1: Sp804

timer0: Sp804



aaci_fake: AmbaFake


ssp_fake: AmbaFake


nvmem: SimpleMemory

clcd: Pl111

uart: Pl011

kmi1: Pl050

kmi0: Pl050

smc_fake: AmbaFake

sp810_fake: AmbaFake

sci_fake: AmbaFake

l20: BaseCache

l21: BaseCache

iobus: NoncoherentBus

cpu0: AtomicSimpleCPU

icache: BaseCache

dtb: ArmISA::TLB

walker: ArmISA::TableWalker

itb_walker_cache: BaseCache

itb: ArmISA::TLB


dtb_walker_cache: BaseCache

dcache: BaseCache


icache: BaseCache

dtb: ArmISA::TLB



itb: ArmISA::TLB



dcache: BaseCache


icache: BaseCache

dtb: ArmISA::TLB



itb: ArmISA::TLB



dcache: BaseCache


icache: BaseCache

dtb: ArmISA::TLB



itb: ArmISA::TLB



dcache: BaseCache

coretol2buses0: DetailedCoherentBus

coretol2buses1: DetailedCoherentBus

system_port

slavemaster

slave port

mem_sidecpu_side

master

slave

piopiopiopiopiopiopiopiopiopiopioconfigdma piopiopiopiopiopiopiopiopiopiopioportdma piopiopiopiopiopiopio


master

icache_port

cpu_side

dcache_port

cpu_side mem_side

slave

port

cpu_sidemem_sidecpu_side

port

mem_sidemem_side

icache_port

cpu_side

dcache_port

cpu_side mem_side

port


port

mem_sidemem_side

icache_port

cpu_side

dcache_port

cpu_side mem_side

slave

port


port

mem_sidemem_side

icache_port

cpu_side

dcache_port

cpu_side mem_side

port


port

mem_sidemem_side

mastermaster

Failing L2 $ L2 $

main bus

Figure 6.11: bodytrack-failing system

61

occupied and continued to issue more requests to the memory system till the system broke due to a software bugin the simulator.

Figure 6.12 compares occupancy levels of all caches in the failing configuration and an identical system using10 ns snoop response latency.

2.40 2.45 2.50 2.55time [sec]

0.0

0.2

0.4

0.6

0.8

1.0

Aver

age

perc

enta

ge o

f cac

he o

ccup

ancy

cpu0 dcachecpu1 dtb_walker_cachecpu2 itb_walker_cachecpu2 icachel20 occ_percent::totalcpu0 icachecpu2 dcachecpu1 dcachecpu1 itb_walker_cachecpu0 dtb_walker_cachel21 occ_percent::totalcpu2 dtb_walker_cachecpu1 icachecpu0 itb_walker_cache

(a) snoop latency: 90 ns

2.40 2.45 2.50 2.55time [sec]

0.0

0.2

0.4

0.6

0.8

1.0

Aver

age

perc

enta

ge o

f cac

he o

ccup

ancy

cpu0 dcachecpu1 dtb_walker_cachecpu2 itb_walker_cachecpu2 icachel20 occ_percent::totalcpu0 icachecpu2 dcachecpu1 dcachecpu1 itb_walker_cachecpu0 dtb_walker_cachel21 occ_percent::totalcpu2 dtb_walker_cachecpu1 icachecpu0 itb_walker_cache

(b) snoop latency: 10 ns

Figure 6.12: PARSEC bodytrack 4 cores 4t simsmall cache occupancy in percent

The main difference between the two systems is the occupancy level of l21, a hence level-2 cache number 1. Thiscache is also highlighted in Figure 6.11. The difference is that in Figure 6.12(a), the occupancy level in level-2 cachenumber 1 rises till it is full. This eventually triggers the software bug in which a response to a read request cannotbe stored in the cache. Typically the data is passed on upstream without a cache-line being allocated, yet in thiscase the bodytrack test ended in a corner-case that was not supported.

This analysis provided micro-level insight to the impact different snoop response latencies can have, henceproved our hypothesis as correct. Inevitably a blocked level-2 cache can severely degrade execution times, andoverall system performance.

6.4.4 BBench

In order to perform full-system performance analysis, it is crucial to make use of real-world workloads, such thatwill stress the entire system, rather than mostly the CPUs. In case of smartphones, web-browsers are the mostrepresentative application. Browser bench (BBench [25]) is an interactive smartphone benchmark suite that includesa web-browser page rendering benchmark.

BBench is automated, repeatable, and fully contained (to avoid any external dependencies such as networkperformance). It exercises not only the web-browser, but the underlying libraries and operating system. BBenchis multi-threaded, comparing to the SPEC benchmark suite which is single-threaded. It has a much higher mis-predictions rate due to code size and sparseness. BBench makes use of more shared libraries (due to high-levelsoftware abstractions) and system calls. Its core is the rendering of 10 of the most popular and complex sites on theweb: Amazon, BBC, CNN, Craigslist, eBay, ESPN, Google, MSN, Slashdot, Twitter, and YouTube, making use ofdynamic content, video, images, Flash, CSS, and HTML5. BBench’s main metric is the load time of each website.

BBench Simulation

Running a BBench simulation, from system startup till BBench results are available, takes roughly 11.5 hours whenrunning on a 2.7Ghz Intel Xeon server. The given wall-clock time is for the fastest and less accurate CPU model(AtomicSimple). For more accurate models, simulation might take considerably longer.

The simulation time (hence the time it would take to run BBench on a real system which matches gem5’s ARMfull-system) is about 2 minutes of guest time, once booting has finished. During a full-system gem5 simulation, thedisplay is dumped to a bitmap framebuffer. Snapshots from a BBench run are provided in Figure 6.13. The BBenchresults page is provided in Figure 6.13(d). The no network connection dialog box is a known persistent error which

62

does not effect the benchmark result, as no network connection is required. All sources and online documentationare available at [2].

(a) Home screen - BBench on GB (b) BBC website - BBench on GB

(c) Home screen - BBench on ICS (d) Results page - BBench on GB

Figure 6.13: Snapshots taken during BBench simulations on Gingerbread and Ice Cream Sandwich Android OSs

6.4.5 Insight from Bus Stats

BBench was meant to be a leading workload during this research. Only as we discovered that it is not feasible torun BBench on a simulated platform which contains four A15 O3 CPUs did BBench’s importance start to fadeaway, clearing the stage for other, more appropriate alternative such as PARSEC. PARSEC’s scalable dataset wasa key enabler to running multi-core simulations.

The most important metric provided by BBench is the benchmark’s execution time. Yet running BBench tillcompletion is beyond what gem5 can cope with in reasonable time at the moment. However, in order to demonstratethe usefulness of the added bus stats, BBench was run using a Timing CPU. Figure 6.14 provides a demonstrationof one of the new bus capabilities: distributions of transactions that either have passed through the bus, but alsodistribution per master or slave port. Figure 6.14 visualizes the transaction distribution for the aggregate trafficthrough the bus during a BBench simulation. Boot-time has been omitted. Instead of plotting a set of curves, eachof which represent the amount of transactions from that type, we stack these plots as layers. The layers’ stackingorder used is the aggregate amount (volume) of transactions per type. The legend follows the same order.

6.5 Results analysis and conclusions

• The set of experiments discussed in this chapter provided insights from the micro-level to the macro-levelregarding possible impact of revised traffic scheduling, mostly snoop responses.

• In addition, it demonstrated the need of increased observability in key points in a system, such as its maininterconnect. Such observability enables easy detection of bottlenecks, crucial when designing systems andworkloads.

63

20 30 40 50 60time [sec]

0

1

2

3

4

5

6

tran

sact

ions

per

sam

ple

seco

nd

1e7

ReadRespReadReqWriteRespWriteReqWritebackReadExRespReadExReqUpgradeRespUpgradeReqFlushReqPrintReqMessageRespInvalidationReqLoadLockedReqWriteInvalidateRespStoreCondFailReqFunctionalReadErrorFunctionalWriteErrorHardPFReqInvalidDestError

Figure 6.14: BBench transaction distribution demonstration

• The results demonstrate that a significant impact from delaying snoop responses in a gem5-like system isexpected only when high levels of inter-cluster sharing is expected. Currently even in PARSEC, which is atypical benchmark for such relatively intensive multi-threaded workloads, the impact is not critical and inmany cases hardly noticeable. According to [15] ‘PARSEC benchmarks have significantly less communicatingwrites (4.2% on average) than Splash-2 applications (20.8% on average)‘.

• One should take into consideration that the results shown cover the less interesting systems, those which onlycontain one cluster. The experiments with multi-clustered systems should be rerun once the open bug ingem5’s caches is fixed, to evaluate the impact in such systems.

• The bus’s snoop layer statistics, notably the utilization and aggregate data statistics, can determine how muchinter-cluster sharing there actually exists, as they represent snoop-hits from two remote clusters.

• The experiments should be run on more detailed CPU models and larger datasets to evaluate whether theresults presented here can be extrapolated or not.

• In most cases additional threads had either no impact or positive impact. However in streamcluster testsadding more threads significantly increased execution time, thus they inevitably come with a price tag.

• The modeling of snoop requests in gem5 must be improved for making realistic meaningful conclusions.The real cost of task migration currently cannot be realistically evaluated with gem5. This is based onunderstanding of the current gem5 memory model and its optimization. The current memory system is quiteoptimistic and behaves as a Distributed Shared Cache with zero-time snoop-miss penalties.

• The impact of system-coherency expected on non inherently-multi-threaded or cooperating workloads (e.g.BBench, browsing, and typical applications) is currently minor. This is based on low snoop traffic observedon the bus.

6.6 Contribution

• The experiments provided important insight regarding a multi-processor SoC-like model running multi-threaded workloads, whether large or small. They shed light on the PARSEC benchmark tests, their sharingpatterns and memory requirements. This was a direct result of improving the system’s observability byintroducing statistics to the bus models.

• Specifically, the impact of snoop response latencies was investigated and demonstrated as meaningful insituations where sharing patterns are intensive.

64

• The entire flow of invoking simulations and post-processing statistics directly to figures has been automated,allowing any gem5 user to easily re-run diverse sets of PARSEC benchmarks on a multi-clustered SoC model.

65

Chapter 7

Conclusions and Future Work

• The proposal for this research project started as a one-liner bullet: ‘Cache coherent interconnect and modelfidelity The aim of this project is to capture the behavior of the ARM AMBA4-family Cache CoherentInterconnect in a transaction-level model and develop a strategy (stimuli, metrics, evaluation method etc) toreason bottom-up about the model fidelity‘. Only after months of study and hard work one can grasp theimmense magnitude this short statement holds.

• This project required intensive knowledge ramp-up in a long list of cutting-edge domains: AMBA, system-levelcoherence, ACE, CCI, transaction-level modeling, target platforms, gem5 at user level, workloads, designingsystems with gem5, the gem5 memory model - bus, caches, and their interaction. Studying all these topicsand making them one was a challenging task on its own.

• gem5 on its own provided a sea of challenges: being bleeding-edge comes at a high price of breaking software,missing deliverables, constantly changing patch-queue, a severe need for documentation, performance issues.

• On the same note, being a constantly-changing platform made it possible to contribute and help shape gem5to a better simulation framework. This included both directly-related work as part of this project, as well astangential contributions, which are discussed in Section A.

• The accumulated knowledge was a key factor in being able to bridge between ACE and gem5’s memorysystem. Seeing both sides, each implementing system-level coherence for a different purpose, enabled seeingpieces in both these puzzles which are missing. Some of these missing pieces are provided next.

7.1 Technical conclusions

7.1.1 Interconnect observability

• The introduced bus stats have proven to be critical in evaluating on-chip traffic characteristics. This is anenabler for easily finding system bottlenecks.

• Nevertheless, observability comes at a performance price. This tradeoff is a classic modeling challenge.

7.1.2 Modeling SoC platforms

• gem5 provides a convenient framework for exploring SoC architectures and modeling existing platforms.

• Such a platform was composed in order to correlate gem5 simulations with real hardware

• Simulator performance currently significantly limit the possible workloads when using gem5 to model a multi-processor SoCs with detailed CPU model.

• Nevertheless, smart selection of smaller workloads which exhibit the same patterns, e.g. using SimPoints [41],can still yield very meaningful results.

66

7.1.3 The impact of delaying snoop responses

• We have demonstrated the impact of punishing snoop responses with varying latencies on a system’s perfor-mance for various workloads. We concluded that the impact heavily depends on the amount of inter-clustersharing, thus snoop traffic that passes through the main interconnect, as is the case in I/O coherent systems.

• gem5 currently exhibits the behavior of a distributed shared cache due to the modeling-dependent low over-heads of system coherency.

• In order to properly analyze sharing patterns, cache snoop statistics are essential and should be added.Currently gem5 provides only statistics for traffic from the master side (e.g. hit/miss rates, and not snoophit/miss rates).

• It is essential to deal with gem5’s MemInhibit-related challenges, to enable realistic modeling of snoop costs,speculative fetching, and resource contention.

7.1.4 Truly evaluating ACE-like transactions

As mentioned in section 4.4, the current modeling peculiarities of the gem5 memory system prevent us fromevaluating the true impact of ACE transactions:

• ReadOnce is meant to eliminate the cost of downgrading and then upgrading a cache line as a result of asnoop hit. However in gem5 snoop hits will result in data transfer on-chip, resulting in minor effects. For amore realistic comparison, a non-coherent reference system should be provided, or a system which does nothave a snoop-channel on its bus.

• ReadNoSnoop and WriteNoSnoop are meant to avoid issuing snoop requests in vain to masters which areknown not to hold a cache line in their caches. However, snoop misses currently come with zero-cost.

• WriteLineUnique is meant for eliminating the cost of peer caches being required to perform write-back upona snoop-write hitting a modified line. This is done by broadcasting a MakeInvalid invalidation message.However in gem5 in case of a snoop-write hit to a modified line, the modified line will pass on-chip to theissuing master. Thus where the ACE architect intended to save the cost of a write-back, in gem5 the eliminatedcost would be of a snoop response containing the dirty line. Obviously the cost of an on-chip transaction isnothing compared to an off-chip transaction.

However:

• Understanding how to model ACE transactions in gem5 is of great value. It enables followers (e.g. whomeverwould like to add support of snoop filters to gem5 to evaluate their impact) to read a recipe which we havewritten for them.

• The infrastructure created during this project for developing and testing changes in the memory systemdiscussed in Section 5.1 is an enabler for further similar investigations. In addition, it serves as a functional-verification regression tool, being able to effectively stress the memory system.

• Understanding the modeling peculiarities, flaws, pitfalls, or inaccuracies means we understand where we aretoday, what is a more realistic model, and what should be done to bridge the gap. Understanding why thesituation is what it is today is crucial: sometimes the current model describes a specific implementation, andin other times it might abstract from details for simulation performance reasons.

• Understanding a simulator’s performance means we know our limits: what can be evaluated and what cannot,as has been demonstrated with several fixes as a result of this work. Also, awareness of a problem is a keyfactor for solving it.

7.2 Reflections, main hurdles and difficulties faced

During the work on this project, challenges and surprises were waiting behind every corner. Obviously, if therewere no challenges, there would be no need for a research in the first place. This section aims at providing usefulpractice guidelines, a postmortem, or lessons learnt, that might improve future projects.

67

• From the moment this project was born till the day it was submitted it has constantly evolved and diverted.While this process of re-evaluation and adaptation is fundamental, it comes with a high price tag. Mostproblematically, this project relied on external deliverables as a realistic GPU model, corporate cooperationenabling correlation on a micro-benchmark-level, and scalable simulation performance. Each of these linkswas essential for completing the chain of goals set for these project. It would be best to be more pessimisticwhen the project is as time-limited as this project was, since each each diversion de-rails work. Eventuallytime should be spent on most fruitful and promising tasks, de-risking should be done as soon as possible, andexternal dependencies should be avoided as much as possible.

• The background knowledge required for this project was by all means extensive. Time required for such astudy should be well planned, well accounted for, and preferably be mostly done during the preparation stage.

• Being a part of a bleeding-edge software project means frequently-changing code, environment, performance,and consistency. This is especially crippling when developing code in core models, rather than end-nodes.Furthermore, e.g. as discussed in Section 6.4.3, since this simulation infrastructure is still young, unexpectedbugs can hold back work at any moment. While finding root-causes and solving bugs is the best solution (aswas the case a couple of times and described in Appendix A), it comes at the expense of other scheduled tasksthat will not be accomplished.

Obviously, all of the above does not relieve any responsibility (or blame) from the horses carrying the researchwagon down the road.

7.3 Future work

During this work I gained vast knowledge and insight regarding system-level coherency, interconnects, and simu-lation challenges. Many interesting related topics drew my attention, and ideas of potentially promising topics toinvestigate. This section describes those topics which seem both feasible and worthy investing time in.

The most promising

• Ownership passing policies: gem5 currently utilizes a fixed cache coherence protocol, in which ownershipis passed only when an M-state cache line is hit by a snoop-write. However the ACE specification have abroader range of capabilities. For instance, a request can state whether ownership can be passed as part of thetransaction. One motivation for passing ownership is the ability to avoid performing write-backs. Assumingthe cache line will be required by some consumer soon, it is better to keep the line dirty on-chip longer.On the other hand, some clients (e.g. I/O-coherent devices with write-through caches) might not be ableto receive dirty data. In such a case it can be either the responsibility of the interconnect to perform thewrite-back or the dirty response might also be sent back to the responding master. Eliminating write-backscan have significant power, bandwidth and performance impact. The underlying assumption, that the linewill be needed again soon, is one of the fundamental principles of caches - temporal locality. The passing ofownership to the requesting master as a hot potato can be seen as updating an LRU indicator for a line in adistributed shared cache.

• gem5’s memory system lacks the notion of QoS, which is crucial for multiplexing high-bandwidth consumerssuch as a GPU with latency-sensitive consumers such as CPUs or video controllers. This hot topic can provideplenty of insight as how on-chip traffic should be managed.

• While CCI-400 is a crossbar-based interconnect aimed at current mobile compute subsystems, future inter-connects might adopt scalable interconnect topologies, such as network-based. gem5 provides a convenientground for exploring such interconnects.

• Evaluating the true cost of migrating tasks is essential to a big.LITTLE architecture. For gem5 to accuratelymodel these costs, several inaccuracies that have been discussed in Section 4.4 have to be improved. Thiswill also enable investigating the true price of speculative fetching, which can hide the latency of a snoop-missat the expense of potentially performing a redundant off-chip access. Its usefulness depends on the currentworkload’s sharing behavior. In order to investigate the effectiveness of such a feature gem5 has to be modifiedas currently an unrealistically perfect speculation is implemented.

68

Adding the missing pieces

• A realistic GPU model will enable investigating the sharing patterns between CPUs and a GPU. This insightwould be very important for system architects.

• The target platform was described in Section 3.6. A simplified platform was eventually used due to a missingGPU model and performance issues. However once these are dealt with, implementing the more complexsystem depicted in Figure 3.15 will provide a more realistic model of a hand-held device, which could bebetter correlated with a fabricated SoC.

• CCI-400 utilizes several interesting features which were conceived by experienced architects with limitedexploration tools. gem5 provides a convenient ground for evaluating such features, amongst them:

– Address striping over several interconnect interfaces for load-balancing of external memory transactions

– Per-interface point of serialization: each shared address in a coherent system must have a PoS for allmasters that share that address. In essence, the purpose of a PoS is to avoid transactions to avoidsame-address read or write transactions to be serviced out-of-order and to enforce barriers.

– Limiting the amount of outstanding transactions in the interconnect.

– Barrier support, which is currently implemented in the CPU models.

Leverage what is already there

• Utilize ACE transactions: currently no piece of software makes use of ACE’s system-coherency support. Asnow gem5 supports ACE transactions, existing workloads can be modified to utilize ACE. In addition, datafrom the TLB can be used to determine the shareability domain of each request.

• As mentioned in Section 7.1.4, gem5’s transaction bank can be further extended with additional ACE-likesystem-level coherency transactions.

• The bus model can be correlated with an existing hardware platform for improving its fidelity.

69

Appendix A

Contributions to gem5

As part of the work on this project numerous contributions have been made. Some of these are directly related toor required for this research, and some were a result of a personal whim or attempt to improve gem5.

The project-related contributions include:

• Introducing stats to the bus model.

• Creating the DetailedCoherentBus which makes use of queued ports, and introduced the snoop-responselatency knob.

• Adding support of a set of ACE-like transactions to the memory system.

• Creating a fixed-configuration system-composing script for modeling the target platform.

• Reviving, revising and improving MemTest; adding support of scenario inputs and the ability to generate anyhierarchy of test systems.

• Adding informative messaging throughout the memory system.

In addition, I had the privilege of contributing to the active gem5 community in several ways:

• DOT [23] based automating system visualization. This added feature generates a block-diagram of the simu-lated system upon the invocation of the simulator. The diagram contains a connected directed graph whereeach arrow is from a master port to slave port. This output makes it trivial to comprehend a test-system’shierarchy. Figure 5.2 is an example of such an automatically-generated image. Each node represents an in-stance of a model such as a CPU or a cache or a bus. Arrows connect nodes which represent master or slaveports.

• Introducing statistics post-processing scripts which make use of pickle (Python object serialization) to storestatistics in a compact file format which enables convenient data retrieval.

• Demonstrate online stats visualization in gem5 for live monitoring purposes, based on Python Matplotlib’s[27] interactive mode.

• Profiling gem5 and raising awareness of performance issues, as described in Section 3.6.2), demonstratingwhere time is spent during simulation.

• Wiki contributions related to the memory system and to integrated use of Eclipse’s CDT and PyDev forproviding a complete integrated development and debug environment.

• Reporting of bugs (e.g. in gem5’s memory system and traffic generator); providing a fix for some of them.

• Various utility scripts, e.g. for semantically comparing system description files (config.ini), and for verifyingACE-like instructions when a cache line is already in the system in any of the MOESI states.

70

Bibliography

[1] Simplescalar llc. http://www.simplescalar.com/. [Online; accessed 23-January-2012].

[2] Bbench. http://www.gem5.org/Bbench, 2011. [Online; accessed 12-December-2011].

[3] The gem5 simulator system. http://www.gem5.org, 2011. [Online; accessed 12-December-2011].

[4] N. Agarwal, L.S. Peh, and N.K. Jha. In-network coherence filtering: snoopy coherence without broadcasts. InProceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 232–243.ACM, 2009.

[5] K. Aisopos, C.C. Chou, and L.S. Peh. Extending open core protocol to support system-level cache coherence. InProceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and systemsynthesis, pages 167–172. ACM, 2008. [Online; accessed 18-July-2012].

[6] ARM. Amba axi and ace protocol specification (free registration required). https://silver.arm.com/

download/download.tm?pv=1198016, 2011. [Online; accessed 17-July-2012].

[7] ARM. Big.little processing with arm cortex-a15 & cortex-a7. http://renesasmobile.com/news-events/

news/ARM-big.LITTLE-whitepaper.pdf, 2011. [Online; accessed 17-July-2012].

[8] ARM. Corelink cci-400 cache coherent interconnect technical reference manual. http://infocenter.

arm.com/help/topic/com.arm.doc.ddi0470c/DDI0470C_cci400_r0p2_trm.pdf, 2011. [Online; accessed 12-December-2011].

[9] ARM. Corelink gic-400 generic interrupt controller technical reference manual. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0471a/index.html, 2011. [Online; accessed 1-August-2012].

[10] ARM. Corelink mmu-400 system memory management unit technical reference manual. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0472a/index.html, 2011. [Online; accessed 1-August-2012].

[11] ARM. Corelink system ip & design tools for amba. http://www.arm.com/products/system-ip/amba/index.php, 2011. [Online; accessed 12-December-2011].

[12] ARM. Introduction to amba 4 ace. http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_

6June2011.pdf, 2011. [Online; accessed 12-December-2011].

[13] ARM. Corelink dmc-400 dynamic memory controller technical reference manual. http://infocenter.arm.

com/help/index.jsp?topic=/com.arm.doc.ddi0466b/index.html, 2012. [Online; accessed 1-August-2012].

[14] T. Austin, E. Larson, and D. Ernst. Simplescalar: An infrastructure for computer system modeling. Computer,35(2):59–67, 2002.

[15] N. Barrow-Williams, C. Fensch, and S. Moore. A communication characterisation of splash-2 and parsec. InWorkload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 86–97. IEEE, 2009.

[16] T.B. Berg. Maintaining i/o data coherence in embedded multicore systems. Micro, IEEE, 29(3):10–19, 2009.

[17] C. Bienia, S. Kumar, J.P. Singh, and K. Li. The parsec benchmark suite: Characterization and architecturalimplications. In Proceedings of the 17th international conference on Parallel architectures and compilationtechniques, pages 72–81. ACM, 2008.

71

http://www.simplescalar.com/

http://www.gem5.org/Bbench

http://www.gem5.org

https://silver.arm.com/download/download.tm?pv=1198016

https://silver.arm.com/download/download.tm?pv=1198016

http://renesasmobile.com/news-events/news/ARM-big.LITTLE-whitepaper.pdf

http://renesasmobile.com/news-events/news/ARM-big.LITTLE-whitepaper.pdf

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0470c/DDI0470C_cci400_r0p2_trm.pdf

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0470c/DDI0470C_cci400_r0p2_trm.pdf

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0471a/index.html




http://www.arm.com/products/system-ip/amba/index.php

http://www.arm.com/products/system-ip/amba/index.php

http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf

http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_6June2011.pdf

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0466b/index.html

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0466b/index.html

[18] C. Bienia and K. Li. Fidelity and scaling of the parsec benchmark inputs. In Workload Characterization(IISWC), 2010 IEEE International Symposium on, pages 1–10. IEEE, 2010.

[19] N. Binkert, B. Beckmann, G. Black, S.K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D.R. Hower, T. Krishna,S. Sardashti, et al. The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2):1–7, 2011.

[20] E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, and A. Kolodny. The power of priority: Noc based distributed cachecoherency. In Networks-on-Chip, 2007. NOCS 2007. First International Symposium on, pages 117–126. IEEE,2007.

[21] O. Luciano Sassatelli G. Butko A., Garibotti R. Accuracy evaluation of gem5 simulator system (to be pub-lished). In Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2012 7th International Work-shop on. IEEE, 2012.

[22] A. Cimatti, E. Clarke, E. Giunchiglia, F. Giunchiglia, M. Pistore, M. Roveri, R. Sebastiani, and A. Tacchella.Nusmv 2: An opensource tool for symbolic model checking. In Computer Aided Verification, pages 241–268.Springer, 2002.

[23] J. Ellson, E. Gansner, L. Koutsofios, S. North, and G. Woodhull. Graphvizopen source graph drawing tools.In Graph Drawing, pages 594–597. Springer, 2002.

[24] M. Guher. Physical design of snoop-based cache coherence on multiprocessors.

[25] A. Gutierrez, R.G. Dreslinski, T.F. Wenisch, T. Mudge, A. Saidi, C. Emmons, and N. Paver. Full-systemanalysis and characterization of interactive smartphone applications. 2011.

[26] W. Hlayhel, J. Collet, and L. Fesquet. Implementing snoop-coherence protocol for future smp architectures.Euro-Par99 Parallel Processing, pages 745–752, 1999.

[27] J.D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, pages 90–95, 2007.

[28] IEEE. Ieee draft standard for standard systemc(r) language reference manual. IEEE P1666/D3, May 2011,pages 1 – 674, 2011.

[29] A.B. Kahng, B. Li, L.S. Peh, and K. Samadi. Orion 2.0: A fast and accurate noc power and area modelfor early-stage design space exploration. In Proceedings of the conference on Design, Automation and Test inEurope, pages 423–428. European Design and Automation Association, 2009.

[30] P.D. Karandikar. Open core protocol (ocp) an introduction to interface specification. In 1st Workshop on SoCArchitecture, Accelerators & Workloads Jan, volume 10, 2010.

[31] T.J. Kjos, H. Nusbaum, M.K. Traynor, and B.A. Voge. Hardware cache coherent input/output. HEWLETTPACKARD JOURNAL, 47:52–59, 1996.

[32] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt,and B. Werner. Simics: A full system simulation platform. Computer, 35(2):50–58, 2002.

[33] M.R. Marty. Cache coherence techniques for multicore processors. ProQuest, 2008.

[34] P.E. McKenney. Memory barriers: a hardware view for software hackers. Linux Technology Center, IBMBeaverton, 2010.

[35] D. Molka, D. Hackenberg, R. Schone, and M.S. Muller. Memory performance and cache coherency effects on anintel nehalem multiprocessor system. In Parallel Architectures and Compilation Techniques, 2009. PACT’09.18th International Conference on, pages 261–270. Ieee, 2009.

[36] N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. ACMSigplan Notices, 42(6):89–100, 2007.

[37] B. OSullivan. Distributed revision control with mercurial. Mercurial project, 2007.

[38] C. Seiculescu, S. Volos, N.K. Pour, B. Falsafi, and G. De Micheli. Ccnoc: On-chip interconnects for cache-coherent manycore server chips. 2011.

72

[39] D.J. Sorin, M.D. Hill, and D.A. Wood. A primer on memory consistency and cache coherence. SynthesisLectures on Computer Architecture, 6(3):1–212, 2011.

[40] O.C.P. Specification and I. Volume. Release 2.0. http://www.cvsi.fau.edu/download/attachments/

852535/OpenCoreProtocolSpecification2.1.pdf?version=1, 2003. [Online; accessed 18-July-2012].

[41] V.M. Weaver and S.A. McKee. Using dynamic binary instrumentation to generate multi-platform simpoints:Methodology and accuracy. In Proceedings of the 3rd international conference on High performance embeddedarchitectures and compilers, pages 305–319. Springer-Verlag, 2008.

[42] J. Weidendorfer. Sequential performance analysis with callgrind and kcachegrind. Tools for High PerformanceComputing, pages 93–113, 2008.

[43] T.F. Wenisch, R.E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J.C. Hoe. Simflex: statisticalsampling of computer system simulation. Micro, IEEE, 26(4):18–31, 2006.

[44] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. The splash-2 programs: Characterization andmethodological considerations. In ACM SIGARCH Computer Architecture News, volume 23, pages 24–36.Acm, 1995.

73

http://www.cvsi.fau.edu/download/attachments/852535/OpenCoreProtocolSpecification2.1.pdf?version=1

http://www.cvsi.fau.edu/download/attachments/852535/OpenCoreProtocolSpecification2.1.pdf?version=1

Documents

Modeling and Analysis of a Cache Coherent Interconnect · Performance analysis is demonstrated on a multi-cluster mobile-device-like compute subsystem model, using both PARSEC benchmarks