221
The gem5 Simulator ISCA 2011 Brad Beckmann 1 Nathan Binkert 2 Ali Saidi 3 Joel Hestness 4 Gabe Black 5 Korey Sewell 6 Derek Hower 7 1 AMD Research 2 HP Labs 3 ARM, Inc. 4 University of Texas, Austin 5 Google, Inc. 6 University of Michigan, Ann Arbor 7 University of Wisconsin, Madison June 5th, 2011 1

The gem5 Simulator · Welcome! We’re glad you’re here! The gem5 simulator has been multi-year effort A wide variety of institutions have participated This tutorial is for you

  • Upload
    buibao

  • View
    273

  • Download
    4

Embed Size (px)

Citation preview

  • The gem5 SimulatorISCA 2011

    Brad Beckmann1 Nathan Binkert2 Ali Saidi3 Joel Hestness4

    Gabe Black5 Korey Sewell6 Derek Hower7

    1 AMD Research 2 HP Labs 3 ARM, Inc. 4 University of Texas, Austin5 Google, Inc. 6 University of Michigan, Ann Arbor

    7 University of Wisconsin, Madison

    June 5th, 2011

    1

  • Welcome!

    Were glad youre here! The gem5 simulator has been multi-year effort A wide variety of institutions have participated

    This tutorial is for you Please ask questions! Dont save them for the break! We intend the focus to be audience driven

    2

  • Tutorial Goals and Timeline

    Tutorial goals Introduce you to the gem5 simulator Answer your development questions

    Two halves 8:30-noon: Overview of the simulator, features, components,

    and simple examples after lunch: Birds of a feather sessions and informal

    discussions of simulator internals

    3

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    4

  • Introduction to gem5

    Introduction to gem5

    Brad Beckmann

    AMD Research

    5

  • Introduction to gem5

    What is gem5? The best parts of M5 The best parts of GEMS

    Overall goals, design principles, and capabilities

    6

  • What is gem5?

    The combination of M5 and GEMS into a new simulator

    Google scholar statistics M5 (IEEE Micro, CAECW): 440 citations GEMS (CAN): 588 citations

    Best aspects of both glued together M5: CPU models, ISAs, I/O devices, infrastructure GEMS (essentially Ruby): cache coherence protocols,

    interconnect models

    7

  • What else is new?

    Many other things have changed since previous tutorialsbeyond GEMS+M5

    Some of the highlights: The worlds most popular ISAs: ARM and x86 The In-order CPU model New documentation

    Overall gem5 has a high degree of capabilities

    8

  • Android on ARM FS

    9

    android.mp4Media File (video/mp4)

  • 64 Processor Linux on x86 FS

    10

  • What gem5 is Not

    A hardware design language Higher level for design space exploration, simulation speed

    A restrictive environment Just C++ and Python with an event queue and a bunch of

    APIs you can choose to ignore

    Finished! Always room for improvement . . .

    11

  • What We Would Like gem5 to Be

    Something that spares you the pain weve been through A community resource

    Modular enough to localize changes Contribute back, and spare others some pain

    A path to reproducible/comparable results A common platform for evaluating ideas

    Let us know how we can help you contribute Public wiki is up at http://www.gem5.org Please submit patches and additional features Ability to add modules with EXTRAS= The more active the community is, the more successful gem5

    will be!

    12

  • Two Views of gem5

    View #1 A framework for event-driven simulation

    Events, objects, statistics, configuration

    View #2 A collection of predefined object models

    CPUs, caches, busses, devices, etc.

    This tutorial focuses on #2 You may find #1 useful even if #2 is not

    At least three other simulators have been created using #1

    13

  • Main GoalsOverall Goal: Open source community tool focused onarchitectural modeling

    Flexibility Multiple CPU models across the speed vs. accuracy spectrum Two execution modes: System-call Emulation & Full-system Two memory system models: Classic & Ruby Once you learn it, you can apply to a wide-range of

    investigations Availability

    For both academic and corporate researchers No dependence on proprietary code BSD license

    Collaboration Combined effort of many with different specialties Active community leveraging collaborative technologies

    14

  • Key Features

    Pervasive object-oriented design Provides modularity, flexibility Significantly leverages inheritance e.g. SimObject

    Python integration Powerful front-end interface Provides initialization, configuration, & simulation control

    Domain-Specific Languages ISA DSL: defines ISA semantics Cache Coherence DSL (a.k.a.SLICC): defines coherence logic

    Standard interfaces: Ports and MessageBuffers

    15

  • Capabilities

    Execution modes: System-call Emulation (SE) &Full-System (FS)

    ISAs: Alpha, ARM, MIPS, Power, SPARC, x86 CPU models: AtomicSimple, TimingSimple, InOrder, and O3 Cache coherence protocols: broadcast-based, directories,

    etc. Interconnection networks: Simple & Garnet (Princeton,

    MIT) Devices: NICs, IDE controller, etc. Multiple systems: communicate over TCP/IP

    16

  • Cross-Product Matrix

    Processor Memory System

    CPU Model System Mode Classic RubySimple Garnet

    Atomic Simple SEFS

    Timing Simple SEFS

    InOrder SEFS

    O3 SEFS

    17

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    18

  • Basics

    Basics

    Nate Binkert

    HP Labs

    19

  • Basics

    Compiling gem5 Running gem5 Very brief overview of a few key concepts:

    Objects Events Modes Ports Stats

    20

  • Building Executables

    Platforms Linux, BSD, MacOS, Solaris, etc. Little endian machines

    Some architectures support big endian 64-bit machines help a lot

    Tools GCC/G++ 3.4.6+

    Most frequently tested with 4.2-4.5 Python 2.4+ SCons 0.98.1+

    We generally test versions 0.98.5 and 1.2.0 http://www.scons.org

    SWIG 1.3.31+ http://www.swig.org

    21

  • Compile Targets

    build// configs

    By convention, usually _ ALPHA_SE (Alpha syscall emulation) ALPHA_FS (Alpha full system) Other ISAs: ARM, MIPS, POWER, SPARC, X86 Sometimes followed by Ruby protocol: ALPHA_SE_MOESI_hammer You can define your own configs

    binary gem5.debug debug build, symbols, tracing, assert gem5.opt optimized build, symbols, tracing, assert gem5.fast optimized build, no debugging, no symbols, no

    tracing, no assertions gem5.prof gem5.fast + profiling support

    22

  • Sample Compile

    blue% scons build/X86_FS/gem5.optscons: Reading SConscript files ...Checking for leading underscore in global variables...noChecking for C header file Python.h... yesChecking for C library pthread... yes

    Reading /n/blue/z/binkert/work/m5/incoming/src/mem/ruby/SConsoptsReading /n/blue/z/binkert/work/m5/incoming/src/mem/protocol/SConsoptsReading /n/blue/z/binkert/work/m5/incoming/src/arch/arm/SConsopts

    Building in /n/blue/z/binkert/work/m5/incoming/build/X86_FSVariables file /n/blue/z/binkert/work/m5/incoming/build/variables/X86_FS not found,

    using defaults in /n/blue/z/binkert/work/m5/incoming/build_opts/X86_FSscons: done reading SConscript files.scons: Building targets ...[ CXX] X86_FS/sim/main.cc -> .o[ CXX] X86_FS/sim/async.cc -> .o[ CXX] X86_FS/sim/core.cc -> .o[ TRACING] -> X86_FS/debug/Event.hhDefining FAST_ALLOC_STATS as 0 in build/X86_FS/config/fast_alloc_stats.hh.Defining FORCE_FAST_ALLOC as 0 in build/X86_FS/config/force_fast_alloc.hh.Defining NO_FAST_ALLOC as 0 in build/X86_FS/config/no_fast_alloc.hh.[ CXX] X86_FS/sim/debug.cc -> .o[ TRACING] -> X86_FS/debug/Config.hh[ CXX] X86_FS/sim/eventq.cc -> .o[ CXX] X86_FS/sim/init.cc -> .o[ TRACING] -> X86_FS/debug/TimeSync.hh[SO PARAM] Root -> X86_FS/params/Root.hh[SO PARAM] SimObject -> X86_FS/params/SimObject.hh...

    23

  • Running Simulations

    maize% ./build/ARM_FS/gem5.opt --helpUsage=====

    gem5.opt [gem5 options] script.py [script options]

    gem5 is copyrighted software; use the --copyright option for details.

    Options=======--version show programs version number and exit--help, -h show this help message and exit--build-info, -B Show build information--copyright, -C Show full copyright information--readme, -R Show the readme--outdir=DIR, -d DIR Set the output directory to DIR [Default: /tmp/m5out]--redirect-stdout, -r Redirect stdout (& stderr, without -e) to file--redirect-stderr, -e Redirect stderr to file--stdout-file=FILE Filename for -r redirection [Default: simout]--stderr-file=FILE Filename for -e redirection [Default: simerr]--interactive, -i Invoke the interactive interpreter after running the script--pdb Invoke the python debugger before running the script--path=PATH[:PATH], -p PATH[:PATH]

    Prepend PATH to the system path when invoking the script--quiet, -q Reduce verbosity--verbose, -v Increase verbosity

    24

  • Running Simulations (cont)

    Statistics Options--------------------stats-file=FILE Sets the output file for statistics [Default:

    stats.txt]

    Configuration Options-----------------------dump-config=FILE Dump configuration output file [Default: config.ini]

    Debugging Options-------------------debug-break=TIME[,TIME]

    Cycle to create a breakpoint--debug-help Print help on trace flags--debug-flags=FLAG[,FLAG]

    Sets the flags for tracing (-FLAG disables a flag)--remote-gdb-port=REMOTE_GDB_PORT

    Remote gdb base port (set to 0 to disable listening)

    Trace Options---------------trace-start=TIME Start tracing at TIME (must be in ticks)--trace-file=FILE Sets the output file for tracing [Default: cout]--trace-ignore=EXPR Ignore EXPR sim objects

    Help Options--------------list-sim-objects List all built-in SimObjects, their params and default

    values

    25

  • Sample Run

    maize% ./build/ARM_SE/gem5.opt configs/example/se.pygem5 Simulator System. http://gem5.orggem5 is copyrighted software; use the --copyright option for details.

    gem5 compiled Jun 2 2011 17:39:30gem5 started Jun 3 2011 14:48:20gem5 executing on maizecommand line: ./build/ARM_SE/gem5.opt configs/example/se.pyGlobal frequency set at 1000000000000 ticks per second0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...Hello world!hack: be nice to actually delete the event hereExiting @ tick 3350000 because target called exit()

    26

  • Modes

    gem5 has two fundamental modes Full system (FS)

    For booting operating systems Models bare hardware, including devices Interrupts, exceptions, privileged instructions, fault handlers

    Syscall emulation (SE) For running individual applications, or set of applications on

    MP/SMT Models user-visible ISA plus common system calls System calls emulated, typ. by calling host OS Simplified address translation model, no scheduling

    Selected via compile-time option Vast majority of code is unchanged, though

    27

  • Objects

    Everything you care about is an object (C++/Python) Derived from SimObject base class

    Common code for creation, configuration parameters, naming,checkpointing, etc.

    Uniform method-based APIs for object types CPUs, caches, memory, etc. Plug-compatibility across implementations

    Functional vs. detailed CPU Conventional vs. indirect-index cache

    Easy replication: cores, multiple systems, . . .

    28

  • Events

    Standard event queue timing model Global logical time in ticks No fixed relation to real time

    Normally picoseconds in our examples Objects schedule their own events

    Flexibility for detail vs. performance trade-offs E.g., a CPU typically schedules event at regular intervals

    Every cycle or every n picoseconds Wont schedule self if stalled/idle

    29

  • Ports src/mem/port.{hh,cc}

    Method for connecting MemObjects together Each MemObject subclass has its own Port subclass(es)

    Specialized to forward packets to appropriate methods ofMemObject subclass

    Each pair of MemObjects is connected via a pair of Ports(peers)

    Function pairs pass packets across ports sendTiming() on one port calls recvTiming() on peer

    Result: class-specific handling with arbitrary connections andonly a single virtual function call

    30

  • Access Modes

    Three access modes: Functional, Atomic, Timing Selected by choosing function on initial Port:

    sendFunctional(), sendAtomic(), sendTiming() Functional mode:

    Just make it happen Used for loading binaries, debugging, etc. Accesses happen instantaneously updating data everywhere

    in the hierarchy If devices contain queues of packets they must be scanned

    and updated as well

    31

  • Access Modes (contd)

    Atomic mode: Requests complete before sendAtomic() returns Models state changes (cache fills, coherence, etc.) Returns approx. latency w/o contention or queuing delay Used for fast simulation, fast forwarding, or warming caches

    Timing mode: Models all timing/queuing in the memory system Split transaction

    sendTiming() just initiates send of request to target Target later calls sendTiming() to send response packet

    Atomic and Timing accesses can not coexist in system

    32

  • Statistics

    Scalar Average Vector Formula Histogram Distribution Vector Distribution

    33

  • Statistics Example hh file

    class MySimObject : public SimObject{

    private:Stats::Scalar txBytes;Stats::Formula txBandwidth;Stats::Vector syscall;

    public:void regStats();

    };

    34

  • Statistics Example cc file

    txBytes.name(name() + ".txBytes").desc("Bytes Transmitted").prereq(txBytes);

    txBandwidth.name(name() + ".txBandwidth").desc("Transmit Bandwidth (bits/s)").precision(0);

    txBandwidth = txBytes * Stats::constant(8) / simSeconds;

    syscall.init(SystemCalls ::Number).name(name() + ".syscall").desc("number of syscalls executed").flags(total | pdf | nozero | nonan);

    35

  • Statistics Output

    client.tsunami.etherdev.txBandwidth 4302720client.tsunami.etherdev.txBytes 13446server.tsunami.etherdev.txBandwidth 4684921600server.tsunami.etherdev.txBytes 14640380sim_seconds 0.025000server.cpu.kern.syscall 492server.cpu.kern.syscall_1 189 38.41% 38.41%server.cpu.kern.syscall_2 249 50.61% 89.02%server.cpu.kern.syscall_3 54 10.98% 100.00%

    36

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    37

  • Debugging

    Debugging

    Ali Saidi

    ARM Research & Development

    38

  • Debugging Facilities

    Tracing Instruction Tracing Diffing Traces

    Using gdb to debug gem5 Debugging C++ and gdb-callable functions Remote Debugging

    Python Debugging

    Pipeline Viewer

    39

  • Tracing/Debugging src/base/trace.*

    printf() is a nice debugging tool

    Keep good printfs for tracing

    Lots of debug output is a very good thing

    Example flags: Fetch, Decode, Ethernet, Exec, TLB, DMA, Bus, Cache,

    Loader, O3CPUAll, etc. Print out all flags with --debug-help option

    40

  • Enabling Tracing

    Selecting flags: --debug-flags=Cache,Bus --debug-flags=Exec,-ExecTicks

    Selecting destination: --trace-file=my_trace.out --trace-file=my_trace.out.gz

    Selecting start: --trace-start=23000000

    ./build/ARM_FS/gem5.opt --debug-flags=Cache,Bus--trace-start=2400 configs/example/fs.py

    41

  • Adding Debuging

    Print statement put in source code Encourage you to add ones to your models or contribute ones

    you find particularly useful Macros remove them for gem5.fast or gem5.prof binaries

    So you must be using gem5.debug or gem5.opt to get anyoutput

    Adding an extra tracing statement: DPRINTF(Flag, normal printf %s\n,arguments);

    Adding a new debug flags (in a SConscript): DebugFlag(MyNewFlag)

    42

  • Instruction Tracing src/sim/insttracer.hh

    Separate from the general debug/trace facility But both are enabled the same way

    Per-instruction records populated as instruction executes Start with PC and mnemonic Add argument and result values as they become known

    Printed to trace when instruction completes Flags for printing cycle, symbolic addresses, etc.

    4000: sys.cpu : @sym+776 : add r3, r3, #8 : IntAlu : D=0x000083584500: sys.cpu : @sym+780 : sub r3, r3, r7 : IntAlu : D=0x400000005000: sys.cpu : @sym+784 : add r5, r5, r3 : IntAlu : D=0x000173cc5500: sys.cpu : @sym+788 : add r6, r6, r3 : IntAlu : D=0x000174006000: sys.cpu : @sym+792.0 : addi_uop r34, r5, #0 : IntAlu : D=0x000173cc6500: sys.cpu : @sym+792.1 : ldr_uop r3, [r34, #0] : MemRead : D=0x000f0000 A=0x173cc7000: sys.cpu : @sym+792.2 : ldr_uop r4, [r34, #4] : MemRead : D=0x000f0000 A=0x173d07500: sys.cpu : @sym+796 : and r4, r4, r9 : IntAlu : D=0x000f00008000: sys.cpu : @sym+800 : teqs r3, r4 : IntAlu : D=0x00000001

    43

  • Using GDB with gem5

    Several gem5 functions designed to be called from GDB: schedBreakCycle() also with --debug-break setDebugFlag()/clearDebugFlag() dumpDebugStatus() eventqDump() SimObject::find() takeCheckpoint()

    44

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

    aliHighlight

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

    aliHighlight

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

    aliHighlight

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

    aliHighlight

  • Using GDB with gem5

    gdb --args ./build/ARM_FS/gem5.opt configs/example/fs.pyGNU gdb Fedora (6.8-37.el5)...

    (gdb) b mainBreakpoint 1 at 0x4090b0: file build/ARM_FS/sim/main.cc, line 40.(gdb) run

    Breakpoint 1, main (argc=2, argv=0x7fffa59725f8) at build/ARM_FS/sim/main.cc:4040main(int argc, char **argv)

    (gdb) call schedBreakCycle(1000000)(gdb) continueContinuing.

    gem5 Simulator System...0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000

    **** REAL SIMULATION ****info: Entering event queue @ 0. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) p _curTick$1 = 1000000

    45

    aliHighlight

  • Using GDB with gem5

    (gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.

    1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431

    (gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)

    46

    aliHighlight

    aliHighlight

  • Using GDB with gem5

    (gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.

    1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431

    (gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)

    46

    aliHighlight

  • Using GDB with gem5

    (gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.

    1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431

    (gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)

    46

    aliHighlight

    aliHighlight

    aliHighlight

  • Using GDB with gem5

    (gdb) call setDebugFlag("Exec")(gdb) call schedBreakCycle(1001000)(gdb) continueContinuing.

    1000000: system.cpu T0 : @_stext+148. 1 : addi_uop r0, r0, #4 : IntAlu : D=0x0000000000004c301000500: system.cpu T0 : @_stext+152 : teqs r0, r6 : IntAlu : D=0x0000000000000000

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6

    (gdb) print SimObject::find("system.cpu")$2 = (SimObject *) 0x19cba130(gdb) print (BaseCPU*)SimObject::find("system.cpu")$3 = (BaseCPU *) 0x19cba130(gdb) p $3->instCnt$4 = 431

    (gdb) call clearDebugFlag("Exec")(gdb) call takeCheckpoint(0)(gdb) call schedBreakCycle(1001500)(gdb) continueContinuing.Writing checkpointinfo: Entering event queue @ 1001001. Starting simulation...

    Program received signal SIGTRAP, Trace/breakpoint trap.0x0000003ccb6306f7 in kill () from /lib64/libc.so.6(gdb)

    46

    aliHighlight

    aliHighlight

    aliHighlight

  • Diffing Traces util/{rundiff,tracediff}

    Often useful to compare traces from two simulations Find where known good and modified simulators diverge

    Standard diff works only on files (not pipes) ...but you really dont want to run to completion

    util/rundiff Perl script for diffing two pipes on the fly

    util/tracediff Handy wrapper for using rundiff to compare gem5 outputs tracediff "a/gem5.opt|b/gem5.opt"--debug-flags=Exec compares instruction traces from twobuilds of gem5

    See comments for details

    47

  • Advanced Trace Diffing

    Sometimes if you run into a nasty bug its hard to compareapples-to-apples traces

    Different cycle counts, different code paths frominterrupts/timers

    Some mechanisms that can help: -ExecTicks dont print out ticks -ExecKernel dont print out kernel code -ExecUser dont print out user code ExecAsid print out ASID of currently running process

    State trace PTRACE program that runs binary on real system compares

    cycle-by-cycle to gem5 Supports ARM, x86, SPARC See wiki for more information

    48

  • Remote Debugging

    ./build/ARM_FS/gem5.opt configs/example/fs.pygem5 Simulator System

    ...command line: ./build/ARM_FS/gem5.opt configs/example/fs.pyGlobal frequency set at 1000000000000 ticks per secondinfo: kernel located at: /chips/pd/randd/dist/binaries/vmlinux.armListening for system connection on port 5900Listening for system connection on port 34560: system.remote_gdb.listener: listening for remote gdb #0 on port 7000info: Entering event queue @ 0. Starting simulation...

    Remote gdb connection listening on port 7000

    49

    aliHighlight

  • Remote Debugging

    GNU gdb (Sourcery G++ Lite 2010.09-50) 7.2.50.20100908-cvsCopyright (C) 2010 Free Software Foundation, Inc....(gdb) symbol-file /dist/binaries/vmlinux.armReading symbols from //dist/binaries/vmlinux.arm...done.(gdb) set remote Z-packet on(gdb) set tdesc filename arm-with-neon.xml(gdb) target remote 127.0.0.1:7000Remote debugging using 127.0.0.1:7000cache_init_objs (cachep=0xc7c00240, flags=3351249472) at mm/slab.c:2658(gdb) stepsighand_ctor (data=0xc7ead060) at kernel/fork.c:1467(gdb) info registers

    r0 0xc7ead060-940912544r1 0x5201312r2 0xc002f1e4-1073548828r3 0xc7ead060-940912544r4 0x00r5 0xc7ead020-940912608r6 0x00r7 0xc7ead03c-940912580r8 0xc7c034a0-943704928r9 0x1001001048832r10 0xc7c0cee0-943665440r11 0x2002002097664r12 0xc0000000-1073741824sp 0xc7c29e280xc7c29e28lr 0xc008ed98-1073156712pc 0xc002f1e40xc002f1e4 cpsr 0x1319

    50

    aliHighlight

    aliHighlight

    aliHighlight

    aliHighlight

    aliHighlight

    aliHighlight

  • Python Debugging

    It is possible to drop into the python interpreter (-i flag) This currently happens after the script file is run If you want to do this before objects are instantiated, remove

    them from script It is possible to drop into the python debugger (--pdb flag)

    Occurs just before your script is invoked Lets you use the debugger to debug your script code

    Code that enables this stuff is in src/python/m5/main.py At the bottom of the main function Can copy the mechanism directly into your scripts, if in the

    wrong place for you needs import pdb pdb.set_trace()

    51

  • O3 Pipeline ViewerUse --debug-flags=O3PipeView and util/o3-pipeview.py

    52

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    53

  • Checkpointing and Fastforwarding

    Checkpointing and Fastforwarding

    Joel Hestness

    University of Texas, Austin

    54

  • Checkpointing and Fastforwarding

    Idea is simple: Snapshot of relevant system state Restore it later and/or in different CPUs, configurations

    Provides flexibility: Test numerous different systems configurations Exact same point in the benchmark Avoid re-simulating up to that point Avoid non-determinism inherent with different configurations

    55

  • Checkpointing and Fastforwarding

    Outline: Constraints Checkpointing demo Checkpointing internals Instrumenting a benchmark Fastforwarding internals Fastforwarding demo

    56

  • Checkpointing and Fastforwarding

    Constraints: Original simulation and test simulations must have

    Same ISA Same number of cores Same memory size

    Usually run original sim with atomic (functional) CPUs

    57

  • Checkpointing DEMO!

    Starting simulation

    58

  • Checkpointing DEMO!

    Starting simulation

    59

  • Checkpointing DEMO!

    Simulated system is running

    60

  • Checkpointing DEMO!

    Another terminal to control simulated system

    61

  • Checkpointing DEMO!

    Attach to simulated system

    62

  • Checkpointing DEMO!

    Simulated system has booted to shell

    63

  • Checkpointing DEMO!

    Run a quick application

    64

  • Checkpointing DEMO!

    Run a quick application

    65

  • Checkpointing DEMO!

    Drop a checkpoint

    66

  • Checkpointing DEMO!

    Exit simulation

    67

  • Checkpointing DEMO!

    Exit simulation

    68

  • Checkpointing DEMO!

    Restore from checkpoint into different simulated system

    69

  • Checkpointing DEMO!

    Simulated system is running

    70

  • Checkpointing DEMO!

    Attach to simulated system

    71

  • Checkpointing DEMO!

    Run a quick application

    72

  • Checkpointing DEMO!

    Slower execution: detailed v. functional simulation

    73

  • Checkpointing DEMO!

    Exit simulation

    74

  • Checkpointing Output

    cpt.6967183789500/ m5.cpt: State of system components system.disk?.image.cow: Modified state of disk(s) system.physmem.physmem: State of memory

    75

  • Specifying State to Checkpoint

    To checkpoint a piece of state, serialize it To restore that state, unserialize it

    voidserialize(std::ostream &os){

    SERIALIZE_ARRAY(interrupts, NumInterruptLevels);SERIALIZE_SCALAR(intstatus);

    }

    voidunserialize(Checkpoint *cp, const std::string &section){

    UNSERIALIZE_ARRAY(interrupts, NumInterruptLevels);UNSERIALIZE_SCALAR(intstatus);

    }

    76

  • Checkpointing functionality status

    Classic memory model: Does not save state of caches

    Ruby memory model: Can save state of caches

    77

  • Instrumenting a Benchmark

    Copy files from ./util/m5/ into source tree: m5op.h m5ops.h Appropriate assembly file: m5op_.S

    Include m5op.h in source code that should take a checkpoint

    #include "m5op.h"

    ...

    // Take checkpoint in codem5_checkpoint(0,0);

    1st param: no. ticks in future to schedule the checkpoint 2nd param: no. ticks between checkpoints (periodic)

    Compile and link against assembly file

    78

  • Checkpointing Functionality in Progress

    Current limitation: cache warm-up1 Take periodic checkpoints throughout execution2 Inspect statistics for interesting sections (think Simpoints)3 Choose interesting sections4 Create memory access traces for cache warm-up5 Restore from checkpoint:

    1 Start simulated system2 Warm up caches from trace3 Restore the rest of state4 Begin execution

    79

  • Fastforwarding

    Setup: Specify sets of CPUs

    cpu_class = AtomicSimpleCPUswitch_cpu_class = DerivO3CPUtest_sys.cpu = [cpu_class(cpu_id=i) for i in xrange(np)]switch_cpus = [switch_cpu_class(defer_registration=True, cpu_id=(np+i))

    for i in xrange(np)]switch_cpu_list = [(testsys.cpu[i], switch_cpus[i]) for i in xrange(np)]

    80

  • Fastforwarding DEMO!

    Starting simulation

    81

  • Fastforwarding DEMO!

    Starting simulation

    82

  • Fastforwarding DEMO!

    Simulated system is running

    83

  • Fastforwarding DEMO!

    Another terminal to control simulated system

    84

  • Fastforwarding DEMO!

    Attach to simulated system

    85

  • Fastforwarding DEMO!

    Simulated system has booted to shell

    86

  • Fastforwarding DEMO!

    Run a quick application

    87

  • Fastforwarding DEMO!

    Run a quick application

    88

  • Fastforwarding DEMO!

    Switch from functional to detailed CPUs

    89

  • Fastforwarding DEMO!

    Switch from functional to detailed CPUs

    90

  • Fastforwarding DEMO!

    Run a quick application

    91

  • Fastforwarding DEMO!

    Slower execution: detailed v. functional simulation

    92

  • Fastforwarding DEMO!

    Exit simulation

    93

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    94

  • Break

    Break

    95

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    96

  • Multiple Architecture Support

    Multiple Architecture Support

    Gabe Black

    Google, Inc.

    97

  • Overview

    Tour of the ISAs Parts of an ISA Decoding and instructions

    98

  • ISA Support

    Full-System & Syscall Emulation Alpha ARM SPARC x86

    Syscall Emulation MIPS POWER

    99

  • Alpha

    Alpha 21264 including the BWX, MVI, FIX, and CIXA 21164 PAL code. Syscall Emulation

    Linux or Tru64 binaries Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU

    models Full system

    Linux or FreeBSD Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU

    models Four-cores in a normal Tsunami system Also gem5 big Tsunami support 64 cores

    Custom PAL code and kernel patches required

    100

  • ARM ARMv7-A, Thumb, Thumb2, MP, VFPv3, NEON

    Doesnt (yet) include TrustZone, ThumbEE, Virtualization,LPAE

    Syscall Emulation EABI Linux binaries - no OABI Simple Atomic, Simple Timing, Out-of-Order CPU models

    Full system Linux or Android Simple Atomic, Simple Timing, Out-of-Order CPU models Four-cores in a normal ARM RealView system No kernel patches required Also supports frame buffer, and control via VNC

    Can run X11, Android, Web browsers, etc

    101

  • ARM

    101

  • MIPS

    32 bit little endian Syscall Emulation

    Linux binaries Simple Atomic, Simple Timing, In-Order, Out-of-Order CPU

    models Full system

    Significant progress, but not actively developed

    102

  • POWER

    POWER ISA v2.06 B Book, 32-bit, little endian Most instructions available, but some FP missing; no vector

    support Syscall Emulation

    Linux binaries Simple Atomic, Simple Timing, Out-of-Order CPU models

    Full system No current plans

    103

  • SPARC

    UltraSPARC Architecture 2005 Syscall Emulation

    Linux or Solaris binaries Simple Atomic, Simple Timing, Out-of-Order CPU models

    Full system Solaris Single core of a UltraSPARC T1 (Niagara) processor Simple Atomic CPU model only Significant progress on MP, but not actively developed

    104

  • x86

    Generic x86 CPU w/ 64 bit, 3DNow, & SSE extensions Effort focused on modern features No x87 floating point. Compile 32 bit with -msse2. No Windows support any time soon. Syscall Emulation

    Linux binaries Simple Atomic, Simple Timing, Out-of-Order CPU models

    Full system Linux Simple Atomic, Simple Timing CPU models MP support

    105

  • Parts of an ISA

    Parameterization Number of registers Endianness Page size

    Specialized objects TLBs Faults Control state Interrupt controller

    Instructions Instructions themselves Decoding mechanism

    106

  • Instruction decode process

    Memory

    Byte Byte Byte Byte ByteByte

    Predecoder Context

    ExtMachInst

    Decoder

    StaticInst Macroop

    Microop Microop

    Or

    107

  • ISA Description Languagesrc/arch/isa_parser.py, src/arch/*/isa/*

    Custom domain-specific language Defines decoding & behavior of ISA Generates C++ code

    Scads of StaticInst subclasses decodeInst () function

    Maps machine instruction to StaticInst instance Multiple scads of execute() methods

    Cross-product of CPU models and StaticInst subclasses

    108

  • Definitions etc.

    def bitfield OPCODE ;def bitfield RA ;def bitfield RB ;def bitfield INTFUNC ; // function codedef bitfield RC < 4: 0>; // dest reg

    def operands {{Ra: (IntReg, uq, PALMODE ? AlphaISA::reg_redir[RA] : RA,

    IsInteger, 1),Rb: (IntReg, uq, PALMODE ? AlphaISA::reg_redir[RB] : RB,

    IsInteger, 2),Rc: (IntReg, uq, PALMODE ? AlphaISA::reg_redir[RC] : RC,

    IsInteger, 3),Fa: (FloatReg, df, FA, IsFloating, 1),Fb: (FloatReg, df, FB, IsFloating, 2),Fc: (FloatReg, df, FC, IsFloating, 3),

    }}

    def format LoadAddress(code) {{// Python code here...

    }}

    def format IntegerOperate(code) {{// Python code here...

    }}

    109

  • Instruction Decode and Semantics

    decode OPCODE {format LoadAddress {

    0x08: lda ({{ Ra = Rb + disp; }});0x09: ldah ({{ Ra = Rb + (disp

  • Microcode

    def macroop MOVS_E_M_M {and t0, rcx, rcx, flags=(EZF,), dataSize=aszbr label("end"), flags=(CEZF,)# Find the constant we need to either add or subtract from rdiruflag t0, 10movi t3, t3, dsz, flags=(CEZF,), dataSize=aszsubi t4, t0, dsz, dataSize=aszmov t3, t3, t4, flags=(nCEZF,), dataSize=asz

    topOfLoop:ld t1, seg, [1, t0, rsi]st t1, es, [1, t0, rdi]

    subi rcx, rcx, 1, flags=(EZF,), dataSize=aszadd rdi, rdi, t3, dataSize=aszadd rsi, rsi, t3, dataSize=aszbr label("topOfLoop"), flags=(nCEZF,)

    end:fault "NoFault"

    };

    111

  • Key Features

    Very compact representation Most instructions take 1 line of C code Alpha: 3437 lines of isa description 39K lines of C++

    15K generic decode, 12K for each of 2 CPU models Characteristics auto-extracted from C

    source, dest regs; func unit class; etc. execute() code customized for CPU models

    Thoroughly documented (for us, anyway) See wiki pages

    112

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    113

  • CPU Modeling

    CPU Modeling

    Korey Sewell

    University of Michigan, Ann Arbor

    114

  • Overview

    High Level View Supported CPU Models

    AtomicSimpleCPU TimingSimpleCPU InOrderCPU O3CPU

    CPU Model Internals Parameters Time Buffers Key Interfaces

    115

  • CPU Models - System Level View

    CPU Models are designed to be hot pluggable with arbitraryISAs and Memory Systems

    116

  • Supported CPU Models src/cpu/*.hh,cc Simple CPUs

    Models Single-Thread 1 CPI Machine

    Two Types: AtomicSimpleCPU and TimingSimpleCPU Common Uses:

    Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the twolf benchmark

    Warming Up Caches Studies that do not require detailed CPU modeling

    Detailed CPUs Parameterizable Pipeline Models w/SMT support Two Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on

    the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • Supported CPU Models src/cpu/*.hh,cc Simple CPUs

    Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU and TimingSimpleCPU

    Common Uses: Fast, Functional Simulation: 2.9 million and 1.2 million

    instructions per second on the twolf benchmark Warming Up Caches Studies that do not require detailed CPU modeling

    Detailed CPUs Parameterizable Pipeline Models w/SMT support Two Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on

    the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • Supported CPU Models src/cpu/*.hh,cc Simple CPUs

    Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU and TimingSimpleCPU Common Uses:

    Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the twolf benchmark

    Warming Up Caches Studies that do not require detailed CPU modeling

    Detailed CPUs Parameterizable Pipeline Models w/SMT support Two Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on

    the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • Supported CPU Models src/cpu/*.hh,cc Simple CPUs

    Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU and TimingSimpleCPU Common Uses:

    Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the twolf benchmark

    Warming Up Caches Studies that do not require detailed CPU modeling

    Detailed CPUs Parameterizable Pipeline Models w/SMT support

    Two Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on

    the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • Supported CPU Models src/cpu/*.hh,cc Simple CPUs

    Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU and TimingSimpleCPU Common Uses:

    Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the twolf benchmark

    Warming Up Caches Studies that do not require detailed CPU modeling

    Detailed CPUs Parameterizable Pipeline Models w/SMT support Two Types: InOrderCPU and O3CPU

    Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on

    the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • Supported CPU Models src/cpu/*.hh,cc Simple CPUs

    Models Single-Thread 1 CPI Machine Two Types: AtomicSimpleCPU and TimingSimpleCPU Common Uses:

    Fast, Functional Simulation: 2.9 million and 1.2 millioninstructions per second on the twolf benchmark

    Warming Up Caches Studies that do not require detailed CPU modeling

    Detailed CPUs Parameterizable Pipeline Models w/SMT support Two Types: InOrderCPU and O3CPU Execute in Execute, detailed modeling Slower than SimpleCPUs: 200K instructions per second on

    the twolf benchmark Models the timing for each pipeline stage Forces both timing and execution of simulation to be accurate Important for Coherence, I/O, Multiprocessor Studies, etc.

    117

  • AtomicSimpleCPU src/cpu/simple/atomic/*.hh,cc

    On every CPU tick(),perform all necessaryoperations for an instruction

    Memory accesses areatomic

    Fastest functional simulation

    118

  • TimingSimpleCPU src/cpu/simple/timing/*.hh,cc

    Memory accesses usetiming path

    CPU waits until memoryaccess returns

    Fast, provides some level oftiming

    119

  • InOrder CPU Model src/cpu/inorder/*.hh,cc Detailed in-order CPU InOrder is a new feature to the gem5 Simulator

    Default 5-stage pipeline Fetch, Decode, Execute, Memory, Writeback

    120

  • InOrder CPU Model src/cpu/inorder/*.hh,cc Detailed in-order CPU InOrder is a new feature to the gem5 Simulator

    Default 5-stage pipeline Fetch, Decode, Execute, Memory, Writeback

    120

  • InOrder CPU Model src/cpu/inorder/*.hh,cc

    Detailed in-order CPU Default 5-stage pipeline

    Fetch, Decode, Execute, Memory, Writeback

    Key Resources CacheUnit, ExecutionUnit, BranchPredictor, etc.

    Key Parameters Pipeline Stages, Hardware Threads

    Implementation: Customizable Set of Pipeline Components Pipeline stages interact with Resource Pool Pipeline defined through Instruction Schedules

    Each instruction type defines what resources they need in aparticular stage

    If an instruction cant complete all its resource requests in onestage, it blocks the pipeline

    121

  • InOrder CPU Model src/cpu/inorder/*.hh,cc

    Detailed in-order CPU Default 5-stage pipeline

    Fetch, Decode, Execute, Memory, Writeback Key Resources

    CacheUnit, ExecutionUnit, BranchPredictor, etc.

    Key Parameters Pipeline Stages, Hardware Threads

    Implementation: Customizable Set of Pipeline Components Pipeline stages interact with Resource Pool Pipeline defined through Instruction Schedules

    Each instruction type defines what resources they need in aparticular stage

    If an instruction cant complete all its resource requests in onestage, it blocks the pipeline

    121

  • InOrder CPU Model src/cpu/inorder/*.hh,cc

    Detailed in-order CPU Default 5-stage pipeline

    Fetch, Decode, Execute, Memory, Writeback Key Resources

    CacheUnit, ExecutionUnit, BranchPredictor, etc. Key Parameters

    Pipeline Stages, Hardware Threads

    Implementation: Customizable Set of Pipeline Components Pipeline stages interact with Resource Pool Pipeline defined through Instruction Schedules

    Each instruction type defines what resources they need in aparticular stage

    If an instruction cant complete all its resource requests in onestage, it blocks the pipeline

    121

  • InOrder CPU Model src/cpu/inorder/*.hh,cc

    Detailed in-order CPU Default 5-stage pipeline

    Fetch, Decode, Execute, Memory, Writeback Key Resources

    CacheUnit, ExecutionUnit, BranchPredictor, etc. Key Parameters

    Pipeline Stages, Hardware Threads Implementation: Customizable Set of Pipeline Components

    Pipeline stages interact with Resource Pool Pipeline defined through Instruction Schedules

    Each instruction type defines what resources they need in aparticular stage

    If an instruction cant complete all its resource requests in onestage, it blocks the pipeline

    121

  • O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU

    Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW Issue, Execute, and Writeback

    Model varying amount of pipeline stages by changing delaysbetween pipeline stages (e.g. fetchToDecodeDelay)

    Key Resources Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit

    (FU) Pool Key Parameters

    Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PRentries, FU Delays

    Other Key Features Support for CISC decoding (e.g. x86) Renaming with a Physical Register (PR) File Functional units with varying latencies Branch Prediction Memory dependence prediction

    122

  • O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU

    Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW Issue, Execute, and Writeback Model varying amount of pipeline stages by changing delays

    between pipeline stages (e.g. fetchToDecodeDelay)

    Key Resources Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit

    (FU) Pool Key Parameters

    Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PRentries, FU Delays

    Other Key Features Support for CISC decoding (e.g. x86) Renaming with a Physical Register (PR) File Functional units with varying latencies Branch Prediction Memory dependence prediction

    122

  • O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU

    Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW Issue, Execute, and Writeback Model varying amount of pipeline stages by changing delays

    between pipeline stages (e.g. fetchToDecodeDelay) Key Resources

    Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool

    Key Parameters Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR

    entries, FU Delays Other Key Features

    Support for CISC decoding (e.g. x86) Renaming with a Physical Register (PR) File Functional units with varying latencies Branch Prediction Memory dependence prediction

    122

  • O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU

    Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW Issue, Execute, and Writeback Model varying amount of pipeline stages by changing delays

    between pipeline stages (e.g. fetchToDecodeDelay) Key Resources

    Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool

    Key Parameters Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR

    entries, FU Delays

    Other Key Features Support for CISC decoding (e.g. x86) Renaming with a Physical Register (PR) File Functional units with varying latencies Branch Prediction Memory dependence prediction

    122

  • O3 CPU Model src/cpu/o3/*.hh,cc Detailed out-of-order CPU

    Default 7-stage pipeline Fetch, Decode, Rename, IEW,Commit IEW Issue, Execute, and Writeback Model varying amount of pipeline stages by changing delays

    between pipeline stages (e.g. fetchToDecodeDelay) Key Resources

    Physical Register (PR) File, IQ, LSQ, ROB, Functional Unit(FU) Pool

    Key Parameters Interstage pipeline delays, Hardware threads, IQ/LSQ/ROB/PR

    entries, FU Delays Other Key Features

    Support for CISC decoding (e.g. x86) Renaming with a Physical Register (PR) File Functional units with varying latencies Branch Prediction Memory dependence prediction

    122

  • CPU Model Internals src/cpu/*

    A key reason that the CPU Models are hot pluggable intogem5 is that the CPUs share common components andinterfaces within the simulator

    Parameter Definition Shared Components

    Branch Predictors, TLBs, ISA decoding, Interrupt Handlers TimeBuffer-Based Communication External Interfaces

    System: ThreadContext ISA: StaticInst and DynInst Memory: Ports, {send/recv}Timing

    123

  • CPU Model Internals src/cpu/*

    A key reason that the CPU Models are hot pluggable intogem5 is that the CPUs share common components andinterfaces within the simulator Parameter Definition

    Shared Components Branch Predictors, TLBs, ISA decoding, Interrupt Handlers

    TimeBuffer-Based Communication External Interfaces

    System: ThreadContext ISA: StaticInst and DynInst Memory: Ports, {send/recv}Timing

    123

  • CPU Model Internals src/cpu/*

    A key reason that the CPU Models are hot pluggable intogem5 is that the CPUs share common components andinterfaces within the simulator Parameter Definition Shared Components

    Branch Predictors, TLBs, ISA decoding, Interrupt Handlers

    TimeBuffer-Based Communication External Interfaces

    System: ThreadContext ISA: StaticInst and DynInst Memory: Ports, {send/recv}Timing

    123

  • CPU Model Internals src/cpu/*

    A key reason that the CPU Models are hot pluggable intogem5 is that the CPUs share common components andinterfaces within the simulator Parameter Definition Shared Components

    Branch Predictors, TLBs, ISA decoding, Interrupt Handlers TimeBuffer-Based Communication

    External Interfaces System: ThreadContext ISA: StaticInst and DynInst Memory: Ports, {send/recv}Timing

    123

  • CPU Model Internals src/cpu/*

    A key reason that the CPU Models are hot pluggable intogem5 is that the CPUs share common components andinterfaces within the simulator Parameter Definition Shared Components

    Branch Predictors, TLBs, ISA decoding, Interrupt Handlers TimeBuffer-Based Communication External Interfaces

    System: ThreadContext ISA: StaticInst and DynInst Memory: Ports, {send/recv}Timing

    123

  • CPU Internals - Parameterssrc/cpu/{simple/inorder/o3}*.py

    Parameters are defined in a *.py in each CPUs directory e.g. The contents of src/cpu/inorder/InOrderCPU.py are shown

    below:

    class InOrderCPU(BaseCPU):type = InOrderCPU...cachePorts = Param.Unsigned(2, "Cache Ports")stageWidth = Param.Unsigned(4, "Stage width")...icache_port = Port("Instruction Port")dcache_port = Port("Data Port")...predType = Param.String("tournament", "Branch predictor type (local, tournament)")

    Use in your configuration scripts

    ...cpu = InOrderCPU()cpu.stageWidth = 2...

    124

  • CPU Internals - Parameterssrc/cpu/{simple/inorder/o3}*.py

    Parameters are defined in a *.py in each CPUs directory e.g. The contents of src/cpu/inorder/InOrderCPU.py are shown

    below:

    class InOrderCPU(BaseCPU):type = InOrderCPU...cachePorts = Param.Unsigned(2, "Cache Ports")stageWidth = Param.Unsigned(4, "Stage width")...icache_port = Port("Instruction Port")dcache_port = Port("Data Port")...predType = Param.String("tournament", "Branch predictor type (local, tournament)")

    Use in your configuration scripts

    ...cpu = InOrderCPU()cpu.stageWidth = 2...

    124

  • CPU Internals - Time Buffers src/base/timebuf.hh

    Similar to queues Are advance()d each CPU cycle

    Each pipeline stage places information into time buffer Next stage reads from time buffer by indexing into appropriate

    cycle Used for both forwards and backwards communication

    Avoids unrealistic interaction between pipeline stages Time buffer class is templated

    Its template parameter is the communication struct betweenstages

    125

  • CPU Internals - Time Buffers src/base/timebuf.hh

    Similar to queues Are advance()d each CPU cycle

    Each pipeline stage places information into time buffer Next stage reads from time buffer by indexing into appropriate

    cycle

    Used for both forwards and backwards communication Avoids unrealistic interaction between pipeline stages

    Time buffer class is templated Its template parameter is the communication struct between

    stages

    125

  • CPU Internals - Time Buffers src/base/timebuf.hh

    Similar to queues Are advance()d each CPU cycle

    Each pipeline stage places information into time buffer Next stage reads from time buffer by indexing into appropriate

    cycle Used for both forwards and backwards communication

    Avoids unrealistic interaction between pipeline stages

    Time buffer class is templated Its template parameter is the communication struct between

    stages

    125

  • CPU Internals - Time Buffers src/base/timebuf.hh

    Similar to queues Are advance()d each CPU cycle

    Each pipeline stage places information into time buffer Next stage reads from time buffer by indexing into appropriate

    cycle Used for both forwards and backwards communication

    Avoids unrealistic interaction between pipeline stages Time buffer class is templated

    Its template parameter is the communication struct betweenstages

    125

  • Time Buffer Communication

    Demonstrated on out-of-order pipeline ... Red is a time buffer

    Fetch Decode RenameIssue

    ExecuteWriteback

    Commit

    Backwards Communication

    126

  • CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh

    Interface for accessing total architectural state of a singlethread PC, register values, etc.

    Used to obtain pointers to key classes CPU, process, system, ITB, DTB, etc.

    Abstract base class Each CPU model must implement its own derived

    ThreadContext

    127

  • CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh

    Interface for accessing total architectural state of a singlethread PC, register values, etc.

    Used to obtain pointers to key classes CPU, process, system, ITB, DTB, etc.

    Abstract base class Each CPU model must implement its own derived

    ThreadContext

    127

  • CPU Interfaces - ThreadContextsrc/cpu/thread_context.hh

    Interface for accessing total architectural state of a singlethread PC, register values, etc.

    Used to obtain pointers to key classes CPU, process, system, ITB, DTB, etc.

    Abstract base class Each CPU model must implement its own derived

    ThreadContext

    127

  • CPU Interfaces - StaticInst Classsrc/cpu/static_inst.{hh,cc}

    Represents a decoded instruction Has classifications of the inst Corresponds to the binary machine inst Only has static information

    Has all the methods needed to execute an instruction Tells which regs are source and dest Contains the execute() function ISA parser generates execute() for all insts

    128

  • CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}

    Dynamic version of StaticInst Used to hold extra information detailed CPU models

    BaseDynInst Holds PC, Results, Branch Prediction Status Interface for TLB translations

    InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current status of an instructions request to a resource Manages each instructions pipeline schedule

    O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed Registers Interfaces to the IQ, LSQ and ROB

    129

  • CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}

    Dynamic version of StaticInst Used to hold extra information detailed CPU models

    BaseDynInst Holds PC, Results, Branch Prediction Status Interface for TLB translations

    InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current status of an instructions request to a resource Manages each instructions pipeline schedule

    O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed Registers Interfaces to the IQ, LSQ and ROB

    129

  • CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}

    Dynamic version of StaticInst Used to hold extra information detailed CPU models

    BaseDynInst Holds PC, Results, Branch Prediction Status Interface for TLB translations

    InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current status of an instructions request to a resource Manages each instructions pipeline schedule

    O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed Registers Interfaces to the IQ, LSQ and ROB

    129

  • CPU Interfaces - DynInst Classsrc/cpu/base_dyn_inst.{hh,cc}

    Dynamic version of StaticInst Used to hold extra information detailed CPU models

    BaseDynInst Holds PC, Results, Branch Prediction Status Interface for TLB translations

    InOrderDynInst - src/cpu/inorder/dyn_inst.{hh,cc} Holds current status of an instructions request to a resource Manages each instructions pipeline schedule

    O3DynInst - src/cpu/o3/dyn_inst.{hh,cc} Holds Status of Renamed Registers Interfaces to the IQ, LSQ and ROB

    129

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    130

  • Ruby Memory System

    Ruby Memory System

    Derek Hower

    University of Wisconsin, Madison

    131

  • Outline

    Feature Overview Rich Configuration Rapid Prototyping

    SLICC Modular & Detailed Components

    Lifetime of a Ruby memory request

    132

  • Feature Overview

    Flexible Memory System Rich configuration - Just run it

    Simulate combinations of caches, coherence, interconnect,etc...

    Rapid prototyping - Just create it Domain-Specific Language (SLICC) for coherence protocols Modular components

    Detailed statistics e.g., Request size/type distribution, state transition

    frequencies, etc... Detailed component simulation

    Network (fixed/flexible pipeline and simple) Caches (Pluggable replacement policies) Memory (DDR2)

    133

  • Feature Overview

    Flexible Memory System Rich configuration - Just run it

    Simulate combinations of caches, coherence, interconnect,etc...

    Rapid prototyping - Just create it Domain-Specific Language (SLICC) for coherence protocols Modular components

    Detailed statistics e.g., Request size/type distribution, state transition

    frequencies, etc... Detailed component simulation

    Network (fixed/flexible pipeline and simple) Caches (Pluggable replacement policies) Memory (DDR2)

    133

  • Feature Overview

    Flexible Memory System Rich configuration - Just run it

    Simulate combinations of caches, coherence, interconnect,etc...

    Rapid prototyping - Just create it Domain-Specific Language (SLICC) for coherence protocols Modular components

    Detailed statistics e.g., Request size/type distribution, state transition

    frequencies, etc...

    Detailed component simulation Network (fixed/flexible pipeline and simple) Caches (Pluggable replacement policies) Memory (DDR2)

    133

  • Feature Overview

    Flexible Memory System Rich configuration - Just run it

    Simulate combinations of caches, coherence, interconnect,etc...

    Rapid prototyping - Just create it Domain-Specific Language (SLICC) for coherence protocols Modular components

    Detailed statistics e.g., Request size/type distribution, state transition

    frequencies, etc... Detailed component simulation

    Network (fixed/flexible pipeline and simple) Caches (Pluggable replacement policies) Memory (DDR2)

    133

  • Rich Configuration - Just run it

    Can build many different memory systems CMPs, SMPs, SCMPs 1/2/3 level caches Pt2Pt/Torus/Mesh Topologies MESI/MOESI coherence

    Each components is individually configurable Build heterogeneous cache architectures (new) Adjust cache sizes, bandwidth, link latencies, etc...

    Get research started without modifying code!

    134

  • Configuration Examples

    1 8 core CMP, 2-Level, MESI protocol, 32K L1s, 8MB 8-bankedL2s, crossbar interconnect scons build/ALPHA_FS/gem5.opt PROTOCOL=MESI_CMP_directory RUBY=True ./build/ALPHA_FS/gem5.opt configs/example/ruby_fs.py -n 8 --l1i_size=32kB

    --l1d_size=32kB --l2_size=8MB --num-l2caches=8 --topology=Crossbar --timing

    2 64 socket SMP, 2-Level on-chip Caches, MOESI protocol,32K L1s, 8MB L2 per chip, mesh interconnect scons build/ALPHA_FS/gem5.opt PROTOCOL=MOESI_CMP_directory RUBY=True ./build/ALPHA_FS/m5.opt configs/example/ruby_fs.py -n 64 --l1i_size=32kB

    --l1d_size=32kB --l2_size=512MB --num-l2caches=64 --topology=Mesh --timing

    Many other configuration options Protocols only work with specific architectures (see wiki)

    135

  • Rapid Prototyping - Just create it

    Modular construction Coherence controller (SLICC) Cache (C++)

    Replacement Policy (C++) DRAM (C++) Topology (Python) Network implementation (C++)

    Debugging support

    136

  • SLICC: Specification Language forImplementing Cache Coherence

    Domain-Specific Language Syntatically similar to C/C++ Like HDLs, constrains operations to be hardware-like (e.g., no

    loops) Two generation targets

    C++ for simulation Coherence controller object

    HTML for documentation Table-driven specification (State x Event -> Actions & next

    state)

    137

  • SLICC Protocol Structure

    Collection of Machines, e.g. L1 Controller L2 Controller DRAM Controller

    Machines are connectedthrough network ports(different than MemPorts)

    Network can be an arbitrarytopology

    138

  • Machine Structure

    Machines are (logically) per-block Consist of:

    Ports - Interface to the world States - Both stable and transient Events - Triggered by incoming messages Transitions - Old state x Event -> New state Actions - Occur atomically during transition, e.g.,

    Send/receive messages from network139

  • MI Example...Directory Directory

    Pt-to-Pt Interconnect

    ...L1 Cache L1 Cache

    CPU CPU

    Single-level coherence protocol 2 controller types Cache + Directory

    Cache Controller 2 stable states: Modified (a.k.a. Valid), Invalid

    Directory Controller [Not Shown] 2 stable states: Modified (Present in cache), Valid

    3 virtual networks (request, response, forward) See src/mem/ruby/protocols/MI_example.*

    140

  • MI Example - L1 Cache ControllerMachine structure

    Machine Pseduo-Codemachine(L1Cache, "MI Example L1 Cache)

    : Sequencer * sequencer, // parameters to the machine object (set at initialization)CacheMemory * cacheMemory,int cache_response_latency = 12,int issue_latency = 2

    { // 3 virtual channels to/from network + connection to CPU

    // M,I, & Load,Store, etc.

    // e.g., RequestMessage::GETX -> Fwd_GETX

    // e.g., issueRequest

    // e.g., I x Store -> M}

    141

  • MI Example - L1 Cache ControllerDefining a machine interface

    Interface to the network

    // MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=To, virtual_network=0, ordered=true;MessageBuffer responseFromCache, network=To, virtual_network=1, ordered=true;MessageBuffer forwardToCache, network=From, virtual_network=2, ordered=true;MessageBuffer responseToCache, network=From, virtual_network=1, ordered=true;

    // out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);

    // in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }

    Interface to a CPU

    // The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=false;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=...) { ... }// There is no corresponing out_port - handled with hitCallback

    142

  • MI Example - L1 Cache ControllerDefining a machine interface

    Interface to the network

    // MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=To, virtual_network=0, ordered=true;MessageBuffer responseFromCache, network=To, virtual_network=1, ordered=true;MessageBuffer forwardToCache, network=From, virtual_network=2, ordered=true;MessageBuffer responseToCache, network=From, virtual_network=1, ordered=true;

    // out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);

    // in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }

    Interface to a CPU

    // The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=false;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=...) { ... }// There is no corresponing out_port - handled with hitCallback

    142

  • MI Example - L1 Cache ControllerDefining a machine interface

    Interface to the network

    // MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=To, virtual_network=0, ordered=true;MessageBuffer responseFromCache, network=To, virtual_network=1, ordered=true;MessageBuffer forwardToCache, network=From, virtual_network=2, ordered=true;MessageBuffer responseToCache, network=From, virtual_network=1, ordered=true;

    // out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);

    // in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }

    Interface to a CPU

    // The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=false;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=...) { ... }// There is no corresponing out_port - handled with hitCallback

    142

  • MI Example - L1 Cache ControllerDefining a machine interface

    Interface to the network

    // MessageBuffers - opaque C++ communication queuesMessageBuffer requestFromCache, network=To, virtual_network=0, ordered=true;MessageBuffer responseFromCache, network=To, virtual_network=1, ordered=true;MessageBuffer forwardToCache, network=From, virtual_network=2, ordered=true;MessageBuffer responseToCache, network=From, virtual_network=1, ordered=true;

    // out_port - map request type to outgoing message bufferout_port(requestNetwork_out, RequestMsg, requestFromCache);out_port(responseNetwork_out, ResponseMsg, responseFromCache);

    // in_port - map request type to incomming message buffer// and produce code to accept incomming messagesin_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { ... }in_port(responseNetwork_in, ResponseMsg, responseToCache) { ... }

    Interface to a CPU

    // The other end of mandatoryQueue attaches to SequencerMessageBuffer mandatoryQueue, ordered=false;in_port(mandatoryQueue_in, RubyRequest, mandatoryQueue, desc=...) { ... }// There is no corresponing out_port - handled with hitCallback

    142

  • MI Example - L1 Cache ControllerDeclaring States

    State Declaration

    // STATESstate_declaration(State, desc="Cache states") {

    // Stable StatesI, AccessPermission:Invalid, desc="Not Present/Invalid";M, AccessPermission:Read_Write, desc="Modified";

    // Transient StatesII, AccessPermission:Busy, desc="Not Present/Invalid, issued PUT";MI, AccessPermission:Busy, desc="Modified, issued PUT";MII, AccessPermission:Busy, desc="Modified, issued PUTX, received nack";IS, AccessPermission:Busy, desc="Issued request for LOAD/IFETCH";IM, AccessPermission:Busy, desc="Issued request for STORE/ATOMIC";

    }

    143

  • MI Example - L1 Cache ControllerDeclaring Events

    Event Declaration

    // EVENTSenumeration(Event, desc="Cache events") {

    // from processorLoad, desc="Load request from processor";Ifetch, desc="Ifetch request from processor";Store, desc="Store request from processor";

    // From network (directory)Data, desc="Data from network";Fwd_GETX, desc="Forward from network";Inv, desc="Invalidate request from dir";Writeback_Ack, desc="Ack from the directory for a writeback";Writeback_Nack, desc="Nack from the directory for a writeback";

    // Internally generatedReplacement, desc="Replace a block";

    }

    144

  • MI Example - L1 Cache ControllerMapping messages to events

    Mapping occurs in in_port declaration. peek(in_port, message_type)

    Sets variable in_msg to head of in_port queue. trigger(Event, address)

    Event mapping

    in_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) {if (forwardRequestNetwork_in.isReady()) {

    peek(forwardRequestNetwork_in, RequestMsg) {if (in_msg.Type == CoherenceRequestType:GETX) {

    trigger(Event:Fwd_GETX, in_msg.Address);}...

    }}

    }

    145

  • MI Example - L1 Cache ControllerDefining Transitions

    transition(Starting State(s), Event, [EndingState]) [ { Actions } ]

    Transition sequence for new Store request

    transition(I, Store, IM) {v_allocateTBE; // allocate TBE (a.k.a. MSHR) on transition to transient statei_allocateL1CacheBlock;a_issueRequest;m_popMandatoryQueue;

    }

    transition(IM, Data, M) {u_writeDataToCache;s_store_hit;w_deallocateTBE; // deallocate TBE on transition back to stable staten_popResponseQueue;

    }...

    146

  • MI Example - L1 Cache ControllerDefining Actions

    action(name, abbrev, [desc]) { implementation } Two special functions available in action

    peek(in_port, message_type) { use in_msg } assigns in_msg to message at head of port

    enqueue(out_port, message_type, [options]) {set out_msg } enqueues out_msg on out_port

    Special variable address is available inside an action block Set to the address associated with the event that caused the

    calling transition

    Example Action Definition

    action(e_sendData, "e", desc="Send data from cache to requestor") {peek(forwardRequestNetwork_in, RequestMsg) {

    enqueue(responseNetwork_out, ResponseMsg, latency=cache_response_latency) {out_msg.Address := address;out_msg.Type := CoherenceResponseType:DATA;out_msg.Sender := machineID;out_msg.Destination.add(in_msg.Requestor); // uses in_msg set by peekout_msg.DataBlk := cacheMemory[address].DataBlk;out_msg.MessageSize := MessageSizeType:Response_Data;

    }}

    }

    147

  • MI Example - L1 Cache ControllerTransition Table

    148

  • MI ExampleConnecting SLICC Machines with a Topology

    Creating the Topology Not In SLICCsrc/mem/ruby/network/topologies/Pt2Pt.py

    # returns a SimObject for for a Pt2Pt Topologydef makeTopology(nodes, options, IntLink, ExtLink, Router):

    # Create an individual router for each controller (node),# and connect them (ext_links)routers = [Router(router_id=i) for i in range(len(nodes))]ext_links = [ExtLink(link_id=i, ext_node=n, int_node=routers[i])

    for (i, n) in enumerate(nodes)]link_count = len(nodes)

    # Connect routers all-to-all (int_links)int_links = []for i in xrange(len(nodes)):

    for j in xrange(len(nodes)):if (i != j):

    link_count += 1int_links.append(IntLink(link_id=link_count,

    node_a=routers[i],node_b=routers[j]))

    # Return Pt2Pt Topology SimObjectreturn Pt2Pt(ext_links=ext_links,

    int_links=int_links,routers=routers)

    149

  • Using C++ Objects in SLICC SLICC can be arbitrarily extended with C++ objects

    e.g., Interface with a new message filter Steps:

    Create class in C++ Declare interface in SLICC with structure, external=yes Initialize object in machine Use!

    Extending SLICC

    // MessageFilter.hclass MessageFilter {public:

    MessageFilter(int param1);

    // returns 1 if message should be filteredint filter(RequestMsg msg);

    };

    // MessageFilter.ccint MessageFilter::filter(RequestMsg msg){

    ...return 0;

    }

    // MI_example-cache.smstructure(MessageFilter, external=yes) {

    int filter(RequestMsg);};

    MessageFilter requestFilter,constructor_hack=param;

    action(af_allocateUnlessFiltered, af) {if (requestFilter.filter(in_msg) != 1) {

    cacheMemory.allocate(address, new Entry);}

    }

    150

  • Using C++ Objects in SLICC SLICC can be arbitrarily extended with C++ objects

    e.g., Interface with a new message filter Steps:

    Create class in C++ Declare interface in SLICC with structure, external=yes Initialize object in machine Use!

    Extending SLICC

    // MessageFilter.hclass MessageFilter {public:

    MessageFilter(int param1);

    // returns 1 if message should be filteredint filter(RequestMsg msg);

    };

    // MessageFilter.ccint MessageFilter::filter(RequestMsg msg){

    ...return 0;

    }

    // MI_example-cache.smstructure(MessageFilter, external=yes) {

    int filter(RequestMsg);};

    MessageFilter requestFilter,constructor_hack=param;

    action(af_allocateUnlessFiltered, af) {if (requestFilter.filter(in_msg) != 1) {

    cacheMemory.allocate(address, new Entry);}

    }

    150

  • Using C++ Objects in SLICC SLICC can be arbitrarily extended with C++ objects

    e.g., Interface with a new message filter Steps:

    Create class in C++ Declare interface in SLICC with structure, external=yes Initialize object in machine Use!

    Extending SLICC

    // MessageFilter.hclass MessageFilter {public:

    MessageFilter(int param1);

    // returns 1 if message should be filteredint filter(RequestMsg msg);

    };

    // MessageFilter.ccint MessageFilter::filter(RequestMsg msg){

    ...return 0;

    }

    // MI_example-cache.smstructure(MessageFilter, external=yes) {

    int filter(RequestMsg);};

    MessageFilter requestFilter,constructor_hack=param;

    action(af_allocateUnlessFiltered, af) {if (requestFilter.filter(in_msg) != 1) {

    cacheMemory.allocate(address, new Entry);}

    }

    150

  • Using C++ Objects in SLICC SLICC can be arbitrarily extended with C++ objects

    e.g., Interface with a new message filter Steps:

    Create class in C++ Declare interface in SLICC with structure, external=yes Initialize object in machine Use!

    Extending SLICC

    // MessageFilter.hclass MessageFilter {public:

    MessageFilter(int param1);

    // returns 1 if message should be filteredint filter(RequestMsg msg);

    };

    // MessageFilter.ccint MessageFilter::filter(RequestMsg msg){

    ...return 0;

    }

    // MI_example-cache.smstructure(MessageFilter, external=yes) {

    int filter(RequestMsg);};

    MessageFilter requestFilter,constructor_hack=param;

    action(af_allocateUnlessFiltered, af) {if (requestFilter.filter(in_msg) != 1) {

    cacheMemory.allocate(address, new Entry);}

    }

    150

  • Detailed Component Simulation: Caches

    Set-Associative Caches Each CacheMemory object represents one bank of cache Configurable bit select for indexing Modular replacement policy

    Tree-based pseudo-LRU LRU

    See src/mem/ruby/system/CacheMemory.hh

    151

  • Detailed Component Simulation: Memory

    Memory controller models a single channel DDR2 controller Implements closed-page policy Can configure ranks, tCAS, refresh, etc.. See src/mem/ruby/system/MemoryController.hh

    152

  • Detailed Component Simulation: Network Simple Network

    Idealized routers - fixed latency, no internal resources Does model link bandwidth

    Garnet Network Detailed routers - both fixed and flexible pipeline model From Princeton, MIT

    See src/mem/ruby/network/*

    153

  • Ruby Debugging Support

    Random testing support Stresses protocol by inserting random timing delays

    Support for coherence transition tracing Frequent assertions Deadlock detection

    154

  • Lifetime of a Ruby Memory Request

    1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.

    2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.

    3 Message on mandatoryQueue triggers an event in L1Controller.

    4 Until request is completed:1 (Event, State) is matched to a transition.

    2 Actions in the matched transition (optionally) send messagesto network & allocate TBE.

    3 Responses from network trigger more events.

    5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.

    6 RubyRequest is converted back into a Packet & sent toRubyPort.

    155

  • Lifetime of a Ruby Memory Request

    1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.

    2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.

    3 Message on mandatoryQueue triggers an event in L1Controller.

    4 Until request is completed:1 (Event, State) is matched to a transition.2 Actions in the matched transition (optionally) send messages

    to network & allocate TBE.

    3 Responses from network trigger more events.

    5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.

    6 RubyRequest is converted back into a Packet & sent toRubyPort.

    155

  • Lifetime of a Ruby Memory Request

    1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.

    2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.

    3 Message on mandatoryQueue triggers an event in L1Controller.

    4 Until request is completed:1 (Event, State) is matched to a transition.2 Actions in the matched transition (optionally) send messages

    to network & allocate TBE.3 Responses from network trigger more events.

    5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.

    6 RubyRequest is converted back into a Packet & sent toRubyPort.

    155

  • Lifetime of a Ruby Memory Request

    1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.

    2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.

    3 Message on mandatoryQueue triggers an event in L1Controller.

    4 Until request is completed:1 (Event, State) is matched to a transition.2 Actions in the matched transition (optionally) send messages

    to network & allocate TBE.3 Responses from network trigger more events.

    5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.

    6 RubyRequest is converted back into a Packet & sent toRubyPort.

    155

  • Lifetime of a Ruby Memory Request

    1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.

    2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.

    3 Message on mandatoryQueue triggers an event in L1Controller.

    4 Until request is completed:1 (Event, State) is matched to a transition.2 Actions in the matched transition (optionally) send messages

    to network & allocate TBE.3 Responses from network trigger more events.

    5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.

    6 RubyRequest is converted back into a Packet & sent toRubyPort.

    155

  • Lifetime of a Ruby Memory Request

    1 Request enters through RubyPort::recvTiming, isconverted to RubyRequest, and passed to Sequencer.

    2 Request enters SLICC controllers throughSequencer::makeRequest via mandatoryQueue.

    3 Message on mandatoryQueue triggers an event in L1Controller.

    4 Until request is completed:1 (Event, State) is matched to a transition.2 Actions in the matched transition (optionally) send messages

    to network & allocate TBE.3 Responses from network trigger more events.

    5 Last event causes action that callsSequencer::hitCallback & deallocates TBE.

    6 RubyRequest is converted back into a Packet & sent toRubyPort.

    155

  • Outline

    1 Introduction to gem5

    2 Basics

    3 Debugging

    4 Checkpointing and Fastforwarding

    5 Break

    6 Multiple Architecture Support

    7 CPU Modeling

    8 Ruby Memory System

    9 Wrap-Up

    156

  • Wrap-Up

    Wrap-Up

    Brad Beckmann

    AMD Research

    157

  • Summary

    Reviewed the basics High-level features Debugging Checkpointing

    Highlighted new aspects ISA changes: x86 & ARM InOrder CPU model Ruby memory system

    Upcoming Computer Architecture News (CAN) article Summarizes goals, features, and capabilities Please cite if you use gem5

    Overall gem5 has a wide range of capabilities ...but not all combinations currently work

    158

  • Cross-Product Table

    Processor Memory System

    CPU Model System Mode Classic RubySimple Garnet

    Atomic Simple SEFS

    Timing Simple SEFS

    InOrder SEFS

    O3 SEFS

    Spectrum of choices (light = speed, dark = accuracy)

    159

  • Matrix Examples: Alpha and x86Alpha

    Processor Memory System

    CPU Model System Mode Classic RubySimple Garnet

    Atomic Simple SEFS

    Timing Simple SEFS

    InOrder SEFS

    O3 SEFS

    x86Processor Memory System

    CPU Model System Mode Classic RubySimple Garnet

    Atomic Simple SEFS

    Timing Simple SEFS

    InOrder SEFS