111
Scalable FPGA-accelerated Cycle-Accurate Hardware Simulation in the Cloud Introduction RISC-V Summit 2018 Tutorial Speakers: Sagar Karandikar, David Biancolin, Alon Amid https://fires.im @firesimproject

Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Scalable FPGA-accelerated Cycle-Accurate Hardware

Simulation in the Cloud

Introduction

RISC-V Summit 2018 Tutorial

Speakers: Sagar Karandikar, David Biancolin, Alon Amid

https://fires.im

@firesimproject

Page 2: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

What is this tutorial about?

• How to get started with FireSim • We’re going to keep things very practical today, treat this as a

tutorial • For more of the research background, see our papers/previous

talks

• We’re going to cram in a lot of info, but feel free to interrupt us with questions • We’ll put up the full slide deck online anyway

2

Page 3: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Today’s Agenda

• Intro, what is FireSim capable of? • (Sagar, 10 mins)

• Building/Deploying FireSim Simulations • (Sagar, 15 mins)

• Using FireSim to simulate your own custom HW • (David, 15 mins)

• Testing and Debugging your design in FireSim • (Alon, 15 mins)

• Community/Contributing/Conclusion • (David, 10 mins)

3

Page 4: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

What can FireSim be used for?

• Evaluating a custom hardware design • For example, building a custom accelerator and running real workloads on it • e.g. “A Hardware Accelerator for Tracing Garbage Collection” from ISCA 2018 and the

Hwacha project (see RV-Summit talk tomorrow)

• Debugging a Chisel design at FPGA-speeds • e.g. FireSim Debugging Docs and DESSERT from FPL 2018

• Rapidly Customizing/Evaluating RISC-V cores, • For example, Rocket Chip and BOOM, provided as default designs in FireSim. • FireSim can run all of SPECInt2017 with reference inputs on Rocket Chip in parallel on

10 cloud-FPGAs in ~1 day • e.g. “Composable Building Blocks to Open up Processor Design” from MICRO 2018

• Simulating thousand-node datacenter-scale systems with full control over the hardware • e.g. FireSim ISCA 2018 Paper • Implicitly includes the previous use case

4

Page 5: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

What can FireSim be used for?

• Evaluating a custom hardware design • For example, building a custom accelerator and running real workloads on it • e.g. “A Hardware Accelerator for Tracing Garbage Collection” from ISCA 2018 and the

Hwacha project (see RV-Summit talk tomorrow)

• Debugging a Chisel design at FPGA-speeds • e.g. FireSim Debugging Docs and DESSERT from FPL 2018

• Rapidly Customizing/Evaluating RISC-V cores, • For example, Rocket Chip and BOOM, provided as default designs in FireSim. • FireSim can run all of SPECInt2017 with reference inputs on Rocket Chip in parallel on

10 cloud-FPGAs in ~1 day • e.g. “Composable Building Blocks to Open up Processor Design” from MICRO 2018

• Simulating thousand-node datacenter-scale systems with full control over the hardware • e.g. FireSim ISCA 2018 Paper • Implicitly includes the previous use case

5

Page 6: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

How does FireSim simulate a datacenter?

• Model hardware at scale, cycle-accurately: • CPUs down to microarchitecture (automatically transformed from Chisel RTL)

• Fast networks, switches (SW models)

• Novel accelerators (also transformed from Chisel)

• Run real software: • Real OS, networking stack (Linux)

• Real frameworks/applications (not microbenchmarks)

• Be productive/usable: • Run on a commodity platform (Amazon EC2 F1)

• Want to encourage collaboration between systems, architecture: real HW/SW co-design research

6

Page 7: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Why do we want an FPGA-based simulator?

7

RTL on FPGA 100 MHz

DRAM

100ns latency

RTL taped-out

1 GHz

DRAM

100ns latency

FPGA Emulation/Prototyping SoC sees 10 cycle DRAM latency

Taped-out Design SoC sees 100 cycle DRAM latency

Page 8: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

How do we fix this? FAME-1 Xform in FIRRTL w/MIDAS

8

RTL Design

Physical DRAM

100ns

latency

<- Resp Queue

Req Queue ->

DRAM Model

100

cycle latency

Mem Channel

FPGA Fabric

“FAME-1” Transformed RTL Design

Page 9: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Comparing existing HW “simulation” systems

9

• Taping-out excels at: • Modeling reality: “single source of truth” • Scalability

• Hardware-accelerated simulators excel at: • Simulation rate • Ability to run real workloads (as fn. of sim rate)

• Software-based simulators excel at: • Ease-of-use • Ease-of-rebuild (time-to-first-cycle) • Commodity host platform • Cost • Introspection

Page 10: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Prior work on HW-accelerated simulators: DIABLO

• DIABLO, ASPLOS’15 [4]: • Simulated 3072 servers, 96 ToRs at ~2.7 MHz

• Booted Linux, ran apps like Memcached

• An example of the many projects from RAMP

• Need to hand-write abstract RTL models • Harder than writing “tapeout-ready” RTL

• Need to validate against real HW

• Tied to an expensive custom host-platform • $100k+ host platform, custom built

DIABLO Prototype

10

Page 11: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

FPGAs in the Cloud

Useful trends throughout the architect’s stack

High-Productivity Hardware Design

Language w/IR

Open, Silicon-Proven SoC Implementations

Open ISA

11

Page 12: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

FireSim at a high-level

Server Simulations • Or your own custom hardware

• Good fit for the FPGA

• We have tapeout-proven RTL: automatically FAME-1 transform

Network simulation • Or your own custom SW models

• Little parallelism in switch models (e.g. a thread per port)

• Need to coordinate all of our distributed server simulations

• So use CPUs + host network

12

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

CPU

Host PCIe

f1.16xlarge

Ho

st E

ther

net

(EC

2 N

etw

ork

)

Server Simulations

Swit

ch M

od

el

Page 13: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Now, let’s build a datacenter-scale FireSim simulation!

13

Page 14: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

14

Modeled System

- 4x RISC-V Rocket Cores @ 3.2 GHz

- 16K I/D L1$

- 256K Shared L2$

- 200 Gb/s Eth. NIC

Resource Util.

- < ¼ of an FPGA

Sim Rate

- N/A

Step 1: Server SoC in RTL

Page 15: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

15

Modeled System

- 4x RISC-V Rocket Cores @ 3.2 GHz

- 16K I/D L1$

- 256K Shared L2$

- 200 Gb/s Eth. NIC

Resource Util.

- < ¼ of an FPGA

Sim Rate

- N/A

Step 1: Server SoC in RTL

Page 16: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

16

Modeled System

- 4x RISC-V Rocket Cores @ 3.2 GHz

- 16K I/D L1$

- 256K Shared L2$

- 200 Gb/s Eth. NIC

- 16 GB DDR3

Resource Util.

- < ¼ of an FPGA

- ¼ Mem Chans

Sim Rate

- ~150 MHz

- ~40 MHz (netw)

Step 2: FPGA Simulation of one server blade

Page 17: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

17

Modeled System

- 4x RISC-V Rocket Cores @ 3.2 GHz

- 16K I/D L1$

- 256K Shared L2$

- 200 Gb/s Eth. NIC

- 16 GB DDR3

Resource Util.

- < ¼ of an FPGA

- ¼ Mem Chans

Sim Rate

- ~150 MHz

- ~40 MHz (netw)

Step 2: FPGA Simulation of one server blade

Page 18: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

18

Modeled System

- 4 Server Blades

- 16 Cores

- 64 GB DDR3

Resource Util.

- < 1 FPGA

- 4/4 Mem Chans

Sim Rate

- ~14.3 MHz (netw)

Step 3: FPGA Simulation of 4 server blades

Cost: $0.49 per hour (spot) $1.65 per hour (on-demand)

Page 19: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

19

Modeled System

- 4 Server Blades

- 16 Cores

- 64 GB DDR3

Resource Util.

- < 1 FPGA

- 4/4 Mem Chans

Sim Rate

- ~14.3 MHz (netw)

Step 3: FPGA Simulation of 4 server blades

Page 20: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

20

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3

- 32 Port ToR Switch

- 200 Gb/s, 2us links

Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz (netw)

Step 4: Simulating a 32 node rack

Cost: $2.60 per hour (spot) $13.20 per hour (on-demand)

Page 21: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

21

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3

- 32 Port ToR Switch

- 200 Gb/s, 2us links

Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz (netw)

Step 4: Simulating a 32 node rack

Cost: $2.60 per hour (spot) $13.20 per hour (on-demand)

Page 22: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

22

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3

- 32 Port ToR Switch

- 200 Gb/s, 2us links

Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz (netw)

Step 4: Simulating a 32 node rack

Page 23: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

23

Modeled System

- 256 Server Blades

- 1024 Cores

- 4 TB DDR3

- 8 ToRs, 1 Aggr

- 200 Gb/s, 2us links

Resource Util.

- 64 FPGAs =

- 8x f1.16xlarge

- 1x m4.16xlarge

Sim Rate

- ~9 MHz (netw)

Step 5: Simulating a 256 node “aggregation pod”

Page 24: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

24

Modeled System

- 256 Server Blades

- 1024 Cores

- 4 TB DDR3

- 8 ToRs, 1 Aggr

- 200 Gb/s, 2us links

Resource Util.

- 64 FPGAs =

- 8x f1.16xlarge

- 1x m4.16xlarge

Sim Rate

- ~9 MHz (netw)

Step 5: Simulating a 256 node “aggregation pod”

Page 25: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

25

Modeled System

- 1024 Servers

- 4096 Cores

- 16 TB DDR3

- 32 ToRs, 4 Aggr, 1 Root

- 200 Gb/s, 2us links

Resource Util.

- 256 FPGAs =

- 32x f1.16xlarge

- 5x m4.16xlarge

Sim Rate

- ~6.6 MHz (netw)

Step 6: Simulating a 1024 node datacenter

Page 26: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

26

Modeled System

- 1024 Servers

- 4096 Cores

- 16 TB DDR3

- 32 ToRs, 4 Aggr, 1 Root

- 200 Gb/s, 2us links

Resource Util.

- 256 FPGAs =

- 32x f1.16xlarge

- 5x m4.16xlarge

Sim Rate

- ~6.6 MHz (netw)

Step 6: Simulating a 1024 node datacenter

Harnesses millions of dollars of FPGAs to simulate 1024 nodes cycle-exactly

with a cycle-accurate network simulation and global synchronization

at a cost-to-user of only 100s of dollars/hour

Page 27: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Open-source: Not just datacenter simulation

• An “easy” button for fast, FPGA-accelerated full-system simulation • Replace network endpoints with your own Chisel

designs

• One-click: Parallel FPGA builds, Simulation run/result collection, building target software

• Scales to a variety of use cases: • Networked (performance depends on scale)

• Non-networked (150+ MHz), limited by your budget

• firesim command line program • Like docker or vagrant, but for FPGA sims

• User doesn’t need to care about distributed magic happening behind the scenes

27 FireSim Developer Environment

Page 28: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Open-source: Not just datacenter simulation

• Scripts can call firesim to fully automate distributed FPGA sim • Reproducibility: included scripts to

reproduce ISCA 2018 results

• e.g. scripts to automatically run SPECInt2017 reference inputs in ≈1 day

• Many others

• 100+ pages of documentation: https://docs.fires.im

• AWS provides grants for researchers: https://aws.amazon.com/grants/

28

$ cd fsim/deploy/workloads

$ ./run-all.sh

Page 29: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Latest Updates

• Growing ecosystem!

• BOOM (Out-of-order RISC-V core) now available in FireSim

• Projects publicly releasing FireSim images at RISC-V Summit: • Hwacha vector accelerator • Keystone Secure Enclave

• Berkeley IceNet (photonic DC network) modeling

• New debugging features • Auto-ILA: Annotate Chisel and get an

ILA automatically wired-up in FireSim • Tracer widgets: Collect live instruction

traces from FireSim sims • Integrating DESSERT [8]

• Assertion Synthesis now on master

• First Academic User papers: • ISCA ‘18: Maas et. al. “A Hardware

Accelerator for Tracing Garbage Collection” (Berkeley)

• MICRO ‘18: Zhang et. al. “Composable Building Blocks to Open up Processor Design” (MIT)

29

Page 30: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Wrapping Up

• We can prototype scalable-systems built on arbitrary RTL at unprecedented scale

+ Mix software models when desired

• Simulation is automatically built and deployed

• Automatically deploy real workloads and collect results

• Open-source, runs on Amazon EC2 F1, no capex

30

SoC RTL

Other RTL

Network Topology

Automatically deployed, high-performance,

distributed simulation

SW Models

Full Work-load

Page 31: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Administrivia

• Everything we’re going to show you today is documented in excruciating detail at http://docs.fires.im/

• Use these slides as a high-level view of what’s possible

• We’ll also put the slides/videos online

31

Page 32: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Today’s Agenda

• Intro, what is FireSim capable of? • (Sagar, 10 mins)

• Building/Deploying FireSim Simulations • (Sagar, 15 mins)

• Using FireSim to simulate your own custom HW • (David, 15 mins)

• Testing and Debugging your design in FireSim • (Alon, 15 mins)

• Community/Contributing/Conclusion • (David, 10 mins)

32

☑ ️

Page 33: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Learn More: Web: https://fires.im GitHub: https://github.com/firesim @firesimproject

ISCA’18 Paper: sagark.org/assets/pubs/firesim-isca2018.pdf

The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849. Research was partially funded by ADEPT Lab industrial sponsor Intel, RISE Lab sponsor Amazon Web Services, and ADEPT Lab affiliates Google, Huawei, Siemens, SK Hynix, and Seagate. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

Questions?

Page 34: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

References

[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016. [3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14 [4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15 [5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10 [6] Evaluation of RISC-V RTL Designs with FPGA Simulation. Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. CARRV '17. [7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16 [8] Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan Bachrach, and Krste Asanović, “DESSERT: Debugging RTL Effectively with State Snapshotting for Error Replays across Trillions of cycles”, FPL 2018

34

Page 35: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Building/Deploying Simulations

FireSim Tutorial

RISC-V Summit 2018 Tutorial

Speaker: Sagar Karandikar

Page 36: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

What will we cover?

•Building FireSim FPGA images for a set of targets

•Choosing targets at runtime

•Configuring target software

•Managing EC2 infrastructure for simulations

•Running a simulation

•Customizing workloads: SPEC example

36

Page 37: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Background Terminology

37

“AGFI”: FPGA Bitstream for F1

FPGAs

Page 38: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

First-time User Setup

•Won’t discuss this today, but we’ve documented the process of starting with AWS/FireSim from scratch here: http://docs.fires.im/en/latest/Initial-Setup/index.html • For the following walkthrough, we assume that you’ve

setup a Manager instance and cloned firesim into ~/firesim

38

Page 39: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Let’s simulate the following system:

Target Design:

• One Rocket Chip-based node (RTL) • 4 Rocket Cores

• 16K L1 I$, D$

• Block Device, UART, Serial Adapter

• No NIC

• 4 MB LLC (RTL Model)

• DDR3 Memory System

• No network

• Boot vanilla buildroot-Linux distro

Host Resources:

• One manager instance (c4.4xlarge)

• One F1 instance (f1.2xlarge, single FPGA)

39

☑ ️

Page 40: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Using the firesim manager command line

From inside ~/firesim, source sourceme-f1-manager.sh

This properly sets up your environment and puts firesim on your $PATH

This means that we can now call firesim from anywhere on the instance. It will always run from the directory:

your_firesim_clone/deploy/

in this case:

~/firesim/deploy/

40

Page 41: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Configuring the Manager. 4 files in firesim/deploy/

41

config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini

Page 42: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Setting up the manager to build FPGA images

42

config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini

Page 43: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Running builds

• Once we’ve configured what we want to build, let’s build it

$ firesim buildafi

• This completely automates the process. For each design, in-parallel: • Launch a build instance (c4.4xlarge) • Run FireSim generator (Chisel/FIRRTL/MIDAS) • Ship infrastructure to build instances, run Vivado FPGA

builds in parallel • Collect results back onto manager instance

• ~/firesim/deploy/results-workload/TIMESTAMP-config/ contains final results + QoR information + dcps to open in Vivado for further manual tuning

• Email you the entry to put into config_hwdb.ini • Terminate the build instance

43

Page 44: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Status - We have FPGA images for the simulator of our design

Target Design:

• One Rocket Chip-based node (RTL) • 4 Rocket Cores

• 16K L1 I$, D$

• Block Device, UART, Serial Adapter

• No NIC

• 4 MB LLC (RTL Model)

• DDR3 Memory System

• No network

• Boot vanilla buildroot-Linux distro

Host Resources:

• One manager instance (c4.4xlarge)

• One F1 instance (f1.2xlarge, single FPGA)

44

☑ ️

☑ ️☑ ️☑ ️

☑ ️

Page 45: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Now, let’s build our target-software

• We need: • bbl + vmlinux image: Berkeley BootLoader + Linux kernel image as payload • Root Filesystem disk image

• FireSim provides two levels of workload automation: • firesim-software repo for automatically building base-images for various

distros (e.g. buildroot or Fedora) that are compatible with Rocket / BOOM cd firesim/sw/firesim-software

./sw-manager.py -c br-disk.json build # simple buildroot distro

This lets us use the linux-uniform.json workload, which we will see later

• firesim manager / workload generation tool can bake custom workloads into rootfses • We’ll get to this later

45

Page 46: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Status

Target Design:

• One Rocket Chip-based node (RTL) • 4 Rocket Cores

• 16K L1 I$, D$

• Block Device, UART, Serial Adapter

• No NIC

• 4 MB LLC (RTL Model)

• DDR3 Memory System

• No network

• Boot vanilla buildroot-Linux distro

Host Resources:

• One manager instance (c4.4xlarge)

• One F1 instance (f1.2xlarge, single FPGA)

46

☑ ️

☑ ️☑ ️☑ ️☑ ️

☑ ️

Page 47: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Setting up the manager’s runtime configuration

47

config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini

Page 48: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Setting up the manager’s runtime configuration

48

config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini

Page 49: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Launching Simulation Instances

• Once we’ve configured what we want to simulate and what infrastructure we require, let’s launch our simulation (F1) instances

49

$ firesim launchrunfarm

FireSim Manager. Docs: http://docs.fires.im

Running: launchrunfarm

Waiting for instance boots: f1.16xlarges

Waiting for instance boots: m4.16xlarges

Waiting for instance boots: f1.2xlarges

i-0d6c29ac507139163 booted!

The full log of this run is: /home/centos/firesim-new/deploy/logs/2018-05-19--00-19-43-launchrunfarm-B4Q2ROAK.log

Page 50: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Status

Target Design:

• One Rocket Chip-based node (RTL) • 4 Rocket Cores

• 16K L1 I$, D$

• Block Device, UART, Serial Adapter

• No NIC

• 4 MB LLC (RTL Model)

• DDR3 Memory System

• No network

• Boot vanilla buildroot-Linux distro

Host Resources:

• One manager instance (c4.4xlarge)

• One F1 instance (f1.2xlarge, single FPGA)

50

☑ ️

☑ ️☑ ️☑ ️☑ ️

☑ ️☑ ️

Page 51: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Distributing Simulation Infrastructure to the Run Farm • Next, we need to setup our run farm instance with simulation infrastructure

$ firesim infrasetup FireSim Manager. Docs: http://docs.fires.im

Running: infrasetup

Building FPGA software driver for FireSimNoNIC-FireSimRocketChipQuadCoreConfig-FireSimDDR3FRFCFSLLC4MBConfig

[172.30.2.254] Executing task 'instance_liveness'

[172.30.2.254] Checking if host instance is up...

[172.30.2.254] Executing task 'infrasetup_node_wrapper'

[172.30.2.254] Copying FPGA simulation infrastructure for slot: 0.

[172.30.2.254] Installing AWS FPGA SDK on remote nodes. Upstream hash: 2fdf23ffad944cb94f98d09eed0f34c220c522fe

[172.30.2.254] Unloading EDMA Driver Kernel Module.

[172.30.2.254] Copying AWS FPGA EDMA driver to remote node.

[172.30.2.254] Clearing FPGA Slot 0.

[172.30.2.254] Checking for Cleared FPGA Slot 0.

[172.30.2.254] Flashing FPGA Slot: 0 with agfi: agfi-06b9b705ab9af1238.

[172.30.2.254] Checking for Flashed FPGA Slot: 0 with agfi: agfi-06b9b705ab9af1238.

[172.30.2.254] Loading EDMA Driver Kernel Module.

[172.30.2.254] Starting Vivado hw_server.

[172.30.2.254] Starting Vivado virtual JTAG.

The full log of this run is:

/home/centos/firesim/deploy/logs/2018-11-14--16-42-35-infrasetup-CHLESMNB83XFPZUR.log

51

Page 52: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Let’s run a simulation!

$ firesim runworkload

FireSim Manager. Docs: http://docs.fires.im

Running: runworkload

Creating the directory: /home/centos/firesim/deploy/results-workload/2018-11-14--16-47-17-linux-uniform/

[172.30.2.254] Executing task 'instance_liveness'

[172.30.2.254] Checking if host instance is up...

[172.30.2.254] Executing task 'boot_switch_wrapper'

[172.30.2.254] Executing task 'boot_simulation_wrapper'

[172.30.2.254] Starting FPGA simulation for slot: 0.

52

Page 53: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Monitoring

53

Page 54: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Manually Interacting with simulations

54

$ ssh 172.30.2.254

$ screen –r fsim0

Page 55: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Shutting down simulations/Capturing results

• Frequently, we want to capture results from simulations automatically

• FireSim’s workload system supports this! • More on how to specify what to capture in a few mins

• But let’s continue with our example – it’s configured just to capture each simulated system’s console output, once the simulation is powered off. If we do so, the manager will produce:

...

FireSim Simulation Exited Successfully. See results in:

/home/centos/firesim/deploy/results-workload/2018-11-14--16-52-49-linux-

uniform/

The full log of this run is:

/home/centos/firesim/deploy/logs/2018-11-14--16-52-49-runworkload-

HSMYYMD4JBA6BHT5.log

55

Page 56: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Finally, get rid of our run farm EC2 instances

• Easy!

$ firesim terminaterunfarm

FireSim Manager. Docs: http://docs.fires.im

Running: terminaterunfarm

IMPORTANT!: This will terminate the following instances:

f1.16xlarges

[]

m4.16xlarges

[]

f1.2xlarges

['i-033959e7513fcf928']

Type yes, then press enter, to continue. Otherwise, the operation will be cancelled.

yes

Instances terminated. Please confirm in your AWS Management Console.

The full log of this run is:

/home/centos/firesim/deploy/logs/2018-11-14--17-00-01-terminaterunfarm-RGWY68L5ICAYQTA3.log

56

Page 57: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Custom Workloads: linux-uniform workload

• Previously, we relied on linux-uniform.json as our workload

• These jsons live in firesim/deploy/workloads/

57

Page 58: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Let’s look at a more interesting example: SPEC

58

• JSONs are used for two things: • Building rootfs/binary

combos automatically, with benchmark infrastructure built-in

• Deploying simulations with the manager (like we saw previously) and knowing which results to collect at the end

Page 59: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Building/Deploying SPEC on Rocket/BOOM

• Building: We don’t have enough time to go into detail here. Look at the spec17-% target in firesim/deploy/workloads/Makefile

• At a high-level, you can just run make spec17-intrate, and you’ll get 10 rootfs/linux image combos that will automatically run the spec benchmarks in parallel

• Deploying: Set your workload to spec17-intrate.json in config_runtime.ini, set the # f1.2xlarges to 10, select the hardware config you want to benchmark, then firesim launchrunfarm/infrasetup/runworkload as usual

• At the end, all your performance results live in one directory on the manager: firesim/deploy/results-workload/TIMESTAMP-spec17-intrate-HASH/

• Instances get automatically terminated one-by-one as benchmarks complete – essentially zero cost to running in parallel on EC2, since you pay by the machine-second anyway

59

Page 60: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Summary

• Don’t fret if you didn’t catch everything, everything we showed you today is documented in excruciating detail at http://docs.fires.im

• We learned how to: • Build FireSim FPGA images for a set of targets

• http://docs.fires.im/en/latest/Building-a-FireSim-AFI.html

• Setup/Launch a simulation, including choosing targets, configuring target software, and managing EC2 infrastructure • http://docs.fires.im/en/latest/Running-Simulations-Tutorial/Running-a-Single-Node-

Simulation.html

• Customize workloads: SPEC example • http://docs.fires.im/en/latest/Advanced-Usage/Workloads/Defining-Custom-

Workloads.html • http://docs.fires.im/en/latest/Advanced-Usage/Workloads/SPEC-2017.html

60

Page 61: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Modifying the Target Design

FireSim Tutorial RISC-V Summit 2018

Speaker: David Biancolin

Page 63: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Important Submodules & Aliases

Submodules:

MIDAS: sim/midas

FIRRTL: sim/firrtl

Firechip: target-design/firechip

Rocket-Chip: target-design/firechip/rocket-chip

Chisel: target-design/firechip/rocket-chip/chisel

Aliases for this presentation:

FSCALA -> sim/src/main/scala

FCPP -> sim/src/main/cc/

MSCALA -> {in MIDAS}/src/main/scala/midas

MCPP -> {in MIDAS}/src/main/cc/

63

Page 64: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

• Represent the target as a synchronous-dataflow graph

• FireSim stitches together a larger graph to model a datacenter

Target as a Dataflow Graph

64

SW Model To be hosted on a CPU

Transformed RTL Model Derived from SoC RTL

To be hosted on an FPGA

Abstract RTL Model Cannot be taped out

To be hosted on an FPGA

Page 65: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Expressing the Target Graph in FireSim

• Target graph is implicit, currently not encoded in one place.

• There is only one transformed-RTL model • The CHIRRTL you passed to midas.Compiler

• In Simulation Mapping • 2+ entry queue channels are instantiated on Decoupled IO

• 1-cycle pipe channels are used everywhere else.

• SW and Custom RTL models are bound to IO bundles using midas.core.Endpoint

• Matches on a IO type, generates a model or endpoint

65

Page 66: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Parameterizing the RTL-Transformed Model

• FireSim provides four base Rocket-Chip “cakes” that form the base RTL-transformed model • DESIGN make variable

Two types of parameters:

1. Target: (see TargetConfigs.scala) Configure the generated RTL model • # of Cores, L1 $ Sizes, # of memory channels • TARGET_CONFIG make variable

2. Sim/Platform: (see SimConfigs.scala) Configure MIDAS, SW/Custom RTL model generation

• Configs for endpoints, Type of DRAM model, enable debug features etc… • PLATFORM_CONFIG make variable

66

Page 67: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Swapping out the RTL-Transformed Model

Two common approaches, based on the type of Target:

1. Project-template-based Rocket-chip targets (FireChip) • Point at the FireChip submodule at your fork

NB: MIDAS depends on Rocket-Chip; Rocket-Chip provides Chisel submodule

2. Everything else (ex. Your custom CGRA etc..) • Add scala sources for your target (modify sim/build.sbt)

• Define an App (Generator) that calls midas.Compiler(…)

• Add a target-specific makefrag (ex. sim/src/main/makefrag/firesim/Makefrag)

See the MIDAS examples for an simple demonstration of this.

67

Page 68: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

On Abstract RTL & SW Models.

FireSim & MIDAS provide the following models.

Abstract-RTL :

1. DefaultIOModel ($MSCALA/widgets/PeekPokeIO.scala) • Bound to I/O unbound by custom endpoints

2. DRAM & LLC model ($MSCALA/models/dram/MidasMemModel.scala) • Bound to AXI4 slave interfaces

SW-models:

1. NIC

2. SerialIO

3. Block-device

4. UART

These are all FireSim provided: See $FSCALA/endpoints and $FCPP/endpoints

68

Page 69: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Compile-time vs Runtime Configuration

Software models and custom RTL models are configured twice:

1. Compile/Generation time

2. Runtime

On Compile-time configuration:

• RTL-generation for Custom-RTL models (FPGA resynthesis required) • How: using different a PLATFORM_CONFIG; hacking on the Chisel sources

• CPP compilation for SW models (no FPGA resynthesis) • How: changing model & driver CPP sources.

69

Page 70: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Runtime Configuration

• Pass plus args to the simulator • SW models: does what you expect

• Custom-RTL models: driver writes to model’s configuration registers

• Ex. +blkdev_write_latency,

• How: Modify deploy/config_runtime.ini • Change ini-exposed variables

• Specify a customruntimeconfig in deploy/config_hwdb.ini

70

Page 71: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Runtime Configuration – config_runtime.ini

71

Page 72: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

HWDB entry: None -> Uses a default runtime.conf emitted at generation time

Runtime.conf -> plusArgs you want to pass to the simulator

Runtime Configuration – config_hwdb.ini

72

Page 73: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Adding a Custom Model – Chisel-side

1. Expose a Record/Bundle in your Target RTL’s IO with a specific type

2. Extend midas.Endpoint to: 1. Match on the correct Chisel type

• Implement Endpoint.matchType(Data)=> Boolean

2. Generate your SW model’s endpoint, or custom RTL-model Implement • Implement Endpoint.widget(Parameters)=> EndpointWidget

3. Add your endpoint to the EndpointMap Field in your PLATFORM_CONFIG

See:

• $FSCALA/endpoints/UARTWidget.scala

• $FSCALA/endpoints/BlockDevWidget.scala

• $FSCALA/firesim/SimConfigs.scala

73

Page 74: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Adding a Custom Model – Driver-side

1. Define an endpoint driver class • Use simif::{read, write, push, pull} to interact with FPGA-hosted endpoint

2. Register the endpoint driver in firesim_top.cc

See:

• $FCPP/endpoints/uart.{cc, h}

• $FCPP/endpoints/blockdev.{cc, h}

• $FCPP/firesim/firesim_top.cc

74

Page 75: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

DRAM Timing Models – Generation Time Conf

• Parameterized using the PLATFORM_CONFIG make variable • See $FSCALA/firesim/SimConfigs.scala

• Abstract timing models: • Latency-Bandwidth Pipe • Bank Conflict

• DDR3 Timing models: • First-Come First-Served (FCFS) • First-Ready, First-Come First-Served (FR-FCFS)

• All timing models can be composed with a LLC model.

• To be presented at FPGA2019

75

Page 76: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

DRAM Timing Models – Runtime Conf

• DDR timing models expose >30 runtime arguments • Many illegal settings, need to validate against generated model instance

• Ex. Setting that would overflow a timing register.

• Many invalid settings, need to validate against DDR3 part database • Ex. A non-standard JEDEC DDR3 timing, timings are density dependent

• A default runtime-configuration is emitted at generation time • See generated-src/f1/${TARGET_TUPLE}/runtime.conf

• To generate a custom runtime-config: • In sim/: make conf

76

Page 77: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

DRAM Timing Models – custom runtime.conf

77

Page 78: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Debugging, Testing, and Profiling

FireSim Tutorial RISC-V Summit 2018 Speaker: Alon Amid

Page 79: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Agenda

• FireSim Debugging Using Software Simulation

• FireSim Debugging Using Integrated Logic Analyzers

• Advanced FPGA-Based Debugging and Profiling Features

• The FireSim Vision for Debugging and Profiling

79

Page 80: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Debugging Using Software Simulation

80

Modifying internal simulated target hardware, no new external endpoints

Target-Level SW Simulation

What Am I

doing?

Simulator-Level SW Simulation

Adding/Modifying new interfaces and endpoints,

modifying simulation models

Midas-Level SW Simulation

FPGA-Level SW Simulation

My FireSim Simulation Is Not Working

Page 81: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Debugging Using Software Simulation

81

Target-Level Simulation

• Software Simulation

• Target Design Untransformed

• No Host-FPGA interfaces

MIDAS-Level Simulation

• Software Simulation

• Target Design Transformed by MIDAS

• Host-FPGA interfaces/shell emulated using abstract models

FPGA-Level Simulation

• Software Simulation

• Target Design Transformed by MIDAS

• Host-FPGA interfaces/shell simulated by the FPGA tools

Page 82: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

82

Remember this slide from

Sagar’s presentation?

Debugging Using Software Simulation

Page 83: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

83

RTL Design

Physical DRAM

100ns

latency

<- Resp Queue

Req Queue ->

DRAM Model

100

cycle latency

Mem Channel

“FAME-1” Transformed RTL Design

Target-Level SW Simulation

FPGA Fabric

Debugging Using Software Simulation

Page 84: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

84

RTL Design

Physical DRAM

100ns

latency

<- Resp Queue

Req Queue ->

DRAM Model

100

cycle latency

Mem Channel

“FAME-1” Transformed RTL Design

MIDAS-Level SW Simulation

FPGA Fabric

Abstract Model

Target-Level SW Simulation

Debugging Using Software Simulation

Page 85: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

85

RTL Design

Physical DRAM

100ns

latency

<- Resp Queue

Req Queue ->

DRAM Model

100

cycle latency

Mem Channel

“FAME-1” Transformed RTL Design

MIDAS-Level SW Simulation

FPGA Fabric

Abstract Model

Target-Level SW Simulation

FPGA-Level SW Simulation

Debugging Using Software Simulation

Page 86: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Debugging Using Software Simulation

86

Level Waves VCS Verilator XSIM

Target Off ~5 kHz ~5 kHz N/A

Target On ~1 kHz ~5 kHz N/A

MIDAS Off ~4 kHz ~2 kHz N/A

MIDAS On ~3 kHz ~1 kHz N/A

FPGA On ~2 Hz N/A ~0.5 Hz

Page 87: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

• Target-Level Simulation In firesim/target-design/firechip/vsim

$ make DESIGN=<YourDesign> CONFIG=<YourConfig> debug

$ ./simv-<YourDesign>-<YourConfig>-debug +max-cycles=50000 +vcdplusfile=<WaveformFileName>.vpd ../tests/<InputTest>.riscv

• MIDAS-level Simulation In firesim/sim

$ make <verilator|vcs>-debug

$ make EMUL=<verilator|vcs> DESIGN=FireSimNoNIC run-asm-test-debug

• FPGA-Level Simluation In firesim/sim

$ make xsim

$ make xsim-dut <VCS=1> &

$ make run-xsim SIM_BINARY=<PATH/TO/DRIVER/BINARY/FOR/TARGET/TO/RUN>

**FPGA-level simulation currently does not support DMA_PCIS (which is used for the NIC interface)

87

Debugging Using Software Simulation

Page 88: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

88

“Everything looks OK in SW simulation, but there is still a bug somewhere”

“My bug only appears after hours of running Linux on my simulated HW”

When SW Simulation Debugging is Not Enough…

Page 89: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Debugging Using Integrated Logic Analyzers

Integrated Logic Analyzers (ILAs)

• Common debugging feature provided by FPGA vendors

• Continuous recording of a sampling window of up to 1024 cycles. • Stores recorded samples in BRAM.

• Realtime trigger-based sampled output of probed signals • Multiple probes ports can be combined to a single trigger

• Trigger can be in any location within the sampling window

• On the AWS F1-Instances, ILA interfaced through a debug-bridge and server

89

From: aws-fpga cl_hello_world example

Page 90: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Debugging Using Integrated Logic Analyzers

AutoILA – Automation of ILA integration with FireSim

• Annotate requested signals and bundles in the Chisel source code

• Automatic configuration and generation of the ILA IP in the FPGA toolchain

• Automatic expansion and wiring of annotated signals to the top level of a design using a FIRRTL transform.

• Remote waveform and trigger setup from the manager instance

90

Page 91: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Debugging using Integrated Logic Analyzers

91

Pros:

• No emulated parts – what you see is what’s running on the FPGA

• FPGA simulation speed - O(MHz) compared to O(KHz) in software simulation

• Real-time trigger-based

Cons:

• Requires a full build to modify visible signals/triggers (takes several hours)

• Limited sampling window size

• Consumes FPGA resources

Page 92: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Advanced FPGA-Based Debugging Features

• FPGA-based simulation enables high simulation speed, which enables advanced debugging and profiling tools.

• Reach “deep” in simulation time, and obtain large levels of coverage and data

• Examples: • TracerV

• Synthesizable Assertions

92

Simulated Time

SW Simulation

FPGA-based Simulation

Page 93: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

TracerV

93

• Out-of-band full instruction execution trace

• MIDAS widget connected to target trace ports

• By default, large amount of info wired out of Rocket/BOOM, per-hart, per-cycle: • Instruction Address • Instruction • Privilege Level • Exception/Interrupt Status, Cause

• TracerV can rapidly generate several TB of data.

Page 94: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

TracerV

• Out-of-Band software-level debugging and profiling: Profiling does not perturb execution

• Useful for kernel and hypervisor level cycle-sensitive profiling

• Examples: • Co-Optimization of NIC and Network Driver • Keystone Secure Enclave Project • High-performance hardware-specific code

(supercomputing?)

• Requires large-scale analytics for insightful profiling and optimization.

94

Page 95: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

TracerV

95

Pros:

• Out-of-Band (no impact on workload execution)

• SW-centric method

• Large amounts of data

Cons:

• Slower simulation performance (40 MHz)

• No HW visibility

• Large amounts of data

Page 96: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Synthesizable Assertions

• Assertions – rapid error checking embedded in HW source code. • Commonly used in SW Simulation

• Halts the simulation upon a triggered assertion. Represented as a “stop” statement in FIRRTL

• By defaults, emitted as non-synthesizable SV functions ($fatal)

96

From: Trillion-Cycle Bug Finding Using FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan Bachrach, Krste Asanović. ADEPT Winter Retreat 2018

From: BROOM: An open-source Out-of-Order processor with resilient low-voltage operation in 28nm CMOS, Christopher Celio, Pi-Feng Chiu, Krste Asanovic, David Patterson and Borivoje Nikolic. HotChip 30, 2018

Page 97: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Synthesizable Assertions

• Synthesizable Assertions on FPGA • Transform FIRRTL stop statements into synthesizable logic

• Insert combinational logic and signals for the stop condition arguments

• Insert encodings for each assertion (for matching error statements in SW)

• Wire the assertion logic output to the Top-Level

• Generate timing tokens for cycle-exact assertions

• Assertion checker records the cycle and halts simulation when assertion is triggered

97

Page 98: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Synthesizable Assertions

• Previously a research feature presented in DESSERT [1]

• Helped Identify BOOM bugs trillions of cycles into execution

• Integrated into FireSim in the latest release.

98

[1] Kim, D., Celio, C., Karandikar, S., Biancolin, D., Bachrach, J. and Asanovic, K., DESSERT: Debugging RTL Effectively with State Snapshotting for Error Replays across Trillions of cycles. The International Conference on Field-Programmable Logic and Applications (FPL), 2018

Page 99: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Synthesizable Assertions

99

Pros:

• FPGA simulation speed

• Real-time trigger-based

• Consumes small amount of FPGA resources (compared to ILA)

• Key signals have pre-written assertions in re-usable components/libraries

Cons:

• Low visibility: No waveform/state

• Assertions are best added while writing source RTL rather than during “investigative” debugging

Page 100: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

The FireSim Vision: Speed and Visibility

• High-performance simulation

• Full application workloads

• Tunable visibility & resolution

• Unique data-based insights

100

Page 101: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Speed and Visibility – How do we get there?

• Integration of additional DESSERT features • Hardware state snapshot extraction from FPGA to software simulation using

arbitrary software triggers.

• Easy-to-use information transfer between execution traces and software hooks.

• Data-Processing pipeline for insights from large-scale traces • O(GB)-O(TB) out-of-band logs require big-data analysis methods for insights

(Golden model comparison is one potential method).

• Potential unique insights: can collect globally-cycle-accurate out-of-band instruction traces from a networked datacenter simulation.

101

Page 102: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Summary

• Debugging Using Software Simulation (docs) • Target-Level

• MIDAS-Level

• FPGA-Level

• Debugging Using Integrated Logic Analyzers (docs)

• Advanced Debugging and Profiling Features • TracerV (docs)

• Assertion Synthesis (docs)

• FireSim Debugging and Profiling Future Vision

102

Page 103: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Project Management, Contributing & Conclusion

FireSim Tutorial RISC-V Summit 2018

Speaker: David Biancolin

Page 104: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Development Model – Branches

• FireSim and its key submodules (FireChip, MIDAS, aws-fpga) have two write-controlled branches

• master • Stable

• Contains the most recent release of FireSim

• Can replicate ISCA2018 paper results

• dev • Main development branch

• Base branch for feature PRs and non-critical bug fixes

• Every merge commit has globally shared, regenerated AGFIs

104

Page 105: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Development Model – Merges & Releases

• To merge a feature branch into dev: • Pass all MIDAS-level tests (in sim/: `make test`)

• Regenerate all AGFIs if RTL changes were made

• Associated submodule PRs should be merged; submodule pointers should point at merge commit on dev

• Changes to dev will be summarized in a “Dev-tracking PR” • PR description will become the Changelog entry

• Dev will be periodically merged into master and tagged as a release • Changelog will be updated

• Submodule dev branches will also be merged into master

105

Page 106: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

How to Get Help

• FireSim and MIDAS adds a bunch of additional complexity on top of Rocket-Chip, Chisel, FIRRTL, Scala… • But there’s a team of us here to help!

• For general questions: • Post on the google groups: https://groups.google.com/forum/#!forum/firesim

• Check out the docs: https://docs.fires.im/en/latest/

• (Nascent) FAQ section: https://docs.fires.im/en/latest/Advanced-Usage/FAQs.html

106

Page 107: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Reporting Issues

• If you think you’ve encountered a bug (in FireSim or MIDAS): • Open an issue on FireSim’s github page

• Try your best to open an issue in the most relevant repo • Please don’t open duplicates across multiple repos!

• If you are unsure, default to FireSim and we’ll be happy to redirect you

107

Page 108: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Contributing

• We welcome PRs!

• Issues are labeled with “Good First Issue”

• Ask on the google groups for suggestions

108

Page 109: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Regarding Future Work

• There are many research projects underway at UCB-BAR using and improving FireSim

• Many will be merged into mainline FireSim

• There will be API breakages -> We apologize in advance

• Watch the changelogs, dev tracking PRs

Our goal: make FireSim the standard open way to do fast full-system simulation of Chisel-designed systems, especially ones based on Rocket-Chip.

Focus on:

• Generality (eg. support multi-clock target designs)

• Robustness (eg. More extensive testing in RTL-simulation; CI; regressions)

• Debuggability (eg. DESSERT features) 109

Page 110: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Two more talks from Berkeley Architecture Research

110

Both talks open-sourcing their work and releasing FireSim images!

Page 111: Scalable FPGA-accelerated Cycle-Accurate Hardware ... · Simulation(s) CPU Host PCIe f1.16xlarge ) Server Simulations Model. Now, let’s build a datacenter-scale FireSim simulation!

Conclusion

Key links: • Website: https://fires.im • Google Groups: https://groups.google.com/forum/#!forum/firesim • Docs: https://docs.fires.im/en/latest/ • Twitter: @firesimproject Acknowledgements: The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849. Research was partially funded by ADEPT Lab industrial sponsor Intel, RISE Lab sponsor Amazon Web Services, and ADEPT Lab affiliates Google, Huawei, Siemens, SK Hynix, and Seagate. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

111