35
Computer Architecture Lab at RAMP Retreat, June 2009 The Open Source ProtoFlex Simulator Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, Ken Mai

The Open Source ProtoFlex Simulator

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Open Source ProtoFlex Simulator

Computer Architecture Lab at

RAMP Retreat, June 2009

The Open SourceProtoFlex Simulator

Eric S. Chung, Michael K. Papamichael,James C. Hoe, Babak Falsafi, Ken Mai

Page 2: The Open Source ProtoFlex Simulator

The ProtoFlex Simulator

• History

– Project started (circa 2007) to build scalable, full-system multiprocessor simulators using FPGAs

• Key Features

– Functional simulator for N-way UltraSPARC III server (~50-90 MIPS)

– Using hybrid simulation, runs real server apps + Solaris OS

– Employs multithreading to virtualize # CPUs per FPGA core

2Hybrid Simulation Virtualization

Page 3: The Open Source ProtoFlex Simulator

Open Sourcing ProtoFlex

• Why open source?

– Demonstration of FPGAs as viable architecture research vehicle

– Facilitate adoption of hybrid simulation & host multithreading

– Encourage building on top of our work

• What are we releasing?

– Bluespec source HDL, Verilog and pre-generated netlistsfor SPARCV9 CPU model + interfaces

– XUPV5 Reference Design for EDK 10.1

– Virtutech Simics plug-ins for hybrid simulation

– Top-level SW controller, user command-line interface

– Documentation through online wiki 3

Page 4: The Open Source ProtoFlex Simulator

Outline

• Motivation

• The ProtoFlex Simulator

– High level components

– UltraSPARC core model

• Using ProtoFlex

• XUPV5 Reference Design

• Distribution Details4

Page 5: The Open Source ProtoFlex Simulator

The ProtoFlex Simulator

• User perceives familiar SW-like UltraSPARC III simulator

– Software user interface similar to Simics

– Applications load directly from Simics checkpoints

– Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring

5

Page 6: The Open Source ProtoFlex Simulator

The ProtoFlex Simulator

• User perceives familiar SW-like UltraSPARC III simulator

– Software user interface similar to Simics

– Applications load directly from Simics checkpoints

– Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring

6

FPGA Linux PC

PFMON

SIMICS

(I/O)

FPGA

Core

Main Memory

PowerPC

(or uBlaze)

Ethernet User

Interfac

e

Page 7: The Open Source ProtoFlex Simulator

Our UltraSPARC III Core Model

• ISA Specifications

– 64-bit SPARCV9 ISA + US III extensions

– 8 register windows, 4 global register files

– 512-entry D-TLB, 128 I-TLB

• Implementation

– 14-stage, multi-threaded pipeline, switch context on each cycle

– On Virtex-5, XST~148MHz,Placed & routed @ 100MHz

– Parameterized non-blocking caches

– FP + rare MMU instructions areSW-emulated by nearby uBlaze

– 100% mirrors Virtutech Simics model 7

I-TLB Stage 1

I-TLB Stage 2

64-bit ALU Stage 1

Context Scheduler

I-Fetch Address Generate

I-Fetch Tag Check

US III Decoder

64-bit ALU Stage 2

D-TLB Stage 1

D-TLB Stage 2

D-TLB Stage 3

D-Cache Address Generate

D-Cache Tag Check

Multi-Cycle Instruction Unit

NonblockingI-cache(BRAM)

Writeback

Arbiter to DDR Memory

Integer RF(BRAM)

NonblockingD-cache(BRAM)

Page 8: The Open Source ProtoFlex Simulator

Core Design Statistics

• Runs 100MHz on V5

– Synthesizes up to 148MHz using standard tools (ISE XST)

• Logic usage

– 23.5 KLUTs (11.3% LX330T)

• BRAM usage

– 120 BRAMs for 16-context configuration (37% LX330T)

• Future optimizations

– Paging structures to SRAM or DRAM can reduce BRAM by significant amount

– Will release in future updates 8

Page 9: The Open Source ProtoFlex Simulator

Outline

• Motivation

• The ProtoFlex Simulator

• Using ProtoFlex

• XUPv5 Reference Design

• Distribution

9

Page 10: The Open Source ProtoFlex Simulator

Using ProtoFlex

• Add passive monitors

– Counters, histograms

– Roll your own

10

I-TLB Stage 1

I-TLB Stage 2

64-bit ALU Stage 1

Context Scheduler

I-Fetch Address Generate

I-Fetch Tag Check

US III Decoder

64-bit ALU Stage 2

D-TLB Stage 1

D-TLB Stage 2

D-TLB Stage 3

D-Cache Address Generate

D-Cache Tag Check

Multi-Cycle Instruction Unit

NonblockingI-cache(BRAM)

Writeback

Arbiter to DDR

Memory

Integer RF(BRAM)

NonblockingD-cache(BRAM)

• Trace-based simulation

– Collect dynamic traces

– Feed traces to functional-first timing model

• Sampled Program Monitoring

– Use micro-blaze (or PPC) to monitor core/memory state

– Unintrusive profiling w/o changes to target SW

CountersCountersCounters

Histogram

TrackerHistogram

TrackerHistogram

Trackers

Timing

Model

FPGA

Hard/Soft Core

(PowerPC or

MicroBlaze)

Page 11: The Open Source ProtoFlex Simulator

Applications of ProtoFlex

• Examples

– Functional-first CMP cache coherency model for first-order timing models and functional warming *TRETS’09+

– Real-time stack trace profiling

– CMP interconnect model (in progress)

– Realistic CPU traffic generators (in progress)

• … running real 16-CPU server workloads

– Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K

11

Piranha CMP Cache

(First-Order Timing

Model)

Statistics + Warmed

Coherency &

Tag States

Page 12: The Open Source ProtoFlex Simulator

What does the RTL look like?

• We use Bluespec System Verilog (high-level, synthesizable HDL)

– 4-8 weeks learning curve for normal HDL users

– Once learned, easier to read/modify than conventional RTL

– Requires BSV compiler (free for academics)

– Paper in MEMOCODE’09 describes BSV coding/validation of core

• Sample code:

12

rule split_ALU_pipeline (True);…p1 = piperegs[DECODE];piperegs[ALU1] <= doALUStage1(p1, alu_ifc);p2 = piperegs[ALU1];piperegs[ALU2] <= doALUStage2(p2, alu_ifc);…

endrule

rule merged_ALU_pipeline (True);…p1 = piperegs[DECODE];p_tmp1 = doALUStage1(p1, alu_ifc);p_tmp2 = doALUStage2(p_tmp1, alu_ifc);piperegs[ALU] <= p_tmp2;…

endrule

2-stage ALU 1-stage ALU

Page 13: The Open Source ProtoFlex Simulator

Other Simulator Features

• Changing Core Parameters

– Number of CPU contexts

– Cache sizes

– Merge/split pipeline stages

– Enable/disable modules for profiling & debugging

– Clock frequency (tested @ 10 MHz – 100 MHz)

– Set optimal LUTRAM size (16 = V2P, 64 = V5)

– Choose LUTRAMs or BRAMs for any CPU state

• System Parameters

– UDP or TCP/IP (for PFMON-to-FPGA communication)

– XUPv5, BEE2

13

Page 14: The Open Source ProtoFlex Simulator

Outline

• Motivation

• The ProtoFlex Simulator

• Using ProtoFlex

• XUPv5 Reference Design

• Distribution

14

Page 15: The Open Source ProtoFlex Simulator

Platform Release: XUPV5

• Why XUPv5?

– Inexpensive (~$750), easily accessible

– Standard tool flows (EDK, ISE)

– Reference design portable to other platforms – just drop in our ‘pcores’

• Supporting other platforms

– Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP)

– Plan to release with future updates

15

Page 16: The Open Source ProtoFlex Simulator

Required Equipment

16

FPGA BoardLinux PC

Ethernet

BlueSPARCPFMON+ Simics

Page 17: The Open Source ProtoFlex Simulator

Required Equipment

17

FPGA BoardLinux PC

Ethernet

BlueSPARCPFMON+ Simics

Page 18: The Open Source ProtoFlex Simulator

18

XUPv5 Overview

• Virtex-5 LX110T

• DDR2 Memory

– up to 2GB

• 1ΜΒ SRAM

• 1Gbps Ethernet

• 3Gbps SATA

• Serial Port

Page 19: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

Reference Design Block Diagram

19

PLB

BlueSPARC MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

Page 20: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

BlueSPARC

20

PLB

MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

• EDK IP core

– connects to PLB & NPI

• Runs @ 100MHz

• 4 CPU contexts

• 64KB I&D L1 caches

BlueSPARC

Page 21: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

Reference Design Block Diagram

21

PLB

BlueSPARC MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

Page 22: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

BlueSPARC

22

PLB

BlueSPARC MicroBlaze

Ethernet Serial Port

XUPv5

LX110T

DRAM SRAM

• 81% utilization

– Core 51% (76 out of 148)

– Rest 30% (45 out of 148)

BRAMBRAMBRAM

Page 23: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

Reference Design Block Diagram

23

PLB

BlueSPARC MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

Page 24: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

Ethernet

24

PLB

BlueSPARC MicroBlaze

BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

• 4 MB/sec bandwidth

• 350 usec RTT latency

• Socket Abstraction

– using LWIP RAW interface

Ethernet

Page 25: The Open Source ProtoFlex Simulator

SRAMController

Multi-Port Memory Controller

Reference Design Block Diagram

25

PLB

BlueSPARC MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

Page 26: The Open Source ProtoFlex Simulator

SRAMController

DDR2 Memory Controller

26

PLB

BlueSPARC MicroBlaze

Ethernet BRAM Serial PortBRAMBRAM

XUPv5

LX110T

DRAM SRAM

Multi-Port Memory Controller

• 1.5GB/s peak BW

• 115ns latency

• Multiple ports/interfaces

Page 27: The Open Source ProtoFlex Simulator

Required Equipment

27

FPGA BoardLinux PC

Ethernet

BlueSPARCPFMON+ Simics

Page 28: The Open Source ProtoFlex Simulator

Required Equipment

28

FPGA BoardLinux PC

Ethernet

BlueSPARCPFMON+ Simics

Page 29: The Open Source ProtoFlex Simulator

29

Linux PC

• Software requirements

– SuSE Linux 10.1

– CAD tools + licenses (Bluespec compiler, Xilinx ISE/EDK)

– Simics 3.0.22

– Hybrid simulation plug-in modules

– ProtoFlex MONitor tool (PFMON)

Page 30: The Open Source ProtoFlex Simulator

30

Linux PC

• Runs PFMON (ProtoFlex MONitor)

– Orchestrates communication between Simics & BlueSPARC

– Provides CLI interface to simulator (like Simics Console)

• Runs Simics

– Handles I/O, FPGA Core and memory initialization

BlueSPARCPFMON

Linux PC

Simics

Page 31: The Open Source ProtoFlex Simulator

31

From RTL to Running System

• Bluespec Verilog

– Bluespec compiler

– ~30 minutes

• Verilog Bitstream

– Xilinx EDK

– ~ 3 hours

• BitstreamWorking System

– Stream mem. image over ethernet

– ~ 5 minutes (for 512MB image)

Bluespec code

Verilog code

Bitstream

1

2

3

1

2

WorkingSystem

3

Page 32: The Open Source ProtoFlex Simulator

XUPv5 Caveats

• Limitations

– Due to BRAM limits, only 4 CPU contexts (compared to 16 on BEE2’s V2P70)

– Slow PC-to-FPGA latency via LwIP/Ethernet (1 key/sec in terminal)

– Limited DDR2 RAM capacity (up to 2GB)

• However…

– Our XUPv5 design still has much room for improvement

– Useful for familiarizing with ProtoFlex tools

– Can still run multithreaded workloads + perform monitoring

– Easily portable to more powerful platforms

32

Page 33: The Open Source ProtoFlex Simulator

Outline

• Motivation

• The ProtoFlex Simulator

• Using ProtoFlex

• XUPv5 Reference Design

• Distribution

33

Page 34: The Open Source ProtoFlex Simulator

Distribution Details

• Release dates

– All source code will be available in 1 week (June)

– Send email to [email protected] for more info

• Licensing

– GNU GPL

• Support

– Documentation: www.ece.cmu.edu/~protoflex

– Support mailing list/forum will be available soon

34

Page 35: The Open Source ProtoFlex Simulator

Thank you!

• Acknowledgments

– Funding for this work provided by: NSF CCF-0811702, NSF CNS-0509356, C2S2 Marco Center, and Sun Microsystems

– Paul Hartke & Xilinx for FPGA donations & email support

• Topics & References

– Hybrid Simulation and FPGA Core MultithreadingA Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs [FPGA’08]

– Functional-First, First-Order Timing Model for a CMP cacheProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs [TRETS’09]

– FPGA Core Design & Validation Using Flight Data RecorderImplementing a High-performance Multithreaded Microprocessor: A Case Study in High-level Design and Validation [MEMOCODE’09]

35