35
RAMP Retreat Summer 2006 Break Session Leaders & Questions Greg Gibeling, Derek Chiou, James Hoe, John Wawrzynek & Christos Kozyrakis 6/21/2006

RAMP Retreat Summer 2006

Embed Size (px)

DESCRIPTION

RAMP Retreat Summer 2006. Break Session Leaders & Questions Greg Gibeling, Derek Chiou, James Hoe, John Wawrzynek & Christos Kozyrakis 6/21/2006. Breakout Topics. RDL & Design Infrastructure RAMP White Caches, Network & IO (Uncore) RAMP2 Hardware BEE3 OS, VM and Compiler Software Stack. - PowerPoint PPT Presentation

Citation preview

Page 1: RAMP Retreat Summer 2006

RAMP Retreat Summer 2006

Break Session Leaders & QuestionsGreg Gibeling, Derek Chiou, James Hoe, John Wawrzynek & Christos Kozyrakis6/21/2006

Page 2: RAMP Retreat Summer 2006

Breakout Topics

RDL & Design Infrastructure RAMP White Caches, Network & IO (Uncore) RAMP2 Hardware

BEE3 OS, VM and Compiler

Software Stack

Page 3: RAMP Retreat Summer 2006

RDL & Design Infrastructure

Leader/Reporter: Greg Gibeling Topics

Features & Schedule Proposals

Multi-platform migration Languages

Which languages, priorities Assignments for support

Debugging – Models & Requirements Retargeting to ASICs (Platform Optimization)

Page 4: RAMP Retreat Summer 2006

RDL & DI Notes (1)

Languages Hardware

Verilog BlueSpec IBM uses VHDL

Software? Multi-Platform

Integration of hardware simulations Control of multiplexing

Needed for efficiency! Possible through channel & link parameters

Features Meta-types Component (and unit) libraries

Page 5: RAMP Retreat Summer 2006

RDL & DI Notes (2)

Debugging Split target model

RDL Target Design Exposed to a second level of RDL Allows statistics aggregation Modeling of noisy channels

Integration with unit internals Event & State Extraction Connection to processor debugging tools

People clearly want this ASAP

Page 6: RAMP Retreat Summer 2006

RDL & DI Notes (3)

Debugging (Integrated) Message tracing

Causality Diagrams Framework to debug through units

Checkpoints Injection Single stepping

May not be widely used But cheap to implement

Watch/Breakpoints

Page 7: RAMP Retreat Summer 2006

RDL & DI Notes (4)

Why Java? Runs on various platforms

Recompilation is generally pretty painful Decent type system in Java 1.5 Perfect for plugin infrastructure (e.g. OSGi)

When to use RDL Detailed timing model Great at abstracting inter-chip comm Perfect platform for partitioning designs

Concise, logical specification Support for the debugging framework With standard interfaces, good for sharing

Page 8: RAMP Retreat Summer 2006

RDL & DI Notes (5)

Basic Infrastructure First system bringup

Interfaces with workstations Initial board support

Standard interfaces (RDL and otherwise) Processor Replacements

Board Support Currently a heroic effort Solutions

Standardized components? Generators?

Page 9: RAMP Retreat Summer 2006

RDL & DI Notes (6)

Timelines Greg’s Goals

10/2006 should see RCF/RDLC3 11/2006 should see documentation

Debugging (Integrated) should be ASAP Manpower

Board support First board bring up RDL & RDLC users

Standard interfaces Features & Documentation

Page 10: RAMP Retreat Summer 2006

RAMP White

Leader/Reporter: Derek Chiou Topics

Two day break-out First day should be pro/con

Overall Preliminary Plan Evaluation Who is doing exactly what?

ISA for RAMP White OpenSPARC 32bit Leon PowerPC 405 Processor agnosticism

Implementation Reimplementation will be required Test suites from companies are very useful

Page 11: RAMP Retreat Summer 2006

RAMP White Notes (1)

Use embedded PowerPC core first Available Debugged Can run full OS today FPGA chip space is already committed

PowerPC and Sparc are both candidates PowerPC pros

Embedded processor is PowerPC Sparc pros

64b available today Wait and see on soft-core for RAMP-White from Derek go here

Page 12: RAMP Retreat Summer 2006

RAMP White Notes (2)

>= 256 processors Can buy 64 processors today

Reasonable speed 10’s of MHz

With 280K LUTs in Virtex 5, assume 50% for processor but 80% for ease of place-and-route 100K LUTs for processors Need 4 per FPGA (16 per board, 16 boards) 25K LUTs per processor

Page 13: RAMP Retreat Summer 2006

RAMP White Notes (3)

Embedded PowerPC core (it’s there and better performance than any soft-core) Soft L1 data cache (no L2) Hard L1 instruction cache Emulation????

Ring coherence (a la IBM) Linux on top of embedded PowerPC core NSF mount for disk access Mark’s port of Peh’s and Dally’s router To do:

Ring coherence + L1 data cache + memory interface RDL for modules Software port Timing models for memory, ring, cache, processor? integration

Page 14: RAMP Retreat Summer 2006

RAMP White Notes (4)

RAMP-White Greek Beta

More general fabric using same router Still use ring coherence

Gamma James Hoe’s coherence engine

Delta Soft core integration

Page 15: RAMP Retreat Summer 2006

Caches, Networks & IO (Uncore)

Leader/Reporter: James Hoe Topics

CPU, Cache and Memories Hybrid FPGA Cosimulation Network Storage Interfaces

Especially with respect to interfaces Components, not sub-frameworks

Phase uncore abilities

Page 16: RAMP Retreat Summer 2006

Uncore Notes (1)

A fully-system has more than just CPUs and memory

I/O is very important Getting RAMP to “work”

Just like the real thing (from SW and OS’s perspective) Software porting/development Performance studies

Someone has to build the “uncore”? Co-simulation Direct HW support for paravirtualization / VM

Page 17: RAMP Retreat Summer 2006

Uncore Notes (2)

Why make RAMP white generic? What is a more interesting target system? What is a more relevant target system?

Building a system without an application in mind?

Would anyone care about RAMP-“vanilla”?

Page 18: RAMP Retreat Summer 2006

Uncore Notes (3)

Why insist on directory-based CC for 1000 nodes Today’s large SMPs (at 100+ ways) are

actually snoopy-based Plug in 8-core CMPs, that is a 1000-node

snoopy system (that the industry may be more interested it in)

Page 19: RAMP Retreat Summer 2006

Uncore Notes (4)

Let’s ping down a reference system architecture (including the uncore) minimum modules required? optional modules supported? fix standard interfaces between modules RDL script for RAMP white??

Need more than a block diagram for RAMP white

Page 20: RAMP Retreat Summer 2006

Uncore Notes (5)

Requests and Ideas for RDL Compensate for skewed raw performance of

components (for timing measurements) Large I/O bandwidth relative to CPU throughput Need knobs to dial-in different rates for experiments

Some form of HW/SW co-simulation Built-in performance monitoring

Page 21: RAMP Retreat Summer 2006

Uncore Notes (6)

Sanity Check 1000 processing nodes: no problem I/O: we can fake it somehow DRAM for 1000 processing node

Not easy to cheat on this one

Page 22: RAMP Retreat Summer 2006

RAMP2 Hardware (BEE3)

Leader/Reporter: Dan Burke & John Wawrzynek Topics

Follow up to XUP Should RAMP embrace XUP at low end? Inexpensive small systems

Size & scaling of new platform More than 40 FPGAs?

Technical Questions Reconsider use of SRAM DRAM Capacity Presence of on-board hard CPUs On-board interfaces (PCI-Express)

Project Questions Timelines

Definitely need one Packaging Pricing (Especially FPGAs)

Design for largest FPGA, change part at solder time? Evaluation of Chen Chang’s Design

Page 23: RAMP Retreat Summer 2006

RAMP2 HW Notes (1)

Follow-up to XUP XUP has been useful to the project, particularly

for early development efforts. Xilinx will continue to design and support new

XUP boards No v4 version planned. V5 version will be out Q2 next year. For BEE3 can't really count on V5 FX in 2Q next

year. Perhaps use a separate (AMCC) powerPC

processor chip.

Page 24: RAMP Retreat Summer 2006

RAMP2 HW Notes (2)

Size and Scaling of new platform: Given potential processor core density issue, will need to plan on a

system that can scale past 40 FPGAs.

Better compatibility with new XUP is important: ex: DRAM standard (better sharing of memory controllers) USB use Cypress CY7300 for USB compatibility with Xilinx core.

Our design and production of BEE3 is timed to the production of V5 parts. We need to better understand RAMP team schedule for RAMP white.

Hope to be able to choose the package and have flexibility in part sizes and ideally part feature set.

How about a daughterboard for FPGA (DRC approach)?

Page 25: RAMP Retreat Summer 2006

RAMP2 HW Notes (3)

Technical Questions Reconsider use of SRAM: group thought SRAM is a bad idea.

It is faster, smaller, simpler to interface to. Newer parts will make interfacing simpler. Faster not a big concern for RAMP. Smaller is a big concern.

8GB DDR2 DIMM modules on the horizon.

A target will be 1 GByte/processor.

Presence of on-board hard CPUs Are hard cores in FPGAs useful (e.g. PPC405 in V2Pro) Would commodity chips on PCB be useful (eg for management)

Page 26: RAMP Retreat Summer 2006

RAMP2 HW Notes (4)

Enclosures: Using a standard form-factor will help in the with module packaging.

Need to look carefully at IBM blade center (adopted by IBM and Intel)

ATCA is gaining momentum. Power may be a problem

Can we accomodate custom ASIC integration (perhaps through a slight generalization of the DRAM interface).

What does Google do for packaging in their data centers? Is it racks of 1U modules?

Page 27: RAMP Retreat Summer 2006

RAMP2 HW Notes (5)

Interesting Idea from Chuck Thacker: "Design new board based on need of RAMP White"! Previously suggested by others

Can we estimate the logic capacity, memory BW, network BW, etc.?

Page 28: RAMP Retreat Summer 2006

OS, VM & Compiler

Leader/Reporter: Christos Kozyrakis Topics

Debugging HW and SW (RDL) Phased approach

Proxy, full kernel, VMMs, Hypervisor HW/SW schedule and dependencies

High level applications

Page 29: RAMP Retreat Summer 2006

Software Notes (1)

RAMP milestones Pick ISA Deploy basic VMM Deploy OS

Page 30: RAMP Retreat Summer 2006

Software Notes (2)

VMM approach: use split VMM system (ala VMware/Xen) Run full VMM on x86 host that allows access to devices Run simple VMM on RAMP that communicates with host for

devices accesses through some network A timing model may be used if I/O performed is important Should talk with Sun & IBM about their VMM systems for Sparc

and PowerPC. May be able to port a very basic Xen system on our own

Questions Accurate I/O timing with para-virtualization (you also need

repeatability) SW/system-level/IO issues for large scale machine may be more

important than coherence Related Issue: Do we want global cache coherence in white?

Benefit vs complexity (schedule etc)

Page 31: RAMP Retreat Summer 2006

Software Notes (3)

Separate infrastructure from RAMP Example: RDL should not be tied to

RAMP White Note: This is in progress with some current

RDL applications Same with BEE3 design work Most of our tools are applicable to others

Page 32: RAMP Retreat Summer 2006

Software Notes (4)

Debugging support: RDL-scope Arbitrary conditions on RDL-level events to trigger debugging Get traces of messages Track lineage of messages

Traceability, accountability, relate events to program constructs Infinite checkpoints for instructions & data Checkpoint support

Swappable & observable designs Single step

Instruction, RDL, or cycle level Note: not always a commonly use feature

Such features may attract people to RDL more than retiming Note: This is already the case with current RDL applications

Page 33: RAMP Retreat Summer 2006

Software Notes (5)

What our is schedule What can we have up and running with 1 year? Does it have to be RAMP white?

Do we need to migrate RDL maintenance from Greg? Note: The work should be spread out at least.

Do we have enough manpower for this SW work? Compiler, VMMs, Applications, etc…

Page 34: RAMP Retreat Summer 2006

Software Notes (6)

Application Domains Enterprise/desktop

Full featured OS on all nodes Running a JVM is a big plus here Should be able to run webservers, middleware and DBs.

Embedded While eventually an app may directly control a number of nodes, it is

easier to start with all nodes running the OS. The base design should allow all nodes to run the OS.

Easiest starting point for SW. Various researchers may decide to run the OS in a subset of nodes,

managing the rest of them directly A simple runtime with app-specific policies

Common in embedded systems

Page 35: RAMP Retreat Summer 2006

Software Notes (7)

A simple kernel for embedded systems should support Fast remapping of computation Protection across processes

Emulation of attached disk ISCSI + a timing model for disks

RAMP VMM uses: Attract VMM researchers (might require x86) Our own convenience

Get an OS running, access to devices etc We may achieve (b) without (a)

Some researchers will want to turn cache coherence off anyway!