RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008

RAMP Gold:ParLab InfiniCore Model

Krste AsanovicUC Berkeley

RAMP Retreat, January 16, 2008

2

Outline UCB Parallel Computing Laboratory (ParLab)

overview InfiniCore: UCB’s Manycore prototype architecture RAMP Gold: A RAMP model for InfiniCore

3

Efficiency Language Compilers

Personal Health

Image Retriev

al

Hearing, Music

Speech

Parallel Browse

rMotifs/Dwarfs

Sketching

Legacy Code

Schedulers

Communication & Synch.

Primitives

UCB Par Lab OverviewEasy to write correct software that runs efficiently on

manycore

Legacy OS

Multicore/GPGPU

OS Libraries+ServicesHypervisorOS

Arch.

Composition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verificatio

n

Dynamic Checkin

gDebugging

with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Efficienc

y Layer

Productivi

ty Layer

Corr

ect

ness

Applicatio

ns

InfiniCore/RAMP Gold

4

“Manycore” covers huge design space

L2L2BankBank

DRAMDRAM

L2L2BankBank

DRAMDRAM

L2L2BankBank

FlashFlash

Mem & I/O InterconnectMem & I/O Interconnect

Fast Fast Serial I/O Serial I/O

PortsPorts

Multiple Off-Multiple Off-Chip Chip

DRAM/Flash DRAM/Flash ChannelsChannels

L2 InterconnectL2 Interconnect

CPUCPU

L1L1

CPUCPU

L1L1

CPCPUU

L1L1

CPCPUU

L1L1

CPCPUU

L1L1

CPCPUU

L1L1

HW HW Accel.Accel.HW HW

Accel.Accel.HW HW

Accel.Accel.

Multiple On-Multiple On-Chip L2 $/RAM Chip L2 $/RAM

banksbanks

““Fat” CoresFat” Cores

““Thin” CoresThin” Cores

Special-Purpose Special-Purpose CoresCores

Many alternative Many alternative memory memory

hierarchieshierarchies

5

Narrowing our search space Laptops/Handhelds => single-socket systems

Don’t expect >1 manycore chip per platform Servers/HPC will probably use multiple single-socket blades

Homogeneous, general-purpose cores Presents most of the interesting design challenges Resulting designs can later be specialized for improved

efficiency “Simple” in-order cores

Want low energy/op floor Want high performance/area ceiling More predictable performance

A “tiled” physical design Reduces logical/physical design verification costs Enables design reuse across large family of parts Provides natural locality to reduce latency and energy/op Natural redundancy for yield enhancement & surviving

failures

6

InfiniCore ParLab “strawman” manycore architecture

A playground (punching bag?) for trying out architecture ideas

Highlights: Flexible hardware partitioning & protected

communication Latency-tolerant CPUs Fast and flexible synchronization primitives Configurable memory hierarchy and user-level DMA Pervasive QoS and performance counters

7

InfiniCore Architecture Overview

Four separate on-chip network types

Control networks combine 1-bit signals in combinational tree for interrupts & barriers

Active message networks carry register-register messages between cores

L2/Coherence network connects L1 caches to L2 slices and indirectly to memory

Memory network connects L2 slices to memory controllers

I/O and accelerators potentially attach to all network types.

Flash replaces rotating disks.Only high-speed I/O is network &

display.

Active Message NetworkActive Message Network

Control/Barrier NetworkControl/Barrier Network

L2/Coherence NetworkL2/Coherence Network

Memory NetworkMemory Network

CoreCore

L1D$L1D$

L1I$L1I$

L2L2RAMRAM

L2L2TagsTags

L2 Cntl.L2 Cntl.

CoreCore

L1D$L1D$

L1I$L1I$

Acc

ele

rato

rs a

nd/o

r I/O

A

ccele

rato

rs a

nd/o

r I/O

in

terf

ace

sin

terf

ace

s

MEMCMEMC

DRAMDRAM

I/O I/O PinsPinsL2L2

RAMRAML2L2

TagsTags

L2 Cntl.L2 Cntl.

MEMCMEMC

DRAMDRAM

MEMCMEMC

FlashFlash

8

Physical View of Tiled Architecture

DR

AM

DR

AM

DRAMDRAM

DR

AM

DR

AM

FlashFlash

CoreCore

L1D$L1D$

L2$L2$SlicSlicee

L1I$L1I$

Inte

rcon

Inte

rcon

..CoreCore

L1D$L1D$

L2$L2$SlicSlicee

L1I$L1I$

Inte

rcon

Inte

rcon

..

Core Core

L1D$ L1D$

L2$ L2$Slic Slicee

L1I$ L1I$

Inte

rcon

Inte

rcon

..Core Core

L1D$ L1D$

L2$ L2$Slic Slicee

L1I$ L1I$

Inte

rcon

Inte

rcon

..

I/O

I/O

9

Core Internals

Control Control ProcessoProcesso

rr(Int 64b)(Int 64b)

L1D$L1D$

L1I$L1I$

Vector Vector UnitUnit

(Int/FP (Int/FP 64b)64b)

GPRsGPRs VRegsVRegsCommanCommand Queued Queue

TLB/PLBTLB/PLB

Load Load Data Data

QueuesQueues(Store (Store

Queues not Queues not shown)shown)

To outer levels of To outer levels of memory memory hierarchyhierarchy

Virtual Virtual AddressAddress

RISC-style 64-bit instruction set SPARC V9 used for pragmatic reasons

In-order pipeline with decoupled single-lane (64-bit) vector unit (VU) Integer control unit generates/checks

addresses in-order to give precise exceptions on vector loads/stores

VU runs behind executing queued instructions on queued load data

VU executes both scalar & vector, can mix (e.g., vector load plus scalar ALU)

Each VU cycle: 2 ALU, 1 load, 1 store (all 64b) Vector regfile configurable to trade

reduced I-fetch for fewer register spills 256 total registers (e.g., 32 regs. x 8

elements, or 8 regs. x 32 elements) Decoupling is cheap way to tolerate

memory latency inside thread (scalar & vector)

Vectors increase performance, reduce energy/op, and increase effective decoupling queue size

TLB/PLBTLB/PLB 1-3 issue?1-3 issue?2x64b 2x64b

FLOPS/clockFLOPS/clock

10

Cache CoherenceL1 cache coherence tracked at L1 cache coherence tracked at L2 memory managers (set of L2 memory managers (set of readers)readers)• All cases except write to currently All cases except write to currently

read shared line handled in pure read shared line handled in pure hardwarehardware

• Writer gets trap on memory Writer gets trap on memory response, invokes handlerresponse, invokes handler

• Same process used for Same process used for transactional memory (TM)transactional memory (TM)

• Cache tags visible to user-level Cache tags visible to user-level software in partition, useful for TM software in partition, useful for TM swappingswapping

Active Message NetworkActive Message Network

Control/Barrier NetworkControl/Barrier Network

L2/Coherence NetworkL2/Coherence Network

Memory NetworkMemory Network

CoreCore

L1D$L1D$

L1I$L1I$

L2L2RAMRAM

L2L2TagsTags

L2 Cntl.L2 Cntl.

CoreCore

L1D$L1D$

L1I$L1I$

Acc

ele

rato

rs a

nd/o

r I/O

A

ccele

rato

rs a

nd/o

r I/O

in

terf

ace

sin

terf

ace

s

MEMCMEMC

DRAMDRAM

I/O I/O PinsPinsL2L2

RAMRAML2L2

TagsTags

L2 Cntl.L2 Cntl.

MEMCMEMC

DRAMDRAM

MEMCMEMC

FlashFlash

11

RAMP Gold:A Model of ParLab InfiniCore Target

Target is single-socket tiled manycore system Based on SPARC ISA (v8->v9) Distributed coherent caches Multiple on-chip networks (barrier, active message,

coherence, memory) Multiple DRAM channels

Split timing/functional models, both in hardware Host multithreading of both timing and functional

models Expect to model up to 1024 64-bit cores in system

(8 BEE3 boards) Predict peak performance around 1-10 GIPS, with

full timing models

12

Host Multithreading(Zhangxi Tan (UCB), Chung, (CMU))

CPU1

CPU2

CPU3

CPU4Target Target

ModelModel

Multithreading emulation engine reduces FPGA resource use and improves emulator throughput

Hides emulation latencies (e.g., communicating across FPGAs)

Multithreaded Host Multithreaded Host Emulation Engine (on FPGA)Emulation Engine (on FPGA)

+1

2

PC1PC

1PC1PC

1

I$ IR GPR1GPR1GPR1GPR1

X

Y

2

D$Single hardware Single hardware

pipeline with pipeline with multiple copies multiple copies

of CPU stateof CPU state

13

Split Functional/Timing Models(HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin))

Functional model executes CPU ISA correctly, no timing information Only need to develop functional model once for each ISA

Timing model captures pipeline timing details, does not need to execute code Much easier to change timing model for architectural

experimentation Without RTL design, cannot be 100% certain that timing is

accurate Many possible splits between timing and functional model

Functional Functional ModelModel

Timing Timing ModelModel

14

RAMP Gold Approach

Split (and decoupled) functional and timing models

Host multithreading of both functional and timing models

15

Multithreaded Func. & Timing Models

MT-Unit multiplexes multiple target units on a single host engine MT-Channel multiplexes multiple target channels over a single

host link

Functional Functional Model Model

PipelinePipeline

Arch State

Timing Timing Model Model

PipelinePipeline

TIming

State

MT-UnitMT-Unit

MT-ChannelsMT-Channels

16

RAMP Gold CPU Model (v0.1)

Commit Commit TimingTiming

Execute Execute TimingTiming

PC1PC1PC1PC

1 PC/Fetch PC/Fetch Func.Func.

ALU ALU Func.Func.

Decode/Decode/Issue TimingIssue Timing

InstructioInstructionsns

StatusStatus

GPR1GPR1GPR1GPR1

ImmediateImmediatess

PC ValuesPC Values

StoreStore

Fetch Fetch CommandCommand

ssGPR1GPR1GPR1Timing

State

GPR1GPR1GPR1Timing State

StatusStatus StatusStatus

AddressesAddresses

LoadLoadExec. Exec. CommComm

..

Mem. Mem. CommComm

..

Data Data Memory Memory InterfaceInterface

InstructioInstruction n

Memory Memory InterfaceInterface

StatusStatus

17

RAMP Gold Memory Model (v0.1)

CPUCPUModelModel

CPUCPUModelModel

Host DRAM CacheHost DRAM Cache

BEE DRAMBEE DRAM

GPR1

GPR1

GPR1

GPR1

GPR1

GPR1

Memory Memory ModelModel

(duplicate paths (duplicate paths for Instruction for Instruction

and Data and Data interface)interface)

18

Matching physical resources to utilization

Only implement sufficient functional units to match expected utilization, e.g.:

For single-issue core, expected IPC ~0.6 Regfile read ports (1.2 operands/instruction)

0.6*1.2=0.72 per timing model Regfile write ports (0.8 operands/instruction)

0.6*0.8=0.48 per timing model Instruction mix:

Mem 0.3 FPU 0.1 Int 0.5 Branch 0.1

Therefore only need (per timing model) 0.6*0.3 = 0.18 memory ports 0.6*0.1 = 0.06 FPUs 0.6*0.5 = 0.30 Integer execution units 0.6*0.1 = 0.06 Branch execution units

19

Balancing Resource Utilization

FPUFPU MemMemIntInt IntInt IntInt IntInt IntIntBranchBranch

RegfileRegfile RegfileRegfile RegfileRegfile
















RegfileRegfile






Operand InterconnectOperand Interconnect

20

RAMP Gold Capacity Estimates

For SPARC v8 (32-bit) pipeline Purely functional, no timing model Integer only For BEE3, predict 64 CPUs/engine, 8

engines/FPGA (LX110), or 512 CPUs/FPGA Throughput of 150MHz * 8 engines = 1200

MIPS/FPGA 8 BEE3 boards * 4 FPGAs/board = 38 GIPS/system

Perhaps 4x reduction in capacity with v9, FPU, and timing models

Documents

RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008