OpenPiton: An Open-Source Framework for EDA Tool Development€¦ · 5V VDD BF R sense 2.5V VIO...

Preview:

Citation preview

OpenPiton:

AnOpen-SourceFrameworkfor

EDAToolDevelopment

DavidWentzlaff

FOSDEM2020

PrincetonUniversity

1http://openpiton.org

https://github.com/PrincetonUniversity/openpiton

2

OpenPiton:Theworld’sfirstopensource,general

purpose,multithreadedmanycoreprocessor

•  Opensourcemanycore

•  WritteninVerilogRTL

•  Scalesto½billioncores•  Configurablecore,uncore•  Includessynthesisandback-endflow•  SimulateinVCS,ModelSim,NCSim,Verilator,

Icarus,Riviera

•  ASIC&FPGAverified•  ASICpowerandenergyfullycharacterized

[HPCA2018]

•  Runsfullstackmulti-userDebianLinux

Tile

Chip

chipset

http://openpiton.org

https://github.com/PrincetonUniversity/openpiton

OpenPiton+Ariane:ThefirstOpen-SourceSMPLinux-

bootingRISC-Vsystemscalingfromonetomanycores

•  CollaborationwithETHZurich

–  IntegrateOpenPitonandArianeRISC-V

Core

– BootedSMPLinux

lessthan6months

afterstarting

integratingRTL

3

+ Ariane

http://openpiton.org

https://github.com/PrincetonUniversity/openpiton

4

SiliconProvenDesigns•  Piton(25-coreinstanceofOpenPiton)

–  25-coremodified64bitOpenSPARCT1Core

–  3P-MeshNoCs–  P-MeshDirectory-BasedCacheSystem–  Taped-outinIBM32nmSOI

•  6mmx6mm;460MillionTransistors-Amonglargestchipsbuiltinacademia

–  Target:1GHzClock@900mV

–  Receivedsiliconandrunsfull-stackDebianinlab

•  ArianeRISC-V(ETH-Z)–  Taped-outinGlobalFoundries22nmFDX

Poseidon:

–  Area:0.23mm2-175kGE

–  0.2-1.7GHz(0.5V-1.15V)

Kosmodrom:

–  RV64GCXsmallFloat,Transprecision/VectorFPU

–  ArianeHP•  8Tlibrary,0.8V,1.3GHz

•  55mW@1GHz

–  ArianeLP•  7.5TULPlibrary,0.5V,250MHz

•  5mW@200MHz

Issue

QUENTIN KERBIN

HYPERDRIVE

Ariane

L2

NTX

ArianeDiePhotosfromETH-Z

FPGAPrototypingPlatforms

Available:

•  DigilentGenesys2– $999($600academic)

– 1-2coresat66MHz

•  XilinxVC707– $3500– 1-4coresat60MHz

•  DigilentNexysVideo– $500($250academic)

– 1coreat30MHz

•  BittWareXUPP3R

– $7000-8000– >100MHz(12cores)

•  AmazonAWSF1

– ~$1.60/hr– Rentbythehour– 12cores

5

OpenPitonSystemOverview

6

P-MeshOff-Chip

Routers(3)

Chip

Bridge

P-MeshChipset

Crossbars(3)

DRAMWishbone

SDHC

AXI

I/O

Chip Chipset

http://openpiton.org

https://github.com/PrincetonUniversity/openpiton

7

Tile Chipset

OpenPitonSystemOverview

http://openpiton.org

https://github.com/PrincetonUniversity/openpiton

8

Tile Chipset

OpenPiton+ArianeSystemOverview

http://openpiton.org

https://github.com/PrincetonUniversity/openpiton

CurrentStatusofFreeandOpenSource

ChipEDAtoolsandBenchmarks

•  GrowinguseofFree&OpenSourceEDAtools– MainlyusedforverificationofdesignsandwithFPGAs

•  Free&OpenSourceEDAtoolshavereliedonindustrialhardwaredesignreleasesforverification

•  Dependenceonindustryincurslimitations

–  Designsandscaleofdesignsareoutdatedbythetime

theyarereleased

•  LEON3(2004)isstilloneofthelargestdesigns–  Lower-levelinformationsuchasoneneededforPlacementandRoutingtools,etc.areoften

obfuscated(ISPD,ICCADbenchmarks)

9

OpenPitonContainsDesignVariety

10

Last-Level

Cache

OpenSPARC

T1Core

Last-Level

Cache

Private

Cache

PicoRV32

Core

Private

Cache

NoC

Router

MemorySystem

NoC

Router

Last-Level

Cache

Ariane

RISC-V

Core

Private

Cache

NoC

Router

Last-Level

Cache

ao486

Core

Private

Cache

NoC

Router

DDR

Memory

Ethernet

PS/2

UART

MIAOWGPGPU

WishboneSDHC

Chipset

NoC

Crossbar

•  BigCores•  SmallCores

•  Caches•  Interconnect•  GPGPU(MIAOW)

•  I/O•  Addinginadditional

accelerators

http://openpiton.org

https://github.com/PrincetonUniversity/openpiton

ConfigurabilityOptions

11

Component ConfigurabilityOptions

Cores(perchip) Upto65,536

Cores(persystem) Upto500million

CoreType OpenSPARCT1 Ariane64bitRISC-V

ThreadsperCore 1/2/4 1

Floating-PointUnit FP64,FP32 FP64,FP32,FP16,FP8,

BFLOAT16

TLBs 8/16/32/64entries Numberofentries(16entries)

L1I-Cache NumberofSets,Ways(16kB,4-way)

L1D-Cache NumberofSets,Ways(8kB,4-way)

L1.5Cache NumberofSets,Ways(8kB,4-way)

L2Cache NumberofSets,Ways(64kB,4-way)

Intra-chip

Topologies

2DMesh,Crossbar

Inter-chip

Topologies

2DMesh,3DMesh,Crossbar,ButterflyNetwork

Bootloading SD/SDHCCard,UART,RISC-VJTAGDebug

ScaleofDesigns

12

Component ConfigurabilityOptions

Cores(perchip) Upto65,536

Cores(persystem) Upto500million

CoreType OpenSPARCT1 Ariane64bitRISC-V

ThreadsperCore 1/2/4 1

Floating-PointUnit FP64,FP32 FP64,FP32,FP16,FP8,

BFLOAT16

TLBs 8/16/32/64entries Numberofentries(16entries)

L1I-Cache NumberofSets,Ways(16kB,4-way)

L1D-Cache NumberofSets,Ways(8kB,4-way)

L1.5Cache NumberofSets,Ways(8kB,4-way)

L2Cache NumberofSets,Ways(64kB,4-way)

Intra-chip

Topologies

2DMesh,Crossbar

Inter-chip

Topologies

2DMesh,3DMesh,Crossbar,ButterflyNetwork

Bootloading SD/SDHCCard,UART,RISC-VJTAGDebug

OpenPiton:NotJustVerilog

•  Verificationtestbenches

•  8000+Testcases•  PowerandThermalAnalysis•  PCBdesign

13

https://parallel.princeton.edu/openpiton/piton_power_char.html

https://parallel.princeton.edu/openpiton/piton_pcb.html

PitonTestSetup

14

DRAM+I/O

ChipsetFPGAKintex7

BridgeFPGASpartan6

Piton+HeatSink

Bulk

Decoupling

PowerSupply

Misc.Configuration

[McKeownetal,HotChips2016][McKeownetal,IEEEMICRO2017][McKeownetal,HPCA2018]

PitonCharacterization

Laye

r 7La

yer 1

3La

yer 8

5V VDD BF Rsense 2.5V

1.2V BF Rsense 1.2V AF RsenseVIO Chip AF RsenseVIO Config

3.3V

12V VIO Gateway FPGA + VIO Chip BF Rsense

VCS AF Rsense

2.5V

2.7V

12V 3.5V VCS BF Rsense VDD AF Rsense

[McKeownetal,HPCA2018]

SendingData8-hops

onNoCequivalent

ofanALUop.

Debunks

conventional

wisdomthatNoCs

dominateenergyof

manycore

1LD==3ALU

instructions

Re-computemaybe

moreenergy

efficient

15

OpenSourceDesignandData

16

Tile1

Tile2

Tile3

Tile4

Tile5

Tile6

Tile7

Tile8

Tile9

Tile10

Tile11

Tile12

Tile13

Tile14

Tile15

Tile16

Tile17

Tile18

Tile19

Tile20

Tile21

Tile22

Tile23

Tile0

Tile24

ChipBridge

Design CharacterizationData

Firstdetailedpowerandenergycharacterizationofanopensourcemanycore[McKeownetal,HPCA2018]

https://parallel.princeton.edu/openpiton/piton_power_char.html

OpenPitonandFree&OpenSource

EDAtools

•  OpenPitonframeworkutilizesmultipleopensourcetools

–  IcarusVerilog–  Verilator–  Yosys–  FuseSoC–  SV2V

•  Notonlyusebutcontributetoo–  Bugreportsandfixes–  Requestsfornewfeatures–  Contributingnewfeatures

17

NeedforaFree&OpenSourceEDAFlow

•  EDAtoolsareessential

componentsbutaflowthat

connectsisequally

important

•  TheEDAflowfromachip

builder’sperspectiveis

differentthanfromanEDA

toolsdeveloper

–  Viewacrosstools–  Interactionbetweentools–  Hierarchicalsynthesis–  Twopasses–  GoodsupportforGate-level

verification

18

WhatismissingforafullFree&OpenSourceFlow

•  TheUCSDledOpenROADprojectprovidesmajority

ofchipimplementation

tools

•  UWhaspackagedflow

usingOpenROADwith

Free45library

MissingComponents•  Free/OpenSourceDRCtool

•  StrengtheningYosys•  BetterSystemVerilog

support

•  Free/OpenSourceParasitic

Extractiontool

•  PowergridandSignalIntegrityIssues

19https://github.com/bsg-idea/uw_openroad_free45

https://github.com/The-OpenROAD-Project/OpenROAD-flow

BuildingOpenSourceChipswithOpen

SourceEDAtools

•  PrincetonwithUWplanningto

tapeoutbilliontransistorchip–  GlobalFoundries14/12nm

–  UseFree&OpenSourceCADtoolsfromOpenROAD

andDARPAIDEAproject

20

TurboXAUISerDes TurboXAUISerDes

Debug BaseJumpI/O

Deep

Learning

Accelerators

BaseJump

Manycore

MediumRISC-V

Cores

TurboXAUISerDes

TurboXAUISerDes

FMC

DC/DC

SixPackSiPpackage

BaseJump

PCB

TurboXAUISerDes

TurboXAUISerDes

RISC-Vcores

DeepReinforcementSupervisors

AAA

A A A

AADRLASIC

DECADES:anOpenSource

HeterogeneousPlatform

•  SoftwareDefinedHardware(SDH)

–  Designruntime-reconfigurablehardware

toacceleratedata-intensivesoftware

applications

•  Machinelearninganddatascience

•  Graphanalyticsandsparselinearalgebra

•  DECADES:heterogeneoustile-basedchip

–  Combinationofcore,accelerator,and

intelligentstoragetiles

–  Princeton/Columbiacollaborationledby

PIsMargaretMartonosi,DavidWentzlaff,

LucaCarloni

•  OurtoolsandHardwareisopen-source!–  https://decades.cs.princeton.edu/

21

Program

Executable

Address-space

mapping

Initial Set of

In-Memory Operations

Initial Task

Mapping

DECADES Tile

Configuration

Interconnect Bridges

Configuration

Data Migration

(Across DDR nodes)

Update In-Memory

Operations

Task Migration

(across Tiles)

DECADE Tile

Reconfiguration

Bridge Ports

Re-Routing

Source Code

Compile-Time Optimizations

(Graphicionado, DeSC)

Run-Time

Self-Tuning

Data Segments

Definition

DECADES

Accelerator

Tile

DECADES

Core

Tile

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

chip

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

L1 Cache

Configurable

Interconnect

Shims

Configurable Core Pipeline

Data Supply / Compute Threads L1 Cache

Per-Tile Configurable

On-chip Memory

DECADES Monitor and Run-Time Reconfiguration Shim

DECADES Core Tile

Configurable

Interconnect

Shims

Specialized, Configurable

Data Supply / Compute

Accelerator

Per-Tile Configurable

On-chip Memory

DECADES Monitor and Run-Time Reconfiguration Shim

DECADES Accelerator Tile

Configurable

Interconnect

Shims

Near-Memory

Computation

Across-Chip Configurable

On-Chip Memory

DECADES Monitor and Run-Time Reconfiguration Shim

DECADES Intelligent Storage

Configurable Pattern-Based

Prefetcher

Run-Time

Self-Tuning

Run-Time

Self-Tuning

Run-Time

Self-Tuning

DECADES TA2 DECADES TA1

FPGAs: O↵-Chip Memory System

with In-Memory Computation

FPGAs: O↵-Chip Memory System

with In-Memory Computation

OpenPitonCommunity

•  Welcomecommunitycontributions

•  Homepage–http://openpiton.org

•  GoogleGroup–https://groups.google.com/group/openpiton

•  Directemail–openpiton@princeton.edu

•  GitHub–https://github.com/PrincetonUniversity/openpiton

22

TeamandAcknowledgements

PrincetonParallel

Team

–  SpecialthankstoGeorgiosTziantzioulis

forhelpwithslides

23

TeamBuildingGrandCanyonSummer2019

•  ArianeTeamaspartofthePULPplatformledbyProf.LucaBenini

•  Prof.MichaelB.TaylorandProf.RichardShiGroupsatUW

•  OpenROADteamledbyProf.AndrewKahng

24

Funding/Support

ThismaterialisbasedonresearchsponsoredbytheNSFunderGrantsNo.CNS-1823222,CCF-1217553,CCF1453112,CCF-1823032,

andCCF-1438980,AFOSRunderGrantNo.FA9550-14-1-0148,AirForceResearchLaboratory(AFRL)andDefenseAdvancedResearch

ProjectsAgency(DARPA)underagreementNo.FA8650-18-2-7846,FA8650-18-2-7852,andFA8650-18-2-7862andDARPAunder

GrantsNo.N66001-14-1-4040andHR0011-13-2-0005.TheU.S.Governmentisauthorizedtoreproduceanddistributereprintsfor

Governmentalpurposesnotwithstandinganycopyrightnotationthereon.Theviewsandconclusionscontainedhereinarethoseofthe

authorsandshouldnotbeinterpretedasnecessarilyrepresentingtheofficialpoliciesorendorsements,eitherexpressedorimplied,of

AirForceResearchLaboratory(AFRL)andDefenseAdvancedResearchProjectsAgency(DARPA),theNSF,AFOSR,ortheU.S.

Government.

OpenPiton:AnOpen-SourceFrameworkforEDA

ToolDevelopment

25

DavidWentzlaff(wentzlaf@princeton.edu)

http://openpiton.org

https://github.com/PrincetonUniversity/openpiton

Program

Executable

Address-space

mapping

Initial Set of

In-Memory Operations

Initial Task

Mapping

DECADES Tile

Configuration

Interconnect Bridges

Configuration

Data Migration

(Across DDR nodes)

Update In-Memory

Operations

Task Migration

(across Tiles)

DECADE Tile

Reconfiguration

Bridge Ports

Re-Routing

Source Code

Compile-Time Optimizations

(Graphicionado, DeSC)

Run-Time

Self-Tuning

Data Segments

Definition

DECADES

Accelerator

Tile

DECADES

Core

Tile

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

Intelligent

Storage

DECADES

chip

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Accelerator

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

DECADES

Core

Tile

L1 Cache

Configurable

Interconnect

Shims

Configurable Core Pipeline

Data Supply / Compute Threads L1 Cache

Per-Tile Configurable

On-chip Memory

DECADES Monitor and Run-Time Reconfiguration Shim

DECADES Core Tile

Configurable

Interconnect

Shims

Specialized, Configurable

Data Supply / Compute

Accelerator

Per-Tile Configurable

On-chip Memory

DECADES Monitor and Run-Time Reconfiguration Shim

DECADES Accelerator Tile

Configurable

Interconnect

Shims

Near-Memory

Computation

Across-Chip Configurable

On-Chip Memory

DECADES Monitor and Run-Time Reconfiguration Shim

DECADES Intelligent Storage

Configurable Pattern-Based

Prefetcher

Run-Time

Self-Tuning

Run-Time

Self-Tuning

Run-Time

Self-Tuning

DECADES TA2 DECADES TA1

FPGAs: O↵-Chip Memory System

with In-Memory Computation

FPGAs: O↵-Chip Memory System

with In-Memory Computation