21
1 CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis, John Shalf ModSim August 13, 2014

CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

  • Upload
    others

  • View
    29

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

1

CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis, John Shalf ModSim August 13, 2014

Page 2: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

2

Objective Create a comprehensive architectural simulation platform to accelerate hardware/software co-design for exascale computing

Page 3: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

CoDEx Objective

3

Code Analysis Static analysis, basic speeds and feeds, unbounded hardware models

Rapid Exploration

Use software based simulators and software

kernels to explore hardware parameter

space Synthesis of Point Design Use hardware emulation tools for full application optimization and extraction of accurate power and area estimates

1

2

3

Create a comprehensive architectural simulation platform to accelerate hardware/software co-design for exascale computing

Page 4: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

CoDEx Tool Summary

4

Tools span from high-level, concept exploring tools to physical circuit synthesis

Circuit Synthesis

Hardware Emulation

Concept

Software Simulation

ExaSAT and spreadsheet models

OpenSoC, DRAMSim, NVRAMSim, HMC Models, CHISEL generated components, Tensillica Xtensa models, SystemC based models from industry

Gateware, CHISEL generated components, NoCSim, Tensilica Xtensa processors

FPGAs, ASIC, Power model parameter extraction

Page 5: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

5

Conceptual Analysis

Page 6: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

The Old Way of Analytic Modeling

                first  read,   first  wri,en       Vars  that  can  be                   then  wri,en   then  read       read  from  registers  Module       Read-­‐only   Write-­‐only   RW   WR   Stencil?   a2er  wri4en  ctoprim                                ctoprim(t)   loop1   U1-­‐U5   Q1,Q5,Q6       Q2,Q3,Q4   No   Q2,Q3,Q4       loop2   Q1-­‐Q5                      ctoprim(f)   loop1   U1-­‐U5   Q1,Q5,Q6       Q2,Q3,Q4   No   Q2,Q3,Q4  di<erm           D1           No      ux,vx,wx   loop1   Q2,  Q3,  Q4   ux,  vx,  wx           Yes   D2-­‐D4  across  loops  uy,vy,wy   loop2   Q2,  Q3,  Q4   uy,  vy,  wy           Yes      uz,vz,wz   loop3   Q2,  Q3,  Q4   uz,  vz,  wz           Yes      imx   loop4   Q2,  vy,  wz   D2           Yes      imy   loop5   Q3,  ux,  wz   D3           Yes      imz   loop6   Q4,  ux,  vy   D4           Yes      iene   loop7   Q2-­‐4,Q6,  D2-­‐4   D5           Yes           ux-­‐z,  vx-­‐z,  wx-­‐z      hypterm                                   loop1   U2-­‐U5,  Q2,  Q5   F1-­‐F5           Yes   F1-­‐F5  acroos  loops       loop2   U2-­‐U5,  Q3,  Q5       F1-­‐F5       Yes           loop3   U2-­‐U5,  Q4,  Q5       F1-­‐F5       Yes      CalcU                              N+1/3   loop1   U,  D,  F   Unew           No      N+2/3   loop2   U,  D,  F       Unew       No      N+1   loop3   Unew,  D,  F       U       No      

6

Page 7: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

The Old Way of Analytic Modeling

7

in#bytes in#bytes in#bytes in#bytes in#MB

##of#Reads ##of#Reads ##of#Writes Reads#w/o#halo Reads#w#halo Writes Total#Memory#Access

Module pt w#halo/pt /pt /box /box /box /box /box

ctoprim-courno=true loop1 5 0 6 1310720 0 1572864 2883584 2.75

loop2 5 0 0 1310720 0 0 1310720 1.25

courno=false loop1 5 0 6 1310720 0 1572864 2883584 2.75

diffterm 0 0 1 0 0 262144 262144 0.25

ux,vx,wx loop1 0 3 3 0 1376256 786432 2162688 2.0625

uy,vy,wy loop2 0 3 3 0 1376256 786432 2162688 2.0625

uz,vz,wz loop3 0 3 3 0 1376256 786432 2162688 2.0625

imx loop4 0 3 1 0 1376256 262144 1638400 1.5625

imy loop5 0 3 1 0 1376256 262144 1638400 1.5625

imz loop6 0 3 1 0 1376256 262144 1638400 1.5625

iene loop7 15 1 1 3932160 458752 262144 4653056 4.4375

hyptermloop1 0 6 5 0 2752512 1310720 4063232 3.875

loop2 5 6 5 1310720 2752512 1310720 5373952 5.125

loop3 5 6 5 1310720 2752512 1310720 5373952 5.125

CalcUN+1/3 loop1 15 0 5 3932160 0 1310720 5242880 5

N+2/3 loop2 20 0 5 5242880 0 1310720 6553600 6.25

N+1 loop3 20 0 25 5242880 0 6553600 11796480 11.25

Page 8: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

CoDEx Tools: ExaSAT

‣  Extracts key application statistics in HW independent fashion

•  HW configs parameterized for code performance estimation

‣  Estimate performance benefits of code transforms without changing the code

8

Compiler driven performance analysis framework

Input&Code&

&Compiler&Analysis&

&&&&&

Func5on&and&Loop&&A7ribute&Analysis&

Arithme5c&Ops,&Read/Write,&Stencil&Access&Analysis&

Performance Spreadsheet

Dependency Graph

Exascale&Machine&Config&<XML>&

ROSE&Frontend&

AST

ExaSAT Framework

User&Parameters&

Code&Descrip5on&<XML>&

&Performance&Model&

&&&&&

Working&Set&and&Memory&Traffic&Modeling&

Loop&Dependency&Memory&Footprint&Analysis&

Optimized HW/SW configurations

(reduced space)

Page 9: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

ExaSAT Analysis

9

Automate a lot of tedious analysis

0.5$

1$

2$

4$

2$ 4$ 8$ 16$ 32$ 64$ 128$

Bytes$p

er$Flop$

Block$Size$

Cache$Blocking$on$Dynamics$Kernel$(53$species)$

No$Cache$

16$kB$Cache$

64$kB$Cache$

Unlimited$Cache$

4$MB$Cache$

1$MB$Cache$

256$kB$Cache$

adds,%3620%

muls,%4303%

divs,%540%

trans,%710%

Instruc8on%Mix%

adds%3%% muls%

4%%divs%18%%

trans%75%%

Breakdown%of%CPU%Time%

adds,%14509%

muls,%16447%

divs,%387%

sqrt,1%%

Instruc8on%Mix%%

adds%34%%

muls%34%%

divs%32%%

Breakdown%of%CPU%Time%

Chemistry Dynamics

0.00%$

0.10%$

0.20%$

0.30%$

0.40%$

0.50%$

0.60%$

0.70%$

0.80%$

0.90%$

0$

1$

2$

3$

4$

5$

6$

7$

8$

9$

10$

U.iryn$

Q.qv$

U.iene

$U.im

y$mu$

Q.qpres$

Une

w.iene

$Une

w.im

y$Q.qtemp$

Ddiag.n$

vsm$

dxe.n$ ux$

vy$

U.irho

$uy$

vx$

wx$

Q.qrho$

Fdif.im

x$Fdif.im

z$Fdif.irh

o$Fhyp.iene

$Fhyp.im

y$Fhyp.irho

$Hg

.iene

$Hg

.imy$

Hg.iryn$

Uprim

e.im

x$Uprim

e.im

z$Uprim

e.iry

n$ xi$

!Write!Access!Rate!

!

Read

/Write!Ra

.o!

Read/Write$RaKo$ Write$Access$Rate$

2"

2.5"

3"

3.5"

4"

4.5"

5"

5.5"

6"

6.5"

8" 32" 128" 512" 2048" 8192" 32768"

Mem

ory'Traffi

c'(GB)'

Cache'Size'(kB)'

Pin"Tool"ExaSAT"

1"

4"

16"

64"

256"

1024"

4096"

1" 4" 16" 64" 256" 1024" 4096"

Num

ber"o

f"Accesses"

Unique"Variable"IDs"(sorted"by"number"of"accesses)"

ExaSAT'State'Variable'Analysis''

9"Species"

21"Species"

53"Species"

71"Species"

107"Species"

Page 10: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

Energy Efficiency Analysis

0"

50"

100"

150"

200"

250"

300"

4k" 8k" 16k" 32k" 65k" 128k" 256k"

FLOP"

Cache"Dynamic"

Cache"Sta;c"

0.125&

0.25&

0.5&

1&

2&

4&

8&

2& 8& 32& 128& 512& 2048&

Byte%to

%Flop%Ra

,o%

Cache%Size%(kB)%

Byte%to%Flop%Ra,os%vs%Cache%Size%for%Loop%Fusion%Scenarios%("best"%block%size)%

Baseline&

Simple&

Aggressive&

Baseline&(no&cache)&

Fused&(no&cache)&

"Best"&Strategy&0.125&

0.25&

0.5&

1&

2&

4&

8&

2& 8& 32& 128& 512& 2048&

Byte%to

%Flop%Ra

,o%

Cache%Size%(kB)%

Byte%to%Flop%Ra,os%vs%Cache%Size%for%Loop%Fusion%Scenarios%("best"%block%size)%

Baseline&

Simple&

Aggressive&

Baseline&(no&cache)&

Fused&(no&cache)&

0"

1000"

2000"

3000"

4000"

5000"

6000"

7000"

8000"

9000"

10000"

4k" 8k" 16k" 32k" 65k" 128k" 256k"

Memory"Base"

FLOP"

Cache"Dynamic"

Cache"StaBc"

Power consumed by large caches

0"

1"

2"

3"

4"

5"

6"

7"

8"

9"

10"

Baseline" Fusion"

32k"

256k"

Power w/DRAM

Power Reduction Opportunity

Illustrate the benefit of large L1 scratchpads

Page 11: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

11

Hardware and Software Models Processor, network, and memory models in software and hardware emulation.

Page 12: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

CoDEx Tools: Putting Chisel to work for DOE

12

‣  New hardware DSL

‣  Scala based •  Powerful generators

•  Obj Oriented constructs

‣  Generate C++ and Verilog models from single description

‣  Glue to existing infrastructure with SystemC

A path to hardware and software models

Verilog

FPGA ASIC

Hardware Compilation

Software Compilation

SystemC Simulation

C++ Simulation

Scala

Chisel

Page 13: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

CoDEx Interoperability

13

‣  SystemC embraced by

•  Industrial partners -  Intel, Micron, ARM, Cadence,

Synopsis

•  Other research efforts -  CAL, FastForward

‣  All CoDEx software simulation tools based on SystemC

SystemC as a platform for re-use

SystemC

CoDEx

SST

Chisel

Industry Tools

Page 14: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

CoDEx Tools: FPGA Emulation

14

‣  1000x Faster than SW models •  Scales independent of

processor count

‣  CoDEx Software models have parallel HW emulation path

‣  Performance and Energy counters provide SW model level of detail

Enabling full application optimization

Page 15: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

CoDEx Tools: OpenSoC Fabric

15

‣  Leverage Chisel to create highly parameterized, flexible model for NoC generation •  Dimensions, topology, VCs, etc. all

configurable

•  Fast, functional SW model with SystemC integration

•  Verilog model for FPGA and ASIC flows

‣  AXI Based Interface •  Integrate with Tensilica as well as

ARM based cores

‣  Builds on previous PhoenixSim network model

Parameterized NoC generation tool

AXIOpenSoC

FabricCPU(s)

HMC

AXI

AXI

CPU(s)

AXI CPU(s)

AXI

CPU(s)

AXI

CPU(s)

AXI

AXI

10GbE

PCIe

Page 16: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

OpenSoC Fabric

16

Flits moving through a mesh network

Page 17: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

CoDEx Tools: NVRAMSim

‣  Collaboration with Myoungsoo Jung UT Austin

‣  Build on prior work of page/block addressable NVRAM simulator

•  Extend to byte / word addressability

•  Alternative memory cell architecture

‣  Dynamic energy model

•  Aware of internal NVRAM components with variable clock frequencies

‣  Validated against hardware evaluation platform

‣  Integrates with CoDEx processor and NoC models

•  Supplements existing CoDEx DRAM model (DRAMSim2)

17

Non-Volatile Memory Modeling

Page 18: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

NRAMsim: Alternative Burst Buffer RRAM Organization

18

R RAM

$1

R RAM

$2

R RAM

$4

R RAM

$8

R RAM

$16

R RAM

$32

R RAM

$64

R RAM

$128

0

5000

10000

15000

20000

25000

30000

35000

32nm/ML C /NAND /(Max )

Ban

dwidth/(KB/sec

)

/Min/Max

32nm/ML C /NAND /(Min)

32nm%S L C %NAND %(Max )

32nm%S L C %NAND %(Min)

1 2 4 8 16 32 64 128256512102420484096

5.0x108

1.0x109

1.5x109

2.0x109

2.5x109

3.0x109

Ave

rage

2Laten

cy2(ns

)

D a ta 2Movement2S iz es 2(K B )

2S L C2ML C2R R AMF12R R AMF22R R AMF42R R AMF82R R AMF162R R AMF322R R AMF642R R AMF128

Page 19: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

CoDEx Tools: Processor Models ‣  Tensilica XTensa processor generator

•  Fast configuration of cache, local store and easily extensible ISA

•  Highly flexible FIFO based interfaces (TIEQueues)

•  Rich performance counter and debug interface

‣  All custom cores generated with •  Fast functional model

•  SystemC based performance model

•  Verilog implementation ready for FPGA or ASIC flow

‣  Verified power model

‣  Integrates with all CoDEx hardware and software models 19

Highly configurable XTensa embedded core

Page 20: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

CoDEx Summary

‣  CoDEx tools span range of abstraction levels

‣  Embrace SystemC as common glue •  Enables interoperation with DOE and industrial models

‣  Validated processor model

‣  DRAM and NVRAM models

‣  Parameterized NoC generator

‣  Parallel modeling path includes •  Flexible software models

•  High-speed FPGA based emulation

20

Page 21: CoDEx: CoDesign for Exascalehpc.pnl.gov/modsim/2014/Presentations/Donofrio.pdf · CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis,

CoDEx: Summary

‣  Key Contribution: •  Providing an integrated environment from static analysis

to hardware synthesis

‣  Interoperability Biggest Challenge •  CoDEx designed to be a collection of tools that stand

alone or can be combined to create something more powerful

‣  Opportunities exist to leverage capabilities of other simulation environments

21