Upload
others
View
29
Download
0
Embed Size (px)
Citation preview
1
CoDEx: CoDesign for Exascale David Donofrio Cy Chan, Farzad Fatollahi-Fard, George Michelogiannakis, John Shalf ModSim August 13, 2014
2
Objective Create a comprehensive architectural simulation platform to accelerate hardware/software co-design for exascale computing
CoDEx Objective
3
Code Analysis Static analysis, basic speeds and feeds, unbounded hardware models
Rapid Exploration
Use software based simulators and software
kernels to explore hardware parameter
space Synthesis of Point Design Use hardware emulation tools for full application optimization and extraction of accurate power and area estimates
1
2
3
Create a comprehensive architectural simulation platform to accelerate hardware/software co-design for exascale computing
CoDEx Tool Summary
4
Tools span from high-level, concept exploring tools to physical circuit synthesis
Circuit Synthesis
Hardware Emulation
Concept
Software Simulation
ExaSAT and spreadsheet models
OpenSoC, DRAMSim, NVRAMSim, HMC Models, CHISEL generated components, Tensillica Xtensa models, SystemC based models from industry
Gateware, CHISEL generated components, NoCSim, Tensilica Xtensa processors
FPGAs, ASIC, Power model parameter extraction
5
Conceptual Analysis
The Old Way of Analytic Modeling
first read, first wri,en Vars that can be then wri,en then read read from registers Module Read-‐only Write-‐only RW WR Stencil? a2er wri4en ctoprim ctoprim(t) loop1 U1-‐U5 Q1,Q5,Q6 Q2,Q3,Q4 No Q2,Q3,Q4 loop2 Q1-‐Q5 ctoprim(f) loop1 U1-‐U5 Q1,Q5,Q6 Q2,Q3,Q4 No Q2,Q3,Q4 di<erm D1 No ux,vx,wx loop1 Q2, Q3, Q4 ux, vx, wx Yes D2-‐D4 across loops uy,vy,wy loop2 Q2, Q3, Q4 uy, vy, wy Yes uz,vz,wz loop3 Q2, Q3, Q4 uz, vz, wz Yes imx loop4 Q2, vy, wz D2 Yes imy loop5 Q3, ux, wz D3 Yes imz loop6 Q4, ux, vy D4 Yes iene loop7 Q2-‐4,Q6, D2-‐4 D5 Yes ux-‐z, vx-‐z, wx-‐z hypterm loop1 U2-‐U5, Q2, Q5 F1-‐F5 Yes F1-‐F5 acroos loops loop2 U2-‐U5, Q3, Q5 F1-‐F5 Yes loop3 U2-‐U5, Q4, Q5 F1-‐F5 Yes CalcU N+1/3 loop1 U, D, F Unew No N+2/3 loop2 U, D, F Unew No N+1 loop3 Unew, D, F U No
6
The Old Way of Analytic Modeling
7
in#bytes in#bytes in#bytes in#bytes in#MB
##of#Reads ##of#Reads ##of#Writes Reads#w/o#halo Reads#w#halo Writes Total#Memory#Access
Module pt w#halo/pt /pt /box /box /box /box /box
ctoprim-courno=true loop1 5 0 6 1310720 0 1572864 2883584 2.75
loop2 5 0 0 1310720 0 0 1310720 1.25
courno=false loop1 5 0 6 1310720 0 1572864 2883584 2.75
diffterm 0 0 1 0 0 262144 262144 0.25
ux,vx,wx loop1 0 3 3 0 1376256 786432 2162688 2.0625
uy,vy,wy loop2 0 3 3 0 1376256 786432 2162688 2.0625
uz,vz,wz loop3 0 3 3 0 1376256 786432 2162688 2.0625
imx loop4 0 3 1 0 1376256 262144 1638400 1.5625
imy loop5 0 3 1 0 1376256 262144 1638400 1.5625
imz loop6 0 3 1 0 1376256 262144 1638400 1.5625
iene loop7 15 1 1 3932160 458752 262144 4653056 4.4375
hyptermloop1 0 6 5 0 2752512 1310720 4063232 3.875
loop2 5 6 5 1310720 2752512 1310720 5373952 5.125
loop3 5 6 5 1310720 2752512 1310720 5373952 5.125
CalcUN+1/3 loop1 15 0 5 3932160 0 1310720 5242880 5
N+2/3 loop2 20 0 5 5242880 0 1310720 6553600 6.25
N+1 loop3 20 0 25 5242880 0 6553600 11796480 11.25
CoDEx Tools: ExaSAT
‣ Extracts key application statistics in HW independent fashion
• HW configs parameterized for code performance estimation
‣ Estimate performance benefits of code transforms without changing the code
8
Compiler driven performance analysis framework
Input&Code&
&Compiler&Analysis&
&&&&&
Func5on&and&Loop&&A7ribute&Analysis&
Arithme5c&Ops,&Read/Write,&Stencil&Access&Analysis&
Performance Spreadsheet
Dependency Graph
Exascale&Machine&Config&<XML>&
ROSE&Frontend&
AST
ExaSAT Framework
User&Parameters&
Code&Descrip5on&<XML>&
&Performance&Model&
&&&&&
Working&Set&and&Memory&Traffic&Modeling&
Loop&Dependency&Memory&Footprint&Analysis&
Optimized HW/SW configurations
(reduced space)
ExaSAT Analysis
9
Automate a lot of tedious analysis
0.5$
1$
2$
4$
2$ 4$ 8$ 16$ 32$ 64$ 128$
Bytes$p
er$Flop$
Block$Size$
Cache$Blocking$on$Dynamics$Kernel$(53$species)$
No$Cache$
16$kB$Cache$
64$kB$Cache$
Unlimited$Cache$
4$MB$Cache$
1$MB$Cache$
256$kB$Cache$
adds,%3620%
muls,%4303%
divs,%540%
trans,%710%
Instruc8on%Mix%
adds%3%% muls%
4%%divs%18%%
trans%75%%
Breakdown%of%CPU%Time%
adds,%14509%
muls,%16447%
divs,%387%
sqrt,1%%
Instruc8on%Mix%%
adds%34%%
muls%34%%
divs%32%%
Breakdown%of%CPU%Time%
Chemistry Dynamics
0.00%$
0.10%$
0.20%$
0.30%$
0.40%$
0.50%$
0.60%$
0.70%$
0.80%$
0.90%$
0$
1$
2$
3$
4$
5$
6$
7$
8$
9$
10$
U.iryn$
Q.qv$
U.iene
$U.im
y$mu$
Q.qpres$
Une
w.iene
$Une
w.im
y$Q.qtemp$
Ddiag.n$
vsm$
dxe.n$ ux$
vy$
U.irho
$uy$
vx$
wx$
Q.qrho$
Fdif.im
x$Fdif.im
z$Fdif.irh
o$Fhyp.iene
$Fhyp.im
y$Fhyp.irho
$Hg
.iene
$Hg
.imy$
Hg.iryn$
Uprim
e.im
x$Uprim
e.im
z$Uprim
e.iry
n$ xi$
!Write!Access!Rate!
!
Read
/Write!Ra
.o!
Read/Write$RaKo$ Write$Access$Rate$
2"
2.5"
3"
3.5"
4"
4.5"
5"
5.5"
6"
6.5"
8" 32" 128" 512" 2048" 8192" 32768"
Mem
ory'Traffi
c'(GB)'
Cache'Size'(kB)'
Pin"Tool"ExaSAT"
1"
4"
16"
64"
256"
1024"
4096"
1" 4" 16" 64" 256" 1024" 4096"
Num
ber"o
f"Accesses"
Unique"Variable"IDs"(sorted"by"number"of"accesses)"
ExaSAT'State'Variable'Analysis''
9"Species"
21"Species"
53"Species"
71"Species"
107"Species"
Energy Efficiency Analysis
0"
50"
100"
150"
200"
250"
300"
4k" 8k" 16k" 32k" 65k" 128k" 256k"
FLOP"
Cache"Dynamic"
Cache"Sta;c"
0.125&
0.25&
0.5&
1&
2&
4&
8&
2& 8& 32& 128& 512& 2048&
Byte%to
%Flop%Ra
,o%
Cache%Size%(kB)%
Byte%to%Flop%Ra,os%vs%Cache%Size%for%Loop%Fusion%Scenarios%("best"%block%size)%
Baseline&
Simple&
Aggressive&
Baseline&(no&cache)&
Fused&(no&cache)&
"Best"&Strategy&0.125&
0.25&
0.5&
1&
2&
4&
8&
2& 8& 32& 128& 512& 2048&
Byte%to
%Flop%Ra
,o%
Cache%Size%(kB)%
Byte%to%Flop%Ra,os%vs%Cache%Size%for%Loop%Fusion%Scenarios%("best"%block%size)%
Baseline&
Simple&
Aggressive&
Baseline&(no&cache)&
Fused&(no&cache)&
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
8000"
9000"
10000"
4k" 8k" 16k" 32k" 65k" 128k" 256k"
Memory"Base"
FLOP"
Cache"Dynamic"
Cache"StaBc"
Power consumed by large caches
0"
1"
2"
3"
4"
5"
6"
7"
8"
9"
10"
Baseline" Fusion"
32k"
256k"
Power w/DRAM
Power Reduction Opportunity
Illustrate the benefit of large L1 scratchpads
11
Hardware and Software Models Processor, network, and memory models in software and hardware emulation.
CoDEx Tools: Putting Chisel to work for DOE
12
‣ New hardware DSL
‣ Scala based • Powerful generators
• Obj Oriented constructs
‣ Generate C++ and Verilog models from single description
‣ Glue to existing infrastructure with SystemC
A path to hardware and software models
Verilog
FPGA ASIC
Hardware Compilation
Software Compilation
SystemC Simulation
C++ Simulation
Scala
Chisel
CoDEx Interoperability
13
‣ SystemC embraced by
• Industrial partners - Intel, Micron, ARM, Cadence,
Synopsis
• Other research efforts - CAL, FastForward
‣ All CoDEx software simulation tools based on SystemC
SystemC as a platform for re-use
SystemC
CoDEx
SST
Chisel
Industry Tools
CoDEx Tools: FPGA Emulation
14
‣ 1000x Faster than SW models • Scales independent of
processor count
‣ CoDEx Software models have parallel HW emulation path
‣ Performance and Energy counters provide SW model level of detail
Enabling full application optimization
CoDEx Tools: OpenSoC Fabric
15
‣ Leverage Chisel to create highly parameterized, flexible model for NoC generation • Dimensions, topology, VCs, etc. all
configurable
• Fast, functional SW model with SystemC integration
• Verilog model for FPGA and ASIC flows
‣ AXI Based Interface • Integrate with Tensilica as well as
ARM based cores
‣ Builds on previous PhoenixSim network model
Parameterized NoC generation tool
AXIOpenSoC
FabricCPU(s)
HMC
AXI
AXI
CPU(s)
AXI CPU(s)
AXI
CPU(s)
AXI
CPU(s)
AXI
AXI
10GbE
PCIe
OpenSoC Fabric
16
Flits moving through a mesh network
CoDEx Tools: NVRAMSim
‣ Collaboration with Myoungsoo Jung UT Austin
‣ Build on prior work of page/block addressable NVRAM simulator
• Extend to byte / word addressability
• Alternative memory cell architecture
‣ Dynamic energy model
• Aware of internal NVRAM components with variable clock frequencies
‣ Validated against hardware evaluation platform
‣ Integrates with CoDEx processor and NoC models
• Supplements existing CoDEx DRAM model (DRAMSim2)
17
Non-Volatile Memory Modeling
NRAMsim: Alternative Burst Buffer RRAM Organization
18
R RAM
$1
R RAM
$2
R RAM
$4
R RAM
$8
R RAM
$16
R RAM
$32
R RAM
$64
R RAM
$128
0
5000
10000
15000
20000
25000
30000
35000
32nm/ML C /NAND /(Max )
Ban
dwidth/(KB/sec
)
/Min/Max
32nm/ML C /NAND /(Min)
32nm%S L C %NAND %(Max )
32nm%S L C %NAND %(Min)
1 2 4 8 16 32 64 128256512102420484096
5.0x108
1.0x109
1.5x109
2.0x109
2.5x109
3.0x109
Ave
rage
2Laten
cy2(ns
)
D a ta 2Movement2S iz es 2(K B )
2S L C2ML C2R R AMF12R R AMF22R R AMF42R R AMF82R R AMF162R R AMF322R R AMF642R R AMF128
CoDEx Tools: Processor Models ‣ Tensilica XTensa processor generator
• Fast configuration of cache, local store and easily extensible ISA
• Highly flexible FIFO based interfaces (TIEQueues)
• Rich performance counter and debug interface
‣ All custom cores generated with • Fast functional model
• SystemC based performance model
• Verilog implementation ready for FPGA or ASIC flow
‣ Verified power model
‣ Integrates with all CoDEx hardware and software models 19
Highly configurable XTensa embedded core
CoDEx Summary
‣ CoDEx tools span range of abstraction levels
‣ Embrace SystemC as common glue • Enables interoperation with DOE and industrial models
‣ Validated processor model
‣ DRAM and NVRAM models
‣ Parameterized NoC generator
‣ Parallel modeling path includes • Flexible software models
• High-speed FPGA based emulation
20
CoDEx: Summary
‣ Key Contribution: • Providing an integrated environment from static analysis
to hardware synthesis
‣ Interoperability Biggest Challenge • CoDEx designed to be a collection of tools that stand
alone or can be combined to create something more powerful
‣ Opportunities exist to leverage capabilities of other simulation environments
21