Click here to load reader
Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Model accuracy results on SPEC CPU2006
0
1
2
3
4
56
7
8
9
10
1-1 1-2 1-4 2-1 2-2 2-4 4-1 4-2 4-4 6-1 6-2 6-4 8-1 8-2 8-4 Mean
CMP - SMT configuration
% E
rro
r
TD Micro
TD Random
TD SPEC
BU
Model accuracy results
0
5
10
15
20
FXU
High
FXU
Low
L1
Loads
Main
Memory
VSU
High
VSU
Low
Mean
Validation set
% E
rro
r
TD Micro
TD Random
TD SPEC
BU
62%
Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-BenchmarksRamon Bertran*† Alper Buyuktosunoglu† Meeta S.Gupta† Marc Gonzàlez* Pradip Bose†
*Barcelona Supercomputing Center †IBM T.J. Watson Research Center{ramon.bertran,marc.gonzalez}@bsc.es {rbertra,alperb,mgupta,pbose}@us.ibm.com
• Incorporates specialized knowledge of the system• Instruction set architecture definition
• Micro-architecture definition
• Micro-architecture analytical models• Flexible and adaptive
• Definitions are user-provided text files• Efficient
• Analytical models avoid unnecessary design space explorations
Characterization• Identify performance bottlenecks
• Power/thermal/noise issues• Power virus generation
• Stressmarks for power management control loop• di/dt stressmarks
• Reverse engineer hardware parameters
Verification• Validate simulators and models• Test compiler transformations
Synthesize workloads • Fast evaluation of real benchmarks
• Clone proprietary workloads
• Time consuming and tedious • Error prone task � trial and error process
• Several micro-benchmarks are required
• Deep expertise limited to few designers • Detailed knowledge of the underlying architecture is required
MicroProbeFramework
User
Inputs Outputs
Micro-benchmarkgeneration
policy
ArchitectureDefinition
files
Max PowerStressmark
Micro-Bench-mark
External tools
Realplatforms
Simulators Models
A novel productive micro-benchmark generation framework:
MicroProbe• Adaptive and flexible
• Micro-architecture semantics aware
• Integrated design space exploration
Researchidea
Micro-benchmark generation policies (user-defined scripts)
Loop stressingthe floating point unit
Sequence of loadshitting 50% L1
and 50% L2
Generate a stress-mark for each functionalunit of the architecture
Search for the sequence of 2loads and 2 integer operations
with maximum IPC
MicroProbe Framework (Python API)
Architecture module Code generationmodule
Design spaceexploration moduleISA
definitionsISA
definitionsISA
definitions
Micro-architectureanalytical modelsMicro-architectureanalytical modelsMicro-architectureanalytical models
Micro-architecturedefinitions
Micro-architecturedefinitions
Micro-architecturedefinitions
Micro-benchmarksynthesizer
PassesPassesPasses
SearchdriversSearchdriversSearchdrivers
PropertiesPropertiesProperties
Micro-benchmarkMicro-benchmarkMicro-benchmark
Automaticbootstrapprocess
External tools
MicroProbePrevious worksFeature
�� (manual)- Customizable search�� (manual)- Exhaustive search��- GA-based search�� (no)- Integrated
Design space exploration
�� (no)- Configurable passes
��- Skeleton and instruction definition passes, memory modeling pass, branch modeling pass, ILP definition pass.
Code generation
�� (no)- Set-associative cache model
Micro-architecture models
�� (manual)- Functional unit, latency, throughput, energy per instruction, average instruction power etc.
Micro-architecture queries
�� (manual)- Operand length, binary codification etc.��- Instruction type
ISA queries
MicroProbePrevious worksFeature
�� (manual)- Customizable search�� (manual)- Exhaustive search��- GA-based search�� (no)- Integrated
Design space exploration
�� (no)- Configurable passes
��- Skeleton and instruction definition passes, memory modeling pass, branch modeling pass, ILP definition pass.
Code generation
�� (no)- Set-associative cache model
Micro-architecture models
�� (manual)- Functional unit, latency, throughput, energy per instruction, average instruction power etc.
Micro-architecture queries
�� (manual)- Operand length, binary codification etc.��- Instruction type
ISA queries
TargetArchitecture
Instruction SetArchitecture (ISA)
Micro-architecture
Micro-architectureModels
Instruction Register
Format
Field
ConstantOperand
Type
RegisterOperand
ImmediateOperand
ComponentProperty
MemoryComponent
Register FileComponent
FunctionalUnit Component
Register
Set-Associative Cache Model
BranchModel
Property
TargetArchitecture
Instruction SetArchitecture (ISA)
Micro-architecture
Micro-architectureModels
Instruction Register
Format
Field
ConstantOperand
Type
RegisterOperand
ImmediateOperand
ComponentProperty
MemoryComponent
Register FileComponent
FunctionalUnit Component
Register
Set-Associative Cache Model
BranchModel
Property
AutomaticBootstrapProcess
• Micro-benchmark synthesizer drives the code generation process
• Compiler-like pass-based design• User controls the sequence of passes
• Micro-benchmarks are represented as a set of building blocks• Threads, functions, statements, instructions etc.
• Passes transform the micro-benchmark representation
• Add/Remove/Change building blocks• Passes have generic access to the architecture module
• Architecture independent
ISAdefinition
Partial micro-architecturedefinition:- Functional units and their performance counters
- IPC property definition (performance counters and formula)
For each instruction of the ISA generate:
Micro-benchmark A:endless loop 4K instances
of the instruction withdependency distance 1
Micro-benchmark B:endless loop 4K instancesof the instruction without
dependencies
Run, gather performance counters and powermeasurements
IPC, latency, throughput, average power, energy per instruction and the functional units used
Complete micro-architecturedefinition:- Functional units ISA mapping- instruction latency, throughput, EPI, power
Automatic Bootstrap Support
• Statically ensure a particular hit/miss ratio on a given cache level to avoid time
consuming design space explorations
• Knowledge and control of the set used on
each cache level
• Distribute the available cache sets among
the different cache levels and generate the
required activity on each cache level
• API to define the design space• Design space is a first class abstraction
• API to implement search drivers
• Exhaustive, genetic algorithm or user-defined searches• Search drivers have access to the code generation and architecture
modules• Guide the search using architecture information
• Modify the code generation policy during the search• API to evaluate the generated solutions
• Interface to external tools
Design SpaceExploration module
SearchDriver
DesignSpace
Evaluator
GeneticAlgorithm
Exhaustive DesignSpace
Dimension
Design SpacePoint
DiscreteRange
ContinuousRange
Boolean
Map
Distribution
Set of Value
StaticEvaluator
DynamicEvaluator
PropertyEvaluator
Command LineEvaluator
Combiner Mutator
Average
Intersect
Min
Max
Random
Extreme
Design SpaceExploration module
SearchDriver
DesignSpace
Evaluator
GeneticAlgorithm
Exhaustive DesignSpace
Dimension
Design SpacePoint
DiscreteRange
ContinuousRange
Boolean
Map
Distribution
Set of Value
StaticEvaluator
DynamicEvaluator
PropertyEvaluator
Command LineEvaluator
Combiner Mutator
Average
Intersect
Min
Max
Random
Extreme
Code generationmodule
BenchmarkSynthesizer
Pass Micro-benchmark
Building block
FunctionThreadLoopStatement Instruction
Property
Code generationmodule
BenchmarkSynthesizer
Pass Micro-benchmark
Building block
FunctionThreadLoopStatement Instruction
Property
Pass 1: init registers with random values
Pass 2: add a single basic block loop of
4K instructions
Pass 3: assign random instructions
of the architecture ISA
Pass 4: assign random memory access
pattern
Pass 5: assign random dependency
distances between instructions
Pass 6: assign random immediate
operands
< variable declaration>< register initialization> < memory register initialization> /* Loop starts here */__asm__(" infloop: ");__asm__(" addis 9,9,49 ");__asm__(" addi 9,9,8288 ");__asm__(" xvrdpi 62,63 ");__asm__(" xvmaddmdp 60,61,35 ");__asm__(" xvcmpeqdp. 34,60,57 ");__asm__(" crorc 2,3,0 ");__asm__(" vsrb 17,16,15 ");__asm__(" dctdp 12,13 ");__asm__(" fsel 10,11,16,17 ");__asm__(" vcmpequh. 14,13,12 ");__asm__(" addi 4,4,1170 ");__asm__(" lwax 31,3,4 ");__asm__(" addi 4,4,-15294 ");__asm__(" lbzx 20,3,4 ");__asm__(" fctiwz 14,15 ");__asm__(" drdpq 18,8 ");__asm__(" srawi 21,23,6 ");__asm__(" xsnmsubmdp 56,55,54 ");__asm__(" vcmpgtuw. 11,10,19 ");__asm__(" dxexq 0,10 ");__asm__(" xori 26,27,25460 ");__asm__(" stwx 20,7,8 ");__asm__(" vrefp 18,20 ");__asm__(" mulldo 25,26,28 ");__asm__(" vsububs 21,26,27 ");__asm__(" fsel. 19,9,1,2 ");< memory access guard instruction >< e.g. check/set index/base registers>__asm__(" b infloop ");
< variable declaration>< register initialization> < memory register initialization> /* Loop starts here */__asm__(" infloop: ");__asm__(" addis 9,9,49 ");__asm__(" addi 9,9,8288 ");__asm__(" xvrdpi 62,63 ");__asm__(" xvmaddmdp 60,61,35 ");__asm__(" xvcmpeqdp. 34,60,57 ");__asm__(" crorc 2,3,0 ");__asm__(" vsrb 17,16,15 ");__asm__(" dctdp 12,13 ");__asm__(" fsel 10,11,16,17 ");__asm__(" vcmpequh. 14,13,12 ");__asm__(" addi 4,4,1170 ");__asm__(" lwax 31,3,4 ");__asm__(" addi 4,4,-15294 ");__asm__(" lbzx 20,3,4 ");__asm__(" fctiwz 14,15 ");__asm__(" drdpq 18,8 ");__asm__(" srawi 21,23,6 ");__asm__(" xsnmsubmdp 56,55,54 ");__asm__(" vcmpgtuw. 11,10,19 ");__asm__(" dxexq 0,10 ");__asm__(" xori 26,27,25460 ");__asm__(" stwx 20,7,8 ");__asm__(" vrefp 18,20 ");__asm__(" mulldo 25,26,28 ");__asm__(" vsububs 21,26,27 ");__asm__(" fsel. 19,9,1,2 ");< memory access guard instruction >< e.g. check/set index/base registers>__asm__(" b infloop ");
architecture = microprobe.arch.get_architecture("power7") # Get the architecture object
# Create the benchmark synthesizersynth = microprobe.code.Synthesizer(architecture)
# Add the passes we want to apply in order to synthesize benchmarks# Pass1: Init registers to random valuessynth.add_pass(microprobe.passes.init.INIT_RandomRegisters())# Pass2: Add a single basic block of size 4096synth.add_pass(microprobe.passes.cfg.CFG_SimpleBasicBlock(4096))# Pass3: Fill the basic blocks using random instruction from architecturesynth.add_pass(microprobe.passes.ins.INS_RandomInstruction(architecture)) # Pass4: Add random access patternsynth.add_pass(microprobe.passes.mem.MEM_RandomMemoryModel())# Pass5: Set the dependency distance between instructionssynth.add_pass(microprobe.passes.dep.DEP_RandomDependency())) # Pass6: Init immediate values to random valuessynth.add_pass(microprobe.passes.init.INIT_RandomImmediates())
for name in range(0, 20):# Generate the benchmark (applies the passes).bench = synth.synthesize() # Save the benchmarkbench.save("%s/random-%s"%(outputdir, name))
architecture = microprobe.arch.get_architecture("power7") # Get the architecture object
# Create the benchmark synthesizersynth = microprobe.code.Synthesizer(architecture)
# Add the passes we want to apply in order to synthesize benchmarks# Pass1: Init registers to random valuessynth.add_pass(microprobe.passes.init.INIT_RandomRegisters())# Pass2: Add a single basic block of size 4096synth.add_pass(microprobe.passes.cfg.CFG_SimpleBasicBlock(4096))# Pass3: Fill the basic blocks using random instruction from architecturesynth.add_pass(microprobe.passes.ins.INS_RandomInstruction(architecture)) # Pass4: Add random access patternsynth.add_pass(microprobe.passes.mem.MEM_RandomMemoryModel())# Pass5: Set the dependency distance between instructionssynth.add_pass(microprobe.passes.dep.DEP_RandomDependency())) # Pass6: Init immediate values to random valuessynth.add_pass(microprobe.passes.init.INIT_RandomImmediates())
for name in range(0, 20):# Generate the benchmark (applies the passes).bench = synth.synthesize() # Save the benchmarkbench.save("%s/random-%s"%(outputdir, name))
•Models trained using non-micro-architecture aware training sets (TD_Random/TD_SPEC) show high errors and variability
•Models trained using the micro-architecture aware training set (TD_Micro/BU) show acceptable error margins: