Click here to load reader

Ramon Bertran*† Alper Buyuktosunoglu† Meeta S.Gupta† Marc … · 2012-12-08 · • Compiler-like pass-based design • User controls the sequence of passes • Micro-benchmarks

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Model accuracy results on SPEC CPU2006

    0

    1

    2

    3

    4

    56

    7

    8

    9

    10

    1-1 1-2 1-4 2-1 2-2 2-4 4-1 4-2 4-4 6-1 6-2 6-4 8-1 8-2 8-4 Mean

    CMP - SMT configuration

    % E

    rro

    r

    TD Micro

    TD Random

    TD SPEC

    BU

    Model accuracy results

    0

    5

    10

    15

    20

    FXU

    High

    FXU

    Low

    L1

    Loads

    Main

    Memory

    VSU

    High

    VSU

    Low

    Mean

    Validation set

    % E

    rro

    r

    TD Micro

    TD Random

    TD SPEC

    BU

    62%

    Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-BenchmarksRamon Bertran*† Alper Buyuktosunoglu† Meeta S.Gupta† Marc Gonzàlez* Pradip Bose†

    *Barcelona Supercomputing Center †IBM T.J. Watson Research Center{ramon.bertran,marc.gonzalez}@bsc.es {rbertra,alperb,mgupta,pbose}@us.ibm.com

    • Incorporates specialized knowledge of the system• Instruction set architecture definition

    • Micro-architecture definition

    • Micro-architecture analytical models• Flexible and adaptive

    • Definitions are user-provided text files• Efficient

    • Analytical models avoid unnecessary design space explorations

    Characterization• Identify performance bottlenecks

    • Power/thermal/noise issues• Power virus generation

    • Stressmarks for power management control loop• di/dt stressmarks

    • Reverse engineer hardware parameters

    Verification• Validate simulators and models• Test compiler transformations

    Synthesize workloads • Fast evaluation of real benchmarks

    • Clone proprietary workloads

    • Time consuming and tedious • Error prone task � trial and error process

    • Several micro-benchmarks are required

    • Deep expertise limited to few designers • Detailed knowledge of the underlying architecture is required

    MicroProbeFramework

    User

    Inputs Outputs

    Micro-benchmarkgeneration

    policy

    ArchitectureDefinition

    files

    Max PowerStressmark

    Micro-Bench-mark

    External tools

    Realplatforms

    Simulators Models

    A novel productive micro-benchmark generation framework:

    MicroProbe• Adaptive and flexible

    • Micro-architecture semantics aware

    • Integrated design space exploration

    Researchidea

    Micro-benchmark generation policies (user-defined scripts)

    Loop stressingthe floating point unit

    Sequence of loadshitting 50% L1

    and 50% L2

    Generate a stress-mark for each functionalunit of the architecture

    Search for the sequence of 2loads and 2 integer operations

    with maximum IPC

    MicroProbe Framework (Python API)

    Architecture module Code generationmodule

    Design spaceexploration moduleISA

    definitionsISA

    definitionsISA

    definitions

    Micro-architectureanalytical modelsMicro-architectureanalytical modelsMicro-architectureanalytical models

    Micro-architecturedefinitions

    Micro-architecturedefinitions

    Micro-architecturedefinitions

    Micro-benchmarksynthesizer

    PassesPassesPasses

    SearchdriversSearchdriversSearchdrivers

    PropertiesPropertiesProperties

    Micro-benchmarkMicro-benchmarkMicro-benchmark

    Automaticbootstrapprocess

    External tools

    MicroProbePrevious worksFeature

    �� (manual)- Customizable search�� (manual)- Exhaustive search��- GA-based search�� (no)- Integrated

    Design space exploration

    �� (no)- Configurable passes

    ��- Skeleton and instruction definition passes, memory modeling pass, branch modeling pass, ILP definition pass.

    Code generation

    �� (no)- Set-associative cache model

    Micro-architecture models

    �� (manual)- Functional unit, latency, throughput, energy per instruction, average instruction power etc.

    Micro-architecture queries

    �� (manual)- Operand length, binary codification etc.��- Instruction type

    ISA queries

    MicroProbePrevious worksFeature

    �� (manual)- Customizable search�� (manual)- Exhaustive search��- GA-based search�� (no)- Integrated

    Design space exploration

    �� (no)- Configurable passes

    ��- Skeleton and instruction definition passes, memory modeling pass, branch modeling pass, ILP definition pass.

    Code generation

    �� (no)- Set-associative cache model

    Micro-architecture models

    �� (manual)- Functional unit, latency, throughput, energy per instruction, average instruction power etc.

    Micro-architecture queries

    �� (manual)- Operand length, binary codification etc.��- Instruction type

    ISA queries

    TargetArchitecture

    Instruction SetArchitecture (ISA)

    Micro-architecture

    Micro-architectureModels

    Instruction Register

    Format

    Field

    ConstantOperand

    Type

    RegisterOperand

    ImmediateOperand

    ComponentProperty

    MemoryComponent

    Register FileComponent

    FunctionalUnit Component

    Register

    Set-Associative Cache Model

    BranchModel

    Property

    TargetArchitecture

    Instruction SetArchitecture (ISA)

    Micro-architecture

    Micro-architectureModels

    Instruction Register

    Format

    Field

    ConstantOperand

    Type

    RegisterOperand

    ImmediateOperand

    ComponentProperty

    MemoryComponent

    Register FileComponent

    FunctionalUnit Component

    Register

    Set-Associative Cache Model

    BranchModel

    Property

    AutomaticBootstrapProcess

    • Micro-benchmark synthesizer drives the code generation process

    • Compiler-like pass-based design• User controls the sequence of passes

    • Micro-benchmarks are represented as a set of building blocks• Threads, functions, statements, instructions etc.

    • Passes transform the micro-benchmark representation

    • Add/Remove/Change building blocks• Passes have generic access to the architecture module

    • Architecture independent

    ISAdefinition

    Partial micro-architecturedefinition:- Functional units and their performance counters

    - IPC property definition (performance counters and formula)

    For each instruction of the ISA generate:

    Micro-benchmark A:endless loop 4K instances

    of the instruction withdependency distance 1

    Micro-benchmark B:endless loop 4K instancesof the instruction without

    dependencies

    Run, gather performance counters and powermeasurements

    IPC, latency, throughput, average power, energy per instruction and the functional units used

    Complete micro-architecturedefinition:- Functional units ISA mapping- instruction latency, throughput, EPI, power

    Automatic Bootstrap Support

    • Statically ensure a particular hit/miss ratio on a given cache level to avoid time

    consuming design space explorations

    • Knowledge and control of the set used on

    each cache level

    • Distribute the available cache sets among

    the different cache levels and generate the

    required activity on each cache level

    • API to define the design space• Design space is a first class abstraction

    • API to implement search drivers

    • Exhaustive, genetic algorithm or user-defined searches• Search drivers have access to the code generation and architecture

    modules• Guide the search using architecture information

    • Modify the code generation policy during the search• API to evaluate the generated solutions

    • Interface to external tools

    Design SpaceExploration module

    SearchDriver

    DesignSpace

    Evaluator

    GeneticAlgorithm

    Exhaustive DesignSpace

    Dimension

    Design SpacePoint

    DiscreteRange

    ContinuousRange

    Boolean

    Map

    Distribution

    Set of Value

    StaticEvaluator

    DynamicEvaluator

    PropertyEvaluator

    Command LineEvaluator

    Combiner Mutator

    Average

    Intersect

    Min

    Max

    Random

    Extreme

    Design SpaceExploration module

    SearchDriver

    DesignSpace

    Evaluator

    GeneticAlgorithm

    Exhaustive DesignSpace

    Dimension

    Design SpacePoint

    DiscreteRange

    ContinuousRange

    Boolean

    Map

    Distribution

    Set of Value

    StaticEvaluator

    DynamicEvaluator

    PropertyEvaluator

    Command LineEvaluator

    Combiner Mutator

    Average

    Intersect

    Min

    Max

    Random

    Extreme

    Code generationmodule

    BenchmarkSynthesizer

    Pass Micro-benchmark

    Building block

    FunctionThreadLoopStatement Instruction

    Property

    Code generationmodule

    BenchmarkSynthesizer

    Pass Micro-benchmark

    Building block

    FunctionThreadLoopStatement Instruction

    Property

    Pass 1: init registers with random values

    Pass 2: add a single basic block loop of

    4K instructions

    Pass 3: assign random instructions

    of the architecture ISA

    Pass 4: assign random memory access

    pattern

    Pass 5: assign random dependency

    distances between instructions

    Pass 6: assign random immediate

    operands

    < variable declaration>< register initialization> < memory register initialization> /* Loop starts here */__asm__(" infloop: ");__asm__(" addis 9,9,49 ");__asm__(" addi 9,9,8288 ");__asm__(" xvrdpi 62,63 ");__asm__(" xvmaddmdp 60,61,35 ");__asm__(" xvcmpeqdp. 34,60,57 ");__asm__(" crorc 2,3,0 ");__asm__(" vsrb 17,16,15 ");__asm__(" dctdp 12,13 ");__asm__(" fsel 10,11,16,17 ");__asm__(" vcmpequh. 14,13,12 ");__asm__(" addi 4,4,1170 ");__asm__(" lwax 31,3,4 ");__asm__(" addi 4,4,-15294 ");__asm__(" lbzx 20,3,4 ");__asm__(" fctiwz 14,15 ");__asm__(" drdpq 18,8 ");__asm__(" srawi 21,23,6 ");__asm__(" xsnmsubmdp 56,55,54 ");__asm__(" vcmpgtuw. 11,10,19 ");__asm__(" dxexq 0,10 ");__asm__(" xori 26,27,25460 ");__asm__(" stwx 20,7,8 ");__asm__(" vrefp 18,20 ");__asm__(" mulldo 25,26,28 ");__asm__(" vsububs 21,26,27 ");__asm__(" fsel. 19,9,1,2 ");< memory access guard instruction >< e.g. check/set index/base registers>__asm__(" b infloop ");

    < variable declaration>< register initialization> < memory register initialization> /* Loop starts here */__asm__(" infloop: ");__asm__(" addis 9,9,49 ");__asm__(" addi 9,9,8288 ");__asm__(" xvrdpi 62,63 ");__asm__(" xvmaddmdp 60,61,35 ");__asm__(" xvcmpeqdp. 34,60,57 ");__asm__(" crorc 2,3,0 ");__asm__(" vsrb 17,16,15 ");__asm__(" dctdp 12,13 ");__asm__(" fsel 10,11,16,17 ");__asm__(" vcmpequh. 14,13,12 ");__asm__(" addi 4,4,1170 ");__asm__(" lwax 31,3,4 ");__asm__(" addi 4,4,-15294 ");__asm__(" lbzx 20,3,4 ");__asm__(" fctiwz 14,15 ");__asm__(" drdpq 18,8 ");__asm__(" srawi 21,23,6 ");__asm__(" xsnmsubmdp 56,55,54 ");__asm__(" vcmpgtuw. 11,10,19 ");__asm__(" dxexq 0,10 ");__asm__(" xori 26,27,25460 ");__asm__(" stwx 20,7,8 ");__asm__(" vrefp 18,20 ");__asm__(" mulldo 25,26,28 ");__asm__(" vsububs 21,26,27 ");__asm__(" fsel. 19,9,1,2 ");< memory access guard instruction >< e.g. check/set index/base registers>__asm__(" b infloop ");

    architecture = microprobe.arch.get_architecture("power7") # Get the architecture object

    # Create the benchmark synthesizersynth = microprobe.code.Synthesizer(architecture)

    # Add the passes we want to apply in order to synthesize benchmarks# Pass1: Init registers to random valuessynth.add_pass(microprobe.passes.init.INIT_RandomRegisters())# Pass2: Add a single basic block of size 4096synth.add_pass(microprobe.passes.cfg.CFG_SimpleBasicBlock(4096))# Pass3: Fill the basic blocks using random instruction from architecturesynth.add_pass(microprobe.passes.ins.INS_RandomInstruction(architecture)) # Pass4: Add random access patternsynth.add_pass(microprobe.passes.mem.MEM_RandomMemoryModel())# Pass5: Set the dependency distance between instructionssynth.add_pass(microprobe.passes.dep.DEP_RandomDependency())) # Pass6: Init immediate values to random valuessynth.add_pass(microprobe.passes.init.INIT_RandomImmediates())

    for name in range(0, 20):# Generate the benchmark (applies the passes).bench = synth.synthesize() # Save the benchmarkbench.save("%s/random-%s"%(outputdir, name))

    architecture = microprobe.arch.get_architecture("power7") # Get the architecture object

    # Create the benchmark synthesizersynth = microprobe.code.Synthesizer(architecture)

    # Add the passes we want to apply in order to synthesize benchmarks# Pass1: Init registers to random valuessynth.add_pass(microprobe.passes.init.INIT_RandomRegisters())# Pass2: Add a single basic block of size 4096synth.add_pass(microprobe.passes.cfg.CFG_SimpleBasicBlock(4096))# Pass3: Fill the basic blocks using random instruction from architecturesynth.add_pass(microprobe.passes.ins.INS_RandomInstruction(architecture)) # Pass4: Add random access patternsynth.add_pass(microprobe.passes.mem.MEM_RandomMemoryModel())# Pass5: Set the dependency distance between instructionssynth.add_pass(microprobe.passes.dep.DEP_RandomDependency())) # Pass6: Init immediate values to random valuessynth.add_pass(microprobe.passes.init.INIT_RandomImmediates())

    for name in range(0, 20):# Generate the benchmark (applies the passes).bench = synth.synthesize() # Save the benchmarkbench.save("%s/random-%s"%(outputdir, name))

    •Models trained using non-micro-architecture aware training sets (TD_Random/TD_SPEC) show high errors and variability

    •Models trained using the micro-architecture aware training set (TD_Micro/BU) show acceptable error margins: