Dynamic Compilation at the System Level - IBM · Dynamic Compilation at the System Level Erik Altman Michael Gschwind IBM T.J. Watson Research Center 2006 CGO New York City

Dynamic Compilation at Dynamic Compilation at the System Levelthe System Level

Erik AltmanErik AltmanMichael Michael GschwindGschwind

IBM T.J. Watson Research Center

2006 CGO2006 CGO New York CityNew York City

ContextContextMotivation and examples will generally Motivation and examples will generally be in a DAISY or BOA context.be in a DAISY or BOA context.–– i.e. Coi.e. Co--Designed Virtual MachinesDesigned Virtual Machines

Many ideas are applicable to other Many ideas are applicable to other forms of dynamic optimization.forms of dynamic optimization.

The DAISY and BOA TeamsThe DAISY and BOA Teams

Erik AltmanErik AltmanArthur BrightArthur BrightAl ChangAl ChangKemalKemal EbciogluEbciogluMichael Michael GschwindGschwindMarty HopkinsMarty HopkinsKrishnan Krishnan KailasKailasSteve Steve KosonockyKosonockySumedhSumedh SathayeSathaye

Craig Craig AgricolaAgricolaDave Dave AppenzellerAppenzellerZakZak FilanFilanJay LeBlancJay LeBlancPaul Paul LedakLedak

Jason Jason FrittsFritts (co(co--op)op)

B B inaryinary translationtranslationO O ptimizedptimizedA A rchitecturerchitecture

**

**

Outline of TutorialOutline of Tutorial8:00 8:00 –– 8:158:15 MotivationMotivation8:15 8:15 –– 8:308:30 BackgroundBackground8:30 8:30 –– 8:458:45 Translation GroupsTranslation Groups8:45 8:45 –– 9:309:30 Dynamic Optimizations (DO)Dynamic Optimizations (DO)9:30 9:30 –– 10:0010:00 Virtual Machines (VM)Virtual Machines (VM)

10:00 10:00 –– 10:3010:30 Coffee BreakCoffee Break10:30 10:30 –– 11:0011:00 BOA Support for VM / DOBOA Support for VM / DO11:00 11:00 –– 11:3011:30 BOA Arch and BOA Arch and µµArchArch11:30 11:30 –– 11:5011:50 BOA PerformanceBOA Performance11:50 11:50 –– 11:5511:55 BOA BOA -- DAISY ComparisonDAISY Comparison11:55 11:55 –– 12:0012:00 Summary andSummary and ConclusionConclusion

MotivationMotivation

Why System LevelWhy System LevelDynamic Compilation?Dynamic Compilation?

Seamless Dynamic FeedbackSeamless Dynamic FeedbackCross boundary optimization scope:Cross boundary optimization scope:–– Shared LibrariesShared Libraries–– Operating SystemOperating System100% compatibility with existing 100% compatibility with existing ISAsISAsCan handle:Can handle:–– Arbitrary entry pointsArbitrary entry points–– SelfSelf--modifying codemodifying code–– MultiMulti--threadingthreading–– etc.etc.

Why System LevelWhy System LevelDynamic Compilation?Dynamic Compilation?

Hardware cracking consumes time Hardware cracking consumes time and / or transistors.and / or transistors.Microcode emulation limits Microcode emulation limits exploitation of ILP.exploitation of ILP.Software translation avoids these Software translation avoids these problems but requires high problems but requires high instruction reuse.instruction reuse.–– Most apps have high reuse.Most apps have high reuse.

Why BOA / DAISY?Why BOA / DAISY?

Out of order superscalar processors achieve Out of order superscalar processors achieve high performancehigh performance... But at the cost of ... But at the cost of high hardware complexityhigh hardware complexity–– PredictorsPredictors–– Complex decodeComplex decode–– Complex issue queues with wakeup and issue Complex issue queues with wakeup and issue

logiclogic–– Register mapping tablesRegister mapping tables–– ......


Out of order superscalar processors Out of order superscalar processors achieve high performanceachieve high performance... But at the cost of... But at the cost of high powerhigh power–– Many out of order components operate Many out of order components operate

every cycle.every cycle.–– Many components query a large set of Many components query a large set of

data to operate on a single element.data to operate on a single element.–– Same set of operations performed to get Same set of operations performed to get

the same results.the same results.


Out of order Out of order superscalarssuperscalars achieve high achieve high performanceperformance... But at the cost of... But at the cost of deep pipelinesdeep pipelines–– Complex logic has long latency.Complex logic has long latency.–– To achieve high frequency with long latency, To achieve high frequency with long latency,

super pipelining is required.super pipelining is required.–– Deep pipelines require excellent branch Deep pipelines require excellent branch

predictors.predictors.–– Excellent branch predictors are complex.Excellent branch predictors are complex.–– Complex logic has long latency ...Complex logic has long latency ...

Schedule Slip Relative Performance1 month 4%3 month 12%6 month 26%9 month 41%12 month 59%18 month 100%


Out of order superscalar processors Out of order superscalar processors achieve high performanceachieve high performance... But at the cost of... But at the cost of high verification and high verification and debug complexitydebug complexity–– schedule slips schedule slips performance slipsperformance slips

Moore's Law in actionMoore's Law in action

What does BOA / DAISY offer?What does BOA / DAISY offer?

Software Dynamic OptimizationSoftware Dynamic Optimization–– Adapt code to dynamic runtime behaviorAdapt code to dynamic runtime behavior

SchedulingSchedulingOptimizationOptimizationSpeculationSpeculation

–– Focus hardware design on fast executionFocus hardware design on fast executionReduce hardware complexityReduce hardware complexitySimpler logic Simpler logic Faster logicFaster logicLess logic Less logic Less powerLess power

Simple(rSimple(r) Architecture) Architecture== ==

Good ArchitectureGood ArchitectureReduce hardware complexityReduce hardware complexity–– But no high performance general purpose But no high performance general purpose

processor will ever be “simple”.processor will ever be “simple”.–– Dynamic optimization allows some reduction in Dynamic optimization allows some reduction in

complexity.complexity.

BOA is simpler than DAISY in many ways:BOA is simpler than DAISY in many ways:Focus mostly on BOAFocus mostly on BOA

What BOA Offers CompilersWhat BOA Offers Compilers

Note:Note: Compilers Compilers Dynamic OptimizersDynamic OptimizersSimple, orthogonal architectureSimple, orthogonal architectureLarge register setLarge register setSeamless Dynamic FeedbackSeamless Dynamic FeedbackCross boundary optimization scope:Cross boundary optimization scope:–– Shared LibrariesShared Libraries–– Operating SystemOperating System

What BOA Offers ArchitectsWhat BOA Offers Architects

Shorter pipelines for same frequency.Shorter pipelines for same frequency.Fewer hardware predictors.Fewer hardware predictors.Simpler issue logic.Simpler issue logic.Less power by eliminating repetitive stepsLess power by eliminating repetitive steps–– E.g., Crack and ScheduleE.g., Crack and Schedule

Less debug and verification.Less debug and verification.–– At least for the hardware component…At least for the hardware component…

Smaller chips and higher yield.Smaller chips and higher yield.

BackgroundBackground

BOA / DAISY SystemBOA / DAISY System

BOA / DAISY MemoryBOA / DAISY Memory

Booting a BOA systemBooting a BOA system

1.1. Reset starts executing in BOA Boot FlashReset starts executing in BOA Boot Flash2.2. Initialize BOA environmentInitialize BOA environment

Stack, heap, translation cache, internal data structures…Stack, heap, translation cache, internal data structures…3.3. Start compiling, then executing PowerPC boot ROM Start compiling, then executing PowerPC boot ROM

code at PowerPC reset address (0xFFF00100)code at PowerPC reset address (0xFFF00100)4.4. …eventually transfers to boot loader and causes it to …eventually transfers to boot loader and causes it to

be translated,be translated,5.5. ... loads OS, transfers control to OS, and causes it to ... loads OS, transfers control to OS, and causes it to

be translatedbe translated6.6. …loads applications, transfers to apps, and causes …loads applications, transfers to apps, and causes

apps to be translatedapps to be translated

Booting a BOA systemBooting a BOA system

Simple in concept, harder in practice.Simple in concept, harder in practice.E.g. debugging:E.g. debugging:

–– First part of firmware decompresses later First part of firmware decompresses later part of firmware.part of firmware.

–– Later part of firmware is actually a FORTH Later part of firmware is actually a FORTH interpreter.interpreter.Debugging BOA is 3 levels removed from Debugging BOA is 3 levels removed from semantic actions being taken:semantic actions being taken:

Discovering devicesDiscovering devicesChecking system integrityChecking system integrity

Boot Time OdditiesBoot Time Oddities

SelfSelf--check of memory turns off memory check of memory turns off memory banks (via writes to I/O ports).banks (via writes to I/O ports).Memory bank with BOA system code Memory bank with BOA system code disabled.disabled.

BOA dies.BOA dies.–– Moral: Must Moral: Must virtualizevirtualize the memory controller, the memory controller,

not just the processor to properly emulate not just the processor to properly emulate system behavior.system behavior.


Frame buffer expects 4Frame buffer expects 4--byte stores.byte stores.Code accessing frame buffer uses Code accessing frame buffer uses PowerPC PowerPC stswistswi (store(store--string) instruction.string) instruction.

Cannot emulate Cannot emulate stswistswi as sequence of as sequence of storestore--byte instructions.byte instructions.

Machine dies with a bus error.Machine dies with a bus error.Such errors can be hard to isolate. (We know!)Such errors can be hard to isolate. (We know!)


Redundancy can make debugging more Redundancy can make debugging more difficult.difficult.Example:Example: AIX boot sequence uses 3 AIX boot sequence uses 3 techniques to establish IP address of techniques to establish IP address of machine.machine.If any technique succeeds, machine gets If any technique succeeds, machine gets an IP address.an IP address.Bugs in one or two of the techniques can Bugs in one or two of the techniques can ((and didand did) go undetected.) go undetected.–– And were much harder to find later.And were much harder to find later.

ICBI InstructionsICBI InstructionsIn PowerPC, when instructions are modified, must execute:In PowerPC, when instructions are modified, must execute:–– ICBIICBI ((IInstruction nstruction CCache ache BBlock lock IInvalidate) instructionnvalidate) instruction

BOA uses BOA uses ICBIICBI as signal to invalidate translations:as signal to invalidate translations:–– Must be able to efficiently invalidateMust be able to efficiently invalidate

By age By age (if translation cache full)(if translation cache full)By address By address (for ICBI instructions)(for ICBI instructions)

Little selfLittle self--modifying code in PowerPC, but modifying code in PowerPC, but ICBIICBI also used:also used:–– During program load.During program load.–– For JITFor JIT--compiled code.compiled code.

To be safe, AIX executesTo be safe, AIX executes ICBIICBI for all blocks on a page for all blocks on a page whenever a new page is loaded.whenever a new page is loaded.

Many NOP Many NOP ICBIICBI instructions executed.instructions executed.Reduce BOA VM overhead by fast check for NOP Reduce BOA VM overhead by fast check for NOP ICBIICBI instructions.instructions.

ICBI OccurrencesICBI Occurrences

Interrupt/Exception FrequenciesInterrupt/Exception Frequencies

BOA / Dynamic Compilation BOA / Dynamic Compilation System ArchitectureSystem Architecture

interpret insn X(PowerPC)

X prevtranslatedentry Pt

update statistics

goto next insn X

Yesexecute group @ X

BOA translation

Noseen X15 times

No

form group @ Xand translate

PowerPC to BOA

Yes

Basic Dynamic Compilation LoopBasic Dynamic Compilation Loop

TranslationCache

CompileCode

Segment

(Function,BB, etc)

Store Translation Group in

Translation Cache

Start

Transfer to Untranslated

CodeInterpretation(optional)

Execute group

DAISY / BOA AdditionsDAISY / BOA Additions

TranslationCache

Compile Until Stopping Pnt

Stopping Condition?

Store Translation Group in

Translation Cacheno

Lookup InsnAddress

hit

Branch to Translated

Group

Start

Exception H

andlermiss

yes

Transfer to Untranslated

CodeInterpretation(optional)

Full system translation

Unstructuredbinaries

Execute group

CompileCode

Segment

(Function,BB, etc)

Unstructured binary codeUnstructured binary code

HLL dynamic compilation usually has defined HLL dynamic compilation usually has defined code body and “natural” compilation unitscode body and “natural” compilation units–– Function, method, class,…Function, method, class,…

Dynamic compilation of binary code performs Dynamic compilation of binary code performs dynamic code discovery:dynamic code discovery:–– Identify meaningful translation unitsIdentify meaningful translation units–– Prevent “rediscovery” and duplication of already Prevent “rediscovery” and duplication of already

translated codetranslated code

Exception handlingException handlingMany instructions can raise exceptionsMany instructions can raise exceptions–– Usually invisible to userUsually invisible to user--codecode

Small number can be reflected using UNIX Small number can be reflected using UNIX signal()signal() facilityfacility

Exceptions provide “invisible” control flow arcs Exceptions provide “invisible” control flow arcs from codefrom code–– Restricts optimizations across these arcsRestricts optimizations across these arcs–– InfrequentInfrequent

but must be handled correctly for system functionbut must be handled correctly for system function–– Must be able to reconstitute state for exception handlerMust be able to reconstitute state for exception handler

And provide exception code, registers, …And provide exception code, registers, …

Exceptions vs. InterruptsExceptions vs. Interrupts–– Synchronous vs. AsynchronousSynchronous vs. Asynchronous

Full system aspectsFull system aspects

Large number of synchronous exceptions poses a Large number of synchronous exceptions poses a significant burden on full system compilationsignificant burden on full system compilation–– Many instructions can raise synchronous exceptions:Many instructions can raise synchronous exceptions:

Memory opsMemory ops ((page faults, protectionpage faults, protection))Floating pointFloating point ((IEEE complianceIEEE compliance))Divide by zeroDivide by zeroControl flow across pagesControl flow across pages ((page faults, protectionpage faults, protection))……

–– Number of events dynamically lowNumber of events dynamically lowAnd not degrading common case is necessity for good performanceAnd not degrading common case is necessity for good performanceBut when events occur they need to be handled in compliance withBut when events occur they need to be handled in compliance witharchitecture.architecture.

System safety and correctnessSystem safety and correctness

Instruction execution always under VMM control.Instruction execution always under VMM control.–– Locus of control can be in:Locus of control can be in:

VMM InterpreterVMM InterpreterVMM TranslatorVMM TranslatorVMM Exception managerVMM Exception managerVMM Memory managerVMM Memory manager

–– Locus of control can be in VMMLocus of control can be in VMM--generated traces.generated traces.Traces transfer control only to each other,Traces transfer control only to each other,Or to VMM.Or to VMM.

–– No way to inject native (i.e., uncontrolled) code into the No way to inject native (i.e., uncontrolled) code into the system from the PowerPC layersystem from the PowerPC layer

Protection is a combination of hardware Protection is a combination of hardware primitives, and VMMprimitives, and VMM--generated codegenerated code

Translation Translation GroupsGroups

Translation group structureTranslation group structure

Multiple choices for structure of a translation group:Multiple choices for structure of a translation group:–– MultipleMultiple--entry, multipleentry, multiple--exit blockexit block

Mimic [May ’87]Mimic [May ’87]

–– TreeTree: Single entry, multiple exit: Single entry, multiple exitDAISY [ISCA ’97], [Europar ’99], [MICRO ’99]DAISY [ISCA ’97], [Europar ’99], [MICRO ’99]

–– TraceTrace: Single entry, multiple exit: Single entry, multiple exitBOA [WBTBOA [WBT--99] (similar to 99] (similar to superblockssuperblocks, or hardware trace cache), or hardware trace cache)

–– Atomic blocksAtomic blocksBOA study [WCED ’02]BOA study [WCED ’02]

Translated Code Size over Time:Translated Code Size over Time:DAISY Tree RegionsDAISY Tree Regions

cmpi cr15,PPC_LR,0x1234bne EXIT_GROUP# Translated Code from 0x1234...EXIT_GROUP: b BOA Syscode

BOA Group FormationBOA Group Formation

Include ops from only a single path.Include ops from only a single path.Always follow most likely direction of conditional Always follow most likely direction of conditional branch.branch.If necessary, If necessary, invert branch conditionsinvert branch conditions so all so all branches expected to fallbranches expected to fall--thru.thru.Optionally go thru register branches:Optionally go thru register branches:

Stopping PointsStopping PointsIdentify possible cut points in translationIdentify possible cut points in translation

Reduce code duplicationReduce code duplication–– By quantizing translation stop/start to a set of wellBy quantizing translation stop/start to a set of well--defined defined

entry/exit points to translationsentry/exit points to translations–– Good choice of stopping point increases effectiveness by Good choice of stopping point increases effectiveness by

exploiting program structuresexploiting program structuresBranch, function call, function exit, …Branch, function call, function exit, …Likely program join points map to group startLikely program join points map to group start

–– Translation groups can only be entered at topTranslation groups can only be entered at top–– Translated code only has join points at translation group Translated code only has join points at translation group

boundaryboundary

“Hard stops”“Hard stops”–– Stopping point & stopping conditionStopping point & stopping condition–– Resource limitsResource limits

Stopping PointsStopping Points

Avoid enumerating all possible substrings Avoid enumerating all possible substrings of the dynamic executionof the dynamic execution

L..9:add 3,3,0subfc 0,0,3bdnz L..9

blr

mtctr 5

mtctr 5add 3,3,0subfc 0,0,3bdnz L..9

add 3,3,0subfc 0,0,3bdz L..9add 3,3,0

subfc 0,0,3bdz L..9add 3,3,0subfc 0,0,3

bdz L..9add 3,3,0subfc 0,0,3 bdz L..9

Stopping PointsStopping Points

Stopping points allow exploitation of program structureStopping points allow exploitation of program structure–– Statistically, without detailed (i.e., expensive) analysisStatistically, without detailed (i.e., expensive) analysis

L..9:add 3,3,0subfc 0,0,3bdnz L..9

blr

mtctr 5


add 3,3,0subfc 0,0,3bdnz L..9

or:


add 3,3,0subfc 0,0,3bdz L..9add 3,3,0subfc 0,0,3bdnz L..9

Group Stopping ConditionsGroup Stopping Conditions

Fickle branchFickle branch–– E.g., go one direction less than E.g., go one direction less than 60%60% of timeof time–– Termed Termed BiasBias--9 9 One way One way 9/159/15 timestimes

Bias is a major knob to control group sizeBias is a major knob to control group size

# of ops in group # of ops in group > > Threshold (60 ops)Threshold (60 ops)# of stores in group# of stores in group > > Store Buffer SizeStore Buffer Size–– 32 entries in store buffer32 entries in store bufferRegister Branches (Register Branches (optionallyoptionally))

Synthetic path profilingSynthetic path profiling

““Poor man’s path profiling”Poor man’s path profiling”–– Limit data per profile:Limit data per profile:

No program structure information, as in advanced profile No program structure information, as in advanced profile collection methods.collection methods.

–– For each branch, collect profile informationFor each branch, collect profile informationProfile (Profile (prev_branch_addressprev_branch_address, , this_branch_addressthis_branch_address) ) branch outcomebranch outcome

–– Synthesize path information from 1 block deep path Synthesize path information from 1 block deep path informationinformation

Branch prediction performanceBranch prediction performanceConditional Branch Misprediction Rate

0

5

10

15

20

25

com

pres

s

gcc go

ijpeg li

m88

ksim pe

rl

vort

ex

tpcc

aver

age

Benchmark

Mis

pred

ict R

ate

Basic Block Profiling Path Prediction

Oracle Prediction, Basic Block Profiling Dynamic predictor

Group QualityGroup QualityGroup quality impacts performanceGroup quality impacts performance–– Many ways to quantify Many ways to quantify “quality”“quality”

Fraction of early exits from a group?Fraction of early exits from a group?

Average Average staticstatic length of group?length of group?

Average Average dynamicdynamic length of group?length of group?–– Combines fraction and location of early exits, Combines fraction and location of early exits,

and average static length of groupand average static length of group

Dynamic Group LengthDynamic Group Length(Effective Window Size)(Effective Window Size)

We find We find dynamicdynamic group length to be a good measure of quality.group length to be a good measure of quality.

Group Length Group Length vsvs CPICPI

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 10 20 30 40 50 60

Dynamic Group Length

Infin

ite R

esou

rce

Pow

erPC

CPI

BOA Group LengthsBOA Group Lengths

0

10

20

30

40

50

60

li

perl

m88

k go

ijpeg

vorte

x

gcc

com

pres

s

tpcc

Power

PC I

nstru

ctio

ns

Static Dynamic

Group Quality (cont) Group Quality (cont)

Different conclusions published by others:Different conclusions published by others:–– Results depend on dynamic optimizations Results depend on dynamic optimizations

exploited:exploited:Code Packing, Hot/Cold optimizations effective Code Packing, Hot/Cold optimizations effective across group boundariesacross group boundaries

–– Dynamo derives significant benefit from theseDynamo derives significant benefit from these

ILP optimizations depend on speculating from ILP optimizations depend on speculating from correct pathcorrect path

–– BOA, DAISY exploit ILP optimizationsBOA, DAISY exploit ILP optimizations–– ILP exploitation limited on HPILP exploitation limited on HP--PAPA

Group Length Group Length vsvs CPI: TheoryCPI: Theory

ILP is monotonically nonILP is monotonically non--decreasing with decreasing with group length:group length:–– Consider Consider NN consecutive instructions on some pathconsecutive instructions on some path–– Split these Split these NN instructions into two groups, instructions into two groups, G1G1 and and

G2G2, each with , each with NN/2/2 instructions:instructions:Any pair of schedules derived by schedulingAny pair of schedules derived by scheduling G1G1 and and G2G2separately can also be derived by scheduling allseparately can also be derived by scheduling all NNinstructions together.instructions together.

–– Additional schedules are also possible when scheduling all Additional schedules are also possible when scheduling all NNinstructions together.instructions together.

–– Thus, ILP with a larger window is always at least as high as Thus, ILP with a larger window is always at least as high as with two smaller windows.with two smaller windows.

Group Length Group Length vsvs CPI: Other FactorsCPI: Other Factors

A larger group size A larger group size loses only if other penalties come into play:

– ICache pollution from duplicated code– ITLB pollution from duplicated code– Frequent translation cache overflows / flushes– Roll back penalties

Discard useful work doneDiscard useful work donePotentially rePotentially re--execute more slowly.execute more slowly.

Percentage of Groups Completing All Percentage of Groups Completing All InstructionsInstructions

0

10

20

30

40

50

60

70

80

li

perl

m88k go

ijpeg

vorte

x

gcc

compr

ess

tpcc

% o

f gro

ups

com

plet

ing

all i

nstr

uctio

ns Bias-8 Bias-12 Bias-15

Percent Early Exits Percent Early Exits vsvs CPICPI

No / negative correlation between early group exits and No / negative correlation between early group exits and performanceperformance

0.0

0.5

1.0

1.5

2.0

2.5

0 20 40 60 80 100

% Early Group Exits

Pow

erPC

CPI

Bias 8Bias 12Bias 15Linear (Bias 8)Linear (Bias 12)Linear (Bias 15)

Group Execution AlternativesGroup Execution AlternativesAtomic:Atomic: All operations in a group execute or All operations in a group execute or

no operations in a group executeno operations in a group execute–– From architecture point of view.From architecture point of view.

Incremental:Incremental: Architected State of Emulated ISA Architected State of Emulated ISA state updated incrementally in original program state updated incrementally in original program order as group executes.order as group executes.–– DAISY ApproachDAISY Approach

Combination:Combination: All operations in group up to All operations in group up to point of exit update ISA state simultaneously:point of exit update ISA state simultaneously:–– May exit group “early”May exit group “early”

Due to group not containing branch path taken in executionDue to group not containing branch path taken in executionDue to exception, e.g., page fault.Due to exception, e.g., page fault.

–– BOA ApproachBOA Approach

Atomic BlocksAtomic Blocks

Published at [WCED ’02] Published at [WCED ’02]

Atomic blocks put premium on group completion.Atomic blocks put premium on group completion.

Allows far more aggressive optimization.Allows far more aggressive optimization.

But the penalties of early exit are substantial:But the penalties of early exit are substantial:–– A rollback occurs, and all work must be discardedA rollback occurs, and all work must be discarded–– Slow mode is entered, with less aggressive groups Slow mode is entered, with less aggressive groups

(essentially basic blocks) (essentially basic blocks)

Benefits of Benefits of AtomicAtomic and and CombinationCombinationGroup ExecutionGroup Execution

Aggressive memory reAggressive memory re--orderingordering–– Ability to resolve wrong assumptions about Ability to resolve wrong assumptions about

dependencesdependences–– Ability to maintain correct MP behavior in view of Ability to maintain correct MP behavior in view of

changed memory access orderingchanged memory access ordering

Resolves precise exception challengesResolves precise exception challenges–– AllAll--oror--nothingnothing–– No hidden exception arches…No hidden exception arches…–– ReRe--ordering of possible exceptionsordering of possible exceptions

Requires hardware supportRequires hardware supportSee laterSee later

BlockBlock--Structured CPIStructured CPI

Overlapping BOA GroupsOverlapping BOA Groups

Code duplicationCode duplication

Code duplication due to group formationCode duplication due to group formation–– Same static instruction may be in multiple groups:Same static instruction may be in multiple groups:

Group formation effectively performs tail duplication, Group formation effectively performs tail duplication, inlininginlining, unrolling, …, unrolling, …

–– Incremental code discovery and translation limits Incremental code discovery and translation limits control of this duplication.control of this duplication.

–– Generated code can grow significantly compared Generated code can grow significantly compared to original code size.to original code size.

Code PackingCode PackingCode packing:Code packing:–– Within a groupWithin a group–– Across groupsAcross groups

Within group:Within group: Software Based Trace CachingSoftware Based Trace Caching–– ApplicationApplication--directed code compactiondirected code compaction–– Similar concept to hardware trace cacheSimilar concept to hardware trace cache–– Much simpler to implementMuch simpler to implement

Across group:Across group:–– Statistical PettisStatistical Pettis--Hanson, Hot/ColdHanson, Hot/Cold–– Increase effectiveness of Increase effectiveness of ICacheICache and ITLBand ITLB–– Very helpful in HP Dynamo performanceVery helpful in HP Dynamo performance

Effects of Code Packing on Working Set:Effects of Code Packing on Working Set:HP Dynamo PerformanceHP Dynamo Performance

Code packing implicitly performed

as part of DO exploits instruction

cache and TLB more efficiently

Dynamic Dynamic Optimization and Optimization and ReoptimizationReoptimization

Code ReCode Re--optimizationoptimization

Reduce compile time with tiered compilationReduce compile time with tiered compilationTranslated code can be reTranslated code can be re--optimizedoptimized–– If “very hot” regions are identified, to optimize more If “very hot” regions are identified, to optimize more

aggressivelyaggressivelyDAISY does this [ISCA ’97, Micro ’99]DAISY does this [ISCA ’97, Micro ’99]

–– To avoid frequent upsets and recovery costTo avoid frequent upsets and recovery costMisspeculationMisspeculation, code modification events,…, code modification events,…TransmetaTransmeta Crusoe implements this [CGO ’03]Crusoe implements this [CGO ’03]

Trigger reTrigger re--translation usingtranslation using–– Profiling support in DO targetProfiling support in DO target–– In recovery codeIn recovery code

Interpretation and ProfilingInterpretation and Profiling

Experiments with interpretation and profiling toExperiments with interpretation and profiling to–– increase “translation quality” increase “translation quality” –– reduce translation costreduce translation cost–– code expansioncode expansion

Interpret Interpret ii times and profiletimes and profile–– form group if interpreted form group if interpreted ii timestimes–– extend group beyond branch if branch shows bias of extend group beyond branch if branch shows bias of n/in/i–– explore different values for explore different values for n, in, i–– problematic to form paths based on branch profilesproblematic to form paths based on branch profiles–– path profiling potentially expensivepath profiling potentially expensive

“poor man’s path profiling”“poor man’s path profiling”

StatisticsStatistics

For each For each conditional branchconditional branch, keep , keep takentakenand and fallfall--thruthru countscounts

For each For each register branchregister branch, keep list of , keep list of register target values.register target values.

Could keep load values for Could keep load values for value value predictionprediction

ProfileProfile--Based ImprovementsBased Improvements

Idea:Idea: Limit code bloat & compile time by Limit code bloat & compile time by confining aggressive optimizations to confining aggressive optimizations to frequently executed regions of the codefrequently executed regions of the code

Idea explored in DAISY context [Idea explored in DAISY context [ICS’00ICS’00].].

Details in next slide …Details in next slide …

Optimizing Frequently Executed CodeOptimizing Frequently Executed Code

Initial group formationInitial group formation–– ILP goal is modest, and window size is conservativeILP goal is modest, and window size is conservative–– e.g., 3 IPC infinite, 24 operations windowe.g., 3 IPC infinite, 24 operations window

Group modification due to reGroup modification due to re--executionexecution–– Profiling to determine reuseProfiling to determine reuse

Hardware/software counterHardware/software counter--based scheme based scheme (Details later).(Details later).TimerTimer--based profiling schemebased profiling scheme

–– Two growth mechanismsTwo growth mechanismsMultipathMultipath branch appendbranch append

–– Frequently taken branch compiled into initial group as Frequently taken branch compiled into initial group as multipathmultipath

Group Group reoptimizationreoptimization–– Recompile block exit point with more aggressive goalsRecompile block exit point with more aggressive goals–– e.g., 10 IPC infinite, 250 operations window sizee.g., 10 IPC infinite, 250 operations window size

Group Extension byGroup Extension byMultipathMultipath AppendAppend

AB

DCEF

A

B

C

E

B

Ctranslator

translator

translator

A

B

C

E

Btranslator

translator

C D

Add path to an existing groupAdd path to an existing group

Group Extension byGroup Extension byReRe--OptimizationOptimization

AB

DCEF

A

B

C

E

B

Ctranslator

translator

translator

E

B

C

translator

translator

E

B

C

translator

translator

A

B

Ctranslator

Extend frequently executed Extend frequently executed existing group to increase ILPexisting group to increase ILP

QuasiQuasi--Static BOA OptimizationsStatic BOA Optimizations

Instruction SchedulingInstruction SchedulingCombiningCombiningCopy PropagationCopy PropagationDead Code EliminationDead Code EliminationCode packingCode packingLoadLoad--Store TelescopingStore TelescopingRegister Port arbitrationRegister Port arbitrationReplace “hard” ISA ins with ops that are easier Replace “hard” ISA ins with ops that are easier to schedule / executeto schedule / executeImprove predictability of execution path by code Improve predictability of execution path by code layoutlayout

Loop Index Loop Index SequentializeSequentialize IterationsIterations

for (i = 0; i < N; i++) {for (i = 0; i < N; i++) { // C code// C codea[ia[i] = ] = b[ib[i] + ] + c[ic[i];];

}}

L1L1:: lbzlbz r4,b_offset(r3) # Assembly coder4,b_offset(r3) # Assembly codelbzlbz r5,c_offset(r3)r5,c_offset(r3)addadd r6,r4,r5r6,r4,r5stbstb r6,a_offset(r3)r6,a_offset(r3)addiaddi r3,r3,1r3,r3,1 # r3 dependence# r3 dependencecmpwicmpwi cr0,r3,Ncr0,r3,N # can serialize code# can serialize codebnebne cr0,cr0,L1L1

Combining to the RescueCombining to the Rescue

Combine increment of loop index variable with dependent Combine increment of loop index variable with dependent ops in subsequent iterations, e.g. 2ops in subsequent iterations, e.g. 2ndnd iteration:iteration:

.. lbzlbz r63,b_offsetr63,b_offset+1+1(r3) (r3) lbzlbz r62,c_offsetr62,c_offset+1+1(r3)(r3)addadd r61,r63,r62r61,r63,r62stbstb r61,a_offsetr61,a_offset+1+1(r3)(r3) # Wait till non# Wait till non--speculspeculaddiaddi r63,r3,1r63,r3,1 # to store# to storecmpwicmpwi cr31,r3,cr31,r3,NN--11beqbeq cr31,cr31,Loop_ExitLoop_Exit

Now first and second (and later) iterations can execute in Now first and second (and later) iterations can execute in parallel parallel –– subject to resource constraints.subject to resource constraints.

LoadLoad--Store TelescopingStore Telescoping

xorxor r31r31,r9,r4,r9,r4stwstw r31r31,,8(r1)8(r1)……lwzlwz r7r7,,8(r1)8(r1)……stwstw r7r7,,64(r22)64(r22)……lwzlwz r14,r14,64(r22)64(r22)addiaddi r20,r14,6r20,r14,6

Telescope loads and Telescope loads and stores togetherstores together–– Can execute Can execute addiaddi one one

cycle after cycle after xorxor

Such telescoping Such telescoping patterns common in patterns common in function prologs function prologs and epilogs.and epilogs.–– Save/Restore Save/Restore callee/rcallee/r

saved registers.saved registers.

Dynamic OptimizationsDynamic Optimizations

Memory op speculation based on Memory op speculation based on observed runtime dependence behaviorobserved runtime dependence behaviorSmarter group formation based on Smarter group formation based on discovery of actual pathsdiscovery of actual paths–– Tree, Tree, superblocksuperblock, other…, other…Value prediction based on observed data Value prediction based on observed data behaviorbehaviorLightweight optimizations [Micro ‘99] Lightweight optimizations [Micro ‘99]

Dynamic Optimization ExampleDynamic Optimization Example

System Level Optimization ChallengesSystem Level Optimization Challenges

Dynamic compilation at system level needs to be Dynamic compilation at system level needs to be transparenttransparent–– Compatibility guaranteeCompatibility guarantee–– Same results as Same results as unoptimizedunoptimized original binaryoriginal binary

Microprocessor mechanisms defy many Microprocessor mechanisms defy many traditional optimizationstraditional optimizations–– Analysis scope limitedAnalysis scope limited

Runtime limitedRuntime limitedDynamic code discoveryDynamic code discovery

e.g.,e.g., livenessliveness analysisanalysis

Precise Exceptions and Precise Exceptions and Dynamic OptimizationDynamic Optimization

Microprocessors usually offer precise exceptions Microprocessors usually offer precise exceptions to handle special conditionsto handle special conditions–– Occurs in all forms of binary compilationOccurs in all forms of binary compilation

ProcessProcess--level level signal() interfacesignal() interfaceSystemSystem--level level full range of architected exceptionsfull range of architected exceptions

–– Demand pagingDemand paging–– Divide by zeroDivide by zero–– Floating pointFloating point

Must preserve semantics in presence of Must preserve semantics in presence of exceptionsexceptions–– Machine state Machine state observabilityobservability in unexpected locationsin unexpected locations–– Disabling optimizations will degrade performanceDisabling optimizations will degrade performance

Detailed Example:Detailed Example:

Dead Code Dead Code EliminationElimination

Dead Code ExampleDead Code ExampleExample Code SequenceExample Code Sequence–– (1) add (1) add r4r4,r3,r4 # DEAD!,r3,r4 # DEAD!–– (2) (2) lwzlwz r3,0(r9)r3,0(r9)–– (3) add (3) add r4r4,r3,r3,r3,r3

But a page fault at (2) But a page fault at (2) lwzlwz makes the dead makes the dead value of value of r4r4 visible to the exception handler.visible to the exception handler.

If the handler bases any actions on the If the handler bases any actions on the value of value of r4r4, the program may fail., the program may fail.

Approaches to dynamic compilationApproaches to dynamic compilationin the presence of exceptionsin the presence of exceptions

Severely restrict dead code elimination.Severely restrict dead code elimination.

Include a Include a safe modesafe mode which disables which disables “unsafe” optimizations.“unsafe” optimizations.

Rollback to a good state and interpret Rollback to a good state and interpret original code until exception is foundoriginal code until exception is found

Improved code generation with dynamic Improved code generation with dynamic state recovery [state recovery [WBT2000, CC2002WBT2000, CC2002] ]

Limiting dead code eliminationLimiting dead code elimination

Compute all dead resultsCompute all dead results

Commit results inCommit results in--orderorder

Used in DAISY [ISCA1997]Used in DAISY [ISCA1997]–– highhigh--ILP architectureILP architecture–– excess operations have less performance impactexcess operations have less performance impact–– dead results eliminated in scope of single atomic VLIWdead results eliminated in scope of single atomic VLIW

on exception, rollback to beginning of VLIWon exception, rollback to beginning of VLIW

Safe modeSafe mode

Safe mode uses only conservative optimizationsSafe mode uses only conservative optimizationsUse safe mode to translate critical programs or Use safe mode to translate critical programs or program regionsprogram regionsCritical codeCritical code–– detected by heuristicsdetected by heuristics–– specified by human interventionspecified by human intervention

Heuristics and humans can be wrongHeuristics and humans can be wrong–– Incorrect execution if too aggressiveIncorrect execution if too aggressive–– Performance degradation if too conservativePerformance degradation if too conservative

Used in DYNAMO [HP1999]Used in DYNAMO [HP1999]

Rollback to checkpointRollback to checkpoint

Take checkpoints on group transitionsTake checkpoints on group transitionsAggressively optimize within translation groupsAggressively optimize within translation groupsOn exception,On exception,–– rollback to checkpointrollback to checkpoint–– then interpret original binary conservativelythen interpret original binary conservatively

Rollback requires backing out of processor state Rollback requires backing out of processor state and memory state changesand memory state changes–– special, complex hardware requiredspecial, complex hardware required–– memory rollback complex in MPmemory rollback complex in MP

Used in Used in TransmetaTransmeta, BOA [Computer2000], BOA [Computer2000]

Deferred State Materialization Deferred State Materialization for Dynamic Optimizationfor Dynamic Optimization

Optimize for common performance caseOptimize for common performance case–– aggressive dead code eliminationaggressive dead code elimination

keep enough state to materialize full state when exceptions keep enough state to materialize full state when exceptions occuroccur

State recovery to provide correct inState recovery to provide correct in--order state for order state for exception processingexception processing–– dead values materialized only when exception occursdead values materialized only when exception occurs

exceptions occur infrequentlyexceptions occur infrequentlymodest cost for materializing full statemodest cost for materializing full state

Maximum performance during program executionMaximum performance during program execution

Original CFG Improved CFG

add r4,r3,r4

lwz r3,0(r9)

add r4,r3,r3 exceptionhandler

unlikely

***

lwz r3,0(r9)

add r4,r3,r3

exceptionhandler

unlikely

add r4,r3,r4

State Repair ConceptState Repair Concept

Precise exception frameworkPrecise exception framework

DAISYDAISY--like dynamic compilation environmentlike dynamic compilation environment

Unit of operation is tree regionUnit of operation is tree region–– corresponds well to the mechanics of dynamic corresponds well to the mechanics of dynamic

compilationcompilation–– keeps algorithms simple keeps algorithms simple O(nO(n) since no ) since no ϕϕ nodesnodes

FG in single static assignment formFG in single static assignment form–– simplifies overall algorithmsimplifies overall algorithm–– in particular, simplifies handling live rangesin particular, simplifies handling live ranges

Algorithm stepsAlgorithm steps

Tag instructions computing dead results (excl. exceptions)Tag instructions computing dead results (excl. exceptions)

Tagged instructions will not be emitted into generated Tagged instructions will not be emitted into generated codecode–– keep around as meta data ("repair notes")keep around as meta data ("repair notes")–– could could recomputerecompute meta data on demandmeta data on demand

algorithm is deterministicalgorithm is deterministic

Ensure that all state can be recomputedEnsure that all state can be recomputed–– by keeping information about elided instructionsby keeping information about elided instructions–– by keeping inputs to elided instructions aliveby keeping inputs to elided instructions alive

until point where elided instructions are killeduntil point where elided instructions are killedthis can increase or decrease register pressurethis can increase or decrease register pressure

Live Range AnalysisLive Range AnalysisA register is dead if A register is dead if –– (1) it is no longer referenced by actual instructions (1) it is no longer referenced by actual instructions –– (2) elided instructions that reference it are dead ((2) elided instructions that reference it are dead (w.r.tw.r.t. .

exceptions)exceptions)LivenessLiveness of one symbolic register of one symbolic register oo can influence can influence livenessliveness of other registers of other registers ii–– if register if register oo is not materialized immediatelyis not materialized immediately–– if registers if registers i i are needed to materialize itare needed to materialize it

Represented by Represented by livenessliveness equivalence equivalence sisi ≡≡ <<sjsj,,sksk>>

–– if if sisi is live, then is live, then sjsj, , sksk are liveare live–– Algorithm significantly simplified by SSAAlgorithm significantly simplified by SSA

Algorithm stepsAlgorithm steps

1.1. foreachforeach operation operation opop2.2. ifif dead (target (dead (target (opop))))3.3. convert2repairnote (convert2repairnote (opop))4.4. foreachforeach instruction killing target (instruction killing target (opop))5.5. insert_useinsert_use (target ((target (opop))))6.6. insert_equivalenceinsert_equivalence (target ((target (opop) ) ≡≡ sources (sources (opop))))

LivenessLiveness analysis performed analysis performed beforebefore algorithmalgorithmRegister allocation performed Register allocation performed afterafter algorithmalgorithm

and. r4,r3,r4and. r4,r3,r4

lwz r3,0(r9)lwz r3,0(r9)add r4,r3,r3add r4,r3,r3addi r5,r3,80addi r5,r3,80lwz r3,0(r10)lwz r3,0(r10)addi. r5,r3,1addi. r5,r3,1

Example: PowerPC CodeExample: PowerPC Code

and. r4,r3,r4and. r4,r3,r4

lwz r3,0(r9)lwz r3,0(r9)add r4,r3,r3add r4,r3,r3addi r5,r3,80addi r5,r3,80lwz r3,0(r10)lwz r3,0(r10)addi. r5,r3,1addi. r5,r3,1

1 s4' = s3 & s41 s4' = s3 & s42 sc0' = (s3 & s4) cmp 02 sc0' = (s3 & s4) cmp 03 s3' = [s9]3 s3' = [s9]4 s4'' = s3' + s3'4 s4'' = s3' + s3'5 s5' = s3' + 805 s5' = s3' + 806 s3'' = [s10]6 s3'' = [s10]7 s5'' = s3'' + 17 s5'' = s3'' + 18 sc0'' = (s3'' + 1) cmp 08 sc0'' = (s3'' + 1) cmp 0

Example: Intermediate RepresentationExample: Intermediate Representation

1 s4' = s3 & s41 s4' = s3 & s42 sc0' = (s3 & s4) cmp 02 sc0' = (s3 & s4) cmp 03 s3' = [s9]3 s3' = [s9]4 s4'' = s3' + s3'4 s4'' = s3' + s3'

5 s5' = s3' + 805 s5' = s3' + 806 s3'' = [s10]6 s3'' = [s10]7 s5'' = s3'' + 17 s5'' = s3'' + 1

8 sc0'' = (s3'' + 1) cmp 08 sc0'' = (s3'' + 1) cmp 0

{ s4' = s3 & s4 }{ s4' = s3 & s4 }{ sc0' = (s3 & s4) cmp 0 }{ sc0' = (s3 & s4) cmp 0 }

s3' = [s9]s3' = [s9]s4'' = s3' + s3's4'' = s3' + s3'

use s4' ; s4' use s4' ; s4' ≡≡ <s3, s4><s3, s4>{ s5' = s3' + 80 }{ s5' = s3' + 80 }

s3'' = [s10]s3'' = [s10]s5'' = s3'' + 1s5'' = s3'' + 1

use s5' ; s5' use s5' ; s5' ≡≡ <s3'><s3'>sc0'' = (s3'' + 1) cmp 0sc0'' = (s3'' + 1) cmp 0

use sc0' ; sc0' use sc0' ; sc0' ≡≡ <s3, s4><s3, s4>

Intermediate Representation after Intermediate Representation after Basic AlgorithmBasic Algorithm

Some observationsSome observations

Overly conservativeOverly conservative–– only need to materialize state if a only need to materialize state if a

synchronous exception can actually happensynchronous exception can actually happen–– only need to be able to materialize until the only need to be able to materialize until the

last synchronous exception which can last synchronous exception which can observe state on any given path observe state on any given path

Reduce number of repair notesReduce number of repair notesReduce register pressureReduce register pressure–– by killing otherwise dead input registers to by killing otherwise dead input registers to

repair notesrepair notes


s3' = [s9]s3' = [s9]s4'' = s3' + s3's4'' = s3' + s3'

use use s4' ; s4' s4' ; s4' ≡≡ <s3, s4><s3, s4>{ s5' = s3' + 80 }{ s5' = s3' + 80 }

s3'' = [s10]s3'' = [s10]s5'' = s3'' + 1s5'' = s3'' + 1

useuse s5' ; s5' s5' ; s5' ≡≡ <s3'><s3'>sc0'' = (s3'' + 1) cmp 0sc0'' = (s3'' + 1) cmp 0

useuse sc0' ; sc0' sc0' ; sc0' ≡≡ <s3, s4><s3, s4>


s3' = [s9]s3' = [s9]use use s4' ; s4' s4' ; s4' ≡≡ <s3, s4><s3, s4>

s4'' = s3' + s3's4'' = s3' + s3'{ s5' = s3' + 80 }{ s5' = s3' + 80 }

s3'' = [s10]s3'' = [s10]useuse s5' ; s5' s5' ; s5' ≡≡ <s3'><s3'>useuse sc0' ; sc0' sc0' ; sc0' ≡≡ <s3, s4><s3, s4>

s5'' = s3'' + 1s5'' = s3'' + 1sc0'' = (s3'' + 1) cmp 0sc0'' = (s3'' + 1) cmp 0

Improvement potentialImprovement potential

Synergistic with Other OptimizationsSynergistic with Other Optimizations

Code SinkingCode SinkingUnspeculationUnspeculationConstant PropagationConstant PropagationConstant FoldingConstant FoldingCommoningCommoning

s5 = s2 + 2s5 = s2 + 2s4 = s3 / s2s4 = s3 / s2

s8 = 16s8 = 16s9 = s7 / s8s9 = s7 / s8

s8 = ...s8 = ...

{ { s5 = s2 + 2 }s5 = s2 + 2 }s4 = s3 / s2s4 = s3 / s2

s5' = s2 + 2s5' = s2 + 2use use s5 ; s5 s5 ; s5 ≡≡ <s2><s2>

{{ s8 = 16 }s8 = 16 }s9 = s7 / 16s9 = s7 / 16

useuse s8 ; s8 ; (extraneous)(extraneous)s8 = ...s8 = ...

Code sinkingCode sinking

s5 = s2 + 2 (dead)s5 = s2 + 2 (dead)s4 = s3 / s2s4 = s3 / s2s5' = s2 + 2s5' = s2 + 2

Constant Constant propagpropag..

s8 = 16s8 = 16 (dead)(dead)s9 = s7 / 16s9 = s7 / 16

s8 = ...s8 = ...

Example: Example: Application to other optimizationsApplication to other optimizations

s5 = s2 + s4s5 = s2 + s4s7 = [s5+10]s7 = [s5+10]s9 = s2 + s4s9 = s2 + s4s8 = [s9+20]s8 = [s9+20]

s9 = ...s9 = ...

s5 = s2 + s4s5 = s2 + s4s7 = [s5+10]s7 = [s5+10]

{{ s9 = s2 + s4 }s9 = s2 + s4 }s8 = [s5+20] s8 = [s5+20]

use use s9 ; s9 s9 ; s9 ≡≡ <s2,s4><s2,s4>s9 = ...s9 = ...

CommoningCommoning

s5 = s2 + s4s5 = s2 + s4s7 = [s5+10]s7 = [s5+10]s9 = s2 + s4 (dead)s9 = s2 + s4 (dead)s8 = [s5+20]s8 = [s5+20]

s9 = ...s9 = ...

Example: Example: Application to other optimizationsApplication to other optimizations

[identical mapping][identical mapping]0x000x00 lwz lwz R32,0(R9)R32,0(R9) [r3 := R32][r3 := R32]0x040x04 add add R3,R32,R32R3,R32,R32 [r4 := R3][r4 := R3]0x080x08 lwzlwz R33,0(R10)R33,0(R10) [r3 := R33][r3 := R33]0x0C0x0C addiaddi R5,R33,1R5,R33,1 [ [ --unchangedunchanged-- ]]0x100x10 cmpicmpi CR0,R5,0CR0,R5,0 [ [ --unchangedunchanged-- ]]

S0 S0 = R3 & R4= R3 & R4SC0SC0 = (R3 & R4) cmp 0= (R3 & R4) cmp 0

0x000x00 [r4 := S0; cr0 := SC0][r4 := S0; cr0 := SC0]S0S0 = R3 + 80= R3 + 80

0x080x08 [r5 := S0; cr0 := SC0][r5 := S0; cr0 := SC0]

Emitted Code and Emitted Code and Recovery InformationRecovery Information

EvaluationEvaluation

Detailed results and analysis in [CC2002]Detailed results and analysis in [CC2002]

Reduction in number of operationsReduction in number of operations–– Aggressive group formation increases Aggressive group formation increases

optimization potentialoptimization potential–– Approach only deletes dead code within groupApproach only deletes dead code within group

Avoid Avoid ϕ ϕ nodes and control flow mergenodes and control flow merge

Key: (a) aggressive, (c) conservative group formationKey: (a) aggressive, (c) conservative group formation

Power PCPower PC% ops elim inated

0

5

10

15

20

25

a c a c a c a c a c a c a c a c a c

compress gcc go ijpeg li m88ksim perl tpcc vortex

Key: (a) aggressive, (c) conservative group formationKey: (a) aggressive, (c) conservative group formation

zSerieszSeries% ops eliminated

0

5

10

15

20

25

a c a c a c a c a c a c a c

gcc go ijpeg li m88ksim perl system

Summary:Summary:Dead Code Elimination in DODead Code Elimination in DO

Complications in full system translationComplications in full system translation–– All observable state must match original architecture All observable state must match original architecture at any timeat any time–– Full data flow analysis not possibleFull data flow analysis not possible

Exceptions represent potential control flow transfersExceptions represent potential control flow transfersSynchronous exceptions are problematic (page faults, IEEE FP,…)Synchronous exceptions are problematic (page faults, IEEE FP,…)

Use variant of deferred materializationUse variant of deferred materialization–– Traditional deferred materialization is no win because too Traditional deferred materialization is no win because too

expensiveexpensiveExecutes instructions at runtime to record information about eliExecutes instructions at runtime to record information about elided opsded ops

–– Deferred materialization at the translation group (Deferred materialization at the translation group (superblocksuperblock) level) levelStatically Statically record information about elided operationsrecord information about elided operationsExtend live ranges of input operands to include live range of elExtend live ranges of input operands to include live range of elided ided resultresultIf exception is raised, recreate resultIf exception is raised, recreate result

–– Cost of recovery only burdens infrequent exception caseCost of recovery only burdens infrequent exception case–– Mainline code executes at full speed Mainline code executes at full speed

VirtualizationVirtualization

Definition of VirtualizationDefinition of Virtualization

Virtualization is the efficient emulation Virtualization is the efficient emulation of an architectureof an architecture–– A machine may A machine may virtualizevirtualize itself or another.itself or another.–– Virtualization normally includes all system Virtualization normally includes all system

state, not just user state.state, not just user state.Certain ISA features simplify Certain ISA features simplify virtualizingvirtualizing a machine on itself:a machine on itself:–– E.g., Not being able to view system state E.g., Not being able to view system state

while in user state.while in user state.

History of VirtualizationHistory of VirtualizationVirtualization has a long history, e.g.:Virtualization has a long history, e.g.:–– IBM 360 VMIBM 360 VM–– Goldberg, CACM 1972 (Formal Definition)Goldberg, CACM 1972 (Formal Definition)–– DAISY (DAISY (Architecture as a layer of softwareArchitecture as a layer of software))–– TransmetaTransmeta

Jim Smith and Jim Smith and RaviRavi Nair call DAISY and Nair call DAISY and TransmetaTransmeta““CoCo--Designed Virtual Machines”Designed Virtual Machines”This tutorial terms themThis tutorial terms them “DAISY Hosts”“DAISY Hosts”

–– VMwareVMware virtualization of x86virtualization of x86–– Synthetic instruction setSynthetic instruction set

C Virtual Machine CVM (IBM) [Proc. IEEE ’01]C Virtual Machine CVM (IBM) [Proc. IEEE ’01]Virtual Instruction Set Comp. VISC (Virtual Instruction Set Comp. VISC (AdveAdve et al.)et al.)

VM EnvironmentsVM Environments

Abstract execution environmentAbstract execution environment–– Portable runtime environmentPortable runtime environment

Smalltalk, Java VM,…Smalltalk, Java VM,…

Process abstractionProcess abstraction–– Idealized virtual memoryIdealized virtual memory–– Fewer difficult system/kernel code issuesFewer difficult system/kernel code issuesSystemSystem--levellevel–– No modifications to operating systemNo modifications to operating system–– More transparentMore transparent

Less danger of compatibility issuesLess danger of compatibility issues

Why Virtualization is Used in Why Virtualization is Used in ArchitectureArchitecture

Better PerformanceBetter PerformancePortability:Portability:–– VM as virtual execution environmentVM as virtual execution environmentMigrationMigrationArchitecture Simulation EnvironmentArchitecture Simulation Environment

Technique OOOISA Base ISAGeneral Optimizations

Too complex

Path-Predictive Fetching

IFetch Prediction

Code Compaction Trace Cache

Select Insns to Issue

Wakeup/Select Logic

Precise Exceptions

Register Renaming

Complex Insns Decoder Cracks

Form Issue Groups

Select Logic

DO / Virtualization & Architecture StylesDO / Virtualization & Architecture Styles

Technique OOO DO+OOOISA Base ISA Base ISAGeneral Optimizations

Too complex DO Optimizes


IFetch Prediction DO Improves Prediction

Code Compaction Trace Cache DO Performs Layout


Wakeup/Select Logic

Wakeup/Select Logic

Precise Exceptions

Register Renaming

Register Renaming

Complex Insns Decoder Cracks Decoder Cracks

Form Issue Groups

Select Logic Select Logic


Technique OOO DO+OOO DO+IOISA Base ISA Base ISA Base ISAGeneral Optimizations

Too complex DO Optimizes DO Optimizes



DO Improves Prediction


DO Performs Layout


Wakeup/Select Logic

Wakeup/Select Logic

DO adapts at Exec Time

Precise Exceptions

Register Renaming

Register Renaming

SW Recovery Code

Complex Insns Decoder Cracks Decoder Cracks DO or HW

Form Issue Groups

Select Logic Select Logic Issue Logic


Technique OOO DO+OOO DO+IO DO+VLIWISA Base ISA Base ISA Base ISA New ISAGeneral Optimizations

Too complex DO Optimizes DO Optimizes DO Optimizes






DO Performs Layout

DO Performs Layout


Wakeup/Select Logic

Wakeup/Select Logic

DO adapts at Exec Time

DO Adapts at Exec Time

Precise Exceptions

Register Renaming

Register Renaming

SW Recovery Code

SW Recovery + HW Support

Complex Insns Decoder Cracks Decoder Cracks DO or HW DO Cracks and Layers

Form Issue Groups

Select Logic Select Logic Issue Logic DO Groups Packets


VM Targets for Dynamic CompilersVM Targets for Dynamic Compilers

Same architecture.Same architecture.–– No architectural changes involved.No architectural changes involved.

Specific instance of compatible architecture.Specific instance of compatible architecture.–– ReRe--optimize for particular implementation, but optimize for particular implementation, but

considering specific parameters.considering specific parameters.DAISY Host: DAISY Host: Design for efficient hostingDesign for efficient hosting..–– Allow architectural simplification.Allow architectural simplification.–– Design host to ensure efficient mapping.Design host to ensure efficient mapping.

Migration and compatibility.Migration and compatibility.–– Make old code run on new host.Make old code run on new host.–– Efficient mapping not primary design constraint.Efficient mapping not primary design constraint.

But may be criterion: But may be criterion: ItaniumItanium

Virtualization of Same ISAVirtualization of Same ISA

Allows simpler implementation of the same Allows simpler implementation of the same architecture, e.g. Dynamo on PAarchitecture, e.g. Dynamo on PA--RISC.RISC.Provides ability to bail out and revert to Provides ability to bail out and revert to native execution:native execution:

If overhead too highIf overhead too highFor hardFor hard--toto--emulate sequencesemulate sequencesWhen no benefit of DO can be measuredWhen no benefit of DO can be measured

–– Or DO actually degrades performanceOr DO actually degrades performance

Virtualization of Different ISAVirtualization of Different ISA

Different architecture, e.g., RISC Different architecture, e.g., RISC VLIWVLIWAdvantages:Advantages:–– Simplify architectureSimplify architecture–– Reduce decoding overheadReduce decoding overhead–– Add more registers, add new conceptsAdd more registers, add new concepts–– Code packing / straightening.Code packing / straightening.

Disadvantage:Disadvantage:–– All code must be emulated. Can cause severe All code must be emulated. Can cause severe

degradation if low reuse, e.g. degradation if low reuse, e.g. WinStoneWinStone..

Designing a DAISY HostDesigning a DAISY Host

In designing a host as target for dynamic In designing a host as target for dynamic optimizer, consideroptimizer, consider–– Cost of emulating source architecture’s Cost of emulating source architecture’s

semanticssemantics–– Efficient mapping of architected stateEfficient mapping of architected state–– Providing resources for compiler for use in Providing resources for compiler for use in

optimizationoptimization–– Support dynamic optimization architectureSupport dynamic optimization architecture

Matching Source Architecture SemanticsMatching Source Architecture Semantics

Data formatsData formats–– Representation of integers, floats, SIMD vectors, Representation of integers, floats, SIMD vectors,

condition codes.condition codes.Definition and detection of boundary conditionsDefinition and detection of boundary conditions–– Overflow, exceptions, carry in/out…Overflow, exceptions, carry in/out…

Not everything must have a 1:1 mappingNot everything must have a 1:1 mapping–– Frequent/important idioms must have a highFrequent/important idioms must have a high--

performance solutionperformance solutionCan be Hardware / Software hybridCan be Hardware / Software hybrid

Bias towards maintaining compatible definition Bias towards maintaining compatible definition of many operationsof many operations

Efficient Mapping of StateEfficient Mapping of State

Each aspect of state must be stored Each aspect of state must be stored somewheresomewhere–– and must be locatableand must be locatable

Frequently accessed state should be easy to Frequently accessed state should be easy to retrieveretrieve–– Suggests a significant fraction of source architecture Suggests a significant fraction of source architecture

registers should be stored in registersregisters should be stored in registers

At control flow joins, must have same register At control flow joins, must have same register mappingsmappings–– Bias assignment towards “home locations” or preBias assignment towards “home locations” or pre--

existing assignmentsexisting assignments

Register Mapping at JoinsRegister Mapping at Joins

if (a > b) {if (a > b) {xx = y + 1;= y + 1;

}}else {else {

xx = y = y –– 1;1;}}zz = = x x + w;+ w;

xx in “then” clause in r55 in “then” clause in r55 xx in in ““elseelse”” clause in r55.clause in r55.

Ensures that Ensures that z z at control flow at control flow join gets correct value.join gets correct value.

DAISY Host System ConsiderationsDAISY Host System ConsiderationsAddress translation similar to source architectureSystem architecture consistent with emulated systems– I/O system, mem controller similar to source

architecture– Memory map consistent with source architecture– Able to hide part of real memory from source

architecture for use by VMM:VMM = Virtual Machine Monitor – the “OS” of the translator and optimizer.

– Timers consistent with source architecture

BOA Support for BOA Support for Virtualization and Virtualization and

Dynamic OptimizationDynamic Optimization

Use dynamic binary translation to map Use dynamic binary translation to map PowerPC code to code for high PowerPC code to code for high performance underlying machineperformance underlying machine..

BOA GoalsBOA Goals

Execute existing Execute existing PowerPCPowerPC code 100% code 100% compatiblycompatibly–– User and supervisor stateUser and supervisor state..

Execute at high performanceExecute at high performance–– Good CPI and high frequency.Good CPI and high frequency.

Benefits of BOA Optimization LayerBenefits of BOA Optimization Layer

Eliminate performanceEliminate performance--degrading opsdegrading opsAvoid use of complex hardware idioms:Avoid use of complex hardware idioms:–– E.g., condition register broad side E.g., condition register broad side

read/writeread/write–– Replace with ops that are easier to Replace with ops that are easier to

schedule/executeschedule/executeEExploit novel architecture conceptsxploit novel architecture concepts

BOA Support for DO / VM: RegistersBOA Support for DO / VM: Registers6464 integer registers integer registers vsvs 3232 for for PowerPCPowerPC–– r0r0 to to r31r31 same as PowerPC.same as PowerPC.–– r36r36 toto r63r63: : Used for renaming and scratch resultsUsed for renaming and scratch results–– r33r33: : Hold PowerPCHold PowerPC CtrCtr registerregister–– r34r34: : Hold PowerPCHold PowerPC LinkLink registerregister–– r35r35: : Hold constant 0Hold constant 0

Useful for PowerPC formUseful for PowerPC form lwzlwz r3,8(r0)r3,8(r0) lwzlwz r3,<r3,<AbsAddrAbsAddr 8>8>

Allows hardware to treat all registers uniformly:Allows hardware to treat all registers uniformly:–– No special case forNo special case for r0r0. . Instead Instead lwzlwz r3,8(r35)r3,8(r35)

64 floating point registers 64 floating point registers vsvs 32 for32 for PowerPCPowerPC..32 condition register fields 32 condition register fields vsvs 8 for8 for PowerPCPowerPC..

Example of Load SpeculationExample of Load Speculation

Original Original PowerPCPowerPC Code:Code:–– addiaddi r4,r4,1r4,r4,1–– xorxor r5,r4,r9r5,r4,r9–– beqbeq cr0,L1cr0,L1–– lwzlwz r3,0(r6)r3,0(r6)New code:New code:–– addiaddi r4,r4,1r4,r4,1 lwzlwz r63,0(r6)r63,0(r6)–– xorxor r5,r4,r9r5,r4,r9 beqbeq cr0,L1cr0,L1–– copycopy r3,r63r3,r63

Additional BOA Support for DO / VMAdditional BOA Support for DO / VM

Extra bits with registers to help renamingExtra bits with registers to help renaming–– e.g., e.g., CA, OV CA, OV bitsbits

LRA LRA = = LLoad oad RReal eal AAddress ddress insninsn for crossing for crossing groups and pagesgroups and pagesAbility to quash speculative I/O opsAbility to quash speculative I/O opsHardware counters for profilingHardware counters for profilingStore Order BufferStore Order Buffer so can rollback to so can rollback to beginning of translationbeginning of translationDetails to follow …Details to follow …

Implicit Architectural StateImplicit Architectural StateSupport ability to rename implicit architectural Support ability to rename implicit architectural state, i.e. state not named in instruction state, i.e. state not named in instruction opcodeopcode..Example of Implicit State:Example of Implicit State: PowerPC status bits:PowerPC status bits:–– CACA = Carry= Carry–– OVOV = Overflow= Overflow–– SOSO = Summary Overflow (Any op ever overflow?)= Summary Overflow (Any op ever overflow?)–– FPSCRFPSCR = Floating Point Status and Control Register= Floating Point Status and Control Register

Summary Summary DenormDenormSummary Summary NaNNaN(Many more)(Many more)

Implicit Architectural StateImplicit Architectural StateExample of Limitations from Implicit StateExample of Limitations from Implicit State–– addcaddc <CA><CA>,,r3r3,, r4,r5r4,r5 Op1Op1

–– addeadde <CA><CA>,r6,,r6, r3r3,r7,,r7,<CA><CA> Op2Op2

–– addcaddc <CA><CA>,r8,,r8, r9,r10r9,r10 Op3Op3If If CACA cannot be renamed, cannot be renamed, Op3Op3 cannot be cannot be scheduled prior to scheduled prior to Op1Op1 or or Op2Op2..BOA supports such reBOA supports such re--ordering of operations ordering of operations updating and using implicit state.updating and using implicit state.–– Extend each integer register with Extend each integer register with CACA and and OVOV bitsbits

RAWRAWRAWRAW

WARWAR

Implicit Architectural StateImplicit Architectural State

Extend each integer register with Extend each integer register with CACA and and OVOVbits:bits:–– addcaddc <r3.CA><r3.CA>,,r3r3,, r4,r5r4,r5 Op1Op1

–– addeadde <r6.CA><r6.CA>,r6,,r6, r3r3,r7,,r7,<r3.CA><r3.CA> Op2Op2

–– addcaddc <r8.CA><r8.CA>,r8,,r8, r9,r10r9,r10 Op3Op3

Op3Op3 can now be scheduled independently ofcan now be scheduled independently ofOp1Op1 andand Op2Op2..

RAWRAWRAWRAW

NO WARNO WAR

Branching Between Groups and Branching Between Groups and Across PagesAcross Pages

Problem 1:Problem 1: When exiting one group of When exiting one group of translated instructions, must know:translated instructions, must know:–– If a successor group exists corresponding to the next If a successor group exists corresponding to the next

PowerPC instruction.PowerPC instruction.–– Location of that successor group.Location of that successor group.–– If that successor group is still valid.If that successor group is still valid.

Problem 2:Problem 2: When execution of a group crosses When execution of a group crosses what was a page boundary in the original what was a page boundary in the original PowerPC code, must check if there is a PowerPC code, must check if there is a PowerPC instruction page fault.PowerPC instruction page fault.

Branching Between Groups: LVIABranching Between Groups: LVIAOne solution to Problem 1:LVIA = Load VLIW Instruction Address

LVIA Semantics:– If a valid, translated group exists starting at an address

corresponding to PowerPC address RY, load its address and branch to it.

– If no valid, translated group exists, the LVIA op returns the address of the translator.

The LVIA cache is backed up by a larger memory list of translations akin to page tables.

Direct Branching Between GroupsDirect Branching Between Groups

Drawback to LVIA approach:– Must execute extra LVIA operations -- one per exit

point of group.Alternative: Branch directly between groups:– Advantages:

Fast – Use single LRA op at group start to verify correctness. (See next foil.)LRA op needed anyway. (See next foil.)

– Disadvantage: VMM must track invalidations of translated groups and fix other groups that branch to them.

LRA:LRA:CrossCross--Group / CrossGroup / Cross--Page BranchesPage BranchesLRALRA == LLoad oad RReal eal AAddressddressLRALRA Semantics:Semantics:–– Get Get realreal PowerPC PowerPC addressaddress from PowerPC from PowerPC Virtual Virtual AddrAddr–– Compare:Compare:

Real addressReal address for for Virtual AddressVirtual Address right nowright nowReal addressReal address for for Virtual AddressVirtual Address when group was formedwhen group was formed

–– Trap if mismatch or translation faultTrap if mismatch or translation faultPut Put LRALRA insninsn::–– At every page crossing within a group.At every page crossing within a group.–– At start of each group.At start of each group.

Exception:Exception: If group is reached only from other groups on If group is reached only from other groups on same page, nosame page, no LRALRA is needed at group start.is needed at group start.

Can generally schedule other operations in same Can generally schedule other operations in same cycle as cycle as LRALRA..

I/OI/OMust detect references to memoryMust detect references to memory--mapped I/O:mapped I/O:–– I/O references must be performed inI/O references must be performed in--order.order.–– Cannot be executed speculatively:Cannot be executed speculatively:

Unknown/undefined side effectsUnknown/undefined side effectsReads can effect behaviorReads can effect behavior

–– PowerPC ISA has WIMG bits for each pagePowerPC ISA has WIMG bits for each pageWIMG bits flag I/O referencesWIMG bits flag I/O references

–– Among other thingsAmong other things

I/OI/OBOA hardware detects speculative I/O BOA hardware detects speculative I/O references:references:–– Prevents execution, generates trap.Prevents execution, generates trap.–– Recover in software:Recover in software:

Last defense against incorrect I/O operationLast defense against incorrect I/O operation–– Performance Heuristics:Performance Heuristics:

Detect likely I/O references in initial profiling:Detect likely I/O references in initial profiling:–– Compile these references without speculation Compile these references without speculation

and with nonand with non--trapping instructionstrapping instructionsAlso recompile without speculation later, when Also recompile without speculation later, when detect a reference often refers to I/O space.detect a reference often refers to I/O space.

PowerPC Load/Store DifficultiesPowerPC Load/Store Difficulties

In addition to I/O accesses, In addition to I/O accesses, LOADSLOADS and and STORESSTORES cause two other major problems:cause two other major problems:

1.1. RReferenced (eferenced (RR) and ) and CChanged (hanged (CC) bits in the ) bits in the (PowerPC) page table must be updated.(PowerPC) page table must be updated.

2.2. Memory reserved for BOA must be inaccessible Memory reserved for BOA must be inaccessible to PowerPC programs, even when PowerPC to PowerPC programs, even when PowerPC address translation is disabled and the address translation is disabled and the PowerPC is executing in system/privileged PowerPC is executing in system/privileged mode.mode.

PowerPC Load/Store DifficultiesPowerPC Load/Store DifficultiesTo deal with these problems, BOA uses a To deal with these problems, BOA uses a special cospecial co--designed TLB:designed TLB:

Solving PowerPC Load/Store Solving PowerPC Load/Store Difficulties with CoDifficulties with Co--Designed TLBDesigned TLB

Update of PowerPC Update of PowerPC R R and and CC bits.bits.–– When a page is brought into the BOA TLB, the BOA When a page is brought into the BOA TLB, the BOA

VMM sets the VMM sets the RR bit in the PowerPC page table.bit in the PowerPC page table.–– IFIF a page is brought into the BOA TLB by a a page is brought into the BOA TLB by a STORESTORE

The BOA VMM also sets the PowerPC The BOA VMM also sets the PowerPC CC bit.bit.

–– ELSEELSE if the page is brought in by a if the page is brought in by a LOADLOAD::The page is marked READThe page is marked READ--ONLY in the BOA TLB.ONLY in the BOA TLB.If there is a later If there is a later STORESTORE to the page:to the page:

–– A BOA TLB Miss occursA BOA TLB Miss occurs–– The BOA VMM sets the PowerPC The BOA VMM sets the PowerPC CC bit.bit.

Solving PowerPC Load/Store Solving PowerPC Load/Store Difficulties with CoDifficulties with Co--Designed TLBDesigned TLB

BOA Memory must be inaccessible to BOA Memory must be inaccessible to PowerPC programs:PowerPC programs:–– BOA devotes a READBOA devotes a READ--ONLY page, ONLY page, BB, of its , of its

memory for “bad” PowerPC memory references.memory for “bad” PowerPC memory references.–– All locations in All locations in BB contain the value 0xFFFFFFFFcontain the value 0xFFFFFFFF–– Any PowerPC Any PowerPC LOADLOAD//STORESTORE that attempts to that attempts to

access BOA memory is remapped to page access BOA memory is remapped to page BB by by the BOA TLB.the BOA TLB.

LOADSLOADS return the value 0xFFFFFFFFreturn the value 0xFFFFFFFFSTORESSTORES act as a NOP.act as a NOP.

Hardware Exit Counters for ProfilingHardware Exit Counters for Profiling

DAISY uses hardware profiling:DAISY uses hardware profiling:–– At each exit from a DAISY group put an instruction:At each exit from a DAISY group put an instruction:

countcount exitIDexitID, , Cycles_On_PathCycles_On_Path

–– exitIDexitID is unique among all exits from DAISY groups.is unique among all exits from DAISY groups.–– Cycles_On_PathCycles_On_Path is the estimated number of cycles is the estimated number of cycles

from the start of this group to this exit.from the start of this group to this exit.DAISY dynamic optimizer computes this value.DAISY dynamic optimizer computes this value.

–– exitIDexitID is used to index a is used to index a counter cachecounter cache::If counter cache has no entry for If counter cache has no entry for exitIDexitID::

–– Counter cacheCounter cache entry is set to entry is set to Cycles_On_PathCycles_On_Path–– ELSE ELSE counter cachecounter cache entry is incremented by entry is incremented by Cycles_On_PathCycles_On_Path..

Hardware Exit Counters for ProfilingHardware Exit Counters for ProfilingAsynchronously to the DAISY processor, the Asynchronously to the DAISY processor, the counter cachecounter cache compares each of its entries compares each of its entries to a threshold cycle count, to a threshold cycle count, CC..

If the counter for an entry exceeds If the counter for an entry exceeds CC, the , the counter cachecounter cache signals an asynchronous signals an asynchronous exception to the DAISY VMM, along with the exception to the DAISY VMM, along with the exitIDexitID for the entry.for the entry.

The DAISY dynamic optimizer can then reThe DAISY dynamic optimizer can then re--optimize or restructure the group along the optimize or restructure the group along the path ending at path ending at exitIDexitID..

Hardware Exit Counters for ProfilingHardware Exit Counters for Profiling

Note:Note: The The counter cachecounter cache increments the increments the threshold count threshold count CC each cycle:each cycle:

AAn exception is signaled only if the time spent on n exception is signaled only if the time spent on a particular path exceeds a certain a particular path exceeds a certain percentagepercentage of of execution time, not an execution time, not an absolute amountabsolute amount of of execution time.execution time.

We have found that an 8We have found that an 8--way associative way associative counter cachecounter cache with 8K entries is almost as with 8K entries is almost as accurate as software profiling, but with far accurate as software profiling, but with far less software overhead / program slowdown.less software overhead / program slowdown.

PowerPC State and Precise Exceptions PowerPC State and Precise Exceptions in BOAin BOA

At BOA group entry, save PowerPC At BOA group entry, save PowerPC register state to shadow registers.register state to shadow registers.Save done in one cycle by hardware when Save done in one cycle by hardware when branching to new group.branching to new group.Copy values to PowerPC registers only at Copy values to PowerPC registers only at group exits.group exits.On exception:On exception:–– Rollback to start of groupRollback to start of group–– Restore PowerPC shadow registersRestore PowerPC shadow registers–– Interpret to find exceptionInterpret to find exception

PowerPC Regs Shadow Regs

Group Start

Scratch Regs

Group End

Exception

PowerPC State and Precise PowerPC State and Precise ExceptionsExceptions

Support for STORESSupport for STORESProblemProblem: : STORESTORE executes but later instruction executes but later instruction in group page faults.in group page faults.

Want to rollback to group start, but must Want to rollback to group start, but must rescind all executed rescind all executed STORESSTORES..

SolutionSolution: : SStore tore OOrderrder BBuffer (uffer (SOBSOB))–– Stored values go to Stored values go to SOBSOB, not memory, not memory–– At group exit, all pending At group exit, all pending STORESSTORES in in SOBSOB are are

marked marked eligibleeligible for commit to memory.for commit to memory.–– SOBSOB writes writes eligibleeligible values to memory in order.values to memory in order.

Store Order Buffer (SOB)Store Order Buffer (SOB)

GroupStart Ptr

GroupEnd Ptr

SOB


STORE

BOA GroupGroupStart Ptr

GroupEnd Ptr

SOB


STORE...STORE


GroupEnd Ptr

SOB


STORE...STORE...Exception


GroupEnd Ptr

SOBRollback


STORE...STORE...STORE


GroupEnd Ptr

SOB


STORE...STORE...STORE...BSHAD

BOA Group

GroupStart Ptr

GroupEnd Ptr

SOB

Memory

MP implications of MP implications of SOBsSOBs

Cannot release data to remote nodeCannot release data to remote node–– Could be rolled back laterCould be rolled back later–– Must tell requesting node to waitMust tell requesting node to wait

Processor 1

Group Action

Store A store in S-CAM

Read B cross-interrogate

“wait for commit”

Processor 2

Group Action

Store B store in S-CAM

Read A cross-interrogate

“wait for commit” DEADLOCK

SOBSOB--Aware ProtocolAware Protocol

Must break deadlocksMust break deadlocks–– Detect deadlocksDetect deadlocks

May be complex, involving multiple nodesMay be complex, involving multiple nodes–– Avoid conditions which cause deadlockAvoid conditions which cause deadlock

Deadlock avoidanceDeadlock avoidance–– Break possible cyclesBreak possible cycles–– No “wait for commit” is sufficientNo “wait for commit” is sufficient

But not necessaryBut not necessary

Avoiding “Wait for Commit”Avoiding “Wait for Commit”

Cannot deliver data due to possibility of Cannot deliver data due to possibility of roll backroll backMust respondMust respond–– not responding not responding isis “wait for commit”“wait for commit”Solutions:Solutions:–– Roll back immediately, then return old valueRoll back immediately, then return old value–– Tell remote host to rollback and reTell remote host to rollback and re--executeexecute

Will prevent this host’s waiting on remote dataWill prevent this host’s waiting on remote dataMay be preferable for lockMay be preferable for lock--guarded structuresguarded structures

–– Snooping a lock is not usefulSnooping a lock is not useful–– Writer of lock should exit section as quickly as possibleWriter of lock should exit section as quickly as possible

LivelocksLivelocks and Starvationand Starvation

There is a danger of There is a danger of livelocklivelock..–– Solution:Solution: Allow one processor to make Allow one processor to make

progress:progress:By picking a node to prioritizeBy picking a node to prioritize

–– Use token to distribute equitably and prevent starvationUse token to distribute equitably and prevent starvationExponential backExponential back--offoff

–– Use performance monitor infrastructure to Use performance monitor infrastructure to identify groups suffering excessive identify groups suffering excessive interferenceinterference

Recompile, e.g., using smaller groups to reduce Recompile, e.g., using smaller groups to reduce probability of interferenceprobability of interference

Speculative Load SupportSpeculative Load SupportUse counter to assign Use counter to assign sequence numbersequence number to to each each LOADLOAD and and STORESTORE in a group.in a group.Sequence numberSequence number part of part of opcodeopcode

On On STORESTORE, hardware checks:, hardware checks:–– STORE STORE address overlaps address overlaps prevprev LOADLOAD address?address?–– PrevPrev LOADLOAD addraddr sequence numbersequence number >>

STORE STORE sequence number sequence number ??

If aliasing between a load and store:If aliasing between a load and store:–– Rollback group to start and start interpretationRollback group to start and start interpretation–– Possibly retranslate to Possibly retranslate to unspeculateunspeculate LOADLOAD

LOAD X...STORE Y...LOAD Z...

PowerPC Code1 LOAD X

2 STORE Y...3 LOAD Z

BOA Group

Speculative Load Support (1)Speculative Load Support (1)

Use Use ctrctr to assign to assign sequence numbersequence number to to each each LOADLOAD and and STORESTORE in a group.in a group.Sequence numberSequence number part of part of opcodeopcode::

1 LOAD X3 LOAD Z2 STORE Y...

BOA Group

Z aliases with Y

Seq #3 > Seq#2


STORE STORE addraddr overlaps a overlaps a prevprev LOADLOAD addraddrPrevPrev LOADLOAD addraddr sequence numbersequence number >>

STORE STORE sequence number sequence number ??


If aliasing:If aliasing:–– Rollback group to start and reRollback group to start and re--executeexecute–– Possibly retranslate to Possibly retranslate to unspeculateunspeculate LOADLOAD

BOA Architecture BOA Architecture and and

MicroarchitectureMicroarchitecture

Target BOA systemTarget BOA system

4 way chip multiprocessor (CMP)4 way chip multiprocessor (CMP)–– Building block for large Building block for large SMPsSMPs–– We will only present We will only present uniprocessoruniprocessor

performanceperformance–– System performance largely dependent System performance largely dependent

on memory neston memory nest

Shared onShared on--chip unified L2 cachechip unified L2 cache

Size Line Size Assoc Hit Latency

L1 - Insn 256K . 256 . 4 . 1 .

L1 - Data 64K . 128 . 2 . 4 .

L2 - Joint 4M . 128 . 8 . 14 .

Memory 90 .

BOA CachesBOA Caches

BOA CMP BOA CMP floorplanfloorplan

BOA is variable length VLIW machine.BOA is variable length VLIW machine.BOA instructions (bundles) are 128 bits.BOA instructions (bundles) are 128 bits.–– Bundles have 3 primitive ops.Bundles have 3 primitive ops.–– Primitive ops have 39 bits plus stop bit.Primitive ops have 39 bits plus stop bit.–– 8 bits of bundle reserved for future uses such as 8 bits of bundle reserved for future uses such as

predication.predication.Instruction Issue operates on instruction Instruction Issue operates on instruction packetspackets–– Up to 6 primitive ops are issued together.Up to 6 primitive ops are issued together.–– Only last op issued may have stop bit set.Only last op issued may have stop bit set.

BOA ISA (1)BOA ISA (1)

BOA Instruction PacketBOA Instruction Packet(dynamic abstraction)(dynamic abstraction)

BOA Instruction BundleBOA Instruction Bundle(static abstraction)(static abstraction)

6464 Integer RegistersInteger Registers6464 Float RegistersFloat Registers1616 44--bitbit Condition RegistersCondition RegistersBranches takeBranches take 11 cycle:cycle:–– Branch Branch mispredictsmispredicts cost cost 77 cyclescycles–– Static branch predictionStatic branch prediction

using interpreter statsusing interpreter stats–– At most one branch per cycleAt most one branch per cycle–– Branch and checkpointBranch and checkpoint

For compiled group transitionsFor compiled group transitions

BOA ISA (2)BOA ISA (2)

BOA ArchitectureBOA Architecture

PowerPCPowerPC ops from ops from single path in an single path in an atomic group.atomic group.6 Issue6 IssueOps assigned to Ops assigned to FUsFUsin pipelinein pipeline

StallStall--onon--useuseMemopMemop sequence #'s, sequence #'s, Address ComparatorsAddress Comparators

Predicated bundles Predicated bundles of 3 opsof 3 ops

1 branch per cycle1 branch per cycleBranch predictionBranch prediction

BOA ResourcesBOA Resources

66 IssueIssue SlotsSlots–– Positional encoding simplifies issuing Positional encoding simplifies issuing

22 LOAD / STORELOAD / STORE unitsunits–– Each with own copy of register fileEach with own copy of register file

44 IntegerInteger unitsunits–– Each with own copy of register fileEach with own copy of register file

22 FloatFloat unitsunits11 BranchBranch unitunit3232--entry entry LoadLoad and and Store BuffersStore BuffersRegister Register scoreboardingscoreboarding of LOAD valuesof LOAD values–– Stall when try to use loaded valueStall when try to use loaded value

BOA BOA MicroarchitectureMicroarchitecture

Decoupled fetchDecoupled fetch--executeexecuteFrontFront--end autonomously fetches bundlesend autonomously fetches bundles–– Formats bundleFormats bundle--based encoding stream to packetsbased encoding stream to packets–– PreparePrepare--toto--branch option to redirect instruction fetch branch option to redirect instruction fetch

Disperses packet to perDisperses packet to per--unit issue queueunit issue queue–– Can issue up to 6 instructions to 9 unitsCan issue up to 6 instructions to 9 units

StallStall--free backendfree backend–– Traditional “stall” conditions handled using “recirculation”Traditional “stall” conditions handled using “recirculation”

Quash & reQuash & re--issue violating instruction and successorsissue violating instruction and successors–– No need for “instantaneous” communicationNo need for “instantaneous” communication

–– Branch Branch mispredictionmisprediction uses similar schemeuses similar scheme–– Quash and reQuash and re--issue from correct path issue from correct path

–– Exceptions are handled similarlyExceptions are handled similarly–– Quash and reQuash and re--issue from exception vector addressissue from exception vector address

BOA PipelinesBOA Pipelines

BOA LatenciesBOA Latencies

Integer ops take Integer ops take 11 cyclecycle–– No bypassNo bypass Dependent ops must be 2Dependent ops must be 2

. cycles apartcycles apartLOADsLOADs take take 33 cyclescycles–– No bypassNo bypass Dependent ops must be 4 Dependent ops must be 4 ..........

cycles latercycles later

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Broadcast

Writeback

Mem 1

Fetch 1

Fetch 2

AGEN

Decode

GPR Rd

Issue

TLB

Mem 2

Fetch 1

Fetch 2

AGEN

Decode

GPR Rd

Issue

TLB

S-CAM

Integer STORE

LOAD

BOA PipelinesBOA Pipelines

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

RecirculationRecirculation

Recirculation Buffer

High FrequencyNot send global stall signals.Recirculate Insn instead.

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Pause

RecirculateQuash

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Pause

RecirculateQuash

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback



Input RegsReady

Fetch 1

Fetch 2

Execute

Decode

GPR Rd

Issue

Broadcast

Writeback

ScoreboardingScoreboarding &&Signal DistributionSignal Distribution

Distributed issue queueDistributed issue queue–– Lockstep operation (issue entire packet only)Lockstep operation (issue entire packet only)

Long running instructions scoreboard resultLong running instructions scoreboard result–– Late in pipeline after other conditions have been resolvedLate in pipeline after other conditions have been resolved

No need to “retract” a scoreboard dirty conditionNo need to “retract” a scoreboard dirty condition–– Reduces pressure on scoreboard signal distributionReduces pressure on scoreboard signal distribution–– Operations: LOAD, SPR access…Operations: LOAD, SPR access…

Optionally long running FP to eliminate vertical Optionally long running FP to eliminate vertical NOPsNOPsShort latency operations do not interlock via scoreboardShort latency operations do not interlock via scoreboard–– BOA dynamic compiler must schedule at proper distanceBOA dynamic compiler must schedule at proper distance–– All simple operations: ALU ops, shifter, store,…All simple operations: ALU ops, shifter, store,…

All units issue aggressivelyAll units issue aggressively–– Scoreboard accessed with GPR readScoreboard accessed with GPR read–– If any operand not ready, cancel instruction during EX stageIf any operand not ready, cancel instruction during EX stage

“QUASH” signal broadcast to all pipelines“QUASH” signal broadcast to all pipelines–– ReRe--issued from recirculation bufferissued from recirculation buffer

Issue LogicIssue LogicEasy to schedule for known latencies in an inEasy to schedule for known latencies in an in--order machineorder machineScheduling ops of unknown latency is problematicScheduling ops of unknown latency is problematic–– StallStall--onon--cachecache--miss / Stallmiss / Stall--onon--longlong--latencylatency--op easy but op easy but

penalizes performancepenalizes performanceScoreboard variable latency opsScoreboard variable latency opsOptionally scoreboard long latency ops to reduce Optionally scoreboard long latency ops to reduce vertical vertical NOPsNOPs (FDIV and similar ops)(FDIV and similar ops)

–– Keys to performanceKeys to performance

All interesting variable latency ops are long latencyAll interesting variable latency ops are long latencyHave enough window to take time for updatesHave enough window to take time for updates

BOA BOA PerformancePerformance

BenchmarksBenchmarksBenchmarksBenchmarks–– SPECint95SPECint95–– TPCTPC--CCSPECint95 SPECint95 Sampling MethodSampling Method–– Uniformly Sampled PowerPC TracesUniformly Sampled PowerPC Traces–– 2 million2 million instructions per sampleinstructions per sample–– 5050 samples per benchmarksamples per benchmarkTPCTPC--CC Sampling MethodSampling Method–– SpecialSpecial--purpose hardwarepurpose hardware–– 170 million170 million instruction traceinstruction trace

Factors in BOA PerformanceFactors in BOA Performance

Instruction Reuse RateInstruction Reuse Rate# of times each instruction is translated# of times each instruction is translatedTranslator CPITranslator CPIInterpreter CPIInterpreter CPIStatistics CPIStatistics CPISynchronous Exception RateSynchronous Exception RateICacheICache flushing from translatorflushing from translatorAverage Group LengthAverage Group Length# of times interpret before translating# of times interpret before translating

Σ ( # of VLIW Ins in P ) x ( # Times P Executes )

All Group Paths P

Execution Time of Translated CodeExecution Time of Translated Code

Ignoring Cache Effects, Ignoring Cache Effects, Cycles for Each Cycles for Each Path Through GroupPath Through Group = = Number of VLIW Number of VLIW InstructionsInstructionsTotal Cycles Spent in Group:Total Cycles Spent in Group:

Instruction Cache CyclesInstruction Cache Cycles

Layout VLIW code for all groupsLayout VLIW code for all groupsIndex VLIW code by group exit pointsIndex VLIW code by group exit pointsGo thru exit points in execution orderGo thru exit points in execution order–– Iterate through all VLIW Instruction Iterate through all VLIW Instruction

Addresses corresponding to each exitAddresses corresponding to each exit–– Feed Addresses to Multilevel Feed Addresses to Multilevel ICacheICache SimulSimul–– Simulator includes historySimulator includes history--based based prefetchprefetch

Data Cache CyclesData Cache Cycles

Modeling Modeling Speculative LoadsSpeculative Loads Difficult in Difficult in TraceTrace--Based EnvironmentBased EnvironmentAddresses for Speculative Ops not on actual Addresses for Speculative Ops not on actual Execution Path are UnknownExecution Path are UnknownUse LD/ST addresses from PowerPC trace as Use LD/ST addresses from PowerPC trace as input to input to DCacheDCache/DTLB simulation/DTLB simulationMultiply Multiply DCacheDCache/DTLB Stall Cycles by /DTLB Stall Cycles by 1.71.71.71.7 = Increase in Execution= Increase in Execution--based DAISYbased DAISYSimulated Simulated DCacheDCache notnot lockuplockup--freefree–– BOA’sBOA’s cache is lockup freecache is lockup free

Translation CyclesTranslation Cycles

Measure of CPI adder due to time spent in translationMeasure of CPI adder due to time spent in translationAverage number of clocks required to translate an instructionAverage number of clocks required to translate an instruction–– CPI of translatorCPI of translator

translation translation -- 2500 cycles2500 cycles–– more sophisticated optimizations increase this penaltymore sophisticated optimizations increase this penalty–– delicate balance between translated code performance and transladelicate balance between translated code performance and translation tion

overheadoverheadNumber of times an instruction gets retranslatedNumber of times an instruction gets retranslatedReuse RateReuse Rate–– Time spent in translator per instruction is amortized by the repTime spent in translator per instruction is amortized by the repetition etition

rate of that instructionrate of that instruction

Cycles spent translating 1 instructionCycles spent translating 1 instruction–– (1/ Reuse Rate) * ( Translation CPI * Translations )(1/ Reuse Rate) * ( Translation CPI * Translations )

Overall CPIOverall CPI

Total Cycles for VLIW Execution:Total Cycles for VLIW Execution:–– Infinite Cache CyclesInfinite Cache Cycles ++–– ICacheICache CyclesCycles ++–– DCacheDCache CyclesCycles ++–– DTLB CyclesDTLB Cycles ++–– Branch Branch mispredictionmisprediction ++–– Interpretation & Translation OverheadInterpretation & Translation OverheadCPICPI = = Total VLIW CyclesTotal VLIW Cycles / / OrigOrig PPC InsPPC InsTranslation Overhead NegligibleTranslation Overhead Negligible

BOA Baseline CPIBOA Baseline CPI

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

li perl m88k go ijpeg vortex gcc compress tpcc

PPC

CPI

TLBL2L1-DL1-IInterpretationTranslationBranchExceptionBase

Effect of Bias on CPIEffect of Bias on CPI

0

0.5

1

1.5

2

2.5


PPC

CPI

Bias-8Bias-12Bias-15

Static and Dynamic Group Length,Static and Dynamic Group Length,as Function of Biasas Function of Bias

0

10

20

30

40

50

60

li perl m88ksim go ijpeg vortex gcc compress tpcc_db2

grou

p le

ngth

in P

PC in

stru

ctio

ns

static group length (8/15) dynamic group length (8/15)static group length (12/15) dynamic group length (12/15)static group length (15/15) dynamic group length (15/15)

Oracle Static Branch PredictionOracle Static Branch Prediction

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2


PPC

CPI

BaselineOracle

BOA and DAISY CPIBOA and DAISY CPI

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2


PPC

CPI

BOADAISY

Comparison of Comparison of BOA to DAISYBOA to DAISY

BOA and DAISY Differences (1)BOA and DAISY Differences (1)

PowerPCPowerPC ops from ops from multiple pathsmultiple pathsStopping conditions Stopping conditions are ILPare ILP--based. Rebased. Re--optimization to optimization to increase ILPincrease ILPSoftwareSoftware--based based issueissue--BW constrainedBW constrained

LV LV InIn--order commit,…order commit,…

PowerPCPowerPC ops from ops from single pathsingle pathStopping conditions Stopping conditions relate to code relate to code expansion, resource expansion, resource limitslimitsHardwareHardware--based based resource constrainedresource constrained

Order buffersOrder buffersRollback,…Rollback,…

BOA DAISY


6 Issue6 IssueOps assigned to Ops assigned to FUsFUs in pipelinein pipeline

StallStall--onon--useuseMemopMemop sequence #'s, sequence #'s, Address ComparatorsAddress Comparators

88--16 Issue16 IssueMiniMini--IcacheIcache maps maps fixed cache fixed cache locations to locations to FUsFUsStallStall--onon--missmissLoadLoad--Verify Verify InstructionsInstructions

BOA DAISY


Predicated bundles Predicated bundles of 3 opsof 3 ops1 branch per cycle1 branch per cycle

Branch predictionBranch prediction

Tree instructionsTree instructions

Up to 3 branches Up to 3 branches per cycleper cycleEncode successor Encode successor cache line in cache line in instruction instruction Fetch Fetch known known insninsn each each cyclecycle

BOA DAISY


Exclusively targeted Exclusively targeted at PowerPCat PowerPC

Research to target Research to target multiple multiple architecturesarchitectures

Architecture Architecture commonalitycommonality

Architecture Architecture virtualizationvirtualization

BOA DAISY

SummarySummaryandand

ObservationsObservations

Proof of ConceptProof of ConceptSystemSystem--level dynamic compilation level dynamic compilation demonstrated by:demonstrated by:––DAISY, BOADAISY, BOA––TransmetaTransmeta––FX!32FX!32––IA32IA32--ELEL

Optimization Optimization OpportunitesOpportunitesand Challengesand Challenges

New Optimization Opportunities:New Optimization Opportunities:–– LoadLoad--Store TelescopingStore Telescoping–– NonNon--conservative approaches to aliasingconservative approaches to aliasing–– SystemSystem--level optimization techniques, e.g., level optimization techniques, e.g.,

large pages “under the covers”large pages “under the covers”

New Optimization Challenges:New Optimization Challenges:–– Dead Code EliminationDead Code Elimination–– Management of Translated CodeManagement of Translated Code

Hazelwood and Smith [CGO 2004]Hazelwood and Smith [CGO 2004]

New ParadigmNew ParadigmSystemSystem--level dynamic compilation level dynamic compilation offers opportunities for paradigm shiftoffers opportunities for paradigm shift–– Merged Merged ISAsISAs in one implementationin one implementation

PowerPCPowerPC / / zz--SeriesSeriesx86x86 / / IAIA--6464

Lower development costsLower development costsDynamically allocate fixed hardware to Dynamically allocate fixed hardware to different different ISAsISAs in server farmin server farm

Documents

Dynamic Compilation at the System Level - IBM · Dynamic Compilation at the System Level Erik Altman Michael Gschwind IBM T.J. Watson Research Center 2006 CGO New York City