Upload
denim
View
40
Download
0
Embed Size (px)
DESCRIPTION
IA-64 Architecture Innovations. John Crawford Architect & Intel Fellow Intel Corporation. Jerry Huck Manager & Lead Architect Hewlett Packard Co. Agenda. Architecture Principles Predication & Speculation Branch Architecture Software Pipelining. - PowerPoint PPT Presentation
Citation preview
®®
IA-64 Architecture IA-64 Architecture InnovationsInnovations
John Crawford John Crawford Architect & Intel FellowArchitect & Intel Fellow
Intel CorporationIntel Corporation
Jerry HuckJerry Huck Manager & Lead ArchitectManager & Lead Architect
Hewlett Packard Co.Hewlett Packard Co.
®®
AgendaAgenda
Architecture PrinciplesArchitecture PrinciplesPredication & SpeculationPredication & SpeculationBranch ArchitectureBranch ArchitectureSoftware PipeliningSoftware Pipelining
®® Today’s Processors often 60% IdleToday’s Processors often 60% IdleToday’s Processors often 60% IdleToday’s Processors often 60% Idle
parallelizedparallelizedcodecode parallelizedparallelized
codecode
parallelizedparallelizedcodecode
HardwareHardwareCompilerCompiler
multiplemultiple functional unitsfunctional units
Original SourceOriginal SourceCodeCode
Sequential MachineSequential MachineCodeCode
......
......
Execution Units Available Execution Units Available Used InefficientlyUsed Inefficiently
Traditional Architectures: Traditional Architectures: Limited ParallelismLimited Parallelism
®® Increases Parallel ExecutionIncreases Parallel ExecutionIncreases Parallel ExecutionIncreases Parallel Execution
IA-64 Compiler IA-64 Compiler Views WiderViews Wider
ScopeScope
Parallel MachineParallel MachineCodeCode
CompilerCompiler
Original SourceOriginal SourceCodeCode
CompileCompile
HardwareHardware multiple functional unitsmultiple functional units
......
......
More efficient use of More efficient use of execution resourcesexecution resources
IA-64 Architecture: IA-64 Architecture: Explicit ParallelismExplicit Parallelism
®®
IA-64 PrinciplesIA-64 Principles Explicitly parallel:Explicitly parallel:
– Instruction level parallelism (ILP) in machine code Instruction level parallelism (ILP) in machine code
– Compiler schedules across a wider scopeCompiler schedules across a wider scope
Enhanced ILP :Enhanced ILP :– Predication, Speculation, Software pipelining, ... Predication, Speculation, Software pipelining, ...
Fully compatible:Fully compatible:– Across all IA-64 family membersAcross all IA-64 family members
– IA-32 in hardware and PA-RISC through instruction mapping IA-32 in hardware and PA-RISC through instruction mapping
– Inherently scalableInherently scalable
Massively resourced:Massively resourced:– Many registersMany registers
– Many functional unitsMany functional units
®®
PredicationPredication
cmpcmp
p1
p1
p1
p2
p2
p2
Traditional ArchitecturesTraditional Architectures IA-64IA-64
Removes branches, converts to predicated execution Removes branches, converts to predicated execution – Executes multiple paths simultaneouslyExecutes multiple paths simultaneously
Increases performance by exposing parallelism and reducing Increases performance by exposing parallelism and reducing critical path critical path
– Better utilization of wider machinesBetter utilization of wider machines
– Reduces mispredicted branchesReduces mispredicted branches
elseelse
thenthen
cmpcmp
®®
(p2) p3=
(p3)...
(p1) p3=
Regular: p3 is set just onceRegular: p3 is set just once Unconditional: p3 and p4 Unconditional: p3 and p4 are AND’ed with p2are AND’ed with p2
p1,p2,<-...
(p2) p3,p4 <-cmp.unc...
(p3)... (p4)...p2&p3 p2&p4
Opportunity for Even More ParallelismOpportunity for Even More ParallelismOpportunity for Even More ParallelismOpportunity for Even More Parallelism
Predication ReviewPredication ReviewTwo kinds of normal comparesTwo kinds of normal compares
– Regular Regular – Unconditional (nested IF’s)Unconditional (nested IF’s)
®®
Reduces Critical PathReduces Critical PathReduces Critical PathReduces Critical Path
BB
AA
CC
DD
BBAA CC
DD
Introducing Parallel Introducing Parallel ComparesCompares Three new types of compares:Three new types of compares:
– AND: both target predicates set FALSE if compare is falseAND: both target predicates set FALSE if compare is false
– OR: both target predicates set TRUE if compare is trueOR: both target predicates set TRUE if compare is true
– ANDOR: if true, sets one TRUE, sets other FALSEANDOR: if true, sets one TRUE, sets other FALSE
®®
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]ld R2=[R1]ld R2=[R1]ld.s R4=[R3]ld.s R4=[R3]ld.s R6=[R5]ld.s R6=[R5]P1,P2 <-cmp.unc(R2==true)P1,P2 <-cmp.unc(R2==true)
(p1)(p1) chk.s R4chk.s R4(p1)(p1) P3,P4 <-cmp.unc(R4==true)P3,P4 <-cmp.unc(R4==true)
(p3)(p3) chk.s R6chk.s R6(p3)(p3) P5,P6 <-cmp.unc(R5==true)P5,P6 <-cmp.unc(R5==true)(P5) br then(P5) br thenelseelse
1
2
4
5
6
7
ThenElse
P1
P2
P5
P3 P4
P6
8 queens control flow8 queens control flowUnconditional ComparesUnconditional Compares
Eight Queens ExampleEight Queens Example
®®
Eight Queens ExampleEight Queens Exampleif ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
ThenElse
P1
P2
P5
P3 P4
P6
Parallel ComparesParallel Compares
R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) br then(p1) br thenelseelse
1
2
4
P1P1
5
8 queens control flow8 queens control flow
®®
Eight Queens ExampleEight Queens Exampleif ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
ThenElse
P1
P2
P5
P3 P4
P6
Parallel ComparesParallel Compares
R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) br then(p1) br thenelseelse
1
2
4
5
Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5
8 queens control flow8 queens control flow
ThenElse
P1= true P1=False
®® Tbit (Test Bit) Also Sets PredicatesTbit (Test Bit) Also Sets PredicatesTbit (Test Bit) Also Sets PredicatesTbit (Test Bit) Also Sets Predicates
Five Predicate Compare Five Predicate Compare TypesTypes (qp) p1,p2 <- cmp.relation(qp) p1,p2 <- cmp.relation
– if(qp) {p1 = relation; p2 = !relation}; if(qp) {p1 = relation; p2 = !relation};
(qp) p1,p2 <- cmp.relation.unc(qp) p1,p2 <- cmp.relation.unc– p1 = qp&relation; p2 = qp&!relation;p1 = qp&relation; p2 = qp&!relation;
(qp) p1,p2 <- cmp.relation.and(qp) p1,p2 <- cmp.relation.and– if(qp & (relation==FALSE)) { p1=0; p2=0; }if(qp & (relation==FALSE)) { p1=0; p2=0; }
(qp) p1,p2 <- cmp.relation.or(qp) p1,p2 <- cmp.relation.or– if(qp & (relation==TRUE)) { p1=1; p2=1; }if(qp & (relation==TRUE)) { p1=1; p2=1; }
(qp) p1,p2 <- cmp.relation.or.andcm(qp) p1,p2 <- cmp.relation.or.andcm– if(qp & (relation==TRUE)) { p1=1; p2=0; }if(qp & (relation==TRUE)) { p1=1; p2=0; }
®®
* Source: S. Mahlke, 1995* Source: S. Mahlke, 1995
Predication BenefitsPredication Benefits Reduces branches and mispredict penalties Reduces branches and mispredict penalties
– 50% fewer branches and 37% faster code*50% fewer branches and 37% faster code*
Parallel compares further reduce critical pathsParallel compares further reduce critical paths Greatly improves code with hard to predict Greatly improves code with hard to predict
branchesbranches– Large server apps- capacity limitedLarge server apps- capacity limited
– Sorting, data mining- large database appsSorting, data mining- large database apps
– Data compressionData compression
Traditional architectures’ “bolt-on” approach can’t Traditional architectures’ “bolt-on” approach can’t efficiently approximate predicationefficiently approximate predication
– Cmove: 39% more instructions, 23% slower performance*Cmove: 39% more instructions, 23% slower performance*
– Instructions must all be speculativeInstructions must all be speculative
®®
ld.sld.sinstr 1instr 1instr 2instr 2brbr
chk.schk.suse use
IA-64IA-64
instr 1instr 1instr 2instr 2. . .. . .brbr
LoadLoaduseuse
Traditional ArchitecturesTraditional Architectures
Allows elevation of load, Allows elevation of load, even above a brancheven above a branch
Speculation ReviewSpeculation Review
Memory latency is a major performance Memory latency is a major performance bottleneck in today’s systemsbottleneck in today’s systems– CPU to memory gap increasingCPU to memory gap increasing
BarrierBarrier
®® Enables Further ParallelismEnables Further ParallelismEnables Further ParallelismEnables Further Parallelism
Hoisting UsesHoisting Uses
The uses of speculative data can also The uses of speculative data can also be executed speculativelybe executed speculatively– distinguishes speculation from simple prefetchdistinguishes speculation from simple prefetch
ld.sld.sinstr 1instr 1instr 2instr 2brbr
chk.schk.suse use
IA-64IA-64
®®
ld.sld.sinstr 1instr 1instr 2instr 2brbr
chk.schk.suse use
PropagatePropagateExceptionException
;Exception Detection;Exception Detection
;Exception Delivery;Exception Delivery
IA-64IA-64
Introducing the NaTIntroducing the NaT(“Not a Thing”)(“Not a Thing”)
NaT is the GR’s 65th bit that indicates:NaT is the GR’s 65th bit that indicates:– whether or not an exception has occurred whether or not an exception has occurred – branch to fixup code requiredbranch to fixup code required
NaT set during ld.s, checked by Chk.sNaT set during ld.s, checked by Chk.s
®®
All computation instructions propagate NaTs to reduce All computation instructions propagate NaTs to reduce number of checksnumber of checks
Cmp propagates “false” when writing predicates Cmp propagates “false” when writing predicates RISC architectures require more instructions for RISC architectures require more instructions for
equivalent integrity equivalent integrity – e.g., non faulting loade.g., non faulting load
PropagationPropagation
chk.s r5chk.s r5sub r7 = r5,r2sub r7 = r5,r2
ld8.s r3 = (r9)ld8.s r3 = (r9)ld8.sld8.s r4 = (r10) r4 = (r10)addaddr6 = r3, r4r6 = r3, r4ld8.s r5 = (r6)ld8.s r5 = (r6)p1,p2 = cmp(...)p1,p2 = cmp(...) Allows single chk on Allows single chk on
resultresult
®®
ld.sld.sinstr 1instr 1instr 2instr 2usesusesbrbr
chk.schk.s(Home Block)(Home Block)
ldldusesusesbr homebr home
Recovery codeRecovery code
Complete Solution for Exception ManagementComplete Solution for Exception ManagementComplete Solution for Exception ManagementComplete Solution for Exception Management
Exception Deferral: More Exception Deferral: More Than Skin DeepThan Skin Deep Deferral allows the efficient Deferral allows the efficient
delay of costly exceptionsdelay of costly exceptions OS controlled deferral by OS controlled deferral by
hardware of:hardware of:– Page faultsPage faults– Protection violationsProtection violations– ……
NaTs enable deferral with NaTs enable deferral with recoveryrecovery
Efficiently support structured Efficiently support structured exception handling in C/C++exception handling in C/C++
®®
Control Speculation Control Speculation SummarySummaryAll loads have a speculative form that sets All loads have a speculative form that sets
the NaT bit when deferring exceptionsthe NaT bit when deferring exceptionsComputational instructions propagate NaTsComputational instructions propagate NaTsOS controls deferral of faults but supported OS controls deferral of faults but supported
directly in HW - “no-fault speculation”directly in HW - “no-fault speculation”– Minimizes overhead of data that is not usedMinimizes overhead of data that is not used
Chk more effective than non-faulting loadChk more effective than non-faulting load
®®
Store BarrierStore Barrier
Traditional architectures limited by the Store BarrierTraditional architectures limited by the Store BarrierTraditional architectures limited by the Store BarrierTraditional architectures limited by the Store Barrier
instr 1instr 1instr 2instr 2. . .. . .Store(*)Store(*)
Load (*)Load (*)useuse
BarrierBarrier
Traditional ArchitecturesTraditional Architectures
®®
Introducing Data Introducing Data SpeculationSpeculation
Compiler can issue a load prior to a Compiler can issue a load prior to a preceding, possibly-conflicting storepreceding, possibly-conflicting store
Unique feature to IA-64Unique feature to IA-64Unique feature to IA-64Unique feature to IA-64
instr 1instr 1instr 2instr 2. . .. . .st8st8
ld8ld8useuse
BarrierBarrier
Traditional ArchitecturesTraditional Architectures
ld8.ald8.ainstr 1instr 1instr 2instr 2st8st8
ld.cld.cuse use
IA-64IA-64
®®
Data SpeculationData SpeculationUses can be hoistedUses can be hoisted
Synergy with control speculation Synergy with control speculation yields greater performanceyields greater performance
Synergy with control speculation Synergy with control speculation yields greater performanceyields greater performance
ld8.ald8.ainstr 1instr 1instr 2instr 2st8st8
ld.cld.cuse use
ld8.ald8.ainstr 1instr 1useuseinstr 2instr 2st8st8
chk.achk.a ld8ld8usesusesbr homebr home
Recovery codeRecovery code
®®
Advanced Load Address Advanced Load Address Table - ALATTable - ALAT ld.a inserts entries.ld.a inserts entries. Conflicting stores remove entries Conflicting stores remove entries
– Also: ld.c.clr, chk.a.clr, Also: ld.c.clr, chk.a.clr,
Presence of entry indicates successPresence of entry indicates success– chk.a branches when no entry is found chk.a branches when no entry is found
reg # Address
reg # Address
reg # Address...
ld.a reg# =...
stchk.a reg# ?
®®
Architectural Support for Architectural Support for Data SpeculationData SpeculationInstructionsInstructions
– ld.a - advanced loadsld.a - advanced loads
– ld.c - check loadsld.c - check loads
–chk.a - advance load checkschk.a - advance load checks
Speculative Advanced loads - ld.sa - is Speculative Advanced loads - ld.sa - is an advanced load with deferral an advanced load with deferral
ALAT - HW structure containing ALAT - HW structure containing outstanding advanced loadsoutstanding advanced loads
®®
Speculation BenefitsSpeculation BenefitsReduces impact of memory latencyReduces impact of memory latency
– Study demonstrates performance improvement Study demonstrates performance improvement of 79% when combined with predication*of 79% when combined with predication*
Greatest improvement to code with Greatest improvement to code with many cache accessesmany cache accesses– Large databasesLarge databases
– Operating systemsOperating systems
Scheduling flexibility enables new Scheduling flexibility enables new levels of performance headroomlevels of performance headroom
* August, et.al, 1998
®®
AgendaAgenda
Architecture PrinciplesArchitecture PrinciplesPredication & SpeculationPredication & SpeculationBranch ArchitectureBranch ArchitectureSoftware PipeliningSoftware Pipelining
®®
Instruction 1Instruction 1 Instruction 0Instruction 0 TemplateTemplate
128-bit bundle128-bit bundle00127127
QPQPIP-OffsetIP-OffsetBranchBranch
21-bits21-bits
Branch InstructionBranch Instruction
Two basic branch formatsTwo basic branch formats– Relative: IP := IP + Offset21Relative: IP := IP + Offset21
– Indirect: IP := BR[I] Indirect: IP := BR[I] – 8 branch registers for efficient branch execution8 branch registers for efficient branch execution
– Call/Return linking through branch registersCall/Return linking through branch registers
Loop branches with 64-bit loopcount register (LC)Loop branches with 64-bit loopcount register (LC)– Enables perfect branch prediction of counted loopsEnables perfect branch prediction of counted loops
– Traditional architectures always mispredict last iterationTraditional architectures always mispredict last iteration– Incurs misprediction stall costing many cycles Incurs misprediction stall costing many cycles
41-bits41-bits
®®
(p1) BR #label_A;(p1) BR #label_A;
Conditional branchesConditional branches
(p0) BR #label_A;(p0) BR #label_A;
Unconditional branchesUnconditional branches
AA BB AA
““always true”always true”
Branch PredicatesBranch Predicates
Compiler directed static prediction Compiler directed static prediction augments dynamic predictionaugments dynamic prediction– Better predict highly correlated branches Better predict highly correlated branches
(always/never taken)(always/never taken)
– Frees space in H/W predictorFrees space in H/W predictor
– Can give hint for dynamic predictorCan give hint for dynamic predictor
P1=trueP1=true P1=falseP1=false
®®
Compare & Branch in Compare & Branch in Same CycleSame Cycle
Queens Loop: Parallel Compares &Queens Loop: Parallel Compares &Compare-branchCompare-branch
R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) (p1) br thenbr thenelseelse
1
2
4
From 5 Cycles Down to 4From 5 Cycles Down to 4From 5 Cycles Down to 4From 5 Cycles Down to 4
®®
3 branch cycles3 branch cycles 1 branch cycle1 branch cycle
w/o Speculationw/o Speculation Hoisting LoadsHoisting Loads IA-64IA-64
ld8 r6 = (ra)ld8 r6 = (ra)(p1) br exit1(p1) br exit1
ld8 r7 = (rb)ld8 r7 = (rb)(p3) br exit2(p3) br exit2
ld8 r8 = (rc)ld8 r8 = (rc)(p5) br exit3(p5) br exit3
chk r6, rec0chk r6, rec0(p1) br exit1(p1) br exit1
Chk r7, rec1Chk r7, rec1(p3) br exit2(p3) br exit2
Chk r8, rec2Chk r8, rec2(p5) br exit3(p5) br exit3
ld8.s r6 = (ra)ld8.s r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)
ld8.s r6 = (ra)ld8.s r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)
chk r6, rec0chk r6, rec0(p2) chk r7, rec1(p2) chk r7, rec1(p4) chk r8, rec2 (p4) chk r8, rec2 }{}{(p1) br exit1(p1) br exit1(p3) br exit2(p3) br exit2(p5) br exit3(p5) br exit3}}
P1P1
P6P6P5P5
P2P2
P4P4P3P3
Multiway branches: more than 1 branch in a single cycleMultiway branches: more than 1 branch in a single cycle Allows n-way branchingAllows n-way branching
Supports Aggressive SpeculationSupports Aggressive SpeculationSupports Aggressive SpeculationSupports Aggressive Speculation
Multi-way BranchMulti-way Branch
®®
Software PipeliningSoftware Pipelining Overlapping execution of different loop iterationsOverlapping execution of different loop iterations
vs.vs.
More iterations in same amount of timeMore iterations in same amount of time
®®
Especially Useful for Integer Code With Especially Useful for Integer Code With Small Number of Loop IterationsSmall Number of Loop Iterations
Especially Useful for Integer Code With Especially Useful for Integer Code With Small Number of Loop IterationsSmall Number of Loop Iterations
Software PipeliningSoftware Pipelining IA-64 features that make this possibleIA-64 features that make this possible
– Full PredicationFull Predication
– Special branch handling features Special branch handling features
– Register rotation: removes loop copy overheadRegister rotation: removes loop copy overhead
– Predicate rotation: removes prologue & epiloguePredicate rotation: removes prologue & epilogue
Traditional architectures use loop unrollingTraditional architectures use loop unrolling – High overhead: extra code for loop body, prologue, and High overhead: extra code for loop body, prologue, and
epilogue epilogue
®®
Execution (Cycles) 1 2 3 4 5 6 7 8
For (i=0; i<n; i++) {For (i=0; i<n; i++) {
*b++ =*b++ = *a++;*a++;
}} /* MemCopy */ /* MemCopy */
// setup ra/rb/lc, // setup ra/rb/lc,
.label loop.label loop
{{
ld8 r35 = [ra],8ld8 r35 = [ra],8
}{}{
st8 [rb],8 = r35st8 [rb],8 = r35
br.cloop #loop // check n!=0br.cloop #loop // check n!=0
}}
ld1st1 br.cloop
ld2st2 br. cloop
ld3st3 br. cloop
ld4st4 br. cloop
Basic Copy Loop
3 ops3 ops
Basic Loop ExampleBasic Loop Example
Simple Non-overlapping iterationsSimple Non-overlapping iterations– 2 cycles per iteration2 cycles per iteration– 3 operations in loop body3 operations in loop body
®®
Epilogue
Prologue
Main loop
ld1st1ld2st2 br.cloopld3st3
1 2 3 4
5
Test for loop count 0,1 Test for loop count 0,1 ld8 r34 = [ra],8ld8 r34 = [ra],8
.label loop.label loop ld8ld8 r35 = [ra],8r35 = [ra],8 st8 [rb],8 = r34st8 [rb],8 = r34 br.cle #e-exitbr.cle #e-exit ld8ld8 r34 = [ra],8r34 = [ra],8 st8 [rb],8 = r35st8 [rb],8 = r35 br.cloop #loopbr.cloop #loop st8st8 [rb],8 = r34[rb],8 = r34 br #thrubr #thru
.label e-exit.label e-exit st8 [rb],8 = r35st8 [rb],8 = r35.label thru.label thru
Unrolled Copy Loop
Execution cycles
ld4
st4
br.cle
br.cle10 ops10 ops
Loop Support: UnrollingLoop Support: Unrolling
Overlapped iterations Overlapped iterations – 1 cycle per word1 cycle per word– 1.6X performance improvement1.6X performance improvement– 3.3X code expansion3.3X code expansion
Incurs Code Expansion PenaltiesIncurs Code Expansion Penalties Incurs Code Expansion PenaltiesIncurs Code Expansion Penalties
®®
Software Register RenamingSoftware Register RenamingTraditionalTraditional
ArchitectureArchitecture
......
......
R32R32R33R33R34R34
R35R35ldld11 r34 r34
®®
Software Register RenamingSoftware Register RenamingTraditionalTraditional
ArchitectureArchitecture
......
......
R32R32R33R33R34R34
R35R35stst11 r34 r34ldld22 r35 r35
®®
Software Register RenamingSoftware Register RenamingTraditionalTraditional
ArchitectureArchitecture
......
......
R32R32R33R33R34R34
R35R35stst22 r35 r35ldld33 r34 r34
®®
Software Register RenamingSoftware Register RenamingTraditionalTraditional
ArchitectureArchitecture
......
......
R32R32R33R33R34R34
R35R35ldld44 r35 r35stst33 r34 r34
®®
Software Register RenamingSoftware Register RenamingTraditionalTraditional
ArchitectureArchitecture
......
......
R32R32R33R33R34R34
R35R35stst44 r35 r35
®®
PalmPalm SunnySunnyisisSpringsSprings
RRB=0RRB=0
Introducing Rotating Introducing Rotating RegistersRegisters GR 32-127, FR32-127 can rotateGR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRsSeparate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
ldld11 R35 R35
......
35:35:34:34:33:33:32:32:
36:36:
......
PalmPalm
®®
PalmPalm SunnySunnyisisSpringsSprings
IA-64IA-64
......
35:35:34:34:33:33:32:32:
36:36:
......
RRB=0RRB=0
Introducing Rotating Introducing Rotating RegistersRegisters GR 32-127, FR32-127 can rotateGR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRsSeparate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
PalmPalm
ldld22 R34 R34
stst11 R35 R35
SpringsSpringsPalmPalm
®®
PalmPalm SunnySunnyisisSpringsSprings
IA-64IA-64
......
34:34:33:33:32:32:127:127:
35:35:
......
RRB=-1RRB=-1
Introducing Rotating Introducing Rotating RegistersRegisters GR 32-127, FR32-127 can rotateGR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRsSeparate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
PalmPalm SpringsSprings
ldld33 R34 R34
stst22 R35 R35
isisSpringsSpringsPalmPalm
®®
PalmPalm SunnySunnyisisSpringsSprings
IA-64IA-64
......
33:33:32:32:127:127:126:126:
34:34:
......
RRB=-2RRB=-2
Introducing Rotating Introducing Rotating RegistersRegisters GR 32-127, FR32-127 can rotateGR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRsSeparate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
PalmPalm SpringsSprings
ldld44 R34 R34
stst33 R35 R35
SunnySunnyisisSpringsSprings
isis
®®
PalmPalm SunnySunnyisisSpringsSprings
IA-64IA-64
......
32:32:127:127:126:126:125:125:
33:33:
......
RRB=-3RRB=-3
Introducing Rotating Introducing Rotating RegistersRegisters GR 32-127, FR32-127 can rotateGR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRsSeparate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
PalmPalm SpringsSprings
stst44 R35 R35SunnySunnyisis
isis SunnySunny
®®
// setup ra/rb/lc/ec, check n > 2 { ld8 r35 = [ra],8}.label loop { ld8 r34 = [ra],8 st8 [rb] = r35,8 br.ctop #loop}{ st8 [rb] = r35,8}
Software Pipelined Copy LoopSoftware Pipelined Copy Loop
Epilogue
Prologue
Main loop
ld1st1ld2st2 br. ctopld3st3
1 2 3 4
5
Execution cycles
ld4
st4
br.ctop
br. ctop5 ops
Loop Support: Rotating Loop Support: Rotating RegistersRegisters
Modulo Scheduled IterationsModulo Scheduled Iterations– 1 cycle per word1 cycle per word
– 1.6X performance improvement1.6X performance improvement– additional upside for higher latency conditionsadditional upside for higher latency conditions
– 1.7X code expansion1.7X code expansion
®®
Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
(p16) ld R34(p16) ld R34(p17) st R35(p17) st R35
RRB=0RRB=0
LC=3LC=3EC=2EC=2
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
0000
00
CodeCode
(p16) ld R34(p16) ld R34
(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34
InitializeInitializeInitializeInitialize
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
11
00
00
®®
Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
(p16) ld R34(p16) ld R34(p17) st R35(p17) st R35
LC=2LC=2EC=2EC=2 IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
0000
00
CodeCode
(p16) ld R34(p16) ld R34
(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34
Branch 1Branch 1Branch 1Branch 1
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
1100
00
RRB=-1RRB=-1
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
1111
00
IA-64IA-64
......
16:16:63:63:62:62:61:61:
17:17:
......
1100
11
00
00
(p17) st R35(p17) st R35 (p17) st(p17) st11 R35 R35(p16) ld(p16) ld22 R34 R34
®®
Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
(p16) ld R34(p16) ld R34(p17) st R35(p17) st R35
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
0000
00
CodeCode
(p16) ld R34(p16) ld R34
(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34
Branch 2Branch 2Branch 2Branch 2
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
1100
00
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
1111
00
IA-64IA-64
......
16:16:63:63:62:62:61:61:
17:17:
......
1100
1100
00
(p17) st R35(p17) st R35 (p17) st(p17) st11 R35 R35(p16) ld(p16) ld22 R34 R34
LC=1LC=1EC=2EC=2 IA-64IA-64
......
16:16:63:63:62:62:61:61:
17:17:
......
1100
1111
00
RRB=-2RRB=-2
IA-64IA-64
......
63:63:62:62:61:61:60:60:
16:16:
......
1111
11
00
00
(p17) st(p17) st22 R35 R35(p16) ld(p16) ld33 R34 R34
®®
Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
(p16) ld R34(p16) ld R34(p17) st R35(p17) st R35
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
0000
00
CodeCode
(p16) ld R34(p16) ld R34
(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34
Branch 3Branch 3Branch 3Branch 3
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
1100
00
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
1111
00
IA-64IA-64
......
16:16:63:63:62:62:61:61:
17:17:
......
1100
1100
00
(p17) st R35(p17) st R35 (p17) st(p17) st11 R35 R35(p16) ld(p16) ld22 R34 R34
IA-64IA-64
......
16:16:63:63:62:62:61:61:
17:17:
......
1100
1111
00
IA-64IA-64
......
63:63:62:62:61:61:60:60:
16:16:
......
1111
1100
00
(p17) st(p17) st22 R35 R35(p16) ld(p16) ld33 R34 R34
LC=0LC=0EC=2EC=2 IA-64IA-64
......
63:63:62:62:61:61:60:60:
16:16:
......
1111
1111
00
RRB=-3RRB=-3
IA-64IA-64
......
62:62:61:61:60:60:59:59:
63:63:
......
1111
11
00
00
(p17) st(p17) st33 R35 R35(p16) ld(p16) ld44 R34 R34
®®
Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
(p16) ld R34(p16) ld R34(p17) st R35(p17) st R35
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
0000
00
CodeCode
(p16) ld R34(p16) ld R34
(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34
Branch 4Branch 4Branch 4Branch 4
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
1100
00
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
1111
00
IA-64IA-64
......
16:16:63:63:62:62:61:61:
17:17:
......
1100
1100
00
(p17) st R35(p17) st R35 (p17) st(p17) st11 R35 R35(p16) ld(p16) ld22 R34 R34
IA-64IA-64
......
16:16:63:63:62:62:61:61:
17:17:
......
1100
1111
00
IA-64IA-64
......
63:63:62:62:61:61:60:60:
16:16:
......
1111
1100
00
(p17) st(p17) st22 R35 R35(p16) ld(p16) ld33 R34 R34
IA-64IA-64
......
63:63:62:62:61:61:60:60:
16:16:
......
1111
1111
00
IA-64IA-64
......
62:62:61:61:60:60:59:59:
63:63:
......
1111
1100
00
(p17) st(p17) st33 R35 R35(p16) ld(p16) ld44 R34 R34
LC=0LC=0EC=1EC=1 IA-64IA-64
......
62:62:61:61:60:60:59:59:
63:63:
......
1111
1100
00
IA-64IA-64
......
61:61:60:60:59:59:58:58:
62:62:
......
1111
00
00
00
RRB=-4RRB=-4
(p16) ld R34
(p17) st(p17) st33 R35 R35(p16) ld R34(p16) ld R34
®®
Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number
– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.
(p16)(p16)
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
0000
00
CodeCode(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34
Fall ThroughFall ThroughFall ThroughFall Through
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
1100
00
IA-64IA-64
......
17:17:16:16:63:63:62:62:
18:18:
......
0000
1111
00
IA-64IA-64
......
16:16:63:63:62:62:61:61:
17:17:
......
1100
1100
00
(p17)(p17) (p17) st(p17) st11 R35 R35(p16) ld(p16) ld22 R34 R34
IA-64IA-64
......
16:16:63:63:62:62:61:61:
17:17:
......
1100
1111
00
IA-64IA-64
......
63:63:62:62:61:61:60:60:
16:16:
......
1111
1100
00
(p17) st(p17) st22 R35 R35(p16) ld(p16) ld33 R34 R34
IA-64IA-64
......
63:63:62:62:61:61:60:60:
16:16:
......
1111
1111
00
IA-64IA-64
......
62:62:61:61:60:60:59:59:
63:63:
......
1111
1100
00
(p17) st(p17) st33 R35 R35(p16) ld(p16) ld44 R34 R34
IA-64IA-64
......
62:62:61:61:60:60:59:59:
63:63:
......
1111
1100
00
IA-64IA-64
......
61:61:60:60:59:59:58:58:
62:62:
......
1111
0000
00
(p16)(p16)
(p17) st(p17) st44 R35 R35(p16) ld R34(p16) ld R34
LC=0LC=0EC=0EC=0 IA-64IA-64
......
61:61:60:60:59:59:58:58:
62:62:
......
1111
0000
00
IA-64IA-64
......
60:60:59:59:58:58:57:57:
61:61:
......
0011
00
00
00
RRB=-5RRB=-5
Fall ThroughFall Through
®®
// setup ra/rb/lc/ec,
check n > 1
.label loop
{
(p16) ld8 r34 = [ra],8
(p17) st8 [rb] = r35,8
br.ctop #loop
}
Software Pipelined Copy Loop
Main loop
ld1st
ld2st1 br. ctop
ld3st2
1 2 3 4
5
Execution cycles
ld4 st3
br.ctop
br. ctopbr. ctop
3 ops
ld st4 br. ctop
Efficient Loop, Efficient Code SizeEfficient Loop, Efficient Code SizeEfficient Loop, Efficient Code SizeEfficient Loop, Efficient Code Size
Loop Support: Rotating Loop Support: Rotating PredicatesPredicates
Software Pipelined MemCopySoftware Pipelined MemCopy– 1 cycle per word 1 cycle per word – 1.6X performance improvement1.6X performance improvement– no code expansionno code expansion
®®
Software Pipelining BenefitsSoftware Pipelining BenefitsLoop pipelining maximizes performance; Loop pipelining maximizes performance;
minimizes overheadminimizes overhead– Avoids code expansion of unrolling and code Avoids code expansion of unrolling and code
explosion of prologue and epilogueexplosion of prologue and epilogue
– Smaller code means fewer cache misses Smaller code means fewer cache misses
– Greater performance improvements in higher Greater performance improvements in higher latency conditionslatency conditions
Reduced overhead allows S/W pipelining of Reduced overhead allows S/W pipelining of small loops with unknown trip countssmall loops with unknown trip counts– Typical of integer scalar codesTypical of integer scalar codes
®®
Reviewing What’s New:Reviewing What’s New:Parallel comparesParallel comparesTbitTbitNat bitsNat bitsDeferralDeferralHoisting usesHoisting usesPropagationPropagationBranch instructionsBranch instructionsStatic predictionStatic predictionAdvanced loadsAdvanced loads
ALATALATLoop branchesLoop branchesLC registerLC registerEC registerEC registerMultiway branchMultiway branchBranch registersBranch registersRegister rotationRegister rotationPredicate rotationPredicate rotationRRBRRB
®®
SummarySummarySpeculation reduces memory latency impactSpeculation reduces memory latency impact
– IA-64 removes recovery from critical pathIA-64 removes recovery from critical path
– Benefits applications with poor cache locality: Benefits applications with poor cache locality: server applications, OSserver applications, OS
Predication removes branchesPredication removes branches– Parallel compares increase parallelismParallel compares increase parallelism
– Benefits complex control flow: large databasesBenefits complex control flow: large databases
S/W pipelining support with minimal overhead S/W pipelining support with minimal overhead enables broad usageenables broad usage– Performance for small integer loops with unknown Performance for small integer loops with unknown
trip counts as well as monster FP loopstrip counts as well as monster FP loops