Upload
makani
View
58
Download
0
Embed Size (px)
DESCRIPTION
EPIC Architecture (Explicitly Parallel Instruction Computing). Yangyang Wen CDA5160--Advanced Computer Architecture I University of Central Florida. Outline. What is EPIC? EPIC Philosophy Architectural Features Supporting EPIC Intel’s IA-64 Architectural Features IA-64’s Key Technologies - PowerPoint PPT Presentation
Citation preview
EPIC Architecture(Explicitly Parallel Instruction Computing)
Yangyang Wen
CDA5160--Advanced Computer Architecture IUniversity of Central Florida
OutlineOutline
What is EPIC?EPIC PhilosophyArchitectural Features Supporting EPICIntel’s IA-64 Architectural FeaturesIA-64’s Key TechnologiesSummary and Reference
Traditional Architectures:Traditional Architectures: Limited Parallelism Limited Parallelism
CompilerCompiler parallelizedparallelizedcodecode
HardwareHardware
multiplemultiple functional unitsfunctional units
Original SourceOriginal SourceCodeCode
Sequential MachineSequential MachineCodeCode
Execution Units Available Execution Units Available the execution units are not
used efficientlyToday’s Processors often 60% Idle
EPIC Architecture: Explicit ParallelismEPIC Architecture: Explicit Parallelism
Increases Parallel Execution
Original SourceOriginal SourceCodeCode
CompileCompile
HardwareHardware multiple functional unitsmultiple functional units
......
......
Get more efficient use Get more efficient use of execution resourcesof execution resources
Better Parallel machine CodeBetter Parallel machine Code
EPIC Compiler EPIC Compiler Views WiderViews Wider
ScopeScope
CompilerCompiler
What is EPIC ?What is EPIC ?
EPIC means Explicitly Parallel Instruction computing, and EPIC architecture provides features that allow compilers to take a proactive role in enhancing Instruction level parallelism( ILP) without unacceptable hardware complexity.
EPIC’s PerformanceEPIC’s Performance
EPIC Design PhilosophyEPIC Design Philosophy
EPIC permits the compiler have advanced features to enhance ILP: predication, speculation.
EPIC can design the plan of execution (POE) at compile-time and communicate the POE to the hardware.
EPIC must have massive hardware resources for parallel execution
Introducing IA-64Introducing IA-64
IA-64 comes from Intel and is the first 64-bit architecture for Intel.
The first instance of a commercially available EPIC ISA.
The first architecture to bring ILP features to general-purpose microprocessors.
IA-64’s Architectural BasicsIA-64’s Architectural Basics
Explicit Parallelism Enhanced ILP Compiler-oriented Extremely large physical memory A huge virtual address space for applications 64-bit computation Extremely large register files
IA-64’s Key TechnologiesIA-64’s Key Technologies
Instructions BundlingPredicationControl SpeculationData SpeculationSoftware pipelining
Instruction BundlingInstruction Bundling
Uses a form of VLIW architecture Three Instructions are combined into a 128-bit
instruction Parallel Instructions are executed in groups Template bits decode and route instructions
and mark the end of groups of parallel instructions.
Instruction 1Instruction 1 Instruction 0Instruction 0 TemplateTemplate
128-bit bundle128-bit bundle00127127
Insrtruction2Insrtruction2
41-bits41-bits
ILP BottlenecksILP Bottlenecks Branches
– Deal with branch, take predication.– Branch mispredications cause 20% to 30% loss in
processor performance .
Memory latency– Latency is the time it takes to get data from
memory. The longer it takes you to access memory to get code and data, the longer the CPU sits idle.
– For memory latency, it's the loads that are the big problem, not the stores.
Predication Predication
If A>BIf A>B S+=AS+=Aelseelse S+=BS+=Bend ifend if
If A>BIf A>B
S+=BS+=B
S+=AS+=A
*P=S*P=S
Branching is a major cause of lost performance.
If A>BIf A>B
The predication is wrongThe predication is wrong
Predicate S+=APredicate S+=A
Throw away S+=AThrow away S+=A
S+=BS+=B
(a) Traditional predication (b) IA-64 predication
Processor checks predicationProcessor checks predicationand stores correct resultsand stores correct results
Processor executes both Processor executes both
paths in parallelpaths in parallel
Instructions are Instructions are packed into bundlespacked into bundles
Branch CandidateBranch Candidate
Compiler finds what instsCompiler finds what insts
to execute in parallelto execute in parallel
Instructions are Instructions are
marked with IDmarked with ID
EPIC Predication Process
Predication BenefitsPredication Benefits
Reduce branches
Reduce mispredication penalties
Reduce critical paths
Control SpeculationControl Speculation
ld.s r8=a[ ]ld.s r8=a[ ]instr 1instr 1instr 2instr 2brbr
chk.s r8chk.s r8use use
IA-64 ArchitecturesIA-64 Architectures
instr 1instr 1instr 2instr 2. . .. . .brbr
Load a[ ]Load a[ ]useuse
Traditional ArchitecturesTraditional Architectures
Allows elevation of load, Allows elevation of load, even above a brancheven above a branch
BarrierBarrier
Memory latency is a major performance bottleneck
Elevating the load above a Elevating the load above a branch is not possiblebranch is not possible
Introducing the Token BitIntroducing the Token Bit
ld.s r8=a[ ]ld.s r8=a[ ] instr 1instr 1instr 2instr 2brbr
chk.s r8chk.s r8use use
PropagatePropagateExceptionException
;Exception Detection;Exception Detection
;Exception Delivery;Exception Delivery
IA-64IA-64
When elevate ld, give an exception detection If the load address is valid, it’s normal. If the load address is invalid, compiler sets
token bit ,and jumps out of this path. If the code goes to chk.s, and the chk.s detects
the token bit,jumps to fix-up code,executes the load.
Data SpeculationData Speculation
instr 1instr 1instr 2instr 2. . .. . .storestore
loadloaduseuse
BarrierBarrier
Traditional ArchitecturesTraditional Architectures
load.aload.ainstr 1instr 1instr 2instr 2storestore
load.cload.cuse use
IA-64IA-64
Allows the compiler to elevate Allows the compiler to elevate the load ,even it isn’t sure if the the load ,even it isn’t sure if the memory reference overlaps.memory reference overlaps.
Can’t elevate the load, so prevents from reordering insts
ALATALAT
Chk.aChk.a
Advanced Load Address Table: Advanced Load Address Table: ALATALAT
reg # Address
reg # Address
reg # Address...
ld.a reg# =...
storeWhen elevate ld.a,insert When elevate ld.a,insert ALATALATWhen store, remove overlap When store, remove overlap address records in ALATaddress records in ALATWhen chk.a,if no address is When chk.a,if no address is found ,there is a conflict, and found ,there is a conflict, and jumps to fix-up code to jumps to fix-up code to reexecute the code reexecute the code
chk.a reg#?
Speculation BenefitsSpeculation BenefitsReduces impact of memory latencyStudy demonstrates performance
improvement of 80% when combined with predication
Greatest improvement to code with many cache accesses
Scheduling flexibility enables new levels of performance headroom
Software PipeliningSoftware Pipelining
vs.vs.
•Overlap the execution of different loop iterationsOverlap the execution of different loop iterations•Get more iterations in same amount of timeGet more iterations in same amount of time
Software Pipelining ExampleSoftware Pipelining Example
For(I=0;I<1000;I++)
x[I]=x[I]+s;
Loop: Ld f0,0(r1)Add f0,f0,f1Sd f0,0(r1)Add r1,r1,8Subi r2,r2,1Benz loop
Loop: SD f2, -4(r1)Add f2,f0,f1Subi r2,r2,1Ld f0, 4(r1)Benz loop
Software pipelining
Software Pipelining AdvantagesSoftware Pipelining Advantages
Traditionally performed through loop unrolling
less code compared loop unrolling, increased regularity
Smaller code means fewer cache misses
Especially useful for integer code with small number of loop iterations
Software Pipelining Software Pipelining disadvantagesdisadvantages
Requires many additional instructions to manage the loop
Without hardware support the overhead may greatly increase code size
typically only used in special technical computing applications
IA-64 Features Supporting IA-64 Features Supporting Software PipeliningSoftware Pipelining
Full predication
Circular Buffer of General and FP Registers
Loop Branches Decrement RRBs (register rename bases)
SummarySummary Predication removes branches
– Parallel compares increase parallelism– Benefits complex control flow: large databases
Speculation reduces memory latency impact– IA-64 removes recovery from critical path– Benefits applications with poor cache locality: server
applications, OS S/W pipelining support with minimal overhead
enables broad usage– Performance for small integer loops with unknown trip counts
as well as monster FP loops
ReferenceReference M. S. Schlanker, "EPIC: Explicitly Parallel
Instruction Computing", Computer, vol. ?, No. ?, pp 37--45, 2000.
Jerry Huck et al., "Introducing the IA-64 Architecture", Sept - Oct. 2000, pp. 12-23
Carole Dulong “The IA-64 Architecture at Work”,Computing Practices