EPIC Architecture (Explicitly Parallel Instruction Computing)

EPIC Architecture(Explicitly Parallel Instruction Computing)

Yangyang Wen

CDA5160--Advanced Computer Architecture IUniversity of Central Florida

OutlineOutline

What is EPIC?EPIC PhilosophyArchitectural Features Supporting EPICIntel’s IA-64 Architectural FeaturesIA-64’s Key TechnologiesSummary and Reference

Traditional Architectures:Traditional Architectures: Limited Parallelism Limited Parallelism

CompilerCompiler parallelizedparallelizedcodecode

HardwareHardware

multiplemultiple functional unitsfunctional units

Original SourceOriginal SourceCodeCode

Sequential MachineSequential MachineCodeCode

Execution Units Available Execution Units Available the execution units are not

used efficientlyToday’s Processors often 60% Idle

EPIC Architecture: Explicit ParallelismEPIC Architecture: Explicit Parallelism

Increases Parallel Execution

Original SourceOriginal SourceCodeCode

CompileCompile

HardwareHardware multiple functional unitsmultiple functional units

......

......

Get more efficient use Get more efficient use of execution resourcesof execution resources

Better Parallel machine CodeBetter Parallel machine Code

EPIC Compiler EPIC Compiler Views WiderViews Wider

ScopeScope

CompilerCompiler

What is EPIC ?What is EPIC ?

EPIC means Explicitly Parallel Instruction computing, and EPIC architecture provides features that allow compilers to take a proactive role in enhancing Instruction level parallelism( ILP) without unacceptable hardware complexity.

EPIC’s PerformanceEPIC’s Performance

EPIC Design PhilosophyEPIC Design Philosophy

EPIC permits the compiler have advanced features to enhance ILP: predication, speculation.

EPIC can design the plan of execution (POE) at compile-time and communicate the POE to the hardware.

EPIC must have massive hardware resources for parallel execution

Introducing IA-64Introducing IA-64

IA-64 comes from Intel and is the first 64-bit architecture for Intel.

The first instance of a commercially available EPIC ISA.

The first architecture to bring ILP features to general-purpose microprocessors.

IA-64’s Architectural BasicsIA-64’s Architectural Basics

Explicit Parallelism Enhanced ILP Compiler-oriented Extremely large physical memory A huge virtual address space for applications 64-bit computation Extremely large register files

IA-64’s Key TechnologiesIA-64’s Key Technologies

Instructions BundlingPredicationControl SpeculationData SpeculationSoftware pipelining

Instruction BundlingInstruction Bundling

Uses a form of VLIW architecture Three Instructions are combined into a 128-bit

instruction Parallel Instructions are executed in groups Template bits decode and route instructions

and mark the end of groups of parallel instructions.

Instruction 1Instruction 1 Instruction 0Instruction 0 TemplateTemplate

128-bit bundle128-bit bundle00127127

Insrtruction2Insrtruction2

41-bits41-bits

ILP BottlenecksILP Bottlenecks Branches

– Deal with branch, take predication.– Branch mispredications cause 20% to 30% loss in

processor performance .

Memory latency– Latency is the time it takes to get data from

memory. The longer it takes you to access memory to get code and data, the longer the CPU sits idle.

– For memory latency, it's the loads that are the big problem, not the stores.

Predication Predication

If A>BIf A>B S+=AS+=Aelseelse S+=BS+=Bend ifend if

If A>BIf A>B

S+=BS+=B

S+=AS+=A

*P=S*P=S

Branching is a major cause of lost performance.

If A>BIf A>B

The predication is wrongThe predication is wrong

Predicate S+=APredicate S+=A

Throw away S+=AThrow away S+=A

S+=BS+=B

(a) Traditional predication (b) IA-64 predication

Processor checks predicationProcessor checks predicationand stores correct resultsand stores correct results

Processor executes both Processor executes both

paths in parallelpaths in parallel

Instructions are Instructions are packed into bundlespacked into bundles

Branch CandidateBranch Candidate

Compiler finds what instsCompiler finds what insts

to execute in parallelto execute in parallel

Instructions are Instructions are

marked with IDmarked with ID

EPIC Predication Process

Predication BenefitsPredication Benefits

Reduce branches

Reduce mispredication penalties

Reduce critical paths

Control SpeculationControl Speculation

ld.s r8=a[ ]ld.s r8=a[ ]instr 1instr 1instr 2instr 2brbr

chk.s r8chk.s r8use use

IA-64 ArchitecturesIA-64 Architectures

instr 1instr 1instr 2instr 2. . .. . .brbr

Load a[ ]Load a[ ]useuse

Traditional ArchitecturesTraditional Architectures

Allows elevation of load, Allows elevation of load, even above a brancheven above a branch

BarrierBarrier

Memory latency is a major performance bottleneck

Elevating the load above a Elevating the load above a branch is not possiblebranch is not possible

Introducing the Token BitIntroducing the Token Bit

ld.s r8=a[ ]ld.s r8=a[ ] instr 1instr 1instr 2instr 2brbr

chk.s r8chk.s r8use use

PropagatePropagateExceptionException

;Exception Detection;Exception Detection

;Exception Delivery;Exception Delivery

IA-64IA-64

When elevate ld, give an exception detection If the load address is valid, it’s normal. If the load address is invalid, compiler sets

token bit ,and jumps out of this path. If the code goes to chk.s, and the chk.s detects

the token bit,jumps to fix-up code,executes the load.

Data SpeculationData Speculation

instr 1instr 1instr 2instr 2. . .. . .storestore

loadloaduseuse

BarrierBarrier

Traditional ArchitecturesTraditional Architectures

load.aload.ainstr 1instr 1instr 2instr 2storestore

load.cload.cuse use

IA-64IA-64

Allows the compiler to elevate Allows the compiler to elevate the load ,even it isn’t sure if the the load ,even it isn’t sure if the memory reference overlaps.memory reference overlaps.

Can’t elevate the load, so prevents from reordering insts

ALATALAT

Chk.aChk.a

Advanced Load Address Table: Advanced Load Address Table: ALATALAT

reg # Address

reg # Address

reg # Address...

ld.a reg# =...

storeWhen elevate ld.a,insert When elevate ld.a,insert ALATALATWhen store, remove overlap When store, remove overlap address records in ALATaddress records in ALATWhen chk.a,if no address is When chk.a,if no address is found ,there is a conflict, and found ,there is a conflict, and jumps to fix-up code to jumps to fix-up code to reexecute the code reexecute the code

chk.a reg#?

Speculation BenefitsSpeculation BenefitsReduces impact of memory latencyStudy demonstrates performance

improvement of 80% when combined with predication

Greatest improvement to code with many cache accesses

Scheduling flexibility enables new levels of performance headroom

Software PipeliningSoftware Pipelining

vs.vs.

•Overlap the execution of different loop iterationsOverlap the execution of different loop iterations•Get more iterations in same amount of timeGet more iterations in same amount of time

Software Pipelining ExampleSoftware Pipelining Example

For(I=0;I<1000;I++)

x[I]=x[I]+s;

Loop: Ld f0,0(r1)Add f0,f0,f1Sd f0,0(r1)Add r1,r1,8Subi r2,r2,1Benz loop

Loop: SD f2, -4(r1)Add f2,f0,f1Subi r2,r2,1Ld f0, 4(r1)Benz loop

Software pipelining

Software Pipelining AdvantagesSoftware Pipelining Advantages

Traditionally performed through loop unrolling

less code compared loop unrolling, increased regularity

Smaller code means fewer cache misses

Especially useful for integer code with small number of loop iterations

Software Pipelining Software Pipelining disadvantagesdisadvantages

Requires many additional instructions to manage the loop

Without hardware support the overhead may greatly increase code size

typically only used in special technical computing applications

IA-64 Features Supporting IA-64 Features Supporting Software PipeliningSoftware Pipelining

Full predication

Circular Buffer of General and FP Registers

Loop Branches Decrement RRBs (register rename bases)

SummarySummary Predication removes branches

– Parallel compares increase parallelism– Benefits complex control flow: large databases

Speculation reduces memory latency impact– IA-64 removes recovery from critical path– Benefits applications with poor cache locality: server

applications, OS S/W pipelining support with minimal overhead

enables broad usage– Performance for small integer loops with unknown trip counts

as well as monster FP loops

ReferenceReference M. S. Schlanker, "EPIC: Explicitly Parallel

Instruction Computing", Computer, vol. ?, No. ?, pp 37--45, 2000.

Jerry Huck et al., "Introducing the IA-64 Architecture", Sept - Oct. 2000, pp. 12-23

Carole Dulong “The IA-64 Architecture at Work”,Computing Practices

Documents

EPIC Architecture (Explicitly Parallel Instruction Computing)