12_itanium

Embed Size (px)

Citation preview

  • 8/12/2019 12_itanium

    1/20

    24

    The Itanium processor is the first

    implementation of the IA-64 instruction setarchitecture (ISA). The design team opti-mized the processor to meet a wide range ofrequirements: high performance on Internetservers and workstations, support for 64-bitaddressing, reliability for mission-criticalapplications, full IA-32 instruction set com-patibility in hardware, and scalability across arange of operating systems and platforms.

    The processor employs EPIC (explicitlyparallel instruction computing) design con-

    cepts for a tighter coupling between hardwareand software. In this design style the hard-

    ware-software interface lets the softwareexploit all available compilation time infor-mation and efficiently deliver this informa-tion to the hardware. It addresses severalfundamental performance bottlenecks inmodern computers, such as memory latency,memory address disambiguation, and controlflow dependencies.

    EPIC constructs provide powerful archi-tectural semantics and enable the software tomake global optimizations across a largescheduling scope, thereby exposing availableinstruction-level parallelism (ILP) to the hard-

    ware. The hardware takes advantage of thisenhanced ILP, providing abundant executionresources. Additionally, it focuses on dynam-

    ic runtime optimizations to enable the com-

    piled code schedule to flow through at highthroughput. This strategy increases the syn-ergy between hardware and software, andleads to higher overall performance.

    The processor provides a six-wide and 10-stage deep pipeline, running at 800 MHz ona 0.18-micron process. This combines bothabundant resources to exploit ILP and highfrequency for minimizing the latency of eachinstruction. The resources consist of four inte-ger units, four multimedia units, two

    load/store units, three branch units, twoextended-precision floating-point units, andtwo additional single-precision floating-pointunits (FPUs). The hardware employs dynam-ic prefetch, branch prediction, nonblockingcaches, and a register scoreboard to optimizefor compilation time nondeterminism. Threelevels of on-package cache minimize overallmemory latency. This includes a 4-Mbytelevel-3 (L3) cache, accessed at core speed, pro-

    viding over 12 Gbytes/s of data bandwidth.The system bus provides glueless multi-processor support for up to four-processor sys-tems and can be used as an effective buildingblock for very large systems. The advancedFPU delivers over 3 Gflops of numeric capa-bility (6 Gflops for single precision). The bal-anced core and memory subsystems provide

    Harsh SharangpaniKen Arora

    Intel

    THE ITANIUMPROCESSOR EMPLOYS THE EPIC DESIGNSTYLE TO EXPLOIT

    INSTRUCTION-LEVELPARALLELISM. ITS HARDWAREANDSOFTWAREWORK IN

    CONCERT TODELIVER HIGHERPERFORMANCE THROUGH A SIMPLER, MORE

    EFFICIENTDESIGN.

    0272-1732/00/$10.00 2000 IEEE

    ITANIUM PROCESSORMICROARCHITECTURE

  • 8/12/2019 12_itanium

    2/20

    high performance for a wide range of appli-cations ranging from commercial workloadsto high-performance technical computing.

    In contrast to traditional processors, themachines core is characterized by hardwaresupport for the key ISA constructs thatembody the EPIC design style.1,2 Thisincludes support for speculation, predication,explicit parallelism, register stacking and rota-tion, branch hints, and memory hints. In thisarticle we describe the hardware support for

    these novel constructs, assuming a basic levelof familiarity with the IA-64 architecture (seethe IA-64 Architecture Overview article inthis issue).

    EPIC hardwareThe Itanium processor introduces a num-

    ber of unique microarchitectural features tosupport the EPIC design style.2 These featuresfocus on the following areas:

    supplying plentiful fast, parallel, andpipelined execution resources, exposeddirectly to the software;

    supporting the bookkeeping and controlfor new EPIC constructs such as predi-cation and speculation; and

    providing dynamic support to handle

    events that are unpredictable at compi-lation time so that the compiled codeflows through the pipeline at highthroughput.

    Figure 1 presents a conceptual view of theEPIC hardware. It illustrates how the variousEPIC instruction set features map onto themicropipelines in the hardware.

    The core of the machine is the wide execu-tion engine, designed to provide the compu-

    tational bandwidth needed by ILP-rich EPICcode that abounds in speculative and predi-cated operations.

    The execution control is augmented with abookkeeping structure called the advancedload address table (ALAT) to support dataspeculation and, with hardware, to managethe deferral of exceptions on speculative exe-cution. The hardware control for speculationis quite simple: adding an extra bit to the data

    path supports deferred exception tokens. Thecontrols for both the register scoreboard andbypass network are enhanced to accommo-date predicated execution.

    Operands are fed into this wide executioncore from the 128-entry integer and floating-point register files. The register file addressingundergoes register remapping, in support of

    25SEPTEMBEROCTOBER2000

    Compiler-programmed features:

    Hardware features:

    Branchhints

    Registerstack,

    rotation

    Data and controlspeculation

    Memoryhints

    Predication

    Explicitparallelism;instructiontemplates

    Instructioncache,branch

    predictors

    Fetch

    Three levelsof cache

    (L1, L2, L3)

    Memorysubsystem

    128 GR,128 FR,registerremap,stack

    engine

    Registerhandling

    4 integer,4 MMX units

    2 load/store units

    3 branch units

    32-entry ALAT

    2 + 2 FMACs

    Parallel resources

    Fast,

    simple6-issue

    Issue Control

    Bypasses

    anddependencies

    Speculation deferral management

    Figure 1. Conceptual view of EPIC hardware. GR: general register file; FR: floating-point regis-

    ter file

  • 8/12/2019 12_itanium

    3/20

    register stacking and rotation. The registermanagement hardware is enhanced with acontrol engine called the register stack enginethat is responsible for saving and restoringregisters that overflow or underflow the reg-ister stack.

    An instruction dispersal network feeds theexecution pipeline. This network uses explic-it parallelism and instruction templates to effi-ciently issue fetched instructions onto thecorrect instruction ports, both eliminatingcomplex dependency detection logic andstreamlining the instruction routing network.

    A decoupled fetch engine exploits advancedprefetch and branch hints to ensure that thefetched instructions will come from the cor-rect path and that they will arrive early enough

    to avoid cache miss penalties. Finally, memo-ry locality hints are employed by the cachesubsystem to improve the cache allocation andreplacement policies, resulting in a better useof the three levels of on-package cache and allassociated memory bandwidth.

    EPIC features allow software to more effec-tively communicate high-level semantic infor-mation to the hardware, thereby eliminatingredundant or inefficient hardware and lead-

    ing to a more effective design. Notably absentfrom this machine are complex hardwarestructures seen in dynamically scheduled con-temporary processors. Reservation stations,reorder buffers, and memory ordering buffersare all replaced by simpler hardware for spec-ulation. Register alias tables used for registerrenaming are replaced with the simpler and

    semantically richer register-remapping hard-ware. Expensive register dependency-detec-tion logic is eliminated via the explicit

    parallelism directives that are precomputed bythe software.

    Using EPIC constructs, the compiler opti-mizes the code schedule across a very largescope. This scope of optimization far exceedsthe limited hardware window of a few hun-dred instructions seen on contemporarydynamically scheduled processors. The resultis an EPIC machine in which the close col-laboration of hardware and software enableshigh performance with a greater degree of

    overall efficiency.

    Overview of the EPIC coreThe engineering team designed the EPIC

    core of the Itanium processor to be a parallel,deep, and dynamic pipeline that enables ILP-rich compiled code to flow through at highthroughput. At the highest level, three impor-tant directions characterize the core pipeline:

    wide EPIC hardware delivering a newlevel of parallelism (six instructions/clock),

    deep pipelining (10 stages) enabling highfrequency of operation, and

    dynamic hardware for runtime opti-mization and handling of compilationtime indeterminacies.

    New level of parallel executionThe processor provides hardware for these

    execution units: four integer ALUs, four mul-timedia ALUs, two extended-precision float-ing-point units, two additional single-precisionfloating-point units, two load/store units, andthree branch units. The machine can fetch,issue, execute, and retire six instructions eachclock cycle. Given the powerful semantics ofthe IA-64 instructions, this expands to manymore operations being executed each cycle.The Machine resources per port sidebar on

    p. 31 enumerates the full processor executionresources.Figure 2 illustrates two examples demon-

    strating the level of parallel operation support-ed for various workloads. For enterprise andcommercial codes, the MII/MBB templatecombination in a bundle pair provides sixinstructions or eight parallel operations per

    26

    ITANIUMPROCESSOR

    IEEE MICRO

    M F I

    M I I M B B

    M F I

    2 ALU opsLoad 4 DP(8 SP) ops via2 ldf-pair and 2ALU ops(postincrement)

    4 DP flops(8 SP flops)

    6 instructions provide:

    12 parallel ops/clock for scientific computing 20 parallel ops/clock for digital content creation

    6 instructions provide:

    8 parallel ops/clock for enterprise and Internet applications

    2 loads and2 ALU ops(postincrement)

    2 ALU ops2 branchinstructions

    Figure 2. Two examples illustrating supported parallelism. SP: single preci-

    sion, DP: double precision

  • 8/12/2019 12_itanium

    4/20

    clock (two load/store, twogeneral-purpose ALU opera-tions, two postincrement ALU

    operations, and two branchinstructions). Alternatively, anMIB/MIB pair allows thesame mix of operations, but

    with one branch hint and onebranch operation, instead oftwo branch operations.

    For scientific code, the useof the MFI template in eachbundle enables 12 paralleloperations per clock (loading

    four double-precision oper-ands to the registers and executing four double-precision floating-point, two integer ALU, andtwo postincrement ALU operations). For digi-tal content creation codes that use single-precision floating point, the SIMD (singleinstruction, multiple data) features in themachine effectively enable up to 20 paralleloperations per clock (loading eight single-precision operands, executing eight single-

    precision floating-point, two integer ALU, andtwo postincrementing ALU operations).

    Deep pipelining (10 stages)The cycle time and the core pipeline are bal-

    anced and optimized for the sequential exe-cution in integer scalar codes, by minimizingthe latency of the most frequent operations,thus reducing dead time in the overall com-putation. The high frequency (800 MHz) andcareful pipelining enable independent opera-

    tions to flow through this pipeline at highthroughput, thus also optimizing vectornumeric and multimedia computation. Thecycle time accommodates the following keycritical paths:

    single-cycle ALU, globally bypassedacross four ALUs and two loads;

    two cycles of latency for load datareturned from a dual-ported level-1 (L1)

    cache of 16 Kbytes; and scoreboard and dependency control tostall the machine on an unresolved regis-ter dependency.

    The feature set complies with the high-frequency target and the degree of pipelin-ingaggressive branch prediction, and three

    levels of on-package cache. The pipelinedesign employs robust, scalable circuit designtechniques. We consciously attempted tomanage interconnect lengths and pipelineaway secondary paths.

    Figure 3 illustrates the 10-stage corepipeline. The bold line in the middle of thecore pipeline indicates a point of decoupling inthe pipeline. The pipeline accommodates the

    decoupling buffer in the ROT (instructionrotation) stage, dedicated register-remappinghardware in the REN (register rename) stage,and pipelined access of the large register fileacross the WLD (word line decode) and REG(register read) stages. The DET (exceptiondetection) stage accommodates delayed branchexecution as well as memory exception man-agement and speculation support.

    Dynamic hardware for runtime optimization

    While the processor relies on the compilerto optimize the code schedule based upon thedeterministic latencies, the processor providesspecial support to dynamically optimize forseveral compilation time indeterminacies.These dynamic features ensure that the com-piled code flows through the pipeline at highthroughput. To tolerate additional latency ondata cache misses, the data caches are non-blocking, a register scoreboard enforces

    dependencies, and the machine stalls only onencountering the use of unavailable data.We focused on reducing sensitivity to branch

    and fetch latency. The machine employs hard-ware and software techniques beyond thoseused in conventional processors and providesaggressive instruction prefetch and advancedbranch prediction through a hierarchy of

    27SEPTEMBEROCTOBER2000

    Front end

    Prefetch/fetch of 6 instructions/clockHierarchy of branch predictors

    Decoupling buffer

    Instruction delivery

    Dispersal of 6 instructions onto 9 issue portsRegister remappingRegister save engine

    Operand delivery

    Register file read and bypassRegister scoreboardPredicated dependencies

    Execution core

    4 single-cycle ALUs, 2 load/storesAdvanced load controlPredicate delivery and branch

    NaT/exceptions/retirement

    IPG FET ROT EXP REN WLD REG EXE DET WRB

    Figure 3. Itanium processor core pipeline.

  • 8/12/2019 12_itanium

    5/20

    branch prediction structures. A decouplingbuffer allows the front end to speculatively fetchahead, further hiding instruction cache laten-cy and branch prediction latency.

    Block diagramFigure 4 provides the block diagram of the

    Itanium processor. Figure 5 provides a die plotof the silicon database. A few top-level metallayers have been stripped off to create a suit-

    able view.

    Details of the core pipelineThe following describes details of the core

    processor microarchitecture.

    Decoupled software-directed front endGiven the high execution rate of the proces-

    sor (six instructions per clock), an aggressivefront end is needed to keep the machine effec-tively fed, especially in the presence of dis-ruptions due to branches and cache misses.The machines front end is decoupled fromthe back end.

    Acting in conjunction with sophisticatedbranch prediction and correction hardware,the machine speculatively fetches instructionsfrom a moderate-size, pipelined instruction

    cache into a decoupling buffer. A hierarchy ofbranch predictors, aided by branch hints, pro-vides up to four progressively improvinginstruction pointer resteers. Software-initiat-ed prefetch probes for future misses in theinstruction cache and then prefetches suchtarget code from the level-2 (L2) cache into astreaming buffer and eventually into the

    28

    ITANIUMPROCESSOR

    IEEE MICRO

    L1 instruction cacheand

    fetch/prefetch engineITLB

    Branch

    prediction

    Branchunits

    IntegerandMMunits

    Dual-portL1

    datacache

    Floating-pointunits

    ALAT

    IA-32decode

    andcontrol

    Decouplingbuffer 8 bundles

    B B B M M I I F F

    Register stack engine/remapping

    Branch andpredicate

    128 integerregisters

    Bus controller

    128 floating-pointregisters

    SIMDFMAC

    Scoreboard,predicateNaTs,exce

    ptions

    96-KbyteL2

    cache

    4-MbyteL3

    cache

    Figure 4. Itanium processor block diagram.

  • 8/12/2019 12_itanium

    6/20

    instruction cache. Figure 6illustrates the front-endmicroarchitecture.

    Speculative fetches. The 16-Kbyte, four-way set-associativeinstruction cache is fullypipelined and can deliver 32bytes of code (two instructionbundles or six instructions)every clock. The cache is sup-ported by a single-cycle, 64-entry instruction translationlook-aside buffer (TLB) thatis fully associative and backed

    up by an on-chip hardwarepage walker.

    The fetched code is fed into a decouplingbuffer that can hold eight bundles of code. Asa result of this buffer, the machines front endcan continue to fetch instructions into thebuffer even when the back end stalls. Con-versely, the buffer can continue to feed theback end even when the front end is disrupt-ed by fetch bubbles due to branches or

    instruction cache misses.

    Hierarchy of branch predictors. The processoremploys a hierarchy of branch predictionstructures to deliver high-accuracy and low-penalty predictions across a wide spectrum of

    workloads. Note that if a branch mispredic-tion led to a full pipeline flush, there would

    be nine cycles of pipeline bubbles before thepipeline is full again. This would mean aheavy performance loss. Hence, weve placedsignificant emphasis on boosting the overallbranch prediction rate as well as reducing thebranch prediction and correction latency.

    The branch prediction hardware is assistedby branch hint directives provided by thecompiler (in the form of explicit branch pre-

    dict, or BRP, instructions as well as hint spec-ifiers on branch instructions). The directivesprovide branch target addresses, static hintson branch direction, as well as indications on

    when to use dynamic prediction. These direc-tives are programmed into the branch predic-tion structures and used in conjunction withdynamic prediction schemes. The machine

    29SEPTEMBEROCTOBER2000

    Figure 5. Die plot of the silicon database.

    Core processor die

    4 x 1Mbyte L3 cache

    To dispersal

    Branchaddress

    calculation 2

    Branchaddress

    calculation 1

    Adaptivemultiway

    2-level predictor

    Targetaddressregisters

    Branchtarget

    addresscache

    IP

    multiplexer

    Return stack buffer

    Instruction cache,ITLB, streaming buffers

    Decoupling buffer

    Loop exitcorrector

    IPG FET ROT EXPStages

    Figure 6. Processor front end.

  • 8/12/2019 12_itanium

    7/20

    provides up to four progressive predictionsand corrections to the fetch pointer, greatlyreducing the likelihood of a full-pipeline flush

    due to a mispredicted branch.

    Resteer 1: Single-cycle predictor.A specialset of four branch prediction registers(called target address registers, or TARs)provides single-cycle turnaround on cer-tain branches (for example, loop branch-es in numeric code), operating under tightcompiler control. The compiler programsthese registers using BRP hints, distin-guishing these hints with a special impor-

    tance bit designator and indicating thatthese directives must get allocated into thissmall structure. When the instructionpointer of the candidate branch hits inthese registers, the branch is predictedtaken, and these registers provide the tar-get address for the resteer. On such takenbranches no bubbles appear in the execu-tion schedule due to branching.

    Resteer 2: Adaptive multiway and return

    predictors. For scalar codes, the processoremploys a dynamic, adaptive, two-levelprediction scheme3,4 to achieve well over90% prediction rates on branch direction.The branch prediction table (BPT) con-tains 512 entries (128 sets 4 ways). Eachentry, selected by the branch address,tracks the four most recent occurrencesof that branch. This 4-bit value thenindexes into one of 128 pattern tables(one per set). The 16 entries in each pat-

    tern table use a 2-bit, saturating, up-downcounter to predict branch direction.

    The branch prediction table structureis additionally enhanced for multiwaybranches with a 64-entry, multiwaybranch prediction table (MBPT) thatemploys a similar algorithm but keepsthree history registers per bundle entry.

    A find-first-taken selection provides thefirst taken branch indication for the mul-tiway branch bundle. Multiway branch-es are expected to be common in EPICcode, where multiple basic blocks areexpected to collapse after use of specula-tion and predication.

    Target addresses for this branch resteerare provided by a 64-entry target addresscache (TAC). This structure is updated

    by branch hints (using BRP and move-BR hint instructions) and also manageddynamically. Having the compiler pro-

    gram the structure with the upcomingfootprint of the program is an advantageand enables a small, 64-entry structure tobe effective even on large commercial

    workloads, saving die area and imple-mentation complexity. The BPT andMBPT cause a front-end resteer only ifthe target address for the resteer is presentin the target address cache. In the case ofmisses in the BPT and MBPT, a hit in thetarget address cache also provides a branch

    direction prediction of taken.A return stack buffer (RSB) provides

    predictions for return instructions. Thisbuffer contains eight entries and storesreturn addresses along with correspond-ing register stack frame information.

    Resteers 3 and 4: Branch address calcula-tion and correction.Once branch instruc-tion opcodes are available (ROT stage),its possible to apply a correction to pre-

    dictions made earlier. The BAC1 stageapplies a correction for the exit conditionon modulo-scheduled loops through aspecial perfect-loop-exit-predictorstructure that keeps track of the loopcount extracted during the loop initial-ization code. Thus, loop exits shouldnever see a branch misprediction in theback end. Additionally, in case of missesin the earlier prediction structures, BAC1extracts static prediction information and

    addresses from branch instructions in therightmost slot of a bundle and uses theseto provide a correction. Since most tem-plates will place a branch in the rightmostslot, BAC1 should handle most branch-es. BAC2 applies a more general correc-tion for branches located in any slot.

    Software-initiated prefetch.Another key ele-ment of the front end is its software-initiatedinstruction prefetch. Prefetch is triggered byprefetch hints (encoded in the BRP instruc-tions as well as in actual branch instructions)as they pass through the ROT stage. Instruc-tions get prefetched from the L2 cache intoan instruction-streaming buffer (ISB) con-taining eight 32-byte entries. Support existsto prefetch either a short burst of 64 bytes of

    30

    ITANIUMPROCESSOR

    IEEE MICRO

  • 8/12/2019 12_itanium

    8/20

    code (typically, a basic block residing in up tofour bundles) or a long sequential instructionstream. Short burst prefetch is initiated by a

    BRP instruction hoisted well above the actu-al branch. For longer code streams, thesequential streaming (many) hint from thebranch instruction triggers a continuousstream of additional prefetch requests until ataken branch is encountered. The instructioncache filters prefetch requests. The cache tagsand the TLB have been enhanced with anadditional port to check whether an address

    will lead to a miss. Such requests are sent tothe L2 cache.

    The compiler can improve overall fetch per-formance by aggressive issue and hoisting ofBRP instructions, and by issuing sequentialprefetch hints on the branch instruction whenbranching to long sequential codes. To fullyhide the latency of returns from the L2 cache,BRP instructions that initiate prefetch shouldbe hoisted 12 fetch cycles ahead of the branch.Hoisting by five cycles breaks even with noprefetch at all. Every hoisted cycle above five

    cycles has the potential of shaving one fetchbubble. Although this kind of hoisting of BRPinstructions is a tall order, it does provide amechanism for the compiler to eliminateinstruction fetch bubbles.

    Efficient instruction and operand deliveryAfter instructions are fetched in the front

    end, they move into the middle pipeline thatdisperses instructions, implements the archi-tectural renaming of registers, and delivers

    operands to the wide parallel hardware. Thehardware resources in the back end of themachine are organized around nine issue ports.The instruction and operand delivery hardwaremaps the six incoming instructions onto thenine issue ports and remaps the virtual registeridentifiers specified in the source code ontophysical registers used to access the register file.It then provides the source data to the executioncore. The dispersal and renaming hardwareexploits high-level semantic information pro-vided by the IA-64 software, efficientlyenabling greater ILP and reduced instructionpath length.

    Explicit parallelism directives. The instructiondispersal mechanism disperses instructions pre-sented by the decoupling buffer to the proces-

    sors issue ports. The processor has a total ofnine issue ports capable of issuing up to two

    31SEPTEMBEROCTOBER2000

    Machine resources per port

    Tables A-C describe the vocabulary of operations supported on the different issue ports

    in the Itanium processor. The issue ports feed into memory (M), integer (I), floating-point (F),and branch (B) execution data paths.

    Table C. Branch execution resources.

    Ports for issuing an instruction

    Instruction class B0 B1 B2

    Conditional or unconditional branch

    Call/return/indirect

    Loop-type branch, BSW, cover

    RFI

    BRP (branch hint)

    Table A. Memory and integer execution resources.

    Ports for issuing an instruction Latency

    Instruction class M0 M1 I0 I1 (no. of clock cycles)

    ALU (add, shift-add, logical,

    addp4, compare)

    Sign/zero extend, move long 1

    Fixed extract/deposit, Tbit, TNaT 1

    Multimedia ALU (add/avg./etc.) 2

    MM shift, avg, mix, pack 2

    Move to/from branch/predicates/

    ARs, packed multiply, pop count 2

    Load/store/prefetch/setf/break.m/

    cache control/memory fence 2+

    Memory management/system/getf 2+

    Table B. Floating-point execution resources.

    Ports for issuing an instruct ion Latency

    Instruction class F0 F1 (no. of clock cycles)

    FMAC, SIMD FMAC 5

    Fixed multiply 7

    FClrf 1

    Fchk 1

    Fcompare 2

    Floating-point logicals, class,

    min/max, pack, select 5

  • 8/12/2019 12_itanium

    9/20

    memory instructions (ports M0 and M1), twointeger (ports I0 and I1), two floating-point(ports F0 and F1), and three branch instruc-tions (ports B0, B1, and B2) per clock. Theprocessors 17 execution units are fed throughthe M, I, F, and B groups of issue ports.

    The decoupling buffer feeds the dispersalin a bundle granular fashion (up to two bun-dles or six instructions per cycle), with a freshbundle being presented each time one is con-

    sumed. Dispersal from the two bundles isinstruction granularthe processor dispersesas many instructions as can be issued (up tosix) in left-to-right order. The dispersal algo-rithm is fast and simple, with instructionsbeing dispersed to the first available issue port,subject to two constraints: detection of in-struction independence and detection ofresource oversubscription.

    Independence. The processor must ensure

    that all instructions issued in parallel areeither independent or contain onlyallowed dependencies (such as a compareinstruction feeding a dependent condi-tional branch). This question is easily dealt

    with by using the stop-bits feature of theIA-64 ISA to explicitly communicate par-allel instruction semantics. Instructionsbetween consecutive stop bits are deemedindependent, so the instruction indepen-dence detection hardware is trivial. Thiscontrasts with traditional RISC processorsthat are required to perform O(n2) (typi-cally dozens) comparisons between sourceand destination register specifiers to deter-mine independence.

    Oversubscription. The processor must alsoguarantee that there are sufficient execu-

    tion resources to process all the instructionsthat will be issued in parallel. This over-subscription problem is facilitated by the

    IA-64 ISA feature of instruction bundletemplates. Each instruction bundle notonly specifies three instructions but alsocontains a 4-bit template field, indicatingthe type of each instruction: memory (M),integer (I), branch (B), and so on. Byexamining template fields from the twobundles (a total of only 8 bits), the disper-sal logic can quickly determine the num-ber of memory, integer, floating-point, andbranch instructions incoming every clock.

    This is a hardware simplification resultingfrom the IA-64 instruction set architecture.Unlike conventional instruction set archi-tectures, the instruction encoding itselfdoesnt need to be examined to determinethe type of each operation. This featureremoves decoders that would otherwise berequired to examine many bits of theencoded instruction to determine the in-structions type and associated issue port.

    A second key advantage of the tem-plate-based dispersal strategy is that cer-tain instruction types can only occur onspecific locations within any bundle. Asa result, the dispersal interconnectionnetwork can be significantly optimized;the routing required from dispersal toissue ports is roughly only half of thatrequired for a fully connected crossbar.

    Table 1 illustrates the effectiveness of the

    dispersal strategy by enumerating the instruc-tion bundles that may be issued at full band-

    width. As can be seen, a rich mix ofinstructions can be issued to the machine athigh throughput (six per clock). The combi-nation of stop bits and bundle templates, asspecified in the IA-64 instruction set, allowsthe compiler to indicate the independence andinstruction-type information directly andeffectively to the dispersal hardware. As aresult, the hardware is greatly simplified, there-by allowing an efficient implementation ofinstruction dispersal to a wide execution core.

    Efficient register remapping.After dispersal, thenext step in preparing incoming instructionsfor execution involves implementing the reg-ister stacking and rotation functions.

    32

    ITANIUMPROCESSOR

    IEEE MICRO

    Table 1. Instruction bundles capable of

    full-bandwidth dispersal.

    First bundle* Second bundle

    MIH MLI, MFI, MIB, MBB, or MFB

    MFI or MLI MLI, MFI, MIB, MBB, BBB, or MFB

    MII MBB, BBB, or MFB

    MMI BBB

    MFH MII, MLI, MFI, MIB, MBB, MFB

    * B slots support branches and branch hints.

    * H designates a branch hint operation in the B slot.

  • 8/12/2019 12_itanium

    10/20

    Register stacking is an IA-64 technique thatsignificantly reduces function call and returnoverhead. It ensures that all procedural input

    and output parameters are in specific registerlocations, without requiring the compiler toperform register-register or memory-registermoves. On procedure calls, a fresh registerframe is simply stacked on top of existingframes in the large register file, without theneed for an explicit save of the callers registers.This enables low-overhead procedure calls,providing significant performance benefit oncodes that are heavy in calls and returns, suchas those in object-oriented languages.

    Register rotation is an IA-64 technique thatallows very low overhead, software-pipelinedloops. It broadens the applicability of com-piler-driven software pipelining to a wide vari-ety of integer codes. Rotation provides a formof register renaming that allows every itera-tion of a software-pipelined loop to have afresh copy of loop variables. This is accom-plished by accessing the registers through anindirection based on the iteration count.

    Both stacking and rotation require thehardware to remap the register names. Thisremapping translates the incoming virtual reg-ister specifiers onto outgoing physical registerspecifiers, which are then used to perform theactual lookup of the various register files.Stacking can be thought of as simply addingan offset to the virtual register specifier. In asimilar fashion, rotation can also be viewed asan offset-modulo add. The remapping func-tion supports both stacking and rotation for

    the integer register specifiers, but only registerrotation for the floating-point and predicateregister specifiers.

    The Itanium processor efficiently supportsthe register remapping for both register stack-ing and rotation with a set of adders and mul-tiplexers contained in the pipelines RENstage. The stacking logic requires only one 7-bit adder for each specifier, and the rotationlogic requires either one (predicate or float-ing-point) or two (integer) additional 7-bitadders. The extra adder on the integer side isneeded due to the interaction of stacking withrotation. Therefore, for full six-syllable exe-cution, a total of ninety-eight 7-bit adders and42 multiplexers implement the combinationof integer, floating-point, and predicateremapping for all incoming source and desti-

    nation registers. The total area taken by thisfunction is less than 0.25 square mm.

    The register-stacking model also requires

    special handling when software allocates morevirtual registers than are currently physicallyavailable in the register file. A special statemachine, the register stack engine (RSE), han-dles this casetermed stack overflow. Thisengine observes all stacked register allocationor deallocation requests. When an overflow isdetected on a procedure call, the engine silent-ly takes control of the pipeline, spilling regis-ters to a backing store in memory untilsufficient physical registers are available. In a

    similar manner, the engine handles the con-verse situationtermed stack underflow

    when registers need to be restored from abacking store in memory. While these registersare being spilled or filled, the engine simplystalls instructions waiting on the registers; nopipeline flushes are needed to implement theregister spill/restore operations.

    Register stacking and rotation combine toprovide significant performance benefits for avariety of applications, at the modest cost ofa number of small adders, an additionalpipeline stage, and control logic for a pro-grammer-invisible register stack engine.

    Large, multiported register files. The processorprovides an abundance of registers and execu-tion resources. The 128-entry integer registerfile supports eight read ports and six writeports. Note that four ALU operations requireeight read ports and four write ports from the

    register file, while pending load data returnsneed two additional write ports (two returnsper cycle). The read and write ports can ade-quately support two memory and two integerinstructions every clock. The IA-64 instruc-tion set includes a feature known as postin-crement. Here, the address register of amemory operation can be incremented as aside effect of the operation. This is supportedby simply using two of the four ALU writeports. (These two ALUs and write ports wouldotherwise have been idle when memory oper-ations are issued off their ports).

    The floating-point register file also consistsof 128 registers, supports double extended-precision arithmetic, and can sustain twomemory ports in parallel with two multiply-accumulate units. This combination of

    33SEPTEMBEROCTOBER2000

  • 8/12/2019 12_itanium

    11/20

    resources requires eight read and four writeports. The register write ports are separated

    in even and odd banks, allowing each mem-ory return to update a pair of floating-pointregisters.

    The other large register file is the predicateregister file. This register file has severalunique characteristics: each entry is 1 bit, ithas many read and write ports (15 reads/11

    writes), and it supports a broadside read orwrite of the entire register file. As a result, ithas a distinct implementation, as describedin the Implementing predication elegantlysection (next page).

    High ILP execution coreThe execution core is the heart of the EPIC

    implementation. It supports data-speculativeand control-speculative execution, as well aspredicated execution and the traditional func-tions of hazard detection and branch execu-tion. Furthermore, the processors executioncore provides these capabilities in the context

    of the wide execution width and powerfulinstruction semantics that characterize theEPIC design philosophy.

    Stall-based scoreboard control strategy.As men-tioned earlier, the frequency target of the Ita-nium processor was governed by several keytiming paths such as the ALU plus bypass andthe two-cycle data cache. All of the controlpaths within the core pipeline fit within thegiven cycle timedetecting and dealing withdata hazards was one such key control path.

    To achieve high performance, we adopteda nonblocking cache with a scoreboard-basedstall-on-use strategy. This is particularly valu-able in the context of speculation, in whichcertain load operations may be aggressivelyboosted to avoid cache miss latencies, and the

    resulting data may potentially not be con-sumed. For such cases, it is key that 1) thepipeline not be interrupted because of a cache

    miss, and 2) the pipeline only be interruptedif and when the unavailable data is needed.Thus, to achieve high performance, the

    strategy for dealing with detected data haz-ards is based on stallsthe pipeline only stalls

    when unavai lable data is needed and stallsonly as long as the data is unavailable. Thisstrategy allows the entire processor pipelineto remain filled, and the in-flight dependentinstructions to be immediately ready to con-tinue as soon as the required data is available.

    This contrasts with other high-frequencydesigns, which are based on flushing andrequire that the pipeline be emptied when ahazard is detected, resulting in reduced per-formance. On the Itanium processor, innov-ative techniques reap the performance benefitsof a stall-based strategy and yet enable high-frequency operation on this wide machine.

    The scoreboard control is also enhanced tosupport predication. Since most operations

    within the IA-64 instruction set architecturecan be predicated, either the producer or theconsumer of a given piece of data may be nul-lified by having a false predicate. Figure 7 illus-trates an example of such a case. Note that ifeither the producer or consumer operation isnullified via predication, there are no hazards.The processor scoreboard therefore considersboth the producer and consumer predicates,in addition to the normal operand availabili-ty, when evaluating whether a hazard exists.

    This hazard evaluation occurs in the REG(register read) pipeline stage.

    Given the high frequency of the processorpipeline, theres not sufficient time to bothcompute the existence of a hazard, and effect aglobal pipeline stall in a single clock cycle.Hence, we use a unique deferred-stall strategy.This approach allows any dependent consumerinstructions to proceed from the REG into theEXE (execute) pipeline stage, where they arethen stalledhence the term deferred stall.

    However, the instructions in the EXE stageno longer have read port access to the registerfile to obtain new operand data. Therefore, toensure that the instructions in the EXE stageprocure the correct data, the latches at the startof the EXE stage (which contain the sourcedata values) continuously snoop all returning

    34

    ITANIUMPROCESSOR

    IEEE MICRO

    Clock 1: cmp.eq rl,r2 pl, p3 Compute predicates P1, P3cmp.eq r3, r4 p2, p4;; Compute predicates P2,P4

    Clock 2: (p1) ld4 [r3] r4;; Load nullified if P1=False

    (Producer nullification)ClockN: (p4) add r4, r1 r5 Add nullified if P4=False

    (Consumer nullification)Note that a hazard exists only if(a) p1=p2=true AND(b) the r4 result is not available when the add collects its source data

    Figure 7. Predicated producer-consumer dependencies.

  • 8/12/2019 12_itanium

    12/20

    data values, intercepting any data that theinstruction requires. The logic used to per-form this data interception is identical to theregister bypass network used to collectoperands for instructions in the REG stage.By noting that instructions observing adeferred stall in the REG stage dont requirethe use of the bypass network, the EXE stageinstructions can usurp the bypass network forthe deferred stall. By reusing existing registerbypass hardware, the deferred stall strategy isimplemented in an area-efficient manner.This allows the processor to combine the ben-efits of high frequency with stall-based

    pipeline control, thereby precluding thepenalty of pipeline flushes due to replays onregister hazards.

    Execution resources. The processor provides anabundance of execution resources to exploitILP. The integer execution core includes twomemory and two integer ports, with all fourports capable of executing arithmetic, shift-and-add, logical, compare, and most integerSIMD multimedia operations. The memoryports can also perform load and store opera-tions, including loads and stores with postin-crement functionality. The integer ports addthe ability to perform the less-common inte-ger instructions, such as test bit, look for zerobyte, and variable shift. Additional uncom-mon instructions are also implemented ononly the first integer port.

    See the earlier sidebar for a full enumera-tion of the per-port capabilities and associat-

    ed instruction latencies on the processor. Ingeneral, we designed the method used to mapinstructions onto each port to maximize over-all performance, by balancing the instructionfrequency with the area and timing impact ofadditional execution resources.

    Implementing predication elegantly. Predicationis another key feature of the IA-64 architec-ture, allowing higher performance by elimi-nating branches and their associatedmisprediction penalties.5 However, predica-tion affects several key aspects of the pipelinedesign. Predication turns a control depen-dency (branching on the condition) into adata dependency (execution and forwardingof data dependent upon the value of the pred-icate). If spurious stalls and pipeline disrup-

    tions get introduced during predicated exe-

    cution, the benefit of branch mispredictionelimination will be squandered. Care wastaken to ensure that predicates are imple-mented transparently in the pipeline.

    The basic strategy for predicated executionis to allow all instructions to read the registerfile and get issued to the hardware regardlessof their predicate value. Predicates are used toconfigure the data-forwarding network, detectthe presence of hazards, control pipelineadvances, and conditionally nullify the exe-cution and retirement of issued operations.Predicates also feed the branching hardware.The predicate register file is a highly multi-ported structure. It is accessed in parallel withthe general registers in the REG stage. Sincepredicates themselves are generated in the exe-cution core (from compare instructions, for

    35SEPTEMBEROCTOBER2000

    IA-32 compatibility

    Another key feature of the Itanium processor is its full support of the IA-32 instruction set

    in hardware (see Figure A). This includes support for running a mix of IA-32 applications and

    IA-64 applications on an IA-64 operating system, as well as IA-32 applications on an IA-32

    operating system, in both uniprocessor and multiprocessor configurations. The IA-32 engine

    makes use of the EPIC machines registers, caches, and execution resources. To deliver high

    performance on legacy binaries, the IA-32 engine dynamically schedules instructions.1,2 The

    IA-64 Seamless Architecture is defined to enable running IA-32 system functions in native

    IA-64 mode, thus delivering native performance levels on the system functionality.

    References

    1. R. Colwell and R. Steck, A 0.6m BICMOS Microprocessor with Dynamic

    Execution, Proc. Intl Solid-State Circuits Conf., IEEE Press, Piscataway, N.J.,1995, pp. 176-177.

    2. D. Papworth, Tuning the Pentium Pro Microarchitecture, IEEE Micro,

    Mar./Apr. 1996, pp. 8-15.

    IA-32 instructionfetch and decode

    IA-32 dynamicand scheduler

    IA-32 retirementand exceptions

    Sharedinstruction cache

    and TLB

    SharedIA-64

    executioncore

    Figure A. IA-32 compatibility microarchitecture.

  • 8/12/2019 12_itanium

    13/20

    example) and may be in flight when theyreneeded, they must be forwarded quickly tothe specific hardware that consumes them.

    Note that predication affects the hazarddetection logic by nullifying either data pro-ducer or consumer instructions. Consumernullification is performed after reading thepredicate register file (PRF) for the predicatesources of the six instructions in the REGpipeline stage. Producer nullification is per-formed after reading the predicate register filefor the predicate sources for the six instruc-tions in the EXE stage.

    Finally, three conditional branches can beexecuted in the DET pipeline stage; thisrequires reading three additional predicatesources. Thus, a total of 15 read ports areneeded to access the predicate register file.From a write port perspective, 11 predicatescan be written every clock: eight from fourparallel integer compares, two from a float-

    ing-point compare, and one via the stage pred-icate write feature of loop branches. Theseread and write ports are in addition to a broad-side read and write capability that allows a sin-gle instruction to read or write the entire64-entry predicate register into or from a sin-gle 64-bit integer register. The predicate reg-ister file is implemented as a single 64-bit latch

    with 15 simple 64:1 multiplexers being usedas the read ports. Similarly, the 11 write portsare efficiently implemented, with each beinga 6:64 decoder, with an AND-OR structureused to update the actual predicate register filelatch. Broadside reads and writes are eas ilyimplemented by reading or writing the con-tents of the entire 64 bit latch.

    In-flight predicates must be forwardedquickly after generation to the point of con-

    sumption. The costly bypasslogic that would have beenneeded for this is eliminatedby taking advantage of thefact that all predicate-writinginstructions have determinis-tic latency. Instead, a specula-tive predicate register file(SPRF) is used and updatedas soon as predicate data iscomputed. The source pred-icate of any dependentinstruction is then readdirectly from this register file,

    obviating the need for bypass logic. A sepa-rate architectural predicate register file (APRF)is only updated when a predicate-writinginstruction retires and is only then allowed toupdate the architectural state.

    In case of an exception or pipeline flush, theSPRF is copied from the APRF in the shadowof the flush latency, undoing the effect of anymisspeculative predicate writes. The combi-nation of latch-based implementation and thetwo-file strategy allow an area-efficient andtiming-efficient implementation of the high-ly ported predicate registers.

    Figure 8 shows one of the six EXE stage pred-icates that allow or nullify data forwarding inthe data-forwarding network. The other fivepredicates are handled identically. Predicationcontrol of the bypass network is implementedvery efficiently by ANDing the predicate value

    with the destination-valid signal present in con-ventional bypass logic networks. Instructions

    with false predicates are treated as merely notwriting to their destination register. Thus, theimpact of predication on the operand-forwarding network is fairly minimal.

    Optimized speculation support in hardware.With minimal hardware impact, the Itaniumprocessor enables software to hide the latencyof load instructions and their dependent usesby boosting them out of their home basicblock. This is termed speculation. To performeffective speculation, two key issues must beaddressed. First, any exceptions that are detect-ed must be deferrable until an operationshome basic block is encountered; this is termedcontrol speculation. Second, all stores betweenthe boosted load and its home location mustbe checked for address overlap. If there is an

    36

    ITANIUMPROCESSOR

    IEEE MICRO

    =?AND

    Source

    bypass

    multip

    lexer

    Register file data

    Bypass forwarded data

    "True" source data Added to support predication

    Source RegID

    Forwarded destination RegID

    Forwarded instruction predicate

    Bypass forwarded data

    Figure 8. Predicated bypass control.

  • 8/12/2019 12_itanium

    14/20

    overlap, the latest store should forward the cor-rect data; this is termed data speculation. TheItanium processor provides effective supportfor both forms of speculation.

    In case of control speculation, normalexception checks are performed for a control-speculative load instruction. In the commoncase, no exception is encountered, and there-fore no special handling is required. On adetected exception, the hardware examinesthe exception type, software-managed archi-tectural control registers, and page attributesto determine whether the exception should behandled immediately (such as for a TLB miss)

    or deferred for future handling.For a deferral, a special deferred exception

    token called NaT (Not a Thing) bit is retainedfor each integer register, and a special float-ing-point value, called NaTVal and encodedin the NaN space, is set for floating-point reg-isters. This token indicates that a deferredexception was detected. The deferred excep-tion token is then propagated into result reg-isters when any of the source registers indicatessuch a token. The exception is reported wheneither a speculation check or nonspeculativeuse (such as a store instruction) consumes aregister that is flagged with the deferred excep-tion token. In this way, NaT generation lever-ages traditional exception logic simply, andNaT propagation uses straightforward datapath logic.

    The existence of NaT bits and NaTVals alsoaffect the register spill-and-fill logic. Forexplicit software-driven register spills and fills,

    special move instructions (store.spill andload.fill) are supported that dont take excep-tions when encountering NaTed data. Forfloating-point data, the entire data is simplymoved to and from memory. For integer data,the extra NaT bit is written into a special reg-ister (called UNaT, or user NaT) on spills, andis read back on the load.fill instruction. TheUNaT register can also be written to memo-ry if more than 64 registers need to be spilled.In the case of implicit spills and fills generat-ed by the register save engine, the engine col-lects the NaT bits into another special register(called RNaT, or register NaT), which is thenspilled (or filled) once for every 64 register saveengine stores (or loads).

    For data speculation, the software issues anadvanced load instruction. When the hard-

    ware encounters an advanced load, it placesthe address, size, and destination register ofthe load into the ALAT structure. The ALATthen observes all subsequent explicit storeinstructions, checking for overlaps of the validadvanced load addresses present in the ALAT.In the common case, theres no match, the

    ALAT state is unchanged, and the advancedload result is used normally. In the case of anoverlap, all address-matching advanced loadsin the ALAT are invalidated.

    After the last undisambiguated store priorto the loads home basic block, an instructioncan query the ALAT and find that theadvanced load was matched by an interven-ing store address. In this situation recovery is

    needed. When only the load and no depen-dent instructions were boosted, a load-check(ld.c) instruction is used, and the loadinstruction is reissued down the pipeline, thistime retrieving the updated memory data. Asan important performance feature, the ld.cinstruction can be issued in parallel withinstructions dependent on the load resultdata. By allowing this optimization, the crit-ical load uses can be issued immediately,allowing the ld.c to effectively be a zero-cycleoperation. When the advanced load and itsdependent uses were boosted, an advancedcheck-load (chk.a) instruction traps to a user-specified handler for a special fix-up code thatreissues the load instruction and the opera-tions dependent on the load. Thus, supportfor data speculation was added to the pipeline

    37SEPTEMBEROCTOBER2000

    With minimal hardware

    impact, the Itanium processor

    enables software to hide the

    latency of load instructions

    and their dependent uses by

    boosting them out of their

    home basic block.

  • 8/12/2019 12_itanium

    15/20

    38

    ITANIUMPROCESSOR

    IEEE MICRO

    The FPU in the processor is quite advanced. The native 82-bit hard-

    ware provides efficient support for multiple numeric programming mod-

    els, including support for single, double, extended, and

    mixed-mode-precision computations. The wide-range 17-bit exponent

    enables efficient support for extended-precision library functions as well

    as fast emulation of quad-precision computations. The large 128-entry

    register file provides adequate register resources. The FPU execution

    hardware is based on the floating-point multiply-add (FMAC) primitive,

    which is an effective building block for scientific computation.1 The

    machine provides execution hardware for four double-precision or eight

    single-precision flops per clock. This abundant computation bandwidth

    is balanced with adequate operand bandwidth from the registers and

    memory subsystem. With judicious use of data prefetch instructions, aswell as cache locality and allocation management hints, the software can

    effectively arrange the computation for sustained high utilization of the

    parallel hardware.

    FMAC unitsThe FPU supports two fully pipelined, 82-bit FMAC units that can exe-

    cute single, double, or extended-precision floating-point operations. This

    delivers a peak of 4 double-precision flops/clock, or 3.2 Gflops at 800

    MHz. FMAC units execute FMA, FMS, FNMA, FCVTFX, and FCVTXF oper-

    ations. When bypassed to one another, the latency of the FMAC arith-

    metic operations is five clock cycles.

    The processor also provides support for executing two SIMD-floating-

    point instructions in parallel. Since each instruction issues two single-

    precision FMAC operations (or four single-precision flops), the peak

    execution bandwidth is 8 single-precision flops/clock or 6.4 Gflops at 800

    MHz. Two supplemental single-precision FMAC units support this com-

    putation. (Since the read of an 82-bit register actually yields two single-

    precision SIMD operands, the second operand in each case is peeled off

    and sent to the supplemental SIMD units for execution.) The high com-

    putational rate on single precision is especially suitable for digital con-

    tent creation workloads.The divide operation is done in software and can take advantage of

    the twin fully pipelined FMAC hardware. Software-pipelined divide oper-

    ations can yield high throughput on division and square-root operations

    common in 3D geometry codes.

    The machine also provides one hardware pipe for execution of FCMPs

    and other operations (such as FMERGE, FPACK, FSWAP, FLogicals, recip-

    rocal, and reciprocal square root). Latency of the FCMP operations is two

    clock cycles; latency of the other floating-point operations is five clock

    cycles.

    Operand bandwidthCare has been taken to ensure that the high computational bandwidth

    is matched with operand feed bandwidth. See Figure B. The 128-entry

    floating-point register file has eight read and four write ports. Every cycle,

    the eight read ports can feed two extended-precision FMACs (each withthree operands) as well as two floating-point stores to memory. The four

    write ports can accommodate two extended-precision results from the

    two FMAC units and the results from two load instructions each clock.

    To increase the effective write bandwidth into the FPU from memory, we

    divided the floating-point registers into odd and even banks. This enables

    the two physical write ports dedicated to load returns to be used to write

    four values per clock to the register file (two to each bank), using two ldf-

    pair instructions. The ldf-pair instructions must obey the restriction that

    the pair of consecutive memory operands being loaded in sends one

    operand to an even register and the other to an odd register for proper

    use of the banks.

    The earliest cache level to feed the FPU is the unified L2 cache (96

    Kbytes). Two ldf-pair instructions can load four double-precision values

    from the L2 cache into the registers. The latency of loads from this cache

    to the FPU is nine clock cycles. For data beyond the L2 cache, the band-

    width to the L3 cache is two double-precision operations/clock (one 64-

    byte line every four clock cycles).

    Obviously, to achieve the peak rating of four double-precision floating-point

    operations per clock cycle, one needs to feed the FMACs with six operands

    per clock. The L2 memory can feed a peak of four operands per clock. The

    remaining two need to come from the register file. Hence, with the rightamount of data reuse, and with appropriate cache management strategies

    aimed at ensuring that the L2 cache is well primed to feed the FPU, many

    workloads can deliver sustained per-

    formance at near the peak floating-

    point operation rating. For data

    without locality, use of the NT2 and

    NTA hints enables the data to appear

    to virtually stream into the FPU

    through the next level of memory.

    FPU and integer corecoupling

    The floating-point pipeline is cou-

    pled to the integer pipeline. Regis-

    ter file read occurs in the REG stage,

    with seven stages of execution

    Floating-point feature set

    4-Mbyte

    L3cache

    L2

    cache

    Registerfile

    (128-entry,

    82 bits)

    2 stores/clock

    2 double-precisionops/clock

    6 82 bits

    2 82 bits

    4 double-precisionops/clock

    (2 ldf-pair)

    Even

    Odd

    Figure B. FMAC units deliver 8 flops/clock.

  • 8/12/2019 12_itanium

    16/20

    in a straightforward manner, only needingmanagement of a small ALAT in hardware.

    As shown in Figure 9, the ALAT is imple-

    mented as a 32-entry, two-way set-associativestructure. The array is looked up based on theadvanced loads destination register ID, andeach entry contains an advanced loads phys-ical address, a special octet mask, and a validbit. The physical address is used to compareagainst subsequent stores, with the octet maskbits used to track which bytes have actuallybeen advance loaded. These are used in case ofpartial overlap or in cases where the load andstore are different sizes. In case of a match, the

    corresponding valid bit is cleared. The latercheck instruction then simply queries the

    ALAT to examine if a valid ALAT entry stillexists for the ALAT.

    The combination of a simple ALAT for thedata speculation, in conjunction with NaTbits and small changes to the exception logic

    for control speculation, eliminates the twofundamental barriers that software has tradi-tionally encountered when boosting instruc-tions. By adding this modest hardware

    39SEPTEMBEROCTOBER2000

    extending beyond the REG stage, followed by floating-point write back.

    Safe instruction recognition (SIR) hardware enables delivery of precise

    exceptions on numeric computation. In the FP1 (or EXE) stage, an early exam-

    ination of operands is performed to determine the possibility of numeric

    exceptions on the instructions being issued. If the instructions are unsafe

    (have potential for raising exceptions), a special form of hardware microre-

    play is incurred. This mechanism enables instructions in the floating-point

    and integer pipelines to flow freely in all situations in which no exceptions

    are possible.

    The FPU is coupled to the integer data path via transfer paths between

    the integer and floating-point register files. These transfers (setf, getf)

    are issued on the memory ports and made to look like memory operations

    (since they need register ports on both the integer and floating-point reg-

    isters). While setf can be issued on either M0 or M1 ports, getf can onlybe issued on the M0 port. Transfer latency from the FPU to the integer

    registers (getf) is two clocks. The latency for the reverse transfer (setf) is

    nine clocks, since this operation appears like a load from the L2 cache.

    We enhanced the FPU to support integer multiply inside the FMAC

    hardware. Under software control, operands are transferred from the inte-

    ger registers to the FPU using setf. After multiplication is complete, the

    result is transferred to the integer registers using getf. This sequence

    takes a total of 18 clocks (nine for setf, seven for fmul to write the regis-

    ters, and two for getf). The FPU can execute two integer multiply-add

    (XMA) operations in parallel. This is very useful in cryptographic appli-

    cations. The presence of twin XMA pipelines at 800 MHz allows for over

    1,000 decryptions per second on a 1,024-bit RSA using private keys (server-

    side encryption/decryption).

    FPU controlsThe FPU controls for operating precision and rounding are derived from

    the floating-point status register (FPSR). This register also contains the

    numeric execution status of each operation. The FPSR also supports spec-

    ulation in floating-point computation. Specifically, the register contains

    four parallel fields or tracks for both controls and flags to support three par-

    allel speculative streams, in addition to the primary stream.

    Special attention has been placed on delivering high performance for

    speculative streams. The FPU provides high throughput in cases where the

    FCLRF instruction is used to clear status from speculative tracks before

    forking off a fresh speculative chain. No stalls are incurred on such

    changes. In addition, the FCHKF instruction (which checks for exceptions

    on speculative chains on a given track) is also supported efficiently. Inter-

    locks on this instruction are track-granular, so that no interlock stalls areincurred if floating-point instructions in the pipeline are only targeting the

    other tracks. However, changes to the control bits in the FPSR (made via

    the FSETC instruction or the MOV GRFPSR instruction) have a latency

    of seven clock cycles.

    FPU summaryThe FPU feature set is balanced to deliver high performance across a

    broad range of computational workloads. This is achieved through the

    combination of abundant execution resources, ample operand bandwidth,

    and a rich programming environment.

    References

    1. B.Olsson et al., RISC System/6000 Floating-Point Unit, IBM

    RISC System/6000 Technology, IBM Corp., 1990, pp. 34-42.

    16 setsWay 0 Way 1

    RegID tag

    Physical address of adv load

    Valid bit

    RegID[3:0]

    (index)

    Figure 9. ALAT organization.

  • 8/12/2019 12_itanium

    17/20

    support for speculation, the processor allowsthe software to take advantage of the compil-ers large scheduling window to hide memo-ry latency, without the need for complexdynamic scheduling hardware.

    Parallel zero-latency delay-executed branching.Achieving the highest levels of performancerequires a robust control flow mechanism. Theprocessors branch-handling strategy is basedon three key directions. First, branch seman-tics providing more program details are need-ed to allow the software to convey complexcontrol flow information to the hardware. Sec-

    ond, aggressive use of speculation and predi-cation will progressively lead to an emptyingout of basic blocks, leaving clusters of branch-es. Finally, since the data flow from compare todependent branch is often very tight, specialcare needs to be taken to enable high perfor-mance for this important case. The processoroptimizes across all three of these fronts.

    The processor efficiently implements thepowerful branch vocabulary of the IA-64instruction set architecture. The hardware takesadvantage of the new semantics for improvedbranch handling. For example, the loop count(LC) register indicates the number of iterationsin a For-type loop, and the epilogue count (EC)register indicates the number of epilogue stagesin a software-pipelined loop.

    By using the loop count information, highperformance can be achieved by softwarepipelining all loops. Moreover, the imple-mentation avoids pipeline flushes for the first

    and last loop iterations, since the actual num-ber of iterations is effectively communicatedto the hardware. By examining the epilogcount register information, the processorautomatically generates correct stage predi-cates for the epilogue iterations of the soft-

    ware-pipelined loop. This step leverages thepredicate-remapping hardware along with thebranch prediction information from the loopcount register-based branch predictor.

    Unlike conventional processors, the Itani-

    um processor can execute up to three parallelbranches per clock. This is implemented byexamining the three controlling conditions(either predicates or the loop count/epilogcount counter values) for the three parallelbranches, and performing a priority encodeto determine the earliest taken branch. All side

    effects of later instructions are automaticallysquashed within the branch execution unititself, preventing any architectural state updatefrom branches in the shadow of a takenbranch. Given that the powerful branch pre-diction in the front end contains tailored sup-port for multiway branch prediction, minimalpipeline disruptions can be expected due tothis parallel branch execution.

    Finally, the processor optimizes for the com-mon case of a very short distance between thebranch and the instruction that generates thebranch condition. The IA-64 instruction setarchitecture allows a conditional branch to be

    issued concurrently with the integer comparethat generates its condition codeno stop bitis needed. To accommodate this importantperformance optimization, the processorpipelines the compare-branch sequence. Thecompare instruction is performed in thepipelines EXE stage, with the results beingknown by the end of the EXE clock. Toaccommodate the delivery of this condition tothe branch hardware, the processor executesall branches in the DET stage. (Note that thepresence of the DET stage isnt an overheadneeded solely from branching. This stage isalso used for exception collection and priori-tization, and for the second clock of executionfor integer-SIMD operations.) Thus, anybranch issued in parallel with the compare thatgenerates the condition will be evaluated inthe DET stage, using the predicate results cre-ated in the previous (EXE) stage. In this man-ner, the processor can easily handle the case of

    compare and dependent branches issued inparallel.

    As a result of branch execution in the DETstage, in the rare case of a full pipeline flushdue to a branch misprediction, the processor

    will incur a branch misprediction penalty ofnine pipeline bubbles. Note that we expectthis to occur rarely, given the aggressive mul-titier branch prediction strategy in the frontend. Most branches should be predicted cor-rectly using one of the four progressive resteers

    in the front end.The combination of enhanced branch

    semantics, three-wide parallel branch execu-tion, and zero-cycle compare-to-branch laten-cy allows the processor to achieve highperformance on control-flow-dominatedcodes, in addition to its high performance on

    40

    ITANIUMPROCESSOR

    IEEE MICRO

  • 8/12/2019 12_itanium

    18/20

    more computation-oriented data-flow-dom-inated workloads.

    Memory subsystem

    In addition to the high-performance core,the Itanium processor provides a robust cacheand memory subsystem, which accommo-dates a variety of workloads and exploits thememory hints of the IA-64 ISA.

    Three levels of on-package cacheThe processor provides three levels of on-

    package cache for scalable performance acrossa variety of workloads. At the first level, instruc-tion and data caches are split, each 16 Kbytes

    in size, four-way set-associative, and with a 32-byte line size. The dual-ported data cache hasa load latency of two cycles, is write-through,and is physically addressed and tagged. The L1caches are effective on moderate-size workloadsand act as a first-level filter for capturing theimmediate locality of large workloads.

    The second cache level is 96 Kbytes in size,is six-way set-associative, and uses a 64-byteline size. The cache can handle two requests per

    clock via banking. This cache is also the level atwhich ordering requirements and semaphoreoperations are implemented. The L2 cache usesa four-state MESI (modified, exclusive, shared,and invalid) protocol for multiprocessor coher-ence. The cache is unified, allowing it to ser-vice both instruction and data side requestsfrom the L1 caches. This approach allows opti-mal cache use for both instruction-heavy (serv-er) and data-heavy (numeric) workloads. Sincefloating-point workloads often have large data

    working sets and are used with compiler opti-mizations such as data blocking, the L2 cacheis the first point of service for floating-pointloads. Also, because floating-point performancerequires high bandwidth to the register file, theL2 cache can provide four double-precisionoperands per clock to the floating-point regis-

    ter file, using two parallel floating-point load-pair instructions.

    The third level of on-package cache is 4Mbytes in size, uses a 64-byte line size, and is

    four-way set-associative. It communicateswith the processor at core frequency (800MHz) using a 128-bit bus. This cache servesthe large workloads of server- and transaction-processing applications, and minimizes thecache traffic on the frontside system bus. TheL3 cache also implements a MESI protocolfor microprocessor coherence.

    A two-level hierarchy of TLBs handles vir-tual address translations for data accesses. Thehierarchy consists of a 32-entry first-level and

    96-entry second-level TLB, backed by a hard-ware page walker.

    Optimal cache managementTo enable optimal use of the cache hierar-

    chy, the IA-64 instruction set architecturedefines a set of memory locality hints used forbetter managing the memory capacity at spe-cific hierarchy levels. These hints indicate thetemporal locality of each access at each level of

    hierarchy. The processor uses them to deter-mine allocation and replacement strategies foreach cache level. Additionally, the IA-64 archi-tecture allows a bias hint, indicating that thesoftware intends to modify the data of a givencache line. The bias hint brings a line into thecache with ownership, thereby optimizing theMESI protocol latency.

    Table 2 lists the hint bits and their mappingto cache behavior. If data is hinted to be non-temporal for a particular cache level, that data

    is simply not allocated to the cache. (On the L2cache, to simplify the control logic, the proces-sor implements this algorithm approximately.The data can be allocated to the cache, but theleast recently used, or LRU, bits are modifiedto mark the line as the next target for replace-ment.) Note that the nearest cache level to feed

    41SEPTEMBEROCTOBER2000

    Table 2. Implementation of cache hints.

    Hint Semantics L1 response L2 response L3 response

    NTA Nontemporal (all levels) Dont allocate Allocate, mark as next replace Dont allocate

    NT2 Nontemporal (2 levels) Dont allocate Allocate, mark as next replace Normal allocation

    NT1 Nontemporal (1 level) Dont allocate Normal allocation Normal allocation

    T1 (default) Temporal Normal allocation Normal allocation Normal allocation

    Bias Intent to modify Normal allocation Allocate into exclusive state Al locate into exclusive state

  • 8/12/2019 12_itanium

    19/20

    the floating-point unit is the L2 cache. Hence,for floating-point loads, the behavior is modi-fied to reflect this shift (an NT1 hint on a float-ing-point access is treated like an NT2 hint onan integer access, and so on).

    Allowing the software to explicitly providehigh-level semantics of the data usage patternenables more efficient use of the on-chipmemory structures, ultimately leading tohigher performance for any given cache size

    and access bandwidth.

    System busThe processor uses a multidrop, shared sys-

    tem bus to provide four-way glueless multi-processor system support. No additional bridgesare needed for building up to a four-way sys-tem. Systems with eight or more processors aredesigned through clusters of these nodes usinghigh-speed interconnects. Note that multidrop

    buses are a cost-effective way to build high-per-formance four-way systems for commercialtransaction processing and e-business work-loads. These workloads often have highly shared

    writeable data and demand high throughputand low latency on transfers of modified databetween caches of multiple processors.

    In a four-processor system, the transaction-based bus protocol allows up to 56 pendingbus transactions (including 32 read transac-tions) on the bus at any given time. An

    advanced MESI coherence protocol helps inreducing bus invalidation transactions and inproviding faster access to writeable data. Thecache-to-cache transfer latency is furtherimproved by an enhanced defer mechanism,

    which permits efficient out-of-order datatransfers and out-of-order transaction com-

    pletion on the bus. A deferred transaction onthe bus can be completed without reusing theaddress bus. This reduces data return latencyfor deferred transactions and efficiently usesthe address bus. This feature is critical for scal-ability beyond four-processor systems.

    The 64-bit system bus uses a source-syn-chronous data transfer to achieve 266-Mtrans-fers/s, which enables a bandwidth of 2.1Gbytes/s. The combination of these featuresmakes the Itanium processor system a scalablebuilding block for large multiprocessor sys-tems.

    The Itanium processor is the first IA-64processor and is designed to meet thedemanding needs of a broad range of enter-prise and scientific workloads. Through its useof EPIC technology, the processor funda-mentally shifts the balance of responsibilitiesbetween software and hardware. The softwareperforms global scheduling across the entirecompilation scope, exposing ILP to the hard-

    ware. The hardware provides abundant exe-cution resources, manages the bookkeeping

    for EPIC constructs, and focuses on dynam-ic fetch and control flow optimizations to keepthe compiled code flowing through thepipeline at high throughput. The tighter cou-pling and increased synergy between hardwareand software enable higher performance witha simpler and more efficient design.

    Additionally, the Itanium processor deliv-ers significant value propositions beyond justperformance. These include support for 64

    bits of addressing, reliability for mission-crit-ical applications, full IA-32 instruction setcompatibility in hardware, and scalabilityacross a range of operating systems and mul-tiprocessor platforms. MICRO

    References1. L. Gwennap, Merced Shows Innovative

    Design, Microprocessor Report, Micro-

    Design Resources, Sunnyvale, Calif., Oct. 6,

    1999, pp. 1, 6-10.

    2. M.S. Schlansker and B.R. Rau, EPIC:

    Explicitly Parallel Instruction Computing,

    Computer, Feb. 2000, pp. 37-45.

    3. T.Y. Yeh and Y.N. Patt, Two-Level Adaptive

    Training Branch Prediction, Proc. 24th Ann.

    Intl Symp. Microarchitecture, ACM Press,

    New York, Nov. 1991, pp. 51-61.

    42

    ITANIUMPROCESSOR

    IEEE MICRO

    The Itanium processor is the

    first IA-64 processor and is

    designed to meet the

    demanding needs of a broad

    range of enterprise and

    scientific workloads.

  • 8/12/2019 12_itanium

    20/20

    4. L. Gwennap, New Algorithm Improves

    Branch Prediction, Microprocessor Report,

    Mar. 27, 1995, pp. 17-21.

    5. B.R. Rau et al., The Cydra 5 Departmental

    Supercomputer: Design Philosophies,

    Decisions, and Trade-Offs, Computer,Jan.

    1989, pp. 12-35.

    Harsh Sharangpani was Intels principalmicroarchitect on the joint Intel-HP IA-64EPIC ISA definition. He managed themicroarchitecture definition and validation ofthe EPIC core of the Itanium processor. Hehas also worked on the 80386 and i486 proces-

    sors, and was the numerics architect of the Pen-tium processor. Sharangpani received anMSEE from the University of Southern Cali-fornia, Los Angeles and a BSEE from the Indi-an Institute of Technology, Bombay. He holds20 patents in the field of microprocessors.

    Ken Arorawas the microarchitect of the exe-cution and pipeline control of the EPIC coreof the Itanium processor. He participated inthe joint Intel/HP IA-64 EPIC ISA definition

    and helped develop the initial IA-64 simula-tion environment. Earlier, he was a designerand architect on the i486 and Pentium proces-sors. Arora received a BS degree in computerscience and a masters degree in electrical engi-neering from Rice University. He holds 12patents in the field of microprocessor archi-tecture and design.

    Direct questions about this article to Harsh

    Sharangpani, Intel Corporation, Mail StopSC 12-402, 2200 Mission College Blvd.,Santa Clara, CA 95052; [email protected].

    43SEPTEMBEROCTOBER2000

    CareerServiceCenter

    Certification

    Educational Activities

    Career Information Career Resources

    Student Activities

    Activities Board

    http://computer.org

    Career Service Center

    Introducing the

    IEEE Computer Society

    Career Service CenterAdvance your career

    Search for jobs

    Post a resume

    List a job opportunity

    Post your companys profile

    Link to career services

    http://computer.org/careers/