Chapter 4 Itanium EPIC Processor Architecture

1

ECE 371Microprocessors

Chapter 4Itanium EPIC Processor

Architecture

Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 11/5/2015Status 11/5/2015

For use at CCUT Fall 2015For use at CCUT Fall 2015

2

Syllabus Introduction Intel® Itanium® Architecture Data and Memory Itanium Registers Instruction Set Architecture ISA Assembler Source Program Appendix Bibliography

3

Photo of Itanium 2 Processor

https://en.wikipedia.org/wiki/File:KL_Intel_Itanium2.jpg

4

Itanium Processor Block Diagram

5

Introduction The Itanium® processor is Intel’s first published,

commercial 64-bit computer product, launched 2001, co-developed with HP Corp. IPF stands for Itanium Processor Family

Published means: Smart Intel was diligently developing a contemporaneous, competing 64-bit processor, the extended version of its ancient x86 architecture, just in case, as a secret backup risk hedge

64-bit means that the logical address range spans 264 different memory bytes; and natural integer objects are 64 bits wide

The exact format of data objects is described in section Data and Memory

During its development at Intel, the first generation of Itanium processors was internally code-named Merced

The family is now officially called IPF, for Itanium Processor Family, while early in its development it was referred to as IA-64, for Intel 64-bit architecture; conflicting later with x86

6

Introduction Intel’s Itanium architecture is radically different

from the widely used 32-bit IA-32 architecture IA-32 should be referred to as x86 architecture,

lest one incorrectly infers today that it be restricted to 32-bit addresses and integer types of 32-bit length

That limitation no longer exists since introduction of 64-bit versions about ½ year after AMD’s extension of IA-32 to 64 bits; see also EM64T

Imagine how Intel felt, when AMD, the company having produced CPUs compatible with Intel’s chips, suddenly had a more advanced, attractive x86 CPU!

7

Intel® Itanium® Architecture Interestingly, IA-32 object code is

executable on Itanium processors More interesting yet, even the Hewlett-

Packard PA-RISC code is executable on this novel 64-bit IPF processor

HP and Intel were strategic partners in the definition, development, and cost sharing of the IPF, with HP having initiated the development

Cautious about performance inferences! Just because IA-32 object code is executable on IPF, one should not deduce such code executes on IPF as fast as on an x86 processor!

8

Intel® Itanium® Architecture IPF is Intel’s and HP’s first instance of the novel EPIC

architecture EPIC stands for Explicitly Parallel Instruction

Computing. It is Intel’s first launched 64-bit architecture; the second was launched later (1q04), with EM64T, the first 64-bit version of the old x86 architecture

HP already had a 64-bit version with its Performance Architecture (PA) RISC processor at the time of Itanium launch

Explicit means, the assembly language programmer bears the intellectual burden (or the smart compiler) to take advantage of the parallelism in the architecture; see ref [8]

It is not the processor that automatically exploits the numerous, parallel computing modules; the microprocessor needs to be told!

9

Intel® Itanium® Architecture As a consequence, compilers for IPF are highly

complex; see Donald Knuth’s comment, ref [7] Compiler complexity is not desirable, as that

means more errors, decreased object code quality, something a new architecture should avoid

On the other hand, the IPF has provided explicit architectural features that enable implementing highly optimizing compilers

A case in point is architectural support for software pipelined loops (SW PL)

Certain source constructs let the compiler emit SW PL loops that need no prologue and epilogue

Absence of Prologue and Epilogue not only renders the object code more compact, but also faster

10

Intel® Itanium® Architecture Parallel means an Itanium processor gains

speed not solely via high clock rates, but via simultaneous execution of multiple operations in one clock cycle

Key concepts refined, or newly introduced, in IPF include: predication, branch prediction, branch elimination, conditional move, speculation, parallel comparisons, and a large register file

The first implementation of the new 64-bit Intel + HP Itanium architecture only implemented 44 physical of the 64 logical address bits

11

Intel® Itanium® Architecture With 44 bits, the total initial address range of

first Itanium HW was only about a millionth of the logical address range, but still 4000 times larger than earlier 32-bit architecture

In its second generation, 56 physical bits of the 64-bit logical address space were implemented in HW

Product name of that new version: Itanium® 2 Short-term, no severe limitations were expected

with restricted 56-bit addresses Still about 16 million times larger than 32-bit

addressing space Integer type operands are of course full 64 bits

wide

12

Intel® Itanium® Architecture Unlike earlier parallel VLIW architectures, EPIC does

not use a fixed width instruction encoding

Instead, operational functions can be combined to operate in parallel from a single to as many instructions as desired

What is critical in EPIC is that all code is written assuming parallel semantics within a group (to be explained later), and sequential semantics across groups

To be able to run in parallel, the machine is built with multiple execution modules that can all work at the same time

This allows a natural architecture migration from say, 6 HW modules executing on today’s Itanium, to as many as can be crammed into a future silicon microprocessor a few years from now

13

Intel® Itanium® Architecture To illustrate a sample taken from ref [1], consider 2

memory operands a and b to be swappedtemp := a; // a, b, temp, are memory locsa := b;b := temp;

The semicolon operator ‘;’ implies sequential semantics. On a machine with parallel semantics, it would be sufficient to write

a := b, // operand latching neededb := a; // operand latching needed

With the comma operator ‘,’ implying parallel semantics, similar to syntactic conventions in the programming language Algol-68

This source snipped is just a generic example; NOT a sample of the Itanium assembly language

14

Data & Memory

15

Data and Memory Native data types of IPF resemble

conventional 32-bit architectures, except for the longer 64-bit integer and unsigned formats

An extension over IA-32 object code is the IPF bundle

Data types include integer, unsigned, floating-point, and pointer

Integers are of different widths: byte, word, double-word, or quad-word precision

Length in bits as well as min and max values are listed below:

16

Data and Memory, Min MaxType Byte Word Double-

word+Quad-word+Integer [bits] 8 16 32 64

Unsigned [bits] 8 16 32 64Pointer [bits] NA NA Comp. 32 64Float [bits] NA NA 32, 64 64, 80

Type byte Word Double-word Quad-word Minint -128 -32,768 -2,147,483,648 "-9,223,372,036,854,775,808" Maxint 127 32,767 2,147,483,647 "9,223,372,036,854,775,807"

Minunsigned 0 0 0 0 Maxunisgned 255 65,535 4,294,967,295 "18,446,744,073,709,551,615"

17

Data and Memory Negative numbers are represented in two’s

complement format, with the sign-bit in the most-significant position

Floating-point data use the IEEE 754 standard Bits representing integer values are numbered

from 0 in the least significant position (rightmost position) to higher values

For example, the most significant bit in a double word is in position indexed 31 (Note the unusual word definition on Intel architectures: 2 bytes)

Maximum address on the first generation Itanium processor (Merced) was only 17,592,186,040,322 or 244-1. It grew in the second generation to 56 bits, and is now a full 64-bits long

18

Data and Memory Bytes are stored in little-endian order by

default Possible to programmatically select little- or

big-endian order, by setting the be bit in the user mask, a special status register

The be bit (for big-endian) does not affect how instructions are stored or fetched from memory

Object code is always represented in little-endian order; programmer selected endianness only impacts data

In little-endian order, data bytes with the lowest numeric value are stored in the byte with the lowest address; conversely for big-endian order

19

Data and MemoryData quad-word 0x1102030455060708 would be

stored:

Data stored in 8 adjacent bytes in memory in little-endian order::

Same int value 0x1102030455060708 stored in big-endian order:

addr: 0 addr: 1 addr: 2 addr: 3 addr: 4 addr: 5 addr: 6 addr: 708x 07x 06x 55x 04x 03x 02x 11x

byte7 byte6 byte5 byte4 byte3 byte2 byte1 byte011x 02x 03x 04x 55x 06x 07x 08x

20

Itanium Registers The Itanium processor has 128 general

registers (GR), 128 floating-point registers (FR), 64 single-bit predicate registers (PR), 8 branch registers (BR), and 128 application registers (AR)

In addition, there are Performance Monitor Data registers (PMD), processor identifiers (CPUID), a Current Frame Marker register (CFM), user mask (UM), and instruction pointer registers (IP)

GRs, FRs, BRs, ARs, CPUIDs, IP, and PMDs are 64 bits wide

PRs are 1 bit wide, while the UM holds 6 and the CFM 38 bits; depicted below:

21

Itanium Register FileItanium Register FileGR FR PR BR AR

gr0 63…0 fr0 63…0 pr0 0 br0 63…0 ar0 Kr0gr1 63…0 fr1 63…0 pr1 0 br1 63…0 . . .gr2 63…0 fr2 63…0 pr2 0 br2 63…0 ar7 Kr7gr3 63…0 fr3 63…0 pr3 0 br3 63…0 . . .gr4 63…0 fr4 63…0 pr4 0 br4 63…0 ar16 RSCgr5 63…0 fr5 63…0 pr5 0 br5 63…0 ar17 BSP. . . . . . . . . . . . . . . . . . br6 63…0 ar18 BSPST

Ogr16 63…0 fr16 63…0 pr10 0 br7 63…0 ar19 RNAT. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . ip 63…0 ar21 FCR

gr126 63…0 fr126 63…0 pr62 0 . . . . . .gr127 63…0 fr127 63…0 pr63 0 cfm 37…0 ar30 FDR

User M ar32 CCVCPUID um 5…0 ar36 UNAT

cpuid0 63…0 PMD ar40 FSPRcpuid1 63…0 pmd0 63…0 ar44 ITC

. . . . . . pmd1 63…0 ar64 LCcpuidn 63…0 . . . . . . ar66 EC

pmdm 63…0 ar127

22

Itanium Registers GR The 128 GR registers are the common workhorses

during computation They contain integer values being computed It is possible to use these integer values as

machine addresses, thus GRs can be used as pointers in load- and store-operations

All machine instructions can refer to these registers, for reading and writing values

In addition to the 64 data bits, each GR has an associated NAT bit, which stands for Not A Thing

NAT is 1, if the associated register has not been initialized with valid data

23

Itanium Registers GR NATs support speculation For example, if a speculative load is

issued but aborted, before the value arrives in its destined GR, the NAT state records that fact

Enables integrity of the machine’s exception process

There are 2 groups of GR registers: The first 32, GR0 through GR31, are

visible to all software, and are used to hold globally computed, intermediate values

However, GR0 is read-only, providing the constant 0, 64 bits long

24

Itanium Registers GR The next 96, GR32 to GR127, are used to

implement a small but frequently used portion of the top of the run-time stack; i.e. work like a special-purpose top-of-stack cache

These stack registers are made available to SW by allocation of a register stack frame, and include from 0 to 96 registers

Registers not used from this subset are inaccessible to general SW

The stack frame portion implemented via GRs is further partitioned into subsections, one meant to hold local registers, the other output registers, i.e. results of the current function call

25

Sample Stack Frame, Generic

26

Itanium Predicate Registers PR Execution of most IPF instructions can

be predicated by one of the PRs Value 1 in the PR means: the operation

can be completed normally PR value 0 means the result will not be

posted (committed), even if it has been computed already. I.e. there will be no stores and no impact on any AR of the machine

Exception of an instruction that cannot be predicated is the loop operation

27

Itanium Predicate Registers The PRs are also partitioned into 2

sections: PR0 through PR15 are static PRs The other 48 are so called rotating PRs PR0 is an exceptional register, it can

only be read, and its value is always 1, meaning, the predicate is true; thus PR0 denotes unconditional execution

The remaining 48 PRs are used to hold stage predicates, used during software pipelining

SW PL to be discussed in advanced computer architecture

28

Branch Registers BR IPF instructions are grouped in bundles, which are

16-byte aligned byte sequences holding executable code. Hence their rightmost 4 address bits will always be 0 due to alignment; these 4 address bits don’t need to be stored explicitly

Execution of an indirect branch requires an explicit operand

On the Itanium architecture this operand is a branch register; a branch register BR holds the branch destination

The machine then loads the value of the referenced BR into the IP register and execution continues from there; IP stands for Instruction Pointer

Executing branch-related instructions is about the only way to directly affect the value in the instruction pointer, the register that holds the address of the next bundle to be executed

29

Current Frame Marker Register CFMNote: Frame Marker is often referred to as Stack Frame, and its fixed portion as the Stack Marker

Each function has a specific stack frame associated with it, which is created at function invocation; it is cleared at function return

If all the relevant data of a function’s stack frame do fit, they are placed in the stack of general registers; else the overflowing data must reside in memory

Either way, the current frame marker (CFM) holds the frame marker for the function that is currently active

Generally, most functions have small stack frames

30

Current Frame Marker Register CFM

Layout of the CFM:Layout of the CFM:CFM- 37 .. 32 31 .. 25 24 .. 18 17 .. 14 13 .. 7 6 .. 0 register Rrb.pr Rrb.fr Rrb.gr sor sol sof

Meaning of Bits in CFM:Meaning of Bits in CFM:Name Bit Field meaning

Sof 0..6 Total size of stack frame Sol 7..13 Size of local part of stack frame, in words Sor 14..17 Size of rotating portion of stack frame. The number

of the rotating registers is 8 times the sor value rrb.gr 18..24 Register rename base for grs rrb.fr 25..31 Register rename base frs rrb.pr 32..37 Register rename base prs

31

Application Registers AR

Application Registers – t.b.d.:Application Registers – t.b.d.:

register Mnemonic Description of register ar0 – ar7 KR0 – KR7 Kernel registers 0 .. 7 ar8 – ar15 Reserved ar16 t.b.d.

32

Instruction Pointer IP IPF instructions are fetched in units of

bundles, which are chunks of 16 bytes, or 128 bits

Bundles are stored bundle-aligned The ip can address

18,446,744,073,709,551,616 different bytes (but only at bundle addresses)

The rightmost 4 bits of the ip thus will always be zero, due to the bundle-alignment

Hence these 4 bits don’t needs to be stored on the microprocessor silicon

33

Performance Monitor Data Register These are architecture-provided

resources that record the use of hardware modules

Contents is read-only by SW But contrary to the performance

monitor registers on Intel Pentium architectures, they are user visible on Itanium

34

Itanium ISAInstruction Set Architecture

35

Instruction Set Architecture ISAParallelism, Dependences, and Groups

Itanium instructions packaged in groups can execute in parallel; allows fast execution, if HW is available!

Assembly programmer or compiler may craft groups as large as desired; the performance consequence is:

All operations embedded in a single group can be executed simultaneously, in parallel, saving time over the equivalent sequential execution

The physical silicon angle of this is: Of all operations that could be executed in parallel only those are actually performed in parallel, for which there exist HW resources

E.g. on an Itanium® 2 implementation of IPF, there are 6 units available to operate in parallel

36


If fewer actions are enclosed in a group, some HW will idle

If more actions could be included in a group, then all HW elements are active, yet some degree of possible parallelism will be lost; future HW implementations may execute that same object code faster due to the higher degree of parallelism

Parallel execution is not feasible if dependencies exist between instructions

On Itanium these dependencies are not resolved by the machine

It is the human programmer or optimizer that explicitly tracks, what can be done in parallel, and what must be done in sequence. The machine just runs it, goal: TO BE FAST!

37


If a result has to be computed first before it can be read somewhere else (memory or register), a true dependence exists; AKA data dependence; conventional to say “dependence”

On Itanium we call this a RAW (Read after Write) dependence

If a result has to be read first before it can be re-computed, a false dependence is created, AKA anti-dependence

On Itanium this is named WAR (Write after Read) dependency

If a result has to be computed first before it can be computed again, assuming that an intermediate reference is possible, output dependence is created

Itanium calls this third dependence: WAW (Write after Write) dependence

38


In all these cases, the prior operation has to complete, before the dependent can be started; e.g.:

ld8 r14 = [r3] -- load GR14 w. 8 bytes addr. by GR3add r15 = r14, r16 -– integer sum into GR15, RAW dep

This is an example of RAW dependence, AKA true dependence

The loading of an 8-byte value into (8-byte) register GR14 must complete first, before the addition of the 2 long integer values, held in GR14 and GR16, can be started

Note the assembler register names: r14, and not gr14 This is Intel and HP assembly language convention!

Another assembler may use different conventions

39

Instruction Set Architecture ISAAssembly Language Format Format of an Itanium assembler instruction: In meta-syntax [ and ] brackets mean that the

bracketed portion of the instruction is optional

In assembly syntax, square bracket pairs [] express: indirection

Careful not to get confused by 2 different contexts!

[(pr)] mnemonic[.comp] dest = src1 [, src2 [, src3 ] ]

Meaning of the various assembly language fields:

40

Instruction Set Instruction Set Architecture ISAArchitecture ISA

syntax Name Meaning (pr) Predicate

register Used to predicate execution; if value is 0, the result is not committed, if true, the result is committed. pr0 is always 1, hence the associated instructions are executed unconditionally

mnemonic Instruction Name of the instruction to tell the assembler: which operation to perform

comp Completer Further qualifies or completes the instruction specification. There may be multiple completers per instruction; not all instructions have a completer

dest Destination Is the destination of the specified instruction. Choices are: register or memory

src1 source one Source operand. Not all instructions require a source. Some instructions allow multiple sources. Sources may be: Immediate operands, or registers. Memory can be a source via indirection (through a register)

src2 source two Ditto src3 source

three Ditto

41

Instruction Set Architecture ISAAssembly Language Format A sample assembly language instruction is shown next:

(p0) add r5 = r4, r3, 1 // (p0) can be skipped This is an integer add instruction that sums up the

integer values in GR4 and GR3, also adds integer literal 1

Assigns sum to register GR5. Since the predicate register used is PR0, which is always true, the commit of the sum to register GR5 is unconditional, as if no predicate qualifier had been given

Predicate registers, when listed, are enclosed in ( ) parentheses

Not all instructions allow or need a completer. Typical completers are shown below

Some instructions allow multiple completers, notably the memory access instructions, and branch instructions

42

Instruction Set Architecture ISA

Completer Meaning .a For “advanced” load; check later if successful .c Check

.clr If advanced load was not successful, clear the reg

.nc no clear .s Speculative; e.g. for load; NOT allowed for store!

.many t.b.d. .few t.b.d. .excl t.b.d. Many more

.equ .unc etc.

43

Instruction Set Architecture ISAItanium Bundle Format Executable code on Itanium comes in units of

bundles. A bundle consists of 3 instructions, all grouped with an associated template

Template completes the instruction specification and above all, defines group boundaries

Boundary is also known as a stop. Stop defines where one group ends and another group starts

If no stop is included in a template, this means that the bundle will be part of a larger group, consisting of more instructions in the next bundle

44

Instruction Set Architecture ISAItanium Bundle Format Each instruction is 41 bits long, a template

consumes 5 bits, one template per bundle With 3 instructions per bundle, the overall

bundle length is 3 * 41 + 5 = 128 bits, fitting into 16 bytes; all bundle-aligned, easily accomplished due to first bundle residing on a mod-16 memory boundary

From then on all will be aligned on 16-byte boundaries

With the memory bus being 128 bits wide (or wider on future IPF implementations) and bundles being bundle-aligned, fetching instruction memory is fast

Requiring one single transfer on the bus

45

Instruction Set Architecture ISAItanium Bundle Format General layout of a bundle is shown next, with

bits ordered from 0 through 127 increasing r. to l.

The template serves as a means for the compiler to communicate additional information about instructions 1, 2, and 3, without which they could be ambiguous

One such key piece of information is the placement of an instruction group stop, in assembler ;;

127 87 | 86 46 | 45 5 | 4 0

instruction 2 instruction 1 instruction 0 template

46

Instruction Set Architecture ISAItanium Bundle Format A group stop can occur after instruction 2,

or 1, or 0, indicating an earlier group must complete execution, before another starts

But Itanium instructions allows at most 2 stops in a bundle

If 3 stops are needed, a NOOP must be packed into one of the instructions, to effectively create 2 physical groups, with the third being the NOOP, whose execution order does not matter

Compiler-generated code performs this work-around automatically

47

Instruction Set Architecture ISAItanium Bundle Format The template specifies which types of

instructions are assembled into slot 0, 1, and 2 IPF instructions are partitioned into the

following 6 groups:

Type Meaning A ALU: integer or memory unit I Non-ALU: Integer unit

M Memory unit F Floating-point unit B Branch unit

L + X Extended unit, or Branch unit

48

Instruction Set Architecture ISAItanium Bundle Format Providing such information in the template

speeds up instruction decoding, improving execution speed

A list with the Instruction Set Architecture (ISA) templates and embedded stops is shown next

Note Note at most 2 stops at most 2 stops in any of the formatsin any of the formats On an architecture that aims to have large On an architecture that aims to have large

groups, it seems logical to have few stops groups, it seems logical to have few stops (max 2) per bundle(max 2) per bundle

49

Instruction Set Instruction Set Architecture ISAArchitecture ISA

Template # type slot 0 slot 1 slot2 0 = 0x00 MII Memory unit Integer unit Integer unit 1 = 0x01 MII_ Memory unit Integer unit Integer unit ;; 2 = 0x02 MI_I Memory unit Integer unit;; Integer unit 3 = 0x03 MI_I_ Memory unit Integer unit;; Integer unit;; 4 = 0x04 MLX Memory unit L unit? Extended unit 5 = 0x05 MLX_ Memory unit L unit? Extended unit;; 6 = 0x06 reserved 7 = 0x07 reserved 8 = 0x08 MMI Memory unit Memory unit Integer unit 9 = 0x09 MMI_ Memory unit Memory unit Integer unit;;

10 = 0x0a M_MI Memory unit;; Memory unit Integer unit 11 = 0x0b M_MI_ Memory unit;; Memory unit Integer unit;; 12 = 0x0c MFI Memory unit Floating-point unit Integer unit 13 = 0x0d MFI_ Memory unit Floating-point unit Integer unit;; 14 = 0x0e MMF Memory unit Memory unit Floating-point unit 15 = 0x0f MMF_ Memory unit Memory unit Floating-point unit;; 16 = 0x10 MIB Memory unit Integer unit Branch unit 17 = 0x11 MIB_ Memory unit Integer unit Branch unit;; 18 = 0x12 MBB Memory unit Branch unit Branch unit 19 = 0x13 MBB_ Memory unit Branch unit Branch unit;; 20 = 0x14 reserved 21 = 0x15 reserved 22 = 0x16 BBB Branch unit Branch unit Branch unit 23 = 0x17 BBB_ Branch unit Branch unit Branch unit;; 24 = 0x18 MMB Memory unit Memory unit Branch unit 25 = 0x19 MMB_ Memory unit Memory unit Branch unit;; 26 = 0x1a reserved 27 = 0x1b reserved 28 = 0x1c MFB Memory unit Floating-point unit Branch unit 28 = 0x1d MFB_ Memory unit Floating-point unit Branch unit;; 30 = 0x1e reserved 31 = 0x1f reserved

50

Instruction Set Architecture ISAItanium Bundle Format The difference between above templates

0x00 and 0x01, both being MII type operations is: after instruction 2 in template 0x01 there is a stop, while in template 0x00 there is none

In other words, the next bundle after the one for template 0x00 will belong to the same group, and a higher degree of parallelism will be possible there

51

Instruction Set Architecture ISAItanium Assembly Code A group is a sequence of 1 or more instructions

delimited by a stop. The first instruction in a whole program is thought to be preceded by a stop

Similarly, the last instruction of a complete program is thought to be followed by a stop

All instructions placed into a single group can be executed in parallel. Whether or not they will depends on the number of hardware resources available. In the initial Itanium architecture only 6 resources were available

In a later implementation, more HW resources may become available, thus potentially speeding up execution of the same old, unchanged Itanium code on a future generation

The ;; indicates to the assembler, where one boundary ends and thus the next group starts

52

Instruction Set Architecture ISAItanium Assembly Code Some assembly language instructions follow:

comp.eq p1, p2 = r33, r34 This checks general purpose registers 33 and 34

for equality; if equal, predicate register 1 is set to true, predicate register 2 to false. Otherwise p1 is set to false and p2 to true. A more complicated case is:

(p3) comp.eq.unc p1, p2 = r33, r34 checks if predicate register 3 is true at the

start. If so, if registers GR33 and GR34 are equal, register p1 is set to true and p2 to false, else the reverse

Else –i.e. if p3 is false a priori— then predicate registers 1 and 2 are both set to false

53

Assembler Source ProgramWith & Without

Stack Unwind Operations

From ref [8]

54

Assembler for Hello World, With// hello_world.c assembly with unwind directive// sample taken from ref [8]// page 1/3.file "hello.c".pred.safe_across_calls p1-p5, p16-p63.section .rdata, "a", "progbits".align 8.STRING1:stringz "Hello World!!!\n".text.align 16.global hello#.proc hello#hello:.prologue.save ar.pfs, r34

55

Assembler for Hello World, With// hello_world.c assembly with unwind directive// sample taken from ref [8]// page 2/3alloc r34 = ar.pfs, 0, 4, 1, 0.vframe r35mov r35 = r12.save rp, r33mov r33 = b0 // load branch register into GR33.bodyaddl r36 = @ltoff(.STRING1), gp;;ld8 r36 = [r36]mov r32 = r1br.call.sptk.many b0 = printf# // b0!;;

56

Assembler for Hello World, With// hello_world.c assembly with unwind directive// sample taken from ref [8]// page 3/3mov r1 = r32mov ar.pfs = r34mov b0 = r33 // restore branch register.restore spmov r12 = r35br.ret.sptk.many b0.endp hello#.global printf#.type printf#, @function

57

Assembler for Hello World, Without// hello_world.c assembly without unwind directive// sample taken from ref [8]// page 1/3// The string is defined in the read only data

section.section .rdata, "a", "progbits".align 8.STRING1:stringz "Hello World!!!\n"// definition of function hello is in text section// Registers to be saved in local registers:// gp = r1 - loc0 = r32// rp = b0 - loc1 = r33// ar.pfs - loc2 = r34// sp = r12 - loc3 = r35

58

Assembler for Hello World, Without// hello_world.c assembly without unwind directive// sample taken from ref [8]// page 2/3.text.global hello.proc hellohello:alloc loc2 = ar.pfs, 0, 4, 1, 0mov loc3 = spmov loc1 = b0 // save branch register b0addl out0 = @ltoff(.STRING1), gp;;ld8 out0 = [out0] // group of 3 instructionsmov loc0 = gpbr.call.sptk.many b0 = printf;;

59

Assembler for Hello World, Without

// hello_world.c assembly without unwind directive// sample taken from ref [8]// page 3/3mov gp = loc0mov ar.pfs = loc2mov b0 = loc1mov sp = loc3br.ret.sptk.many b0.endp hello.global printf.type printf, @function

60

Appendix:Some Definitions

61

DefinitionsBranch Elimination Replacing object code that has

conditional branches, with code that has a straight-forward execution path, lacking branches

The second version with branches eliminated must be semantically equivalent to the original code with branches

Everything else equal, the version without branches generally executes faster due to less cache misses

62

DefinitionsBundle Group of 3 instructions plus a template,

that all fit into a 16-byte long, 16-byte aligned section of instruction memory on Itanium

Total number of bits = 128

63

DefinitionsConditional Move Move instruction that transfers bits from source

to destination, but only if an associated condition is true

Otherwise the instruction operates like a noop Such a move can serve as a special case of branch

elimination. For example, the C source construct:

if ( a > 0 ) x = 99; -- HL source program

could be mapped into the conditional move:

cmov x, #99, a, #0, gt -- hypothetical asm

which has no branches. Source operand #99 is moved into memory location x only if the > condition holds between operands a and integer literal 0

64

DefinitionsEndian, Endianness A convention that defines in which order

the higher-valued bytes of a multi-byte data object are addressed

Can be programmed on Itanium with be bit If the higher address byte holds the

higher numeric value, we call this little-endian

typical on Intel x86 architecture The other way around we call big-endian

ordering typical on IBM 370 architecture

65

DefinitionsEPIC Explicitly Parallel Instruction

Computing, with IPF being the first commercial architecture that implements EPIC

Note IPF’s ability to also execute old Intel x86 and old HP PA object code

66

DefinitionsEpilogue When the steady state of a software

pipelined loop completes, there may be yet to be used operands and operations to be computed that would not fit into the steady state

These last operands must be consumed, some even be generated during the epilogue, and ultimately the pipeline must be drained

This is accomplished in the object code after the steady state, and that portion of code is called the epilogue

See also prologue

67

DefinitionsGroup A sequence of instructions, each with an

associated template and a defined stop A group is composed of one bundle or

more The stop means, the hardware cannot

start executing any subsequent group, until the current group has completed

Syntax notation for stop in Itanium assembler is the double-semicolon ;;

68

DefinitionsParallel Comparison A composite source program condition of the form:

( ( a > b ) && ( c <= d ) ) requires multiple steps to compute a boolean predicate

Generally, on a sequential architecture these multiple steps are combined via explicit instructions for anding and oring, or else the flow of control of execution selects a matching true label. All this takes time

The Itanium processor allows parallel evaluation of certain composite Boolean expressions in one single step

The result can be used as a predicate in subsequent instructions. Notice that such combined Boolean expressions must be side-effect free

Is not equivalent to C’s short-circuit evaluation of complex boolean expressions!

69

DefinitionsParallel Comparison, Cont’d For example, another complex boolean

expression( fun( j, k ) && ( i < MAX ) )

cannot be mapped into a parallel EPIC comparison

Since one operand is a function call fun( i, k ) with a possibly large number of parameters, and may have a side-effect on one of the other operands, for example “i” which is yet to be compared

This type of boolean expression is mapped into sequential code

70

DefinitionsPredication Is the association of a boolean condition with the

execution of an instruction sequence. This allows the following:

Two instruction streams can be executed in parallel, clearly requiring multiple hardware modules; provided on EPIC

Both streams have a predicate associated with their operations. Only the stream with the true predicate is actually retired; the other will be aborted and ignored

Abort can happen as soon as the predicate is known. This means, the computation of the predicate can proceed in parallel with the execution of the two code streams, but must complete by the time these 2 code streams waite for who’ll be the winner

An ISA with predication requires bits for the predicates to use, and which direction (true? or false?) to select

Also, the discarded code path may contain no side-effect, such as a write to memory!

71

DefinitionsPrologue Before a software pipelined loop body

can be initiated, hardware resources (e.g. registers) must be initialized; we say the loop must be primed

This is accomplished in the object code before the steady state, called the Prologue

See also epilogue

72

DefinitionsRegister File The IPF has a rich set of registers This includes 128 general purpose

registers (for integer operations), 128 floating-point-, 64 predicate-, 64 branch-, and 128 so-called application registers

Also a variety of special purpose register is visible; visible means accessible by the assembly language program

Includes a user mask, stack marker (frame marker), ip, processor id, and performance monitoring registers

73

DefinitionsSpeculation If it is suspected --but not sure-- that operand o

will be used in the future, and this operand is not readily available (not yet in a high-speed register), and it takes long to fetch o, a processor may initiate the fetch well before it is actually used

Advantage: by the time o is needed, it is already available without delay

Disadvantage: if the flow of control never reaches the place where o was thought to be needed, then the speculative fetch was superfluous

May still be meaningful, if a) no side-effects occurred that are harmful to program correctness, and b) if the hardware resource required to fetch o was idle anyway; then no loss!

74

Definitions

Steady State The software pipelined object code

executed repeatedly, after the Prologue has been initiated, before the Epilogue will be active, is called the Steady State

Each iteration of the Steady State makes some progress toward multiple iterations of the original source loop

See also prologue and epilogue

75

DefinitionsSyllable Is the instruction-only portion of a

bundle A bundle always holds 3 instructions

plus a template, the template specifying additional necessary information about an instruction

The instruction alone, without the needed template information, is a syllable

76

Bibliography1. Triebel, Walter: “IA-64 Architecture for Software

Developers”, Intel Press © 2000, 308 pages2. http://www.intel.com/design/itanium2/manuals/

25110901.pdf3. http://h21007.www2.hp.com/portal/StaticDownload?

attachment_ciid=c2d2e0aecd2b7110VgnVCM100000275d6e10RCRD&ciid=ce1fd701521c7110VgnVCM100000275d6e10RCRD

4. http://www.intel.com/design/itanium/downloads/245320.htm

5. http://www.intel.com/design/itanium/manuals/iiasdmanual.htm

6. http://download.intel.com/design/Itanium2/manuals/25111003.pdf

7. Donald Knuth: “Interview with Donald Knuth” 2008-04-25

8. Intel® Itanium® Architecture Assembly Reference Guide, © 2002, Intel order number 248801-004, at http://developer.intel.com

Documents

Chapter 4 Itanium EPIC Processor Architecture