Upload
alexis-craig
View
277
Download
0
Embed Size (px)
DESCRIPTION
Syllabus Introduction Intel® Itanium® Architecture Data and Memory Itanium Registers Instruction Set Architecture ISA Assembler Source Program Appendix Bibliography
Citation preview
1
ECE 371Microprocessors
Chapter 4Itanium EPIC Processor
Architecture
Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 11/5/2015Status 11/5/2015
For use at CCUT Fall 2015For use at CCUT Fall 2015
2
Syllabus Introduction Intel® Itanium® Architecture Data and Memory Itanium Registers Instruction Set Architecture ISA Assembler Source Program Appendix Bibliography
4
Itanium Processor Block Diagram
5
Introduction The Itanium® processor is Intel’s first published,
commercial 64-bit computer product, launched 2001, co-developed with HP Corp. IPF stands for Itanium Processor Family
Published means: Smart Intel was diligently developing a contemporaneous, competing 64-bit processor, the extended version of its ancient x86 architecture, just in case, as a secret backup risk hedge
64-bit means that the logical address range spans 264 different memory bytes; and natural integer objects are 64 bits wide
The exact format of data objects is described in section Data and Memory
During its development at Intel, the first generation of Itanium processors was internally code-named Merced
The family is now officially called IPF, for Itanium Processor Family, while early in its development it was referred to as IA-64, for Intel 64-bit architecture; conflicting later with x86
6
Introduction Intel’s Itanium architecture is radically different
from the widely used 32-bit IA-32 architecture IA-32 should be referred to as x86 architecture,
lest one incorrectly infers today that it be restricted to 32-bit addresses and integer types of 32-bit length
That limitation no longer exists since introduction of 64-bit versions about ½ year after AMD’s extension of IA-32 to 64 bits; see also EM64T
Imagine how Intel felt, when AMD, the company having produced CPUs compatible with Intel’s chips, suddenly had a more advanced, attractive x86 CPU!
7
Intel® Itanium® Architecture Interestingly, IA-32 object code is
executable on Itanium processors More interesting yet, even the Hewlett-
Packard PA-RISC code is executable on this novel 64-bit IPF processor
HP and Intel were strategic partners in the definition, development, and cost sharing of the IPF, with HP having initiated the development
Cautious about performance inferences! Just because IA-32 object code is executable on IPF, one should not deduce such code executes on IPF as fast as on an x86 processor!
8
Intel® Itanium® Architecture IPF is Intel’s and HP’s first instance of the novel EPIC
architecture EPIC stands for Explicitly Parallel Instruction
Computing. It is Intel’s first launched 64-bit architecture; the second was launched later (1q04), with EM64T, the first 64-bit version of the old x86 architecture
HP already had a 64-bit version with its Performance Architecture (PA) RISC processor at the time of Itanium launch
Explicit means, the assembly language programmer bears the intellectual burden (or the smart compiler) to take advantage of the parallelism in the architecture; see ref [8]
It is not the processor that automatically exploits the numerous, parallel computing modules; the microprocessor needs to be told!
9
Intel® Itanium® Architecture As a consequence, compilers for IPF are highly
complex; see Donald Knuth’s comment, ref [7] Compiler complexity is not desirable, as that
means more errors, decreased object code quality, something a new architecture should avoid
On the other hand, the IPF has provided explicit architectural features that enable implementing highly optimizing compilers
A case in point is architectural support for software pipelined loops (SW PL)
Certain source constructs let the compiler emit SW PL loops that need no prologue and epilogue
Absence of Prologue and Epilogue not only renders the object code more compact, but also faster
10
Intel® Itanium® Architecture Parallel means an Itanium processor gains
speed not solely via high clock rates, but via simultaneous execution of multiple operations in one clock cycle
Key concepts refined, or newly introduced, in IPF include: predication, branch prediction, branch elimination, conditional move, speculation, parallel comparisons, and a large register file
The first implementation of the new 64-bit Intel + HP Itanium architecture only implemented 44 physical of the 64 logical address bits
11
Intel® Itanium® Architecture With 44 bits, the total initial address range of
first Itanium HW was only about a millionth of the logical address range, but still 4000 times larger than earlier 32-bit architecture
In its second generation, 56 physical bits of the 64-bit logical address space were implemented in HW
Product name of that new version: Itanium® 2 Short-term, no severe limitations were expected
with restricted 56-bit addresses Still about 16 million times larger than 32-bit
addressing space Integer type operands are of course full 64 bits
wide
12
Intel® Itanium® Architecture Unlike earlier parallel VLIW architectures, EPIC does
not use a fixed width instruction encoding
Instead, operational functions can be combined to operate in parallel from a single to as many instructions as desired
What is critical in EPIC is that all code is written assuming parallel semantics within a group (to be explained later), and sequential semantics across groups
To be able to run in parallel, the machine is built with multiple execution modules that can all work at the same time
This allows a natural architecture migration from say, 6 HW modules executing on today’s Itanium, to as many as can be crammed into a future silicon microprocessor a few years from now
13
Intel® Itanium® Architecture To illustrate a sample taken from ref [1], consider 2
memory operands a and b to be swappedtemp := a; // a, b, temp, are memory locsa := b;b := temp;
The semicolon operator ‘;’ implies sequential semantics. On a machine with parallel semantics, it would be sufficient to write
a := b, // operand latching neededb := a; // operand latching needed
With the comma operator ‘,’ implying parallel semantics, similar to syntactic conventions in the programming language Algol-68
This source snipped is just a generic example; NOT a sample of the Itanium assembly language
14
Data & Memory
15
Data and Memory Native data types of IPF resemble
conventional 32-bit architectures, except for the longer 64-bit integer and unsigned formats
An extension over IA-32 object code is the IPF bundle
Data types include integer, unsigned, floating-point, and pointer
Integers are of different widths: byte, word, double-word, or quad-word precision
Length in bits as well as min and max values are listed below:
16
Data and Memory, Min MaxType Byte Word Double-
word+Quad-word+Integer [bits] 8 16 32 64
Unsigned [bits] 8 16 32 64Pointer [bits] NA NA Comp. 32 64Float [bits] NA NA 32, 64 64, 80
Type byte Word Double-word Quad-word Minint -128 -32,768 -2,147,483,648 "-9,223,372,036,854,775,808" Maxint 127 32,767 2,147,483,647 "9,223,372,036,854,775,807"
Minunsigned 0 0 0 0 Maxunisgned 255 65,535 4,294,967,295 "18,446,744,073,709,551,615"
17
Data and Memory Negative numbers are represented in two’s
complement format, with the sign-bit in the most-significant position
Floating-point data use the IEEE 754 standard Bits representing integer values are numbered
from 0 in the least significant position (rightmost position) to higher values
For example, the most significant bit in a double word is in position indexed 31 (Note the unusual word definition on Intel architectures: 2 bytes)
Maximum address on the first generation Itanium processor (Merced) was only 17,592,186,040,322 or 244-1. It grew in the second generation to 56 bits, and is now a full 64-bits long
18
Data and Memory Bytes are stored in little-endian order by
default Possible to programmatically select little- or
big-endian order, by setting the be bit in the user mask, a special status register
The be bit (for big-endian) does not affect how instructions are stored or fetched from memory
Object code is always represented in little-endian order; programmer selected endianness only impacts data
In little-endian order, data bytes with the lowest numeric value are stored in the byte with the lowest address; conversely for big-endian order
19
Data and MemoryData quad-word 0x1102030455060708 would be
stored:
Data stored in 8 adjacent bytes in memory in little-endian order::
Same int value 0x1102030455060708 stored in big-endian order:
addr: 0 addr: 1 addr: 2 addr: 3 addr: 4 addr: 5 addr: 6 addr: 708x 07x 06x 55x 04x 03x 02x 11x
byte7 byte6 byte5 byte4 byte3 byte2 byte1 byte011x 02x 03x 04x 55x 06x 07x 08x
20
Itanium Registers The Itanium processor has 128 general
registers (GR), 128 floating-point registers (FR), 64 single-bit predicate registers (PR), 8 branch registers (BR), and 128 application registers (AR)
In addition, there are Performance Monitor Data registers (PMD), processor identifiers (CPUID), a Current Frame Marker register (CFM), user mask (UM), and instruction pointer registers (IP)
GRs, FRs, BRs, ARs, CPUIDs, IP, and PMDs are 64 bits wide
PRs are 1 bit wide, while the UM holds 6 and the CFM 38 bits; depicted below:
21
Itanium Register FileItanium Register FileGR FR PR BR AR
gr0 63…0 fr0 63…0 pr0 0 br0 63…0 ar0 Kr0gr1 63…0 fr1 63…0 pr1 0 br1 63…0 . . .gr2 63…0 fr2 63…0 pr2 0 br2 63…0 ar7 Kr7gr3 63…0 fr3 63…0 pr3 0 br3 63…0 . . .gr4 63…0 fr4 63…0 pr4 0 br4 63…0 ar16 RSCgr5 63…0 fr5 63…0 pr5 0 br5 63…0 ar17 BSP. . . . . . . . . . . . . . . . . . br6 63…0 ar18 BSPST
Ogr16 63…0 fr16 63…0 pr10 0 br7 63…0 ar19 RNAT. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . ip 63…0 ar21 FCR
gr126 63…0 fr126 63…0 pr62 0 . . . . . .gr127 63…0 fr127 63…0 pr63 0 cfm 37…0 ar30 FDR
User M ar32 CCVCPUID um 5…0 ar36 UNAT
cpuid0 63…0 PMD ar40 FSPRcpuid1 63…0 pmd0 63…0 ar44 ITC
. . . . . . pmd1 63…0 ar64 LCcpuidn 63…0 . . . . . . ar66 EC
pmdm 63…0 ar127
22
Itanium Registers GR The 128 GR registers are the common workhorses
during computation They contain integer values being computed It is possible to use these integer values as
machine addresses, thus GRs can be used as pointers in load- and store-operations
All machine instructions can refer to these registers, for reading and writing values
In addition to the 64 data bits, each GR has an associated NAT bit, which stands for Not A Thing
NAT is 1, if the associated register has not been initialized with valid data
23
Itanium Registers GR NATs support speculation For example, if a speculative load is
issued but aborted, before the value arrives in its destined GR, the NAT state records that fact
Enables integrity of the machine’s exception process
There are 2 groups of GR registers: The first 32, GR0 through GR31, are
visible to all software, and are used to hold globally computed, intermediate values
However, GR0 is read-only, providing the constant 0, 64 bits long
24
Itanium Registers GR The next 96, GR32 to GR127, are used to
implement a small but frequently used portion of the top of the run-time stack; i.e. work like a special-purpose top-of-stack cache
These stack registers are made available to SW by allocation of a register stack frame, and include from 0 to 96 registers
Registers not used from this subset are inaccessible to general SW
The stack frame portion implemented via GRs is further partitioned into subsections, one meant to hold local registers, the other output registers, i.e. results of the current function call
25
Sample Stack Frame, Generic
26
Itanium Predicate Registers PR Execution of most IPF instructions can
be predicated by one of the PRs Value 1 in the PR means: the operation
can be completed normally PR value 0 means the result will not be
posted (committed), even if it has been computed already. I.e. there will be no stores and no impact on any AR of the machine
Exception of an instruction that cannot be predicated is the loop operation
27
Itanium Predicate Registers The PRs are also partitioned into 2
sections: PR0 through PR15 are static PRs The other 48 are so called rotating PRs PR0 is an exceptional register, it can
only be read, and its value is always 1, meaning, the predicate is true; thus PR0 denotes unconditional execution
The remaining 48 PRs are used to hold stage predicates, used during software pipelining
SW PL to be discussed in advanced computer architecture
28
Branch Registers BR IPF instructions are grouped in bundles, which are
16-byte aligned byte sequences holding executable code. Hence their rightmost 4 address bits will always be 0 due to alignment; these 4 address bits don’t need to be stored explicitly
Execution of an indirect branch requires an explicit operand
On the Itanium architecture this operand is a branch register; a branch register BR holds the branch destination
The machine then loads the value of the referenced BR into the IP register and execution continues from there; IP stands for Instruction Pointer
Executing branch-related instructions is about the only way to directly affect the value in the instruction pointer, the register that holds the address of the next bundle to be executed
29
Current Frame Marker Register CFMNote: Frame Marker is often referred to as Stack Frame, and its fixed portion as the Stack Marker
Each function has a specific stack frame associated with it, which is created at function invocation; it is cleared at function return
If all the relevant data of a function’s stack frame do fit, they are placed in the stack of general registers; else the overflowing data must reside in memory
Either way, the current frame marker (CFM) holds the frame marker for the function that is currently active
Generally, most functions have small stack frames
30
Current Frame Marker Register CFM
Layout of the CFM:Layout of the CFM:CFM- 37 .. 32 31 .. 25 24 .. 18 17 .. 14 13 .. 7 6 .. 0 register Rrb.pr Rrb.fr Rrb.gr sor sol sof
Meaning of Bits in CFM:Meaning of Bits in CFM:Name Bit Field meaning
Sof 0..6 Total size of stack frame Sol 7..13 Size of local part of stack frame, in words Sor 14..17 Size of rotating portion of stack frame. The number
of the rotating registers is 8 times the sor value rrb.gr 18..24 Register rename base for grs rrb.fr 25..31 Register rename base frs rrb.pr 32..37 Register rename base prs
31
Application Registers AR
Application Registers – t.b.d.:Application Registers – t.b.d.:
register Mnemonic Description of register ar0 – ar7 KR0 – KR7 Kernel registers 0 .. 7 ar8 – ar15 Reserved ar16 t.b.d.
32
Instruction Pointer IP IPF instructions are fetched in units of
bundles, which are chunks of 16 bytes, or 128 bits
Bundles are stored bundle-aligned The ip can address
18,446,744,073,709,551,616 different bytes (but only at bundle addresses)
The rightmost 4 bits of the ip thus will always be zero, due to the bundle-alignment
Hence these 4 bits don’t needs to be stored on the microprocessor silicon
33
Performance Monitor Data Register These are architecture-provided
resources that record the use of hardware modules
Contents is read-only by SW But contrary to the performance
monitor registers on Intel Pentium architectures, they are user visible on Itanium
34
Itanium ISAInstruction Set Architecture
35
Instruction Set Architecture ISAParallelism, Dependences, and Groups
Itanium instructions packaged in groups can execute in parallel; allows fast execution, if HW is available!
Assembly programmer or compiler may craft groups as large as desired; the performance consequence is:
All operations embedded in a single group can be executed simultaneously, in parallel, saving time over the equivalent sequential execution
The physical silicon angle of this is: Of all operations that could be executed in parallel only those are actually performed in parallel, for which there exist HW resources
E.g. on an Itanium® 2 implementation of IPF, there are 6 units available to operate in parallel
36
Instruction Set Architecture ISAParallelism, Dependences, and Groups
If fewer actions are enclosed in a group, some HW will idle
If more actions could be included in a group, then all HW elements are active, yet some degree of possible parallelism will be lost; future HW implementations may execute that same object code faster due to the higher degree of parallelism
Parallel execution is not feasible if dependencies exist between instructions
On Itanium these dependencies are not resolved by the machine
It is the human programmer or optimizer that explicitly tracks, what can be done in parallel, and what must be done in sequence. The machine just runs it, goal: TO BE FAST!
37
Instruction Set Architecture ISAParallelism, Dependences, and Groups
If a result has to be computed first before it can be read somewhere else (memory or register), a true dependence exists; AKA data dependence; conventional to say “dependence”
On Itanium we call this a RAW (Read after Write) dependence
If a result has to be read first before it can be re-computed, a false dependence is created, AKA anti-dependence
On Itanium this is named WAR (Write after Read) dependency
If a result has to be computed first before it can be computed again, assuming that an intermediate reference is possible, output dependence is created
Itanium calls this third dependence: WAW (Write after Write) dependence
38
Instruction Set Architecture ISAParallelism, Dependences, and Groups
In all these cases, the prior operation has to complete, before the dependent can be started; e.g.:
ld8 r14 = [r3] -- load GR14 w. 8 bytes addr. by GR3add r15 = r14, r16 -– integer sum into GR15, RAW dep
This is an example of RAW dependence, AKA true dependence
The loading of an 8-byte value into (8-byte) register GR14 must complete first, before the addition of the 2 long integer values, held in GR14 and GR16, can be started
Note the assembler register names: r14, and not gr14 This is Intel and HP assembly language convention!
Another assembler may use different conventions
39
Instruction Set Architecture ISAAssembly Language Format Format of an Itanium assembler instruction: In meta-syntax [ and ] brackets mean that the
bracketed portion of the instruction is optional
In assembly syntax, square bracket pairs [] express: indirection
Careful not to get confused by 2 different contexts!
[(pr)] mnemonic[.comp] dest = src1 [, src2 [, src3 ] ]
Meaning of the various assembly language fields:
40
Instruction Set Instruction Set Architecture ISAArchitecture ISA
syntax Name Meaning (pr) Predicate
register Used to predicate execution; if value is 0, the result is not committed, if true, the result is committed. pr0 is always 1, hence the associated instructions are executed unconditionally
mnemonic Instruction Name of the instruction to tell the assembler: which operation to perform
comp Completer Further qualifies or completes the instruction specification. There may be multiple completers per instruction; not all instructions have a completer
dest Destination Is the destination of the specified instruction. Choices are: register or memory
src1 source one Source operand. Not all instructions require a source. Some instructions allow multiple sources. Sources may be: Immediate operands, or registers. Memory can be a source via indirection (through a register)
src2 source two Ditto src3 source
three Ditto
41
Instruction Set Architecture ISAAssembly Language Format A sample assembly language instruction is shown next:
(p0) add r5 = r4, r3, 1 // (p0) can be skipped This is an integer add instruction that sums up the
integer values in GR4 and GR3, also adds integer literal 1
Assigns sum to register GR5. Since the predicate register used is PR0, which is always true, the commit of the sum to register GR5 is unconditional, as if no predicate qualifier had been given
Predicate registers, when listed, are enclosed in ( ) parentheses
Not all instructions allow or need a completer. Typical completers are shown below
Some instructions allow multiple completers, notably the memory access instructions, and branch instructions
42
Instruction Set Architecture ISA
Completer Meaning .a For “advanced” load; check later if successful .c Check
.clr If advanced load was not successful, clear the reg
.nc no clear .s Speculative; e.g. for load; NOT allowed for store!
.many t.b.d. .few t.b.d. .excl t.b.d. Many more
.equ .unc etc.
43
Instruction Set Architecture ISAItanium Bundle Format Executable code on Itanium comes in units of
bundles. A bundle consists of 3 instructions, all grouped with an associated template
Template completes the instruction specification and above all, defines group boundaries
Boundary is also known as a stop. Stop defines where one group ends and another group starts
If no stop is included in a template, this means that the bundle will be part of a larger group, consisting of more instructions in the next bundle
44
Instruction Set Architecture ISAItanium Bundle Format Each instruction is 41 bits long, a template
consumes 5 bits, one template per bundle With 3 instructions per bundle, the overall
bundle length is 3 * 41 + 5 = 128 bits, fitting into 16 bytes; all bundle-aligned, easily accomplished due to first bundle residing on a mod-16 memory boundary
From then on all will be aligned on 16-byte boundaries
With the memory bus being 128 bits wide (or wider on future IPF implementations) and bundles being bundle-aligned, fetching instruction memory is fast
Requiring one single transfer on the bus
45
Instruction Set Architecture ISAItanium Bundle Format General layout of a bundle is shown next, with
bits ordered from 0 through 127 increasing r. to l.
The template serves as a means for the compiler to communicate additional information about instructions 1, 2, and 3, without which they could be ambiguous
One such key piece of information is the placement of an instruction group stop, in assembler ;;
127 87 | 86 46 | 45 5 | 4 0
instruction 2 instruction 1 instruction 0 template
46
Instruction Set Architecture ISAItanium Bundle Format A group stop can occur after instruction 2,
or 1, or 0, indicating an earlier group must complete execution, before another starts
But Itanium instructions allows at most 2 stops in a bundle
If 3 stops are needed, a NOOP must be packed into one of the instructions, to effectively create 2 physical groups, with the third being the NOOP, whose execution order does not matter
Compiler-generated code performs this work-around automatically
47
Instruction Set Architecture ISAItanium Bundle Format The template specifies which types of
instructions are assembled into slot 0, 1, and 2 IPF instructions are partitioned into the
following 6 groups:
Type Meaning A ALU: integer or memory unit I Non-ALU: Integer unit
M Memory unit F Floating-point unit B Branch unit
L + X Extended unit, or Branch unit
48
Instruction Set Architecture ISAItanium Bundle Format Providing such information in the template
speeds up instruction decoding, improving execution speed
A list with the Instruction Set Architecture (ISA) templates and embedded stops is shown next
Note Note at most 2 stops at most 2 stops in any of the formatsin any of the formats On an architecture that aims to have large On an architecture that aims to have large
groups, it seems logical to have few stops groups, it seems logical to have few stops (max 2) per bundle(max 2) per bundle
49
Instruction Set Instruction Set Architecture ISAArchitecture ISA
Template # type slot 0 slot 1 slot2 0 = 0x00 MII Memory unit Integer unit Integer unit 1 = 0x01 MII_ Memory unit Integer unit Integer unit ;; 2 = 0x02 MI_I Memory unit Integer unit;; Integer unit 3 = 0x03 MI_I_ Memory unit Integer unit;; Integer unit;; 4 = 0x04 MLX Memory unit L unit? Extended unit 5 = 0x05 MLX_ Memory unit L unit? Extended unit;; 6 = 0x06 reserved 7 = 0x07 reserved 8 = 0x08 MMI Memory unit Memory unit Integer unit 9 = 0x09 MMI_ Memory unit Memory unit Integer unit;;
10 = 0x0a M_MI Memory unit;; Memory unit Integer unit 11 = 0x0b M_MI_ Memory unit;; Memory unit Integer unit;; 12 = 0x0c MFI Memory unit Floating-point unit Integer unit 13 = 0x0d MFI_ Memory unit Floating-point unit Integer unit;; 14 = 0x0e MMF Memory unit Memory unit Floating-point unit 15 = 0x0f MMF_ Memory unit Memory unit Floating-point unit;; 16 = 0x10 MIB Memory unit Integer unit Branch unit 17 = 0x11 MIB_ Memory unit Integer unit Branch unit;; 18 = 0x12 MBB Memory unit Branch unit Branch unit 19 = 0x13 MBB_ Memory unit Branch unit Branch unit;; 20 = 0x14 reserved 21 = 0x15 reserved 22 = 0x16 BBB Branch unit Branch unit Branch unit 23 = 0x17 BBB_ Branch unit Branch unit Branch unit;; 24 = 0x18 MMB Memory unit Memory unit Branch unit 25 = 0x19 MMB_ Memory unit Memory unit Branch unit;; 26 = 0x1a reserved 27 = 0x1b reserved 28 = 0x1c MFB Memory unit Floating-point unit Branch unit 28 = 0x1d MFB_ Memory unit Floating-point unit Branch unit;; 30 = 0x1e reserved 31 = 0x1f reserved
50
Instruction Set Architecture ISAItanium Bundle Format The difference between above templates
0x00 and 0x01, both being MII type operations is: after instruction 2 in template 0x01 there is a stop, while in template 0x00 there is none
In other words, the next bundle after the one for template 0x00 will belong to the same group, and a higher degree of parallelism will be possible there
51
Instruction Set Architecture ISAItanium Assembly Code A group is a sequence of 1 or more instructions
delimited by a stop. The first instruction in a whole program is thought to be preceded by a stop
Similarly, the last instruction of a complete program is thought to be followed by a stop
All instructions placed into a single group can be executed in parallel. Whether or not they will depends on the number of hardware resources available. In the initial Itanium architecture only 6 resources were available
In a later implementation, more HW resources may become available, thus potentially speeding up execution of the same old, unchanged Itanium code on a future generation
The ;; indicates to the assembler, where one boundary ends and thus the next group starts
52
Instruction Set Architecture ISAItanium Assembly Code Some assembly language instructions follow:
comp.eq p1, p2 = r33, r34 This checks general purpose registers 33 and 34
for equality; if equal, predicate register 1 is set to true, predicate register 2 to false. Otherwise p1 is set to false and p2 to true. A more complicated case is:
(p3) comp.eq.unc p1, p2 = r33, r34 checks if predicate register 3 is true at the
start. If so, if registers GR33 and GR34 are equal, register p1 is set to true and p2 to false, else the reverse
Else –i.e. if p3 is false a priori— then predicate registers 1 and 2 are both set to false
53
Assembler Source ProgramWith & Without
Stack Unwind Operations
From ref [8]
54
Assembler for Hello World, With// hello_world.c assembly with unwind directive// sample taken from ref [8]// page 1/3.file "hello.c".pred.safe_across_calls p1-p5, p16-p63.section .rdata, "a", "progbits".align 8.STRING1:stringz "Hello World!!!\n".text.align 16.global hello#.proc hello#hello:.prologue.save ar.pfs, r34
55
Assembler for Hello World, With// hello_world.c assembly with unwind directive// sample taken from ref [8]// page 2/3alloc r34 = ar.pfs, 0, 4, 1, 0.vframe r35mov r35 = r12.save rp, r33mov r33 = b0 // load branch register into GR33.bodyaddl r36 = @ltoff(.STRING1), gp;;ld8 r36 = [r36]mov r32 = r1br.call.sptk.many b0 = printf# // b0!;;
56
Assembler for Hello World, With// hello_world.c assembly with unwind directive// sample taken from ref [8]// page 3/3mov r1 = r32mov ar.pfs = r34mov b0 = r33 // restore branch register.restore spmov r12 = r35br.ret.sptk.many b0.endp hello#.global printf#.type printf#, @function
57
Assembler for Hello World, Without// hello_world.c assembly without unwind directive// sample taken from ref [8]// page 1/3// The string is defined in the read only data
section.section .rdata, "a", "progbits".align 8.STRING1:stringz "Hello World!!!\n"// definition of function hello is in text section// Registers to be saved in local registers:// gp = r1 - loc0 = r32// rp = b0 - loc1 = r33// ar.pfs - loc2 = r34// sp = r12 - loc3 = r35
58
Assembler for Hello World, Without// hello_world.c assembly without unwind directive// sample taken from ref [8]// page 2/3.text.global hello.proc hellohello:alloc loc2 = ar.pfs, 0, 4, 1, 0mov loc3 = spmov loc1 = b0 // save branch register b0addl out0 = @ltoff(.STRING1), gp;;ld8 out0 = [out0] // group of 3 instructionsmov loc0 = gpbr.call.sptk.many b0 = printf;;
59
Assembler for Hello World, Without
// hello_world.c assembly without unwind directive// sample taken from ref [8]// page 3/3mov gp = loc0mov ar.pfs = loc2mov b0 = loc1mov sp = loc3br.ret.sptk.many b0.endp hello.global printf.type printf, @function
60
Appendix:Some Definitions
61
DefinitionsBranch Elimination Replacing object code that has
conditional branches, with code that has a straight-forward execution path, lacking branches
The second version with branches eliminated must be semantically equivalent to the original code with branches
Everything else equal, the version without branches generally executes faster due to less cache misses
62
DefinitionsBundle Group of 3 instructions plus a template,
that all fit into a 16-byte long, 16-byte aligned section of instruction memory on Itanium
Total number of bits = 128
63
DefinitionsConditional Move Move instruction that transfers bits from source
to destination, but only if an associated condition is true
Otherwise the instruction operates like a noop Such a move can serve as a special case of branch
elimination. For example, the C source construct:
if ( a > 0 ) x = 99; -- HL source program
could be mapped into the conditional move:
cmov x, #99, a, #0, gt -- hypothetical asm
which has no branches. Source operand #99 is moved into memory location x only if the > condition holds between operands a and integer literal 0
64
DefinitionsEndian, Endianness A convention that defines in which order
the higher-valued bytes of a multi-byte data object are addressed
Can be programmed on Itanium with be bit If the higher address byte holds the
higher numeric value, we call this little-endian
typical on Intel x86 architecture The other way around we call big-endian
ordering typical on IBM 370 architecture
65
DefinitionsEPIC Explicitly Parallel Instruction
Computing, with IPF being the first commercial architecture that implements EPIC
Note IPF’s ability to also execute old Intel x86 and old HP PA object code
66
DefinitionsEpilogue When the steady state of a software
pipelined loop completes, there may be yet to be used operands and operations to be computed that would not fit into the steady state
These last operands must be consumed, some even be generated during the epilogue, and ultimately the pipeline must be drained
This is accomplished in the object code after the steady state, and that portion of code is called the epilogue
See also prologue
67
DefinitionsGroup A sequence of instructions, each with an
associated template and a defined stop A group is composed of one bundle or
more The stop means, the hardware cannot
start executing any subsequent group, until the current group has completed
Syntax notation for stop in Itanium assembler is the double-semicolon ;;
68
DefinitionsParallel Comparison A composite source program condition of the form:
( ( a > b ) && ( c <= d ) ) requires multiple steps to compute a boolean predicate
Generally, on a sequential architecture these multiple steps are combined via explicit instructions for anding and oring, or else the flow of control of execution selects a matching true label. All this takes time
The Itanium processor allows parallel evaluation of certain composite Boolean expressions in one single step
The result can be used as a predicate in subsequent instructions. Notice that such combined Boolean expressions must be side-effect free
Is not equivalent to C’s short-circuit evaluation of complex boolean expressions!
69
DefinitionsParallel Comparison, Cont’d For example, another complex boolean
expression( fun( j, k ) && ( i < MAX ) )
cannot be mapped into a parallel EPIC comparison
Since one operand is a function call fun( i, k ) with a possibly large number of parameters, and may have a side-effect on one of the other operands, for example “i” which is yet to be compared
This type of boolean expression is mapped into sequential code
70
DefinitionsPredication Is the association of a boolean condition with the
execution of an instruction sequence. This allows the following:
Two instruction streams can be executed in parallel, clearly requiring multiple hardware modules; provided on EPIC
Both streams have a predicate associated with their operations. Only the stream with the true predicate is actually retired; the other will be aborted and ignored
Abort can happen as soon as the predicate is known. This means, the computation of the predicate can proceed in parallel with the execution of the two code streams, but must complete by the time these 2 code streams waite for who’ll be the winner
An ISA with predication requires bits for the predicates to use, and which direction (true? or false?) to select
Also, the discarded code path may contain no side-effect, such as a write to memory!
71
DefinitionsPrologue Before a software pipelined loop body
can be initiated, hardware resources (e.g. registers) must be initialized; we say the loop must be primed
This is accomplished in the object code before the steady state, called the Prologue
See also epilogue
72
DefinitionsRegister File The IPF has a rich set of registers This includes 128 general purpose
registers (for integer operations), 128 floating-point-, 64 predicate-, 64 branch-, and 128 so-called application registers
Also a variety of special purpose register is visible; visible means accessible by the assembly language program
Includes a user mask, stack marker (frame marker), ip, processor id, and performance monitoring registers
73
DefinitionsSpeculation If it is suspected --but not sure-- that operand o
will be used in the future, and this operand is not readily available (not yet in a high-speed register), and it takes long to fetch o, a processor may initiate the fetch well before it is actually used
Advantage: by the time o is needed, it is already available without delay
Disadvantage: if the flow of control never reaches the place where o was thought to be needed, then the speculative fetch was superfluous
May still be meaningful, if a) no side-effects occurred that are harmful to program correctness, and b) if the hardware resource required to fetch o was idle anyway; then no loss!
74
Definitions
Steady State The software pipelined object code
executed repeatedly, after the Prologue has been initiated, before the Epilogue will be active, is called the Steady State
Each iteration of the Steady State makes some progress toward multiple iterations of the original source loop
See also prologue and epilogue
75
DefinitionsSyllable Is the instruction-only portion of a
bundle A bundle always holds 3 instructions
plus a template, the template specifying additional necessary information about an instruction
The instruction alone, without the needed template information, is a syllable
76
Bibliography1. Triebel, Walter: “IA-64 Architecture for Software
Developers”, Intel Press © 2000, 308 pages2. http://www.intel.com/design/itanium2/manuals/
25110901.pdf3. http://h21007.www2.hp.com/portal/StaticDownload?
attachment_ciid=c2d2e0aecd2b7110VgnVCM100000275d6e10RCRD&ciid=ce1fd701521c7110VgnVCM100000275d6e10RCRD
4. http://www.intel.com/design/itanium/downloads/245320.htm
5. http://www.intel.com/design/itanium/manuals/iiasdmanual.htm
6. http://download.intel.com/design/Itanium2/manuals/25111003.pdf
7. Donald Knuth: “Interview with Donald Knuth” 2008-04-25
8. Intel® Itanium® Architecture Assembly Reference Guide, © 2002, Intel order number 248801-004, at http://developer.intel.com