1
ECE 172 Digital Systems
Chapter 3Registers
Herbert G. Mayer, PSUStatus 7/12/2018
2
Syllabusl Definitions, Introductionl Register Transfer & RTLl Register Shift Operationsl Register Windowsl Vector Registersl Score Boardl Zero Register Architecturel Register Dependenciesl Actual Register Setsl Bibliography
3
Definitions Topic: Register; AKA machine register; AKA
processor register
l A machine register is an ISA visible system resource holding data of specific size; data can be accessed + processed fast. Data size generally: word
l Getting data from memory into register is referred to as loading; moving bits from register to memory is called storing
l Generally, information in registers can be processed fast, faster than any other data in digital systems
l Register size dictated by computer architecture; e.g. 64-bit architecture has registers holding 64 bits of information; may be data, addresses or other
l CPU may have many registers (Itanium) or few (x86)l Registers are identified via index, i.e. their name
4
Definitionsl Register is a CPU resource holding operands for
computation; different operands at different timesl Register is key resource in digital systems: to store
information while powered on; register content is volatilel CPU may have 0 (yes zero) or more user-visible registersl Register operand may be source for computationl Or destination, holding next result after operationl Registers may be both, source and destination, in which
case computation changes the original source operandl Type of architecture with registers: general purpose
register architecture, GPRAl Not all computer architectures have registersl For example, stack machines hold operands on top
of a stack; stack grows and shrinks; has 0 user-visible registers!
5
Definitionsl Early architectures, AKA von Neumann machines, or
Princeton architecture, had but 1 register, known as accumulator
l AKA single-accumulator architecture SAA!l Note: Architectural registers of CPU are user visiblel However, real HW implementations of any defined
architecture may provide hidden registerl For example, ancient Intel x86 processor nowadays
has many internal registers that are not user visiblel They are only available to the running HW which
may manage them to speed up executionl Also, Register windows on Sparc may have more
internal –i.e. not visible– registers to speed up execution, by performing register saving and restoring at calls and returns; not detailed here
6
Definitions
Von Neumann Architecture, © teach-ict
7
Introductionl Since accumulator on SAA was sole source + target
of operations, instructions never needed to explicitly name that unique register: always implied!
l Modern architectures have multiple registers, e.g. for integer ops, floating-point ops, program counter, status, segment registers, stack addresses, etc.
l Early architectures were register-starved, e.g. Intel x86; yet hidden registers in modern versions of x86 alleviate slowness due to register shortage!
l Program status register holds ALU status of last op; is indirectly visible via branch conditions; e.g. branch_if_zero, or branch_if_less, etc.
l Recent architectures (e.g. Itanium) have a large number of visible, also large number of hidden (not user visible) registers, to speed up execution through –hidden− register renaming
8
Introductionl An architecture ought ☺ to have many registers!l Yet accessing any such HW resource requires
naming –via indexing– a register, AKA addressing itl Number of bits used in instruction to address register
is ⎡ log2( number-of-regs ) ⎤, increasing object code size! Where is the optimum?
l Way to reduce number of bits: to partition registers into different classes: integer, float, branch, status
l Desirable for execution speed: ideally all data reside in registers! Yet not feasible!
l Data set is way larger than total size of register file, hence this ideal is impossible
l Architecture solution via memory hierarchy: register, slower registers, cache, slower cache, memory, etc.
9
IntroductionMemory hierarchy: registers & other HW modules holding data
10
IntroductionAnother, common view of memory hierarchy: Registers at top
11
Logical Register
l Logical register (LR) is a hypothetical machine resource to hold data as operands for computation, addressing, indexing, decision making, etc.
l Convenient model for discussing architecturel Logical registers are used as abstract design tool to
explore or refine a computer architecturel To propose code sequences for simulation etc.l LR doesn’t suffer ☺ from physical constraints, such
as slowness, limit of data size, number of unitsl As eventual result of computer design process, LR
may end up defining key attributes for a to-be-defined actual register of a digital system being built
12
Physical Registerl A physical or processor register (PR) is a machine
resource holding data as operands for addressing, computing, and decision making, etc.
l Each PR has a unique name, specific width, defined data types, and set of operations
l Width (number of bits) of a PR is defined by architecture of digital system, for which the PR is a resource
l Or a PR may be defined by the maximum precision of the data it is ever expected to compute
l Frequently, these two precisions are identical; e.g. on a 32-bit architecture the maximum numeric precision for integer of floating-point data was also 32 bits
l But they may differ; e.g. during the evolution of 32-bit architectures, wider numeric precisions such as 64-bit integer or float data became commonplace
13
Physical Registerl Actual number of PRs dictates required number of
bits in instructions that specify register source and destination of register operations
l Old Intel x86 architecture has 4 general purpose and 4 dedicated (segment) registers; 8 total! Not a typo!
l SPARC architecture has 32 visible physical registers; yet has large number of hidden registers, available to allow smooth use of circular register window, when limit of 32 is exceeded
l After consuming all 32 (register 0 .. 31), count of next register restarts at 0; yet old register 0 must be saved
l Physical registers are actually built from flip-flops, thus are clocked, hold data between clock pulses
l PRs need reset (or clear) function at start of computation
14
Physical Registerr1 Register r1: generic complete register; bits unspecified
r2 63 0
Register r2: some 64-bit register
byte1 byte0
15 8 7 0
byte3 byte2
31 24 23 16
32-bit, 4-byte register r3 r3
l Sample registers, with specified length: r2 having 64 bits; r3 only 32 bits, 4 bytes
l Register r3 shows bit indices right to left, byte addresses right to left within one word: Little Endian
l Else byte addresses increase left to right: Big Endianl Generic register r1 with length unspecified
15
Register Transfer &
Register Transfer Language
16
Register Transfer Language RTLl Register transfer language (RTL) is not a general
purpose programming language (PL): PL offers higher-level abstractions to discuss computing environment
l RTL is not an assembly language (AL): Executable operations of AL map directly onto CPU instructions: level of abstraction for AL higher than for RTL
l Instead, RTL is a low-level language, used to define digital systems including key processor component; RTL specifies operations on registers or between multiple registers with one another
l RTL shares certain operations with high level PLs or ALs: e.g. transferring bits, copying bits, specifying bit ranges, zeroing bit fields, or shifting bits, etc.
l . . . where source or destination or both are registers
17
Register Transfer Language
l Reason for continued use of HDL is logic synthesisl HDL description of a system can be written in
intermediate language, AKA RTLl Logic synthesis tools can convert a HW description
into interconnection of simple components that implement such circuit!
l . . . and can transform RTL specification of a circuit in HDL into an equivalent netlist
l Optimized netlist with storage elements and with combinational logic
l Netlist can be mapped into actual IC layoutl Becoming basis for IC manufacturing
18
Register Micro OperationsMicro operations (micro-ops) are low-level (primitive) operations executed in digital systems, involving register operands, such as:1. Register to register transfer micro-ops, moving bits from one to another register2. Arithmetic micro-ops, performing arithmetic on numeric data in registers 3. Logical micro-ops, performing bit manipulations on non-numeric data in registers4. Shift micro-ops, moving bits or bit fields inside a register5. Setting micro-ops, clearing to 0 or setting to 1 the selected bit fields (or all bits) inside a register
19
Register Micro OperationsRTL allows various addressing modes, needed per operand location; permissible due to power of digital system HW versus more restrictive SW environment:1. Immediate: literal is immediate in opcode, e.g. r1 " #92. Implied: Operand on stack architecture, e.g. add, refers to 2 top operands stack[top] and stack[top-1]3. Register to register: natural instruction on GPR architecture, assign r1 value of r2, e.g. r1 " r24. Memory indirect case 1: operand in memory, memory address 1234 is immediate operand, and the refer to r1 " mem[ #1234 ]5. Register indirect case 2: operand is in memory, address is in another register, e.g. r1 " mem[ r2 ]6. Indexed: operand is in memory, address in register plus numeric offset n, e.g. r1 " mem[ r2 + n ]
20
Register Transferl Abstract view of register transfer is akin to assignment
in SW, except RTL operands specify registers, as opposed to general program objects, e.g.:
r1 = r2; -- no predicate used: assign unconditionally!if ( predicate_p1 == true ) then r1 = r2; -- with predicate!
l In RTL more tersely expressed as:r1 " r2 -- no predicate: assign register to register!p1: r1 " r2 --with predicate, meaning: if p1 is true
l Digital circuits have one grand power ☺ other tools often lack: parallelism!
p2: r3 " r4, and r5 " r6l Multiple register transfers performed simultaneously,
provided predicate p2 holds!l In preparation for register transfer, we review Flip-
Flops, specifically D Flip-Flops; registers are built from arrays of flip-flops, one flip-flop per register bit
21
Register Shift Operations
22
Register Shift Opl Register shift operations (AKA shifts) move bits of a
source register a defined number of bit positions into a destination register
l Source and destination may be different registers, or could be the same
l Depending on shift type, there are side effects in addition to destination register change:l Bits are discarded into a large bit bucket Jl 0-bits, 1-bits, sign-bits are pulled inl Flags are set (e.g. sign, overflow, zero, etc.)
l So called left shift moves bits of a register toward the most significant bit
l Right shift moves register bits toward least significant bit, AKA right hand side
23
Register Shift Opl If bits in register are viewed as string of 0s and 1s,
shift operation may lose bits without side-effectl AKA logical view of bit values
l Shift right may pull in zeros on left hand side (high bit index)l Or shift can be arithmetic, then sign bit value is to be
considered:l Shift right extends (pulls in) sign bit on left hand sidel Shift left may cause overflow on twos-complement, if leftmost
2 bits differ during right shift! Sign change!l For arithmetic shifts, convenient to view left shift by 1
position as multiply by 2l Or right shift to divide by 2: Must extend the sign!l Sample, pseudo shift operations below:
24
Register Shift OpPseudo shift instructions, not selected from any real
computer architecture:r1 ← lsl r1 -- logical shift left 1 bit same reg
r1 ← lsr r1 -- logical shift right 1 bit same reg
r1 ← lsl r2 -- logical shift left 1 bit different
r1 ← lsr r2 -- logical shift right 1 bit
r1 ← asl r2 -- arithmetic shift left 1 bit
r1 ← asr r2 -- arithmetic shift right 1 bit
r1 ← 2 lsl r2 -- logical shift left 2 bits
r1 ← 3 lsr r2 -- logical shift right 3 bits
r1 ← 4 asl r2 -- arithmetic shift left 4 bits
r1 ← 5 asr r2 -- arithmetic shift right 5 bits
r1 ← rotl r2 -- rotate r2 left 1 bit, result in r1
25
Register Shift Op 8-bits
26
Register Shift Op 8-bitsOperation Pseudo op r2 before r1 afterRight logical r1 ← lsr r2 1011,0111 0101,1011Logical shift right 1 position: leftmost bit ‘1’ in r2 is not interpreted as sign bit, thus ‘0’s pulled in from left
Operation Pseudo op r2 before r1 afterRight artihmet. r1 ← asr r2 1011,0111 1101,1011Arithmetic shift right 1, leftmost bit ‘1’ in r2 “interpreted” by HW as sign bit, thus ‘1’s pulled in, i.e. sign-extended
Operation Pseudo op r2 before r1 afterLeft logical r1 ← 2 lsl r2 1011,0111 1101,1100Logical shift left, 2 positions, left 2 bits into bit bucket; and from right side, two ‘0’s are pulled in
27
Register Shift Arithmetic
l Arithmetic shift left asl pulls in ‘0’s from right sidel Views bits as signed two’s complement numbersl Sign bit change causes overflow; to be flagged?
high order sign bit lost
arithmetic shift left: asl low order input: ‘0’s
sign bit is extended
arithmetic shift right: asr low order bits lost
l Arithmetic shift right asr extends sign bitl In contrast, lsr pulls in ‘0’s from left hand side
28
Register Shift Logical
l Logical shift l views bits as string, not signed numberl Hence lsl pulls in ‘0’s from right hand side
high order bits lost
logical shift left: lsl low order input: ‘0’s
high order Input: ‘0’s
logical shift right: lsr low order bits lost
l Logical shift r views bits as string, not signed numberl Hence lsr also pulls in ‘0’s from left hand side
29
Register Shift Circular
l Circular shift left sees bit string, not signed numberl No bits lost during csl, leftmost bits pulled in on right
l Circular shift right sees bit string, not signed numberl No bits lost during csr, rightmost bits pulled in on left
high order bits move to low end
circular shift left: csl low order bits move to left
high order bits move to right
circular shift right: csr low order bits move to high end
30
Register Windows
31
Register Windowl SPARC architecture popularized register windowsl All visible registers in set 0 are indexed from 00 .. n-10l When another is needed from set 0 after n-1 have
been used, restart using register 01 again, acting as if it (register 00) were free
l But if 00 is actually still needed, let HW transparently back it up in hidden register file, simultaneously with continuing execution
l Later to be restored, when the overflowing register 01 is no longer needed; thus 00 comes back to lifel Requires more physical registers, many invisible,
thus needs more control HW, but does not extend index bits for registers in instruction to encode particular register resource; beyond log2( n ) with n being number of registers: no code bloat! ☺
32
Register WindowCyclic register naming
33
Vector Registers
34
Vector Register Architecture VRAl Registers on VRA are implemented as a HW array
of functionally identical registers, named: vri[j], i = 0 .. n-1, and j = 0 .. m-1, AKA vector registersl VRA may have scalar registers, named r0, r1, etc.l Vector registers vri[*] can each load/store blocks
of contiguous datal Still in sequence, but overlapped; number of
clocks to complete a full vector load/store depends on bus width
l Vector registers perform multiple identical operations on contiguous blocks of operands
35
Vector Register ArchitectureVRA operates sequentially; but processes n ≥ 1
vector operands in n registers simultaneously: faster than n sequential, scalar ops!
36
Vector Register ArchitectureGraph shows parallel data processing in one single
operation, using multiple registers
37
Vector Register Architecturel Otherwise operations look similar to GPR architecturel Sample vector operations, assuming 64-unit ops:
ldv vr1, memi -- loads 64 memory locs from [mem+i=0..63]stv vr2, memj -- stores vr2[0..63] in 64 contig. locsvadd vr1, vr2, vr3 -- register-register vector addcvaddf r0, vr1, vr2, vr3 -- semantics: condition via bit in r0-- sequential equivalent:for i = 0 to 63 do
if biti in r0 = 1 then vr1[i] = vr2[i] + vr3[i]else – must be 0 -- do not move corresponding bits into vr1[i]end if
end for
-- parallel syntax equivalent:forall i = 0 to 63 doparallel -- parallel semantics
if bit i in r0 = 1 then vr1[i] = vr2[i] + vr3[i]end if
end parallel for
38
Score Board
39
Score Boardl Score-board sb[*] supports out of order
executionl Is not user visible, hence not accessible to
programmer! EE students need to know ☺l Instead, score-board sb[*] is array of HW
programmable bits, or single-bit registers named sb[*], each identified by index; not visible in ISA! Owned by processor HW!
l Score-board manages actual HW registersl Is single-bit HW array sb[]l Every bit i in sb[i] is associated with one of
the real, specific registers: the one identified by index i , e.g. ri
40
Score Board
l Association by index: sb[i] belongs to reg ril Only if score board sb[i] = 0 does register ri hold valid data; else must wait! Do not access!l Also a load register ri may proceed if sb[i] = 0l Or we can say, if sb[i] = 0, then register ri is
currently NOT in the process of being writtenl If bit i is set, i.e. if sb[i] = 1, that register ri is
reserved, i.e. it is off limits for the moment; HW must wait, until sb[i] = 0
l Initially all sb[*] are free to use, i.e. all are set to: sb[i] = 0
41
Score Boardl Execution constraints, assume:
rd ← rs op rtl If either sb[s] or sb[t] are being set: → RAW
dependence, hence HW stalls computation; wait until both rs and rt are available, i.e. until sb[s] = 0 and sb[t] = 0
l if sb[d] is set→ WAW dependence, hence HW stalls the write; waits until rd has been used; processor or even SW (compiler) can sometimes determine to use another register instead of rd that is known to be free
l Else, if none of the 3 registers are in use, i.e. if all score board entries s, t, and d are 0, then HW can dispatch instruction immediately
42
Score Board & ooo ExecutionTo allow out of order (ooo) execution, by using any available ri and rj
1. For uses (AKA references), HW may take any register i, whose sb[i] is 0
2. For definitions (AKA assignments), HW may set any register j, whose sb[j] is 0
3. Independent of original order, in which source program was written, i.e. possibly ooo
4. Provided, in the end all ISA visible registers hold the intended, programmed results
43
Score Board & ooo Executionl Out of order execution (ooo), AKA dynamic execution l CDC supercomputers broke complex instruction (e.g.
FP divide) into a semantically equivalent sequence of simpler FP sub-operations
l Each of which could be executed very swiftlyl On pipelined architecture, numerous sub-operations or
multiple instructions are live and make progress in various phases of completionl First invented for CDC 6600 during late 1960sl IBM 360/91 during 1970s, Tomasulo’s genuine ooo algorithml IBM POWER1 μP in 1990l Intel x86 family, since 1995 on Pentium Pro®
44
Score Board & ooo Executionl Multiple sub-operations progress simultaneously, yet
in any order, that’s why we name it: oool As long as the retiring order is logically equivalent to
sequential operation of original instruction sequencel Detail of ooo execution paradigm:
1. Fetch next instruction i2. Dispatch i to instruction queue, AKA reservation station3. Then i waits in queue until input operands are available4. When available, then i can leave queue, and run possibly even
before earlier, older instructions5. i is issued to appropriate functional unit for execution6. Results are queued up, to preserve original order7. Once older instructions have written back results to register
file rn, then i’s result is written back to rd −called retire stage, with rd being instruction i’s destination register; i.e. holding the result
45
Zero Register Architecture
46
Zero Registers?l Zero register architectures are known as stack
machinesl Semi-tongue-in-cheek claim is: “Registers are not
needed for computing!”l As long as a stack is available −stack just being a
policy of accessing main memory− all computations can be done without registers
l Technically correct!l Yet all such computations will be slow; as operations
are completed solely with memory operands! l Speed gap is several decimal orders of magnitude
worse, to the heavy loss of the stack machine, i.e. to the loss of the zero register architecture!
47
Code For Stack Architecturel Solution? Implement a few top of stack
elements via HW shadow registers ⇒ Cachel Let us compare equivalent code sequences
with and without consideration of a cachel The top-of-stack register “tos” points to the
last (topmost) valid word on physical stackl Two hidden shadow registers may hold 0, 1,
or 2 true top of stack wordsl Top of stack cache counter tcc specifies
number of shadow registers actually usedl Thus tos plus tcc jointly specify the true top
of stack
48
Abstract Stack Architecture
free free
0,1,20,1,2
tcc tcc
2 tos registers 2 tos registers
stack stack
tos tos
49
Code For Stack Architecturel Timings for push, pushlit, add, pop operations
depend on top of stack cachel Operations in shadow registers are fast, typically 1
cycle; includes register access and the operationl Generally, memory access adds numerous cyclesl To track dynamic changes of the stack, use some
defined policy, say try to keep top 50% fulll Table below refines timings for stack with
transparent shadow registersl For example, pushing element mem[ x ] into top of
stack cache, we arbitrarily define this requires 2 cycles; due to the memory fetch
l Note: 2 cycles for memory access, highly idealized! In reality more likely multiple tens of cycles!
50
Code For Stack Architecture
operation Cycles tcc before tcc after tos change comment add 1 tcc = 2 tcc = 1 no change add 1+2 tcc = 1 tcc = 1 tos-- underflow? add 1+2+2 tcc = 0 tcc = 1 tos -= 2 underflow? push x 2 tcc = 0,1 tcc++ no change tcc update
in parallel push x 2+2 tcc = 2 tcc = 2 tos++ overflow? pushlit #3 1 tcc = 0,1 tcc++ no change pushlit #3 1+2 tcc = 2 tcc = 2 tos++ overflow? pop y 2 tcc = 1,2 tcc-- no change pop y 2+2 tcc = 0 tcc = 0 tos-- underflow?
51
Code For Stack Architecturel Code emission for source snippet on SA:
a + b * c ^ ( d + e * f ^ g )l Let + and * be commutative, conventional
language rulel Architecture here has 2 shadow registersl Assembly language programmer (or HLL
compiler) exploits thisl Assume initially empty 2-word cache
52
# 1 Left - to - Right cycles 1 2 Exploit Cache cycles 2
1 push a 2 push f 2 2 push b 2 push g 2 3 push c 4 expo 1 4 push d 4 push e 2 5 push e 4 mult 1 6 push f 4 push d 2 7 push g 4 add 1 8 expo 1 push c 2 9 mult 3 expo 1
10 add 3 push b 2 11 expo 3 mult 1 12 mult 3 push a 2 13 add 3 add 1
Code For Stack Architecture
53
Code For Stack Architecturel Brute-force code emission costs 40 cycles; i.e. failing
to take advantage of tcc knowledgel Code emission with shadow register consideration
costs 20 cyclesl True penalty for memory access is worse in practicel Tremendous speed-up always possible when fixing
system with severe flaws ☺l Return of investment for 2 registers is double the
original performance!l Such strong speedup is an indicator that the starting
architecture was poor in the first place!l Stack Machine can be fast, if purity of top-of-stack
memory-access is relaxed for performancel Indexing, looping, indirection, call/return etc. are not
addressed here
54
Register Dependencies
55
Register Dependenciesl Inter-instruction dependencies, in EE parlance also
known as dependences, arise between registers or memory locations being defined (AKA assigned, or written) and used (AKA read, or referenced)
l One instruction computes a result into a register (or memory); another instruction needs that result from that same register (or same memory location)
l Or, one instruction uses a register; and after use the same register is newly recomputed (written)
l Dependences cause sequential execution, lest the result is unpredictable
l On next page: op is any arithmetic/logical operation
56
Register DependenciesTrue-Dependence, AKA Data Dependence: <- synonymous!r3 ← r1 op r2 1: Write r3, op is some arithmetic opcoder5 ← r3 op r4 2: Read r3 after Write, RAW
Anti-Dependence, not a true dependenceparallelize under right conditionr3 ← r1 op r2 1: Read r1r1 ← r5 op r4 2: Write r1 after Read, WAR
Output Dependence, similar to Anti-Dependence is not true dep.r3 ← r1 op r2 1: Write r3r5 ← r3 op r4 2: Read r3r3 ← r6 op r7 3: Write r3 after Write, WAW, use between
57
Register DependenciesControl Dependence:
// ri, i = 1..4 come in “live”
if ( condition1 ) {
r3 = r1 op r2;
}else{ " see the jump here?
r5 = r3 op r4;
} // end if
write( r3 );
58
Register Renamingl Only data dependence is a real dependence,
hence called true dependencel Other dependences are artifacts of insufficient
resources, generally insufficient registersl This means: if additional registers were
available, then replacing some of these conflicting registers with other registers, could make the conflict (dependence) disappear!
l Anti- and Output-Dependences are indeed such false dependences
59
Register Renaming Original Code:-- r2, r3, r5, r6, r7 come in “live” from code before
-- r1, r4 are not “live”, don’t have initial values
-- r1, r2, r3, r4, r5, r6, r7 must go out “live”
L1: r1 ← r2 op r3
L2: r4 ← r1 op r5
L3: r1 ← r3 op r6
L4: r3 ← r1 op r7
Initial Dependences:Lx: Ly: x, y = 1..4, which dependence? Next page
60
Register RenamingOriginal Code: L1: r1 ← r2 op r3
L2: r4 ← r1 op r5
L3: r1 ← r3 op r6
L4: r3 ← r1 op r7
Initial Dependences: numerous!! L1, L2 true-Dep with r1
L1, L3 output-Dep with r1
L1, L4 anti-Dep with r3
L3, L4 true-Dep with r1
L2, L3 anti-Dep with r1
L3, L4 anti-Dep with r3
61
Register Renaming
l What could be changed and improved for better performance, if we had additional registers?
l Hidden or real (visible architecture) registers could be advantageous!
l Compute and use other temporaries via other registers to reduce dependences!
l May at times allows higher degree of parallelism, due to lower degree of dependence
l More parallelism è faster execution!l Register renaming conducted by HW; invisible
to assembly programmer or compiler
62
Register RenamingOriginal Code: New Code, added regs, in r30 instead r3:L1: r1 ← r2 op r3 r10 ← r2 op r30 –- r30 instead
L2: r4 ← r1 op r5 r4 ← r10 op r5 –- r10 instead
L3: r1 ← r3 op r6 r1 ← r30 op r6
L4: r3 ← r1 op r7 r3 ← r1 op r7
Dependences before: Dependences after:L1, L2 true-Dep with r1 L1, L2 true-Dep with r10
L1, L3 output-Dep with r1 L3, L4 true-Dep with r1
L1, L4 anti-Dep with r3 // ri, i = 1..7 are “live”
L3, L4 true-Dep with r1
L2, L3 anti-Dep with r1
L3, L4 anti-Dep with r3
63
Register Renaming• With these additional, renamed regs, the new code
could execute in half the time!
• First: Compute into free/hidden reg r10 instead of r1, but needs additional register r10; no time penalty!
• Also: Compute in preceding code into r30 instead of r3, if r30 available; also no time penalty!
• Then all 7 regs are live afterwards: r1, r3, r4, plus the non-modified ones! E.g. r2 came in live, must go out live!
• While r10 and r30 are don’t cares afterwards; free to use again by HW
64
Actual Register SetArchitecture Examples
65
Intel x86 Registers
66
Intel x86 Registers 32-bitl Intel x86 is infamous for being register-starved!l Need for object code compatibility extended life of x86
architecture beyond anyone’s imagination
67
Intel x86 Registers 32-bit
68
Intel x86 Registers 32-bitl Intel x86 has mmx and xmm registersl Can be used as array of 8, 16, 32, etc. sub registersl Also referred to as SSE (streaming SIMD Extension)
69
Intel x86 Registers 64-bit
70
ItaniumTM Registers
71
Itanium Registersl Intel’s newer 64-bit ItaniumTM processor has 128
general registers (GR), 128 floating-point registers (FR), 64 single-bit predicate registers (PR), 8 branch registers (BR), 128 application registers (AR)
l Also, there are Performance Monitor Data registers (PMD), processor identifiers (CPUID), a Current Frame Marker register (CFM), user mask (UM), and instruction pointer registers (IP)
l GRs, FRs, BRs, ARs, CPUIDs, IP, and PMDs are 64 bits wide
l PRs are 1 bit wide, while the UM holds 6 and the CFM 38 bits; depicted below:
72
Itanium Register FileGR FR PR BR AR
gr0 63…0 fr0 63…0 pr0 0 br0 63…0 ar0 Kr0gr1 63…0 fr1 63…0 pr1 0 br1 63…0 . . .gr2 63…0 fr2 63…0 pr2 0 br2 63…0 ar7 Kr7gr3 63…0 fr3 63…0 pr3 0 br3 63…0 . . .gr4 63…0 fr4 63…0 pr4 0 br4 63…0 ar16 RSCgr5 63…0 fr5 63…0 pr5 0 br5 63…0 ar17 BSP. . . . . . . . . . . . . . . . . . br6 63…0 ar18 BSPST
Ogr16 63…0 fr16 63…0 pr10 0 br7 63…0 ar19 RNAT. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . ip 63…0 ar21 FCR
gr126 63…0 fr126 63…0 pr62 0 . . . . . .gr127 63…0 fr127 63…0 pr63 0 cfm 37…0 ar30 FDR
User M ar32 CCVCPUID um 5…0 ar36 UNAT
cpuid0 63…0 PMD ar40 FSPRcpuid1 63…0 pmd0 63…0 ar44 ITC
. . . . . . pmd1 63…0 ar64 LCcpuidn 63…0 . . . . . . ar66 EC
pmdm 63…0 ar127
73
Itanium Register GRl The 128 GR registers are the common workhorses
during computationl They contain integer values being computedl It is possible to use these integer values as machine
addresses, thus GRs can be used as pointers in load- and store-operations
l All machine instructions can refer to these registers, for reading and writing values
l In addition to the 64 data bits, each GR has an associated NAT bit, which stands for Not A Thing
l NAT is 1, if the associated register has not been initialized with valid data
74
Itanium Register GRl NATs support speculationl For example, if a speculative load is issued
but aborted, before the value arrives in its destined GR, the NAT state records that fact
l Enables integrity of the machine’s exception process
l There are 2 groups of GR registers:l The first 32, GR0 through GR31, are visible to
all software, and are used to hold globally computed, intermediate values
l However, GR0 is read-only, providing the constant 0, 64 bits long
75
Itanium Register GRl The next 96 registers, GR32 to GR127, are used to
implement a small but frequently used portion of the top of the run-time stack; i.e. work like a special-purpose top-of-stack cache
l These stack registers are made available to SW by allocation of a register stack frame, and include between 0 and 96 registers
l Registers not used from this subset are inaccessible to general SW
l The stack frame portion implemented via GRs is further partitioned into subsections, one meant to hold local registers, the other output registers, i.e. results of the current function call
76
Itanium Predicate Registers PRl Execution of most IPF (Itanium Processor
family) instructions can be predicated by a PR (predicate register)
l Value 1 in the PR means: the operation can be completed normally
l PR value 0 means the result will not be posted (committed), even if it has been computed already. I.e. there will be no stores and no impact on any AR of the machine
l Exception of an instruction that cannot be predicated is the loop operation
77
Itanium Predicate Registers
l The PRs are also partitioned into 2 sections:l PR0 through PR15 are static PRsl The other 48 are so called rotating PRsl PR0 is an exceptional register, it can only be
read, and its value is always 1, meaning, the predicate is true; thus PR0 denotes unconditional execution
l The remaining 48 PRs are used to hold stage predicates, used during software pipelining
78
Itanium Branch Registers BRl IPF instructions are grouped in bundles, which are 16-byte
aligned byte sequences holding executable code. Hence their rightmost 4 address bits will always be 0 due to alignment; these 4 address bits don’t need to be stored explicitly
l Execution of an indirect branch requires an explicit operandl On the Itanium architecture this operand is a branch register;
a branch register BR holds the branch destinationl The machine then loads the value of the referenced BR into
the IP register and execution continues from there; IP stands for Instruction Pointer
l Executing branch-related instructions is about the only way to directly affect the value in the instruction pointer, the register that holds the address of the next bundle to be executed
79
Current Frame Marker Register CFMNote: Frame Marker is often referred to as Stack Frame,
and its fixed portion as the Stack Marker
l Each function has a specific stack frame associated with it, which is created at function invocation; it is cleared at function return
l If all the relevant data of a function’s stack frame do fit, they are placed in the stack of general registers; else the overflowing data must reside in memory
l Either way, the current frame marker (CFM) holds the frame marker for the function that is currently active
l Generally, most functions have small stack frames
80
Current Frame Marker Register CFMLayout of the CFM:
CFM- 37 .. 32 31 .. 25 24 .. 18 17 .. 14 13 .. 7 6 .. 0 register Rrb.pr Rrb.fr Rrb.gr sor sol sof
Meaning of Bits in CFM:Name Bit Field meaning
Sof 0..6 Total size of stack frame Sol 7..13 Size of local part of stack frame, in words Sor 14..17 Size of rotating portion of stack frame. The number
of the rotating registers is 8 times the sor value rrb.gr 18..24 Register rename base for grs rrb.fr 25..31 Register rename base frs rrb.pr 32..37 Register rename base prs
81
Itanium Application Registers AR
Application Registers – t.b.d.:
register Mnemonic Description of register ar0 – ar7 KR0 – KR7 Kernel registers 0 .. 7 ar8 – ar15 Reserved ar16 t.b.d.
82
Itanium Instruction Pointer IP
l IPF instructions are fetched in units of bundles: chunks of 16 bytes, or 128 bits
l Bundles are stored bundle-alignedl The ip addresses 18,446,744,073,709,551,616
different bytes (aligned at bundle addresses)l The rightmost 4 bits of the ip thus will always
be zero, due to the bundle-alignmentl Hence these 4 bits don’t needs to be stored
on the microprocessor silicon
83
Performance Monitor Data Register
l These are architecture-provided resources that record the use of HW modules
l Contents is read-only by SWl But contrary to the performance monitor
registers on Intel Pentium architectures, they are user visible on Itanium
84
Alpha Registers
85
Alpha Registersl On MP Alpha system, each processor has its own,
full complement of architecture registersl The pc register always addresses the next
instruction in 4-byte aligned instruction streaml The pc is 64-bits wide, yet the rightmost 2 bits are
implied 0 and not explicitly stored, due to the 4-byte instruction alignment
l Alpha has 32 integer registers, each 64 bits wide, conventionally named R0 .. R31
l R31 has special meaning: R31 always supplies integer 0 as a source operand
l Clearly, R31 is not writeablel Exceptions are not raised, when R31 is specified
as a destination for a load!
86
Alpha Registersl Alpha has 32 floating-point registers,
named F0 .. F31l Each float register is 64 bits widel Register F31 always holds the true 0.0
floating-point value as a constant, cannot be written
l Note: An exception is not signaled for a load, specifying F31 as destination!
l Float instructions computing single-precision data –only 32-bits wide– still write all 64 bits of their respective floating point destination register, sign-extended!
87
Alpha Registersl Alpha has 2 special registers, named lock-registers,
LR0 and LR1; not further explained herel Process Cycle Counter (PCC) register consists of
two 32-bit fields; usable for performance monitoringl Low order 32 bits (31..0), known as PCC_CNT, uses
as interval timer, unsigned wrapping counter, tacking number of nanoseconds of an event
l High order 32 bits (63..32) known as PCC_OFF, and are operating-system dependent
l Suggested: use as cycle counter for process, threadl PCC read by special RPCC instruction; for OS
supportl FPCR (64-bit Floating Point Control Register) used in
IEEE 754 format; else FPCR is not visible; among others, sets one rounding mode of four
88
Alpha Registers
89
IBM 370 Registers
90
IBM 370 Registersl IBM’s 370 ancient mainframe architecture preceded
x86, had regular and relatively rich register setl Various formats: half-word, word, extended formats
91
Bibliography1. Morris M. Mano, et al.: Logic and Computer Design
Fundamentals, Pearson 5th Edition, ISBN 978-0-13-376063-7
2. Shen, John Paul, and Mikko H. Lipasti: Modern Processor Design, Fundamentals of Superscalar Processors, McGraw Hill, © 2005
3. Nilsson, James W., and Susan A. Riedel: Electric Circuits, © 2015 Pearson Education Inc., ISBN 13: 9780-13-376003-3
4. Sparc: https://en.wikipedia.org/wiki/SPARC
5. http://www-03.ibm.com/ibm/history/exhibits/mainframe/mainframe_PP3158.html