Download pdf - ECE 172 Digital Systems Chapter 3 Registers - Computer Action …web.cecs.pdx.edu/~herb/ece172s18/l03_registers.pdf · level of abstraction for AL higher than for RTL l Instead, RTL

1

ECE 172 Digital Systems

Chapter 3Registers

Herbert G. Mayer, PSUStatus 7/12/2018

2

Syllabusl  Definitions, Introductionl  Register Transfer & RTLl  Register Shift Operationsl  Register Windowsl  Vector Registersl  Score Boardl  Zero Register Architecturel  Register Dependenciesl  Actual Register Setsl  Bibliography

3

Definitions Topic: Register; AKA machine register; AKA

processor register

l  A machine register is an ISA visible system resource holding data of specific size; data can be accessed + processed fast. Data size generally: word

l  Getting data from memory into register is referred to as loading; moving bits from register to memory is called storing

l  Generally, information in registers can be processed fast, faster than any other data in digital systems

l  Register size dictated by computer architecture; e.g. 64-bit architecture has registers holding 64 bits of information; may be data, addresses or other

l  CPU may have many registers (Itanium) or few (x86)l  Registers are identified via index, i.e. their name

4

Definitionsl  Register is a CPU resource holding operands for

computation; different operands at different timesl  Register is key resource in digital systems: to store

information while powered on; register content is volatilel  CPU may have 0 (yes zero) or more user-visible registersl  Register operand may be source for computationl  Or destination, holding next result after operationl  Registers may be both, source and destination, in which

case computation changes the original source operandl  Type of architecture with registers: general purpose

register architecture, GPRAl  Not all computer architectures have registersl  For example, stack machines hold operands on top

of a stack; stack grows and shrinks; has 0 user-visible registers!

5

Definitionsl  Early architectures, AKA von Neumann machines, or

Princeton architecture, had but 1 register, known as accumulator

l  AKA single-accumulator architecture SAA!l  Note: Architectural registers of CPU are user visiblel  However, real HW implementations of any defined

architecture may provide hidden registerl  For example, ancient Intel x86 processor nowadays

has many internal registers that are not user visiblel  They are only available to the running HW which

may manage them to speed up executionl  Also, Register windows on Sparc may have more

internal –i.e. not visible– registers to speed up execution, by performing register saving and restoring at calls and returns; not detailed here

6

Definitions

Von Neumann Architecture, © teach-ict

7

Introductionl  Since accumulator on SAA was sole source + target

of operations, instructions never needed to explicitly name that unique register: always implied!

l  Modern architectures have multiple registers, e.g. for integer ops, floating-point ops, program counter, status, segment registers, stack addresses, etc.

l  Early architectures were register-starved, e.g. Intel x86; yet hidden registers in modern versions of x86 alleviate slowness due to register shortage!

l  Program status register holds ALU status of last op; is indirectly visible via branch conditions; e.g. branch_if_zero, or branch_if_less, etc.

l  Recent architectures (e.g. Itanium) have a large number of visible, also large number of hidden (not user visible) registers, to speed up execution through –hidden− register renaming

8

Introductionl  An architecture ought ☺ to have many registers!l  Yet accessing any such HW resource requires

naming –via indexing– a register, AKA addressing itl  Number of bits used in instruction to address register

is ⎡ log2( number-of-regs ) ⎤, increasing object code size! Where is the optimum?

l  Way to reduce number of bits: to partition registers into different classes: integer, float, branch, status

l  Desirable for execution speed: ideally all data reside in registers! Yet not feasible!

l  Data set is way larger than total size of register file, hence this ideal is impossible

l  Architecture solution via memory hierarchy: register, slower registers, cache, slower cache, memory, etc.

9

IntroductionMemory hierarchy: registers & other HW modules holding data

10

IntroductionAnother, common view of memory hierarchy: Registers at top

11

Logical Register

l  Logical register (LR) is a hypothetical machine resource to hold data as operands for computation, addressing, indexing, decision making, etc.

l  Convenient model for discussing architecturel  Logical registers are used as abstract design tool to

explore or refine a computer architecturel  To propose code sequences for simulation etc.l  LR doesn’t suffer ☺ from physical constraints, such

as slowness, limit of data size, number of unitsl  As eventual result of computer design process, LR

may end up defining key attributes for a to-be-defined actual register of a digital system being built

12

Physical Registerl  A physical or processor register (PR) is a machine

resource holding data as operands for addressing, computing, and decision making, etc.

l  Each PR has a unique name, specific width, defined data types, and set of operations

l  Width (number of bits) of a PR is defined by architecture of digital system, for which the PR is a resource

l  Or a PR may be defined by the maximum precision of the data it is ever expected to compute

l  Frequently, these two precisions are identical; e.g. on a 32-bit architecture the maximum numeric precision for integer of floating-point data was also 32 bits

l  But they may differ; e.g. during the evolution of 32-bit architectures, wider numeric precisions such as 64-bit integer or float data became commonplace

13

Physical Registerl  Actual number of PRs dictates required number of

bits in instructions that specify register source and destination of register operations

l  Old Intel x86 architecture has 4 general purpose and 4 dedicated (segment) registers; 8 total! Not a typo!

l  SPARC architecture has 32 visible physical registers; yet has large number of hidden registers, available to allow smooth use of circular register window, when limit of 32 is exceeded

l  After consuming all 32 (register 0 .. 31), count of next register restarts at 0; yet old register 0 must be saved

l  Physical registers are actually built from flip-flops, thus are clocked, hold data between clock pulses

l  PRs need reset (or clear) function at start of computation

14

Physical Registerr1 Register r1: generic complete register; bits unspecified

r2 63 0

Register r2: some 64-bit register

byte1 byte0

15 8 7 0

byte3 byte2

31 24 23 16

32-bit, 4-byte register r3 r3

l  Sample registers, with specified length: r2 having 64 bits; r3 only 32 bits, 4 bytes

l  Register r3 shows bit indices right to left, byte addresses right to left within one word: Little Endian

l  Else byte addresses increase left to right: Big Endianl  Generic register r1 with length unspecified

15

Register Transfer &

Register Transfer Language

16

Register Transfer Language RTLl  Register transfer language (RTL) is not a general

purpose programming language (PL): PL offers higher-level abstractions to discuss computing environment

l  RTL is not an assembly language (AL): Executable operations of AL map directly onto CPU instructions: level of abstraction for AL higher than for RTL

l  Instead, RTL is a low-level language, used to define digital systems including key processor component; RTL specifies operations on registers or between multiple registers with one another

l  RTL shares certain operations with high level PLs or ALs: e.g. transferring bits, copying bits, specifying bit ranges, zeroing bit fields, or shifting bits, etc.

l  . . . where source or destination or both are registers

17

Register Transfer Language

l  Reason for continued use of HDL is logic synthesisl  HDL description of a system can be written in

intermediate language, AKA RTLl  Logic synthesis tools can convert a HW description

into interconnection of simple components that implement such circuit!

l  . . . and can transform RTL specification of a circuit in HDL into an equivalent netlist

l  Optimized netlist with storage elements and with combinational logic

l  Netlist can be mapped into actual IC layoutl  Becoming basis for IC manufacturing

18

Register Micro OperationsMicro operations (micro-ops) are low-level (primitive) operations executed in digital systems, involving register operands, such as:1.  Register to register transfer micro-ops, moving bits from one to another register2.  Arithmetic micro-ops, performing arithmetic on numeric data in registers 3.  Logical micro-ops, performing bit manipulations on non-numeric data in registers4.  Shift micro-ops, moving bits or bit fields inside a register5.  Setting micro-ops, clearing to 0 or setting to 1 the selected bit fields (or all bits) inside a register

19

Register Micro OperationsRTL allows various addressing modes, needed per operand location; permissible due to power of digital system HW versus more restrictive SW environment:1.  Immediate: literal is immediate in opcode, e.g. r1 " #92.  Implied: Operand on stack architecture, e.g. add, refers to 2 top operands stack[top] and stack[top-1]3.  Register to register: natural instruction on GPR architecture, assign r1 value of r2, e.g. r1 " r24.  Memory indirect case 1: operand in memory, memory address 1234 is immediate operand, and the refer to r1 " mem[ #1234 ]5.  Register indirect case 2: operand is in memory, address is in another register, e.g. r1 " mem[ r2 ]6.  Indexed: operand is in memory, address in register plus numeric offset n, e.g. r1 " mem[ r2 + n ]

20

Register Transferl  Abstract view of register transfer is akin to assignment

in SW, except RTL operands specify registers, as opposed to general program objects, e.g.:

r1 = r2; -- no predicate used: assign unconditionally!if ( predicate_p1 == true ) then r1 = r2; -- with predicate!

l  In RTL more tersely expressed as:r1 " r2 -- no predicate: assign register to register!p1: r1 " r2 --with predicate, meaning: if p1 is true

l  Digital circuits have one grand power ☺ other tools often lack: parallelism!

p2: r3 " r4, and r5 " r6l  Multiple register transfers performed simultaneously,

provided predicate p2 holds!l  In preparation for register transfer, we review Flip-

Flops, specifically D Flip-Flops; registers are built from arrays of flip-flops, one flip-flop per register bit

21

Register Shift Operations

22

Register Shift Opl  Register shift operations (AKA shifts) move bits of a

source register a defined number of bit positions into a destination register

l  Source and destination may be different registers, or could be the same

l  Depending on shift type, there are side effects in addition to destination register change:l  Bits are discarded into a large bit bucket Jl  0-bits, 1-bits, sign-bits are pulled inl  Flags are set (e.g. sign, overflow, zero, etc.)

l  So called left shift moves bits of a register toward the most significant bit

l  Right shift moves register bits toward least significant bit, AKA right hand side

23

Register Shift Opl  If bits in register are viewed as string of 0s and 1s,

shift operation may lose bits without side-effectl  AKA logical view of bit values

l  Shift right may pull in zeros on left hand side (high bit index)l  Or shift can be arithmetic, then sign bit value is to be

considered:l  Shift right extends (pulls in) sign bit on left hand sidel  Shift left may cause overflow on twos-complement, if leftmost

2 bits differ during right shift! Sign change!l  For arithmetic shifts, convenient to view left shift by 1

position as multiply by 2l  Or right shift to divide by 2: Must extend the sign!l  Sample, pseudo shift operations below:

24

Register Shift OpPseudo shift instructions, not selected from any real

computer architecture:r1 ← lsl r1 -- logical shift left 1 bit same reg

r1 ← lsr r1 -- logical shift right 1 bit same reg

r1 ← lsl r2 -- logical shift left 1 bit different

r1 ← lsr r2 -- logical shift right 1 bit

r1 ← asl r2 -- arithmetic shift left 1 bit

r1 ← asr r2 -- arithmetic shift right 1 bit

r1 ← 2 lsl r2 -- logical shift left 2 bits

r1 ← 3 lsr r2 -- logical shift right 3 bits

r1 ← 4 asl r2 -- arithmetic shift left 4 bits

r1 ← 5 asr r2 -- arithmetic shift right 5 bits

r1 ← rotl r2 -- rotate r2 left 1 bit, result in r1

25

Register Shift Op 8-bits

26

Register Shift Op 8-bitsOperation Pseudo op r2 before r1 afterRight logical r1 ← lsr r2 1011,0111 0101,1011Logical shift right 1 position: leftmost bit ‘1’ in r2 is not interpreted as sign bit, thus ‘0’s pulled in from left

Operation Pseudo op r2 before r1 afterRight artihmet. r1 ← asr r2 1011,0111 1101,1011Arithmetic shift right 1, leftmost bit ‘1’ in r2 “interpreted” by HW as sign bit, thus ‘1’s pulled in, i.e. sign-extended

Operation Pseudo op r2 before r1 afterLeft logical r1 ← 2 lsl r2 1011,0111 1101,1100Logical shift left, 2 positions, left 2 bits into bit bucket; and from right side, two ‘0’s are pulled in

27

Register Shift Arithmetic

l  Arithmetic shift left asl pulls in ‘0’s from right sidel  Views bits as signed two’s complement numbersl  Sign bit change causes overflow; to be flagged?

high order sign bit lost

arithmetic shift left: asl low order input: ‘0’s

sign bit is extended

arithmetic shift right: asr low order bits lost

l  Arithmetic shift right asr extends sign bitl  In contrast, lsr pulls in ‘0’s from left hand side

28

Register Shift Logical

l  Logical shift l views bits as string, not signed numberl  Hence lsl pulls in ‘0’s from right hand side

high order bits lost

logical shift left: lsl low order input: ‘0’s

high order Input: ‘0’s

logical shift right: lsr low order bits lost

l  Logical shift r views bits as string, not signed numberl  Hence lsr also pulls in ‘0’s from left hand side

29

Register Shift Circular

l  Circular shift left sees bit string, not signed numberl  No bits lost during csl, leftmost bits pulled in on right

l  Circular shift right sees bit string, not signed numberl  No bits lost during csr, rightmost bits pulled in on left

high order bits move to low end

circular shift left: csl low order bits move to left

high order bits move to right

circular shift right: csr low order bits move to high end

30

Register Windows

31

Register Windowl  SPARC architecture popularized register windowsl  All visible registers in set 0 are indexed from 00 .. n-10l  When another is needed from set 0 after n-1 have

been used, restart using register 01 again, acting as if it (register 00) were free

l  But if 00 is actually still needed, let HW transparently back it up in hidden register file, simultaneously with continuing execution

l  Later to be restored, when the overflowing register 01 is no longer needed; thus 00 comes back to lifel  Requires more physical registers, many invisible,

thus needs more control HW, but does not extend index bits for registers in instruction to encode particular register resource; beyond log2( n ) with n being number of registers: no code bloat! ☺

32

Register WindowCyclic register naming

33

Vector Registers

34

Vector Register Architecture VRAl  Registers on VRA are implemented as a HW array

of functionally identical registers, named: vri[j], i = 0 .. n-1, and j = 0 .. m-1, AKA vector registersl  VRA may have scalar registers, named r0, r1, etc.l  Vector registers vri[*] can each load/store blocks

of contiguous datal  Still in sequence, but overlapped; number of

clocks to complete a full vector load/store depends on bus width

l  Vector registers perform multiple identical operations on contiguous blocks of operands

35

Vector Register ArchitectureVRA operates sequentially; but processes n ≥ 1

vector operands in n registers simultaneously: faster than n sequential, scalar ops!

36

Vector Register ArchitectureGraph shows parallel data processing in one single

operation, using multiple registers

37

Vector Register Architecturel  Otherwise operations look similar to GPR architecturel  Sample vector operations, assuming 64-unit ops:

ldv vr1, memi -- loads 64 memory locs from [mem+i=0..63]stv vr2, memj -- stores vr2[0..63] in 64 contig. locsvadd vr1, vr2, vr3 -- register-register vector addcvaddf r0, vr1, vr2, vr3 -- semantics: condition via bit in r0-- sequential equivalent:for i = 0 to 63 do

if biti in r0 = 1 then vr1[i] = vr2[i] + vr3[i]else – must be 0 -- do not move corresponding bits into vr1[i]end if

end for

-- parallel syntax equivalent:forall i = 0 to 63 doparallel -- parallel semantics

if bit i in r0 = 1 then vr1[i] = vr2[i] + vr3[i]end if

end parallel for

38

Score Board

39

Score Boardl  Score-board sb[*] supports out of order

executionl  Is not user visible, hence not accessible to

programmer! EE students need to know ☺l  Instead, score-board sb[*] is array of HW

programmable bits, or single-bit registers named sb[*], each identified by index; not visible in ISA! Owned by processor HW!

l  Score-board manages actual HW registersl  Is single-bit HW array sb[]l  Every bit i in sb[i] is associated with one of

the real, specific registers: the one identified by index i , e.g. ri

40

Score Board

l  Association by index: sb[i] belongs to reg ril  Only if score board sb[i] = 0 does register ri hold valid data; else must wait! Do not access!l  Also a load register ri may proceed if sb[i] = 0l  Or we can say, if sb[i] = 0, then register ri is

currently NOT in the process of being writtenl  If bit i is set, i.e. if sb[i] = 1, that register ri is

reserved, i.e. it is off limits for the moment; HW must wait, until sb[i] = 0

l  Initially all sb[*] are free to use, i.e. all are set to: sb[i] = 0

41

Score Boardl  Execution constraints, assume:

rd ← rs op rtl  If either sb[s] or sb[t] are being set: → RAW

dependence, hence HW stalls computation; wait until both rs and rt are available, i.e. until sb[s] = 0 and sb[t] = 0

l  if sb[d] is set→ WAW dependence, hence HW stalls the write; waits until rd has been used; processor or even SW (compiler) can sometimes determine to use another register instead of rd that is known to be free

l  Else, if none of the 3 registers are in use, i.e. if all score board entries s, t, and d are 0, then HW can dispatch instruction immediately

42

Score Board & ooo ExecutionTo allow out of order (ooo) execution, by using any available ri and rj

1.  For uses (AKA references), HW may take any register i, whose sb[i] is 0

2.  For definitions (AKA assignments), HW may set any register j, whose sb[j] is 0

3.  Independent of original order, in which source program was written, i.e. possibly ooo

4.  Provided, in the end all ISA visible registers hold the intended, programmed results

43

Score Board & ooo Executionl  Out of order execution (ooo), AKA dynamic execution l  CDC supercomputers broke complex instruction (e.g.

FP divide) into a semantically equivalent sequence of simpler FP sub-operations

l  Each of which could be executed very swiftlyl  On pipelined architecture, numerous sub-operations or

multiple instructions are live and make progress in various phases of completionl  First invented for CDC 6600 during late 1960sl  IBM 360/91 during 1970s, Tomasulo’s genuine ooo algorithml  IBM POWER1 μP in 1990l  Intel x86 family, since 1995 on Pentium Pro®

44

Score Board & ooo Executionl  Multiple sub-operations progress simultaneously, yet

in any order, that’s why we name it: oool  As long as the retiring order is logically equivalent to

sequential operation of original instruction sequencel  Detail of ooo execution paradigm:

1. Fetch next instruction i2. Dispatch i to instruction queue, AKA reservation station3. Then i waits in queue until input operands are available4. When available, then i can leave queue, and run possibly even

before earlier, older instructions5.  i is issued to appropriate functional unit for execution6. Results are queued up, to preserve original order7. Once older instructions have written back results to register

file rn, then i’s result is written back to rd −called retire stage, with rd being instruction i’s destination register; i.e. holding the result

45

Zero Register Architecture

46

Zero Registers?l  Zero register architectures are known as stack

machinesl  Semi-tongue-in-cheek claim is: “Registers are not

needed for computing!”l  As long as a stack is available −stack just being a

policy of accessing main memory− all computations can be done without registers

l  Technically correct!l  Yet all such computations will be slow; as operations

are completed solely with memory operands! l  Speed gap is several decimal orders of magnitude

worse, to the heavy loss of the stack machine, i.e. to the loss of the zero register architecture!

47

Code For Stack Architecturel Solution? Implement a few top of stack

elements via HW shadow registers ⇒ Cachel  Let us compare equivalent code sequences

with and without consideration of a cachel  The top-of-stack register “tos” points to the

last (topmost) valid word on physical stackl  Two hidden shadow registers may hold 0, 1,

or 2 true top of stack wordsl  Top of stack cache counter tcc specifies

number of shadow registers actually usedl  Thus tos plus tcc jointly specify the true top

of stack

48

Abstract Stack Architecture

free free

0,1,20,1,2

tcc tcc

2 tos registers 2 tos registers

stack stack

tos tos

49

Code For Stack Architecturel  Timings for push, pushlit, add, pop operations

depend on top of stack cachel  Operations in shadow registers are fast, typically 1

cycle; includes register access and the operationl  Generally, memory access adds numerous cyclesl  To track dynamic changes of the stack, use some

defined policy, say try to keep top 50% fulll  Table below refines timings for stack with

transparent shadow registersl  For example, pushing element mem[ x ] into top of

stack cache, we arbitrarily define this requires 2 cycles; due to the memory fetch

l  Note: 2 cycles for memory access, highly idealized! In reality more likely multiple tens of cycles!

50

Code For Stack Architecture

operation Cycles tcc before tcc after tos change comment add 1 tcc = 2 tcc = 1 no change add 1+2 tcc = 1 tcc = 1 tos-- underflow? add 1+2+2 tcc = 0 tcc = 1 tos -= 2 underflow? push x 2 tcc = 0,1 tcc++ no change tcc update

in parallel push x 2+2 tcc = 2 tcc = 2 tos++ overflow? pushlit #3 1 tcc = 0,1 tcc++ no change pushlit #3 1+2 tcc = 2 tcc = 2 tos++ overflow? pop y 2 tcc = 1,2 tcc-- no change pop y 2+2 tcc = 0 tcc = 0 tos-- underflow?

51

Code For Stack Architecturel Code emission for source snippet on SA:

a + b * c ^ ( d + e * f ^ g )l  Let + and * be commutative, conventional

language rulel Architecture here has 2 shadow registersl Assembly language programmer (or HLL

compiler) exploits thisl Assume initially empty 2-word cache

52

# 1 Left - to - Right cycles 1 2 Exploit Cache cycles 2

1 push a 2 push f 2 2 push b 2 push g 2 3 push c 4 expo 1 4 push d 4 push e 2 5 push e 4 mult 1 6 push f 4 push d 2 7 push g 4 add 1 8 expo 1 push c 2 9 mult 3 expo 1

10 add 3 push b 2 11 expo 3 mult 1 12 mult 3 push a 2 13 add 3 add 1

Code For Stack Architecture

53

Code For Stack Architecturel  Brute-force code emission costs 40 cycles; i.e. failing

to take advantage of tcc knowledgel  Code emission with shadow register consideration

costs 20 cyclesl  True penalty for memory access is worse in practicel  Tremendous speed-up always possible when fixing

system with severe flaws ☺l  Return of investment for 2 registers is double the

original performance!l  Such strong speedup is an indicator that the starting

architecture was poor in the first place!l  Stack Machine can be fast, if purity of top-of-stack

memory-access is relaxed for performancel  Indexing, looping, indirection, call/return etc. are not

addressed here

54

Register Dependencies

55

Register Dependenciesl  Inter-instruction dependencies, in EE parlance also

known as dependences, arise between registers or memory locations being defined (AKA assigned, or written) and used (AKA read, or referenced)

l  One instruction computes a result into a register (or memory); another instruction needs that result from that same register (or same memory location)

l  Or, one instruction uses a register; and after use the same register is newly recomputed (written)

l  Dependences cause sequential execution, lest the result is unpredictable

l  On next page: op is any arithmetic/logical operation

56

Register DependenciesTrue-Dependence, AKA Data Dependence: <- synonymous!r3 ← r1 op r2 1: Write r3, op is some arithmetic opcoder5 ← r3 op r4 2: Read r3 after Write, RAW

Anti-Dependence, not a true dependenceparallelize under right conditionr3 ← r1 op r2 1: Read r1r1 ← r5 op r4 2: Write r1 after Read, WAR

Output Dependence, similar to Anti-Dependence is not true dep.r3 ← r1 op r2 1: Write r3r5 ← r3 op r4 2: Read r3r3 ← r6 op r7 3: Write r3 after Write, WAW, use between

57

Register DependenciesControl Dependence:

// ri, i = 1..4 come in “live”

if ( condition1 ) {

r3 = r1 op r2;

}else{ " see the jump here?

r5 = r3 op r4;

} // end if

write( r3 );

58

Register Renamingl  Only data dependence is a real dependence,

hence called true dependencel  Other dependences are artifacts of insufficient

resources, generally insufficient registersl  This means: if additional registers were

available, then replacing some of these conflicting registers with other registers, could make the conflict (dependence) disappear!

l  Anti- and Output-Dependences are indeed such false dependences

59

Register Renaming Original Code:-- r2, r3, r5, r6, r7 come in “live” from code before

-- r1, r4 are not “live”, don’t have initial values

-- r1, r2, r3, r4, r5, r6, r7 must go out “live”

L1: r1 ← r2 op r3

L2: r4 ← r1 op r5

L3: r1 ← r3 op r6

L4: r3 ← r1 op r7

Initial Dependences:Lx: Ly: x, y = 1..4, which dependence? Next page

60

Register RenamingOriginal Code: L1: r1 ← r2 op r3

L2: r4 ← r1 op r5

L3: r1 ← r3 op r6

L4: r3 ← r1 op r7

Initial Dependences: numerous!! L1, L2 true-Dep with r1

L1, L3 output-Dep with r1

L1, L4 anti-Dep with r3

L3, L4 true-Dep with r1



61

Register Renaming

l  What could be changed and improved for better performance, if we had additional registers?

l  Hidden or real (visible architecture) registers could be advantageous!

l  Compute and use other temporaries via other registers to reduce dependences!

l  May at times allows higher degree of parallelism, due to lower degree of dependence

l  More parallelism è faster execution!l  Register renaming conducted by HW; invisible

to assembly programmer or compiler

62

Register RenamingOriginal Code: New Code, added regs, in r30 instead r3:L1: r1 ← r2 op r3 r10 ← r2 op r30 –- r30 instead

L2: r4 ← r1 op r5 r4 ← r10 op r5 –- r10 instead

L3: r1 ← r3 op r6 r1 ← r30 op r6

L4: r3 ← r1 op r7 r3 ← r1 op r7

Dependences before: Dependences after:L1, L2 true-Dep with r1 L1, L2 true-Dep with r10

L1, L3 output-Dep with r1 L3, L4 true-Dep with r1

L1, L4 anti-Dep with r3 // ri, i = 1..7 are “live”

L3, L4 true-Dep with r1



63

Register Renaming•  With these additional, renamed regs, the new code

could execute in half the time!

•  First: Compute into free/hidden reg r10 instead of r1, but needs additional register r10; no time penalty!

•  Also: Compute in preceding code into r30 instead of r3, if r30 available; also no time penalty!

•  Then all 7 regs are live afterwards: r1, r3, r4, plus the non-modified ones! E.g. r2 came in live, must go out live!

•  While r10 and r30 are don’t cares afterwards; free to use again by HW

64

Actual Register SetArchitecture Examples

65

Intel x86 Registers

66

Intel x86 Registers 32-bitl  Intel x86 is infamous for being register-starved!l  Need for object code compatibility extended life of x86

architecture beyond anyone’s imagination

67

Intel x86 Registers 32-bit

68

Intel x86 Registers 32-bitl  Intel x86 has mmx and xmm registersl  Can be used as array of 8, 16, 32, etc. sub registersl  Also referred to as SSE (streaming SIMD Extension)

69

Intel x86 Registers 64-bit

70

ItaniumTM Registers

71

Itanium Registersl  Intel’s newer 64-bit ItaniumTM processor has 128

general registers (GR), 128 floating-point registers (FR), 64 single-bit predicate registers (PR), 8 branch registers (BR), 128 application registers (AR)

l  Also, there are Performance Monitor Data registers (PMD), processor identifiers (CPUID), a Current Frame Marker register (CFM), user mask (UM), and instruction pointer registers (IP)

l  GRs, FRs, BRs, ARs, CPUIDs, IP, and PMDs are 64 bits wide

l  PRs are 1 bit wide, while the UM holds 6 and the CFM 38 bits; depicted below:

72

Itanium Register FileGR FR PR BR AR

gr0 63…0 fr0 63…0 pr0 0 br0 63…0 ar0 Kr0gr1 63…0 fr1 63…0 pr1 0 br1 63…0 . . .gr2 63…0 fr2 63…0 pr2 0 br2 63…0 ar7 Kr7gr3 63…0 fr3 63…0 pr3 0 br3 63…0 . . .gr4 63…0 fr4 63…0 pr4 0 br4 63…0 ar16 RSCgr5 63…0 fr5 63…0 pr5 0 br5 63…0 ar17 BSP. . . . . . . . . . . . . . . . . . br6 63…0 ar18 BSPST

Ogr16 63…0 fr16 63…0 pr10 0 br7 63…0 ar19 RNAT. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . ip 63…0 ar21 FCR

gr126 63…0 fr126 63…0 pr62 0 . . . . . .gr127 63…0 fr127 63…0 pr63 0 cfm 37…0 ar30 FDR

User M ar32 CCVCPUID um 5…0 ar36 UNAT

cpuid0 63…0 PMD ar40 FSPRcpuid1 63…0 pmd0 63…0 ar44 ITC

. . . . . . pmd1 63…0 ar64 LCcpuidn 63…0 . . . . . . ar66 EC

pmdm 63…0 ar127

73

Itanium Register GRl  The 128 GR registers are the common workhorses

during computationl  They contain integer values being computedl  It is possible to use these integer values as machine

addresses, thus GRs can be used as pointers in load- and store-operations

l  All machine instructions can refer to these registers, for reading and writing values

l  In addition to the 64 data bits, each GR has an associated NAT bit, which stands for Not A Thing

l  NAT is 1, if the associated register has not been initialized with valid data

74

Itanium Register GRl  NATs support speculationl  For example, if a speculative load is issued

but aborted, before the value arrives in its destined GR, the NAT state records that fact

l  Enables integrity of the machine’s exception process

l  There are 2 groups of GR registers:l  The first 32, GR0 through GR31, are visible to

all software, and are used to hold globally computed, intermediate values

l  However, GR0 is read-only, providing the constant 0, 64 bits long

75

Itanium Register GRl  The next 96 registers, GR32 to GR127, are used to

implement a small but frequently used portion of the top of the run-time stack; i.e. work like a special-purpose top-of-stack cache

l  These stack registers are made available to SW by allocation of a register stack frame, and include between 0 and 96 registers

l  Registers not used from this subset are inaccessible to general SW

l  The stack frame portion implemented via GRs is further partitioned into subsections, one meant to hold local registers, the other output registers, i.e. results of the current function call

76

Itanium Predicate Registers PRl  Execution of most IPF (Itanium Processor

family) instructions can be predicated by a PR (predicate register)

l  Value 1 in the PR means: the operation can be completed normally

l  PR value 0 means the result will not be posted (committed), even if it has been computed already. I.e. there will be no stores and no impact on any AR of the machine

l  Exception of an instruction that cannot be predicated is the loop operation

77

Itanium Predicate Registers

l  The PRs are also partitioned into 2 sections:l  PR0 through PR15 are static PRsl  The other 48 are so called rotating PRsl  PR0 is an exceptional register, it can only be

read, and its value is always 1, meaning, the predicate is true; thus PR0 denotes unconditional execution

l  The remaining 48 PRs are used to hold stage predicates, used during software pipelining

78

Itanium Branch Registers BRl  IPF instructions are grouped in bundles, which are 16-byte

aligned byte sequences holding executable code. Hence their rightmost 4 address bits will always be 0 due to alignment; these 4 address bits don’t need to be stored explicitly

l  Execution of an indirect branch requires an explicit operandl  On the Itanium architecture this operand is a branch register;

a branch register BR holds the branch destinationl  The machine then loads the value of the referenced BR into

the IP register and execution continues from there; IP stands for Instruction Pointer

l  Executing branch-related instructions is about the only way to directly affect the value in the instruction pointer, the register that holds the address of the next bundle to be executed

79

Current Frame Marker Register CFMNote: Frame Marker is often referred to as Stack Frame,

and its fixed portion as the Stack Marker

l  Each function has a specific stack frame associated with it, which is created at function invocation; it is cleared at function return

l  If all the relevant data of a function’s stack frame do fit, they are placed in the stack of general registers; else the overflowing data must reside in memory

l  Either way, the current frame marker (CFM) holds the frame marker for the function that is currently active

l  Generally, most functions have small stack frames

80

Current Frame Marker Register CFMLayout of the CFM:

CFM- 37 .. 32 31 .. 25 24 .. 18 17 .. 14 13 .. 7 6 .. 0 register Rrb.pr Rrb.fr Rrb.gr sor sol sof

Meaning of Bits in CFM:Name Bit Field meaning

Sof 0..6 Total size of stack frame Sol 7..13 Size of local part of stack frame, in words Sor 14..17 Size of rotating portion of stack frame. The number

of the rotating registers is 8 times the sor value rrb.gr 18..24 Register rename base for grs rrb.fr 25..31 Register rename base frs rrb.pr 32..37 Register rename base prs

81

Itanium Application Registers AR

Application Registers – t.b.d.:

register Mnemonic Description of register ar0 – ar7 KR0 – KR7 Kernel registers 0 .. 7 ar8 – ar15 Reserved ar16 t.b.d.

82

Itanium Instruction Pointer IP

l  IPF instructions are fetched in units of bundles: chunks of 16 bytes, or 128 bits

l  Bundles are stored bundle-alignedl  The ip addresses 18,446,744,073,709,551,616

different bytes (aligned at bundle addresses)l  The rightmost 4 bits of the ip thus will always

be zero, due to the bundle-alignmentl  Hence these 4 bits don’t needs to be stored

on the microprocessor silicon

83

Performance Monitor Data Register

l  These are architecture-provided resources that record the use of HW modules

l  Contents is read-only by SWl  But contrary to the performance monitor

registers on Intel Pentium architectures, they are user visible on Itanium

84

Alpha Registers

85

Alpha Registersl  On MP Alpha system, each processor has its own,

full complement of architecture registersl  The pc register always addresses the next

instruction in 4-byte aligned instruction streaml  The pc is 64-bits wide, yet the rightmost 2 bits are

implied 0 and not explicitly stored, due to the 4-byte instruction alignment

l  Alpha has 32 integer registers, each 64 bits wide, conventionally named R0 .. R31

l  R31 has special meaning: R31 always supplies integer 0 as a source operand

l  Clearly, R31 is not writeablel  Exceptions are not raised, when R31 is specified

as a destination for a load!

86

Alpha Registersl  Alpha has 32 floating-point registers,

named F0 .. F31l  Each float register is 64 bits widel  Register F31 always holds the true 0.0

floating-point value as a constant, cannot be written

l  Note: An exception is not signaled for a load, specifying F31 as destination!

l  Float instructions computing single-precision data –only 32-bits wide– still write all 64 bits of their respective floating point destination register, sign-extended!

87

Alpha Registersl  Alpha has 2 special registers, named lock-registers,

LR0 and LR1; not further explained herel  Process Cycle Counter (PCC) register consists of

two 32-bit fields; usable for performance monitoringl  Low order 32 bits (31..0), known as PCC_CNT, uses

as interval timer, unsigned wrapping counter, tacking number of nanoseconds of an event

l  High order 32 bits (63..32) known as PCC_OFF, and are operating-system dependent

l  Suggested: use as cycle counter for process, threadl  PCC read by special RPCC instruction; for OS

supportl  FPCR (64-bit Floating Point Control Register) used in

IEEE 754 format; else FPCR is not visible; among others, sets one rounding mode of four

88

Alpha Registers

89

IBM 370 Registers

90

IBM 370 Registersl  IBM’s 370 ancient mainframe architecture preceded

x86, had regular and relatively rich register setl  Various formats: half-word, word, extended formats

91

Bibliography1.  Morris M. Mano, et al.: Logic and Computer Design

Fundamentals, Pearson 5th Edition, ISBN 978-0-13-376063-7

2.  Shen, John Paul, and Mikko H. Lipasti: Modern Processor Design, Fundamentals of Superscalar Processors, McGraw Hill, © 2005

3.  Nilsson, James W., and Susan A. Riedel: Electric Circuits, © 2015 Pearson Education Inc., ISBN 13: 9780-13-376003-3

4.  Sparc: https://en.wikipedia.org/wiki/SPARC

5.  http://www-03.ibm.com/ibm/history/exhibits/mainframe/mainframe_PP3158.html