1 CS 201 Computer Systems Programming Chapter 3 “Architecture Overview” Herbert G. Mayer, PSU CS Status 1/28/2013

1

CS 201Computer Systems Programming

Chapter 3“Architecture Overview”

Herbert G. Mayer, PSU CSHerbert G. Mayer, PSU CSStatus 1/28/2013Status 1/28/2013

2

Syllabus Computing HistoryComputing History

Evolution of Microprocessor µP PerformanceEvolution of Microprocessor µP Performance

Processor Performance GrowthProcessor Performance Growth

Key Architecture MessagesKey Architecture Messages

Code Sequences for Different ArchitecturesCode Sequences for Different Architectures

Dependencies, AKA DependencesDependencies, AKA Dependences

Score BoardScore Board

ReferencesReferences

3

Computing HistoryComputing HistoryBefore 1940Before 19401643 Pascal’s 1643 Pascal’s Arithmetic MachineArithmetic Machine

About 1660 Leibnitz About 1660 Leibnitz Four Function CalculatorFour Function Calculator

1710 -1750 1710 -1750 Punched CardsPunched Cards by Bouchon, Falcon, Jacquard by Bouchon, Falcon, Jacquard

1810 Babbage 1810 Babbage Difference EngineDifference Engine, unfinished; 1st programmer , unfinished; 1st programmer ever in the world was Ada, poet Lord Byron’s daughter, after ever in the world was Ada, poet Lord Byron’s daughter, after whom the language Ada was named: whom the language Ada was named: Lady Ada LovelaceLady Ada Lovelace

1835 Babbage 1835 Babbage Analytical EngineAnalytical Engine, also unfinished, also unfinished

1920 Hollerith 1920 Hollerith Tabulating MachineTabulating Machine to help with census in the USA to help with census in the USA

4

Computing HistoryComputing HistoryDecade of 1940sDecade of 1940s1939 – 1942 1939 – 1942 John Atanasoff John Atanasoff built programmable, electronic built programmable, electronic

computer at Iowa State Universitycomputer at Iowa State University

1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical 1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical computers based on relays; colleague advised use of computers based on relays; colleague advised use of “vacuum tubes”“vacuum tubes”

1946 1946 John von Neumann’s John von Neumann’s computer design of stored programcomputer design of stored program

1946 Mauchly and Eckert built 1946 Mauchly and Eckert built ENIACENIAC, modeled after Atanasoff’s , modeled after Atanasoff’s ideas, built at University of Pennsylvania: Electronic Numeric ideas, built at University of Pennsylvania: Electronic Numeric Integrator and Computer, 30 ton monsterIntegrator and Computer, 30 ton monster

1980s John Atanasoff got acknowledgment and patent officially 1980s John Atanasoff got acknowledgment and patent officially

5

Computing HistoryComputing HistoryDecade of the 1950sDecade of the 1950s Univac Uniprocessor based on ENIAC, commercially viable, Univac Uniprocessor based on ENIAC, commercially viable,

developed by developed by John Mauchly John Mauchly and John Presper Eckertand John Presper Eckert Commercial systems sold by Remington RandCommercial systems sold by Remington Rand Mark III computerMark III computer

Decade of the 1960s Decade of the 1960s IBM’s 360 family co-developed with GE, Siemens, et al.IBM’s 360 family co-developed with GE, Siemens, et al. Transistor replaces vacuum tubeTransistor replaces vacuum tube Burroughs stack machines, compete with GPR architecturesBurroughs stack machines, compete with GPR architectures All still All still von Neumannvon Neumann architectures architectures 1969 1969 ARPANETARPANET CacheCache and and VMMVMM developed, first at Manchester University developed, first at Manchester University

6

Computing HistoryComputing History

Decade of the 1970sDecade of the 1970sBirth of Microprocessor at Intel, Birth of Microprocessor at Intel, see see Gordon MooreGordon Moore

High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 seriesHigh-end mainframes, e.g. CDC 6000s, IBM 360 + 370 series

Architecture advances: Caches, Architecture advances: Caches, virtual virtual memories (VMM) memories (VMM) ubiquitous, since ubiquitous, since realreal memories were expensive memories were expensive

Intel 4004, Intel 8080, single-chip microprocessorsIntel 4004, Intel 8080, single-chip microprocessors

Programmable controllersProgrammable controllers

Mini-computers, PDP 11, HP 3000 16-bit computerMini-computers, PDP 11, HP 3000 16-bit computer

Height of Digital Equipment Corp. (DEC)Height of Digital Equipment Corp. (DEC)

Birth of personal computers, which DEC misses!Birth of personal computers, which DEC misses!

7


Decade of the 1980sDecade of the 1980s

decrease of mini-computer usedecrease of mini-computer use

32-bit computing even on minis32-bit computing even on minis

Architecture advances: superscalar, faster caches, Architecture advances: superscalar, faster caches, larger cacheslarger caches

Multitude of Supercomputer manufacturersMultitude of Supercomputer manufacturers

Compiler complexity: trace-scheduling, VLIWCompiler complexity: trace-scheduling, VLIW

Workstations common: Apollo, HP, DEC’s Workstations common: Apollo, HP, DEC’s Ken Olsen Ken Olsen trying to catch up, Intergraph, Ardent, Sun, Three trying to catch up, Intergraph, Ardent, Sun, Three Rivers, Silicon Graphics, etc.Rivers, Silicon Graphics, etc.

8


Decade of the 1990sDecade of the 1990s•Architecture advances: superscalar & pipelined, Architecture advances: superscalar & pipelined, speculative execution, ooo executionspeculative execution, ooo execution

•Powerful desktopsPowerful desktops

•End of mini-computer and of many super-computer End of mini-computer and of many super-computer manufacturersmanufacturers

•Microprocessor powerful as early supercomputersMicroprocessor powerful as early supercomputers

•Consolidation of many computer companies into a Consolidation of many computer companies into a few large onesfew large ones

•End of Soviet Union marked the end of several End of Soviet Union marked the end of several supercomputer companiessupercomputer companies

9

Evolution of µP Performance(by: James C. Hoe @ CMU)

1970s 1980s 1990s 2000+ Transistor Count 10k-100k 100k-1M 1M-100M 1B

Clock Frequency 0.2-2 MHz 2-20 MHz 0.02 – 1 GHz 10 GHz

Instructions / cycle: ipc < 0.1 0.1 – 0.9 0.9 – 2.0 > 10 (?)

MIPs, FLOPs < 0.2 0.2 - 20 20 – 2,000 100,000

10

Processor Performance GrowthMoore’s Law --from Webopedia 8/27/2004:Moore’s Law --from Webopedia 8/27/2004:

““The observation made in 1965 by Gordon Moore, co-founder of The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on Intel, that the number of transistors per square inch on integrated circuits had doubled every year since it was integrated circuits had doubled every year since it was invented. Moore predicted that this trend would continue for invented. Moore predicted that this trend would continue for the foreseeable future.the foreseeable future.

In subsequent years, the pace slowed down a bit, but In subsequent years, the pace slowed down a bit, but data data density doubled approximately every 18 monthsdensity doubled approximately every 18 months, and this is , and this is the current definition of the current definition of Moore's LawMoore's Law, which , which Moore himself Moore himself has blessedhas blessed. Most experts, including Moore himself, expect . Most experts, including Moore himself, expect Moore's LawMoore's Law to hold for another two decades. to hold for another two decades.

Others coin a more general law, stating that Others coin a more general law, stating that “the circuit density “the circuit density increases predictably over time.”increases predictably over time.”

11

Processor Performance GrowthSo far in 2013, Moore’s Law is holding true since ~1968.So far in 2013, Moore’s Law is holding true since ~1968.

Some Intel fellows believe that an end to Moore’s Law will be Some Intel fellows believe that an end to Moore’s Law will be reached ~2018 due to physical limitations in the process of reached ~2018 due to physical limitations in the process of manufacturing transistors from semi-conductor material.manufacturing transistors from semi-conductor material.

This phenomenal growth is unknown in any other industry. For This phenomenal growth is unknown in any other industry. For example, if doubling of performance could be achieved example, if doubling of performance could be achieved every 18 months, then by 2001 other industries would have every 18 months, then by 2001 other industries would have achieved the following:achieved the following:

cars would travel at 2,400,000 Mph, and get 600,000 MpGcars would travel at 2,400,000 Mph, and get 600,000 MpG

Air travel from LA to NYC would be at 36,000 Mach, or take 0.5 Air travel from LA to NYC would be at 36,000 Mach, or take 0.5 secondsseconds

12

Message 1: Memory is Slow The inner core of the processor, the CPU or the µP, is The inner core of the processor, the CPU or the µP, is

getting faster at a steady rategetting faster at a steady rate

Access to memoryAccess to memory is also getting faster over time, but is also getting faster over time, but at a at a slower rateslower rate. This rate differential has existed for quite some . This rate differential has existed for quite some time, with the strange effect that fast processors have to rely time, with the strange effect that fast processors have to rely on slow memorieson slow memories

Not uncommon on MP server that processor has to wait Not uncommon on MP server that processor has to wait >100 cycles before a memory access completes; >100 cycles before a memory access completes; that is one that is one single memory accesssingle memory access. On a Multi-Processor the bus . On a Multi-Processor the bus protocol is more complex due to snooping, backing-off, protocol is more complex due to snooping, backing-off, arbitration, thus the number of cycles to complete a memory arbitration, thus the number of cycles to complete a memory access can grow highaccess can grow high

IO simply compounds the problem of slow memory accessIO simply compounds the problem of slow memory access

13

Message 1: Memory is Slow Discarding conventional memory altogether, relying only on cache-Discarding conventional memory altogether, relying only on cache-

like memories, is NOT an option for 64-bit architectures, due to the like memories, is NOT an option for 64-bit architectures, due to the price/size/cost/power if you pursue full memory population with 2price/size/cost/power if you pursue full memory population with 26464 bytesbytes

Another way of seeing this: Using solely reasonably-priced cache Another way of seeing this: Using solely reasonably-priced cache memories (say at < 10 times the cost of regular memory) is not memories (say at < 10 times the cost of regular memory) is not feasible: resulting physical address space would be too small, or feasible: resulting physical address space would be too small, or price too highprice too high

Significant intellectual efforts in computer architecture focuses on Significant intellectual efforts in computer architecture focuses on reducing the performance impact of fast processors accessing reducing the performance impact of fast processors accessing slow memoriesslow memories

All else except IO, seems easy compared to this fundamental All else except IO, seems easy compared to this fundamental problem!problem!

IO is even slower by orders of magnitudeIO is even slower by orders of magnitude

14

Message 1: Memory is Slow

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Time

“Moore’s Law”

Source: David Patterson, UC Berkeley

2001

2002

15

Message 2: Events Tend to Cluster A strange thing happens during program execution: A strange thing happens during program execution:

Seemingly Seemingly unrelated events tend to clusterunrelated events tend to cluster

memory accessesmemory accesses tend to concentrate a majority of their tend to concentrate a majority of their referenced addresses onto a small domain of the total referenced addresses onto a small domain of the total address space. Even if all of memory is accessed, during address space. Even if all of memory is accessed, during some periods of time such clustering is observed. some periods of time such clustering is observed. Intuitively, one memory access seems independent of Intuitively, one memory access seems independent of another, but they both happen to fall onto the same page (or another, but they both happen to fall onto the same page (or working set working set of pages)of pages)

We call this phenomenon We call this phenomenon LocalityLocality! Architects exploit locality ! Architects exploit locality to speed up memory access via to speed up memory access via CachesCaches and increase the and increase the address range beyond physical memory via address range beyond physical memory via Virtual Memory Virtual Memory ManagementManagement. Distinguish . Distinguish spacialspacial versus versus temporaltemporal locality locality

16

Message 2: Events Tend to Cluster Similarly, hash functions tend to concentrate an Similarly, hash functions tend to concentrate an

unproportionally large number of keys onto a unproportionally large number of keys onto a small number of table entriessmall number of table entries

Incoming search key (say, a C++ program Incoming search key (say, a C++ program identifier) is mapped into an index, but the next, identifier) is mapped into an index, but the next, completely unrelated key, happens to map onto completely unrelated key, happens to map onto the same index. In an extreme case, this may the same index. In an extreme case, this may render a hash lookup slower than a sequential render a hash lookup slower than a sequential searchsearch

Programmer must Programmer must watch outwatch out for the phenomenon for the phenomenon of clustering, as it is undesired in hashing!of clustering, as it is undesired in hashing!

17

Message 2: Events Tend to Cluster Clustering happens in all diverse modules of the processor Clustering happens in all diverse modules of the processor

architecture. For example, when a data cache is used to architecture. For example, when a data cache is used to speed-up memory accesses by having a copy of frequently speed-up memory accesses by having a copy of frequently used data in a faster memory unit, it happens that a small used data in a faster memory unit, it happens that a small cache suffices to speed up executioncache suffices to speed up execution

Due to Due to Data Locality Data Locality (spatial and temporal). Data that have (spatial and temporal). Data that have been accessed recently will again be accessed in the near been accessed recently will again be accessed in the near future, or at least data that live close by will be accessed in future, or at least data that live close by will be accessed in the near futurethe near future

Thus they happen to reside in the same cache line. Thus they happen to reside in the same cache line. Architects do exploit this to speed up execution, while Architects do exploit this to speed up execution, while keeping the incremental cost for HW contained. Here keeping the incremental cost for HW contained. Here clustering is a valuable phenomenon clustering is a valuable phenomenon

18

Message 3: Heat is Bad Clocking a processor fast (e.g. > 3-5 GHz) can increase Clocking a processor fast (e.g. > 3-5 GHz) can increase

performance and thus generally “is good”performance and thus generally “is good”

Other performance parameters, such as memory access Other performance parameters, such as memory access speed, peripheral access, etc. do not scale with the clock speed, peripheral access, etc. do not scale with the clock speed. Still, increasing the clock to a higher rate is desirablespeed. Still, increasing the clock to a higher rate is desirable

Comes at the cost of higher current, thus more heat Comes at the cost of higher current, thus more heat generated in the identical physical geometry (the real-estate) generated in the identical physical geometry (the real-estate) of the silicon processor or also the chipsetof the silicon processor or also the chipset

But Silicon part acts like a heat-conductor, conducting But Silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative temperature coefficient better, as it gets warmer (negative temperature coefficient resistor, or NTC). Since the power-supply is a constant-resistor, or NTC). Since the power-supply is a constant-current source, a lower resistance causes lower voltage, current source, a lower resistance causes lower voltage, shown as VDroop in the figure belowshown as VDroop in the figure below

19

Message 3: Heat is Bad

20

Message 3: Heat is Bad This in turn means, voltage must be increased artificially, to This in turn means, voltage must be increased artificially, to

sustain the clock rate, creating more heat, ultimately leading to sustain the clock rate, creating more heat, ultimately leading to self-destruction of the partself-destruction of the part

Great efforts are being made to increase the clock speed, Great efforts are being made to increase the clock speed, requiring more voltage, while at the same time reducing heat requiring more voltage, while at the same time reducing heat generation. Current technologies include sleep-states of the generation. Current technologies include sleep-states of the Silicon part (processor as well as chip-set), and Silicon part (processor as well as chip-set), and Turbo BoostTurbo Boost mode, to contain heat generation while boosting clock speed mode, to contain heat generation while boosting clock speed just at the right timejust at the right time

Good that to date Silicon manufacturing technologies allow the Good that to date Silicon manufacturing technologies allow the shrinking of transistors and thus of whole dies. Else CPUs shrinking of transistors and thus of whole dies. Else CPUs would become larger, more expensive, and above all: hotter.would become larger, more expensive, and above all: hotter.

21

Message 4: Resource Replication

Architects cannot increase clock speed Architects cannot increase clock speed beyond physical limitationsbeyond physical limitations

One cannot decrease the die size beyond One cannot decrease the die size beyond evolving technologyevolving technology

Yet speed improvements are desired, and Yet speed improvements are desired, and achievedachieved

This conflict can partly be overcome with This conflict can partly be overcome with replicated resources! But careful!replicated resources! But careful!

22


Key obstacle to parallel execution is data Key obstacle to parallel execution is data dependence in the SW under execution. A dependence in the SW under execution. A datum cannot be used, before it has been datum cannot be used, before it has been computedcomputed

Compiler optimization technology calls this Compiler optimization technology calls this use-def dependence use-def dependence (short for use-before-(short for use-before-definition, and definition-before-use definition, and definition-before-use dependence), AKA true dependence, AKA dependence), AKA true dependence, AKA data dependencedata dependence

Goal is to search for program portions that Goal is to search for program portions that are independent of one another. This can be are independent of one another. This can be at multiple levels of focusat multiple levels of focus

23

Message 4: Resource Replication At the At the very low levelvery low level of registers, at the machine of registers, at the machine

level –done by HW; see also score boardlevel –done by HW; see also score board

At the At the low level low level of individual machine instructions of individual machine instructions –done by HW; see also superscalar architecture–done by HW; see also superscalar architecture

At the At the medium level medium level of subexpressions in a of subexpressions in a program –done by compiler; see CSEprogram –done by compiler; see CSE

At the At the higher level higher level of several statements written in of several statements written in sequence in high-level language program –done sequence in high-level language program –done by optimizing compiler or by programmerby optimizing compiler or by programmer

Or at the Or at the very high level very high level of different applications, of different applications, running on the same computer, but with running on the same computer, but with independent data, separate computations, and independent data, separate computations, and independent results –done by the user running independent results –done by the user running concurrent programsconcurrent programs

24


Whenever program portions are independent of Whenever program portions are independent of one another, they can be computed at the same one another, they can be computed at the same time: in paralleltime: in parallel

Architects provide resources for this parallelismArchitects provide resources for this parallelism Compilers need to uncover opportunities for Compilers need to uncover opportunities for

parallelismparallelism If two actions are independent of one another, they If two actions are independent of one another, they

can be computed simultaneouslycan be computed simultaneously Provided that HW resources exist, that the absence Provided that HW resources exist, that the absence

of dependence has been proven, that independent of dependence has been proven, that independent execution paths are scheduled on these replicated execution paths are scheduled on these replicated HW resourcesHW resources

25

Code 1 for Different ArchitecturesExample 1: Object Code Sequence Example 1: Object Code Sequence Without OptimizationWithout Optimization

Strict left-to-right translation, no smarts in mappingStrict left-to-right translation, no smarts in mapping

Consider non-commutative subtraction and division Consider non-commutative subtraction and division operatorsoperators

No common subexpression elimination (CSE), and no No common subexpression elimination (CSE), and no register reuseregister reuse

Conventional operator precedenceConventional operator precedence

For Single Accumulator SAA, Three-Address GPR, Stack For Single Accumulator SAA, Three-Address GPR, Stack ArchitecturesArchitectures

Sample source: Sample source: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c

26

Code 1 for Different Architectures

No Single-Accumulator

Three-Address GPR dest op1 op op2

Stack Machine

1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 mult b add r3, a, #3 add 4 st temp1 div r4, r3, c push b 5 ld a sub d, r2, r4 mult 6 add #3 push a 7 div c pushlit #3 8 st temp2 add 9 ld temp1 push c

10 sub temp2 div 11 st d sub 12 pop d

27

Code 1 for Different ArchitecturesThree-address code looks shortest, w.r.t. Three-address code looks shortest, w.r.t. number of instructionsnumber of instructions

Maybe optical illusion, must also consider Maybe optical illusion, must also consider number of bitsnumber of bits for for instructionsinstructions

Must consider number of I-fetches, operand fetches, total number Must consider number of I-fetches, operand fetches, total number of storesof stores

Numerous memory accesses on SAA (Single Accumulator Numerous memory accesses on SAA (Single Accumulator Architecture) due to temporary values held in memoryArchitecture) due to temporary values held in memory

Most memory accesses on SA (Stack Architecture), since Most memory accesses on SA (Stack Architecture), since everything requires a memory accesseverything requires a memory access

Three-Address architecture immune to commutativity constraint, Three-Address architecture immune to commutativity constraint, since operands may be placed in registers in either ordersince operands may be placed in registers in either order

No need for reverse-operation opcodes for Three-Address No need for reverse-operation opcodes for Three-Address architecturearchitecture

Decide in Three-Address architecture how to encode operand Decide in Three-Address architecture how to encode operand typestypes

28


This time we This time we eliminate common subexpression (CSE)eliminate common subexpression (CSE)

Compiler handles left-to-right order for non-Compiler handles left-to-right order for non-commutative operators on SAAcommutative operators on SAA

Better: Better: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c

29




Stack Machine

1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 st temp1 div r1, r1, c add 4 div c sub d, r2, r1 dup 5 st temp2 push b 6 ld temp1 mult 7 mult b xch 8 sub temp2 push c

9 st d div 10 sub 11 pop d

30


Single Accumulator Architecture (SAA) optimized still Single Accumulator Architecture (SAA) optimized still needs temporary storage; uses needs temporary storage; uses temp1 temp1 for common for common subexpression; has no other register!!subexpression; has no other register!!

SAA could use SAA could use negatenegate instruction or instruction or reverse subtractreverse subtract

Register-use optimized for Three-Address Register-use optimized for Three-Address architecture; but architecture; but dupdup and and xchxch are newly added are newly added instructionsinstructions

Common subexpresssion optimized on Stack Common subexpresssion optimized on Stack Machine by duplicating, exchanging, etc.Machine by duplicating, exchanging, etc.

20% reduced for Three-Address, 18% for SAA, only 20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machine8% for Stack Machine

31

Code 3 for Different Architectures Analyze similar source expressions but with Analyze similar source expressions but with

reversed operator precedencereversed operator precedence

One operator sequence associates right-to-left, One operator sequence associates right-to-left, due to precedencedue to precedence

Compiler uses commutativityCompiler uses commutativity

The other left-to-right, due to explicit parenthesesThe other left-to-right, due to explicit parentheses

Use simple-minded code model: no cache, no Use simple-minded code model: no cache, no optimizationoptimization

Will there be advantages/disadvantages due to Will there be advantages/disadvantages due to architecture?architecture?

Expression 1 is : e Expression 1 is : e a + b * c ^ d a + b * c ^ d

32

Expression 1 is : e a + b * c ^ d




Stack Machine Implied Operands

1 ld c expo r1, c, d push a 2 expo d mult r1, b, r1 push b

3 mult b add e, a, r1 push c 4 add a push d 5 st e expo 6 mult 7 add 8 pop e

Expression 2 is : f ( ( g + h ) * i ) ^ j Here the operators associate left-to-right due to parentheses

• Expression 1 is : e Expression 1 is : e a + b * c ^ d a + b * c ^ d

33


No Single-

Accumulator Three-Address GPR dest op1 op op2

Stack Machine Implied operands

1 ld g add r1, g, h push g 2 add h mult r1, i, r1 push h

3 mult i expo f, r1, j add 4 expo j push i 5 st f mult 6 push j 7 expo 8 pop f

Observations, Interaction of Precedence and Architecture Software eliminates constraints imposed by precedence: looking ahead Execution times identical for the 2 different expressions on the same

architecture --unless blurred by secondary effect; see cache example below Conclusion: all architectures handle arithmetic and logic operations well

• Expression 2 is : f Expression 2 is : f ( ( g + h ) * i ) ^ j ( ( g + h ) * i ) ^ j

34

Code For Stack Architecture Stack Machine with no register inherently slow: Memory Stack Machine with no register inherently slow: Memory

Accesses!!!Accesses!!!

Implement few top of stack elements via HW shadow Implement few top of stack elements via HW shadow registers registers Cache Cache

Measure equivalent code sequences with/without Measure equivalent code sequences with/without consideration for cacheconsideration for cache

Top-of-stack register tos points to last valid word on Top-of-stack register tos points to last valid word on physical stackphysical stack

Two shadow registers may hold 0, 1, or 2 true top wordsTwo shadow registers may hold 0, 1, or 2 true top words

Top of stack cache counter tcc specifies number of shadow Top of stack cache counter tcc specifies number of shadow registers in useregisters in use

Thus tos plus tcc jointly specify true top of stackThus tos plus tcc jointly specify true top of stack

35

Code For Stack Architecture

free free

0,1,20,1,2

tcc tcc

2 tos registers 2 tos registers

stack stack

tos tos

36

Code For Stack ArchitectureTimings for push, pushlit, add, pop operations depend on tccTimings for push, pushlit, add, pop operations depend on tcc

Operations in shadow registers fastest, typically 1 cycle, include Operations in shadow registers fastest, typically 1 cycle, include register access and the operation itselfregister access and the operation itself

Generally, further memory access adds 2 cyclesGenerally, further memory access adds 2 cycles

For stack changes use some defined policy, e.g. keep tcc 50% For stack changes use some defined policy, e.g. keep tcc 50% fullfull

Table below refines timings for stack with shadow registersTable below refines timings for stack with shadow registers

Note: push x into cache with free space requires 2 cycles: cache Note: push x into cache with free space requires 2 cycles: cache adjustment is done at the same time as memory fetchadjustment is done at the same time as memory fetch

37


operation Cycles tcc before tcc after tos change comment add 1 tcc = 2 tcc = 1 no change add 1+2 tcc = 1 tcc = 1 tos-- underflow? add 1+2+2 tcc = 0 tcc = 1 tos -= 2 underflow? push x 2 tcc = 0,1 tcc++ no change tcc update

in parallel push x 2+2 tcc = 2 tcc = 2 tos++ overflow? pushlit #3 1 tcc = 0,1 tcc++ no change pushlit #3 1+2 tcc = 2 tcc = 2 tos++ overflow? pop y 2 tcc = 1,2 tcc-- no change pop y 2+2 tcc = 0 tcc = 0 tos-- underflow?

38


Code emission for: a + b * c ^ ( d + e * f ^ g )Code emission for: a + b * c ^ ( d + e * f ^ g )

Let + and * be commutative, by language ruleLet + and * be commutative, by language rule

Architecture here has 2 shadow registers, compiler Architecture here has 2 shadow registers, compiler exploitsexploits this this

Assume initially empty 2-word cacheAssume initially empty 2-word cache

39


# 1 Left - to - Right cycles 1 2 Exploit Cache cycles

2

1 push a 2 push f 2

2 push b 2 push g 2

3 push c 4 e xpo 1

4 push d 4 push e 2

5 push e 4 m ult 1

6 push f 4 push d 2

7 push g 4 a dd 1

8 expo 1 push c 2

9 mult 3 r_ e xpo = swap + expo 1

10 add 3 push b 2

11 expo 3 m ult 1

12 m ult 3 push a 2

13 a dd 3 a dd 1

40

Code For Stack ArchitectureBlind Blind code emission costs 40 cycles; i.e. not taking advantage of tcc code emission costs 40 cycles; i.e. not taking advantage of tcc

knowledge: costs performanceknowledge: costs performance

Code emission with shadow register consideration costs 20 cyclesCode emission with shadow register consideration costs 20 cycles

True penalty for memory access is worse in practiceTrue penalty for memory access is worse in practice

Tremendous speed-up always possible when fixing system with severe Tremendous speed-up always possible when fixing system with severe flawsflaws

Return of investment for 2 registers is twice the original performanceReturn of investment for 2 registers is twice the original performance

Such strong speedup is an indicator that the starting architecture was Such strong speedup is an indicator that the starting architecture was poorpoor

Stack Machine can be fast, if purity of top-of-stack access is sacrificed Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performancefor performance

Note that indexing, looping, indirection, call/return are not addressed Note that indexing, looping, indirection, call/return are not addressed herehere

41

Register Dependencies Inter-instruction dependenInter-instruction dependenciescies, in CS parlance , in CS parlance

also known as also known as dependendependencesces, arise between , arise between registers being defined and usedregisters being defined and used

One instruction computes a result into a register One instruction computes a result into a register (or memory), another instruction needs that result (or memory), another instruction needs that result from that same register (or that memory location)from that same register (or that memory location)

Or, one instruction uses a datum; and after such Or, one instruction uses a datum; and after such use the same item is reset, i.e. recomputeduse the same item is reset, i.e. recomputed

42

Register DependenciesTrue-DependenceTrue-Dependence, AKA Data Dependence: <- note synonym!, AKA Data Dependence: <- note synonym!

r3 ←r3 ← r1 op r2 r1 op r2r5 ← r5 ← r3r3 op r4 op r4 Read after Write, RAWRead after Write, RAW

Anti-Dependence,Anti-Dependence, not a true dependence not a true dependence

parallelize under right conditionparallelize under right condition

r3 ← r3 ← r1r1 op r2 op r2r1r1 ← r5 op r4 ← r5 op r4 Write after read, WARWrite after read, WAR

Output DependenceOutput Dependence

r3r3 ← r1 op r2 ← r1 op r2r5 ← r5 ← r3r3 op r4 op r4r3 r3 ← r6 op r7← r6 op r7 Write after Write, WAW, use in betweenWrite after Write, WAW, use in between

43

Register Dependencies

Control Dependence:Control Dependence:

if ( condition1 ) {if ( condition1 ) {

r3 = r1 op r2;r3 = r1 op r2;

}else{}else{ see the jump here? see the jump here?

r5 = r3 op r4;r5 = r3 op r4;

} // end if} // end if

write( r3 );write( r3 );

44

Register Renaming Only the data dependence is a Only the data dependence is a real dependence, real dependence,

hence called true-dependencehence called true-dependence

Other dependences are artifacts of Other dependences are artifacts of insufficient insufficient resourcesresources, generally not enough registers, generally not enough registers

This means: if additional registers were available, This means: if additional registers were available, then replacing some of these conflicting regs with then replacing some of these conflicting regs with new one regsiters could make conflict disappear?new one regsiters could make conflict disappear?

Anti-Anti- and and Output-Output-Dependences are indeed such Dependences are indeed such falsefalse dependences dependences

45

Register Renaming Original Dependences:Original Dependences: Renamed Situation, Dependences Gone:Renamed Situation, Dependences Gone:

L1:L1: r1 ← r2 op r3r1 ← r2 op r3 r10 ← r2 op r30 –- r30 has r3 copyr10 ← r2 op r30 –- r30 has r3 copy

L2:L2: r4 ← r1 op r5r4 ← r1 op r5 r4 ← r10 op r5r4 ← r10 op r5



The dependences before:The dependences before: after:after:

L1, L2 true-Dep with r1L1, L2 true-Dep with r1 L1, L2 true-Dep with r10L1, L2 true-Dep with r10

L1, L3 output-Dep with r1L1, L3 output-Dep with r1 L3, L4 true-Dep with r1L3, L4 true-Dep with r1

L1, L4 anti-Dep with r3L1, L4 anti-Dep with r3

L3, L4 true-Dep with r1L3, L4 true-Dep with r1



46

Register Renaming

With these additional or renamed regs, the new code With these additional or renamed regs, the new code could possibly run in half the time!could possibly run in half the time!

First : Compute into r10 instead of r1, but you need to First : Compute into r10 instead of r1, but you need to have the additional registerhave the additional register

Also: Compute into r30, no added copy operations, just Also: Compute into r30, no added copy operations, just more registers á-priorimore registers á-priori

Then regs are Then regs are livelive afterwards: r1, r3, r4 afterwards: r1, r3, r4

While r10 and r30 are While r10 and r30 are don’t caresdon’t cares

47

Score BoardScore-board is an array of HW programmable bits Score-board is an array of HW programmable bits sb[]sb[]

Manages other HW resources, specifically registersManages other HW resources, specifically registers

Single-bit HW array, every bit Single-bit HW array, every bit ii in in sb[i]sb[i] is is associated with one associated with one specific, dedicated register specific, dedicated register rrii

Association is by index, i.e. by name: Association is by index, i.e. by name: sb[i]sb[i] belongs to reg belongs to reg rrii

Only if Only if sb[i] = 0sb[i] = 0, does register , does register i i have have valid datavalid data

If If sb[i] = 0 sb[i] = 0 then register then register rrii is is NOT in process of being writtenNOT in process of being written

If bit If bit ii is set, i.e. if is set, i.e. if sb[i] = 1sb[i] = 1, then that register , then that register rrii has has stale datastale data

Initially all Initially all sb[*]sb[*] are stale, i.e. set to 1 are stale, i.e. set to 1

48

Score Board

Execution constraints:Execution constraints:

rrdd ← r ← rss op r op rtt

if if sb[s]sb[s] or if or if sb[t]sb[t] is set → RAW dependence, hence is set → RAW dependence, hence stall the computation; wait until both stall the computation; wait until both rrss and and rrtt are are availableavailable

if if sb[d]sb[d] is set→ WAW dependence, hence stall the is set→ WAW dependence, hence stall the write; wait until write; wait until rrdd has been used; SW can sometimes has been used; SW can sometimes

determine to use another register instead of determine to use another register instead of rrdd

else dispatch instruction immediatelyelse dispatch instruction immediately

49

Score Board

To allow To allow out of order (ooo) executionout of order (ooo) execution, upon , upon computing the value of rcomputing the value of rdd

Update Update rrdd, and clear , and clear sb[d]sb[d]

For uses (references), HW may use any register For uses (references), HW may use any register ii, , whose whose sb[i]sb[i] is 0 is 0

For definitions (assignments), HW may set any For definitions (assignments), HW may set any register j, whose register j, whose sb[j]sb[j] is 0 is 0

Independent of original order, in which source Independent of original order, in which source program was writtenprogram was written, i.e. possibly ooo

50

References1.1. The Humble Programmer: The Humble Programmer:

http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmlhttp://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.html

2.2. Algorithm Definitions: Algorithm Definitions: http://en.wikipedia.org/wiki/Algorithm_characterizationshttp://en.wikipedia.org/wiki/Algorithm_characterizations

3.3. http://en.wikipedia.org/wiki/Moore's_lawhttp://en.wikipedia.org/wiki/Moore's_law

4.4. C. A. R. HoareC. A. R. Hoare’’s comment on readability: s comment on readability: http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdfhttp://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdf

5.5. Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16Number 7, July 1986, pp 11-16

6.6. Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/

7.7. Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htmLinux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htm

8.8. Words of wisdom: http://www.cs.yale.edu/quotes.htmlWords of wisdom: http://www.cs.yale.edu/quotes.html

9.9. John von Neumann’s computer design: A.H. Taub (ed.), “Collected John von Neumann’s computer design: A.H. Taub (ed.), “Collected Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., New York 1963New York 1963

Documents

1 CS 201 Computer Systems Programming Chapter 3 “Architecture Overview” Herbert G. Mayer, PSU CS Status 1/28/2013