Upload
warren-hall
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
1
CS 201Computer Systems Programming
Chapter 3“Architecture Overview”
Herbert G. Mayer, PSU CSHerbert G. Mayer, PSU CSStatus 1/28/2013Status 1/28/2013
2
Syllabus Computing HistoryComputing History
Evolution of Microprocessor µP PerformanceEvolution of Microprocessor µP Performance
Processor Performance GrowthProcessor Performance Growth
Key Architecture MessagesKey Architecture Messages
Code Sequences for Different ArchitecturesCode Sequences for Different Architectures
Dependencies, AKA DependencesDependencies, AKA Dependences
Score BoardScore Board
ReferencesReferences
3
Computing HistoryComputing HistoryBefore 1940Before 19401643 Pascal’s 1643 Pascal’s Arithmetic MachineArithmetic Machine
About 1660 Leibnitz About 1660 Leibnitz Four Function CalculatorFour Function Calculator
1710 -1750 1710 -1750 Punched CardsPunched Cards by Bouchon, Falcon, Jacquard by Bouchon, Falcon, Jacquard
1810 Babbage 1810 Babbage Difference EngineDifference Engine, unfinished; 1st programmer , unfinished; 1st programmer ever in the world was Ada, poet Lord Byron’s daughter, after ever in the world was Ada, poet Lord Byron’s daughter, after whom the language Ada was named: whom the language Ada was named: Lady Ada LovelaceLady Ada Lovelace
1835 Babbage 1835 Babbage Analytical EngineAnalytical Engine, also unfinished, also unfinished
1920 Hollerith 1920 Hollerith Tabulating MachineTabulating Machine to help with census in the USA to help with census in the USA
4
Computing HistoryComputing HistoryDecade of 1940sDecade of 1940s1939 – 1942 1939 – 1942 John Atanasoff John Atanasoff built programmable, electronic built programmable, electronic
computer at Iowa State Universitycomputer at Iowa State University
1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical 1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical computers based on relays; colleague advised use of computers based on relays; colleague advised use of “vacuum tubes”“vacuum tubes”
1946 1946 John von Neumann’s John von Neumann’s computer design of stored programcomputer design of stored program
1946 Mauchly and Eckert built 1946 Mauchly and Eckert built ENIACENIAC, modeled after Atanasoff’s , modeled after Atanasoff’s ideas, built at University of Pennsylvania: Electronic Numeric ideas, built at University of Pennsylvania: Electronic Numeric Integrator and Computer, 30 ton monsterIntegrator and Computer, 30 ton monster
1980s John Atanasoff got acknowledgment and patent officially 1980s John Atanasoff got acknowledgment and patent officially
5
Computing HistoryComputing HistoryDecade of the 1950sDecade of the 1950s Univac Uniprocessor based on ENIAC, commercially viable, Univac Uniprocessor based on ENIAC, commercially viable,
developed by developed by John Mauchly John Mauchly and John Presper Eckertand John Presper Eckert Commercial systems sold by Remington RandCommercial systems sold by Remington Rand Mark III computerMark III computer
Decade of the 1960s Decade of the 1960s IBM’s 360 family co-developed with GE, Siemens, et al.IBM’s 360 family co-developed with GE, Siemens, et al. Transistor replaces vacuum tubeTransistor replaces vacuum tube Burroughs stack machines, compete with GPR architecturesBurroughs stack machines, compete with GPR architectures All still All still von Neumannvon Neumann architectures architectures 1969 1969 ARPANETARPANET CacheCache and and VMMVMM developed, first at Manchester University developed, first at Manchester University
6
Computing HistoryComputing History
Decade of the 1970sDecade of the 1970sBirth of Microprocessor at Intel, Birth of Microprocessor at Intel, see see Gordon MooreGordon Moore
High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 seriesHigh-end mainframes, e.g. CDC 6000s, IBM 360 + 370 series
Architecture advances: Caches, Architecture advances: Caches, virtual virtual memories (VMM) memories (VMM) ubiquitous, since ubiquitous, since realreal memories were expensive memories were expensive
Intel 4004, Intel 8080, single-chip microprocessorsIntel 4004, Intel 8080, single-chip microprocessors
Programmable controllersProgrammable controllers
Mini-computers, PDP 11, HP 3000 16-bit computerMini-computers, PDP 11, HP 3000 16-bit computer
Height of Digital Equipment Corp. (DEC)Height of Digital Equipment Corp. (DEC)
Birth of personal computers, which DEC misses!Birth of personal computers, which DEC misses!
7
Computing HistoryComputing History
Decade of the 1980sDecade of the 1980s
decrease of mini-computer usedecrease of mini-computer use
32-bit computing even on minis32-bit computing even on minis
Architecture advances: superscalar, faster caches, Architecture advances: superscalar, faster caches, larger cacheslarger caches
Multitude of Supercomputer manufacturersMultitude of Supercomputer manufacturers
Compiler complexity: trace-scheduling, VLIWCompiler complexity: trace-scheduling, VLIW
Workstations common: Apollo, HP, DEC’s Workstations common: Apollo, HP, DEC’s Ken Olsen Ken Olsen trying to catch up, Intergraph, Ardent, Sun, Three trying to catch up, Intergraph, Ardent, Sun, Three Rivers, Silicon Graphics, etc.Rivers, Silicon Graphics, etc.
8
Computing HistoryComputing History
Decade of the 1990sDecade of the 1990s•Architecture advances: superscalar & pipelined, Architecture advances: superscalar & pipelined, speculative execution, ooo executionspeculative execution, ooo execution
•Powerful desktopsPowerful desktops
•End of mini-computer and of many super-computer End of mini-computer and of many super-computer manufacturersmanufacturers
•Microprocessor powerful as early supercomputersMicroprocessor powerful as early supercomputers
•Consolidation of many computer companies into a Consolidation of many computer companies into a few large onesfew large ones
•End of Soviet Union marked the end of several End of Soviet Union marked the end of several supercomputer companiessupercomputer companies
9
Evolution of µP Performance(by: James C. Hoe @ CMU)
1970s 1980s 1990s 2000+ Transistor Count 10k-100k 100k-1M 1M-100M 1B
Clock Frequency 0.2-2 MHz 2-20 MHz 0.02 – 1 GHz 10 GHz
Instructions / cycle: ipc < 0.1 0.1 – 0.9 0.9 – 2.0 > 10 (?)
MIPs, FLOPs < 0.2 0.2 - 20 20 – 2,000 100,000
10
Processor Performance GrowthMoore’s Law --from Webopedia 8/27/2004:Moore’s Law --from Webopedia 8/27/2004:
““The observation made in 1965 by Gordon Moore, co-founder of The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on Intel, that the number of transistors per square inch on integrated circuits had doubled every year since it was integrated circuits had doubled every year since it was invented. Moore predicted that this trend would continue for invented. Moore predicted that this trend would continue for the foreseeable future.the foreseeable future.
In subsequent years, the pace slowed down a bit, but In subsequent years, the pace slowed down a bit, but data data density doubled approximately every 18 monthsdensity doubled approximately every 18 months, and this is , and this is the current definition of the current definition of Moore's LawMoore's Law, which , which Moore himself Moore himself has blessedhas blessed. Most experts, including Moore himself, expect . Most experts, including Moore himself, expect Moore's LawMoore's Law to hold for another two decades. to hold for another two decades.
Others coin a more general law, stating that Others coin a more general law, stating that “the circuit density “the circuit density increases predictably over time.”increases predictably over time.”
11
Processor Performance GrowthSo far in 2013, Moore’s Law is holding true since ~1968.So far in 2013, Moore’s Law is holding true since ~1968.
Some Intel fellows believe that an end to Moore’s Law will be Some Intel fellows believe that an end to Moore’s Law will be reached ~2018 due to physical limitations in the process of reached ~2018 due to physical limitations in the process of manufacturing transistors from semi-conductor material.manufacturing transistors from semi-conductor material.
This phenomenal growth is unknown in any other industry. For This phenomenal growth is unknown in any other industry. For example, if doubling of performance could be achieved example, if doubling of performance could be achieved every 18 months, then by 2001 other industries would have every 18 months, then by 2001 other industries would have achieved the following:achieved the following:
cars would travel at 2,400,000 Mph, and get 600,000 MpGcars would travel at 2,400,000 Mph, and get 600,000 MpG
Air travel from LA to NYC would be at 36,000 Mach, or take 0.5 Air travel from LA to NYC would be at 36,000 Mach, or take 0.5 secondsseconds
12
Message 1: Memory is Slow The inner core of the processor, the CPU or the µP, is The inner core of the processor, the CPU or the µP, is
getting faster at a steady rategetting faster at a steady rate
Access to memoryAccess to memory is also getting faster over time, but is also getting faster over time, but at a at a slower rateslower rate. This rate differential has existed for quite some . This rate differential has existed for quite some time, with the strange effect that fast processors have to rely time, with the strange effect that fast processors have to rely on slow memorieson slow memories
Not uncommon on MP server that processor has to wait Not uncommon on MP server that processor has to wait >100 cycles before a memory access completes; >100 cycles before a memory access completes; that is one that is one single memory accesssingle memory access. On a Multi-Processor the bus . On a Multi-Processor the bus protocol is more complex due to snooping, backing-off, protocol is more complex due to snooping, backing-off, arbitration, thus the number of cycles to complete a memory arbitration, thus the number of cycles to complete a memory access can grow highaccess can grow high
IO simply compounds the problem of slow memory accessIO simply compounds the problem of slow memory access
13
Message 1: Memory is Slow Discarding conventional memory altogether, relying only on cache-Discarding conventional memory altogether, relying only on cache-
like memories, is NOT an option for 64-bit architectures, due to the like memories, is NOT an option for 64-bit architectures, due to the price/size/cost/power if you pursue full memory population with 2price/size/cost/power if you pursue full memory population with 26464 bytesbytes
Another way of seeing this: Using solely reasonably-priced cache Another way of seeing this: Using solely reasonably-priced cache memories (say at < 10 times the cost of regular memory) is not memories (say at < 10 times the cost of regular memory) is not feasible: resulting physical address space would be too small, or feasible: resulting physical address space would be too small, or price too highprice too high
Significant intellectual efforts in computer architecture focuses on Significant intellectual efforts in computer architecture focuses on reducing the performance impact of fast processors accessing reducing the performance impact of fast processors accessing slow memoriesslow memories
All else except IO, seems easy compared to this fundamental All else except IO, seems easy compared to this fundamental problem!problem!
IO is even slower by orders of magnitudeIO is even slower by orders of magnitude
14
Message 1: Memory is Slow
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Time
“Moore’s Law”
Source: David Patterson, UC Berkeley
2001
2002
15
Message 2: Events Tend to Cluster A strange thing happens during program execution: A strange thing happens during program execution:
Seemingly Seemingly unrelated events tend to clusterunrelated events tend to cluster
memory accessesmemory accesses tend to concentrate a majority of their tend to concentrate a majority of their referenced addresses onto a small domain of the total referenced addresses onto a small domain of the total address space. Even if all of memory is accessed, during address space. Even if all of memory is accessed, during some periods of time such clustering is observed. some periods of time such clustering is observed. Intuitively, one memory access seems independent of Intuitively, one memory access seems independent of another, but they both happen to fall onto the same page (or another, but they both happen to fall onto the same page (or working set working set of pages)of pages)
We call this phenomenon We call this phenomenon LocalityLocality! Architects exploit locality ! Architects exploit locality to speed up memory access via to speed up memory access via CachesCaches and increase the and increase the address range beyond physical memory via address range beyond physical memory via Virtual Memory Virtual Memory ManagementManagement. Distinguish . Distinguish spacialspacial versus versus temporaltemporal locality locality
16
Message 2: Events Tend to Cluster Similarly, hash functions tend to concentrate an Similarly, hash functions tend to concentrate an
unproportionally large number of keys onto a unproportionally large number of keys onto a small number of table entriessmall number of table entries
Incoming search key (say, a C++ program Incoming search key (say, a C++ program identifier) is mapped into an index, but the next, identifier) is mapped into an index, but the next, completely unrelated key, happens to map onto completely unrelated key, happens to map onto the same index. In an extreme case, this may the same index. In an extreme case, this may render a hash lookup slower than a sequential render a hash lookup slower than a sequential searchsearch
Programmer must Programmer must watch outwatch out for the phenomenon for the phenomenon of clustering, as it is undesired in hashing!of clustering, as it is undesired in hashing!
17
Message 2: Events Tend to Cluster Clustering happens in all diverse modules of the processor Clustering happens in all diverse modules of the processor
architecture. For example, when a data cache is used to architecture. For example, when a data cache is used to speed-up memory accesses by having a copy of frequently speed-up memory accesses by having a copy of frequently used data in a faster memory unit, it happens that a small used data in a faster memory unit, it happens that a small cache suffices to speed up executioncache suffices to speed up execution
Due to Due to Data Locality Data Locality (spatial and temporal). Data that have (spatial and temporal). Data that have been accessed recently will again be accessed in the near been accessed recently will again be accessed in the near future, or at least data that live close by will be accessed in future, or at least data that live close by will be accessed in the near futurethe near future
Thus they happen to reside in the same cache line. Thus they happen to reside in the same cache line. Architects do exploit this to speed up execution, while Architects do exploit this to speed up execution, while keeping the incremental cost for HW contained. Here keeping the incremental cost for HW contained. Here clustering is a valuable phenomenon clustering is a valuable phenomenon
18
Message 3: Heat is Bad Clocking a processor fast (e.g. > 3-5 GHz) can increase Clocking a processor fast (e.g. > 3-5 GHz) can increase
performance and thus generally “is good”performance and thus generally “is good”
Other performance parameters, such as memory access Other performance parameters, such as memory access speed, peripheral access, etc. do not scale with the clock speed, peripheral access, etc. do not scale with the clock speed. Still, increasing the clock to a higher rate is desirablespeed. Still, increasing the clock to a higher rate is desirable
Comes at the cost of higher current, thus more heat Comes at the cost of higher current, thus more heat generated in the identical physical geometry (the real-estate) generated in the identical physical geometry (the real-estate) of the silicon processor or also the chipsetof the silicon processor or also the chipset
But Silicon part acts like a heat-conductor, conducting But Silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative temperature coefficient better, as it gets warmer (negative temperature coefficient resistor, or NTC). Since the power-supply is a constant-resistor, or NTC). Since the power-supply is a constant-current source, a lower resistance causes lower voltage, current source, a lower resistance causes lower voltage, shown as VDroop in the figure belowshown as VDroop in the figure below
19
Message 3: Heat is Bad
20
Message 3: Heat is Bad This in turn means, voltage must be increased artificially, to This in turn means, voltage must be increased artificially, to
sustain the clock rate, creating more heat, ultimately leading to sustain the clock rate, creating more heat, ultimately leading to self-destruction of the partself-destruction of the part
Great efforts are being made to increase the clock speed, Great efforts are being made to increase the clock speed, requiring more voltage, while at the same time reducing heat requiring more voltage, while at the same time reducing heat generation. Current technologies include sleep-states of the generation. Current technologies include sleep-states of the Silicon part (processor as well as chip-set), and Silicon part (processor as well as chip-set), and Turbo BoostTurbo Boost mode, to contain heat generation while boosting clock speed mode, to contain heat generation while boosting clock speed just at the right timejust at the right time
Good that to date Silicon manufacturing technologies allow the Good that to date Silicon manufacturing technologies allow the shrinking of transistors and thus of whole dies. Else CPUs shrinking of transistors and thus of whole dies. Else CPUs would become larger, more expensive, and above all: hotter.would become larger, more expensive, and above all: hotter.
21
Message 4: Resource Replication
Architects cannot increase clock speed Architects cannot increase clock speed beyond physical limitationsbeyond physical limitations
One cannot decrease the die size beyond One cannot decrease the die size beyond evolving technologyevolving technology
Yet speed improvements are desired, and Yet speed improvements are desired, and achievedachieved
This conflict can partly be overcome with This conflict can partly be overcome with replicated resources! But careful!replicated resources! But careful!
22
Message 4: Resource Replication
Key obstacle to parallel execution is data Key obstacle to parallel execution is data dependence in the SW under execution. A dependence in the SW under execution. A datum cannot be used, before it has been datum cannot be used, before it has been computedcomputed
Compiler optimization technology calls this Compiler optimization technology calls this use-def dependence use-def dependence (short for use-before-(short for use-before-definition, and definition-before-use definition, and definition-before-use dependence), AKA true dependence, AKA dependence), AKA true dependence, AKA data dependencedata dependence
Goal is to search for program portions that Goal is to search for program portions that are independent of one another. This can be are independent of one another. This can be at multiple levels of focusat multiple levels of focus
23
Message 4: Resource Replication At the At the very low levelvery low level of registers, at the machine of registers, at the machine
level –done by HW; see also score boardlevel –done by HW; see also score board
At the At the low level low level of individual machine instructions of individual machine instructions –done by HW; see also superscalar architecture–done by HW; see also superscalar architecture
At the At the medium level medium level of subexpressions in a of subexpressions in a program –done by compiler; see CSEprogram –done by compiler; see CSE
At the At the higher level higher level of several statements written in of several statements written in sequence in high-level language program –done sequence in high-level language program –done by optimizing compiler or by programmerby optimizing compiler or by programmer
Or at the Or at the very high level very high level of different applications, of different applications, running on the same computer, but with running on the same computer, but with independent data, separate computations, and independent data, separate computations, and independent results –done by the user running independent results –done by the user running concurrent programsconcurrent programs
24
Message 4: Resource Replication
Whenever program portions are independent of Whenever program portions are independent of one another, they can be computed at the same one another, they can be computed at the same time: in paralleltime: in parallel
Architects provide resources for this parallelismArchitects provide resources for this parallelism Compilers need to uncover opportunities for Compilers need to uncover opportunities for
parallelismparallelism If two actions are independent of one another, they If two actions are independent of one another, they
can be computed simultaneouslycan be computed simultaneously Provided that HW resources exist, that the absence Provided that HW resources exist, that the absence
of dependence has been proven, that independent of dependence has been proven, that independent execution paths are scheduled on these replicated execution paths are scheduled on these replicated HW resourcesHW resources
25
Code 1 for Different ArchitecturesExample 1: Object Code Sequence Example 1: Object Code Sequence Without OptimizationWithout Optimization
Strict left-to-right translation, no smarts in mappingStrict left-to-right translation, no smarts in mapping
Consider non-commutative subtraction and division Consider non-commutative subtraction and division operatorsoperators
No common subexpression elimination (CSE), and no No common subexpression elimination (CSE), and no register reuseregister reuse
Conventional operator precedenceConventional operator precedence
For Single Accumulator SAA, Three-Address GPR, Stack For Single Accumulator SAA, Three-Address GPR, Stack ArchitecturesArchitectures
Sample source: Sample source: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c
26
Code 1 for Different Architectures
No Single-Accumulator
Three-Address GPR dest op1 op op2
Stack Machine
1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 mult b add r3, a, #3 add 4 st temp1 div r4, r3, c push b 5 ld a sub d, r2, r4 mult 6 add #3 push a 7 div c pushlit #3 8 st temp2 add 9 ld temp1 push c
10 sub temp2 div 11 st d sub 12 pop d
27
Code 1 for Different ArchitecturesThree-address code looks shortest, w.r.t. Three-address code looks shortest, w.r.t. number of instructionsnumber of instructions
Maybe optical illusion, must also consider Maybe optical illusion, must also consider number of bitsnumber of bits for for instructionsinstructions
Must consider number of I-fetches, operand fetches, total number Must consider number of I-fetches, operand fetches, total number of storesof stores
Numerous memory accesses on SAA (Single Accumulator Numerous memory accesses on SAA (Single Accumulator Architecture) due to temporary values held in memoryArchitecture) due to temporary values held in memory
Most memory accesses on SA (Stack Architecture), since Most memory accesses on SA (Stack Architecture), since everything requires a memory accesseverything requires a memory access
Three-Address architecture immune to commutativity constraint, Three-Address architecture immune to commutativity constraint, since operands may be placed in registers in either ordersince operands may be placed in registers in either order
No need for reverse-operation opcodes for Three-Address No need for reverse-operation opcodes for Three-Address architecturearchitecture
Decide in Three-Address architecture how to encode operand Decide in Three-Address architecture how to encode operand typestypes
28
Code 2 for Different Architectures
This time we This time we eliminate common subexpression (CSE)eliminate common subexpression (CSE)
Compiler handles left-to-right order for non-Compiler handles left-to-right order for non-commutative operators on SAAcommutative operators on SAA
Better: Better: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c
29
Code 2 for Different Architectures
No Single-Accumulator
Three-Address GPR dest op1 op op2
Stack Machine
1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 st temp1 div r1, r1, c add 4 div c sub d, r2, r1 dup 5 st temp2 push b 6 ld temp1 mult 7 mult b xch 8 sub temp2 push c
9 st d div 10 sub 11 pop d
30
Code 2 for Different Architectures
Single Accumulator Architecture (SAA) optimized still Single Accumulator Architecture (SAA) optimized still needs temporary storage; uses needs temporary storage; uses temp1 temp1 for common for common subexpression; has no other register!!subexpression; has no other register!!
SAA could use SAA could use negatenegate instruction or instruction or reverse subtractreverse subtract
Register-use optimized for Three-Address Register-use optimized for Three-Address architecture; but architecture; but dupdup and and xchxch are newly added are newly added instructionsinstructions
Common subexpresssion optimized on Stack Common subexpresssion optimized on Stack Machine by duplicating, exchanging, etc.Machine by duplicating, exchanging, etc.
20% reduced for Three-Address, 18% for SAA, only 20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machine8% for Stack Machine
31
Code 3 for Different Architectures Analyze similar source expressions but with Analyze similar source expressions but with
reversed operator precedencereversed operator precedence
One operator sequence associates right-to-left, One operator sequence associates right-to-left, due to precedencedue to precedence
Compiler uses commutativityCompiler uses commutativity
The other left-to-right, due to explicit parenthesesThe other left-to-right, due to explicit parentheses
Use simple-minded code model: no cache, no Use simple-minded code model: no cache, no optimizationoptimization
Will there be advantages/disadvantages due to Will there be advantages/disadvantages due to architecture?architecture?
Expression 1 is : e Expression 1 is : e a + b * c ^ d a + b * c ^ d
32
Expression 1 is : e a + b * c ^ d
Code 3 for Different Architectures
No Single-Accumulator
Three-Address GPR dest op1 op op2
Stack Machine Implied Operands
1 ld c expo r1, c, d push a 2 expo d mult r1, b, r1 push b
3 mult b add e, a, r1 push c 4 add a push d 5 st e expo 6 mult 7 add 8 pop e
Expression 2 is : f ( ( g + h ) * i ) ^ j Here the operators associate left-to-right due to parentheses
• Expression 1 is : e Expression 1 is : e a + b * c ^ d a + b * c ^ d
33
Code 3 for Different Architectures
No Single-
Accumulator Three-Address GPR dest op1 op op2
Stack Machine Implied operands
1 ld g add r1, g, h push g 2 add h mult r1, i, r1 push h
3 mult i expo f, r1, j add 4 expo j push i 5 st f mult 6 push j 7 expo 8 pop f
Observations, Interaction of Precedence and Architecture Software eliminates constraints imposed by precedence: looking ahead Execution times identical for the 2 different expressions on the same
architecture --unless blurred by secondary effect; see cache example below Conclusion: all architectures handle arithmetic and logic operations well
• Expression 2 is : f Expression 2 is : f ( ( g + h ) * i ) ^ j ( ( g + h ) * i ) ^ j
34
Code For Stack Architecture Stack Machine with no register inherently slow: Memory Stack Machine with no register inherently slow: Memory
Accesses!!!Accesses!!!
Implement few top of stack elements via HW shadow Implement few top of stack elements via HW shadow registers registers Cache Cache
Measure equivalent code sequences with/without Measure equivalent code sequences with/without consideration for cacheconsideration for cache
Top-of-stack register tos points to last valid word on Top-of-stack register tos points to last valid word on physical stackphysical stack
Two shadow registers may hold 0, 1, or 2 true top wordsTwo shadow registers may hold 0, 1, or 2 true top words
Top of stack cache counter tcc specifies number of shadow Top of stack cache counter tcc specifies number of shadow registers in useregisters in use
Thus tos plus tcc jointly specify true top of stackThus tos plus tcc jointly specify true top of stack
35
Code For Stack Architecture
free free
0,1,20,1,2
tcc tcc
2 tos registers 2 tos registers
stack stack
tos tos
36
Code For Stack ArchitectureTimings for push, pushlit, add, pop operations depend on tccTimings for push, pushlit, add, pop operations depend on tcc
Operations in shadow registers fastest, typically 1 cycle, include Operations in shadow registers fastest, typically 1 cycle, include register access and the operation itselfregister access and the operation itself
Generally, further memory access adds 2 cyclesGenerally, further memory access adds 2 cycles
For stack changes use some defined policy, e.g. keep tcc 50% For stack changes use some defined policy, e.g. keep tcc 50% fullfull
Table below refines timings for stack with shadow registersTable below refines timings for stack with shadow registers
Note: push x into cache with free space requires 2 cycles: cache Note: push x into cache with free space requires 2 cycles: cache adjustment is done at the same time as memory fetchadjustment is done at the same time as memory fetch
37
Code For Stack Architecture
operation Cycles tcc before tcc after tos change comment add 1 tcc = 2 tcc = 1 no change add 1+2 tcc = 1 tcc = 1 tos-- underflow? add 1+2+2 tcc = 0 tcc = 1 tos -= 2 underflow? push x 2 tcc = 0,1 tcc++ no change tcc update
in parallel push x 2+2 tcc = 2 tcc = 2 tos++ overflow? pushlit #3 1 tcc = 0,1 tcc++ no change pushlit #3 1+2 tcc = 2 tcc = 2 tos++ overflow? pop y 2 tcc = 1,2 tcc-- no change pop y 2+2 tcc = 0 tcc = 0 tos-- underflow?
38
Code For Stack Architecture
Code emission for: a + b * c ^ ( d + e * f ^ g )Code emission for: a + b * c ^ ( d + e * f ^ g )
Let + and * be commutative, by language ruleLet + and * be commutative, by language rule
Architecture here has 2 shadow registers, compiler Architecture here has 2 shadow registers, compiler exploitsexploits this this
Assume initially empty 2-word cacheAssume initially empty 2-word cache
39
Code For Stack Architecture
# 1 Left - to - Right cycles 1 2 Exploit Cache cycles
2
1 push a 2 push f 2
2 push b 2 push g 2
3 push c 4 e xpo 1
4 push d 4 push e 2
5 push e 4 m ult 1
6 push f 4 push d 2
7 push g 4 a dd 1
8 expo 1 push c 2
9 mult 3 r_ e xpo = swap + expo 1
10 add 3 push b 2
11 expo 3 m ult 1
12 m ult 3 push a 2
13 a dd 3 a dd 1
40
Code For Stack ArchitectureBlind Blind code emission costs 40 cycles; i.e. not taking advantage of tcc code emission costs 40 cycles; i.e. not taking advantage of tcc
knowledge: costs performanceknowledge: costs performance
Code emission with shadow register consideration costs 20 cyclesCode emission with shadow register consideration costs 20 cycles
True penalty for memory access is worse in practiceTrue penalty for memory access is worse in practice
Tremendous speed-up always possible when fixing system with severe Tremendous speed-up always possible when fixing system with severe flawsflaws
Return of investment for 2 registers is twice the original performanceReturn of investment for 2 registers is twice the original performance
Such strong speedup is an indicator that the starting architecture was Such strong speedup is an indicator that the starting architecture was poorpoor
Stack Machine can be fast, if purity of top-of-stack access is sacrificed Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performancefor performance
Note that indexing, looping, indirection, call/return are not addressed Note that indexing, looping, indirection, call/return are not addressed herehere
41
Register Dependencies Inter-instruction dependenInter-instruction dependenciescies, in CS parlance , in CS parlance
also known as also known as dependendependencesces, arise between , arise between registers being defined and usedregisters being defined and used
One instruction computes a result into a register One instruction computes a result into a register (or memory), another instruction needs that result (or memory), another instruction needs that result from that same register (or that memory location)from that same register (or that memory location)
Or, one instruction uses a datum; and after such Or, one instruction uses a datum; and after such use the same item is reset, i.e. recomputeduse the same item is reset, i.e. recomputed
42
Register DependenciesTrue-DependenceTrue-Dependence, AKA Data Dependence: <- note synonym!, AKA Data Dependence: <- note synonym!
r3 ←r3 ← r1 op r2 r1 op r2r5 ← r5 ← r3r3 op r4 op r4 Read after Write, RAWRead after Write, RAW
Anti-Dependence,Anti-Dependence, not a true dependence not a true dependence
parallelize under right conditionparallelize under right condition
r3 ← r3 ← r1r1 op r2 op r2r1r1 ← r5 op r4 ← r5 op r4 Write after read, WARWrite after read, WAR
Output DependenceOutput Dependence
r3r3 ← r1 op r2 ← r1 op r2r5 ← r5 ← r3r3 op r4 op r4r3 r3 ← r6 op r7← r6 op r7 Write after Write, WAW, use in betweenWrite after Write, WAW, use in between
43
Register Dependencies
Control Dependence:Control Dependence:
if ( condition1 ) {if ( condition1 ) {
r3 = r1 op r2;r3 = r1 op r2;
}else{}else{ see the jump here? see the jump here?
r5 = r3 op r4;r5 = r3 op r4;
} // end if} // end if
write( r3 );write( r3 );
44
Register Renaming Only the data dependence is a Only the data dependence is a real dependence, real dependence,
hence called true-dependencehence called true-dependence
Other dependences are artifacts of Other dependences are artifacts of insufficient insufficient resourcesresources, generally not enough registers, generally not enough registers
This means: if additional registers were available, This means: if additional registers were available, then replacing some of these conflicting regs with then replacing some of these conflicting regs with new one regsiters could make conflict disappear?new one regsiters could make conflict disappear?
Anti-Anti- and and Output-Output-Dependences are indeed such Dependences are indeed such falsefalse dependences dependences
45
Register Renaming Original Dependences:Original Dependences: Renamed Situation, Dependences Gone:Renamed Situation, Dependences Gone:
L1:L1: r1 ← r2 op r3r1 ← r2 op r3 r10 ← r2 op r30 –- r30 has r3 copyr10 ← r2 op r30 –- r30 has r3 copy
L2:L2: r4 ← r1 op r5r4 ← r1 op r5 r4 ← r10 op r5r4 ← r10 op r5
L3:L3: r1 ← r3 op r6r1 ← r3 op r6 r1 ← r30 op r6r1 ← r30 op r6
L4:L4: r3 ← r1 op r7r3 ← r1 op r7 r3 ← r1 op r7r3 ← r1 op r7
The dependences before:The dependences before: after:after:
L1, L2 true-Dep with r1L1, L2 true-Dep with r1 L1, L2 true-Dep with r10L1, L2 true-Dep with r10
L1, L3 output-Dep with r1L1, L3 output-Dep with r1 L3, L4 true-Dep with r1L3, L4 true-Dep with r1
L1, L4 anti-Dep with r3L1, L4 anti-Dep with r3
L3, L4 true-Dep with r1L3, L4 true-Dep with r1
L2, L3 anti-Dep with r1L2, L3 anti-Dep with r1
L3, L4 anti-Dep with r3L3, L4 anti-Dep with r3
46
Register Renaming
With these additional or renamed regs, the new code With these additional or renamed regs, the new code could possibly run in half the time!could possibly run in half the time!
First : Compute into r10 instead of r1, but you need to First : Compute into r10 instead of r1, but you need to have the additional registerhave the additional register
Also: Compute into r30, no added copy operations, just Also: Compute into r30, no added copy operations, just more registers á-priorimore registers á-priori
Then regs are Then regs are livelive afterwards: r1, r3, r4 afterwards: r1, r3, r4
While r10 and r30 are While r10 and r30 are don’t caresdon’t cares
47
Score BoardScore-board is an array of HW programmable bits Score-board is an array of HW programmable bits sb[]sb[]
Manages other HW resources, specifically registersManages other HW resources, specifically registers
Single-bit HW array, every bit Single-bit HW array, every bit ii in in sb[i]sb[i] is is associated with one associated with one specific, dedicated register specific, dedicated register rrii
Association is by index, i.e. by name: Association is by index, i.e. by name: sb[i]sb[i] belongs to reg belongs to reg rrii
Only if Only if sb[i] = 0sb[i] = 0, does register , does register i i have have valid datavalid data
If If sb[i] = 0 sb[i] = 0 then register then register rrii is is NOT in process of being writtenNOT in process of being written
If bit If bit ii is set, i.e. if is set, i.e. if sb[i] = 1sb[i] = 1, then that register , then that register rrii has has stale datastale data
Initially all Initially all sb[*]sb[*] are stale, i.e. set to 1 are stale, i.e. set to 1
48
Score Board
Execution constraints:Execution constraints:
rrdd ← r ← rss op r op rtt
if if sb[s]sb[s] or if or if sb[t]sb[t] is set → RAW dependence, hence is set → RAW dependence, hence stall the computation; wait until both stall the computation; wait until both rrss and and rrtt are are availableavailable
if if sb[d]sb[d] is set→ WAW dependence, hence stall the is set→ WAW dependence, hence stall the write; wait until write; wait until rrdd has been used; SW can sometimes has been used; SW can sometimes
determine to use another register instead of determine to use another register instead of rrdd
else dispatch instruction immediatelyelse dispatch instruction immediately
49
Score Board
To allow To allow out of order (ooo) executionout of order (ooo) execution, upon , upon computing the value of rcomputing the value of rdd
Update Update rrdd, and clear , and clear sb[d]sb[d]
For uses (references), HW may use any register For uses (references), HW may use any register ii, , whose whose sb[i]sb[i] is 0 is 0
For definitions (assignments), HW may set any For definitions (assignments), HW may set any register j, whose register j, whose sb[j]sb[j] is 0 is 0
Independent of original order, in which source Independent of original order, in which source program was writtenprogram was written, i.e. possibly ooo
50
References1.1. The Humble Programmer: The Humble Programmer:
http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmlhttp://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.html
2.2. Algorithm Definitions: Algorithm Definitions: http://en.wikipedia.org/wiki/Algorithm_characterizationshttp://en.wikipedia.org/wiki/Algorithm_characterizations
3.3. http://en.wikipedia.org/wiki/Moore's_lawhttp://en.wikipedia.org/wiki/Moore's_law
4.4. C. A. R. HoareC. A. R. Hoare’’s comment on readability: s comment on readability: http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdfhttp://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdf
5.5. Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16Number 7, July 1986, pp 11-16
6.6. Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/
7.7. Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htmLinux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htm
8.8. Words of wisdom: http://www.cs.yale.edu/quotes.htmlWords of wisdom: http://www.cs.yale.edu/quotes.html
9.9. John von Neumann’s computer design: A.H. Taub (ed.), “Collected John von Neumann’s computer design: A.H. Taub (ed.), “Collected Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., New York 1963New York 1963