Computer Architecture and Design – ECEN 350

The Design of Survivable Networks

Computer Architecture and Design ECEN 350Part 9[Some slides adapted from M. Irwin, D. Paterson. D. Garcia and others]1Review: Major Components of a Computer ProcessorControlDatapathMemoryDevicesInputOutputCacheMain MemorySecondary Memory(Disk)CSE331 W14.#Irwin&Li Fall 2006 PSU2That is, any computer, no matter how primitive or advance, can be divided into five parts:1. The input devices bring the data from the outside world into the computer.2. These data are kept in the computers memory until ...3. The datapath request and process them.4. The operation of the datapath is controlled by the computers controller.All the work done by the computer will NOT do us any good unless we can get the data back to the outside world. 5. Getting the data back to the outside world is the job of the output devices.

The most COMMON way to connect these 5 components together is to use a network of busses.

Workstation Design Target:25% of cost on Processor, 25% of cost on Memory (minimum memory size), rest on I/O devices, power supplies, box

SecondLevelCache(SRAM)A Typical Memory HierarchyControlDatapathSecondaryMemory(Disk)On-Chip ComponentsRegFileMainMemory(DRAM)DataCacheInstrCacheITLBDTLBSpeed (ns): .1s 1s 10s 100s 1,000sSize (bytes): 100s Ks 10Ks Ms Gs Cost: highest lowestBy taking advantage of the principle of locality:Present the user with as much memory as is available in the cheapest technology.Provide access at the speed offered by the fastest technology.CSE331 W14.#Irwin&Li Fall 2006 PSU3Instead, the memory system of a modern computer consists of a series of black boxes ranging from the fastest to the slowest.Besides variation in speed, these boxes also varies in size (smallest to biggest) and cost.What makes this kind of arrangement work is one of the most important principle in computer design. The principle of locality.

The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk).While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology.(We will go over this slide in detail in the next lectures on caches).

iMacs PowerPC 970: All caches on-chip(1K)Registers512KL2

L1 (64K Instruction)L1 (32K Data)CSE331 W14.#Irwin&Li Fall 2006 PSU4Memory Hierarchy TechnologiesRandom Access Memories (RAMs)Random is good: access time is the same for all locationsDRAM: Dynamic Random Access MemoryHigh density (1 transistor cells), low power, cheap, slowDynamic: need to be refreshed regularly (~ every 4 ms)SRAM: Static Random Access MemoryLow density (6 transistor cells), high power, expensive, fastStatic: content will last forever (until power turned off)Size: DRAM/SRAM ratio of 4 to 8Cost/Cycle time: SRAM/DRAM ratio of 8 to 16

Non-so-random Access TechnologyAccess time varies from location to location and from time to time (e.g., disk, CDROM)CSE331 W14.#Irwin&Li Fall 2006 PSU5The technology we used to build our memory hierarchy can be divided into two categories: Random Access and Non-so-Random Access.Unlike all other aspects of life where the word random usually associates with bad things, random, when associates with memory access, for the lack of a better word, is good!Because random access means you can access any random location at any time and the access time will be the same as any other random locations.Which is NOT the case for disks or tape where the access time for a given location at any time can be quite different from some other random locations at some other random time.As far as Random Access technology is concerned, we will concentrate on two specific technologies: Dynamic RAM and Static RAM.The advantages of Dynamic RAMs are high density, low cost, and low power so we can have a lot of them without burning a hole in our budget or our desktop.The disadvantages of DRAM are they are slow. Also they will forget what you tell them if you dont remind them constantly (Refresh). SRAM only has one redeeming feature: it is fast. Other than that, they have low density, expensive, and burn a lot of power. Oh, SRAM actually has another redeeming feature. They will not forget what you tell them. They will keep whatever you write to them forever.Well forever is a long time. So lets just say it will keep your data as long as you dont pull the plug on your computer.In the next two lectures, we will be focusing on DRAMs and SRAMs.

RAM Memory Uses and Performance MetricsCaches use SRAM for speedMain Memory uses DRAM for density

Memory performance metrics Latency: Time to access one wordAccess Time: time between request and when word is read or written (read access and write access times can be different)Bandwidth: How much data can be supplied per unit timewidth of the data channel * the rate at which it can be usedCSE331 W14.#Irwin&Li Fall 2006 PSU6The Memory HierarchyL1$L2$Main MemorySecondary MemoryIncreasing distance from the processor in access timeProcessor(Relative) size of the memory at each levelInclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM4-8 bytes (word)1 to 4 blocks1,024+ bytes (disk sector = page)8-32 bytes (block)Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technologyCSE331 W15.#Irwin&Li Fall 2006 PSU7The Memory Hierarchy: Why Does it Work?Temporal Locality (Locality in Time): Keep most recently accessed instructions/data closer to the processorSpatial Locality (Locality in Space): Move blocks consisting of contiguous words to the upper levels Lower LevelMemoryUpper LevelMemoryTo ProcessorFrom ProcessorBlk XBlk YCSE331 W15.#Irwin&Li Fall 2006 PSU8How does the memory hierarchy work? Well it is rather simple, at least in principle.In order to take advantage of the temporal locality, that is the locality in time, the memory hierarchy will keep those more recently accessed data items closer to the processor because chances are (points to the principle), the processor will access them again soon.In order to take advantage of the spatial locality, not ONLY do we move the item that has just been accessed to the upper level, but we ALSO move the data items that are adjacent to it.

+1 = 15 min. (X:55)The Memory Hierarchy: TerminologyHit: data is in some block in the upper level (Blk X) Hit Rate: the fraction of memory accesses found in the upper levelHit Time: Time to access the upper level which consists ofUpper level acc. time + Time to determine hit/miss

Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y)Miss Rate = 1 - (Hit Rate)Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time > than the global miss rateCSE331 W15.#Irwin&Li Fall 2006 PSU40Global miss rate the fraction of references that miss in all levels of a multilevel cache. The global miss rate dictates how often we must access the main memory.Local miss rate the fraction of references to one level of a cache that missKey Cache Design ParametersL1 typicalL2 typicalTotal size (blocks)250 to 20004000 to 250,000Total size (KB)16 to 64500 to 8000Block size (B)32 to 6432 to 128Miss penalty (clocks)10 to 25100 to 1000Miss rates (global for L2)2% to 5%0.1% to 2%CSE331 W15.#Irwin&Li Fall 2006 PSU41Two Machines Cache ParametersIntel P4AMD OpteronL1 organizationSplit I$ and D$Split I$ and D$L1 cache size8KB for D$, 96KB for trace cache (~I$)64KB for each of I$ and D$L1 block size64 bytes64 bytesL1 associativity4-way set assoc.2-way set assoc.L1 replacement~ LRULRUL1 write policywrite-throughwrite-backL2 organizationUnifiedUnifiedL2 cache size512KB1024KB (1MB)L2 block size128 bytes64 bytesL2 associativity8-way set assoc.16-way set assoc.L2 replacement~LRU~LRUL2 write policywrite-backwrite-backCSE331 W15.#Irwin&Li Fall 2006 PSU42A trace cache finds a dynamic sequence of instructions including taken branches to load into a cache block. Thus, the cache blocks contain dynamic traces of the executed instructions as determined by the CPU rather than static sequences of instructions as determined by memory layout. It folds branch prediction into the cache.Example ProblemHow many total bits are required for a direct-mapped cache with 128 KB of data and 1-word block size, assuming a 32-bit address?

Cache data = 128 KB = 217 bytes = 215 words = 215 blocksCache entry size = block data bits + tag bits + valid bit = 32 + (32 15 2) + 1 = 48 bitsTherefore, cache size = 215 48 bits = 215 (1.5 32) bits = 1.5 220 bits = 1.5 Mbitsdata bits in cache = 128 KB 8 = 1 Mbitstotal cache size/actual cache data = 1.5

43Example ProblemCache data = 128 KB = 217 bytes = 215 words = 215 blocksCache entry size = block data bits + tag bits + valid bit = 32 + (32 15 2) + 1 = 48 bitsTherefore, cache size = 215 48 bits = 215 (1.5 32) bits = 1.5 220 bits = 1.5 Mbitsdata bits in cache = 128 KB 8 = 1 Mbitstotal cache size/actual cache data = 1.5

Address (showing bit positions)1515ByteoffsetValidTagDataIndex012TagIndex44Example ProblemHow many total bits are required for a direct-mapped cache with 128 KB of data and 4-word block size, assuming a 32-bit address?

Cache size = 128 KB = 217 bytes = 215 words = 213 blocksCache entry size = block data bits + tag bits + valid bit = 128 + (32 13 2 2) + 1 = 144 bitsTherefore, cache size = 213 144 bits = 213 (1.25 128) bits = 1.25 220 bits = 1.25 Mbitsdata bits in cache = 128 KB 8 = 1 Mbitstotal cache size/actual cache data = 1.25

45Example ProblemCache size = 128 KB = 217 bytes = 215 words = 213 blocksCache entry size = block data bits + tag bits + valid bit = 128 + (32 13 2 2) + 1 = 144 bitsTherefore, cache size = 213 144 bits = 213 (1.25 128) bits = 1.25 220 bits = 1.25 Mbitsdata bits in cache = 128 KB 8 = 1 Mbitstotal cache size/actual cache data = 1.25

Address (showing bit positions)1513ByteoffsetVTagDataHitData16324Kentries163 bits128 bitsMux323232232Block offsetIndexTag31 46Example ProblemsAssume for a given machine and program:instruction cache miss rate 2%data cache miss rate 4%miss penalty always 40 cyclesCPI of 2 without memory stallsfrequency of load/stores 36% of instructions

How much faster is a machine with a perfect cache that never misses?What happens if we speed up the machine by reducing its CPI to 1 without changing the clock rate?What happens if we speed up the machine by doubling its clock rate, but if the absolute time for a miss penalty remains same?

47Solution1.Assume instruction count = IInstruction miss cycles = I 2% 40 = 0.8 IData miss cycles = I 36% 4% 40 = 0.576 ISo, total memory-stall cycles = 0.8 I + 0.576 I = 1.376 Iin other words, 1.376 stall cycles per instructionTherefore, CPI with memory stalls = 2 + 1.376 = 3.376Assuming instruction count and clock rate remain same for a perfect cache and a cache that misses: CPU time with stalls / CPU time with perfect cache = 3.376 / 2 = 1.688Performance with a perfect cache is better by a factor of 1.68848Solution (cont.)2.CPI without stall = 1CPI with stall = 1 + 1.376 = 2.376 (clock has not changed so stall cycles per instruction remains same)CPU time with stalls / CPU time with perfect cache = CPI with stall / CPI without stall = 2.376Performance with a perfect cache is better by a factor of 2.376Conclusion: with higher CPI cache misses hurt more than with lower CPI

49Solution (cont.)3.With doubled clock rate, miss penalty = 2 40 = 80 clock cycles Stall cycles per instruction = (I 2% 80) + (I 36% 4% 80) = 2.752 ISo, faster machine with cache miss has CPI = 2 + 2.752 = 4.752CPU time with stalls / CPU time with perfect cache = CPI with stall / CPI without stall = 4.752 / 2 = 2.376Performance with a perfect cache is better by a factor of 2.376Conclusion: with higher clock rate cache misses hurt more than with lower clock rate

504 Questions for the Memory HierarchyQ1: Where can a block be placed in the upper level? (Block placement)

Q2: How is a block found if it is in the upper level? (Block identification)

Q3: Which block should be replaced on a miss? (Block replacement)

Q4: What happens on a write? (Write strategy)CSE331 W15.#Irwin&Li Fall 2006 PSU51Q1&Q2: Where can a block be placed/found?# of setsBlocks per setDirect mapped# of blocks in cache1Set associative(# of blocks in cache)/ associativityAssociativity (typically 2 to 16)Fully associative1# of blocks in cacheLocation method# of comparisonsDirect mappedIndex1Set associativeIndex the set; compare sets tagsDegree of associativityFully associativeCompare all blocks tags# of blocksCSE331 W15.#Irwin&Li Fall 2006 PSU52Q3: Which block should be replaced on a miss?Easy for direct mapped only one choiceSet associative or fully associativeRandomLRU (Least Recently Used)

For a 2-way set associative cache, random replacement has a miss rate about 1.1 times higher than LRU.LRU is too costly to implement for high levels of associativity (> 4-way) since tracking the usage information is costlyCSE331 W15.#Irwin&Li Fall 2006 PSU53Q4: What happens on a write?Write-through The information is written to both the block in the cache and to the block in the next lower level of the memory hierarchyWrite-through is always combined with a write buffer so write waits to lower level memory can be eliminated (as long as the write buffer doesnt fill)Write-back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.Need a dirty bit to keep track of whether the block is clean or dirtyPros and cons of each?Write-through: read misses dont result in writes (so are simpler and cheaper)Write-back: repeated writes require only one write to lower levelCSE331 W15.#Irwin&Li Fall 2006 PSU54Improving Cache Performance0. Reduce the time to hit in the cachesmaller cachedirect mapped cachesmaller blocksfor writes no write allocate no hit on cache, just write to write bufferwrite allocate to avoid two cycles (first check for hit, then write) pipeline writes via a delayed write buffer to cache

1. Reduce the miss ratebigger cachemore flexible placement (increase associativity)larger blocks (16 to 64 bytes typical)victim cache small buffer holding most recently discarded blocks CSE331 W15.#Irwin&Li Fall 2006 PSU55Improving Cache Performance2. Reduce the miss penaltysmaller blocksuse a write buffer to hold dirty blocks being replaced so dont have to wait for the write to complete before reading check write buffer (and/or victim cache) on read miss may get lucky for large blocks fetch critical word firstuse multiple cache levels L2 cache not tied to CPU clock ratefaster backing store/improved memory bandwidthwider busesmemory interleaving, page mode DRAMs

CSE331 W15.#Irwin&Li Fall 2006 PSU56Summary: The Cache Design SpaceSeveral interacting dimensionscache sizeblock sizeassociativityreplacement policywrite-through vs write-backwrite allocationThe optimal choice is a compromisedepends on access characteristicsworkloaduse (I-cache, D-cache, TLB)depends on technology / costSimplicity often winsAssociativityCache SizeBlock SizeBadGoodLessMoreFactor AFactor BCSE331 W15.#Irwin&Li Fall 2006 PSU57No fancy replacement policy is needed for the direct mapped cache. As a matter of fact, that is what cause direct mapped trouble to begin with: only one place to go in the cache--causes conflict misses.

No fancy replacement policy is needed for the direct mapped cache. As a matter of fact, that is what cause direct mapped trouble to begin with: only one place to go in the cache--causes conflict misses.

Besides working at Sun, I also teach people how to fly whenever I have time.Statistic have shown that if a pilot crashed after an engine failure, he or she is more likely to get killed in a multi-engine light airplane than a single engine airplane.The joke among us flight instructors is that: sure, when the engine quit in a single engine stops, you have one option: sooner or later, you land. Probably sooner.But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those people.

Other Ways to Reduce Cache Miss RatesAllow more flexible block placementIn a direct mapped cache a memory block maps to exactly one cache blockAt the other extreme, could allow a memory block to be mapped to any cache block fully associative cacheA compromise is to divide the cache into sets each of which consists of n ways (n-way set associative)Use multiple levels of cachesAdd a second level of caches on chip normally a unified L2 cache (i.e., it holds both instructions and data)L1 caches focuses on minimizing hit time in support of a shorter clock cycle (smaller with smaller block sizes)L2 cache focuses on reducing miss rate to reduce the penalty of long main memory access times (larger with larger block sizes)CSE331 W15.#Irwin&Li Fall 2006 PSU58Cache SummaryThe Principle of Locality:Program likely to access a relatively small portion of the address space at any instant of timeTemporal Locality: Locality in TimeSpatial Locality: Locality in SpaceThree major categories of cache misses:Compulsory misses: sad facts of life, e.g., cold start missesConflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect!Capacity misses: increase cache sizeCache design spacetotal size, block size, associativity (replacement policy)write-hit policy (write-through, write-back)write-miss policy (write allocate, write buffers)CSE331 W15.#Irwin&Li Fall 2006 PSU59Lets summarize todays lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space.So far, we have covered three major categories of cache misses.Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually dont bother you.Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both.Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger.There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer.The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory.No fancy replacement policy is needed for the direct mapped cache. That is what caused direct mapped cache trouble to begin with only one place to go in the cache causing conflict misses.Improving Cache PerformanceReduce the hit timesmaller cachedirect mapped cachesmaller blocksfor writes no write allocate just write to write bufferwrite allocate write to a delayed write buffer that then writes to the cacheReduce the miss ratebigger cacheassociative cachelarger blocks (16 to 64 bytes)use a victim cache a small buffer that holds the most recently discarded blocksReduce the miss penaltysmaller blocksfor large blocks fetch critical word firstuse a write buffercheck write buffer (and/or victim cache) on read miss may get lucky use multiple cache levels L2 cache not tied to CPU clock ratefaster backing store/improved memory bandwidthwider busesSDRAMsCSE331 W15.#Irwin&Li Fall 2006 PSU60

Documents

Computer Architecture and Design – ECEN 350