Cache Data

Embed Size (px)

Citation preview

  • 8/2/2019 Cache Data

    1/30

    In computer engineering, a cache ( / k/kashorAust/NZ:/k e/kaysh) is a component

    that transparently stores data so that future requests for that data can be served faster. The data

    that is stored within a cache might be values that have been computed earlier or duplicates of

    original values that are stored elsewhere. If requested data is contained in the cache (cache hit),

    this request can be served by simply reading the cache, which is comparatively faster. Otherwise

    (cache miss), the data has to be recomputed or fetched from its original storage location, which is

    comparatively slower. Hence, the more requests can be served from the cache the faster the

    overall system performance is.

    To be cost efficient and to enable an efficient use of data, caches are relatively small. Nevertheless,

    caches have proven themselves in many areas of computing because access patterns in

    typicalcomputer applications have locality of reference. References exhibit temporal localityif data

    is requested again that has been recently requested already. References exhibitspatial locality if

    data is requested that is physically stored close to data that has been requested already.

    Diagram of a CPU memory cache

    Contents

    [hide]

    1 Operation

    2 Applications

    o 2.1 CPU cache

    o 2.2 Disk cache

    o 2.3 Web cache

    o 2.4 Other caches

    o 2.5 The difference between buffer and cache

    3 See also

    4 Further reading

    5 References

    [edit]Operation

    http://en.wikipedia.org/wiki/Computer_engineeringhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Australiahttp://en.wikipedia.org/wiki/Australiahttp://en.wikipedia.org/wiki/New_Zealandhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Application_softwarehttp://en.wikipedia.org/wiki/Application_softwarehttp://en.wikipedia.org/wiki/Locality_of_referencehttp://en.wikipedia.org/wiki/Memory_localityhttp://en.wikipedia.org/wiki/Spatial_localityhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Cache#Operationhttp://en.wikipedia.org/wiki/Cache#Applicationshttp://en.wikipedia.org/wiki/Cache#CPU_cachehttp://en.wikipedia.org/wiki/Cache#Disk_cachehttp://en.wikipedia.org/wiki/Cache#Web_cachehttp://en.wikipedia.org/wiki/Cache#Other_cacheshttp://en.wikipedia.org/wiki/Cache#The_difference_between_buffer_and_cachehttp://en.wikipedia.org/wiki/Cache#See_alsohttp://en.wikipedia.org/wiki/Cache#Further_readinghttp://en.wikipedia.org/wiki/Cache#Referenceshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=1http://en.wikipedia.org/wiki/File:Cache,basic.svghttp://en.wikipedia.org/wiki/Computer_engineeringhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Australiahttp://en.wikipedia.org/wiki/New_Zealandhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_English#Keyhttp://en.wikipedia.org/wiki/Wikipedia:IPA_for_Englishhttp://en.wikipedia.org/wiki/Wikipedia:Pronunciation_respelling_keyhttp://en.wikipedia.org/wiki/Application_softwarehttp://en.wikipedia.org/wiki/Locality_of_referencehttp://en.wikipedia.org/wiki/Memory_localityhttp://en.wikipedia.org/wiki/Spatial_localityhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Cache#Operationhttp://en.wikipedia.org/wiki/Cache#Applicationshttp://en.wikipedia.org/wiki/Cache#CPU_cachehttp://en.wikipedia.org/wiki/Cache#Disk_cachehttp://en.wikipedia.org/wiki/Cache#Web_cachehttp://en.wikipedia.org/wiki/Cache#Other_cacheshttp://en.wikipedia.org/wiki/Cache#The_difference_between_buffer_and_cachehttp://en.wikipedia.org/wiki/Cache#See_alsohttp://en.wikipedia.org/wiki/Cache#Further_readinghttp://en.wikipedia.org/wiki/Cache#Referenceshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=1
  • 8/2/2019 Cache Data

    2/30

    Hardware implements cache as a block of memory for temporary storage of data likely to be used

    again.CPUs and hard drivesfrequently use a cache, as do web browsers and web servers.

    A cache is made up of a pool of entries. Each entry has a datum (a nugget of data) - a copy of the

    same datum in some backing store. Each entry also has a tag, which specifies the identity of the

    datum in the backing store of which the entry is a copy.

    When the cache client (a CPU, web browser, operating system) needs to access a datum

    presumed to exist in the backing store, it first checks the cache. If an entry can be found with a tag

    matching that of the desired datum, the datum in the entry is used instead. This situation is known

    as a cache hit. So, for example, a web browser program might check its local cache on disk to see

    if it has a local copy of the contents of a web page at a particular URL. In this example, the URL is

    the tag, and the contents of the web page is the datum. The percentage of accesses that result in

    cache hits is known as the hit rate orhit ratio of the cache.

    The alternative situation, when the cache is consulted and found not to contain a datum with the

    desired tag, has become known as acache miss. The previously uncached datum fetched from

    the backing store during miss handling is usually copied into the cache, ready for the next access.

    During a cache miss, the CPU usually ejects some other entry in order to make room for the

    previously uncached datum. Theheuristic used to select the entry to eject is known as

    thereplacement policy. One popular replacement policy, "least recently used" (LRU), replaces the

    least recently used entry (see cache algorithms). More efficient caches compute use frequency

    against the size of the stored contents, as well as the latenciesand throughputs for both the cache

    and the backing store. While this works well for larger amounts of data, long latencies and slow

    throughputs, such as experienced with a hard drive and the Internet, it is not efficient for use with a

    CPU cache.[citation needed]

    When a system writes a datum to the cache, it must at some point write that datum to the backing

    store as well. The timing of this write is controlled by what is known as thewrite policy.

    In a write-through cache, every write to the cache causes a synchronous write to the backing

    store.

    Alternatively, in a write-back (orwrite-behind) cache, writes are not immediately mirrored to the

    store. Instead, the cache tracks which of its locations have been written over and marks these

    locations as dirty. The data in these locations are written back to the backing store when those

    data are evicted from the cache, an effect referred to as a lazy write. For this reason, a read miss

    in a write-back cache (which requires a block to be replaced by another) will often require two

    http://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Hard_drivehttp://en.wikipedia.org/wiki/Hard_drivehttp://en.wikipedia.org/wiki/Operating_systemhttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Page_replacement_algorithmhttp://en.wikipedia.org/wiki/Page_replacement_algorithmhttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Access_timehttp://en.wikipedia.org/wiki/Access_timehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Hard_drivehttp://en.wikipedia.org/wiki/Operating_systemhttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Page_replacement_algorithmhttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Access_timehttp://en.wikipedia.org/wiki/Wikipedia:Citation_needed
  • 8/2/2019 Cache Data

    3/30

    memory accesses to service: one to retrieve the needed datum, and one to write replaced data

    from the cache to the store.

    Other policies may also trigger data write-back. The client may make many changes to a datum in

    the cache, and then explicitly notify the cache to write back the datum.

    No-write allocation (a.k.a. write-no-allocate) is a cache policy which caches only processor reads,

    i.e. on a write-miss:

    Datum is written directly to memory,

    Datum at the missed-write location is not added to cache.

    This avoids the need for write-back or write-through when the old value of the datum was absent

    from the cache prior to the write.

    Entities other than the cache may change the data in the backing store, in which case the copy in

    the cache may become out-of-date orstale. Alternatively, when the client updates the data in the

    cache, copies of those data in other caches will become stale. Communication protocols between

    the cache managers which keep the data consistent are known ascoherency protocols.

    [edit]Applications

    [edit]CPU cache

    Main article:CPU cache

    Small memories on or close to the CPU can operate faster than the much larger main memory.Most CPUs since the 1980s have used one or more caches, and modern high-end embedded,

    desktop and servermicroprocessors may have as many as half a dozen, each specialized for a

    specific function. Examples of caches with a specific function are the D-cache and I-cache (data

    cache and instruction cache).

    [edit]Disk cache

    Main article:Page cache

    While CPU caches are generally managed entirely by hardware, a variety of software manages

    other caches. The page cache in main memory, which is an example of disk cache, is managed by

    the operating system kernel.

    While the hard drive's hardwaredisk bufferis sometimes misleadingly referred to as "disk cache",

    its main functions are write sequencing and read prefetching. Repeated cache hits are relatively

    rare, due to the small size of the buffer in comparison to the drive's capacity. However, high-

    end disk controllersoften have their own on-board cache of hard diskdata blocks.

    http://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=2http://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=3http://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Microprocessorshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=4http://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Kernel_(computer_science)http://en.wikipedia.org/wiki/Kernel_(computer_science)http://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_controllerhttp://en.wikipedia.org/wiki/Disk_controllerhttp://en.wikipedia.org/wiki/Block_(data_storage)http://en.wikipedia.org/wiki/Block_(data_storage)http://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=2http://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=3http://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/CPUhttp://en.wikipedia.org/wiki/Microprocessorshttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=4http://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Page_cachehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Kernel_(computer_science)http://en.wikipedia.org/wiki/Disk_bufferhttp://en.wikipedia.org/wiki/Disk_controllerhttp://en.wikipedia.org/wiki/Block_(data_storage)
  • 8/2/2019 Cache Data

    4/30

    Finally, fast local hard disk can also cache information held on even slower data storage devices,

    such as remote servers (web cache) or local tape drives oroptical jukeboxes. Such a scheme is the

    main concept ofhierarchical storage management.

    [edit]Web cache

    Main article:Web cache

    Web browsers andweb proxy servers employ web caches to store previous responses fromweb

    servers, such as web pages. Web caches reduce the amount of information that needs to be

    transmitted across the network, as information previously stored in the cache can often be re-used.

    This reduces bandwidth and processing requirements of the web server, and helps to

    improveresponsiveness for users of the web.

    Web browsers employ a built-in web cache, but some internet service providersor organizations

    also use a caching proxy server, which is a web cache that is shared among all users of that

    network.

    Another form of cache isP2P caching, where the files most sought for by peer-to-peerapplications

    are stored in an ISP cache to accelerate P2P transfers. Similarly, decentralised equivalents exist,

    which allow communities to perform the same task for P2P traffic, e.g. Corelli [1]

    [edit]Other caches

    The BIND DNSdaemon caches a mapping of domain names toIP addresses, as does a resolver

    library.

    Write-through operation is common when operating over unreliable networks (like an Ethernet

    LAN), because of the enormous complexity of thecoherency protocolrequired between multiple

    write-back caches when communication is unreliable. For instance, web page caches andclient-

    sidenetwork file system caches (like those inNFS orSMB) are typically read-only or write-through

    specifically to keep the network protocol simple and reliable.

    Search engines also frequently make web pagesthey have indexed available from their cache. For

    example,Googleprovides a "Cached" link next to each search result. This can prove useful when

    web pages from a web serverare temporarily or permanently inaccessible.

    Another type of caching is storing computed results that will likely be needed again,

    ormemoization.ccache, a program that caches the output of the compilation to speed up the

    second-time compilation, exemplifies this type.

    Database cachingcan substantially improve the throughput ofdatabaseapplications, for example

    in the processing ofindexes, data dictionaries, and frequently used subsets of data.

    http://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Tape_drivehttp://en.wikipedia.org/wiki/Optical_jukeboxhttp://en.wikipedia.org/wiki/Optical_jukeboxhttp://en.wikipedia.org/wiki/Hierarchical_storage_managementhttp://en.wikipedia.org/wiki/Hierarchical_storage_managementhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=5http://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Web_browserhttp://en.wikipedia.org/wiki/Proxy_serverhttp://en.wikipedia.org/wiki/Proxy_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Responsivenesshttp://en.wikipedia.org/wiki/Responsivenesshttp://en.wikipedia.org/wiki/Internet_Service_Providerhttp://en.wikipedia.org/wiki/Internet_Service_Providerhttp://en.wikipedia.org/wiki/P2P_cachinghttp://en.wikipedia.org/wiki/Peer-to-peerhttp://en.wikipedia.org/wiki/ISPhttp://en.wikipedia.org/wiki/Cache#cite_note-0http://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=6http://en.wikipedia.org/wiki/Domain_Name_Systemhttp://en.wikipedia.org/wiki/Domain_Name_Systemhttp://en.wikipedia.org/wiki/IP_addresshttp://en.wikipedia.org/wiki/IP_addresshttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Network_File_Systemhttp://en.wikipedia.org/wiki/Network_File_System_(protocol)http://en.wikipedia.org/wiki/Network_File_System_(protocol)http://en.wikipedia.org/wiki/Server_Message_Blockhttp://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Memoizationhttp://en.wikipedia.org/wiki/Memoizationhttp://en.wikipedia.org/wiki/Ccachehttp://en.wikipedia.org/wiki/Ccachehttp://en.wikipedia.org/wiki/Database_cachinghttp://en.wikipedia.org/wiki/Database_cachinghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Index_(database)http://en.wikipedia.org/wiki/Data_dictionaryhttp://en.wikipedia.org/wiki/Data_dictionaryhttp://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Tape_drivehttp://en.wikipedia.org/wiki/Optical_jukeboxhttp://en.wikipedia.org/wiki/Hierarchical_storage_managementhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=5http://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/Web_browserhttp://en.wikipedia.org/wiki/Proxy_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Responsivenesshttp://en.wikipedia.org/wiki/Internet_Service_Providerhttp://en.wikipedia.org/wiki/P2P_cachinghttp://en.wikipedia.org/wiki/Peer-to-peerhttp://en.wikipedia.org/wiki/ISPhttp://en.wikipedia.org/wiki/Cache#cite_note-0http://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=6http://en.wikipedia.org/wiki/Domain_Name_Systemhttp://en.wikipedia.org/wiki/IP_addresshttp://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Client-sidehttp://en.wikipedia.org/wiki/Network_File_Systemhttp://en.wikipedia.org/wiki/Network_File_System_(protocol)http://en.wikipedia.org/wiki/Server_Message_Blockhttp://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Memoizationhttp://en.wikipedia.org/wiki/Ccachehttp://en.wikipedia.org/wiki/Database_cachinghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Index_(database)http://en.wikipedia.org/wiki/Data_dictionary
  • 8/2/2019 Cache Data

    5/30

    Distributed caching[2] uses caches spread across different networked hosts, e.g. Corelli

    [edit]The difference between buffer and cache

    The terms "buffer" and "cache" are not mutually exclusive and the functions are frequently

    combined; however, there is a difference in intent.

    Abufferis a temporary memory location, that is traditionally used because CPUinstructionscannot

    directly address data stored in peripheral devices. Thus, addressable memory is used as

    intermediate stage. Additionally such a buffer may be feasible when a large block of data is

    assembled or disassembled (as required by a storage device), or when data may be delivered in a

    different order than that in which it is produced. Also a whole buffer of data is usually transferred

    sequentially (for example to hard disk), so buffering itself sometimes increases transfer

    performance or reduce the variation or jitter of the transfer's latency as opposed to caching where

    the intent is to reduce the latency. These benefits are present even if the buffered data are written

    to thebufferonce and read from the buffer once.

    A cache also increases transfer performance. A part of the increase similarly comes from the

    possibility that multiple small transfers will combine into one large block. But the main performance-

    gain occurs because there is a good chance that the same datum will be read from cache multiple

    times, or that written data will soon be read. A cache's sole purpose is to reduce accesses to the

    underlying slower storage. Cache is also usually anabstraction layerthat is designed to be invisible

    from the perspective of neighbouring layers.

    A CPU cache is a cacheused by thecentral processing unit of acomputerto reduce the average

    time to accessmemory. The cache is a smaller, faster memory which stores copies of the data

    from the most frequently usedmain memory locations. As long as most memory accesses are

    cached memory locations, the averagelatency of memory accesses will be closer to the cache

    latency than to the latency of main memory.

    When the processor needs to read from or write to a location in main memory, it first checks

    whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to

    the cache, which is much faster than reading from or writing to main memory.

    Most modern desktop and server CPUs have at least three independent caches: aninstruction

    cache to speed up executable instruction fetch, a data cache to speed up data fetch and store,

    http://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://en.wikipedia.org/wiki/Cache#cite_note-1http://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://eprints.comp.lancs.ac.uk/2044/1/MMCN09.pdfhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=7http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/RAM_latencyhttp://en.wikipedia.org/w/index.php?title=Distributed_caching&action=edit&redlink=1http://en.wikipedia.org/wiki/Cache#cite_note-1http://eprints.comp.lancs.ac.uk/2044/1/MMCN09.pdfhttp://en.wikipedia.org/w/index.php?title=Cache&action=edit&section=7http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Data_bufferhttp://en.wikipedia.org/wiki/Abstraction_layerhttp://en.wikipedia.org/wiki/Cachehttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Main_memoryhttp://en.wikipedia.org/wiki/RAM_latency
  • 8/2/2019 Cache Data

    6/30

    and atranslation lookaside buffer(TLB) used to speed up virtual-to-physical address translation for

    both executable instructions and data. Data cache is usually organized as a hierarchy of more

    cache levels (L1, L2, etc.; seeMulti-level caches).

    Details of operation

    This section describes a typical data cache and some instruction caches; A TLB may have more

    complexity and an instruction cache may be simpler. The diagram on the right shows two

    memories. Each location in each memory contains data (acache line), which in different designs

    may range in size from 8 to 512 bytes.[citation needed] The size of the cache line is usually larger than the

    size of the usual access requested by a CPU instruction[citation needed], which ranges from 1 to 16

    bytes[citation needed] (the largest addresses and data handled by current 32 bit and 64 bit architectures

    being 128 bits long, i.e. 16 bytes).[citation needed] Each location in each memory also has an index, which

    is a unique number used to refer to that location. The index for a location in main memory is called

    anaddress. Each location in the cache has a tag that contains the index of the datum in main

    memory that has been cached. In a CPU's data cache these entries are calledcache lines orcache

    blocks.

    When theprocessorneeds to read or write a location in main memory, it first checks whether that

    memory location is in the cache. This is accomplished by comparing the address of the memory

    location to all tags in the cache that might contain that address. If the processor finds that the

    memory location is in the cache, we say that a cache hithas occurred; otherwise, we speak of

    a cache miss. In the case of a cache hit, the processor immediately reads or writes the data in the

    cache line. The proportion of accesses that result in a cache hit is known as thehit rate, and is a

    measure of the effectiveness of the cache for a given program or algorithm.

    In the case of a miss, the cache allocates a new entry, which comprises the tag just missed and acopy of the data. The reference can then be applied to the new entry just as in the case of a hit.

    Read misses delay execution because they require data to be transferred from a much slower

    memory than the cache itself. Write misses may occur without such penalty since the data can be

    copied in the background. Instruction caches are similar to data caches but the CPU only performs

    read accesses (instruction fetch) to the instruction cache. Instruction and data caches can be

    http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/CPU_cache#Multi-level_cacheshttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Memory_addresshttp://en.wikipedia.org/wiki/File:Cache,basic.svghttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/CPU_cache#Multi-level_cacheshttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Memory_address
  • 8/2/2019 Cache Data

    7/30

    separated for higher performance withHarvard CPUs but they can also be combined to reduce the

    hardware overhead.

    In order to make room for the new entry on a cache miss, the cache has toevictone of the existing

    entries. Theheuristic that it uses to choose the entry to evict is called thereplacement policy. The

    fundamental problem with any replacement policy is that it must predict which existing cache entry

    is least likely to be used in the future. Predicting the future is difficult, especially for hardware

    caches that use simple rules amenable to implementation in circuitry, so there are a variety of

    replacement policies to choose from and no perfect way to decide among them. One popular

    replacement policy, LRU, replaces the least recently used entry. Defining some memory ranges

    non cacheableavoids affecting performance by storing in caches information which are never re-

    used or seldom used. Cache misses are simply ignored for not cacheable data. Cache entries may

    also be disabled or locked depending on the context.

    If data are written to the cache, they must at some point be written to main memory as well. The

    timing of this write is controlled by what is known as thewrite policy. In a write-through cache, every

    write to the cache causes a write to main memory. Alternatively, in a write-backorcopy-backcache,

    writes are not immediately mirrored to the main memory. Instead, the cache tracks which locations

    have been written over (these locations are marked dirty). The data in these locations are written

    back to the main memory when that data is evicted from the cache. For this reason, a miss in a

    write-back cache may sometimes require two memory accesses to service: one to first write the

    dirty location to memory and then another to read the new location from memory.

    There are intermediate policies as well. The cache may be write-through, but the writes may be

    held in a store data queue temporarily, usually so that multiple stores can be processed together

    (which can reducebus turnarounds and so improve bus utilization).

    The data in main memory being cached may be changed by other entities (e.g. peripherals

    using direct memory accessormulti-core processor), in which case the copy in the cache may

    become out-of-date orstale. Alternatively, when the CPU in a multi-core processor updates the

    data in the cache, copies of data in caches associated with other cores will become stale.

    Communication protocols between the cache managers which keep the data consistent are known

    as cache coherence protocols. Another possibility is to share non cacheable data.

    The time taken to fetch one datum from memory (read latency) matters because the CPU will run

    out of things to do while waiting for the datum. When a CPU reaches this state, it is called astall.

    As CPUs become faster, stalls due to cache misses displace more potential computation; modern

    CPUs can execute hundreds of instructions in the time taken to fetch a single datum from the main

    memory. Various techniques have been employed to keep the CPU busy during this time.Out-of-

    http://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Computer_bushttp://en.wikipedia.org/wiki/Computer_bushttp://en.wikipedia.org/wiki/Direct_memory_accesshttp://en.wikipedia.org/wiki/Direct_memory_accesshttp://en.wikipedia.org/wiki/Multi-core_processorhttp://en.wikipedia.org/wiki/Cache_coherencehttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Heuristic_(computer_science)http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Computer_bushttp://en.wikipedia.org/wiki/Direct_memory_accesshttp://en.wikipedia.org/wiki/Multi-core_processorhttp://en.wikipedia.org/wiki/Cache_coherencehttp://en.wikipedia.org/wiki/Out-of-order_execution
  • 8/2/2019 Cache Data

    8/30

    orderCPUs (Pentium Proand laterInteldesigns, for example) attempt to execute independent

    instructions after the instruction that is waiting for the cache miss data. Another technology, used by

    many processors, is simultaneous multithreading (SMT), or -in Intel's terminology- hyper-

    threading (HT), which allows an alternate thread to use the CPU core while a first thread waits for

    data to come from main memory.

    [edit]Cache entry structure

    Cache row entries usually have the following structure:

    tag data blocks valid bit

    The data blocks (cache line) contain the actual data fetched from the main memory. The valid bit

    (dirty bit) denotes that this particular entry has valid data.

    An effective memory address is split (MSB to LSB) into the tag, the index and the displacement

    (offset),

    tag index displacement

    The index length is bits and describes which row the data has been put

    in. The displacement length is and specifies which block of the ones we

    have stored we need. The tag length

    isaddress_length index_length displacement_length and contains the most significant bits of

    the address, which are checked against the current row (the row has been retrieved by index) to

    see if it is the one we need or another, irrelevant memory location that happened to have the same

    index bits as the one we want.

    [edit]Associativity

    Which memory locations can be cached by which cache locations

    http://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Pentium_Prohttp://en.wikipedia.org/wiki/Pentium_Prohttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Thread_(computer_science)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=2http://en.wikipedia.org/wiki/Most_significant_bithttp://en.wikipedia.org/wiki/Least_significant_bithttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=3http://en.wikipedia.org/wiki/File:Cache,associative-fill-both.pnghttp://en.wikipedia.org/wiki/File:Cache,associative-fill-both.pnghttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Pentium_Prohttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Thread_(computer_science)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=2http://en.wikipedia.org/wiki/Most_significant_bithttp://en.wikipedia.org/wiki/Least_significant_bithttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=3
  • 8/2/2019 Cache Data

    9/30

    Associativity is a trade-off. If there are ten places to which the replacement policy could have

    mapped a memory location, then to check if that location is in the cache, ten cache entries must be

    searched. Checking more places takes more power, chip area, and potentially time. On the other

    hand, caches with more associativity suffer fewer misses (see conflict misses, below), so that the

    CPU wastes less time reading from the slow main memory. The rule of thumb is that doubling the

    associativity, from direct mapped to 2-way, or from 2-way to 4-way, has about the same effect on

    hit rate as doubling the cache size. Associativity increases beyond 4-way have much less effect on

    the hit rate,[1]and are generally done for other reasons (see virtual aliasing, below).

    In order of increasing (worse) hit times and decreasing (better) miss rates,

    direct mapped cachethe best (fastest) hit times, and so the best tradeoff for "large"

    caches

    2-way set associative cache

    2-way skewed associative cache "the best tradeoff for .... caches whose sizes are in the

    range 4K-8K bytes" Andr Seznec[2]

    4-way set associative cache

    fully associative cache the best (lowest) miss rates, and so the best tradeoff when the

    miss penalty is very high

    [edit]2-way set associative cache

    If each location in main memory can be cached in either of two locations in the cache, one logical

    question is: which two? The simplest and most commonly used scheme, shown in the right-hand

    diagram above, is to use the least significant bits of the memory location's index as the index for the

    cache memory, and to have two entries for each index. One benefit of this scheme is that the tags

    stored in the cache do not have to include that part of the main memory address which is implied by

    the cache memory's index. Since the cache tags are fewer bits, they take less area on the

    microprocessor chip and can be read and compared faster.

    [edit]Speculative execution

    One of the advantages of a direct mapped cache is that it allows simple and fastspeculation. Once

    the address has been computed, the one cache index which might have a copy of that datum is

    known. That cache entry can be read, and the processor can continue to work with that data before

    it finishes checking that the tag actually matches the requested address.

    The idea of having the processor use the cached data before the tag match completes can be

    applied to associative caches as well. A subset of the tag, called ahint, can be used to pick just

    http://en.wikipedia.org/wiki/Trade-offhttp://en.wikipedia.org/wiki/Trade-offhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-0http://en.wikipedia.org/wiki/CPU_cache#cite_note-0http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=4http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=5http://en.wikipedia.org/wiki/Speculative_executionhttp://en.wikipedia.org/wiki/Speculative_executionhttp://en.wikipedia.org/wiki/Trade-offhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-0http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=4http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=5http://en.wikipedia.org/wiki/Speculative_execution
  • 8/2/2019 Cache Data

    10/30

    one of the possible cache entries mapping to the requested address. This datum can then be used

    in parallel with checking the full tag. The hint technique works best when used in the context of

    address translation, as explained below.

    [edit]2-way skewed associative cache

    Other schemes have been suggested, such as the skewed cache,[2]where the index for way 0 is

    direct, as above, but the index for way 1 is formed with ahash function. A good hash function has

    the property that addresses which conflict with the direct mapping tend not to conflict when mapped

    with the hash function, and so it is less likely that a program will suffer from an unexpectedly large

    number of conflict misses due to a pathological access pattern. The downside is extra latency from

    computing the hash function.[3] Additionally, when it comes time to load a new line and evict an old

    line, it may be difficult to determine which existing line was least recently used, because the new

    line conflicts with data at different indexes in each way; LRUtracking for non-skewed caches is

    usually done on a per-set basis. Nevertheless, skewed-associative caches have major advantages

    over conventional set-associative ones.[4]

    [edit]Pseudo-associative cache

    A true set-associative cache tests all the possible ways simultaneously, using something like

    a content addressable memory. A pseudo-associative cache tests each possible way one at a time.

    A hash-rehash cache is one kind of pseudo-associative cache.

    In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast as

    a direct-mapped cache. But it has a much lower conflict miss rate than a direct-mapped cache,

    closer to the miss rate of a fully associative cache. [3]

    [edit]Cache misses

    A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in

    a main memory access with much longer latency. There are three kinds of cache misses:

    instruction read miss, data read miss, and data write miss.

    A cache read miss from an instruction cache generally causes the most delay, because the

    processor, or at least the thread of execution, has to wait (stall) until the instruction is fetched from

    main memory.

    A cache read miss from a data cache usually causes less delay, because instructions not

    dependent on the cache read can be issued and continue execution until the data is returned from

    main memory, and the dependent instructions can resume execution.

    http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=6http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/CPU_cache#cite_note-3http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=7http://en.wikipedia.org/wiki/Content_addressable_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=8http://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=6http://en.wikipedia.org/wiki/CPU_cache#cite_note-Seznec-1http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/wiki/Cache_algorithmshttp://en.wikipedia.org/wiki/CPU_cache#cite_note-3http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=7http://en.wikipedia.org/wiki/Content_addressable_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-CK-2http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=8http://en.wikipedia.org/wiki/Simultaneous_multithreading
  • 8/2/2019 Cache Data

    11/30

    A cache write miss to a data cache generally causes the least delay, because the write can be

    queued and there are few limitations on the execution of subsequent instructions. The processor

    can continue until the queue is full.

    In order to lower cache miss rate, a great deal of analysis has been done on cache behavior in an

    attempt to find the best combination of size, associativity, block size, and so on. Sequences of

    memory references performed by benchmark programs are saved asaddress traces. Subsequent

    analyses simulate many different possible cache designs on these long address traces. Making

    sense of how the many variables affect the cache hit rate can be quite confusing. One significant

    contribution to this analysis was made byMark Hill, who separated misses into three categories

    (known as the Three Cs):

    Compulsory misses are those misses caused by the first reference to a datum. Cache size

    and associativity make no difference to the number of compulsory misses. Prefetching can helphere, as can larger cache block sizes (which are a form of prefetching). Compulsory misses are

    sometimes referred to as cold misses.

    Capacity misses are those misses that occur regardless of associativity or block size,

    solely due to the finite size of the cache. The curve of capacity miss rate versus cache size

    gives some measure of the temporal locality of a particular reference stream. Note that there is

    no useful notion of a cache being "full" or "empty" or "near capacity": CPU caches almost

    always have nearly every line filled with a copy of some line in main memory, and nearly every

    allocation of a new line requires the eviction of an old line.

    Conflict misses are those misses that could have been avoided, had the cache not evicted

    an entry earlier. Conflict misses can be further broken down intomapping misses, that are

    unavoidable given a particular amount of associativity, andreplacement misses, which are due

    to the particular victim choice of the replacement policy.

    http://www.cs.wisc.edu/~markhillhttp://www.cs.wisc.edu/~markhill
  • 8/2/2019 Cache Data

    12/30

    Miss rate versus cache size on the Integer portion of SPEC CPU2000

    The graph to the right summarizes the cache performance seen on the Integer portion of the SPEC

    CPU2000 benchmarks, as collected by Hill and Cantin.[5]These benchmarks are intended to

    represent the kind of workload that an engineering workstation computer might see on any given

    day. The reader should keep in mind that finding benchmarks which are even usefully

    representative of many programs has been very difficult, and there will always be important

    programs with very different behavior than what is shown here.

    We can see the different effects of the three Cs in this graph.

    At the far right, with cache size labelled "Inf", we have the compulsory misses. If we wish to improve

    a machine's performance on SpecInt2000, increasing the cache size beyond 1 MB is essentially

    futile. That's the insight given by the compulsory misses.

    The fully associative cache miss rate here is almost representative of the capacity miss rate. The

    difference is that the data presented is from simulations assuming an LRU replacement policy.

    Showing the capacity miss rate would require aperfect replacement policy, i.e. an oracle that looks

    into the future to find a cache entry which is actually not going to be hit.

    Note that our approximation of the capacity miss rate falls steeply between 32KB and 64 KB. This

    indicates that the benchmark has aworking setof roughly 64 KB. A CPU cache designer examining

    this benchmark will have a strong incentive to set the cache size to 64 KB rather than 32 KB. Note

    that, on this benchmark, no amount of associativity can make a 32 KB cache perform as well as a

    64 KB 4-way, or even a direct-mapped 128 KB cache.

    http://en.wikipedia.org/wiki/CPU_cache#cite_note-4http://en.wikipedia.org/wiki/CPU_cache#cite_note-4http://en.wikipedia.org/wiki/Benchmark_(computing)http://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithmhttp://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithmhttp://en.wikipedia.org/wiki/Kilobytehttp://en.wikipedia.org/wiki/Kilobytehttp://en.wikipedia.org/wiki/Working_sethttp://en.wikipedia.org/wiki/File:Cache,missrate.svghttp://en.wikipedia.org/wiki/File:Cache,missrate.svghttp://en.wikipedia.org/wiki/CPU_cache#cite_note-4http://en.wikipedia.org/wiki/Benchmark_(computing)http://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithmhttp://en.wikipedia.org/wiki/Kilobytehttp://en.wikipedia.org/wiki/Working_set
  • 8/2/2019 Cache Data

    13/30

    Finally, note that between 64 KB and 1 MB there is a large difference between direct-mapped and

    fully associative caches. This difference is the conflict miss rate. The insight from looking at conflict

    miss rates is that secondary caches benefit a great deal from high associativity.

    This benefit was well known in the late 80s and early 90s, when CPU designers could not fit large

    caches on-chip, and could not get sufficient bandwidth to either the cache data memory or cache

    tag memory to implement high associativity in off-chip caches. Desperate hacks were attempted:

    theMIPSR8000 used expensive off-chip dedicated tag SRAMs, which had embedded tag

    comparators and large drivers on the match lines, in order to implement a 4 MB 4-way associative

    cache. The MIPS R10000used ordinary SRAM chips for the tags. Tag access for both ways took

    two cycles. To reduce latency, the R10000 would guess which way of the cache would hit on each

    access.

    [edit]Address translation

    Main article:Translation lookaside buffer

    Most general purpose CPUs implement some form ofvirtual memory. To summarize, each program

    running on the machine sees its own simplifiedaddress space, which contains code and data for

    that program only. Each program uses this virtual address space without regard for where it exists

    in physical memory.

    Virtual memory requires the processor to translate virtual addresses generated by the program into

    physical addresses in main memory. The portion of the processor that does this translation is

    known as thememory management unit(MMU). The fast path through the MMU can perform thosetranslations stored in thetranslation lookaside buffer(TLB), which is a cache of mappings from the

    operating system's page table.

    For the purposes of the present discussion, there are three important features of address

    translation:

    Latency: The physical address is available from the MMU some time, perhaps a few

    cycles, after the virtual address is available from the address generator.

    Aliasing: Multiple virtual addresses can map to a single physical address. Most processors

    guarantee that all updates to that single physical address will happen in program order. To

    deliver on that guarantee, the processor must ensure that only one copy of a physical address

    resides in the cache at any given time.

    http://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/R8000http://en.wikipedia.org/wiki/Static_random_access_memoryhttp://en.wikipedia.org/wiki/Static_random_access_memoryhttp://en.wikipedia.org/wiki/R10000http://en.wikipedia.org/wiki/R10000http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=9http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Page_tablehttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/R8000http://en.wikipedia.org/wiki/Static_random_access_memoryhttp://en.wikipedia.org/wiki/R10000http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=9http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Address_spacehttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Page_table
  • 8/2/2019 Cache Data

    14/30

    Granularity: The virtual address space is broken up into pages. For instance, a 4 GB

    virtual address space might be cut up into 1048576 pages of 4 KB size, each of which can be

    independently mapped. There may be multiple page sizes supported; seevirtual memory for

    elaboration.

    A historical note: some early virtual memory systems were very slow, because they required an

    access to the page table (held in main memory) before every programmed access to main memory.

    [NB 1] With no caches, this effectively cut the speed of the machine in half. The first hardware cache

    used in a computer system was not actually a data or instruction cache, but rather a TLB.

    Caches can be divided into 4 types, based on whether the index or tag correspond to physical or

    virtual addresses:

    Physically indexed, physically tagged (PIPT) caches use the physical address for both

    the index and the tag. While this is simple and avoids problems with aliasing, it is also slow, as

    the physical address must be looked up (which could involve a TLB miss and access to main

    memory) before that address can be looked up in the cache.

    Virtually indexed, virtually tagged (VIVT) caches use the virtual address for both the

    index and the tag. This caching scheme can result in much faster lookups, since the MMU

    doesn't need to be consulted first to determine the physical address for a given virtual address.

    However, VIVT suffers from aliasing problems, where several different virtual addresses may

    refer to the same physical address. The result is that such addresses would be cached

    separately despite referring to the same memory, causing coherency problems. Another

    problem is homonyms, where the same virtual address maps to several different physical

    addresses. It is not possible to distinguish these mappings by only looking at the virtual index,

    though potential solutions include: flushing the cache after acontext switch, forcing address

    spaces to be non-overlapping, tagging the virtual address with an address space ID (ASID), or

    using physical tags. Additionally, there is a problem that virtual-to-physical mappings can

    change, which would require flushing cache lines, as the VAs would no longer be valid.

    Virtually indexed, physically tagged (VIPT) caches use the virtual address for the index

    and the physical address in the tag. The advantage over PIPT is lower latency, as the cache

    line can be looked up in parallel with the TLB translation, however the tag can't be compared

    until the physical address is available. The advantage over VIVT is that since the tag has the

    physical address, the cache can detect homonyms. VIPT requires more tag bits, as the index

    bits no longer represent the same address.

    http://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-7http://en.wikipedia.org/wiki/Context_switchhttp://en.wikipedia.org/wiki/Context_switchhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-7http://en.wikipedia.org/wiki/Context_switch
  • 8/2/2019 Cache Data

    15/30

    Physically indexed, virtually tagged caches are only theoretical as they would basically

    be useless.[8]

    The speed of this recurrence (the load latency) is crucial to CPU performance, and so most

    modern level-1 caches are virtually indexed, which at least allows the MMU's TLB lookup to

    proceed in parallel with fetching the data from the cache RAM.

    But virtual indexing is not the best choice for all cache levels. The cost of dealing with virtual aliases

    grows with cache size, and as a result most level-2 and larger caches are physically indexed.

    Caches have historically used both virtual and physical addresses for the cache tags, although

    virtual tagging is now uncommon. If the TLB lookup can finish before the cache RAM lookup, then

    the physical address is available in time for tag compare, and there is no need for virtual tagging.

    Large caches, then, tend to be physically tagged, and only small, very low latency caches are

    virtually tagged. In recent general-purpose CPUs, virtual tagging has been superseded by vhints,

    as described below.

    [edit]Virtual indexing and virtual aliases

    The usual way the processor guarantees that virtually aliased addresses act as a single storage

    location is to arrange that only one virtual alias can be in the cache at any given time.

    Whenever a new entry is added to a virtually indexed cache, the processor searches for any virtual

    aliases already resident and evicts them first. This special handling happens only during a cache

    miss. No special work is necessary during a cache hit, which helps keep the fast path fast.

    The most straightforward way to find aliases is to arrange for them all to map to the same location

    in the cache. This happens, for instance, if the TLB has e.g. 4 KB pages, and the cache is direct

    mapped and 4 KB or less.

    Modern level-1 caches are much larger than 4 KB, but virtual memory pages have stayed that size.

    If the cache is e.g. 16 KB and virtually indexed, for any virtual address there are four cache

    locations that could hold the same physical location, but aliased to different virtual addresses. If the

    cache misses, all four locations must be probed to see if their corresponding physical addresses

    match the physical address of the access that generated the miss.

    These probes are the same checks that a set associative cache uses to select a particular match.

    So if a 16 KB virtually indexed cache is 4-way set associative and used with 4 KB virtual memory

    pages, no special work is necessary to evict virtual aliases during cache misses because the

    checks have already happened while checking for a cache hit.

    http://en.wikipedia.org/wiki/CPU_cache#cite_note-8http://en.wikipedia.org/wiki/CPU_cache#cite_note-8http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=10http://en.wikipedia.org/wiki/CPU_cache#cite_note-8http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=10
  • 8/2/2019 Cache Data

    16/30

    Using the AMD Athlon as an example again, it has a 64 KB level-1 data cache, 4 KB pages, and 2-

    way set associativity. When the level-1 data cache suffers a miss, 2 of the 16 (==64 KB/4 KB)

    possible virtual aliases have already been checked, and seven more cycles through the tag check

    hardware are necessary to complete the check for virtual aliases.

    [edit]Homonym and synonym problems

    The cache that relies on the virtual indexing and tagging becomes inconsistent after the same

    virtual address is mapped into different physical addresses (homonym). This can be solved by

    using physical address for tagging or by storing the address space id in the cache line. However

    the latter of these two approaches does not help against thesynonym problem, where several

    cache lines end up storing data for the same physical address. Writing to such location may update

    only one location in the cache, leaving others with inconsistent data. Problem might be solved by

    using non overlapping memory layouts for different address spaces or otherwise the cache (or part

    of it) must be flushed when the mapping changes.[9]

    [edit]Virtual tags and vhints

    Virtual tagging is possible too. The great advantage of virtual tags is that, for associative caches,

    they allow the tag match to proceed before the virtual to physical translation is done. However,

    Coherence probes and evictions present a physical address for action. The hardware must

    have some means of converting the physical addresses into a cache index, generally by storing

    physical tags as well as virtual tags. For comparison, a physically tagged cache does not need

    to keep virtual tags, which is simpler.

    When a virtual to physical mapping is deleted from the TLB, cache entries with those virtual

    addresses will have to be flushed somehow. Alternatively, if cache entries are allowed on

    pages not mapped by the TLB, then those entries will have to be flushed when the access

    rights on those pages are changed in the page table.

    It is also possible for the operating system to ensure that no virtual aliases are simultaneously

    resident in the cache. The operating system makes this guarantee by enforcing page coloring,

    which is described below. Some early RISC processors (SPARC, RS/6000) took this approach. It

    has not been used recently, as the hardware cost of detecting and evicting virtual aliases has fallen

    and the software complexity and performance penalty of perfect page coloring has risen.

    It can be useful to distinguish the two functions of tags in an associative cache: they are used to

    determine which way of the entry set to select, and they are used to determine if the cache hit or

    http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=11http://en.wikipedia.org/wiki/Homonymhttp://en.wikipedia.org/wiki/Synonymhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-9http://en.wikipedia.org/wiki/CPU_cache#cite_note-9http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=12http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=11http://en.wikipedia.org/wiki/Homonymhttp://en.wikipedia.org/wiki/Synonymhttp://en.wikipedia.org/wiki/CPU_cache#cite_note-9http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=12
  • 8/2/2019 Cache Data

    17/30

  • 8/2/2019 Cache Data

    18/30

    A programmer attempting to make maximum use of the cache may arrange his program's access

    patterns so that only 1 MB of data need be cached at any given time, thus avoiding capacity

    misses. But he should also ensure that the access patterns do not have conflict misses. One way to

    think about this problem is to divide up the virtual pages the program uses and assign them virtual

    colors in the same way as physical colors were assigned to physical pages before. The

    programmer can then arrange the access patterns of his code so that no two pages with the same

    virtual color are in use at the same time. There is a wide literature on such optimizations (e.g. loop

    nest optimization), largely coming from the High Performance Computing (HPC) community.

    The snag is that while all the pages in use at any given moment may have different virtual colors,

    some may have the same physical colors. In fact, if the operating system assigns physical pages to

    virtual pages randomly and uniformly, it is extremely likely that some pages will have the same

    physical color, and then locations from those pages will collide in the cache (this is thebirthday

    paradox).

    The solution is to have the operating system attempt to assign different physical color pages to

    different virtual colors, a technique calledpage coloring. Although the actual mapping from virtual to

    physical color is irrelevant to system performance, odd mappings are difficult to keep track of and

    have little benefit, so most approaches to page coloring simply try to keep physical and virtual page

    colors the same.

    If the operating system can guarantee that each physical page maps to only one virtual color, then

    there are no virtual aliases, and the processor can use virtually indexed caches with no need for

    extra virtual alias probes during miss handling. Alternatively, the O/S can flush a page from the

    cache whenever it changes from one virtual color to another. As mentioned above, this approach

    was used for some early SPARC and RS/6000 designs.

    [edit]Cache hierarchy in a modern processor

    Modern processors have multiple interacting caches on chip.

    [edit]Specialized caches

    Pipelined CPUs access memory from multiple points in the pipeline: instruction fetch,virtual-to-

    physicaladdress translation, and data fetch (seeclassic RISC pipeline). The natural design is to

    use different physical caches for each of these points, so that no one physical resource has to be

    scheduled to service two points in the pipeline. Thus the pipeline naturally ends up with at least

    three separate caches (instruction,TLB, and data), each specialized to its particular role.

    Pipelines with separate instruction and data caches, now predominant, are said to have aHarvard

    architecture. Originally, this phrase referred to machines with separate instruction and data

    http://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/High_Performance_Computinghttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=14http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=15http://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/High_Performance_Computinghttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/wiki/Birthday_paradoxhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=14http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=15http://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Virtual_memoryhttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Harvard_architecturehttp://en.wikipedia.org/wiki/Harvard_architecture
  • 8/2/2019 Cache Data

    19/30

    memories, which proved not at all popular. Most modern CPUs have a single-memoryvon

    Neumann architecture.

    [edit]Victim cache

    A victim cache is a cache used to hold blocks evicted from a CPU cache upon replacement. The

    victim cache lies between the main cache and its refill path, and only holds blocks that were evicted

    from the main cache. The victim cache is usually fully associative, and is intended to reduce the

    number of conflict misses. Many commonly used programs do not require an associative mapping

    for all the accesses. In fact, only a small fraction of the memory accesses of the program require

    high associativity. The victim cache exploits this property by providing high associativity to only

    these accesses. It was introduced by Norman Jouppiin 1990.

    [edit]Trace cache

    One of the more extreme examples of cache specialization is the trace cache found in the Intel

    Pentium 4 microprocessors. A trace cacheis a mechanism for increasing the instructionfetch

    bandwidth and decreasing power consumption (in the case of the Pentium 4) by storing traces

    ofinstructionsthat have already been fetched and decoded.

    The earliest widely acknowledged academic publication of trace cache was byEric

    Rotenberg, Steve Bennett, andJim Smithin their 1996 paper"Trace Cache: a Low Latency

    Approach to High Bandwidth Instruction Fetching."

    An earlier publication is US Patent 5,381,533, "Dynamic flow instruction cache memory organized

    around trace segments independent of virtual address line", byAlex Peleg and Uri Weiserof Intel

    Corp., patent filed March 30, 1994, a continuation of an application filed in 1992, later abandoned.

    A trace cache stores instructions either after they have been decoded, or as they are retired.

    Generally, instructions are added to trace caches in groups representing either individualbasic

    blocks or dynamic instruction traces. A dynamic trace ("trace path") contains only instructions

    whose results are actually used, and eliminates instructions following taken branches (since they

    are not executed); a dynamic trace can be a concatenation of multiple basic blocks. This allows the

    instruction fetch unit of a processor to fetch several basic blocks, without having to worry about

    branches in the execution flow.

    Trace lines are stored in the trace cache based on the program counterof the first instruction in the

    trace and a set of branch predictions. This allows for storing different trace paths that start on the

    same address, each representing different branch outcomes. In the instruction fetch stage of

    a pipeline, the current program counter along with a set of branch predictions is checked in the

    trace cache for a hit. If there is a hit, a trace line is supplied to fetch which does not have to go to a

    http://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=16http://en.wikipedia.org/w/index.php?title=Norman_Jouppi&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Norman_Jouppi&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=17http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Steve_Bennett_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Alex_Peleg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Alex_Peleg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Uri_Weiser&action=edit&redlink=1http://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Program_counterhttp://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/wiki/Von_Neumann_architecturehttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=16http://en.wikipedia.org/w/index.php?title=Norman_Jouppi&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=17http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Fetch_bandwidth&action=edit&redlink=1http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Eric_Rotenberg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Steve_Bennett_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Jim_Smith_(academic)&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Alex_Peleg&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Uri_Weiser&action=edit&redlink=1http://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Basic_blockhttp://en.wikipedia.org/wiki/Program_counterhttp://en.wikipedia.org/wiki/Instruction_pipeline
  • 8/2/2019 Cache Data

    20/30

    regular cache or to memory for these instructions. The trace cache continues to feed the fetch unit

    until the trace line ends or until there is amispredictionin the pipeline. If there is a miss, a new

    trace starts to be built.

    Trace caches are also used in processors like theIntelPentium 4to store already decoded micro-

    operations, or translations of complex x86 instructions, so that the next time an instruction is

    needed, it does not have to be decoded again.

    See the full text ofSmith, Rotenberg and Bennett's paperatCiteseer.

    [edit]Multi-level caches

    Another issue is the fundamental tradeoff between cache latency and hit rate. Larger caches have

    better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of

    cache, with small fast caches backed up by larger slower caches.

    Multi-level caches generally operate by checking the smallest Level 1 (L1) cache first; if it hits, the

    processor proceeds at high speed. If the smaller cache misses, the next larger cache (L2) is

    checked, and so on, before external memory is checked.

    As the latency difference between main memory and the fastest cache has become larger, some

    processors have begun to utilize as many as three levels of on-chip cache. For example, theAlpha

    21164(1995) had 1 to 64MB off-chip L3 cache; the IBMPOWER4 (2001) had a 256[citation needed]MB

    L3 cache off-chip, shared among several processors; theItanium 2 (2003) had a 6 MB unified level

    3 (L3) cache on-die; theItanium 2 (2003) MX 2 Module incorporates two Itanium2 processors along

    with a shared 64 MB L4 cache on a MCM that was pin compatible with a Madison processor;

    Intel's Xeon MP product code-named "Tulsa" (2006) features 16 MB of on-die L3 cache shared

    between two processor cores; the AMD Phenom II (2008) has up to 6 MB on-die unified L3 cache;

    and theIntel Core i7(2008) has an 8 MB on-die unified L3 cache that is inclusive, shared by all

    cores. The benefits of an L3 cache depend on the application's access patterns.

    Finally, at the other end of the memory hierarchy, the CPUregister file itself can be considered the

    smallest, fastest cache in the system, with the special characteristic that it is scheduled in software

    typically by a compiler, as it allocates registers to hold values retrieved from main memory. (See

    especially loop nest optimization.) Register files sometimes also have hierarchy: The Cray-1(circa1976) had 8 address "A" and 8 scalar data "S" registers that were generally usable. There was also

    a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster

    than main memory. The "B" and "T" registers were provided because the Cray-1 did not have a

    data cache. (The Cray-1 did, however, have an instruction cache.)

    http://en.wikipedia.org/wiki/Mispredictionhttp://en.wikipedia.org/wiki/Mispredictionhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Pentium_4http://en.wikipedia.org/wiki/Pentium_4http://citeseer.ist.psu.edu/rotenberg96trace.htmlhttp://en.wikipedia.org/wiki/Citeseerhttp://en.wikipedia.org/wiki/Citeseerhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=18http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/POWER4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Xeonhttp://en.wikipedia.org/wiki/Phenom_IIhttp://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Register_filehttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Cray-1http://en.wikipedia.org/wiki/Cray-1http://en.wikipedia.org/wiki/Mispredictionhttp://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Pentium_4http://citeseer.ist.psu.edu/rotenberg96trace.htmlhttp://en.wikipedia.org/wiki/Citeseerhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=18http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/Alpha_21164http://en.wikipedia.org/wiki/POWER4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Xeonhttp://en.wikipedia.org/wiki/Phenom_IIhttp://en.wikipedia.org/wiki/Intel_Core_i7http://en.wikipedia.org/wiki/Register_filehttp://en.wikipedia.org/wiki/Loop_nest_optimizationhttp://en.wikipedia.org/wiki/Cray-1
  • 8/2/2019 Cache Data

    21/30

    [edit]Exclusive versus inclusive

    Multi-level caches introduce new design decisions. For instance, in some processors, all data in the

    L1 cache must also be somewhere in the L2 cache. These caches are called strictly inclusive.

    Other processors (like the AMD Athlon) have exclusive caches data is guaranteed to be in at

    most one of the L1 and L2 caches, never in both. Still other processors (like the IntelPentium II,III,

    and 4), do not require that data in the L1 cache also reside in the L2 cache, although it may often

    do so. There is no universally accepted name for this intermediate policy, although the termmainly

    inclusive has been used.[citation needed]

    The advantage of exclusive caches is that they store more data. This advantage is larger when the

    exclusive L1 cache is comparable to the L2 cache, and diminishes if the L2 cache is many times

    larger than the L1 cache. When the L1 misses and the L2 hits on an access, the hitting cache line

    in the L2 is exchanged with a line in the L1. This exchange is quite a bit more work than just

    copying a line from L2 to L1, which is what an inclusive cache does.

    One advantage of strictly inclusive caches is that when external devices or other processors in a

    multiprocessor system wish to remove a cache line from the processor, they need only have the

    processor check the L2 cache. In cache hierarchies which do not enforce inclusion, the L1 cache

    must be checked as well. As a drawback, there is a correlation between the associativities of L1

    and L2 caches: if the L2 cache does not have at least as many ways as all L1 caches together, the

    effective associativity of the L1 caches is restricted.

    Another advantage of inclusive caches is that the larger cache can use larger cache lines, which

    reduces the size of the secondary cache tags. (Exclusive caches require both caches to have the

    same size cache lines, so that cache lines can be swapped on a L1 miss, L2 hit). If the secondary

    cache is an order of magnitude larger than the primary, and the cache data is an order of

    magnitude larger than the cache tags, this tag area saved can be comparable to the incremental

    area needed to store the L1 cache data in the L2.

    [edit]Example: the K8

    To illustrate both specialization and multi-level caching, here is the cache hierarchy of the K8 core

    in the AMDAthlon 64 CPU.[10]

    http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=19http://en.wikipedia.org/wiki/Pentium_IIhttp://en.wikipedia.org/wiki/Pentium_IIIhttp://en.wikipedia.org/wiki/Pentium_IIIhttp://en.wikipedia.org/wiki/Pentium_4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=20http://en.wikipedia.org/wiki/Athlon_64http://en.wikipedia.org/wiki/CPU_cache#cite_note-10http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=19http://en.wikipedia.org/wiki/Pentium_IIhttp://en.wikipedia.org/wiki/Pentium_IIIhttp://en.wikipedia.org/wiki/Pentium_4http://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=20http://en.wikipedia.org/wiki/Athlon_64http://en.wikipedia.org/wiki/CPU_cache#cite_note-10
  • 8/2/2019 Cache Data

    22/30

    Example of hierarchy, the K8

    The K8 has 4 specialized caches: an instruction cache, an instructionTLB, a data TLB, and a data

    cache. Each of these caches is specialized:

    The instruction cache keeps copies of 64-byte lines of memory, and fetches 16 bytes each

    cycle. Each byte in this cache is stored in ten bits rather than 8, with the extra bits marking the

    boundaries of instructions (this is an example of predecoding). The cache has

    only parityprotection rather than ECC, because parity is smaller and any damaged data can be

    replaced by fresh data fetched from memory (which always has an up-to-date copy of

    instructions).

    The instruction TLB keeps copies of page table entries (PTEs). Each cycle's instruction

    fetch has its virtual address translated through this TLB into a physical address. Each entry is

    either 4 or 8 bytes in memory. Because the K8 has a variable page size, each of the TLBs is

    split into two sections, one to keep PTEs that map 4 KB pages, and one to keep PTEs that map

    4 MB or 2 MB pages. The split allows the fully associative match circuitry in each section to be

    simpler. The operating system maps different sections of the virtual address space with

    different size PTEs.

    http://en.wikipedia.org/wiki/CPU_cache#cite_note-10http://en.wikipedia.org/wiki/CPU_cache#cite_note-10http://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Parity_bithttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/File:Cache,hierarchy-example.svghttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Parity_bithttp://en.wikipedia.org/wiki/Error-correcting_code
  • 8/2/2019 Cache Data

    23/30

    The data TLB has two copies which keep identical entries. The two copies allow two data

    accesses per cycle to translate virtual addresses to physical addresses. Like the instruction

    TLB, this TLB is split into two kinds of entries.

    The data cache keeps copies of 64-byte lines of memory. It is split into 8 banks (each

    storing 8 KB of data), and can fetch two 8-byte data each cycle so long as those data are in

    different banks. There are two copies of the tags, because each 64-byte line is spread among

    all 8 banks. Each tag copy handles one of the two accesses per cycle.

    The K8 also has multiple-level caches. There are second-level instruction and data TLBs, which

    store only PTEs mapping 4 KB. Both instruction and data caches, and the various TLBs, can fill

    from the large unified L2 cache. This cache is exclusive to both the L1 instruction and data caches,

    which means that any 8-byte line can only be in one of the L1 instruction cache, the L1 data cache,

    or the L2 cache. It is, however, possible for a line in the data cache to have a PTE which is also in

    one of the TLBsthe operating system is responsible for keeping the TLBs coherent by flushing

    portions of them when the page tables in memory are updated.

    The K8 also caches information that is never stored in memoryprediction information. These

    caches are not shown in the above diagram. As is usual for this class of CPU, the K8 has fairly

    complex branch prediction, with tables that help predict whether branches are taken and other

    tables which predict the targets of branches and jumps. Some of this information is associated with

    instructions, in both the level 1 instruction cache and the unified secondary cache.

    The K8 uses an interesting trick to store prediction information with instructions in the secondary

    cache. Lines in the secondary cache are protected from accidental data corruption (e.g. by

    an alpha particle strike) by eitherECCorparity, depending on whether those lines were evicted

    from the data or instruction primary caches. Since the parity code takes fewer bits than the ECC

    code, lines from the instruction cache have a few spare bits. These bits are used to cache branch

    prediction information associated with those instructions. The net result is that the branch predictor

    has a larger effective history table, and so has better accuracy.

    [edit]More hierarchies

    Other processors have other kinds of predictors (e.g. the store-to-load bypass predictor in

    theDECAlpha 21264), and various specialized predictors are likely to flourish in future processors.

    These predictors are caches in that they store information that is costly to compute. Some of the

    terminology used when discussing predictors is the same as that for caches (one speaks of ahit in

    a branch predictor), but predictors are not generally thought of as part of the cache hierarchy.

    http://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Alpha_particlehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=21http://en.wikipedia.org/wiki/Digital_Equipment_Corporationhttp://en.wikipedia.org/wiki/Digital_Equipment_Corporationhttp://en.wikipedia.org/wiki/Alpha_21264http://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Alpha_particlehttp://en.wikipedia.org/wiki/Error-correcting_codehttp://en.wikipedia.org/wiki/Parity_(telecommunication)http://en.wikipedia.org/w/index.php?title=CPU_cache&action=edit&section=21http://en.wikipedia.org/wiki/Digital_Equipment_Corporationhttp://en.wikipedia.org/wiki/Alpha_21264
  • 8/2/2019 Cache Data

    24/30

  • 8/2/2019 Cache Data

    25/30

    Read path for a 2-way associative cache

    The diagram to the right is intended to clarify the manner in which the various fields of the address

    are used. Address bit 31 is most significant, bit 0 is least significant. The diagram shows the

    SRAMs, indexing, and multiplexing for a 4 KB, 2-way