A TRANSLATION LOOKASIDE BUFFER

A TRANSLATION LOOKASIDE BUFFERTLB is a table used in a virtual memory system, that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel (the translation is done "on the side"). If the requested address is not cached then the physical address is used to locate the data in main memory. The alternative would be to place the translation table between the cache and main memory so that it will only be activated once there was a cache miss.TLB is a CPU cache that memory management hardware uses to improve virtual address translation speed. Some key features of TLB are:

It holds cache most recent translations

Small, fully associative cache

Avoids retranslation

Figure shows how various parts of a multilevel memory management typically realize the address-translation ideas just discussed. The input address Av is a virtual address consisting of a (virtual) base address Bv concatenated with a displacement D. Av contains an effective address computed in accordance with some program-defined addressing mode (direct, indirect, indexed, and so on) for the memory item being accessed. It also can contain system-specific control information-a segment address, for example-as we will see later. The real address BR =f(Bv) assigned to Bv is stored in a memory map somewhere in the memory system; this map can be quite large. To speed up the mapping process, part (or occasionally all) of the memory map is placed in a small high-speed memory in the CPU called a translation look-aside buffer (TLB). The TLB's input is thus the base-address part Bv of Av; its output is the corresponding real base address BR. This address is then concatenated with the D part of Av to

obtain the full physical address AR. If the virtual address Bv is not currently

assigned to the TLB, then the part of the memory map that contains Bv is first

UNIT 5 COMPUTER ARCHITECTURE AND ORGANIZATION NOTES Page | 1

transferred from the external memory into the TLB. Hence the TLB itself forms a cachelike level within a multilevel address storage system for memory maps. For this reason, the TLB is sometimes referred to as an address cache.

Segmentation

The virtual address space is divided into logical, variable-length units, or segments. Physical memory isn't really divided or partitioned into anything. When a segment needs to be copied into physical memory, the operating system looks for a chunk of free memory large enough to store the entire segment. Each segment has a base address, indicating were it is located in memory, and a bounds limit, indicating its size. Each program, consisting of multiple segments, now has an associated segment table instead of a page table. This segment table is simply a collection of the base/bounds pairs for each segment.

Formally, a segment is a set of logically related, contiguous words. A word in a segment is referred to by specifying a base address-the segment address-and a displacement within the segment. A program and its data can be viewed as a collection of linked segments. The links arise from the fact that a program segment uses, or calls, other segments. Some computers have a memory management technique that allocates main memory by M1 segments alone. When a segment not currently resident in M1 is required, the entire segment is transferred from secondary memory M2. The physical addresses assigned to the

segments are kept in a memory map called a segment table (which can itself be a relocatable segment).

The main advantage of segmentation

segment boundaries correspond to natural program and data boundaries. Consequently, information that is shared among different users is often organized into segments.

Because of their logical independence, a program segment can be changed or recompiled at any time without affecting other segments.

Certain properties of programs such as the scope (range of definition) of a variable and access rights are naturally specified by segment. These properties require that accesses to segments be checked to protect against unauthorized use; this protection is most easily implemented when the units of allocation are segments.

Certain segment types-stacks and queues, for instance-vary in length during program execution. Segmentation varies the region assigned to


such a segment as it expands and contracts, thus efficiently using the available memory space.

The main disadvantage of segmentation

Segments can be of different lengths requires a relatively complex allocation method to avoid excessive fragmentation of main-memory space. This problem is alleviated by combining segmentation with paging, as discussed later.

Paging

The basic idea behind paging is quite simple: Allocate physical memory to processes in fixed size chunks (page frames) and keep track of where the various pages of the process reside by recording information in a page table. Every process has its own page table that typically resides in main memory, and the page table stores the physical location of each virtual page of the process. The page table has N rows, where N is the number of virtual pages in the process. If there are pages of the process currently not in main memory, the page table indicates this by setting a valid bit to 0; if the page is in main memory, the valid bit is set to 1. Therefore, each entry of the page table has two fields: a valid bit and a frame number.

Process memory is divided into these fixed size pages, resulting in potential internal fragmentation when the last page is copied into memory. The process may not actually need the entire page frame, but no other process may use it. Therefore, the unused memory in this last frame is effectively wasted

Now that you understand what paging is, we will discuss how it works. When a process generates a virtual address, the operating system must dynamically translate this virtual address into the physical address in memory at which the data actually resides. (For purposes of simplicity, let's assume we have no cache memory for the moment.) For example, from a program viewpoint, we see the final byte of a 10-byte program as address 9, assuming 1-byte instructions and 1-byte addresses, and a starting address of 0. However, when actually loaded into memory, the logical address 9 (perhaps a reference to the label X in an assembly language program) may actually reside in physical memory location 1239, implying the program was loaded starting at physical address 1230. There must be an easy way to convert the logical, or virtual, address 9 to the physical address 1230.

To accomplish this address translation, a virtual address is divided into two fields: a page field and an offset field, to represent the offset within that page where the requested data is located. This address translation process is similar to UNIT 5 COMPUTER ARCHITECTURE AND ORGANIZATION NOTES Page | 3

the process we used when we divided main memory addresses into fields for the cache mapping algorithms. And similar to cache blocks, page sizes are usually powers of 2; this simplifies the extraction of page numbers and offsets from virtual addresses.

To access data at a given virtual address, the system performs the following steps:

Extract the page number from the virtual address. Extract the offset from the virtual address. Translate the page number into a physical page frame number by

accessing the page table. Look up the page number in the page table (using the virtual page number

as an index). Check the valid bit for that page. If the valid bit = 0, the system generates a page fault and the operating

system must intervene to Locate the desired page on disk. Copy the desired page into the free page frame in main memory. Update the page table Resume execution of the process causing the page fault, continuing to

Step B2. If the valid bit = 1, the page is in memory. Replace the virtual page number with the actual frame number. Access data at offset in physical page frame by adding the offset to the

frame number for the given virtual page.Please note that if a process has free frames in main memory when a page fault occurs, the newly retrieved page can be placed in any of those free frames. However, if the memory allocated to the process is full, a victim page must be selected. The replacement algorithms used to select a victim are quite similar to those used in cache. FIFO, Random, and LRU are all potential replacement algorithms for selecting a victim page.

Paging Combined with Segmentation


Paging is not the same as segmentation. Paging is based on a purely physical value: The program and main memory are divided up into the same physical size chunks. Segmentation, on the other hand, allows for logical portions of the program to be divided into variable-sized partitions. With segmentation, the user is aware of the segment sizes and boundaries; with paging, the user is unaware of the partitioning. Paging is easier to manage: allocation, freeing, swapping, and relocating are easy when everything's the same size. However, pages are typically smaller than segments, which means more overhead (in terms of resources to both track and transfer pages). Paging eliminates external fragmentation, whereas segmentation eliminates internal fragmentation. Segmentation has the ability to support sharing and protection, both of which are very difficult to do with paging.

Paging and segmentation both have their advantages; however, a system does not have to use one or the other-these two approaches can be combined, in an effort to get the best of both worlds. In a combined approach, the virtual address space is divided into segments of variable length, and the segments are divided into fixed-size pages. Main memory is divided into the same size frames.

When segmentation is used with paging, a virtual address has three components: a segment index Sf, a page index PI, and a displacement (offset) D. The memory map then consists of one or more segment tables and page tables. For fast address translation, two TLBs can be used as shown in Figure , one for segment tables and one for page tables. As discussed earlier, the TLBs serve as fast caches for the memory maps. Every virtual address Av generated by a program goes through a two-stage translation process. First, the segment index Sf is used to read the current segment table to obtain the base address PB of the required page table. This base address is combined with the base index Pf (which is just a displacement within the page table) to produce a page address, which is then used to access a page table. The result is a real page address, that


is, a page frame number, which can be combined with the displacement part D of Av to give the final (real) address AR• This system, as depicted in Figure , is

very flexible. All the various memory maps can be treated as paged segments and can be relocated anywhere in the physical memory space.

Combined segmentation and paging is very advantageous because it allows for segmentation from the user's point of view and paging from the system's point of view.

Page size

The page size Sp has a big impact on both storage utilization and the effective memory data-transfer rate. Consider first the influence of Sp on the space-utilization factor u defined earlier. If Sp is too large, excessive internal fragmentation results; if it is too small, the page tables become very large and tend to reduce space utilization. A good value of Sp should achieve a balance between these two extremes. Let Ss denote the average segment size in words. If Ss >> Sp the last page assigned to a segment should contains about Sp /2 words. The size of the page table associated with each segment is approximately Ss / Sp

words, assuming each entry in the table is a word. Hence the memory space overhead associated with each segment is

The space utilization u is

The optimum page size S~PT can be defined as the value of Sp that maximizes u or, equivalently, that minimizes S. Differentiating S with respect to Sp' we obtain

S is a minimum when dS/dSp = 0, from which it follows that


The optimum space utilization is

MEMORY ALLOCATION

The various levels of a memory system are divided into sets of contiguous locations, variously called regions, segments, or pages, which store blocks of data. Blocks are swapped automatically among the levels in order to minimize the access time seen by the processor. Swapping generally occurs in response to processor requests (demand swapping). However, to avoid making a processor wait while a requested item is being moved to the fastest level of memory MI,

some kind of anticipatory swapping must be implemented, which implies transferring blocks to MI in anticipation that they will be required soon. Good short-range prediction of access-request patterns is possible because of locality of reference.

The placement of blocks of information in a memory system is called memory allocation and is the topic of this section. The method of selecting the part of Mj in which an incoming block K is to be placed is the replacement policy. Simple replacement policies assign K to M 1 only when an unoccupied or inactive region of sufficient size is available. More aggressive policies preempt occupied blocks to make room for K. In general, successful memory allocation methods result in a high hit ratio and a low average access time. Ifthe hit ratio is low, an excessive amount of swapping between memory levels occurs, a phenomenon known as thrashing. Good memory allocation also minimizes the amount of unused or underused space inM1•

The information needed for allocation within a two-level hierarchy (M j ,M2)unless otherwise stated, we will assume the main-secondary-memory hierarchycan be held in a memory map that contains the following information:

• Occupied space list for M1 Each entry of this list specifies a block name, the (base) address of the region it occupies, and, if variable, the block size. In systems using preemptive allocation, additional information is associated with each block to determine when and how it can be preempted.

• Available space list for M1 Each entry of this list specifies the address of an unoccupied region and, if necessary, its size.


• Directory for M2• This list specifies the unites) that contain the directories for all the blocks associated with the current programs. These directories, in turn, define the regions of the M2 space to which each block is assigned.

When a block is transferred from M2 to M1, the memory management system makes an appropriate entry in the occupied space list. When the block is no longer required in MJ, it is deallocated and the region it occupies is transferred from the occupied space list to the available space list. A block is deallocated when a program using it terminates execution or when the block is replaced to make room for one with higher priority. Many preemptive and nonpreemptive algorithms have been developed for dynamic memory allocation. Nonpreemptive allocation. Suppose a block Ki of niwords is to be transferred from M2 to M1If none of the blocks already occupying MI can be preempted (overwritten or moved) by K" then it is necessary to find or create an "available" region of ni or more words to accommodate Ki• This process is termed nonpreemptive allocation. The problem is more easily solved in a paging system where all blocks (pages) have size Sp words and Mj is divided into fixed Sp-word regions (page frames). The memory map (page table) is searched for an available page frame; if one is found, it is assigned to the incoming block Kj • This easy allocation method is the principal reason for the widespread use of paging. If memory space is divisible into regions of variable length, however, then it becomes more difficult to allocate incoming blocks efficiently. Two widely used algorithms for nonpreemptive allocation of variable-sized blocks-unpaged segments, for example-are first fit and best fit. The first-fit method scans the memory map sequentially until an available region Rj of nj or more words is found, where nj is the size of the incoming block Kj • It then allocates Kj to Rj' The best-fit approach requires searching the memory map completely and assigning Kj to a region of nj ~ nj words such that nj nj is minimized.

Preemptive allocation. Nonpreemptive allocation cannot make efficient USE of memory in all situations. Memory overflow, that is, rejection of a memory allocation request due to insufficient space, can be expected to occur with M1 only partially full. Much more efficient use of the available memory UNIT 5 COMPUTER ARCHITECTURE AND ORGANIZATION NOTES Page | 8

space is possible if the occupied space can be reallocated to make room for incoming blocks. Reallocation may be done in two ways:

• The blocks already in M1 can be relocated within M1 to create a gap large for the incoming block.

• One or more occupied regions can be made available by deallocating the blocks they contain. This method requires a rule-a replacement policy-for selecting blocks to be deallocated and replaced.

Deallocation requires that a distinction be made between "dirty" blocks, which have been modified since being loaded into MI, and "clean" blocks, which have not been modified. Blocks of instructions remain clean, whereas blocks of data become dirty. To replace a clean block, the memory management system can simply overwrite it with the new block and update its entry in the memory map. a dirty block is overwritten, it should be copied to M2, which involves a slow transfer.

Relocation of the blocks already occupying M1 can be done by a method compaction. The blocks currently in are compressed into a single contiguous group at one end of the memory. This creates an available region of maximum size. Once the memory is compacted, incoming blocks are assigned to contiguous regions at the unoccupied end.

REPALCEMENT POLICIES

CACHE MEMORY


These are small fast memories placed between the processor and the main memory. Caches are faster than main memory. Small cache memories are intended to provide fast speed of memory retrieval without sacrificing the size of memory. Cache contains a copy of certain portions of main memory. The memory read or writes Operation is first checked with cache and if the desired location data is available in cache then used by the CPU directly. Otherwise, a block of words are read from main memory to cache and the word is used by CPU from cache. Since cache has limited space, so for this incoming block a portion called a slot need to be vacated in Cache. The contents of this vacating block are written back to the main memory at the position it belong to. The reason of bringing a block of words to cache is once again locality of reference. We expect that next few addresses will be close to this address and, therefore, the block of word is transferred from main memory to cache. Thus, for the word which is not in cache access time is slightly more than the access time for main memory without cache. But, because of locality of references, the next few words may be in the cache. thus, enhancing the overall speed of memory references. For example, if memory read cycle takes 100 ns and a cache read cycle takes 20 ns, then for four continuous references (first one brings the main memory content to cache and next three from cache).

The time taken with cache =(100+20) + 20 x 3

for the first for last three

read operation read operation

=120+60 = 180

Time taken without cache =100x4 = 400 ns

CACHE ORGANIZATION

Figure below shows the principal components of a cache. Memory words are stored in a cache data memory and are grouped into small pages called cache blocks or lines. The contents of the cache's data memory are thus copies of a set of main-memory blocks. Each cache block is marked with its block address, referred to as a tag, so the cache knows to what part of the memory space the block belongs. The collection of tag addresses currently assigned to the cache, which can be noncontinguous, is stored in a special memory, the cache tag memory or directory.

There are two basic organizations of cache memory:


Look aside design

Look through design

LOOK ASIDE DESIGN

In the look-aside design, the cache and the main memory are directly connected to the system bus. In this design the CPU initiates a memory access by placing a (real) address Ai on the memory address bus at the start of a read (load) or write (store) cycle. The cache M1 immediately compares Ai to the tag addresses currently residing in its tag memory. If a match is found in M1, that is, a cache hit occurs, the access is completed by a read or write operation executed in the cache; main memory M2

is not involved. If no match with Ai is found in cache, that is, a cache miss occurs, then the desired access is completed by a write operation directed to M2. In response to a cache miss, a block (line) of B; that includes the target address Ai is transferred from M2 to M1. This transfer is fast, taking advantage of the small block size and fast RAM access methods, which allow the cache block to be filled in a single short burst. The cache implements some replacement policy such as LRU to determine where to place an incoming block. When necessary, the cache block replaced by B; in M1 is saved in M2. Note that cache misses, even though they are infrequent, result in block transfers between M1 and M2

that tie up the system bus, making it unavailable for other uses like IO operations.

LOOK THROUGH DESIGN

A faster, but more costly organization called a look-through cache appears in Figure. The CPU communicates with the cache via a separate (local) bus that is isolated from the main system bus. The system bus is available for use by other units, such as IO controllers, to communicate with main memory. Hence cache accesses and main-memory accesses not involving the CPU can proceed concurrently. Unlike the look-aside case, with a look-through cache the CPU does not automatically send all UNIT 5 COMPUTER ARCHITECTURE AND ORGANIZATION NOTES Page | 11

memory requests to main memory; it does so only after a cache miss. A look-through cache allows the local bus linking M1 and M2 to be wider than the system bus, thus speeding up cache-main-memory transfers.

CACHE OPERATION

Read Policy

Figure shows relationship between the data stored in the cache M1 and the data stored in main memory M2.Here a cache block (line) size of 4 bytes is assumed. Each memory address is 12 bits long, so the 10 high-order bits form the tag or block address, and the 2 low-order bits define a displacement address within the block. When a block is assigned to M1 data memory, its tag is also placed in M1's tag memory. Figure shows the contents of two blocks assigned to the cache data memory; note the locations of the same blocks in main memory. To read the shaded word, its address Ai = 101111000110 is sent to M1, which compares Ai tag part to its stored tags and finds a match (hit). The stored tag pinpoints the corresponding block in M1 tag memory, and the 2-bit displacement is used to output the target word to the CPU.

Write Policy:A cache write operation employs the same addressing technique. The data in cache and main memory can be written by processors or Input/Output devices. The main problems faced in writing with cache memories are:

The content of cache and main memory can be altered by more than one devices e.g. CPU can write to caches and Input/Output module can directly write to main memory. This can result in inconsistencies in the values of cache and main memory.

In the case of multiple CPUs with different caches a word altered in one cache automatically invalidate the word in other cache.

The suggested techniques for writing in system with caches are:

Write through: Write the data in cache as well as main memory. The other CPUs - Cache combination (in multiprocessor system) has to


watch traffic to the main memory and make suitable amendment in the contents of cache. The disadvantage of this technique is that a bottleneck is created due to large number of accesses to main memory by various CPUs.

Write block: In this method updates are made only in the cache, setting a bit called Update bit. Only that block whose update bit is set is replaced in the main memory. But here all the accesses to main memory whether from other CPUs or Input/output modules need to be from the cache resulting in complex circuitry.

ADDRESS MAPPING TECHNIQUES

Address mapping is defined as the smallest unit of addressed data that can be mapped independently to an area of the virtual address space.

There are three common mapping techniques in cache memory:

Direct Mapping

Associative memory

Set associative memory

Direct Mapping: In this mapping each block of memory is mapped in a fixed slot of cache only. For example, if a cache has four slots then the main memory blocks 0 or 4 or 8 or 12 or 16... can be found in slot 0, while 1 or 5 or 9 or 13 or 17... can be found in slot 1; 2 or 6 or 10 or 14 or 18 ... in slot 2: and 3 or7or 11 or 15 or 19 ... in slot 3. This can be mathematically defined as

Cache slot number =(Block number of main memory) Modulo (Total number of slots

in cache)

ADVANTAGE

In this technique, it can be easily determined whether a block is in cache or not.

This is a simple technique.DISADVANTAGE

this scheme suppose two words which are referenced alternately repeatedly are falling in the same slot then the swapping of these two blocks will take place in the cache, thus, resulting in reduced efficiency of the cache.


Associative Mapping: In associate mapping any block of the memory can be mapped on to any location of the cache. But here the main difficulty is to determine "Whether a block is in cache or not?". This process of determination is normally carried out simultaneously.

The main disadvantage of this mapping is the complex circuitry required to examine all the cache slots in parallel to determine the presence or absence of a block in cache.

Set Associative Mapping: This is a compromise between the above two types of mapping. Here the advantages of both direct and associative cache can be obtained. The cache border is divided in some sets, let’s say A. The scheme is that a direct mapping is used to map the main memory blocks in one of the A sets and within this set any slot can be assigned to this block.

Associative Memories

In associative memories any stored item can be accessed directly by assigning the contents of the item in question, such as name of a person, account number, number etc., as an address. Associative memories are also known as content addressable memories (CAMs). The entity chosen to address the memory is known as the key.

Figure aside shows the structure of a simple associative memory. The information is stored in CAMs as fixed-length words. Any entity of the word may be chosen as the key field. The desired key is indicated by the mask register. Then, the key is compared simultaneously with all stored words. The words that match the key issue a match signal. This match signal then enters a select circuit. The select circuit in turn helps for the access UNIT 5 COMPUTER ARCHITECTURE AND ORGANIZATION NOTES Page | 14

of required data field. In case more than one entry has the same by. Then it is the responsibility of the select circuit to determine the data field to be read. For example, the select circuit read out all the matching entries in a pre- determined order. Each word is provided with its own match circuit as all the words in the memory need to compare their keys with the desired key simultaneously. The match and select circuits thus, make these associative memories very complex and expensive than any conventional memory. The VLSI technology has made these associative memories economically feasible. Even now the cost considerations limit the applications of associative memories to relatively small amount of information which needed to be accessed very rapidly.

Associative memory cell

The logic circuit for a 1-bit associative memory cell appears in Figure below. The cell comprises a D flip-flop for data storage, a match circuit (the EXCLUSIVE-NOR gate) for comparing the flip-flop's contents to an external data bit D, and circuits for reading from and writing into the cell. The results of a comparison appear on the match output M, where M = 1 denotes a match and M = 0 denotes no match. The cell is selected or addressed for both read and write operations by setting the select line S to 1. New data is written into the cell by setting the write enable line WE to 1, which in turn enables the D flip-flop's clock input CK. The stored data is read out via the Q line. The mask control line MK is activated (MK =1) to force the match line M to 0 independently of the data stored in the D flip-flop; MK also disables the input circuits of the flip-flop by forcing CK to 0. A cell like that of Figure can be realized with about 10 transistors-far more than the single transistor required for a dynamic RAM cell. This high hardware cost is the main reason that large associative memories are rarely used outside caches.

Structure versus Performance


We next examine some additional aspects of cache design: the types of information to store in the cache, the cache's dimensions and control methods, and the impact of the cache's design on its performance.

CACHE TYPES

Caches are distinguished by the kinds of information they store. An instruction or I-cache stores instructions only, while a data or D-cache stores data only. Separating the stored data in this way recognizes the different access behaviour patterns of instructions and data. For example, programs tend to involve few write accesses, and they often exhibit more temporal and spatial locality than the data they process. A cache that stores both instructions and data is referred to as unified. A split cache, on the other hand, consists of two associated but largely independent units: an I-cache for instructions and a D-cache for data. While a unified cache is simpler, a split cache makes it possible to access programs and data concurrently. A split cache can also be designed to manage its I-and D-cache components differently.

PERFORMANCE

The cache is the fastest component in the memory hierarchy, so it is desirable to make the average memory access time tA seen by the CPU as close as possible to access time tAl of the cache. To achieve this goal, M1 should satisfy a very high percentage of all memory references; that is, the cache hit ratio H should be almost one. A high hit ratio is possible because of the locality-of reference property discussed earlier. we have tA = tA + (I -H)tB where tA is the block-

transfer time from M2 to M1 The block size is small enough that, with a sufficiently wide M2 to- M1 data bus, a block can be loaded into the cache in a single main-memory read operation, making tB=t A the main-memory access time. Hence we can roughly estimate cache performance with the equation

Consider a k-way set-associative cache M1 defined by the following parameters: the number of sets s1 the number of blocks (lines) per set k, and the number bytes per block (also called the line size) p1 .Recall that the cache is fully associative when s1=1 and is direct-mapped when k =1. The number of bytes stored in the cache's data memory, usually referred to as the cache size S1 is given by the following formula:


or, in words, Cache size=number of blocks (lines) per set x number of sets x number of bytes per block


Documents

A TRANSLATION LOOKASIDE BUFFER