Branch prediction: Look ahead Pre-fetching [A detailed look into branch prediction leading to look-ahead prediction techniques with software and hardware compilation and overall analytical

8/12/2019 Branch prediction: Look ahead Pre-fetching [A detailed look into branch prediction leading to look-ahead prediction

1/11

Branch prediction: Look ahead Pre-fetching [A detailed look into branch prediction leading tolook-ahead prediction techniques with software and hardware compilation and overall analyticcontrast]

By: SALEEM, Muhammad Umair [25279] under the supervision of Prof. Dr. Andreas SiggelkowHochschule Ravensburg-Weingarten, Department of Electrical Engineering (Master of Engineering)

Computer Architecture 4872

1.0 - Abstract:

The discussion to follow in this compilation is based on the observation, reading, assessment and conclusions of branchstyle predictions and methodologies. As a starter for comprehensive understanding of this writing, it is assumed thatthe reader is well aware of the concept of branches, and how they can be useful in decreasing the overall instructioncycle for a nominated Super-Scalar system. The basic idea of pre-fetching scheme is to keep track of the data accesspatterns in a Prediction Table organized as an instruction cache.

In the current computer technological age, the concept of branch prediction utilization and clock instruction reductionis nothing new. In this paper we will discuss the technique of pre-fetching branch prediction. How pre-fetching isbased on in terms of hardware utilization and realizations, its firmware counterpart and a little analytical theory to

complement their usage in terms of being handy and powerfully useful to reduce memory access latency and improveperformance.

1.1 - Introduction:

Instruction pre-fetching is an important technique for closing the gap between the speed of the microprocessor andits memory system. As current microprocessors become ever faster, this gap continues to increase and becomes abottleneck, resulting in the loss of overall system performance. To close this gap, instruction prefetching speculativelybrings the instructions needed in ahead of time close to the microprocessor and, hence, reduces the transfer delay dueto the relatively slow memory system. If instruction prefetching can predict future instructions accurately and bringthem in advance, most of the delay due to the memory system can be eliminated. The branch predictors are built intocurrent microprocessors to reduce the stall time due to instruction fetching and, in general, can achieve predictionaccuracy as high as 95% for SPEC benchmarks. Prefetching based on branch prediction (BP-based prefetching) can

achieve higher performance than a cache, by speculatively running ahead of the execution unit at a rate close to onebasic block per cycle. With the aid of advanced branch predictors and a small autonomous fetching unit, this type ofprefetching can accurately select the most likely path and fetch the instructions on the path in advance. Therefore,most of the pre-fetches are useful and can fetch instructions before they are needed by the execution unit.

The paper is designed on the descriptive pattern that gives an overview of the pre-fetch scheme on hardware and soft-ware techniques, along with the comparison for different schemes for pre-fetching algorithms. The different techniquesdescribed here are compiled from academic research projects, and shown here with proper mentions and credits fortheir work. These results will eventually create an outline on the preference of design schemes and implementation forthe all-encompassing purpose of reduced memory access latency and super-scale pipeline structure orientations usedfor pre-fetch logics.

2.0 - A guide to pre-fetch schema:

Microprocessor performance has increased at a dramatic rate over the past decade. This trend has been sustained bycontinued architectural innovations and advances in microprocessor fabrication technology. In contrast, main memory(dynamic RAM) performance has increased at a much more leisurely rate. This expanding gap between microprocessorand DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide thelarge latency of memory accesses.

Chief among the latency reducing techniques is the use of cache memory hierarchies [1]. The static RAM (SRAM)memories used in caches have managed to keep pace with processor memory request rates but continue to be too

expensive for a main store technology. Although the use of large cache hierarchies has proven to be effective inreducing the average memory access penalty for programs that show a high degree of locality in their addressingpatterns, it is still not uncommon for data intensive programs to spend more than half their run times stalled onmemory requests [2]. The large, dense matrix operations that form the basis of many such applications typicallyexhibit little locality and therefore can defeat caching strategies.

1


2/11

2.1 The On-Demand Fetch

This policy fetches data into the cache from main memory only after the processor has requested a word and foundit absent from the cache. The situation is illustrated in Figure (a) where computation, including memory referencessatisfied within the cache hierarchy, are represented by the upper time line while main memory access time is repre-sented by the lower time line. In this figure, the data blocks associated with memory references r1, r2, and r3 are notfound in the cache hierarchy and must therefore be fetched from main memory. Assuming the referenced data wordis needed immediately, the processor will be stalled while it waits for the corresponding cache block to be fetched.

Once the data returns from main memory it is cached and forwarded to the processor where computation may againproceed.

Note that this fetch policy will always result in a cache miss for the first access to a cache block since only previouslyaccessed data are stored in the cache. Such cache misses are known as cold start or compulsory misses. Also, if thereferenced data is part of a large array operation, it is likely that the data will be replaced after its use to make roomfor new array elements being streamed into the cache. When the same data block is needed later, the processor mustagain bring it in from main memory incurring the full main memory access latency. This is called a capacity miss.

Many of these cache misses can be avoided if we augment the demand fetch policy of the cache with the additionof a data pre-fetch operation. Rather than waiting for a cache miss to perform a memory fetch, data prefetchinganticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. Thispre-fetch proceeds in parallel with processor computation, allowing the memory system time to transfer the desired

data from main memory to the cache. Ideally, the pre-fetch will complete just in time for the processor to access theneeded data in the cache without stalling the processor.

2.2 The Explicit Fetch

At a minimum, this fetch specifies the address of a data word to be brought into cache space. When the fetchinstruction is executed, this address is simply passed on to the memory system without forcing the processor to waitfor a response. The cache respond to the fetch in a manner similar to an ordinary load instruction with the exceptionthat the referenced word is not forwarded to the processor after it has been cached. Figure (b) shows how pre-fetchingcan be used to improve the execution time of the demand fetch case given in Figure (a). Here, the latency of mainmemory accesses is hidden by overlapping computation with memory accesses resulting in a reduction in overall runtime. This figure represents the ideal case when pre-fetched data arrives just as it is requested by the processor.

Figure 1: Fig1: an example of explicit fetching

A less optimistic situation is depicted in Figure (c). In this figure, the pre-fetches for references r1 and r2 are issuedtoo late to avoid processor stalls although the data for r2 is fetched early enough to realize some benefit. Note thatthe data for r3 arrives early enough to hide all of the memory latency but must be held in the processor cache forsome period of time before it is used by the processor. During this time, the pre-fetched data are exposed to the cachereplacement policy and may be evicted from the cache before use. When this occurs, the pre-fetch is said to be uselessbecause no performance benefit is derived from fetching the block early.

2.2.1 - Hazards associated with pre-fetching:

A prematurely pre-fetched block may also displace data in the cache that is currently in use by the processor, resultingin what is known as cache pollution. Not the same as normal cache replacement misses. A pre-fetch that causes a

2


3/11

Figure 2: Fig2:Overhead on fetching statements in a processor

miss in the cache that would not have occurred if prefetching was not in use is defined as cache pollution. If, however,a pre-fetched block displaces a cache block which is referenced after the pre-fetched block has been used, this is anordinary replacement miss since the resulting cache miss would have occurred with or without prefetching. A moresubtle side effect of prefetching occurs in the memory system. Note that in previous Figure (a) the three memoryrequests occur within the first 31 time units of program startup whereas in the same previous Figure (b), these requestsare compressed into a period of 19 time units. By removing processor stall cycles, prefetching effectively increasesthe frequency of memory requests issued by the processor. Memory systems must be designed to match this higherbandwidth to avoid becoming saturated and nullifying the benefits of prefetching. This is can be particularly true formultiprocessors where bus utilization is typically higher than single processor systems.

3.0 - Software pre-fetch:

Software prefetching [3] can achieve a reduction in run time despite adding instructions into the execution stream.

In Figure shown here, the memory effects from the previous Figure (a,b,c)s in section 2 are ignored and only thecomputational components of the run time are shown. Here, it can be seen that the three pre-fetch instructionsactually increase the amount of work done by the processor.

Although hardware prefetching incurs no instruction overhead, it often generates more unnecessary pre-fetches thansoftware prefetching. Unnecessary pre-fetches are more common in hardware schemes because they speculate onfuture memory accesses without the benefit of compile-time information. Although unnecessary pre-fetches do notaffect correct program behavior, they can result in cache pollution and will consume memory bandwidth. To beeffective, data prefetching must be implemented in such a way that pre-fetches are timely, useful, and introduce littleoverhead.

3.1 - Software Pre-fetch Methodologies:

The pre-fetch overhead can be reduced to a minimum if we can selectively pre-fetch only those references that will bemisses. Various algorithms have been suggested to deal with this, but both Chen and Baer [1] and Tulsen and Eggers[4] feel that the algorithm described by Mowry and Gupta [5] is the best of these. Using Mowry and Guptas algorithm,once a potential cache miss has been identified, the software scheme inserts a pre-fetch instruction. If accesses havespatial or group locality in the same cache line, only the first access to the line will result in a cache miss, and onlyone pre-fetch instruction should be issued.

Testing for this condition, however, can be expensive and the compiler will generally perform loop splitting and loopunrolling. One consequence of this is that the code may expand significantly. An example of this can be seen inexample provided in the table.

However, Mowry et al. [5] report that, for more than half of the thirteen Benchmarks (irrelevant here) that they used,the instruction overhead caused less than a 15% increase in instruction count, and that in the other cases the number

of instructions increased by 25% to 50%.

Since the compiler does not have complete information about the dynamic behavior of the program, it will be unableto successfully cover all misses, and a miss covered by a pre-fetch may still stall the processor if the pre-fetch arriveslate or if it is cancelled by some other activity. Furthermore the compiler may also insert unnecessary pre-fetches forthose variables that generate hits in the original execution [6].

It is assumed that a cache line holds two array elements (so that prefetching &X[i] also gets &X[i+1]) and that thememory latency requires the pre-fetch to be scheduled four iterations ahead. After the original split, the loops areunrolled by a factor of two.

4.0 - Description of Hardware Prefetching Schemes

4.1 - Sequential prefetching (general scheme of One Block Look Ahead OBL):

Many prefetching schemes are designed to fetch data from main memory into the processor cache in units of cacheblocks. By grouping consecutive memory words into single units, caches exploit the principle of spatial locality to

3


4/11

Figure 3: Fig3: code selective pre-fetching example

implicitly pre-fetch data. The degree to which large cache blocks can be effective in prefetching data is limited bythe ensuing cache pollution effects (mentioned before). Sequential prefetching can take advantage of spatial localitywithout introducing some of the problems associated with large cache blocks. The simplest sequential prefetchingschemes are variations upon the one block look ahead (OBL) approach which initiates a pre-fetch for block b+1 whenblock b is accessed.

Figure 4: Fig4: 3 forms of sequential prefetching: a) Pre-fetch on miss, b) tagged and c) sequential pre-fetch with K= 2.

4.1.1 Types of OBL

OBL implementations differ depending on what type of access to block b initiates the pre-fetch of b+1. Smith [7]summarizes several of these approaches of which the pre-fetch-on-miss and tagged pre-fetch algorithms. The pre-fetch-on-miss algorithmsimply initiates a pre-fetch for block b+1 whenever an access for block b results in a cachemiss. If b+1 is already cached, no memory access is initiated. The tagged pre-fetch algorithm associates a tag bit withevery memory block. This bit is used to detect when a block is demand-fetched or a pre-fetched block is referencedfor the first time. In either of these cases, the next sequential block is fetched.

Smith [7] found that tagged prefetching reduced cache miss ratios in a unified (both instruction and data) cacheby between 50% and 90%. Pre-fetch-on-miss was less than half as effective as tagged prefetching in reducing missratios. The reason pre-fetch-on-miss is less effective is illustrated in Figure where the behavior of each algorithm whenaccessing three contiguous blocks is shown. Here, it can be seen that astrictly sequential access pattern will resultin a cache miss for every other cache block when the pre-fetch-on-miss algorithm is used but this same access patternresults in only one cache miss when employing a tagged pre-fetch algorithm.

4.1.2 - Sequential adaptive prefetching an improvement:

One upgrade to the pre-existing above logic was provided by Dahlgren and Stenstrm [8] who proposed an adaptivesequential prefetching policy that allows the value of K (fetching frequency) to vary during program execution. To dothis, a pre-fetch efficiency count is periodically calculated by the cache as an indication of the current spatial locality

4


5/11

characteristics of the program. Pre-fetch efficiency is defined to be the ratio of useful pre-fetches to total pre-fetcheswhere a useful pre-fetch occurs whenever a pre-fetched block results in a cache hit. The value of K is initialized to one,incremented whenever the pre-fetch efficiency exceeds a predetermined upper threshold and decremented whenever theefficiency drops below a lower threshold as shown in the graphical figure. Note that if K is reduced to zero, prefetchingis effectively disabled.

Figure 5: Fig5: adaptive stride for sequential fetching

4.2 - Prefetching with arbitrary strides:

Several techniques have been proposed which employ special logic to monitor the processors address referencing patternto detect constant stride array references [1, 9, and 10]. This is accomplished by comparing successive addresses usedby load or store instructions. Chen and Baers scheme [1] illustrate its design, assume a memory instruction references

addresses a1, a2 and a3 during three successive loop iterations. Prefetching for MI will be initiated if (a2 - a1) = D!= 0, Where D is now assumed to be the stride of a series of array accesses. The first pre-fetch address will then beA3 = a2 + D where A3 is the predicted value of the observed address, a3. Prefetching continues in this way untilthe equality A n = a nno longer holds true.

Figure 6: Fig6: register level realization of Arbitrary strides pre-fetching from Chen et al

Note that this approach requires the previous address used by a memory instruction to be stored along with thelast detected stride, if any. Recording the reference histories of every memory instruction in the program is clearlyimpossible. Instead, a separate cache called the reference prediction table (RPT) holds this information for only themost recently used memory instructions. The organization of the RPT in Figure is shown above.

The first time a load instruction causes a miss, a table entry is reserved, possibly evicting the table entry for an olderload instruction. The miss address is then recorded in the last address held and the state is set to initial. The nexttime this instruction causes a miss, last address is subtracted from the current miss address and the result is storedin the delta (stride) held. Last address is then updated with the new miss address. The entry is now in the trainingstate. The third time the load instruction misses a new delta is computed. If this delta matches the one stored in theentry, then there is a stride access pattern. The pre-fetcher then uses the delta to calculate which cache block(s) topre-fetch.

4.3 - The look ahead program counter scheme; with selective strides development

The RPT still limits the pre-fetch distance to one loop iteration. To remedy this shortcoming, a distance field may beadded to the RPT which specifies the pre-fetch distance explicitly. Pre-fetched addresses would then be calculated as

5


6/11

Figure 7: Fig7: 2 bit state machine representation

effective address + ( stride x distance )

The addition of the distance field requires some method of establishing its value for a given RPT entry. To calculatean appropriate value, Chen and Baer [1] decouple the maintenance of the RPT from its use as a pre-fetch engine.The RPT entries are maintained under the direction of the Program Counter as described above but pre-fetches areinitiated separately by a pseudo program counter, called the look-ahead program counter (LA-PC) which is allowedto precede the PC.

Figure 8: Fig8: A realization of LA-PC Chen et al

This is basically a pseudo-program counter that runs several cycles ahead of the regular program counter (PC).The LA-PC then looks up a Reference Prediction Table to pre-fetch data in advance. LA-PC scheme only advancesone instruction per cycle and is restricted to be, at most, a fixed number of cycles ahead of the regular PC (ProgramCounter). The studies from Chen etal [1] focused on data prefetching rather than instruction prefetching, and did notevaluate the effects of speculative execution, multiple instruction issue, and the presence of advanced branch predictionmechanisms.

Similar data prefetching schemes can be seen in the work of Liu and Pinter etal [2] figure shown.

The prefetching degree is the number of cache blocks that are fetched on a single prefetching operation, while theprefetching distance is how far ahead prefetching starts. For example, a sequential pre-fetcher with a prefetchingdegree of 2, and a prefetching distance of 5, would fetch blocks X+5 and X+6 if there was a miss on block X.

Perez, et al. [11] did a comparative survey in 2004 of many proposed prefetching heuristics and found that tagged se-quential prefetching, reference prediction tables (RPT) and Program Counter/Delta Correlation Prefetching (PC/DC)were the top performers.

4.4 - PC/DC Prefetching:

This approach presented by Nesbit and Smith [12] presented the idea utilizing a Global History Buffer (GHB). The

structure of the GHB is shown in the figure.

Each cache miss or cache hit to a tagged (pre-fetched) cache block is inserted into the GHB in FIFO order. The indextable stores the address of the load instruction and a pointer into the GHB for the last miss issued by that instruction.Each entry in the GHB has a similar pointer, which points to the next miss issued by the same instruction.

6


7/11

Figure 9: Fig9: A Global History Buffer , based on FIFO for pre-fetching

PC/DC prefetching calculates the deltas between successive cache misses and stores them in a delta-buffer. Thehistory in GHB Figure yields the address stream and corresponding deltas stream buffer in Figure. The last pair ofdeltas is (1, 9). By searching the delta-stream (correlating), we find this same pair in the beginning. A pattern is

found, and prefetching can begin. The deltas after the pair are then added to the current miss address, and pre-fetchesare issued for the calculated addresses.

Figure 10: Fig10: Delta stream buffer structure

4.5 - Branch prediction-based prefetching

Conceptually, the instruction prefetching scheme proposed here [13] is similar to the look-ahead program counter,yet with much more aggressive prefetching policies. The pre-fetching unit is an autonomous state machine, whichspeculatively runs down the instruction stream as fast as possible and brings all the instructions encountered alongthe path. When a branch is encountered, the prefetching unit predicts the likely execution path using the branchpredictor, records the prediction in a log, and continues. In the meantime, the execution unit of the microprocessorroutinely checks the log as branches are resolved and resets the program counter of the prefetching unit if an error isfound.

Figure 11: Fig11:The organization of BP-based prefetching scheme.

Initially, the (PC) of the prefetching unit is set to be equal to the PC of the execution unit. Then the prefetching unitspends one cycle to fetch the desired cache line.

The prefetching unit examines an entire cache line as a unit, and quickly finds the first branch (either conditional orunconditional) in that cache line using existing pre-decoded information or a few bits from the opcode. During thesame cycle, the prefetching unit also predicts and computes the potential target for the branch in one of three ways:first, for a subroutine return branch, its target is predicted with a return address stack, which has high predictionaccuracy. [14] The prefetching unit has its own separate return address stack.

7


8/11

Second, for a conditional branch, the direction is predicted with a two-level branch predictor and the target addressis computed with the dedicated adder in the same cycle. A dedicated adder is used instead of a branch target buffer,because the first time the branch is encountered it will not yet be recorded in the target buffer. Also note that thetwo-level branch predictor used in the prefetching unit has its own small branch history register but shares the sameexpensive pattern history table with the execution unit. The prefetching unit only speculative updates its own branchhistory register, but does not update the pattern history table.

Third, for an unconditional branch, its direction is always taken and its target is calculated using the same adder used

for conditional branches. However, for an indirect branch, the prefetching unit stalls and waits for the execution unitbecause this type of branch can have multiple targets.

Figure 12: Logic flow pre-directive for branch predictor fetching

The cache line pre-fetch depends on the predicted direction of a branches. When a branch is predicted to be taken, thecache line containing its target is pre-fetched; otherwise, the prefetching unit examines the next branch in the cacheline. The prefetching unit continues to examine successive branches until the end of the current cache line is reached,then the next sequential cache line is pre-fetched. The entire process is repeated again for the newly pre-fetched cacheline.

To verify the predictions made, when a branch is predicted, the predicted outcome is recorded in a log. This logis organized as a first-in-first-out (FIFO) buffer. When the execution unit resolves a branch, the actual outcome iscompared with the one predicted by the prefetching unit, if the actual outcome matches the one predicted, the itemis removed from the log. However, if the actual outcome differs from the one predicted, then the entire log is flushedand the PC of the prefetching unit is reset to the PC of the execution unit.

4.6 - Delta-Correlating Prediction Tables DCPT[15]

a combinatory approach from Reference Prediction Tables and Delta Correlation Scheme

Figure 13: Fig13: Structure of a DCPT Instruction fetch

In DCPT we use a large table indexed by the address (PC) of the load. Each entry has the format shown in figurebellow. The last address field works in a similar manner as in RPT prefetching. The n delta fields acts as a circularbuffer, holding the last n deltas observed by this load instruction and the delta pointer points to the head of thiscircular buffer.

To provide further insight into the operation of this scheme, a pseudo code is presented courtesy of the developmentteam as mentioned before [15].

In this pseudo code the mnemonic of is used as an assignment operator and is used as the insert into circularbuffer operator. For the ease of description and display, the delta buffer and the inflight buffer is presented as anarray, however in reality, they are still circular and get rested to flushed state every time the buffer reaches a circularfull on its state.

Initially, the PC is used to look up into a table of entries. In our implementation we have used a fully-associativetable, but it is possible to use other organizations as well. If an entry with the corresponding PC is not found, then areplacement entry is initialized. This is shown in lines 4-8. If an entry is found, the delta between the current addressand the previous address is computed. The buffer is only updated if the delta is non-zero. The new delta is inserted

8


9/11

Figure 14: Fig14:

into the delta buffer and the last address field is updated. Each delta is stored as an n bit value. If the value cannotbe represented with only n bits, a 0 is stored in the delta buffer as an indicator of an overflow error.

Figure 15: Fig15:

Delta correlation begins after updating the entry. The pseudo code for delta correlation is shown in Algorithm below.

The deltas are navigated in reverse order, looking for a match to the two most recently inserted deltas. If a matchis found, the next stage begins. The first pre-fetched candidate is generated by adding the delta after the match tothe value found in last address. The next pre-fetch candidate is generated by adding the next delta to the previouspre-fetched candidate. This process is repeated for each of the deltas after the matched pair including the newlyinserted deltas.

Figure 16: Fig16:

The next step in the DCPT flow is pre-fetch filtering. The pseudo code for this step is shown in Algorithm below.If a pre-fetch candidate matches the value stored in last pre-fetch, the content of the pre-fetch candidate buffer upto this point is discarded. Every pre-fetch candidate is looked up in the cache to see if it is already present. If it isnot present, it is checked against the miss status holding registers to see if a demand request for the same block hasalready been issued.

This buffer can only hold 32 pre-fetches. If it is full, then pre-fetch is discarded in FIFO order. Finally, the lastpre-fetched field is updated with the address of the issued pre-fetch.

9


10/11

1. - Critical evaluation

(a) - Increased cache interference

However, pre-fetching may lead to increased cache interference. For uniprocessors, there are two different ways inwhich prefetching can increase cache interference:

5.1.1 -A pre-fetched line can displace another cache line which would have been a hit under the original execution

5.1.2 - A pre-fetched line can be removed from the cache by either an access or another pre-fetch before theprocessor has time to reference it. In the former case a pre-fetch generates another miss, while in the latter it cancelsa pre-fetch.

However for a multiprocessor prefetching can cause internode interference. This happens when invalidations generatedby pre-fetches occurring at other nodes transform original local hits into misses or cancel pre-fetched data before theycan be referenced by the processor.

1. (a) - Increased memory traffic

There are two reasons why this might occur in prefetching schemes. One reason is the prefetching of un-necessarydata and the other is early displacement of and on demand or later recall of the same useful data. These types ofincrease in memory traffic can add to memory access latency. These actions may lead to performance degradations incase of processors that support bus based multi-processing. These types of processors dont support pre-fetching verywell and cause performance degradation, the results shown and tested by Tullsen et al [16, 17].

6.0 - Comparisons and final thoughts

Starting off with hardware based prefetching, these schemes require some sort of hardware modifications or tweaks atthat to the main processor. Its main advantage is that the hardware pre-fetches are handled dynamically at runtimewithout compiler intervention. The drawbacks are that extra hardware resources are needed, that memory referencesfor complex access patterns are difficult to predict and that it tends to have a more negative effect on memory traffic.

In contrast, software-directed approaches require little to no hardware support. They rely on compiler technologyto perform statistical program analysis and to selectively insert pre-fetching instructions. Because of this, they areless likely to pre-fetch unnecessary data and hence reduce cache pollution. The disadvantages are that there is someoverhead due to the extra pre-fetch instructions and that some useful pre-fetching cannot be uncovered at runtime.

The conclusions provided above are based on study of Chen et al [1]. Their results and observations were taken andconsidered as a base line for compilation of this paper.

7.0 - BIBLOGRAPHY:

[1] Chen, T.-F. And Baer, J.-L. Effective hardware based data prefetching for high-performance processors. IEEETransactions on Computers, Vol. 44, No. 5, May, 1995.1

[2] Liu, Y. and Kaeli, D. R. Branch-directed and stride-based data cache prefetching. Proceedings of the InternationalConference on Computer Design, October, 1996. 2

[3] A.K. Porterfield. Software methods for improvement of cache performance on supercomputer applications. Ph.D.Thesis, Rice University, 1989

[4] D.M. Tullsen and S.J. Eggers. Effective Cache Prefetching on Bus-Based Multiprocessors. ACM Transactions onComputer Systems, 13, pp 57-88, 1995

[5]T.C. Mowry, M.S. Lam and A. Gupta. Design and evaluation of a computer algorithm for prefetching. 5th Int.Conf. on Arch. Support for Programming Languages and Operating Systems,

[6] R.H. Saavedra, W. Mao and K. Hwang. Performance and Optimization of Data Prefetching Strategies in ScalableMultiprocessors. Journal of Parallel and Distributed Computing 22:3, pp 427-448, 1994

[7] Smith, A.J., Cache Memories, Computing Surveys, Vol.14, No.3, September 1982, p. 473-530.

[8] Dahlgren, F., M. Dubois and P. Stenstrom, Fixed and Adaptive Sequential Prefetching in Shared memory Multi-processors, Proc. International Conference on Parallel Processing, St. Charles, IL, August 1993, p. I-56-63.

[9] Fu, J.W.C., J.H. Patel and B.L. Janssens, Stride Directed Prefetching in Scalar Processors, Proc. 25th Interna-tional Symposium on Microarchitecture, Portland, OR, December 1992, p. 102-110.

10


11/11

[10] Sklenar, I., Prefetch Unit for Vector Operations on Scalar Computers, Proc. 19th International Symposium onComputer Architecture, Gold Coast, Qld., Australia, May 1992.

[11] D. G. Perez, G. Mouchard, and O. Temam, Microlib: A case for the quantitative comparison of micro-architecturemechanisms, in MICRO 37: Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitec-ture, (Washington, DC, USA), IEEE Computer Society, 2004.

[12] K. J. Nesbit and J. E. Smith, Data cache prefetching using a global history buffer, High-Performance ComputerArchitecture, International Symposium on, vol. 0, 2004.

[13] Instruction Prefetching Using Branch Prediction Information I-Cheng K. Chen, Chih-Chieh Lee, and Trevor N.Mudge EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor, Michigan 48109-2122

[14] Kaeli, D. and Emma, P. G. Branch history table prediction of moving target branches due to subroutine returns.Proceedings of the 18th International Symposium on Computer Architecture, May 1991.

[15] Storage Effcient Hardware Prefetching using Delta-Correlating Prediction Tables Marius Grannaes, Magnus Jahre,Lasse Natvig, Department of Computer and Information Science Norwegian University of Science and Technology SemSaelandsvei 7-9, 7491 Trondheim, Norway

[16] D.M. Tullsen and S.J. Eggers. Limitations of Cache Prefetching on a Bus-Based Multiprocessor. ACM Transac-tions on Computer Systems, 13, 1995.

[17]D.M. Tullsen and S.J. Eggers. Effective Cache Prefetching on Bus-Based Multiprocessors. ACM Transactions onComputer Systems, 13, 1995.

11

Documents

Branch prediction: Look ahead Pre-fetching [A detailed look into branch prediction leading to look-ahead prediction techniques with software and hardware compilation and overall analytical