ECSE 425 - Branch predictor for pipelined 8-bit MIPS machine - Report

ECSE 425 Computer Organisation and Architecture Group 7 Prof. Warren Gross Bndicte Leonard-Cannon (260377592) Tuesday, April 16, 2013 Payom Meshgin (260431193)

FINAL REPORT: N-BIT LOCAL BRANCH PREDICTION

INTRODUCTION

The objective of this project was to study the effect of a dynamic, local n-bit branch predictor on the performance of a machine organised under the MIPS architecture, compared to that of a static predict not-taken predictor. To simulate this machine, we used the EduMIPS64 software as a base. As the simulator we had at our disposal did not feature a dynamic predictor, the source code of the software has been modified to include our own. In addition, two different prediction algorithms were implemented to determine the state of the dynamic predictor. We were fairly successful in implementing the above features, although a few issues related to the original simulator were discovered during the testing stage, as described in the Post Mortem section.

APPROACH

N-BIT PREDICTOR

The local n-bit predictor is the central modification to the simulator. The predictor selects a prediction scheme (i.e. predict taken or predict not taken) based on the outcome of n previously encountered branches - unlike its static counterpart which predicts the same scheme every time a branch instruction is encountered. Initially, our local n-bit predictor predicts a not-taken branch, i.e., it anticipates that the condition under which the branch occurs will not be satisfied. The predictor can be set to one of two modes: a consecutive n-level counter or an n-bit saturating counter, where n represents the number of bits in the prediction.

In the first configuration, the prediction is left unchanged until n consecutive mispredictions occur, in which case the prediction scheme switches. If the number of mispredictions, which is stored in a counter, reaches n, the predictor changes its scheme and resets the counter to 0. Moreover, if a correct prediction is made, the counter is also reset to 0. In other words, the predictor alternates between the two schemes after a certain number of successive mispredictions. In the saturating counter mode, the predictor exists in 2 states. There are two boundary states in the predictor: a strong predict taken (state 2 1 ) and a strong predict not-taken (state 0). In between these states lie a number of transitional states, which are traversed by a counter. For every taken branch, this counter is incremented, while for every not-taken branch this counter is decremented. If the counter value is in the upper half of its possible range (i.e. its most significant bit is 1), the current branch is predicted taken. Conversely, if the value is in the lower half of the acceptable range, the branch is predicted not-taken.

Figure 1: n-level misprediction counter (left) and saturating counter (right) schemes for n = 2

P a g e | 2

PREDICT TAKEN

Since our predictor must choose whether or not a branch is predicted taken, the simulator must be able to behave correctly under each of these schemes. Unfortunately, the original simulator did not include a static predict taken branch predictor, so much of the work was focused on adding this feature. Simply put, under the predict taken scheme, the target instruction of the branch must be fed into the IF stage of the pipeline, as opposed to the fall-through instruction in the not-taken case.

GUI

USER INTERFACE

A few additional functions were added to the GUI of the initial EduMIPS64 software. To facilitate switching from the default static predictor to our predictor (and vice-versa), a checkbox was added to the main settings tab of the simulator. When unchecked, the default predictor is enabled; when checked, the button enables our predictor. Another checkbox was included to select our n-bit misprediction predictor (when the button is unchecked) or our n-bit saturating counter predictor (when the button is checked). Moreover, a text field was embedded into the same panel to set the number of prediction bits, permitting us to quickly change this value during the testing phase.

Note that clicking on the OK button is mandatory to activate the changes described above.

STATISTICS

To gather data on the performance of our simulator, extra statistics were displayed by the GUI. In particular, the number of branch not taken stalls and of branches encountered were added to the list of stalls displayed in the statistics window of the simulator. The branch not taken and taken stalls correspond to the number of stalls resulting from a misprediction with the not-taken and taken schemes respectively, while the number of misprediction stalls corresponds to the sum of the two.

IMPLEMENTATION

N-BIT PREDICTOR

Our predictor was implemented in a class of its own: OurPredictor.java. This class contains a method for updating the predictor status (updatePredictor(condition)), which is called in every subclass of Instruction.java corresponding to a branch instruction (BEQ.java, BNE.java, etc.). Based on the current mode of the predictor (n-level or saturating counter), the method updates the predictors counter and changes the prediction scheme according to the boolean variable condition, which indicates whether the condition specified in the branch instruction was met.

PREDICT TAKEN

The IF stage of CPU.java has been modified to predict taken when advised as such by our predictor. In such a case, the offset of the current branch is fetched, then added to the program counter. Hence, the next instruction that will get into the IF stage is the branch target. Additionally, the counter of the branch fall-through is stored in case of a misprediction so that RestoreIF.java can feed it back into IF (see next section for more details) Additional code was implemented to restore the IF stage of the pipeline with the fall-through instruction of a branch in the case of a misprediction on a predicted taken.

RESTORING THE IF STAGE

We have created an additional class named RestoreIF.java, which is called in the case of a misprediction detected in one of the branch classes (BEQ.java, BNE.java, etc.). RestoreIF.java has two functions: SchemeTaken, which is called when a taken prediction was wrong and SchemeNotTaken,

P a g e | 3

which corresponds to a erroneously predicted not taken branch. The former will restore the fall-through address into the IF stage, while the latter will feed the branch target back into the IF stage.

GUI

To implement the changes described in the above section, three classes and one properties file had to be altered: Instruction.java , Config.java, GUIConfig.java and MessagesBundle_en.properties. In Instruction.java, boolean variables representing the GUI objects we implemented were added, as well as their respective getter and setter methods. In Config.java, simulation parameters were added for these objects. In GUIConfig.java, code for the buttons and text fields has been added. Finally, the properties file was modified to display the text corresponding to our GUI objects on the main settings tab.

STATISTICS

Methods to keep track of the statistics were implemented into the program (including the non-functioning predicted taken stall and branch misprediction stall counters that are part of the initial simulator). To implement these values, we created a set of variables for each type of stall we were interested in, as well as getter and setter methods in CPU.java. These number of branches encountered is incremented in every branch type class (BEQ.java, BNE.java, etc.), while the number of taken and not taken stalls are incremented in our RestoreIF.java class that is called on every misprediction.

RESULTS

We initially ran tests on relatively complex programs taken from the EduMIPS64 samples (mySqrt, vet20parinum, etc.) to verify the correctness of the results obtained from the simulator (registers and data). All programs returned the same values on both the original and our modified simulators, confirming the correct operation of the modified version. Then, we created our own set of simple tests to verify that the pipeline was functioning correctly based on the expected number of clock cycles, branch mispredictions and other metrics we implemented. These tests included static for loops and nested for loops, whose behaviour can be easily determined. Finally, we modified the EduMIPS samples to extend the number of computations performed on these programs to obtain more global data on programs with complex branching behaviour.

IMPLEMENTED TESTS

We have used the following test programs to observe the performance of the different configurations of our predictors compared to the original static not-taken predictor:

1. Static for loop implemented using branch equal (BEQ); 2. Static for loop implemented using branch not equal (BNE); 3. Static nested for loops (2) implemented using BNE; 4. Static nested for loops (2) implemented using BEQ; 5. Extended EduMIPS sample: mins2 (finds the minimum of a vector); 6. Extended EduMIPS sample: isort (insertion sort of a vector); 7. Extended EduMIPS sample: mysqrt (identifies complete squares and computes their square

root); 8. Extended EduMIPS sample: vet20parinum (squares or subtracts 1 from a number depending

on whether or not the number is less than 20); 9. Extended EduMIPS sample: copyvet1_10 (copies a vector element into another vector for

elements between 1 and 10); 10. Extended EduMIPS sample: copyvet50disp (copy the inverse of a vector into another and

squares the entries that are more than or equal to 50 and even).

P a g e | 4

RESULTS OF TESTS

Of course, in all prediction configurations, all test programs saved correct data onto the registers and onto the memory of our simulated machine. As for performance benchmarks, statistics were stored after every test to observe the performance of the simulator in different configurations. A summary of our test results is shown below, however all raw data has been stored in an Excel spreadsheet included in the project submission.

First, we quantified the performance of the predictors using the CPI (clock cycles per instruction). As seen below, the CPI hits a negative peak for n= 2 in both configurations, reaching an average CPI of 1.751 and 1.732, representing improvements of 5.7% and 6.9% respectively over the default static not-taken predictor. It is interesting to note that beyond a value of 6 bits in the case of the n-level consecutive predictor or 8 bits for the saturating counter, the prediction algorithms start behaving similarly to the static predict not taken predictor. Next, we looked at the misprediction rate returned by the simulator after running each test. Taking an average of all misprediction rates returned by programs running under the same configuration leads to the plot below. Once again, the optimal number of prediction bits is n = 2, which yields misprediction rates of 15.1% for the consecutive n-level mode and 15.4% for the n-bit saturating counter mode. Both configurations ended up being more accurate than the static predictor, which had a misprediction rate of 39.5%. Finally, the last metric of note, memory size, was not inherently implemented in the code but it can be determined quite easily. Quite simply, as we increase the prediction bits in a particular configuration of the predictor, the larger hardware requirements will be (bigger counters).

COMMENT ON THE TESTS

Based on our tests, we determined that our modifications did improve the performance of the simulator for small values of n. Moreover, on average, optimal results were observed for n=2, which confirms the theory discussed in class that 2-bit prediction delivers the most performance gain.

1.74

1.76

1.78

1.8

1.82

1.84

1.86

1 2 3 4 5 6 7 8 9 10

Static Predict NotTaken

N-bit SaturatingCounter

N-levelconsecutive

Figure 2: Average CPI of Test programs over number of prediction bits

10%

15%

20%

25%

30%

35%

40%

45%

1 3 5 7 9

Static Predict NotTaken

N-bit SaturatingCounter

N-levelconsecutive

Figure 3: Misprediction Rate over number of prediction bits

P a g e | 5

POST-MORTEM

SUCCESSFUL IMPLEMENTATION

According to the tests we ran for comparing the data and register values obtained after running several samples on our simulator and the original one, our two predictors do not affect their final outputs. Moreover, we have calculated the expected number of mispredictions (taken and not taken) expected in the case of simple programs, which coincided with the values returned by the simulator, with two exceptions caused by simulator bugs (see Post Mortem section). Therefore, we can assume that our predictor has been successfully implemented. Moreover, as we expected, a value of n equal to two corresponded to the optimal predictor in most cases compared to other values of n and to predict not taken. Moreover, we observed that in general, the performance of our predictor decreased as n increased past a value of two and converged towards that of the predicted not taken scheme. This result is coherent with the fact that as n increases, the probability of changing scheme decreases; the higher the value of n, the more statically it behaves.

PROBLEMS WITH EDUMIPS

After running test3.s and test4.s containing two nested loops, we realized that our predictor was not returning the number of stalls that we were expecting based on our calculations. We then

looked at the cycles displayed on the GUI and noticed very unusual program behaviour (superimposed IF and ID stages) as shown on Figure 4. To ensure that our modifications were not the cause of this bug, we ran the same two test files on the original simulator and obtained the same results. Therefore, we concluded that the EduMIPS64 simulator was unreliable and could cause errors in our test results.

FUTURE IMPROVEMENTS

For this project, we supplemented the EduMIPS64 simulator with an n-bit dynamic predictor. We ran various tests on our implementation and observed that optimal performance occurred with two predictor bits, conforming to the theory. In the middle of the project, we had contemplated modifying our predictor so that it would work with forwarding enabled. However, due to time constraints, as well as the weird behaviour we had discovered as shown in Figure 4, we decided against implementing our predictor with forwarding enabled. Moreover, it would not be difficult to implement more complex dynamic branch predictors such as a correlating predictor or a tournament predictor. The only required modification is that the program counter of the current branch instruction would have to be included as a parameter of these two predictors. In the end, we enjoyed tinkering with the simulator and observing the effect of our individual modifications. The problems encountered due to EduMIPS64 notwithstanding, we gained a deeper familiarity with the MIPS 5-stage pipeline.

Figure 4: abnormal behavior caused by running test3.s on the original simulator

Documents

ECSE 425 - Branch predictor for pipelined 8-bit MIPS machine - Report