Upload
sukey
View
38
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Branch Predictor Design for AE64000. Lynn Choi Department of Electronics and Computer Engineering Korea University [email protected] Session: 5D Paper: 8. Motivation. Demand for high performance embedded processors ㅡ High-end embedded applications ㅡ Many uses of embedded processors - PowerPoint PPT Presentation
Citation preview
Branch Predictor Design for AE64000
Lynn Choi
Department of Electronics and Computer Engineering
Korea University
Session: 5D Paper: 8
Motivation Demand for high performance embedded processors
ㅡ High-end embedded applications
ㅡ Many uses of embedded processors
Addition of a branch predictor
ㅡ To achieve higher performance
ㅡ The most cost-effective method
AE64000 Characteristics
IFU to minimize performance decrease caused by LERI’s Additional two pipeline stages (IFU1+IFU2) to eliminate LERI’s 3 line buffers to store 12 instructions PrePC in IFU and PC in the pipeline core
Branch misprediction penalty Branch misprediction penalty : 3 cycles
Branch Predictor Design for AE64000
Issues in branch predictor design for AE64000 AE64000 has additional two stages (IFU1-IFU2) in front of 5-stage pipeline
core. At which pipeline stage prediction should be performed?
IFU1 stage
Due to line buffers in the IFU, predicted target addresses need to be buffered as well to verify branch prediction results
need buffers for predicted branch target addresses (PTAB)
Since 4 instructions are fetched at a time, multiple branches can be fetched at a time as well.
Only the first taken branch will be predicted.
To do that, TAC has the precise target address.
Branch misprediction penalty Can be reduced from 3 to 2 cycles by updating PPC at the same cycle that
PC is updated by adding a MUX in the IFU
Branch PredictorFor AE64000
Separate BPT with TAC
PTAB to store predicted target address for instructions in the line buffer
Branch prediction verification in the ID stage
Predicted Target Address Buffer
Predicted Target Address Buffer (PTAB) For branch instructions in the line buffer
When we send a branch instruction to the pipeline core, we also send the corresponding predicted target address
Simulation Environment Developed a cycle-accurate AE64000 simulator
Simulated 1 billion instructions– 30 minutes on P4 1.6GHz with 512MB RAM
Indirect branches are not predicted in the simulation Input: AE64000 compiler binary, memory & predictor configuration
parameters Output: IPC, BPT/TAC hit ratios, etc.
Benchmark SPECint95 (compress, go) Dhrystone Whetstone
Predictors tested Last-time predictor Bimodal predictor G-share predictor
Simulator Block Diagram
Simulation Results
Without branch predictor (IPC)
Classification3-cycle misprediction
penalty2-cycle misprediction
penalty
Compress 0.6787 0.7429
Go 0.6905 0.7322
Dhrystone 0.6500 0.7200
Whetstone 0.5569 0.6521
Simulation Results
Last-time branch predictor
Simulation Results (cont’d)
Bimodal Branch Predictor
Simulation Results (cont’d)
G-share Branch Predictor
Conclusion Simulation result analysis
Consider both performance and area
The additional performance gain by g-share and bimodal predictors are negligible compared to their size and complexity.
Final design Last-time predictor with 4-way set-associative 8-en
try TAC with LRU replacement– IPC is improved 10% by reducing the branch predic
tion penalty from 3 to 2 cycles
– Additional 15% IPC improvement by branch predictor
About 11500 gate (about 2.64% area) in Verilog HDL model
– Thus, we can improve the performance of AE64000 by 25% with less than 3% cost