Upload
everett-harris
View
217
Download
0
Embed Size (px)
Citation preview
Simulation of Decode Filter Cache using SimpleScalar simulator
Presented by Fei Hong
Motivation & Goals
• Instruction fetches and decodes are the major on-chip power consumers
• Optimize the power consumption by reducing instruction fetches and decodes
• Simulate the DFC architecture using simplescalar
• To test the performance of DFC
Prediction Mechanism Each sector in DFC has the following fields.
(tag, sector_valid, next_address)
If A is not equal to C, a different control path will be taken tag(A) != tag(C) (1)
A and B are consecutively accessed. If they belonged to a small loop
tag(A) == tag(B) (2) Based on (1) and (2), the prediction for next fetch : tag(C) == tag(B) (3)
Next AddressValid bits
Tag Data
B
...
X: A
B
Y: B
X: C
Working Process
last_table_entry
next_fetch_srcfetch
address
...
Fetch from DFC or I-cacheC
next_fetch_srcupdate
update
predict
1
2
3NFPT
The Platform
• Host computer: ACPI x86-based PC • Host computer operating system: Microsoft Windows V
ista Ultimate• Virtual Machine: VMware Workstation version 6.03• Linux operating system: Fedora Core 6• Simulator: SimpleScalar version 3.0
Work have done so far…
• Setup the platform• Reading the source code of SimpleScalar• Apply my DFC structure and working process to S
impleScalar• Find benchmarks and compile in the platform • Do simulation using given memory hierarchy par
ameters
MiBench
• dijkstra: it constructs a large graph in an adjacency matrix representation and then calculates the shortest path between every pair of nodes using repeated applications of Dijkstra’s algorithm.
• stringsearch: it searches for given words in phrases using a case insensitive comparison algorithm.
• rijndael encrypt/decrypt: it was selected as the National Institute of Standards and Technologies Advanced Encryption Standard (AES).
• CRC32: This benchmark performs a 32-bit Cyclic Redundancy Check (CRC) on a file. CRC checks are often used to detect errors in data transmission.
Memory hierarchy parameters
Parameter Value
Instr. size 4B
DFC direct-mapped, 32 secotors,4 decoded instr. per sector,
8B per decoded instr.
L1 I-cache 16KB, 2-way, 32B line,1 cycle hit latency
L1 D-cache 8KB, 2-way, 32B line,1-cycle hit latency
Memory 30-cycle latency
Simulation results
% reduction in instruction fetches and decodes
0
20
40
60
80
100
di j kstra stri ngsearch ri j ndael CRC32
fetch and decodereducti on
Simulation results
Prediction hit rate
97
97. 5
98
98. 5
99
99. 5
100
di j kstra stri ngsearch ri j ndael CRC32
predi cti on hi t rate
Simulation results
dijkstra stringsearch rijndael CRC32
sim_num_insn
255620304 4437612 391487315 533385529
il1.accesses 43508918 1605417 236160209 972328
il1.hits 43399500 1568976 228694324 971600
il1.misses 109418 36441 7465885 728
il1.miss_rate 0.0025 0.0227 0.0316 0.0007
dfc.accesses 215740165 3269067 232531480 532674172
dfc.hits 212111386 2832195 155327106 532413201
dfc.misses 3628779 436872 77204374 260971
dfc.miss_rate 0.0168 0.1336 0.3320 0.0005
Conclusion
• The DFC stores decoded instructions and can be very small and energy-efficient.
• Use of the DFC eliminates both the access to a much larger instruction cache and the entire decoding step.
• From the simulation results, we can see that most instruction fetch and decode can be eliminated by using DFC. Therefore, it is a very efficient way to optimize the power consumption of embedded processors.
Thank you!