Upload
vivian-alaina-jennings
View
215
Download
0
Embed Size (px)
Citation preview
Flexicache:Software-based Instruction
Caching for Embedded Processors
Jason E Miller and Anant AgarwalRaw Group - MIT CSAIL
Hardware Instruction Caches
• Used in virtually all high-performance general-purpose processors
DRAM
• Good performance– Decreases average memory
access time
• Easy to use– Transparent operation
Processor
I-Cache
Chip
Chip
ICache-less Processors
• Embedded procs and DSPs– TMS470, ADSP-21xx, etc.
• Embedded multicore processors– IBM Cell SPE
Processor
DRAM
SRAM
• No special-purpose hardware– Less design/verification time– Less area– Shorter cycle time– Less energy per access– Predictable behavior
• Much harder to program!– Manually partition code and transfer pieces from DRAM
Software-based I-Caching
• Use a software system to virtualize instruction memory by recreating hardware cache functionality
• Automatic management of simple SRAM memory– Good performance with no extra programming effort
• Integrated into each individual application– Customized to program’s needs– Optimize for different goals– Real-time predictability
• Maintain low-cost, high-speed hardware
Flexicache System Overview
ProcessorDRAM
I-mem
Programmer
OriginalBinary
BinaryRewriter
Rewritten Binary
Linker
Runtimelibrary
FlexicacheBinary
Binary Rewriter
• Break up user program into cache blocks• Modify control-flow that leaves the blocks
Flexicacheruntime
Binary
Rewriter
Rewriter: Details• One basic block in each cache block, but…
– Fixed-size of 16 instructions• Simplifies bookkeeping• Requires padding of small blocks and splitting of large ones
• Control-flow instructions that leave a block are modified to jump to the runtime system– E.g. BEQ $2,$3,foo JEQL $2,$3,runtime– Original destination addresses stored in table– Fall-through jumps at end of blocks
Runtime: Overview
• Stays resident in I-mem
• Receive requests from cache blocks• See if requested block is resident• Load new block from DRAM if necessary
– Evict blocks to make room
• Transfer control to the new block
Runtime Operation
Loaded Cache Blocks
Miss Handler
DRAM
Block 0
Block 1
Block 2
Block 3
…
request
replyBlock 2
branchfall-thru
JR
RuntimeSystem
Entry Point 1
Entry Point 2
Indirect EP
System Policies and Mechanisms
• Fully-associative cache block placement
• Replacement Policy: FIFO– Evict oldest block in cache– Matches sequential execution
• Pinned functions– Key feature for timing predictability– No cache overhead within function
Experimental Setup
• Implemented for a tile in the Raw multicore processor– Similar to many embedded processors– 32-bit single-issue in-order MIPS pipeline– 32 kB SRAM I-mem
• Raw simulator– Cycle-accurate– Idealized I/O model– SRAM I-mem or traditional hardware I-cache models– Uses Wattch to estimate energy consumption
• Mediabench benchmark suite– Multimedia applications for embedded processors
Baseline Performance
0%
100%
200%
300%
400%
500%
600%
700%
800%
900%
adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta
Baseline
Fle
xica
che
Ove
rhea
d
Overhead: Number of additional cycles relative to 32 kB, 2-way HW cache
Basic Chaining
• Problem: Hit case in runtime system takes about 40 cycles
WithoutChaining
BlockA
BlockB
BlockC
BlockD
RuntimeSystem
WithChaining
BlockA
BlockB
BlockC
BlockD
RuntimeSystem
• Solution: Modify jump to runtime system so that it jumps directly to loaded code the next time
Basic Chaining Performance
0%
100%
200%
300%
400%
500%
600%
700%
800%
900%
adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta
Baseline Basic Chaining
Fle
xica
che
Ove
rhea
d
Basic Chaining Performance
0%
25%
50%
75%
100%
125%
150%
adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta
Basic Chaining
Fle
xica
che
Ove
rhea
d
Function Call Chaining
• Problem: Function calls were not being chained
• Compound instructions (like jump-and-link) handle two virtual addresses– Load return address into link register– Jump to destination address
• Solution:– Decompose them in the rewriter– Jump can be chained normally at runtime
Function Call Chaining Performance
0%
25%
50%
75%
100%
125%
150%
adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta
Basic Chaining Chain +JAL
Fle
xica
che
Ove
rhea
d
Replacement Policy
• Problem: Too much bookkeeping– Chains must be backed out if
destination block is evicted– Idea 1: With FIFO replacement policy, no
need to record chains from old to young– Idea 2: Limit # of chains to each block
• Solution: Flush replacement policy– Evict everything and start fresh– No need to undo or track chains– Increased miss rate vs FIFO
BlockA
BlockB
BlockC
BlockD
RuntimeSystem
older
newer
A
D
Unchaining table
A:
B:
C:
D:
C
Flush Policy Performance
0%
25%
50%
75%
100%
125%
150%
adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta
Basic Chaining Chain +JAL Flush
Fle
xica
che
Ove
rhea
d
Indirect Jump Chaining
• Problem: Different destination on each execution• Solution: Pre-screen addresses and chain each individually
• But…– Screening takes time– Which addresses should we chain?
A
B
C
JR $31
A
B
C
if $31==A: JMP Aif $31==B: JMP Bif $31==C: JMP C
Indirect Jump Chaining Performance
0%
25%
50%
75%
100%
125%
150%
adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta
Basic Chaining Chain +JAL Flush JR Chain
Fle
xica
che
Ove
rhea
d
Fixed-size Block Padding
• Padding for small blocks wastes more space than expected– Average basic block
contains 5.5 instructions– Most common size is 3– 60-65% of storage space
is wasted on NOPs
0
100
200
300
400
500
600
# o
f c
ac
he
blo
ck
s
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# of useful instrs in cache block
Distribution of block sizes in rasta
00008400 <L2B1>: 8400: mfsr $r9,28 8404: rlm $r9,$r9,0x4,0x0 8408: jnel+ $r9,$0, _dispatch.entry1 840c: jal _dispatch.entry2 8410: nop 8414: nop 8418: nop 841c: nop …
8-word Cache Blocks• Reduce cache block size to better fit basic blocks
– Less padding less wasted space lower miss rate– Bookkeeping structures get bigger higher miss rate– More block splits higher miss rate, overhead
• Allow up to 4 consecutive blocks to be loaded together– Effectively creates 8, 16, 24 and 32 word blocks– Avoid splitting up large basic blocks
• Performance Benefits– Amortize cost of a call into the runtime– Overlap DRAM fetches– Eliminate jumps used to split large blocks– Also used to add extra space for runtime JR chaining
8-word Blocks Performance
0%
25%
50%
75%
100%
adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta
JR Chain 8-word 8-word,AL,pad
Fle
xica
che
Ove
rhea
d
Performance Summary• Good performance on 6 of 9 benchmarks: 5-11%
• G721 (24.2% overhead)– Indirect jumps
• Mesa (24.4% overhead)– Indirect jumps, High miss rate
• Rasta (93.6% overhead)– High miss rate, indirect jumps
• Majority of remaining overhead is due to modifications to user code, not runtime calls– Fall-through jumps added by rewriter– Indirect jump chain comparisons
Energy Analysis
• SRAM uses less energy than cache for each access– No tags and unused cache ways– Saves about 9% of total processor power
• Additional instructions for software management use extra energy– Total energy roughly proportional to number of cycles
• Software I-cache will use less total energy if instruction overhead is below 9%
Energy Results
• Wattch used with CACTI models for SRAM and I-cache– 32 kB, 2-way set associative HW cache, 25% of total power
• Total energy to complete each benchmark calculated
-0.3% -1.0%+14.5%
-3.1% -3.8%
+14.8%+1.0% -2.6%
+82.9%
0 %20 %
40 %60 %
80 %100 %
120 %140 %160 %
180 %200 %
En
erg
y (r
ela
tive
to
HW
IC)
adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta
Flexicache Energy Consumption
Conclusions• Software-based instruction caching can be a practical
solution for embedded processors
• Provides programming convenience of a HW cache
• Performance and energy similar to a HW cache– Overhead < 10% on several benchmarks– Energy savings of up to 3.8%
• Maintain advantages of Icache-less architecture– Low-cost hardware– Real-time guarantees
http://cag.csail.mit.edu/raw