33
Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL

Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL

Embed Size (px)

Citation preview

Flexicache:Software-based Instruction

Caching for Embedded Processors

Jason E Miller and Anant AgarwalRaw Group - MIT CSAIL

Outline

• Introduction

• Baseline Implementation

• Optimizations

• Energy

• Conclusions

Hardware Instruction Caches

• Used in virtually all high-performance general-purpose processors

DRAM

• Good performance– Decreases average memory

access time

• Easy to use– Transparent operation

Processor

I-Cache

Chip

Chip

ICache-less Processors

• Embedded procs and DSPs– TMS470, ADSP-21xx, etc.

• Embedded multicore processors– IBM Cell SPE

Processor

DRAM

SRAM

• No special-purpose hardware– Less design/verification time– Less area– Shorter cycle time– Less energy per access– Predictable behavior

• Much harder to program!– Manually partition code and transfer pieces from DRAM

Software-based I-Caching

• Use a software system to virtualize instruction memory by recreating hardware cache functionality

• Automatic management of simple SRAM memory– Good performance with no extra programming effort

• Integrated into each individual application– Customized to program’s needs– Optimize for different goals– Real-time predictability

• Maintain low-cost, high-speed hardware

Outline

• Introduction

• Baseline Implementation

• Optimizations

• Energy

• Conclusions

Flexicache System Overview

ProcessorDRAM

I-mem

Programmer

OriginalBinary

BinaryRewriter

Rewritten Binary

Linker

Runtimelibrary

FlexicacheBinary

Binary Rewriter

• Break up user program into cache blocks• Modify control-flow that leaves the blocks

Flexicacheruntime

Binary

Rewriter

Rewriter: Details• One basic block in each cache block, but…

– Fixed-size of 16 instructions• Simplifies bookkeeping• Requires padding of small blocks and splitting of large ones

• Control-flow instructions that leave a block are modified to jump to the runtime system– E.g. BEQ $2,$3,foo JEQL $2,$3,runtime– Original destination addresses stored in table– Fall-through jumps at end of blocks

Runtime: Overview

• Stays resident in I-mem

• Receive requests from cache blocks• See if requested block is resident• Load new block from DRAM if necessary

– Evict blocks to make room

• Transfer control to the new block

Runtime Operation

Loaded Cache Blocks

Miss Handler

DRAM

Block 0

Block 1

Block 2

Block 3

request

replyBlock 2

branchfall-thru

JR

RuntimeSystem

Entry Point 1

Entry Point 2

Indirect EP

System Policies and Mechanisms

• Fully-associative cache block placement

• Replacement Policy: FIFO– Evict oldest block in cache– Matches sequential execution

• Pinned functions– Key feature for timing predictability– No cache overhead within function

Experimental Setup

• Implemented for a tile in the Raw multicore processor– Similar to many embedded processors– 32-bit single-issue in-order MIPS pipeline– 32 kB SRAM I-mem

• Raw simulator– Cycle-accurate– Idealized I/O model– SRAM I-mem or traditional hardware I-cache models– Uses Wattch to estimate energy consumption

• Mediabench benchmark suite– Multimedia applications for embedded processors

Baseline Performance

0%

100%

200%

300%

400%

500%

600%

700%

800%

900%

adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta

Baseline

Fle

xica

che

Ove

rhea

d

Overhead: Number of additional cycles relative to 32 kB, 2-way HW cache

Outline

• Introduction

• Baseline Implementation

• Optimizations

• Energy

• Conclusions

Basic Chaining

• Problem: Hit case in runtime system takes about 40 cycles

WithoutChaining

BlockA

BlockB

BlockC

BlockD

RuntimeSystem

WithChaining

BlockA

BlockB

BlockC

BlockD

RuntimeSystem

• Solution: Modify jump to runtime system so that it jumps directly to loaded code the next time

Basic Chaining Performance

0%

100%

200%

300%

400%

500%

600%

700%

800%

900%

adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta

Baseline Basic Chaining

Fle

xica

che

Ove

rhea

d

Basic Chaining Performance

0%

25%

50%

75%

100%

125%

150%

adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta

Basic Chaining

Fle

xica

che

Ove

rhea

d

Function Call Chaining

• Problem: Function calls were not being chained

• Compound instructions (like jump-and-link) handle two virtual addresses– Load return address into link register– Jump to destination address

• Solution:– Decompose them in the rewriter– Jump can be chained normally at runtime

Function Call Chaining Performance

0%

25%

50%

75%

100%

125%

150%

adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta

Basic Chaining Chain +JAL

Fle

xica

che

Ove

rhea

d

Replacement Policy

• Problem: Too much bookkeeping– Chains must be backed out if

destination block is evicted– Idea 1: With FIFO replacement policy, no

need to record chains from old to young– Idea 2: Limit # of chains to each block

• Solution: Flush replacement policy– Evict everything and start fresh– No need to undo or track chains– Increased miss rate vs FIFO

BlockA

BlockB

BlockC

BlockD

RuntimeSystem

older

newer

A

D

Unchaining table

A:

B:

C:

D:

C

Flush Policy Performance

0%

25%

50%

75%

100%

125%

150%

adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta

Basic Chaining Chain +JAL Flush

Fle

xica

che

Ove

rhea

d

Indirect Jump Chaining

• Problem: Different destination on each execution• Solution: Pre-screen addresses and chain each individually

• But…– Screening takes time– Which addresses should we chain?

A

B

C

JR $31

A

B

C

if $31==A: JMP Aif $31==B: JMP Bif $31==C: JMP C

Indirect Jump Chaining Performance

0%

25%

50%

75%

100%

125%

150%

adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta

Basic Chaining Chain +JAL Flush JR Chain

Fle

xica

che

Ove

rhea

d

Fixed-size Block Padding

• Padding for small blocks wastes more space than expected– Average basic block

contains 5.5 instructions– Most common size is 3– 60-65% of storage space

is wasted on NOPs

0

100

200

300

400

500

600

# o

f c

ac

he

blo

ck

s

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

# of useful instrs in cache block

Distribution of block sizes in rasta

00008400 <L2B1>: 8400: mfsr $r9,28 8404: rlm $r9,$r9,0x4,0x0 8408: jnel+ $r9,$0, _dispatch.entry1 840c: jal _dispatch.entry2 8410: nop 8414: nop 8418: nop 841c: nop …

8-word Cache Blocks• Reduce cache block size to better fit basic blocks

– Less padding less wasted space lower miss rate– Bookkeeping structures get bigger higher miss rate– More block splits higher miss rate, overhead

• Allow up to 4 consecutive blocks to be loaded together– Effectively creates 8, 16, 24 and 32 word blocks– Avoid splitting up large basic blocks

• Performance Benefits– Amortize cost of a call into the runtime– Overlap DRAM fetches– Eliminate jumps used to split large blocks– Also used to add extra space for runtime JR chaining

8-word Blocks Performance

0%

25%

50%

75%

100%

adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta

JR Chain 8-word 8-word,AL,pad

Fle

xica

che

Ove

rhea

d

Performance Summary• Good performance on 6 of 9 benchmarks: 5-11%

• G721 (24.2% overhead)– Indirect jumps

• Mesa (24.4% overhead)– Indirect jumps, High miss rate

• Rasta (93.6% overhead)– High miss rate, indirect jumps

• Majority of remaining overhead is due to modifications to user code, not runtime calls– Fall-through jumps added by rewriter– Indirect jump chain comparisons

Outline

• Introduction

• Baseline Implementation

• Optimizations

• Energy

• Conclusions

Energy Analysis

• SRAM uses less energy than cache for each access– No tags and unused cache ways– Saves about 9% of total processor power

• Additional instructions for software management use extra energy– Total energy roughly proportional to number of cycles

• Software I-cache will use less total energy if instruction overhead is below 9%

Energy Results

• Wattch used with CACTI models for SRAM and I-cache– 32 kB, 2-way set associative HW cache, 25% of total power

• Total energy to complete each benchmark calculated

-0.3% -1.0%+14.5%

-3.1% -3.8%

+14.8%+1.0% -2.6%

+82.9%

0 %20 %

40 %60 %

80 %100 %

120 %140 %160 %

180 %200 %

En

erg

y (r

ela

tive

to

HW

IC)

adpcm epic g721 gsm jpeg mesa mpeg2 pegwit rasta

Flexicache Energy Consumption

Conclusions• Software-based instruction caching can be a practical

solution for embedded processors

• Provides programming convenience of a HW cache

• Performance and energy similar to a HW cache– Overhead < 10% on several benchmarks– Energy savings of up to 3.8%

• Maintain advantages of Icache-less architecture– Low-cost hardware– Real-time guarantees

http://cag.csail.mit.edu/raw

Questions?

http://cag.csail.mit.edu/raw