Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University

Automatic Tuning of Two-Level Caches to Embedded

ApplicationsAnn Gordon-Ross and

Frank Vahid*Department of Computer Science and

EngineeringUniversity of California, Riverside

*Also with the Center for Embedded Computer Systems, UC Irvine

Nikil DuttCenter for Embedded Computer

SystemsSchool for Information and

Computer ScienceUniversity of California, Irvine

This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation

2

Introduction• Memory access: 50% of

embedded processor’s system power

• Caches are power hungry

• ARM920T(Segars 01)

• M*CORE (Lee/Moyer/Arends 99)

• Thus, the cache is a good candidate for optimizations

Main Mem

L1 Cache

Processor

L2 Cache

53%

3

Motivation• Tuning cache parameters to an application can

save energy: 60% on average• Balasubramonian’00, Zhang’03

• Each application has different cache requirements • One predetermined cache configuration can’t be best for

all applications

• Size

–Excess fetch and static energy if too large

–Excess thrashing energy if too small

L1 Cache

4




all applications

L1 Cache• Line size

–Excess fetch energy if line size too large

–Excess stall energy if line size too small

5




all applications

L1 Cache

{ }• Cache associativity

– Excess fetch energy per access if too high

– Excess miss energy if too low

6

Motivation• By tuning these parameters, the cache can

be customized to a particular application

Microprocessor

Main Memory

Ene

rgyL1

Cache

L2 Cache

Possible Cache Configurations

Choose lowest energy configuration

Tuning

Tuning

7

Related Work• Configurable caches

• Soft cores (ARM, MIPS, Tensillica, etc.)• Even for hard processors (Motorola M*Core

- Malik ISLPED’00; Albonesi MICRO’00; Zhang ISCA’03)

• Configurable cache tuning• Mostly manually in practice

– Sub-optimal, time-consuming• L1 automated methods

– Platune (Givargis TCAD’02, Palesi CODES’02)

– Zhang RSP’03

• Two-level caches becoming popular• More transistors on-chip available• Bigger gap between on-chip and off-chip

accesses• Need automated tuning for L1+L2

Microprocessor

Main Memory

L1 Cache

L2 Cache

Tuning

Tuning

8

Challenge for Two-Level Cache Tuning

• One level: 10s of configurations

• Two levels: 100s/1000s of configurations• Need efficient heuristic

• Especially if used with simulation-based search

- Total size- Line size- Associativity

Level 1

- Total size- Line size- Associativity

Level 2

*2500 configs

Say 50 configs. 50 configs.

9

Two-Level Cache Tuning Goal• Develop fast, good-quality heuristic for tuning two-

level caches to embedded applications for reduced energy consumption

• Presently focus on separate I and D cache in both levels

Mic

ropr

oces

sor

Level 1 Caches Level 2 Caches

Main Memory

I-cache

D-cache

I-cache

D-cache

Tune Instruction Cache Hierarchy

Tune Data Cache Hierarchy

10

Configurable Cache Architecture• Our target configurable cache architecture is based

on Zhang/Vahid/Najjar’s “Highly-Configurable Cache Architecture for Embedded Systems,” ISCA 2003

2KB 2KB2KB2KB

8 KB cache consisting of 4 2KB banks that can operate as 4 ways

Way concatenation offers a 2-way or a directed-mapped variation

4 KB 4 KB

Way concatenation offers a 2-way or a directed-mapped variation

8 KB

Base Level One Cache

Way shutdown offers a 2-way 4 KB cache and a direct-mapped 2 KB cache

2 KB 2 KB

Level One Cache

Way shutdown offers a 2-way 4 KB cache or a direct-mapped 2 KB cache

2 KB

Level One Cache

Way shutdown and way concatenation can be combined to offer a direct-mapped 4 KB cache

4 KB

Level One Cache

11

Configuration Space• Cache parameters

• Size - L1 cache: 2, 4, and 8 KBytes. L2 cache: 16, 32, and 64 KBytes

• Line size (L1 or L2) - 16, 32, and 64 Bytes

–16 byte physical base line size

• Associativity (L1 or L2) - Direct-mapped, 2-way, and 4-way

• 432 possible configurations• For two levels, with separate I and D

12

Experimental EnvironmentMediaBench

EEMBC

SimpleScalarHit and miss

ratios for each configuration

Cache energy - CactiMain memory energy - Samsung memoryCPU stall energy - 0.18 micron MIPS uP

Cache exploration heuristic

Chosen cache

configuration

Exhaustive search

Took days.For comparison purposes

13

First Heuristic: Tune Levels One-at-a-Time

• Tune L1, then L2• Initial L2: 64 KByte, 4-

way, 64 byte line size

• For best L1 found, tune L2 cache

• Tuned each cache using Zhang’s heuristic for one-level cache tuning (RSP’03)

Microprocessor

Main Memory

L1 Cache

L2 Cache

14

First Heuristic: Tune Levels One-at-a-Time

• Zhang’s heuristic: Search parameters in order of importance (RSP’03)

First search size Begin with a 2 KByte, direct-mapped cache

with a 16 Byte line size

Level One Cache

First search size Increase size to 4 KB.

Level One Cache

First search size If the size increase yields energy

improvements, increase the cache size to 8KB.

Level One Cache

Next search line size For the lowest energy cache size, increase the

line size to 32 Bytes

Level One Cache

Next search line size If the increase in line size yields a decrease in

energy, increase the line size to 64 Bytes

Level One Cache

Finally, search associativity For the lowest energy line size, increase the

associativity to 2

Level One Cache

Finally, search associativity If increasing the associativity yields a decrease

in energy, increase the associativity to 4

Level One Cache

15

Results of First Heuristic• Base cache configuration

• Level 1 - 8 KByte, 4-way, 32 byte line

• Level 2 - 64 KByte, 4-way, 64 byte line

0

0.2

0.4

0.6

0.8

1

1.2

BaseCache

FirstHeuristicOptimal

16

First Heuristic• Did not find optimal in most cases

• Sometimes 200% or 300% worse

• The two levels should not be explored separately

• Too much interdependence among L1 and L2 cache parameters

• E.g., high L1 associativity decreases misses and thus reduces need for large L2

• Dozens of other such interdependencies

17

Improved Heuristic – Basic Interlacing

• To more fully explore the dependencies between the two levels, we interlaced the exploration of the level one and level two caches

L1 Cache L2 CacheDetermine the best size of level one cacheDetermine the best size of level two cache

18



L1 Cache L2 CacheDetermine the best line size of level one cacheDetermine the best line size of level two cache

19



L1 Cache L2 Cache

{ }

Determine the best associativity of level one cache

{ }

Determine the best associativity of level two cache

Basic interlacing performed better than the initial heuristic but there was still much room for improvement

20

Final Heuristic: Interlaced with Local Search

• Performed well, but some cases sub-optimal• Manually examined those cases

• Determined small local search needed

• Final heuristic called: TCaT - The Two Level Cache Tuner

16KB

Because of the bank arrangements, if a 16KB cache is determined to be the best size, the only associativity

option is direct-mapped

16KB 16KB

However, the application may require the increased associativity. During the associativity search step, the

cache size is allowed to increase so that larger associativities may be explored.

16KB 16KB 16KB 16KB

21

TCaT Results: Energy

• Energy consumption (normalized to the base cache configuration)

• 53% energy savings in cache/memory access sub-system vs. base cache

0

0.2

0.4

0.6

0.8

1

1.2

BaseCacheFirstHeuristicTCaT

Optimal

22

TCaT Results: Performance

• Execution time for the TCaT cache configuration and the optimal cache configuration (normalized to the execution time of the benchmark running with the base cache configuration)

• TCaT finds near-optimal configuration, nearly 30% improvement over base cache

00.10.20.30.40.50.60.70.80.9

1

BaseCache

TCaT

Optimal

23

TCaT Exploration Time Improvements

• Searches only 28 of 432 possible configurations

• 6% of space

• Simulation-based approach

• 500 MHz Sparc

• 50 hrs vs. 3 hrs

• Hardware-based approach

• 434 sec vs. 28 sec

28

432

0

50

100

150

200

250

300

350

400

450

Co

nfi

gu

rati

on

s

TCaT Exhaustive

24

TCaT in Presence of Hw/Sw Partitioning

• Hardware/software partitioning may become common in SOC platforms

• On-chip FPGA

• Program kernels moved to FPGA• Greatly reduces temporal and spatial locality of

program

• Does TCaT still work well on programs with very low locality?

25

TCaT With Hardware/Software Partitioning

• Energy consumption (normalized to the base cache configuration)

• 55% energy savings in cache/memory access sub-system vs. base cache

0

0.2

0.4

0.6

0.8

1

1.2

BaseCacheFirstHeuristicTCaT

Optimal

26

Conclusions• TCaT is an effective heuristic for two-level

cache tuning• Prunes 94% of search space for a given two-level

configurable cache architecture• Near-optimal performance results, 30%

improvement vs. base cache• Near-optimal energy results, 53% improvement

vs. base cache• Robust in presence of hw/sw partitioning

• Future work• More cache parameters, unified 2L cache

–Even larger search space• Dynamic in-system tuning

–Must avoid cache flushes

Documents

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University