13
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1 , Pablo Viana 2 , Frank Vahid 1 , Walid Najjar 1 , and Edna Barros 4 1 Dept of Computer Science & Engineering - University of California, Riverside, USA 2 Campus Arapiraca – Federal University of Alagoas, Brazil 3 Centro de Informática - Federal University of Pernambuco, Brazil This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation

A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

  • Upload
    eadoin

  • View
    23

  • Download
    0

Embed Size (px)

DESCRIPTION

A One-Shot Configurable-Cache Tuner for Improved Energy and Performance. Ann Gordon-Ross 1 , Pablo Viana 2 , Frank Vahid 1 , Walid Najjar 1 , and Edna Barros 4 1 Dept of Computer Science & Engineering - University of California, Riverside, USA - PowerPoint PPT Presentation

Citation preview

Page 1: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

Ann Gordon-Ross1, Pablo Viana2, Frank Vahid1, Walid Najjar1, and Edna Barros4

1Dept of Computer Science & Engineering - University of California, Riverside, USA2Campus Arapiraca – Federal University of Alagoas, Brazil

3Centro de Informática - Federal University of Pernambuco, Brazil

This work was supported by the U.S. National Science Foundation, and by the Semiconductor Research Corporation

Page 2: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

2Ann Gordon-RossUniv of Ca, Riverside

Introduction• Memory access: 50% of embedded processor’s

system power

• Caches are power hungry

• ARM920T (Segars 01)

• M*CORE (Lee/Moyer/Arends 99)

• Thus, caches are a good candidate for optimizations

53%

Main Mem

L1 I Cache

Processor

L1 D Cache

Page 3: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

3Ann Gordon-RossUniv of Ca, Riverside

Introduction• Different applications have vastly different cache

requirements

• Total size, line size, and associativity

• Cache parameters that don’t match an application’s behavior can waste over 60% of energy (Gordon-Ross 05)

• Cache tuning is the process of determining the appropriate cache parameters for an application

4KB 16 byte2-way

2KB 32 byte

direct-mapped8KB

64 byte4-way

Page 4: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

4Ann Gordon-RossUniv of Ca, Riverside

Download application

Runtime Cache Tuning• Best cache configuration can be determined by

searching the design space during runtime

• Runtime cache tuning is transparent to the designer and end user, but incurs runtime overhead in terms of energy and performance

Ene

rgy

Executing in base configuration

Tunable cache

Tuning hw

TC Cache TuningTCTCTCTCTCTC TCTCTCTC

Page 5: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

5Ann Gordon-RossUniv of Ca, Riverside

Download application

Contribution• We introduce specialized hardware for non-intrusive runtime

cache evaluation

• Temporary energy overhead and no performance overhead

• Single-pass multi-cache evaluation - SPCE

• Special hardware simultaneously evaluates all cache configurations

• Enables switching to the best configuration in one-shot

Tunable cache

SPCE

Ene

rgy

Executing in base configuration SPCE causes an increase

in energy but no performance overhead

Switch to best config in “one-shot”

SPCESPCE

TC

Page 6: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

6Ann Gordon-RossUniv of Ca, Riverside

SPCE Key Points• Contributions compared to previous methods

• Evaluates a highly configurable cache

–Previous method offer little configurability

• Little hardware overhead

–Simple data structures

–Elementary operations

Page 7: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

7Ann Gordon-RossUniv of Ca, Riverside

SPCE• Monitors address stream to extract cache hit

information for all configurations

Fully-associative cache example(64-bit architecture)

Address stream

t0 = 0

t1 = 8t2 = 16t3 = 0t4 = 8t5 = 0t6 = 16

Table(stored hit info)

bd 0 1 212345678

Line size (number of words)

Number of lines

24 different configs

Number of conflicts determines cache sizes that would result in a hit

For each line size …

>> 20*8

t0 = 0

t1 = 1t2 = 2t3 = 0t4 = 1t5 = 0t6 = 2

HIT

}31

HIT

}3 2

HIT

}2

1

HIT

}3

3>> 21*8

t0 = 0

t1 = 0t2 = 1t3 = 0t4 = 0t5 = 0t6 = 1

HIT1

HIT

1

HIT

2

HIT

3

HIT

2

>> 22*8

t0 = 0

t1 = 0t2 = 0t3 = 0t4 = 0t5 = 0t6 = 0

HITHITHITHITHITHIT

6

Cache with 2 lines with 21 words per line (32 bytes) will have 5 hits and 7-

5=2 misses

Page 8: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

8Ann Gordon-RossUniv of Ca, Riverside

SPCE• SPCE determines hits for other set-associativities

by counting the number of unique conflicts in the address trace

Tables(multiple layers)

Direct-mapped2-way

4-way

Table(stored hit info)

bs 0 1 212345678

Line size (number of words)

Number of sets

Page 9: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

9Ann Gordon-RossUniv of Ca, Riverside

SPCE - Hardware

(stack)• Designed and evaluated in

synthesizable VHDL

Page 10: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

10Ann Gordon-RossUniv of Ca, Riverside

Results - Energy Savings• Energy savings compared to exploring the design

space using a state-of-the-art intrusive heuristic (Zhang 03)

• Values less than 1 denote an energy increase

0.99 0.77 0.98

0

4

8

12

brevepic

rawcaudio

pocsag

g721Decode

pegwitDecodev42

ucbqsort

g3faxbilv

binary blit

matmul

mpegDecode

pegwitEncode

jpegDecode

ps-jpeg fi

r

jpegEncodebcnt

average

4.6x less energy

expended

Page 11: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

11Ann Gordon-RossUniv of Ca, Riverside

Results - Tuning Speedup• Tuning speedup obtained compared to a state-of-

the-art intrusive heuristic

0

4

8

12

16

brevepic

rawcaudio

pocsag

g721Decode

pegwitDecodev42

ucbqsort

g3faxbilv

binary blit

matmul

mpegDecode

pegwitEncode

jpegDecode

ps-jpeg fi

r

jpegEncodebcnt

average

7.7x faster

Page 12: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

12Ann Gordon-RossUniv of Ca, Riverside

Overheads• Evaluated SPCE compared to the ARM920T

• Area

• 12% area overhead

–Due in large part to the TCAM stack structure

• Power

• Temporary 2.2X increase in power during short tuning cycle

–Application need only iterate 4 times for average power overhead to reduce to 1%

Page 13: A One-Shot Configurable-Cache Tuner for Improved Energy and Performance

13Ann Gordon-RossUniv of Ca, Riverside

Conclusions• SPCE is a specialized hardware structure to

evaluate all cache configurations simultaneously

• Enables non-intrusive runtime cache evaluation

• Enables switching to best cache configuration in one shot

• Compared to a state-of-the-art intrusive cache tuning heuristic

• 4.6x less energy expended

• 7.7x speedup in tuning time

• 12% area overhead compared to ARM920T

• Temporary 2.2x increase in power during short tuning time

–Only 4 application iterations to recoup power