17
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta , Alex Nicolau

Power Savings in Embedded Processors through Decode Filter Cache

Embed Size (px)

DESCRIPTION

Power Savings in Embedded Processors through Decode Filter Cache. Weiyu Tang, Rajesh Gupta , Alex Nicolau. Overview. Introduction Related Work Decode Filter Cache Results and Conclusion. Introduction. Instruction delivery is a major power consumer in embedded systems Instruction fetch - PowerPoint PPT Presentation

Citation preview

Page 1: Power Savings in Embedded Processors through Decode Filter Cache

Power Savings in Embedded Processors through Decode Filter Cache

Weiyu Tang, Rajesh Gupta, Alex Nicolau

Page 2: Power Savings in Embedded Processors through Decode Filter Cache

2Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Overview• Introduction

• Related Work

• Decode Filter Cache

• Results and Conclusion

Page 3: Power Savings in Embedded Processors through Decode Filter Cache

3Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Introduction• Instruction delivery is a major power consumer in

embedded systems

• Instruction fetch– 27% processor power in StrongARM

• Instruction decode– 18% processor power in StrongARM

• Goal

• Reduce power in instruction delivery with minimal performance penalty

Page 4: Power Savings in Embedded Processors through Decode Filter Cache

4Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Related Work• Architectural approaches to reduce

instruction fetch power• Store instructions in small and power efficient storages

• Examples:– Line buffers

– Loop cache

– Filter cache

Page 5: Power Savings in Embedded Processors through Decode Filter Cache

5Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Related Work• Architectural approaches to reduce

instruction decode power• Avoid unnecessary decoding by saving decoded

instructions in a separate cache

• Trace cache– Store decoded instructions in execution order

– Fixed cache access order

• Instruction cache is accessed on trace cache misses

– Targeted for high-performance processors

• Increase fetch bandwidth

• Require sophisticated branch prediction mechanisms

– Drawbacks

• Not power efficient as the cache size is large

Page 6: Power Savings in Embedded Processors through Decode Filter Cache

6Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Related Work• Micro-op cache

– Store decoded instructions in program order

– Fixed cache access order

• Instruction cache and micro-op cache are accessed in parallel to minimize micro-op cache miss penalty

– Drawbacks

• Need extra stage in the pipeline, which increases misprediction penalty

• Require a branch predictor

• Per access power is large– Micro-op cache size is large

– Power consumption from both micro-op cache and instruction cache

Page 7: Power Savings in Embedded Processors through Decode Filter Cache

7Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Decode Filter Cache• Targeted processors

• Single issue, In-order execution

• Research goal• Use a small (and power efficient) cache to save decoded

instructions• Reduce instruction fetch power and decode power

simultaneously• Reduce power without sacrificing performance

• Problems to deal with• What kind of cache organization to use• Where to fetch instructions as instructions can be provided

from multiple sources• How to minimize decode filter cache miss latency

Page 8: Power Savings in Embedded Processors through Decode Filter Cache

8Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

fetchaddress

I-cache

Line buffer

fetch decode

Decodefilter cache

execute mem writeback

predictor

1

5

432

Decode Filter Cache

Page 9: Power Savings in Embedded Processors through Decode Filter Cache

9Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Decode Filter Cache• Decode filter cache organization

• Problems with traditional cache organization– The decoded instruction width varies

– Save all the decoded instructions will waste cache space

• Our approach– Instruction classification

• Classify instructions into cacheable and uncacheable depending on instruction width distribution

• Use a “cacheable ratio” to balance the cache utilization vs. the number of instructions that can be cached

– Sectored cache organization

• Each instruction can be cached independently of neighboring lines

• Neighboring lines share a tag to reduce cache tag store cost

Page 10: Power Savings in Embedded Processors through Decode Filter Cache

10Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Decode Filter Cache• Where to fetch instructions

• Instructions can be provided from one of the following sources

– Line buffer

– Decode filter cache

– Instruction cache

• Predictive order for instruction fetch – For power efficiency, either the decode filter cache or the line

buffer is accessed first when an instruction is likely to hit

– To minimize decode filter cache miss penalty, the instruction cache is accessed directly when the decode filter cache is likely to miss

Page 11: Power Savings in Embedded Processors through Decode Filter Cache

11Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Decode Filter Cache• Prediction mechanism

• When next fetch address and current address map to the same cache line

– If current fetch source is line buffer, the next fetch source remain the same

– If current fetch source is decode filter cache and the corresponding instruction is valid, the next fetch source remain the same

– Otherwise, the next fetch source is instruction cache

• When fetch address and current address map to different cache lines

– Predict based on next fetch prediction table, which utilizes control flow predictability

– If the tag of current fetch address and the tag of the predicted next fetch address are same, next fetch source is decode filter cache

– Otherwise, next fetch source is instruction cache

Page 12: Power Savings in Embedded Processors through Decode Filter Cache

12Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Results• Simulation setup

• Media Benchmark

• Cache size– 512B decode filter cache, 16KB instruction cache, 8KB data cache.

• Configurations investigated

Line buffer

Decode filter cache

Cacheable ratio

Instruction filter cache

Use Predictor

DF_0.9 X X 0.9 X

DF_0.8 X X 0.8 X

DF_0.7 X X 0.7 X

DF_0.6 X X 0.6 X

DF_NO X X 0.9

IF X

Page 13: Power Savings in Embedded Processors through Decode Filter Cache

13Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Results: % reduction in I-cache fetches

0

10

20

30

40

50

60

70

80

90

100

IF DF_NO DF_0.9

Page 14: Power Savings in Embedded Processors through Decode Filter Cache

14Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Results: % reduction in instruction decodes

0

10

20

30

40

50

60

70

80

90

100

721_

dec

721_

enc

cjg

djg

gst

mpg

_dec

mpg

_enc

rast

aad

pcm

_cad

pcm

_d

epic

unep

icpw

dec

pwen

c

avg

DF_NO DF_0.9

Page 15: Power Savings in Embedded Processors through Decode Filter Cache

15Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Results: normalized delay

0.90

0.95

1.00

1.05

1.10

1.15

721_

dec

721_

enc

cjg

djg

gst

mpg

_dec

mpg

_enc

rast

aad

pcm

_cad

pcm

_d

epic

unep

icpw

dec

pwen

c

avg

IF DF_NO DF_0.9

Page 16: Power Savings in Embedded Processors through Decode Filter Cache

16Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Results: % reduction in processor power

0

5

10

15

20

25

30

35

40

45

721_

dec

721_

enc cjg

djg

gst

mpg

_dec

mpg

_enc

rasta

adpc

m_c

adpc

m_d

epic

unep

ic

pwde

c

pwen

c

avg

IF DF_0.6 DF_0.7 DF_0.8 DF_0.9

Page 17: Power Savings in Embedded Processors through Decode Filter Cache

17Weiyu Tang, Rajesh Gupta, Alex NicolauCECS, University of California, Irvine

Conclusion• There is a basic tradeoff between

• no. of the instructions cached as in instruction caches, and

• greater savings in power by reducing decoding, fetch work (as in decode caches).

• We tip this balance in the favor of decode cache by a coordinated operation of

• instruction classification/selective decoding (into smaller widths)

• sectored caches built around this classification

• The results show

• Average 34% reduction in processor power– 50% more effective in power savings than an instruction filter

cache

• Less than 1% performance degradation due to effective prediction mechanism