22
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Co nstantine D. Polychronopoulos, and George Sta moulis IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRA TION (VLSI) SYSTEMS, VOL. 8, NO. 3, JUNE 2000 Presenter: R.T.-Gu

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Architectural and Compiler Techniques for Energy

Reduction in High-Performance Microprocessors

Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine D. Polychronopoulos, and George Stamoulis

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 3, JUNE 2000

Presenter: R.T.-Gu

Page 2: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Abstract

In this paper, we focus on low-power design techniques for high-performance processors at the architectural and compiler levels. We focus mainly on developing methods for reducing the energy dissipated in the on-chip caches. Energy dissipated in caches represents a substantial portion in the energy budget of today’s processors. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip.

We propose a method that uses an additional mini-cache located between the I-Cache and the central processing unit (CPU) core and buffers instructions that are nested within loops and are continuously otherwise fetched from the I-Cache. This mechanism is combined with code modifications, through the compiler, that greatly simplify the required hardware, eliminate unnecessary instruction fetching, and consequently reduce signal switching activity and the dissipated energy.

We show that the additional cache, dubbed L-Cache, is much smaller and simpler than the I-Cache when the compiler assumes the role of allocating instructions to it. Through simulation, we show that for the SPECfp95 benchmarks, the I-Cache remains disabled most of the time, and the “cheaper” extra cache is used instead. We also propose different techniques that are better adapted to nonnumeric nonloop-intensive code.

Page 3: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Outline

What’s the problem The main idea Software support Hardware support Energy Estimation Result & discuss Conclusion

Page 4: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

What’s the problem?

The I-Cache subsystem which is one of the main power consumers in most of today’s microprocessors. The on-chip L1 and L2 caches of the DEC Alpha chip dissi

pate 25% of the total power of the processor [1]. The I-Cache of StrongARM SA-110 dissipates about 27%

of the total power [2]. The I-Cache subsystem is a very high power consumer

because: Very high clock rate (the same as CPU) Very high switching activity

How to reduce the switching activity on cache?

Page 5: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

The main idea

Uses an additional mini-cache located between the I-Cache and the CPU During a loop execution, the I-Cache unit

frequently repeats its previous tasks over and over again.

Reduce the IF unit fetch frequency by caching the I-Cache (use another L-cache)

Page 6: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Software support

We have to find out the most frequently executed basic blocks in loops

The compiler lays out the target program so that the selected blocks can be placed in L-Cache

Page 7: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

How to find out the most frequently executed basic blocks

The compiler places the basic blocks in the L-Cache by determining their nesting and using their execution profile.

The determination consists of two distinct phases. Function inlining,

• Compiler tries to expose as many basic blocks as possible in frequently executed routines

Block placement,• Compiler selects and then places the selected b

asic blocks in the extra cache is maximized.

Page 8: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

How to select and then place

First Step: Nesting Computation

Second Step: LabelTree Construction

Third Step: Basic Block Selection and

Placement Fourth and Fifth Steps:

Global Placement in the Memory

Page 9: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Nesting Computation

The control flow graph is built for each function

Find the loops and the nesting for every basic block and set as LabelSet(B)

Page 10: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

LabelTree Construction

Page 11: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Placement algorithm

Page 12: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Global Placement in the Memory

The user has the ability to adjust the thresholds in the selection of the basic blocks in the first stage,

Tradeoff performance degradation with power savings.

Page 13: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Hardware support

A 32-b register is used to hold the address of the first nonplaced block in the main memory layout. If the PC has a value

less than that address, comparator will set blocked_part=on, else this signal will be set to off.

Page 14: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Energy Estimation

The energy model was based on the work byWilson and Jouppi [16] in which they propose a timing analysis model for SRAM-based caches

This model uses run-time information of the cache utilization number of accesses, number of hits, misses, in

put statistics, etc. gathered during simulation, A 0.8-μm technology with 3.3-V voltage supply

is assumed.

Page 15: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Simulation results

Page 16: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Simulation results (cont.)

Page 17: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Discuss

Larger L-Cache saving more power A larger L-Cache is more successful in storing

basic blocks and therefore in disabling the I-Cache for a larger period of time.

256 instruction L-Cache is enough In most cases, 256 instruction L-Cache

approximates the performance of an infinite size L-Cache.

The performance is worse in integer benchmarks On the other hand, most integer benchmarks

do not have a large number of basic blocks that can be cached in the L-Cache. (they are not nested)

Page 18: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Modified Scheme for Integer Benchmarks

Focus on a function. (not a nested loop) Select the function with the largest

contribution in the execution time, and places its most important basic blocks permanently in the L-Cache

Page 19: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Experimental Evaluation of the Modified Scheme

The results are very encouraging for benchmarks that have poor performance under the initial method.

Page 20: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

Conclusion

They have developed techniques for hardware/software codesign in high-performance processors that result in energy/power reduction at the system level.

Major energy gains can be obtained if the compiler and the hardware are designed with low energy in mind.

Page 21: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

The dynamic mix of instructions for SPECint95 benchmarks

An instruction belongs to one of the six following categories: 1) “P” if it has been selected by the algorithm to be position

ed in the L-Cache; 2) “U” if it is in a basic block with a small execution frequen

cy (unimportant); 3) “NN” if it is in a block with large execution frequency but

not nested in a loop; 4) “SD” if it is in a nested block with large execution freque

ncy but small execution density; 5) “SS” if it belongs to a nested block with large frequency

and execution density but small size; 6) “L” if it satisfies all the above criteria but does not fit in t

he L-Cache. The frequency threshold is (1/10 000) of the execution time of the

program, the execution density threshold is five executions per function

Page 22: Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine

The dynamic mix of instructions for SPECint95 benchmarks (2)