27
Block-Precise Processors Nagesh B Lakshminarayana, Hyesoon Kim

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

Embed Size (px)

Citation preview

Page 1: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

Block-Precise Processors Nagesh B Lakshminarayana, Hyesoon Kim

Page 2: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

2Block-Precise Processors

| Processors designed for low power

| Architectural state is correct at basic block granularity rather than instruction granularity

Page 3: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

3Outline

| Background

| B-Processor mechanisms

| Results

| Conclusion

Page 4: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

4Pipeline Designs

| Depending on when instructions read their source operands two pipeline designs are possible Operand values are read before issue Operand values are read after issue

Issue instruction sent to functional unit for execution Dispatch instruction inserted into instruction scheduler

Page 5: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

5

Operands Values Are Read Before Issue

| Pipeline has a Data-Capture (DC) Scheduler

DC Scheduler + ARF + ROB with Data – Intel Nehalem, Intel Core

Data-Capture Scheduler

Update

Bypass and Wake

up

Fetch, Decode and Dispatch

ARF

Execution Units

ROB/Rename Buffer

Read

Page 6: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

6DC Scheduler + ARF + ROB with Data

| Results produced by instructions are copied twice First to ROB – on instruction completion Then to ARF – on instruction commit

| ROB + ARF consume a significant portion of the total core power > 10% [Brooks et al. ISCA 2000]

Page 7: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

7Goal

| Design mechanism(s) to reduce the power consumption of the ROB + ARF reduce the number of writes to these structures

Page 8: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

8Related Work

| Change the organization of these structures ports, hierarchical organization, banking [MICRO’92,

MICRO’94]

| Reduce accesses to these structures Register File Caches [Yung et al, ICCD ‘95] Reduce writes

Target short-lived variables (mostly VLIW)

Page 9: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

9Observation

| Many instruction results within a basic block are not visible outside the basic block we call such values BB-Internal values

| Values visible outside a basic block are called BB-External values The last value written to a register within a basic block is a

BB-External value

…ADD R1, R2, R3SUB R4, R1, R6…MUL R1, R1, R4…JGZ R10

Basic Block

Inst-M

Inst-N

Page 10: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

10Dependency Distance

| Dependency Distance (Dep-Distance) – integer value defined for every instruction For instructions producing BB-Internal value(s) only

it is the distance of last consumer from the instruction For instructions producing BB-External value(s)

it is infinite

Page 11: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

11Dependency Distance| Many BB-Internal values become dead shortly after being

produced i.e., all consumers of BB-Internal value are found within a short

distance of the instruction producing the BB-Internal value

>22% of all instructions produce BB-Internal values only and those values are consumed within 4 instructions of being produced

perlbench gcc

gobmksje

ng

h264ref

astar

gamess

zeusm

p

cactu

sADM

namdso

plex

calcu

lixtonto wrf

0102030405060708090

100

BB-ExternalDep-Distance > 8Dep-Distance = [5, 8]Dep-Distance = 4Dep-Distance = 3Dep-Distance = 2Dep-Distance = 1

Page 12: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

12Mechanisms – Overview

| Instruction results are broadcast over the bypass network

|If we can guarantee that instructions dependent on BB-Internal values produced by a instruction have received the BB-Internal values from the bypass network then we can skip writing the BB-Internal values to the operand store(s)

Page 13: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

13Mechanisms – Overview

| If results of a instruction are not being written to operand stores (Mechanism #1), then we can stop broadcast of results beyond first stage of bypass

Page 14: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

14Eliminating writes to ROB and ARF

| Assistance of the Compiler| Changes to ISA| Changes to hardware

Page 15: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

15Compiler

| Do analysis of life-time of variables and identify the dep-distance of instructions in basic blocks

Page 16: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

16ISA Extensions

| Add 2-bits to instruction encoding Compiler passes dep-distance of instructions via this

encoding Bits can be encoded in several ways Example encoding using multiples of 2

Encoding Meaning

00 Dep-Distance is Infinite

01 1 ≤ Dep-Distance < 2 * 1 [1]

10 2 ^ 1 ≤ Dep-Distance < 2 * 2 [2-3]

11 2 ^ 2 ≤ Dep-Distance < 2 * 3 [4-7]

Page 17: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

17Changes to Scheduler

| Add a bit-mask (Presence Vector) to track the presence of instructions in Scheduler Bit-mask of same size as ROB

Bit mask has head and tail pointers First 0 (from tail) in mask is set when a new instruction is dispatched First 1 (from head) in mask is cleared when a instruction is retired

Page 18: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

18Changes to Scheduler

| When instruction is issued, check if all dependent instructions have been dispatched If dep-distance is n, check if nth bit from bit for this instruction

is set If set then do not write to ROB and ARF

Ia

Ib

Ic

Id

. . .

0

1

1

1

1

0Schedule

rPV

Ia

Ib

Ic

Id

. . .

0

1

1

1

1

0Schedule

rPV

DD = 3

Check

hit

Page 19: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

19Changes to Scheduler

| d1d0 – 2 bit encoding for the instruction

bxbx-1…b0 – Presence Vector

d1d0 = 00 must write to ROB and ARF

d1d0 = 01 dep-distance is 1

d1d0 = 10 dep-distance in [2,3]

d1d0 = 11 dep-distance in [4,7]

01 10 11 Dep-Distance

Page 20: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

20

Issues – Supporting Precise Exceptions

| Precise exceptions are not supported Many instructions will not update the architectural state as

they are supposed to do But at end of a basic block architectural state matches state

obtained with regular execution

Soln: Check-point RF at the end of each basic block, whenever there is an exception, rollback to start of basic block and execute in instruction-precise mode Use a light weight RF check-pointing mechanism

Page 21: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

21Check-pointing Mechanism

| ARF 2 ARF + 1 Dirty Mask + Several State Masks Each bit mask is equal to size of ARF # of state masks is equal to the maximum number of basic

blocks supported by pipeline + 1

ARF-1ARF

ARF-0

Dirty and State Masks 2 copies of ARF

ARF

Page 22: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

22Check-pointing Mechanism

| Dirty mask Tracks which registers have been written by the current basic

block

| State mask Holds current mapping of registers i.e., whether latest value

of register is in ARF0 or in ARF1

| First write to a register in a basic block flips the bit in the state mask register value at end of last basic block is untouched subsequent writes to same register use the current mapping

Page 23: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

23Results

| MacSim Simulator with integrated McPAT-based tool for modeling power

| Nehalem like core 4-wide, 128 entry ROB, 36 entry scheduler, 16 IRegs, 32 Fregs 22nm

Page 24: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

24Results

| Power savings for ROB + ARF

15% over baseline, 7% over RFC-32 FP benchmarks – B-Processor skips writing many results and

RFC mechanism writes lot of live values to ROB

perlbench gcc

gobmksje

ng

h264ref

astar

gamess

zeusm

p

cactu

sADM

namdso

plex

calcu

lixtonto wrf

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RFC-32B-Processor

Tota

l pow

er c

onsu

mpti

on fo

r RO

B +

ARFs

an

d ot

her

data

sto

res

rela

tive

to B

asel

ine

Page 25: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

25Results

| Power savings for Bypass Network baseline has two levels of bypass

10% savings on average

perlbench gcc

gobmksje

ngh264

astar

bwaves

milc

gromacs

leslie3d

dealII

povray

GemsFDTD

lbm

sphinx3

GMean

0

5

10

15

20

25

30

35

40

B-Processor-C

% s

avin

g in

Pow

er o

ver

Base

line

for

the

Bypa

ss N

etw

ork

Page 26: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

26Conclusion

| ROB + ARF contribute a significant fraction of total power propose mechanism to reduce their power consumption

| For bb-internal values, if all dependent instructions read value off bypass network then skip writes to ROB and ARF and broadcast beyond first stage of bypass

| Mechanism results in correct architecture state at basic block granularity

| Mechanism reduces ROB + ARF power consumption by 15% and bypass power consumption by 10% relative to conventional design

Page 27: |Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2

27

Thank You!