KILO-INSTRUCTION PROCESSORS

KILO-INSTRUCTION PROCESSORS

Arzucan Özgür

Department of Computer EngineeringBoğaziçi University

15.12.2005 Cmpe 511

Introduction

Memory Wall

Performance improvements of high-frequency micro-processors is seriously limited by main memory access latencies

60%/yr.

RAM7%/yr.

1

10

100

10001980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

RAM

CPU

1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Reducing Memory Latency

Cache memory hierarchies

Cache memory hierarchies First level (L1) cache built into

the processor core Takes 1-3 processor clock

cycles to access

If there is a miss in the L1 cache on-chip L2 cache accessed in the order of 10 processor cycles

Accessing main memory takes at least in the order of 100 processor cycles

Prefetching data from memory to the cache

Prefetch addresses hard to predict

QueueScheduleScheduleScheduleDispatchDispatch

Reg. ReadReg. ReadExecute

FlagsBr. chkDrive

DriveAlloc.

RenameRename

Next IPNext IPFetchFetch L1

Instr.

L1Data

L2

Mem

ory

Bra

nc h

mis

pr e

dic

tion

Out-of-order superscalar processors

Sequence of instructions containing data cashe misses

Kilo-Instruction Processors

Definition

An out-of-order superscalar processor that supports thousands of “in-flight instructions”

Intelligent use of resources

Scalability

Thousands of In-flight Instructions and In-Order Commit make designs impractical: ROB : Needs to maintain a copy of every in-flight

instruction IQs : Instructions depending on long latency instructions

remain in these queues for a long time LSQs : Instructions remain in the queue until commit Registers : A new physical register for each instruction

producing a new value

We would like to get the IPC of thousands of instructions in-flight without drastically increasing resource requirements

Efficient Kilo-Instruction Processor Design

Multi-Checkpointing the ROB Out-of-Order Commit

Early Release of Resources Ephemeral Registers Load Queues

Checkpointing

Checkpointing

ROB allows of the restoration of the correct state at any instruction (not necessary)

Checkpoint a snapshot of the processor state taken at a specific instruction of the program being executed (checkpoint processor state for a subset of instructions)

With this snapshot the processor can restore state to that point in case of an exception or misprediction

Design Decisions

How many in-flight checkpoints should be maintained by the processor? large number of checkpoints reduce the penalty of the

recovery process large number of checkpoints increase the implementation

cost

What kind of instructions should be checkpointed? take a checkpoint at any instruction some instructions are better candidates (ex:some current

processors take checkpoints at branch instructions in order to minimize the branch misprediction penalty)

How much information should be kept by each checkpoint?

Multicheckpointing

Selective Checkpointing

Replace ROB Pseudo-ROB Processor removes instructions that reach the

pseudo-ROB’s head at fixed rate

Processor state is recovarable for any instruction in the pseudo-ROB

Checkpoint taken when incomplete instruction leaves the pseudo-ROB

Instruction Queue Management

Bi-level Issue Queue

Processor detects instructions that will hold an issue queue for a long time

Removes this instructions from primary issue queue

Offloads them to slow-lane instruction queue larger, slower, less complex

Same principle applied to load-store queue

Physical Register File

Ephemeral Registers

A conventional superscalar processor assigns registers to architected registers when an instruction enters the issue queue

An instruction reserves a physical register for its entire flight time

A physical register not written a value until much later primary function is tracking data dependencies

Use virtual registers late register allocation Release register if no other instruction that reads

the data early release

Performance Evaluation

Kilo-Instruction Multiprocessors

Ideal Network

0

0.5

1

1.5

2

2.5

3

3.5

FFT RADIX LU MP3D WATER

16 processors

IPC

ROB 64

ROB 128

ROB 512

ROB 1024

ROB 2048

References

Adrian Cristal, Oliverio J. Santana, Francisco Cazorla, Marco Galluzzi, Tanausu Ramirez, Miquel Pericas, Mateo Valero. "Kilo-Instruction Processors: Overcoming the Memory Wall," IEEE Micro, vol. 25, no. 3, pp. 48-57, May/June, 2005.

A. Cristal, O. Santana, M. Valero, and J.F. Martínez. Toward kilo-instruction processors. In ACM Trans. on Architecture and Code Optimization, Vol. 1, No. 4, Dec. 2004

Marco Galluzzi, Valentin Puente, Adrián Cristal, Ramón Beivide, José-Ángel Gregorio, Mateo Valero, A first glance at Kilo-instruction based multiprocessors, Conf. Computing Frontiers 2004: 212-221

Thank you!

Documents

KILO-INSTRUCTION PROCESSORS