Upload
chacha
View
41
Download
1
Embed Size (px)
DESCRIPTION
KILO-INSTRUCTION PROCESSORS. Arzucan Özgür Department of Computer Engineering Boğaziçi University. 15.12.2005 Cmpe 511. Introduction. Memory Wall. 60%/yr. 1000. CPU. “Moore’s Law”. 100. Processor-Memory Performance Gap: (grows 50% / year). Performance. 10. RAM 7%/yr. - PowerPoint PPT Presentation
Citation preview
KILO-INSTRUCTION PROCESSORS
Arzucan Özgür
Department of Computer EngineeringBoğaziçi University
15.12.2005 Cmpe 511
Introduction
Memory Wall
Performance improvements of high-frequency micro-processors is seriously limited by main memory access latencies
60%/yr.
RAM7%/yr.
1
10
100
10001980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
RAM
CPU
1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
Reducing Memory Latency
Cache memory hierarchies
Cache memory hierarchies First level (L1) cache built into
the processor core Takes 1-3 processor clock
cycles to access
If there is a miss in the L1 cache on-chip L2 cache accessed in the order of 10 processor cycles
Accessing main memory takes at least in the order of 100 processor cycles
Prefetching data from memory to the cache
Prefetch addresses hard to predict
QueueScheduleScheduleScheduleDispatchDispatch
Reg. ReadReg. ReadExecute
FlagsBr. chkDrive
DriveAlloc.
RenameRename
Next IPNext IPFetchFetch L1
Instr.
L1Data
L2
Mem
ory
Bra
nc h
mis
pr e
dic
tion
Out-of-order superscalar processors
Sequence of instructions containing data cashe misses
Kilo-Instruction Processors
Definition
An out-of-order superscalar processor that supports thousands of “in-flight instructions”
Intelligent use of resources
Scalability
Thousands of In-flight Instructions and In-Order Commit make designs impractical: ROB : Needs to maintain a copy of every in-flight
instruction IQs : Instructions depending on long latency instructions
remain in these queues for a long time LSQs : Instructions remain in the queue until commit Registers : A new physical register for each instruction
producing a new value
We would like to get the IPC of thousands of instructions in-flight without drastically increasing resource requirements
Efficient Kilo-Instruction Processor Design
Multi-Checkpointing the ROB Out-of-Order Commit
Early Release of Resources Ephemeral Registers Load Queues
Checkpointing
Checkpointing
ROB allows of the restoration of the correct state at any instruction (not necessary)
Checkpoint a snapshot of the processor state taken at a specific instruction of the program being executed (checkpoint processor state for a subset of instructions)
With this snapshot the processor can restore state to that point in case of an exception or misprediction
Design Decisions
How many in-flight checkpoints should be maintained by the processor? large number of checkpoints reduce the penalty of the
recovery process large number of checkpoints increase the implementation
cost
What kind of instructions should be checkpointed? take a checkpoint at any instruction some instructions are better candidates (ex:some current
processors take checkpoints at branch instructions in order to minimize the branch misprediction penalty)
How much information should be kept by each checkpoint?
Multicheckpointing
Selective Checkpointing
Replace ROB Pseudo-ROB Processor removes instructions that reach the
pseudo-ROB’s head at fixed rate
Processor state is recovarable for any instruction in the pseudo-ROB
Checkpoint taken when incomplete instruction leaves the pseudo-ROB
Instruction Queue Management
Bi-level Issue Queue
Processor detects instructions that will hold an issue queue for a long time
Removes this instructions from primary issue queue
Offloads them to slow-lane instruction queue larger, slower, less complex
Same principle applied to load-store queue
Physical Register File
Ephemeral Registers
A conventional superscalar processor assigns registers to architected registers when an instruction enters the issue queue
An instruction reserves a physical register for its entire flight time
A physical register not written a value until much later primary function is tracking data dependencies
Use virtual registers late register allocation Release register if no other instruction that reads
the data early release
Performance Evaluation
Kilo-Instruction Multiprocessors
Ideal Network
0
0.5
1
1.5
2
2.5
3
3.5
FFT RADIX LU MP3D WATER
16 processors
IPC
ROB 64
ROB 128
ROB 512
ROB 1024
ROB 2048
References
Adrian Cristal, Oliverio J. Santana, Francisco Cazorla, Marco Galluzzi, Tanausu Ramirez, Miquel Pericas, Mateo Valero. "Kilo-Instruction Processors: Overcoming the Memory Wall," IEEE Micro, vol. 25, no. 3, pp. 48-57, May/June, 2005.
A. Cristal, O. Santana, M. Valero, and J.F. Martínez. Toward kilo-instruction processors. In ACM Trans. on Architecture and Code Optimization, Vol. 1, No. 4, Dec. 2004
Marco Galluzzi, Valentin Puente, Adrián Cristal, Ramón Beivide, José-Ángel Gregorio, Mateo Valero, A first glance at Kilo-instruction based multiprocessors, Conf. Computing Frontiers 2004: 212-221
Thank you!