Upload
tola
View
22
Download
0
Tags:
Embed Size (px)
DESCRIPTION
ECE200 – Computer Organization. Chapter 6 – Enhancing Performance with Pipelining. Homework 6. 6.2, 6.3, 6.5, 6.9, 6.11, 6.19, 6.27, 6.30. Outline for Chapter 6 lectures. Pipeline motivation: increasing instruction throughput MIPS 5-stage pipeline Hazards Handling exceptions - PowerPoint PPT Presentation
Citation preview
ECE200 – Computer Organization
Chapter 6 – Enhancing Performance with
Pipelining
Homework 6
6.2, 6.3, 6.5, 6.9, 6.11, 6.19, 6.27, 6.30
Outline for Chapter 6 lectures
Pipeline motivation: increasing instruction throughput
MIPS 5-stage pipeline
Hazards
Handling exceptions
Superscalar execution
Dynamic scheduling (out-of-order execution)
Real pipeline designs
Pipeline motivation
Need both low CPI and high frequency for best performanceWant a multicycle for high frequency, but need better
CPI
Idea behind pipelining is to have a multicycle implementation that operates like a factory assembly line
Each “worker” in the pipeline performs a particular task, hands off to the next “worker”, while getting new work
Pipeline motivation
Tasks should take about the same time – if one “worker” is much slower than the rest, then other “workers” will stand idle
Once the assembly line is full, a new “product” (instruction) comes out of the back-end of the line each time period
In a computer assembly line (pipeline), each task is called a stage and the time period is one clock cycle
MIPS 5-stage pipeline
Like single cycle datapath but with registers separating each stage
MIPS 5-stage pipeline
5 stages for each instructionIF: instruction fetchID: instruction decode and register file readEX: instruction execution or effective address calculationMEM: memory access for load and storeWB: write back results to register file
Delays of all 5 stages are relatively the same
Staging registers are used to hold data and control as instructions pass between stages
All instructions pass through all 5 stages
As an instruction leaves a stage in a particular clock period, the next instruction enters it
Pipeline operation for lw
Stage 1: Instruction fetch
Pipeline operation for lw
Stage 2: Instruction decode and register file read
What happens to the instruction info in IF/ID?
Pipeline operation for lw
Stage 3: Effective address calculation
Pipeline operation for lw
Stage 4: Memory access
Pipeline operation for lw
Stage 5: Write back
Instruction info in IF/ID is gone – won’t work
Modified pipeline with write back fix
Write register bits from the instruction must be carried through the pipeline with the instruction
Pipeline operation for lw
Pipeline usage in each stage for lw
Pipeline operation for sw
Stage 3: Effective address calculation
Pipeline operation for sw
Stage 4: Memory access
Pipeline operation for sw
Stage 5: Write back (nothing)
Pipeline operation for lw, sub sequence
Pipeline operation for lw, sub sequence
Pipeline operation for lw, sub sequence
Pipeline operation for lw, sub sequence
Pipeline operation for lw, sub sequence
Pipeline operation for lw, sub sequence
Graphical pipeline representation
Represent overlap of pipelined instructions as multiple pipelines skewed by a cycle
Another useful shorthand form
Pipeline control
Basic pipeline control is similar to the single cycle implementation
Pipeline control
Control for an instruction is generated in ID and travels with the instruction and data through the pipeline
When an instruction enters a stage, it’s control signals set the operation of that stage
Pipeline control
Multiple instruction example
For the following code fragment
show the datapath and control usage as the instruction sequence travels down the pipeline
lw $10, 20($1)sub $11, $2, $3and $12, $4, $5or $13, $6, $7add $14, $8, $9
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
How the MIPS ISA simplifies pipelining
Fixed length instruction simplifiesFetch – just get the next 32 bitsDecode – single step; don’t have to decode opcode
before figuring out where to get the rest of the fields
Source register fields always in same locationCan read source registers during decode
Load/store architectureALU can be used for both arithmetic and EA
calculationMemory instruction require about same amount of
work as arithmetic ones, easing pipelining of the two together
Memory data must be alignedRead or write accesses can be done in one cycle
Pipeline hazards
A hazard is a conflict, regarding data, control, or hardware resources
Data hazards are conflicts for register values
Control hazards occur due to the delay to execute branch and jump instruction
Structural hazards are conflicts for hardware resources, such as A single memory for instructions and dataA multi-cycle, non-pipelined functional unit (such as a
divider)
Data dependences
A read after write (RAW) dependence occurs when the register written by an instruction is a source register of a subsequent instruction
Also have write after read (WAR) and write after write (WAW) data dependences (later)
lw $10, 20($1)
sub $11, $10, $3
and $12, $4, $11
or $13, $11, $4
add $14, $13, $9
Pipelining and RAW dependences
RAW dependences that are close by may cause data hazards in the pipeline
Consider the following code sequence:
What are the RAW dependences?
sub $2, $1, $3and $12, $2, $6or $13, $6, $2add $14, $2, $2sw $15, 100($2)
Pipelining and RAW dependences
Data hazards with first three instructions
ok
ok
hazard
hazard
Forwarding
Most RAW hazards can be eliminated by forwarding results between pipe stages
at this point, result of sub is available
Forwarding datapaths
Bypass paths feed data from MEM and WB back to MUXes at the EX ALU inputs
Do we still have to write the register file in WB?
Detecting forwarding
Rd of the instruction in MEM or WB must match Rs and/or Rt of the instruction in EX
The instruction in MEM or WB must have RegWrite=1 (why?)
Rd must not be $0 (why?)
Detecting forwarding from MEM to EX
To the upper ALU input (ALUupper)EX/MEM.RegWrite = 1EX/MEM.RegisterRd not equal 0EX/MEM.RegisterRd = ID/EX.RegisterRs
To the lower ALU input (ALUlower)EX/MEM.RegWrite = 1EX/MEM.RegisterRd not equal 0EX/MEM.RegisterRd = ID/EX.RegisterRt
Detecting forwarding from WB to EX
To the upper ALU inputMEM/WB.RegWrite = 1MEM/WB.RegisterRd not equal 0MEM/WB.RegisterRd = ID/EX.RegisterRsThe value is not being forwarded from MEM (why?)
To the lower ALU inputMEM/WB.RegWrite = 1MEM/WB.RegisterRd not equal 0MEM/WB.RegisterRd = ID/EX.RegisterRtThe value is not being forwarded from MEM
Forwarding control
Control is handled by the forwarding unit
Forwarding example
Show forwarding for the code sequence:
sub $2, $1, $3
and $4, $2, $5
or $4, $4, $2
add $9, $4, $2
Forwarding example
sub produces result in EX
Forwarding example
sub forwards result from MEM to ALUupper
Forwarding example
sub forwards result from WB to ALUlower
and forwards result from MEM to ALUupper
Forwarding example
or forwards result from MEM to ALUupper
RAW hazards involving loads
Loads produce results in MEM – can’t forward to an immediately following R-type instruction
Called a load-use hazard
RAW hazards involving loads
Solution: stall the stages behind the load for one cycle, after which the result can be forwarded
Detecting load-use hazards
Instruction in EX is a loadID/EX.MemRead = 1
Instruction in ID has a source register that matches the load destination registerID/EX.RegisterRt = IF/ID.RegisterRs OR
ID/EX.RegisterRt = IF/ID.RegisterRt
Stalling the stages behind the load
Force nop (“no operation”) instruction into EX stage on next clock cycleForce ID/EX.MemWrite input to zeroForce ID/EX.RegWrite input to zero
Hold instructions in ID and IF stages for one clock cycleHold the contents of PCHold the contents of IF/ID
Control for load-use hazards
Control is handled by the hazard detection unit
Load-use stall example
Code sequence:
lw $2, 20($1)
and $4, $2, $5
or $4, $4, $2
add $9, $4, $2
Load-use stall example
lw enters ID
Load-use stall example
Load-use hazard detected
Load-use stall example
Force nop into EX and hold ID and IF stages
Load-use stall example
lw result in WB forwarded to and in EX
or reads operand $2 from register file
Load-use stall example
Pipeline advances normally
Control hazards
Taken branches and jumps change the PC to the target address from which the next instruction is to be fetched
In our pipeline, the PC is changed when the taken beq instruction is in the MEM stage
This creates a control hazard in which sequential instructions in earlier stages must be discarded
beq instruction that is takeninstri+1instri+2instri+3
instri+1, instri+2, instri+3 must be discarded
beq $2,$3,7
beq instruction that is taken
In this example, the branch delay is three
Why is the branch immediate field a 7?
Reducing the branch delay
Reducing the branch delay reduces the number of instructions that have to be discarded on a taken branch
We can reduce the branch delay to one for beq by moving both the equality test and the branch target address calculation into ID
We need to insert a nop between the beq and the correctly fetched instruction
Reducing the branch delay
beq with one branch delay
Register equality test done in ID by a exclusive ORing the register values and NORing the result
Instruction in ID forced to nop by zeroing the IF/ID register
Next fetched instruction will be from PC+4 or branch target depending on the beq outcome
beq with one branch delay
beq in ID; next sequential instruction (and) in IF
beq with one branch delay
bubble in ID; lw (from taken address) in IF
Forwarding and stalling changes
Results in MEM and WB must be forwarded to ID for use as possible beq source operand values
beq may have to stall in ID to wait for source operand values to be produced
Examples addi $2, $2, -1
beq $2, $0, 20
lw $8, 20($1)
beq $4, $8, 6
Stall beq one cycle; forward $2 from MEM to upper equality input
in ID
Stall beq two cycles; forward $8 from WB to lower equality input in
ID
Forwarding from MEM to ID
beq $2,$0,20 addi $2,$2,-1bubble
How could we eliminate the bubble?
Forwarding from WB to ID
beq $4,$8,6 lw $8,20($1) bubble bubble
Further reducing the branch delay
Insert a bubble only if the branch is takenAllow the next sequential instruction to proceed if the
branch is not taken AND the IF.Flush signal with the result of the equality
testStill have bubble for taken branches (~2/3 of all
branches)
Delayed branching
Delayed branching
The ISA states that the instruction following the branch is always executed irregardless of the branch outcomeHardware must adhere to this rule!
The compiler finds an appropriate instruction to place after the branch (in the branch delay slot)
beq $4, $8, 6sub $1, $2, $3 branch delay slot (always
executed after the branch)
Delayed branching
Three places compiler may find a delay slot instruction
Prior example without delayed branch
beq in ID; next sequential instruction (and) in IF
What do you notice about the sub instruction?
Prior example without delayed branch
bubble in ID; lw (from taken address) in IF
Prior example with delayed branch
beq in ID; delay slot instruction (sub) in IF
sub $10 $4,$8
Prior example with delayed branch
sub in ID; lw (from taken address) in IF
sub $10 $4,$8
What would happen if the branch was not taken?
Limitations of delayed branching
50% of the time the compiler can’t fill delay slot with useful instructions while maintaining correctness (has to insert nops instead)
High performance pipelines may have >10 delay slotsMany cycles for instruction fetch and decodeMultiple instructions in each pipeline stageExample
Pipeline: IF1-IF2-ID1-ID2Branch calculation performed in ID2Four instructions in each stage12 delay slots
Solution: branch prediction (later)
Precise exceptions
Exceptions require a change of control to a special exception handler routine
The PC of the user program is saved in EPC and restored after the handler completes so that the user program can resume at that instruction
For the user program to work correctly after resuming, All instructions before the excepting one must have written
their resultAll subsequent instructions must not have written their result
Exceptions handled this way are called precise
Pipelining and precise exceptions
There may be instructions from before the excepting one and from after it in the pipeline when the exception occurs
Exceptions may be detected out of program order
Which should be handled first?
exception
exception
Supporting precise exceptions
Each instruction in the pipeline has an exception field that travels with it
When an exception is detected, the type of exception is encoded in the exception field
The RegWrite and MemWrite control signals for the instruction are set to 0
At the end of MEM, the exception field is checked to see if an exception occurred
If so, the instructions in IF, ID, and EX are made into nops, and the address of the exception handler is loaded into the PC
Supporting precise exceptions
exception
exception
Superscalar pipelines
In a superscalar pipeline, each pipeline stage holds multiple instructions 4-6 instructions in modern high performance
microprocessors
Performance is increased because every clock period more than one instruction completes (increased parallelism)
Superscalar pipelines have a CPI less than 1
Simple 2-way superscalar MIPS
Simple 2-way superscalar MIPS
Two instructions fetched and decoded each cycle
Conditions for executing a pair of instructionsFirst instruction an integer or branch, second a load or
storeNo RAW dependence from first to second
Otherwise, second instruction is executed the cycle after the first
Compiler code scheduling
The compiler can improve performance by changing the order of the instructions in the program (code scheduling)
Examples Fill branch delay slotsMove instructions between two dependent instructions
to eliminate the stall cyclesReorder instructions to increase the number executed
in parallel
Scheduling example – before
Load-use stallStall after addiFirst three instructions must execute serially due to
dependencesLast two must also execute serially for same reasonHave branch delay slot to fill
Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1, -4bne $s1, $zero, Loop
Scheduling example – after
All stall cycles are eliminatedLast two instructions can now execute in parallel on
the 2-way superscalar MIPSFirst two can also, but we would introduce a stall cycle
before the addu (loop is too short – not enough instructions to schedule)
Loop: lw $t0, 0($s1)addi $s1, $s1, -4 # moved into load delay slotaddu $t0, $t0, $s2bne $s1, $zero, Loopsw $t0, 4($s1) # moved into branch delay slot
Loop unrolling
Idea is to take multiple iterations of a loop (“unroll” it) and combine them into one bigger loop
Gives the compiler many instructions to move between dependent instructions and to increase parallel execution
Reduces the overhead of branching
Loop unrolling
Example of prior loop unrolled 4 times:
Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)lw $t0, -4($s1)addu $t0, $t0, $s2sw $t0, -4($s1)lw $t0, -8($s1)addu $t0, $t0, $s2sw $t0, -8($s1)lw $t0, -12($s1)addu $t0, $t0, $s2sw $t0, -12($s1)addi $s1, $s1, -16bne $s1, $zero, Loop
Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1, -4bne $s1, $zero, Loop
Original code:
Loop unrolling
Problem: reuse of $t0 constrains instruction order
Write after read (WAR) and write after write (WAW) hazards
Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)lw $t0, -4($s1)addu $t0, $t0, $s2sw $t0, -4($s1)lw $t0, -8($s1)addu $t0, $t0, $s2sw $t0, -8($s1)lw $t0, -12($s1)addu $t0, $t0, $s2sw $t0, -12($s1)addi $s1, $s1, -16bne $s1, $zero, Loop
Loop unrolling
Solution: different registers for each computation
Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)lw $t1, -4($s1)addu $t1, $t1, $s2sw $t1, -4($s1)lw $t2, -8($s1)addu $t2, $t2, $s2sw $t2, -8($s1)lw $t3, -12($s1)addu $t3, $t3, $s2sw $t3, -12($s1)addi $s1, $s1, -16bne $s1, $zero, Loop
Loop unrolling
Unrolled loop after scheduling:
New sw offsets due to moving the addi
Loop: lw $t0, 0($s1)lw $t1, -4($s1)lw $t2, -8($s1)lw $t3, -12($s1)addu $t0, $t0, $s2addu $t1, $t1, $s2 addu $t2, $t2, $s2addu $t3, $t3, $s2addi $s1, $s1, -16sw $t0, 16($s1)sw $t1, 12($s1)sw $t2, 8($s1)bne $s1, $zero, Loopsw $t3, 4($s1)
Modern superscalar processors
Today’s superscalar processors attempt to issue (initiate the execution of) 4-6 instructions each clock cycle
Such processors have multiple integer ALUs, integer multipliers, and floating point units that operate in parallel on different instructions
Because most of these units are pipelined, there is the potential to have 10’s of instructions simultaneously executing
We must remove several barriers to achieve this
Modern processor challenges
Handling branches in a way that prevents instruction fetch from becoming a bottleneck
Preventing long latency operations, especially loads in which the data is in main memory, from holding up instruction execution
Removing register hazards due to the reuse of registers so that instruction can execute in parallel
Instruction fetch challenges
Branches comprise about 20% of the executed instructions in SPEC integer programs
The branch delay may be >10 instructions in a highly pipelined, superscalar processor
Delayed branches are useless with so may delay slots
Solution: dynamic branch prediction with speculative execution
Dynamic branch prediction
When fetching the branch, predict what the branch outcome and target will be
Fetch instructions from the predicted direction
After executing the branch, verify whether the prediction was correct
If so, continue without any performance penalty
If not, undo and fetch from the other direction
Bimodal branch predictor
Predicts the branch outcome
Works under the assumption that most branches are either taken most of the time or not taken most of the time
Prediction accuracy is ~85-95% with 2048 entries
Bimodal branch predictor
Consists of a small memory and a state machine
Each memory location has 2 bits
The address of the memory is the low-order log2n PC bits of a fetched branch instruction
.
.
.
PC of fetched branch instruction
n entriesaddress
2 bits/entry
branch predictor memory
Bimodal branch predictor
When a branch is fetched, the 2-bit memory entry is retrieved
The prediction is based on the high-order bit1=predict taken0=predict not taken
Bimodal branch predictor
Once the branch is executed, the state bits are updated and written back into the memory
In the 00 or 11 state, have to be wrong twice in a row to change the prediction
11
0001
10
actual branch outcome
Branch target buffer
Predicts the branch target addressIs this as critical as predicting the branch outcome?
Small memory (typically 256-512 entries) addressed by the low-order branch PC bits
Each entry holds the last target address of the branch
When a branch is fetched, the BTB is accessed and the target address is used if the bimodal predictor predicts “taken”
Speculative execution
The execution of the branch, and verification of the prediction, may take many cycles due to RAW dependences with long-latency instructions
We cannot write the register file or data memory until we know the prediction is correct
Execution will eventually stall
lw $2,100($1) # can take >100 cyclesbeq $2,$0,Label
Speculative execution
In speculative execution, results are first written to temporary buffers (NOT the register file or data memory)
The results are copied from the buffers to the register file or data memory if the branch prediction has been verified and is correct
If the prediction is incorrect, we discard the results
Speculative execution
Writeback now consists of two stages: instruction completion and instruction commit
Completion: execution is complete, write results to buffer
Commit: branch prediction is verified and correct, copy results from buffers to register file or data memory
Modern processors can speculate through 4-8 branches
execute
completion commit
buffersregister
file
branch prediction verified as correct
Modern processor challenges
Handling branches in a way that prevents instruction fetch from becoming a bottleneck
Preventing long latency operations, especially loads in which the data is in main memory, from holding up instruction execution
Removing register hazards due to the reuse of registers so that instruction can execute in parallel
Long latency operations
Long latency operations, especially loads that have to access main memory, may stall subsequent instructions
Solution: allow instructions to issue (start executing) out of their original program order but update registers/memory in program order
or $5,$6,$7and $8,$6,$7lw $2,100($1)add $9,$2,$2sub $10,$5,$8
completed
waiting for lw
can’t execute even though its operands
are available!
data not found in on-chip memory, have to get
from main memory
Out-of-order issue
Fetched and decoded instructions are placed in a special hardware queue called the issue queue
An instruction waits in the IQ untilIts source operands are availableA suitable functional unit is available
The instruction can then issue
IF ID …issue queue
EX
completioncommit
buffersreg file
Out-of-order issue
Every cycle, the destination register numbers (rd or rt) of issuing instructions are broadcast to all instructions in the IQ
A match with a source register number (rs or rt) of an instruction in the IQ indicates the operand will be available
or $5,$6,$7and $8,$6,$7lw $2,100($1) # can take >100 cycles!add $9,$2,$2sub $10,$5,$8
issued instructions
both operands become available
IF ID …issue queue
EX
completioncommit
dest reg numbers
Out-of-order issue
Instructions with available source operands can issue ahead of earlier instructions (out of original program order)
add $9,$2,$2
sub $10,$5,$8
.
.
.
from ID
waiting for lw
or and and instructions were just issued => issue sub
issue queue
Out-of-order issue, in-order commit
Once instructions complete, they write results into the buffers used for speculative execution
However, instructions are written to the register file and data memory in original program order
Why do we need to do this?
execute
completion commit
buffersregister
file
must be in-order
may be out-of-order
or $5,$6,$7and $8,$6,$7lw $2,100($1)add $9,$2,$2sub $10,$5,$8
commits firstcompletes first
Modern processor challenges
Handling branches in a way that prevents instruction fetch from becoming a bottleneck
Preventing long latency operations, especially loads in which the data is in main memory, from holding up instruction execution
Removing register hazards due to the reuse of registers so that instruction can execute in parallel
Register hazards
The reuse of registers creates WAW and WAR hazards that limit out-of-order issue and parallel execution
Example
Potential for multiple iterations to be executed in parallelThe branch could be predicted as taken with high accuracyProblem: WAW and WAR hazards involving $t0 and $s1
Solution: register renaming
Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4bne $s1, $zero, Loop
Register renaming
Idea is for the hardware to reassign registers like the compiler does in loop unrolling
Requires implementing more registers than specified in the ISA (e.g., 128 integer registers rather than 32)
Allows every instruction in the pipeline to be given a unique destination register number to eliminate all WAR and WAW register conflicts
Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)lw $t1, -4($s1)addu $t1, $t1, $s2sw $t1, -4($s1)
Register renaming
A register renaming stage is added between decode and the register file access
The original architectural destination register number is replaced by a unique physical register number that is not used by any other instruction
A lookup is done for each source register to find the corresponding physical register numberdecode rename
reg file
architectural register numbers used up to here
physical register numbers used after this point
Register renaming
Example: two iterations of the loop with branch predicted taken
WAR hazard involving $s1 is removed, allowing the addi to complete before the first iteration is completed
The WAW and WAR hazards involving $t0 are removedRemoving both of these restrictions allows the second
iteration to proceed in parallel with the first
lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4<bne predicted taken>lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4<bne predicted taken>
lw $p1, 0($p3)addu $p2, $p1, $p10sw $p2, 0($p3)addi $p4, $p3 , -4<bne predicted taken>lw $p7, 0($p4)addu $p23, $p7, $p10sw $p23, 0($p4)addi $p11, $p4 , -4<bne predicted taken>
BEFORE: AFTER:
The MIPS R12000 microprocessor
4-way superscalar
Five execution units2 integer2 floating point1 load/store for effective address calculation and data
memory access
Dynamic branch prediction and speculative execution
ooo issue, in-order commit
Register renaming
R12000 pipeline (ALU operations) Fetch stages 1 and 2
Fetch 4 instructions each cyclePredict branchesSplit into two stages to enable higher clock rates (R10K had
1)
Decode stageDecode and rename 4 instructions each cyclePut into issue queues
Issue stageCheck source operand availabilityRead source operands from register file (or bypass paths) for
issued instructions
Execute stageExecute and complete
Write stageWrite results to physical registers
R12000 branch prediction
2048-entry bimodal predictor
32 entry branch target address cache
Speculation through four branches
R12000 ooo completion, in-order commit
Separate 16-entry issue queues for integer, floating point, and memory (load and store) instructions
Hardware tracks the program order and status (completed, caused exception, etc) of up to 48 instructions
R12000 register renaming
64 integer and 64 floating point physical registers
Hardware lookup table to correlate architectural registers with physical registers
Hardware maintains list of currently unused registers that can be assigned as destination registers
R10000 die photo
R12000 summary
R10000 was one of the 1st microprocessors to implement the “issue queue” approach to ooo superscalar executionPowerPC processors use the “reservation station”
approach discussed in the book
Clock rate was slowR12000 provided a slight improvement with some redesign
Pentium and Alpha processors are ooo but with much faster clock rates
Very hard to get significant improvement beyond 4-6 way issueBranch prediction needs to be extremely high Finding parallel operations in many program is difficultLong latency of loads creates an operand supply problemKeeping the clock rate high is tough
Questions?