Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh

Precomputation-based Prefetching

By James Schatz and Bashar Gharaibeh

Outline

Introduction Implementations of Precomputation Some Examples of Precomputation Results of Precomputation Tests Summary

Introduction

Why Precomputation? Designed to improve single thread

performance on a multithread system Utilizes idle hardware to improve cache hit

rates Useful in programs with unpredictable

access patterns

Precomputation

Allows programs to run faster

What are the key causes of delay? Waiting for input values Waiting for memory Poor speculation

Solving the Delay Issue

Since most programs are slowed by waiting for data, prefetching this data would speed execution

Problems with Prefetching

Small instruction window Needs to predict branches Has limited resources on hand Does not solve the problems of pointer

chains

Solution

Expand the instruction window!

Normally done by increasing instruction-level parallelism (ILP)

Increasing ILP means increasing the sizes of hardware structures, such as register size, issue queues, and the reorder buffer

This is not an ideal solution for people working with a fixed structure size, so another solution is necessary

Precomputation Solution

The instruction window can be expanded by executing instructions in a separate thread of execution that can assist the main thread by testing for cache data and evaluating branches before the main thread

Since this data is executed before it normally would be, it is referred to as being "precomputed"

Adding Precomputation to your CPUHow can precomputation be included?

Different methods for multiple thread use Secondary thread is run ahead of the main thread, and

software controlled When the main thread stalls on an instruction, execute

secondary thread. This method is hardware controlled A mixture of hardware and software control

Each method has certain advantages

Implementation of the Design

Software Controlled Precomputation Hardware Controlled Precomputation

Software-Controlled Precomputation

The Basics

Allows compiler to initiate helper threads into code that is likely to incur cache misses

Launches precomputation threads based on the programmer's knowledge, cache miss profiling, and compiler locality analysis

Running the Threads

When the code calls for a precomputation thread to be made, check for idle hardware

If no hardware is idle, drop the request Otherwise, start a precomputation thread

at the given PC

Applications of Software Precomputation Analysis of programs with irregular access

patterns that are typically difficult for prefetching

Usually involving pointers, hash tables, indirect array references

Fixing pointer chains

A big problem with prefetching is that of pointer chains

A pointer chain is a where the address of the next node is not known until after the current load finishes

Single pointers can be resolved by using jump-pointer prefetching

Jump-pointers become too complex to resolve multiple chains

Running a helper thread for each chain allows multi-chains to be resolved quickly

Using Precomputation on a linked-list

A single thread is used due to their being a sufficient number of nodes present to use precomputation to mask latency

More complicated uses for precomputation

Hashing is the most difficult challenge to prefetching for two reasons:

Good hash algorithms are fairly random, so regular prefetching is hard

Good hash algorithms us short chains, so jump-pointer prefetching will not work

Precomputation allows for N hash functions to run at the same time, reducing memory stall

Support for software based precomputationIn order to utilize software based precomputation, it is necessary to implement

a few new instructions for the existing processor

Thread_ID = PreExecute_Start(Start_PC, Max_Insts): Request for and idle context to start pre-execution at Start_PC and stop when Max_insts instructions have been executed: Thread_ID holds either the identity of the pre-execution thread or -1 if there is no idle context. This instruction has effect only if it is executed by the main thread

PreExecute_Stop(): The thread that executes this instruction will be self terminated if it a pre-execution thread; no effect otherwise.

PreExecute_Cancel(Thread_ID): Terminate the pre-execution thread with Thread_ID. This instruction has effect only if it is executed by the main thread.

Hardware-Controlled Precomputation

The Basics

Allocates a set portion of available registers to precomputation threads

Runs secondary helper thread when the primary thread is stalled

Integration in Hardware

In order to execute the secondary (future) thread, additional structures are needed within the hardware

These are the future IFQ, future rename table, and Preg status table

The processor must also have a PC for both threads, both initially being the same

Updating the Hardware at Runtime

The future IFQ is loaded with instructions fetched by the future thread

The future rename table receives a copy of each instruction that is mapped into the primary rename table

For each instruction dispatched by the future thread, an entry is added to the Preg status table, which keeps track of the registers assigned to the future thread

Other fields in the Preg table indicate whether or not the register is able to be reused by the future thread

Importance of Register Reuse

By allowing the future thread to timeout and reuse registers, it is possible to run the future thread more efficiently

With the timeout protocols, it is also possible to allocate resources from the future thread to the primary thread to ensure the priority of the primary thread

Resuming Activity in the Primary ThreadIt would be wasteful to run the same instructions

twice if the data is still available, so many hardware based precomputation schemes allow for the passing of data from the Preg status table and future rename table

If an instruction exists in the tables, and has the appropriate sequence numbers, it is allocated to the primary thread and removed from the future thread tables

Recovering from Branch MispredictionsOnce it begins execution, only the future thread accesses the branch

predictor. Instead of its normal operation, the branch predictor gives its information to the future thread, which then conveys information via a FIFO queue.

These predictions are updated by the future thread when they are resolved, so that the primary thread does not have to go along the mispredicted path.

On detecting a misprediction, the future thread checkpoints back to the state of the misprediction. This check point rolls back to the sequence number of the mapping, not the mapping itself. Anything after this checkpoint can be overwritten, and is flagged as such

Given the opportunistic nature of the future thread, it misprediction penalty does not play a major role in its performance

Results of Precomputation Tests

Summary

Utilizes secondary threads to improve speed

Can run as hardware or software based Generally runs programs more than 25%

faster than normal execution

Documents

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh