Data Speculation Support for a Chip Multiprocessor (Hydra CMP)

Data Speculation Support for a Chip Multiprocessor(Hydra CMP)Lance Hammond, Mark Willey and Kunle Olukotun

Presented: May 7th, 2008Ankit Jain

(Some slides have been adopted from Olukotuns talk to CS252 in 2000)CS 258Parallel Computer Architecture

OutlineThe Hydra ApproachData SpeculationSoftware Support for Speculation (Threads)Hardware Support for SpeculationResults

The Hydra Approach

Exploiting Program ParallelismHYDRA

Hydra ApproachA single-chip multiprocessor architecture composed of simple fast processorsMultiple threads of controlExploits parallelism at all levelsMemory renaming and thread-level speculationMakes it easy to develop parallel programsKeep design simple by taking advantage of single chip implementation

The Base Hydra DesignSingle-chip multiprocessorFour processors Separate primary cachesWrite-through data caches to maintain coherence Shared 2nd-level cache Low latency interprocessor communication (10 cycles)Separate fully-pipelined read and write buses to maintain single-cycle occupancy for all accesses

Data Speculation

Problem: Parallel SoftwareParallel software is limitedHand-parallelized applicationsAuto-parallelized applicationsTraditional auto-parallelization of C-programs is very difficultThreads have data dependencies synchronizationPointer disambiguation is difficult and expensiveCompile time analysis is too conservativeHow can hardware help?Remove need for pointer disambiguationAllow the compiler to be aggressive

Solution: Data SpeculationData speculation enables parallelization without regard for data-dependenciesLoads and stores follow original sequential semantics (committed in order using thread sequence number)Speculation hardware ensures correctnessAdd synchronization only for performanceLoop parallelization is now easily automatedOther ways to parallelize codeBreak code into arbitrary threads (e.g. speculative subroutines)Parallel execution with sequential commits

Data Speculation Requirements IForward data between parallel threadsDetect violations when reads occur too early

Data Speculation Requirements IISafely discard bad state after violationCorrectly retire speculative stateForward progress guarantee

Data Speculation Requirements SummaryMethod for detecting true memory dependencies, in order to determine when a dependency has been violated.Method for backing up and re-executing speculative loads and any instructions that may be dependent upon them when the load causes a violation.Method for buffering any data written during a speculative region of a program so that it may be discarded when a violation occurs of permanently committed at the right time.

(Threads + Register Passing Buffers)Software Support for Speculation

Thread Fork and Return

Register Passing Buffers (RPBs)Allocate one per threadAllocate once in memory at starting time so that can be loaded/re-loaded when thread is started/re-startedSpeculated values set using repeat last return value prediction mechanismWhen a new RPB is allocated, it is added to active buffer list from where free processors pick up the next-most-speculative thread

E.g.: Speculatively Executed LoopTermination Message sent from first processor that detects end-of-loop condition. Any speculative processors that executed iterations beyond the end of the loop are cancelled and freed.Justifies need for precise exceptionsOperating system call or exception can only be called from a point that would be encountered in the sequential execution. Thread is stalled until it becomes the head processor.

Miscellaneous IssuesThread SizeLimited Buffer SizeTrue dependenciesRestart lengthOverheadExplicit SynchronizationProtects Used to improve performanceNot needed for correctnessAbility to dynamically turn off speculation when there are parallel threads in code (@ runtime)Ability to share threads with OS (speculative threads give up processors)

Hardware Support for Speculation

Hydra Speculation SupportWrite bus and L2 buffers provide forwardingRead L1 tag bits detect violationsDirty L1 tag bits and write buffers provide backupWrite buffers reorder and retire speculative stateSeparate L1 caches with pre-invalidation & smart L2 forwarding to provide multiple views of memorySpeculation coprocessors to control threads

Secondary Cache Write Buffers Data forwarded to more speculative processors based on Write Masks (by byte) Drain only set bytes to L2 Cache on commit More buffers than processors in order allow execution to continue as draining happens Processor keeps tags of written lines in order to calculate when buffer will overflow and then halt process until it is the head processor

Speculative Loads (Reads)L1 hitThe read bits are setL1 missL2 and write buffers are checked in parallelThe newest bytes written to a line are pulled in by priority encoders on each byte (priority 1-5)Read and modified bits for appropriate read bytes are set in L1

Speculative Stores (Writes) A CPU writes to its L1 cache & write buffer Earlier CPUs invalidate our L1 & cause RAW hazard checks Later CPUs just pre-invalidate our L1 Non-speculative write buffer drains out into the L2

Results

Results (1/3)

Results (2/3)27 4000140 occasionaltoo many cycles cyclescycles dependenciesdependencies

Results (3/3)

ConclusionSpeculative support is only able to improve performance when there is a substantial amount of mediumgrained loop-level parallelism in the application. When the granularity of parallelism is too small or there is little inherent parallelism in the application, the overhead of the software handlers overwhelms any potential performance benefits from speculative-thread parallelism.

Tables and ChartsExtra Slides

Quick Loops

Hydra Speculation Hardware Modified Bit Pre-invalidate Bit Read Bits Write Bits

The goal of any high-performance architecture is to exploit program parallelism. Now parallelism exists at multiple levels. At the lowest level is the parallelism between instructions in a basic-block, Above that is the loop-level parallelism between loop iterations. Above this is thread-level parallelism that comes from parallelizing a single application either manually or automatically. At the highest level is the parallelism between separate processes in a multiprogramming environment. The granularity of parallelism typically increases as the level of parallelism increases. Superscalar architectures concentrate on exploiting ILP and to some extent LLP, but these techniques are complex and do not scale well in advanced semiconductor technologies. So this begs the question

*The approach we have taken in Hydra is multiprocessor built out of simple processors that can be designed for high speed. The main advantage that this architecture is multiple threads of control which enables parallelism to be exploited that is much more widely distributed in the program than is possible with a single hardware window. Secondly, if each of the processors is tightly integrated with its first level cache the highest speed paths can be localized to reduce the impact of wire delay and make it possible to achieve very high clock frequencies. The disadvantage of this architecture is that it requires parallel programs to run on more than one processor. We use thread level speculation to simplify the process of creating parallel programs. Our overall design approach in Hydra is to keep the basic MP architecture as simple as possible by taking advantage of the fact that we are on single piece of silicon. Well see how we do this.*The problem is that the amount parallel software is limited. Today, most parallel software is generated manually. This is typically a long and difficult process. Traditional parallelizing compilers can help but they only work well for dense matrix FORTRAN programs and so are not a solution for parallelizing general purpose programs particularly C programs. The problem with C programs is figuring out the memory dependencies and placing synchronization to honor true-data dependencies. This is especially a problem where pointers are used. Typically the pointer disambiguation algorithms are not powerful enough and so limit the amount of parallelism that can be found. Furthermore the compiler must be conservative. So that if 99% of the iterations of a loop are parallel but 1% are not the loop is declared as not parallel.

We would like to change this situation by adding hardware to allow the compiler to be aggressive rather than conservative.The technique we will use to allow the compiler to be aggressive is called data-speculation. Data speculation allows sequential programs to execute in parallel, but ensures that all loads and stores follow the sequential order so that program correctness is maintained. Now synchronization between parallel threads is only needed for performance and loops can be paralllelized almost obliviously.

Loop iterations are not the only program construct that can be parallelized in this way. All that is required is a sequential execution order to the threads. Another construct that we have experimented with is procedures where the code after the procedure is executed in parallel with the procedureSo what sort of hardware support do you need for data speculation.

The basic functionality we would like to provide is memory location renaming which is analogous to register renaming in dynamically scheduled processors. Lets suppose we have two iterations of a loop (I and I +1). Naturally a write to x in the earlier iteration will be read by a read of x in the later iteration. Let us suppose that these iterations are executed in parallel. Now if the write occurs before the read we would like the value of x to be forwarded to the latter iteration, but if the read occurs before the write we have a violation we want he memory system to detect this case and signal the violation.The simplest although not the most efficient way of recovering from a violation is to re-execute the loop iteration

However, if this loop-iteration is re-executed, the state of the memory system should be the same as when the loop first executed. In particular all writes from the violated iteration must be discarded and must not be allowed to change the sequential permanent state of the machine. On the other hand, if iteration I+1 finishes without violations all writes must change the permanents state after writes from iteration I. Note that writes to the same variable get renamed [memory location renaming].Also note that iteration I can cause violations in iteration I+1 even if it finishes after I+1. In other words all threads must commit in-order even though they execute in parallel.Forward progress is always maintained because one thread (running on head processor) will always execute non-speculatively, and there will be immune from violations. Limited buffer size: Since we need to buffer state from a speculative region until it commits, threads need to be short enough to avoid filling up the buffer space allocated for data speculation too often. An occasional full buffer can be handled by simply stalling the thread that is producing too much state until it becomes the head processor, when it may continue to execute while writing directly to memory. However, if this occurs too often, performance will suffer. True dependencies: Excessively large threads have a higher probability of dependencies with later threads, simply because they issue more loads and stores. With more true dependencies, more violations and restarts occur. Restart length: A late restart on a large thread will cause much more work to be discarded, since a checkpoint of the system state is only taken at the beginning of each thread. Shorter threads result in more frequent checkpoints and thus more efficient restarts. Overhead: Very small threads are also inefficient, because there is inevitably some overhead incurred during thread creation and completion. Programs that are broken up into larger numbers of threads will waste more time on these overheads.*More threads than processors.

Coprocessor support: Table of exception vectors for speculation vectorsHandled by software routines as well as exceptional handlers triggered by hardware eventsContains timers and prediction tables used to prevent speculation on nonparallel threads and predict return values for speculative

L1 CACHE InfoModified Bit: This bit acts like a dirty bit in a writeback cache.If any changes are written to the line during speculation, this bitis set. These changes may come from stores by this processor orbecause a line is read in that includes speculative data fromactive secondary cache buffers. If a thread needs to be restartedon this processor, then all lines with the modified bit set aregang-invalidated at once.

Pre-invalidate Bit: This optional bit is set whenever another processorwrites to the line, but is running a more speculativethread than this processor. Since writes are only propagatedback to more speculative processors, we are able to safely delayinvalidating the line until a different, more speculative thread isassigned to this processor. Thus, this bit acts as the opposite ofthe modified bit it invalidates its cache line when the processorcompletes a thread. Again, all lines must be designed forgang-invalidation. If pre-invalidate bits are not included, writesfrom more speculative processors must invalidate the lineimmediately to ensure correct program execution

Read Bits: These bits are set whenever the processor reads froma word within the cache line, unless that words written bit isset. If a write from a less speculative thread, seen on the writebus, hits an address in a data cache with a set read bit, then atrue dependence violation has occurred between the two processors.The data cache then notifies the processors CP2, initiatinga violation exception. Subsequent stores will not activate thewritten bit for this line, since the potential for a violation hasbeen established.

Write Bits: To prevent unnecessary violations, this bit or set ofbits may be added to allow renaming of memory addresses usedby multiple threads in different ways. If a processor writes to anentire word, then the written bit is set, indicating that this threadnow has a locally generated version of the address. Subsequentloads will not set any read bit(s) for this section of the cacheline, and therefore cannot cause violations.

What happens on a L1 cache conflict?Normally flush to L2 if read bits set, then the processor must halt execution until it becomes the head processor because otherwise no way to detect speculation violations in L2. Can have a victim cache for this as well3 speculative CPUs and one non-specualtive head CPU.Larger speculative miss rates during speculation inter processor communication misses in speculation and restarts with L2 loads% increase of loads so high because superfluous speculative memory accesses which are then restartedCompress() is an anomaly b/c uniprocessor optimized code had more register saves across function calls than the speculative loop body*

Documents

Data Speculation Support for a Chip Multiprocessor (Hydra CMP)