What is simultaneous multithreading

What is Simultaneous Multithreading?

Generally speaking there are two types of parallelism that can be exploited by modern computing machinery to achieve higher performance. The Instruction Level Parallelism (ILP) approach attempts to reduce program runtime by overlapping the execution time of as many instructions as possible, to as great a degree as possible. The EV8 will have higher performance than earlier Alpha designs through the enhanced exploitation of ILP made possible by its eight-instruction issue width. But gains from higher ILP come at a high and ever increasing price. Building wider machines runs into the problem of geometrically increasing complexity in control logic while data and control dependencies within the program code limit performance increases. John Hennessy of Stanford University has likened the difficulty of increasing exploitation of ILP for greater performance to the task of pushing a boulder up a mountain whose slopes grow ever steeper the further processor architects progress [1].

Figure 1. Multithreaded Execution with Increasing Levels of TLP Hardware Support

The second form of parallelism is called Thread Level Parallelism or TLP. This simply means the ability to execute independent programs, or independent parts of a single program, simultaneously using different flows of execution, called threads. The illusion of multiple thread execution is often

achieved on a single conventional processor through the use of multitasking. Multitasking relies on the ability of an operating system (OS) to overlap the execution of multiple threads or programs on a single processor by running each thread successively for short intervals. This is shown in Figure 1A. This diagram illustrates program execution using rectangles repeated in the horizontal direction to represent consecutive clock cycles while squares placed vertically in each rectangle represent the per cycle utilization of instruction issue slots in a four way superscalar processor (unused slots are left as white squares).

Each thread runs for a short interval that ends when the program experiences an exception like a page fault, calls an operating system function, or is interrupted by an interval timer. When a thread is interrupted, a short segment of OS code (shown in Figure 1A as gray instructions in issue slots) is run which performs a context switch and switches execution to a new thread. Multitasking provides the illusion of simultaneous execution of multiple threads but does nothing to enhance the overall computational capability of the processor. In fact, excessive context switching causes processor cycles, which could have been used running user code, to be wasted in the OS.

The most basic type of TLP exploitation that can be incorporated into processor hardware is coarse grained multithreading (CMT), shown in Figure 1B. The processor incorporates two or more thread contexts (general purpose registers, program counter PC, process status word PSW etc.) in hardware. One process context is active at a time and runs until an exception occurs, or more likely, a high latency operation such as a cache miss during a load instruction. When this occurs, the processor hardware automatically flushes and changes the thread context, and switches execution to a new thread.

For contemporary MPUs, a memory operation initiated in response to a cache miss can take over a hundred clock cycles, which represents the potential execution of hundreds of instructions. A conventional in-order processor will simply stall and forever lose those hundreds of potential instructions slots waiting for memory to respond with needed data. A conventional out-of-order execution processor has the potential to continue to execute other instructions that weren’t dependent on the missed load data. However, independent instructions tend to be quickly exhausted in most programs and the processor simply takes longer to stall.

But a coarse grained multithreaded processor has the opportunity to quickly switch to another thread after a cache miss and perform useful work while the first thread awaits its data from memory. Many programs spend considerable time waiting for memory operations and a coarse grained multithreaded processor has the opportunity to increase overall system throughput, compared to a conventional processor performing OS-based multitasking. The IBM PowerPC RS64, also known as Northstar, is rumored to incorporate two way coarse grained multithreading capability, although it is not utilized in some product lines.

A more comprehensive way to exploit TLP in hardware is the fine grained multithreaded (FMT) processor. The operation of one variant of this class of machine is shown in Figure 1C. In this type of design there are N thread contexts in the processor and instructions from each thread are allocated every Nth processor clock cycle to advance through the processor’s execution pipeline by one stage. Figure 1C shows the operation of a four-way fine grained multithreaded processor, i.e. N = 4. At first glance its seems like each thread has only 1/Nth the performance potential of a conventional processor. It is actually much better than this simply because the execution pipeline can be made much shorter from the logical viewpoint of a single thread. This reduces instruction latencies, simplifies compiler code scheduling, and increases the instructions per clock (IPC) component of performance.

For example, a four-way fine grained multithreaded processor might provide single cycle latency floating point (FP) addition while conventional processors typically require three or four cycles of latency. That is possible because the FP adder has four physical processor clock cycles to advance a thread’s FP add instruction through what is one logical execution pipeline stage from the thread’s viewpoint. In a similar fashion, memory latency appears to be 1/Nth the number of processor clock cycles from the viewpoint of individual threads. The hardware cost of fine grained multithreading is relatively modest: N thread contexts, and control logic and multiplexors to cyclically commutate instructions and data from N different threads into and out of the execution units. The drawback of this approach is that its performance running any single thread is still appreciably less than for a conventional processor although the system throughput is increased. An example of a fine-grained multithreaded processor is the five-threaded MicroUnity MediaProcessor [2].

The EV8 uses a more powerful mechanism than either coarse or fine grained multithreading to exploit TLP. Called Simultaneous Multithreading (SMT), it allows the instructions from two or more threads to be issued to execution units each cycle. This process is illustrated conceptually in Figure 1D. The advantage of SMT is that it permits TLP to be exploited all the way down to the most fundamental level of hardware operation - instruction issue slots in a given clock period. This allows instructions from alternate threads to take advantage of individual instruction execution opportunities presented by the normal ILP inefficiencies of single thread program execution. SMT can be thought of as equivalent to the airline practice of using standby passengers to fill seats that would have otherwise flown empty.

Consider a single thread executing on a superscalar processor. Conventional superscalar processors such as the Alpha EV6 fall well short of utilizing all the available instruction issue slots. This is caused by execution inefficiencies including data dependency stalls, cycle by cycle shortfall between thread ILP and the processor resources given limited re-ordering capability, and memory accesses that miss in cache. The big advantage of SMT over other approaches is its inherent flexibility in providing good performance over a wide spectrum of workloads. Programs that have a lot of extractable ILP can get nearly all the benefit of the wide issue capability of the processor. And programs with poor ILP can share with other threads instruction issue slots and execution resources that otherwise would have gone unused.

Hardware Requirements for SMT

Compared to a conventional out-of-order execution superscalar processor like the EV6, the following hardware changes are necessary to support SMT operation:

1. Multiple program counters (PCs), and the capacity to select one or more of them to direct instruction fetch each clock cycle.

2. Association of a thread identifier with each instruction fetched to distinguish different threads for the purpose of branch prediction, branch target buffering, and register renaming.

3. A per-thread capacity to retire, flush, and trap instructions. 4. A per-thread stack for prediction of subroutine return addresses.

One of the most remarkable aspect of SMT is it takes relatively little extra logic to add the capability to the execution portion of an out-of-order execution superscalar processor that employs register renaming and issue queues. Register renaming is a scheme in which the logical registers in an instruction set architecture (ISA) are mapped to a subset of a larger pool of physical hardware registers. Each time an instruction is decoded the logical register specified to be overwritten with the instruction result (i.e. the destination register) is assigned a mapping to a new physical register, i.e. it is renamed. When the instruction completes execution and retires, its physical destination

register becomes officially bound to the logical destination register within the processor state, i.e. the result is committed. Register renaming permits out-of-order execution of instructions to proceed even in the presence of false dependencies as shown in Figure 2.

Figure 2 Data Dependencies and Register Renaming

Register renaming is also done to permit speculative execution beyond conditional branches since it allows the results of speculated instructions to be discarded and earlier processor state restored if the branch turns out to be mispredicted. In this case it is only necessary to restore an older mapping of logical to physical registers.

The beauty of register renaming is that it allows an SMT processor to contain multiple thread contexts without the need for multiple physical register sets or additional complicated tracking logic to ensure execution results from instructions from different threads are written to the appropriate thread context. For example, the Alpha EV6 has 80 physical integer registers (there are actually 160 integer registers in the EV6 device but these are really two duplicate sets of 80 for reasons I won’t go into) and 72 physical FP registers. At any given time, 31 of the 80 physical integer registers contain the contents of the 31 logical general purpose registers that appear to the programmer in the Alpha ISA (there are actually 32 logical integer registers but one of them always reads as zero, as is customary for RISC architectures). The remaining physical registers are available for renaming. The EV6 uses two separate twelve-port register mappers for integer and FP register renaming, and each can rename up to four instructions per clock [3]. Content addressable memory (CAM)-based tables

are used to hold the register mapping state. The map tables are also buffered so that an older state can be saved and later restored, if necessary to recover from branch mispredictions and exceptions.

At first glance, implementing a four-way SMT like the EV8 would seem to require four separate and independent register mapping tables, one for each thread. This could be physically realized with a single map table if the size of logical register specifiers used by the mapper is expanded to 7 bits by appending a two-bit thread identifier associated with a fetched instruction to the 5 bit logical register specifiers extracted from the instruction itself. So thread context 0 would use mapper logical registers 0 through 31, thread 1 would use mapper logical registers 32 through 63 and so on. In this scheme each quadrant of the mapper CAM would have the capability to be independently backed up in buffers and restored as needed to maintain the illusion of serial, in-order execution of each thread.

Early research into 8-issue wide superscalar out-of-order processors suggests that with a 64 entry dispatch queue at least 96, and preferably 128, physical registers are needed to limit the fraction of time the processor is out of free registers to 15% and 10% respectively [4]. It is known that the EV8 supports four thread contexts in hardware [5]. This suggests that the EV8 needs an additional 96 integer physical registers above and beyond a conventional 8-issue wide superscalar. That places the number of integer physical renaming registers in the EV8 in the range of 192 to 224 for optimal performance. It should be noted that this exceeds even the 128 logical/physical integer registers required in implementations of Intel/HP's IA-64 instruction set architecture. Such a large, highly ported register file has the potential to seriously limit EV8's clock rate even with the use of an advanced 0.13 um process. The best solution to this problem is to spread register read and write access across two pipe stages instead of one. This has the effect of lengthening the basic execution pipeline from EV6's seven stages to nine stages as shown in Figure 3. One study suggests the extra two pipeline stages in the hypothetical EV8 will degrade single thread performance by less than 2% [6].

Figure 3. Comparison of EV6 and Hypothetical EV8 Execution Pipeline

Instruction Selection Strategies For SMT

I have described how the execution engine portion of an out-of-order superscalar processor implementing register renaming can be modified to support SMT operation. The big design issue with SMT is the algorithm that chooses between threads for the fetch and issue of instructions to that execution engine. A number of different schemes associated with 8 issue wide SMT RISC processor designs have been investigated and reported in the literature [7]. Some of these schemes are listed in Table 1.

Scheme Max. Active

Threads per Cycle

Max Instr Fetched per Thread per

Cycle

Description

RR.1.8 1 8 Round-robin, 1 active thread, 1 x 8 fetch

RR.2.4 2 4 Round-robin, 2 active threads, 2 x 4 fetch

RR.2.8 2 8 Round-robin, 2 active threads, 2 x 8 fetch

BRCOUNT.1.8 1 8 Choose thread with fewest unresolved branches, 1 active thread, 1 x 8 fetch

BRCOUNT.2.8 2 8 Choose thread with fewest unresolved branches, 2 active threads, 2 x 8 fetch

MISSCOUNT.1.8 1 8 Choose thread with fewest outstanding Dcache misses, 1 active thread, 1 x 8 fetch

MISSCOUNT.2.8 2 8 Choose thread with fewest outstanding Dcache misses, 2 active thread, 2 x 8 fetch

ICOUNT.1.8 1 8 Choose thread with fewest instructions in DEC/REN/QUE pipe stages, 1 active thread, 1 x 8 fetch

ICOUNT.2.8 2 8 Choose thread with fewest instructions in DEC/REN/QUE pipe stages, 2 active thread, 2 x 8 fetch

The simplest scheme is termed RR.1.8, or round-robin, one active thread, up to 8 instructions fetched. Each clock, the processor selects one thread from those not currently experiencing an instruction cache (Icache) miss on a round robin basis and uses its PC value to fetch up to 8 instructions per cycle for decoding, renaming, and entry into the integer and/or FP instruction issue queues. The Icache design is essentially unchanged from that of a conventional single-threaded 8-issue wide superscalar processor. Variants include RR.2.4, and RR.2.8, which require a dual ported Icache to permit simultaneous access using two different thread PC values. In the latter case the Icache also needs to support 16 instructions/cycle bandwidth, or twice that of a single-threaded processor. This scheme takes as many instructions as possible from the first thread, and fills in any gaps with instructions fetched from the second thread. The RR.1.8 scheme provides 12% better single thread performance than RR.2.4 but RR.2.4 outperforms RR.1.8 with four active threads. Unsurprisingly, the expensive RR.2.8 scheme outperforms both RR.1.8 and RR.2.4 for both single thread and four thread operation.

More sophisticated schemes have been devised to help increase the throughput of the processor. The BRCOUNT scheme attempts to give priority to threads that are least likely to be wasting instruction slots performing speculative execution. It does this by counting branch instructions in the decode (DEC) pipe stage, rename (REN) pipe stage, and instruction queues (QUE). Priority is given to the thread(s) with the smallest branch count. In practice BRCOUNT.x.8 offers little performance advantage over RR.x.8. The MISSCOUNT scheme gives priority to the thread(s) with

the fewest number of outstanding data cache (Dcache) misses. Like BRCOUNT, MISSCOUNT.x.8 offers little advantage over RR.x.8.

The ICOUNT scheme takes a more general approach to prevent the 'clogging' of the instruction execution queues. Priority is given to the thread(s) with the fewest instructions in the DEC, REN, and QUE pipe stages. ICOUNT has the effect of keeping one thread from filling the instruction queue and favors threads that are moving instructions through the issue queues most efficiently. It turns out the ICOUNT scheme is also highly effective at improving processor throughput. It outperforms the best round-robin scheme by 23% and increases throughput to as much as 5.3 IPC compared to 2.5 for a non-SMT superscalar with similar resources (in this study: 32 KB direct mapped Icache and Dcache, 256 KB 4-way L2 cache, 2 MB direct mapped off-chip cache). In fact, ICOUNT.1.8 consistently outperforms RR.2.8.

The performance difference between ICOUNT.1.8 and ICOUNT.2.8 doesn’t appear to be significant. Given the choice between them, the EV8 designers would likely choose ICOUNT.1.8 to halve Icache fetch bandwidth requirements and reduce associated power consumption. Interestingly, in a more recent paper [6], Alpha architect Joel Emer and his collaborators seem to favor an ICOUNT.2.4 scheme (2 active threads, up to 4 instructions fetched per thread per cycle). At first glance this choice, to the extent that it foretells the actual EV8 fetch heuristic, seems contrary to previous claims by Compaq that the SMT capabilities of EV8 would not hurt its single thread performance compared to a single-threaded processor. One possible explanation for this apparent contradiction may be that the ICOUNT.2.4 scheme as hypothetically implemented in EV8 is capable of using a single thread PC value to access both Icache ports to permit 8 instruction wide fetch capability for a single thread when appropriate. The processor organization of this hypothetical ICOUNT.2.4 based EV8 design is shown in Figure 4.

Figure 4 Hypothetical EV8 CPU Organization

Compaq claims the overall impact of adding SMT capability will be to increase the die area of the processor portion of the EV8 device by less than 10% [8]. It is harder to gauge the extra burden SMT imposes on the already considerable design and verification effort for an eight issue wide superscalar processor, even one implementing a streamlined and prescient RISC architecture like the Alpha ISA. The potential for EV6-like schedule slips in the EV8 project seems ominously tangible if Compaq’s Alpha managers and engineers haven’t taken to heart the lessons of that unfortunate period.

Software Implications of SMT

An obvious question to ask is how does an SMT processor offer up its multithreading capabilities to software. In the case of the EV8, it is with an abstraction called a thread processing unit or TPU. A TPU is essentially a single-threaded virtual processor that is presented to the lowest level of the operating system hardware abstraction layer (HAL). The EV8’s four way SMT capabilities are represented with four separate TPUs as shown in Figure 5.

Figure 5. Software View of the EV8

Essentially the EV8 appears to software as consisting of four separate processors that share a single set of translation lookaside buffers (TLBs) and caches. The advantages of SMT over a real four-way chip level multiprocessor (CMP) are there is only one physical processor occupying die area and cache coherency occurs without extra logic or overhead.

Can the EV8 execute threads from different processes simultaneously? (i.e. threads with different address spaces). That hasn’t been disclosed but the simple answer is, it would probably be easy to permit but it wouldn’t be desirable in practice because it could thrash the TLBs. It is easy to permit with a mechanism called an address space number (ASN) or address space identifier (ASID). In conventional processors an ASN is a small hardware register (typically 6 to 8 bits in size) containing a unique value that is appended to virtual addresses prior to translation. The purpose of doing this is to speed up context switches in a multitasking operating system by avoiding flushing and reloading the TLB state, and flushing and/or invalidating the caches. By simply changing the value in the ASN register during a context switch, the OS can prevent a virtual address from one process from accidentally matching the same virtual address from a previous process in the TLB and/or cache. In the case of an SMT it would seem natural that a separate ASN register be provided within each thread hardware context.

Another important issue is software’s ability to synchronize threads. The Alpha uses a synchronization mechanism based on the load-locked/store-conditional model [9]. This scheme, commonly used by RISC architectures, uses a software based spin loop to set or wait on a semaphore. In a conventional single or multiprocessor system this works well. But on an SMT a spin loop is horrendously wasteful of processing resources. To solve this problem Compaq invented a spin loop quiescing feature that allows the TPU associated with a thread executing a spin loop to be put sleep until the associated semaphore memory location is modified. While asleep the associated thread does not consume any processor resources. This feature adds relatively little extra logic to EV8 because it piggybacks on existing cache coherency mechanisms.

Summary

Simultaneous Multithreading technology seems to be a match complement to the modern out-of-order execution superscalar RISC processor. The difficult task of tracking computational results for instructions from separate threads issuing and executing simultaneously is a natural fit with register renaming schemes currently used to work around false register based data dependencies between instructions and support recovery from speculated instruction execution. The problem of selecting instructions from a group of active hardware threads for SMT issue and execution has a relatively simple heuristic solution that provides robust performance over a wide range of workloads with varying degrees of ILP and TLP.

Research to date suggests SMT can approximately double the throughput performance of an 8 instruction-issue wide processor like EV8 for a cost in extra processor complexity equivalent to less than 10% increased die area for the processor core. The multithreading capabilities of an SMT processor can be accessed by software through a virtual CMP model that uses abstracted TPUs in place of multiple physical CPUs. Existing thread synchronization mechanisms can be retained with little impact on SMT processor performance if appropriate measures are taken to ensure threads waiting for a semaphore do not consume a share of execution resources.

In the third and final part of this article I will examine how the performance characteristics of SMT potentially impact EV8’s competitive posture relative to alternative design approaches like EPIC and CMP and the implications for the future of MPU design.

Footnotes

[1] Hennessy, J., 'Processor Design and Other Challenges in the Post-PC Era', Proceedings of Microprocessor Forum 1999, October 5, 1999, Cahners MicroDesign Resources.

[2] Slater, M., 'MicroUnity Lifts Veil on MediaProcessor', Microprocessor Report, Vol. 9, No. 14, October 23, 1995, p. 11.

[3] Gieseke, B., 'A 600 MHz Superscalar RISC Microprocessor with Out-Of-Order Execution', Digest of Technical Papers, ISSCC 1997, February 7, 1997, p. 176.

[4] Farkas, K. et al, 'Register File Design Considerations in Dynamically Scheduled Processors', DECWRL Report, November 1995.

[5] Emer, J., 'Simultaneous Multithreading: Multiplying Alpha Performance', Proceedings of Microprocessor Forum 1999, October 5, 1999, Cahners MicroDesign Resources.

[6] Lo, J. et al, 'Converting Thread-Level Parallelism to Instruction Level Parallelism via Simultaneous Multithreading', ACM Transactions on Computer Systems, Vol. 15, No. 3, August 1997, p. 322.

[7] Tullsen, D. et al, 'Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor', Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.

[8] Diefendorff, K., 'Compaq Chooses SMT for Alpha', Microprocessor Report, Vol. 13, No. 16, December 6, 1999, p. 1.

[9] Sites, R., 'Alpha Architecture Reference Manual', Digital Press, 1992. Fundamentals of Multithreading: http://www.slcentral.com/articles/01/6/multithreading/

Technology

What is simultaneous multithreading