Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions

Thread Level Parallelism• Since ILP has inherent limitations, can we exploit

multithreading?– a thread is defined as a separate process with its own

instructions and data• this is unlike the traditional (OS) definition of a thread

which shares instructions with other threads but they each have their own stack and data (a thread in this case is multiple versions of the same process)

– a thread may be a traditional thread or a separate process or a single program executing in parallel

• the idea here is that the thread offers different instructions and data so that the processor, when it has to stall, can switch to another thread and continue execution so that it does not cause time consuming stalls

– TLP exploits a different kind of parallelism than ILP

Unit IIIIntroduction to Multithreading

CS2354 Advanced Computer Architecture

Approaches to TLP• We want to enhance our current processor

– superscalar with dynamic scheduling

• Fine-grained multi-threading– switches between threads at each clock cycle

• thus, threads are executed in an interleaved fashion

– as the processor switches from one thread to the next, a thread that is currently stalled is skipped over

– CPU must be able to switch between threads at every clock cycle so that it needs extra hardware support

• Coarse-grained multi-threading– switches between threads only when current thread is likely

to stall for some time (e.g., level 2 cache miss)– the switching process can be more time consuming since

we are not switching nearly as often and therefore does not need extra hardware support

Advantages/Disadvantages• Fine-grained

– Adv: less susceptible to stalling situations– Adv: throughput costs can be hidden because stalls are often

unnoticed– Disadv: slows down execution of each thread– Disadv: requires a switching process that does not cost any

cycles – this can be done at the expense of more hardware (we will require at a minimum a PC for every thread)

• Coarse-grained– Adv: more natural flow for any given thread– Adv: easier to implement switching process– Adv: can take advantage of current processors to implement

coarse-grained, but not fine-grained– Disadv: limited in its ability to overcome throughput losses

because of short stalling situations because the cost of starting the pipeline on a new thread is expensive (in comparison to fine-grained)

Simultaneous Multi-threading (SMT)• SMT uses multiple issue and dynamic scheduling on

our superscalar architecture but adds multi-threading– (a) is the traditional approach with idle slots caused by

stalls and a lack of ILP – (b) and (c) are fine-grained and coarse-grained MT

respectively– (d) shows the potential payoff for SMT– (e) goes one step further to illustrate multiprocessing

Four Approaches• Superscalar on a single thread (a)

– we are limited to ILP or, if we switch threads when one is going to stall, then the switch is equivalent to a context switch, which takes many (dozens or hundreds) of cycles

• Superscalar + coarse-grained MT (c)– fairly easy to implement, performance increase over no MT

support, but still contains empty instruction slots due to short stalling situations (as opposed to lengthier stalls associated with cache miss)

• Superscalar + fine-grained MT (b)– requires switching between threads at each cycle which requires

more complex and expensive hardware, but eliminates most stalls, the only problem is that a thread that lacks ILP or cannot make use of all instruction issue slots will not take full advantage of the hardware

• Superscalar + SMT (d)– most efficient way to use hardware and multithreading so that

as many functional units as possible can be occupied

Superscalar Limitations for SMT• In spite of the performance increase by combining our

superscalar hardware and SMT, there are still inherent limitations– how many active threads can be considered at one time?

• we will be limited by resources such as number of PCs available to keep track of each thread, size of bus to accommodate multiple threads having instruction fetches at the same time, how many threads can be stored in main memory, etc

– finite limitation on buffers used to support the superscalar• reorder buffer, instruction queue, issue buffer

– limitations on bandwidth between CPU and cache/memory

– limitation on the combination of instructions that can be issued at the same time

• consider four threads, each of which contains an abnormally large number of FP * but no FP +, then the multiplier functional unit(s) will be very busy while the adder remains idle

SMT Design Challenges• Superscalars best perform on lengthier pipelines• We will only implement SMT using fine-grained MT so we need

– large register file to accommodate multiple threads– per-thread renaming table and more registers for renaming– separate PCs for each thread– ability to commit instructions of multiple threads in the same cycle– added logic that does not require an increase in clock cycle time– cache and TLB setups that can handle simultaneous thread access without a

degradation in their performance (miss rate, hit time)

• In spite of the design challenges, we will find– performance on each individual thread will decrease (this is natural since

every thread will be interrupted as the CPU switches to other threads, cycle-by-cycle)

• One alternative strategy is to have a “preferred” thread of which instructions are issued every cycle as is possible– the functional unit slots not used are filled by alternate threads– if the preferred thread reaches a substantial stall, other threads fill in until

the stall ends

SMT Example Design• The IBM Power5 was built on top of the Power4

pipeline– but in this case, the Power5 implements SMT

• simple design choices whenever possible• increase associativity of L1 instruction cache and TLB to offset the

impact that might arise because of multithreading access to the cache and TLB

• add per-thread load/store queues• increase size of L2 and L3 caches to permit more threads to be

represented in these caches• add separate instruction prefetch and buffering hardware• increase number of virtual registers for renaming• increase size of instruction issue queues

– the cost for these enhancements is not extreme (although it does take up more space on the chip) – are the performance payoffs worthwhile?

Performance Improvement of SMT• As it turns out, the improvement gains of SMT over a single

thread processor is only modest– in part this is because multi-issue processors have not increased their issue

size over the past few years – to best take advantage of SMT, issue size should increase from maybe 4 to 8 or more, but this is not practical

• Pentium IV Extreme had improvements of – 1.01 and 1.07 for SPEC int and SPEC FP benchmarks respectively over

Pentium IV (Extreme = Pentium IV + SMT support)

• When running 2 SPEC benchmarks at the same time in SMT mode, improvements ranged from– 0.9 to 1.58 with an average improvement of 1.20

• Conclusions– SMT has benefits but the costs do not necessarily pay for the improvement– another option: use multiple CPU cores on a single processor (see (e) from

the figure on slide 4)– another factor discussed in the text (but skipped here) is the increasing

demands on power consumption as we continue to add support for ILP/TLP/SMT

Advanced Multi-Issue Processors• Here, we wrap up chapter 3 with a brief comparison of multi-

issue superscalar processors

Processor Architecture Fetch/Issue/

Execute

Functional Units

Clock Rate

(GHz)

Pentium 4 Extreme

speculative dynamically scheduled, deeply pipelined, SMT

3/3/4 7 int

1 FP

3.8

AMD Athlon 64

speculative dynamically scheduled

3/3/4 6 int

3 FP

2.8

IBM Power 5

speculative dynamically scheduled, SMT, 2 CPU cores/chip

8/4/8 6 int

2 FP

1.9

Itanium 2 EPIC style (see appendix G), primarily statically scheduled

6/5/11 9 int

2 FP

1.6

Comparison on Integer Benchmarks

Comparison on FP Benchmarks

Documents

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions