Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 1

Motivation for Multithreaded Architectures

Processors not executing code at their hardware potential

• late 70’s: performance lost to memory latency

• 90’s: performance not in line with the increasingly complex parallel hardware

• instruction issue bandwidth is increasing

• number of functional units is increasing

• instructions can execute out-of-order execution & complete in-order

• still, processor utilization was decreasing & instruction throughput not increasing in proportion to the issue width





Major cause is the lack of instruction-level parallelism in a single executing thread

Therefore the solution has to be more general than building a smarter cache or a more accurate branch predictor


Multithreaded Processors

Multithreaded processors can increase the pool of independent instructions & consequently address multiple causes of processor stalling

• holds processor state for more than one thread of execution

• registers

• PC

• each thread’s state is a hardware context

• execute the instruction stream from multiple threads without software context switching

• utilize thread-level parallelism (TLP) to compensate for a lack in ILP


Multithreading

Traditional multithreaded processors hardware switch to a different context to avoid processor stalls

3 styles

1. coarse-grain multithreading• switch on a long-latency operation (e.g., L2 cache miss)• another thread executes while the miss is handled• modest increase in instruction throughput with potentially no

slowdown to the thread with the miss• doesn’t hide latency of short-latency operations• doesn’t add much to an already long stall• need to fill the pipeline on a switch• no switch if no long-latency operations

• HEP, IBM RS64 III


Traditional Multithreading

3 styles

2. fine-grain multithreading

• can switch to a different thread each cycle (usually round robin)

• hides latencies of all kinds

• larger increase in instruction throughput but slows down the execution of each thread

• Cray (Tera) MTA


Comparison of Issue Capabilities


Multithreading

3 styles

3. simultaneous multithreading (SMT)

• issues multiple instructions from multiple threads each cycle

• no hardware context switching

• huge boost in instruction throughput with less degradation to individual threads


Comparison of Issue Capabilities


Cray (Tera) MTA

Fine-grain multithreaded processor• can switch to a different thread each cycle

• switches to ready threads only• allows execution to remain with the current thread for a specific

number of cycles (discussed under compiler support)• up to 128 hardware contexts

• lots of latency to hide, mostly from the multi-hop interconnection network

• average instruction latency is 70-cycles (at one point)(i.e., 70 instruction streams needed to hide all latency, on average)

• processor state for all 128 contexts• GPRs (total of 4K registers!)• status registers (includes the PC)• branch target registers


Cray (Tera) MTA

Interesting features

• no data caches

• to avoid having to keep caches coherent (topic of the next lecture)

• increases the latency for data accesses but reduces the variation

• L1 & L2 instruction caches

• instruction accesses are more predictable & have no coherency problem

• prefetch straight-line & target code


Cray (Tera) MTA


• trade-off between avoiding memory bank conflicts & exploiting spatial locality

• memory distributed among processors

• memory addresses are randomized to avoid conflicts

• want to fully utilize all memory bandwidth

• run-time system can confine consecutive virtual addresses to a single (close-by) memory unit

• reduces latency (used mainly for instructions)


Cray (Tera) MTA

Interesting features• tagged memory for synchronization, pointer-chasing, trap handling

• indirectly set full/empty bits to prevent data races• prevents a consumer/producer from loading/overwriting a value

before a producer/consumer has written/read it• set to empty when producer instruction starts• consumer instructions block if try to read the producer value• set to full when producer value is written• consumers can now read a valid value

• explicitly set full/empty bits for synchronization• primarily used to synchronize threads that are accessing shared

data (topic of the next lecture)• lock: read memory location & set to empty• other readers are blocked• unlock: write & set to full


Cray (Tera) MTA


• virtual memory system

• no paging: want pages pinned down in memory

• page size is 256MB

• user-mode trap handlers

• fatal exceptions, normalizing floating point numbers


Cray (Tera) MTA

Compiler support• each instruction is a VLIW instruction

• memory/arithmetic/branch• load/store architecture

• need a good code scheduler• explicit dependence look-ahead

• field in a memory instruction that specifies the number of independent LIW instructions that follow

• deviation from fine-grain multithreading• 8 memory operations can be outstanding

• handling branches• instruction to store a branch target in a register before the

branch is executed• can start prefetching the target code


Cray (Tera) MTA

Run-time support

• thread manipulation

• protection domains: group of threads executing in the same virtual address space

• OS sets the maximum number of thread contexts (instruction streams) a domain is allowed

• domain can create & kill threads within that limit, depending on its need for them

• OS bases resource allocations on resource usage


SMT: The Executive Summary

Simultaneous multithreaded (SMT) processors combine designs from:

• superscalar processors

• traditional multithreaded processors

The combination:

• SMT issues & executes instructions from multiple threads simultaneously

=> converting TLP to ILP

• threads share almost all hardware resources


Performance Implications

Multiprogramming workload

• 2.5X on SPEC95, 4X on SPEC2000

Parallel programs

• ~.7 on SPLASH2

Commercial databases

• 2-3X on TPC B; 1.5 on TPC D

Web servers & OS

• 4X on Apache and Digital Unix


Does this Processor Sound Familiar?

Huge performance boost + straightforward implementation =>

• 2-context Intel Hyperthreading

• 2-context Sun UltraSPARC on 4-processor CMP

• 4-context Compaq 21464

• network processor & mobile device start-ups

• others in the wings


An SMT Architecture

Three primary goals for this architecture:

1. Achieve significant throughput gains with multiple threads

2. Minimize the performance impact on a single thread executing alone

3. Minimize the architectural impact on a conventional out-of-order superscalar design


Implementing SMT


Implementing SMT

Can use as is most hardware on current out-or-order processors

Out-of-order renaming & instruction scheduling mechanisms• physical register pool model• renaming hardware eliminates false dependences both within a

thread (just like a superscalar) & between threads• map thread-specific architectural registers onto a pool of thread-

independent physical registers• operands are thereafter called by their physical names• an instruction is issued when its operands become available & a

functional unit is free• instruction scheduler not consider thread IDs when dispatching

instructions to functional units(unless threads have different priorities)


From Superscalar to SMT

Extra pipeline stages for accessing thread-shared register files

• 8 threads * 32 registers + renaming registers

SMT instruction fetcher (ICOUNT)

• fetch from 2 threads each cycle

• count the number of instructions for each thread in the pre-execution stages

• pick the 2 threads with the lowest number

• in essence fetching from the two highest throughput threads


From Superscalar to SMT

Per-thread hardware• small stuff• all part of current out-of-order processors• none endangers the cycle time

• other per-thread processor state, e.g.,• program counters• return stacks• thread identifiers, e.g., with BTB entries, TLB entries

• per-thread bookkeeping for• instruction retirement• trapping• instruction queue flush

This is why there is only a 10% increase to Alpha 21464 chip area.


Implementing SMT

Thread-shared hardware:

• fetch buffers

• branch prediction structures

• instruction queues

• functional units

• active list

• all caches & TLBs

• MSHRs

• store buffers

This is why there is little single-thread performance degradation (~1.5%).


Architecture Research

Concept & potential of Simultaneous Multithreading: ISCA ’95 & ISCA 25th Anniversary Anthology

Designing the microarchitecture: ISCA ’96

• straightforward extension of out-of-order superscalars

I-fetch thread chooser: ISCA ’96

• 40% faster than round-robin

The lockbox for cheap synchronization: HPCA ’98

• orders of magnitude faster

• can parallelize previously unparallelizable codes


Architecture Research

Software-directed register deallocation: TPDS ’99

• large register-file performance w. small register file

Mini-threads: HPCA ’03

• large SMT performance w. small SMTs

SMT instruction speculation: TOCS ‘03

• a good thread mix is the most important performance factor


Compiler Research

Tuning compiler optimizations for SMT: Micro ‘97 & IJPP ’99

• data decomposition: use cyclic iteration scheduling

• tiling: use cyclic tiling; no tile size sweet spot

• software speculation: generates too much overhead code on integer programs

Communicate last-use info to HW for early register deallocation: TPDS ’99

• now need ¼ the renaming registers

Compiling for fewer registers/thread: HPCA ’03

• surprisingly little additional spill code (avg. 3%)


OS Research

Analysis of OS behavior on SMT: ASPLOS ‘00

• Kernel-kernel conflicts in I$ & D$ & branch mispredictions ameliorated by SMT instruction issue + thread-sharing in HW

OS/runtime support for mini-threads: HPCA ’03

• dedicated server: recompile for fewer registers

• multiprogrammed environment: multiple versions

OS/runtime support for executing threaded programs: ISCA ’98 & PPoPP ’03

• dynamic memory allocation, synchronization, stack offsetting, page mapping


Others are Now Carrying the Ball

Fault detection & recovery

Thread-level speculation

Instruction & data prefetching

Instruction issue hardware design

Thread scheduling

Single-thread execution

Profiling executing threads

SMT-CMP hybrids


SMT Collaborators

UWHank LevySteve Gribble

Dean Tullsen (UCSD)Jack Lo (Transmeta)Sujay Parekh (IBM Yorktown)Brian Dewey (Microsoft)Manu Thambi (Microsoft)Josh Redstone (interviewing)Mike SwiftLuke McDowellSteve SwansonAaron Eakin (HP)Dimitriy Portnov (Google)

DEC/Compaq

Joel Emer (now Intel)

Rebecca Stamm

Luiz Barroso (now Google)

Kourosh Gharachorloo (now Google)

For more info on SMT:

http://www.cs.washington.edu/research/smt


0

1

2

3

4

5

SMT Performance

SMT improves throughput by converting TLP from multiple contexts into ILP

Performance increases with more contexts IP

C

1 2 4 8Contexts

Apache


A Problem with Large SMTs

Register file will not scale in future technologies

• e.g., 4-context Alpha 21464:

• 3 cycles to access

• 4 times the size of the 64KB instruction cache

First SMT implementations will be small

• limited to 2 hardware contexts

• expect lower throughput gains

Goal of mini-threads: achieve performance gains of larger SMTs without the register cost


Mini-threads: the Concept

Execute more than one thread, called a mini-thread, within each context

Mini-threads in a context share the context’s architectural register set


SMT Processor Model

Context 1

PC

R0…

R31

Architectural Registers

R0…

R31

Context 2

PC


Mini-thread Processor Model

Context 1

PC


Context 2

PC

PC

PC

R0…

R31

R0…

R31



Context 1

PC


Context 2

PC

PC

PC

R0 – R13

R14 – R17

R18 - R31

R0 – R13

R14 – R17

R18 - R31



Context 1

Context 2

Context 1

PC


Context 2

PC

PC

PC

R0 – R15

R16 – R31

R0 – R15

R16 – R31


Mini-threads Executing on SMT

The HW side: Add extra PCs to each hardware context

• mini-threads execute using their own PC

• mini-threads share the context’s architectural register set


Mini-threads Executing on SMT

The SW side: applications can choose to trade-off registers for TLP

• ignore the PCs and the machine is SMT

• use the extra PCs for higher thread-level parallelism

• an application is responsible for coordinating register usage across mini-threads within its context


The Register/TLP Trade-off

+ Throughput benefit of extra mini-threads (more TLP)

-- Instruction cost of fewer registers per mini-thread (spill code)

+/-- Spill code impact on throughput

+/-- Instruction impact of extra mini-threads


Two Mini-threads/Context

--10

10

30

50

70

90

% I

ncre

ase

in I

PC

ApacheBarnes

FmmRaytrace

Water-spatial

• 1 Context: 63%

Workload Average


Two Mini-threads/Context

-10

10

30

50

70

90

% I

ncre

ase

in I

PC

ApacheBarnes

FmmRaytrace

Water-spatial

• 1 Context: 63%

• 2 Contexts: 40%

• 4 Contexts: 26%

• 8 Contexts: 9%

1 Context2 Contexts4 Contexts8 Contexts

Workload Average


16 Registers per Mini-thread

-25

-20

-15

-10

-5

0

5

10

15

20

% C

hang

e in

Dyn

amic

Ins

truc

tions

ApacheBarnes

FmmRaytrace

Water-spatial

• Only 3% on average!

• IPC cost can be greater



-25

-20

-15

-10

-5

0

5

10

15

20

% C

hang

e in

Dyn

amic

Ins

truc

tions

ApacheBarnes

FmmRaytrace

Water-spatial

• 7% savings in spill code



-25

-20

-15

-10

-5

0

5

10

15

20

% C

hang

e in

Dyn

amic

Ins

truc

tions

ApacheBarnes

FmmRaytrace

Water-spatial

• Kernel: .8%


Bottom Line for a 2 Mini-threads on SMT

ApacheBarnes

FmmRaytrace

Water-spatial-40

-20

0

20

40

60

80

100

% S

peed

up


• 1 Context: 60%• 2 Contexts: 38• 4 Contexts: 20%• 8 Contexts: -2%

Workload Average



ApacheBarnes

FmmRaytrace

Water-spatial

• If free to choose:

– 1 Contexts: 60%

– 2 Contexts: 38%

– 4 Contexts: 22%

– 8 Contexts: 6%

-40

-20

0

20

40

60

80

100

% S

peed

up


Workload Average



ApacheBarnes

FmmRaytrace

Water-spatial

• 1 Contexts: 60%

• 2 Contexts: 38%

• 4 Contexts: 20%

• 8 Contexts: -2%

-40

-20

0

20

40

60

80

100

% S

peed

up


83

66

43

10

Workload Average


Operating System & Runtime Support

The issues:

• Runtime and/or OS support for managing mini-threads

• Compatible register usage between mini-threads & OS/runtime



1. Dedicated server, e.g., web server

• Goal: maximum OS performance

• Solution:

• OS/runtime recompiled for half the register set

• each mini-thread uses different registers

• hardware partition bit steers register access to high or low set

+ Both mini-threads in a context can execute in the OS at the same time

-- Only 1 partition allowed & no sharing of values



2. Multiprogrammed environment

• Goal: maximum flexibility to use registers

• Solution:

• standard OS compile, multiple versions of the runtime

• runtime schedules mini-threads

• hardware blocks 1 mini-thread when the other enters the OS

+ No restrictions on register usage and sharing among mini-threads

-- Lower throughput because mini-threads are blocked


Conclusion for a 2 Mini-threads on SMT

Technique for obtaining large-scale SMT throughput with a small-scale SMT implementation

Mini-threads bypass the register scaling problem on SMT

• average 38% boost on 2-context SMT, and 20% on 4-context SMT, -2% on 8-context SMT

Performance increases when applications choose to use mini-threads when it pays (38%/22%/6%)

Reducing registers has low impact!

• 3% average increase in instructions


A Problem or an Opportunity?

Database workload

+ low throughput (0.8 IPC on an 8-wide superscalar)

+ naturally threaded (and widely used) application

- already high cache miss rates on a single-threaded machine

Key question:

Can SMT’s simultaneous multi-thread instruction issue hide the latencies from additional misses in this already memory-intensive databases?


Database Bottom Line

SMT gets good speedups on commercial database workloads• 3x for debit/credit systems, 1.7 for big search problems

The reasons:• Limits inter-thread data interference in the caches to superscalar

levels by:• a virtual-to-physical page mapping policy that exploits temporal

locality in the physically-addressed L2 cache• address offsetting for thread-private data in the L1 data cache

• Exploits SMT’s inter-thread instruction sharing(35% reduction in instruction cache miss rates)

• Hides latencies of all kinds with simultaneous, multi-thread instruction issue


SMT Performance


SMT Register Deallocation Strategies

(1) Compiler/architecture support to free registers after their last use• small SMT register files achieve large register file performance

Use a small register file if the register access sets the cycle time.

(2) OS support to free registers in idle contexts:• enables SMT registers to be shared by all contexts

(traditional multithreaded processors have separate contexts per thread)

• increases performance with a small register file by:• up to 50% with 8 threads• 3 times with fewer threads.

Design a good throughput CPU (i.e., SMT) & not penalize the performance of a single or few threads


Deallocating Registers After Last Use


Putting This in Context

Documents

Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to