59
Spring 2003 CSE 548P 1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to memory latency 90’s: performance not in line with the increasingly complex parallel hardware instruction issue bandwidth is increasing number of functional units is increasing instructions can execute out-of-order execution & complete in-order still, processor utilization was decreasing & instruction throughput not increasing in proportion to the issue width

Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Embed Size (px)

Citation preview

Page 1: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 1

Motivation for Multithreaded Architectures

Processors not executing code at their hardware potential

• late 70’s: performance lost to memory latency

• 90’s: performance not in line with the increasingly complex parallel hardware

• instruction issue bandwidth is increasing

• number of functional units is increasing

• instructions can execute out-of-order execution & complete in-order

• still, processor utilization was decreasing & instruction throughput not increasing in proportion to the issue width

Page 2: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 2

Motivation for Multithreaded Architectures

Page 3: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 3

Motivation for Multithreaded Architectures

Major cause is the lack of instruction-level parallelism in a single executing thread

Therefore the solution has to be more general than building a smarter cache or a more accurate branch predictor

Page 4: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 4

Multithreaded Processors

Multithreaded processors can increase the pool of independent instructions & consequently address multiple causes of processor stalling

• holds processor state for more than one thread of execution

• registers

• PC

• each thread’s state is a hardware context

• execute the instruction stream from multiple threads without software context switching

• utilize thread-level parallelism (TLP) to compensate for a lack in ILP

Page 5: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 5

Multithreading

Traditional multithreaded processors hardware switch to a different context to avoid processor stalls

3 styles

1. coarse-grain multithreading• switch on a long-latency operation (e.g., L2 cache miss)• another thread executes while the miss is handled• modest increase in instruction throughput with potentially no

slowdown to the thread with the miss• doesn’t hide latency of short-latency operations• doesn’t add much to an already long stall• need to fill the pipeline on a switch• no switch if no long-latency operations

• HEP, IBM RS64 III

Page 6: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 6

Traditional Multithreading

3 styles

2. fine-grain multithreading

• can switch to a different thread each cycle (usually round robin)

• hides latencies of all kinds

• larger increase in instruction throughput but slows down the execution of each thread

• Cray (Tera) MTA

Page 7: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 7

Comparison of Issue Capabilities

Page 8: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 8

Multithreading

3 styles

3. simultaneous multithreading (SMT)

• issues multiple instructions from multiple threads each cycle

• no hardware context switching

• huge boost in instruction throughput with less degradation to individual threads

Page 9: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 9

Comparison of Issue Capabilities

Page 10: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 10

Cray (Tera) MTA

Fine-grain multithreaded processor• can switch to a different thread each cycle

• switches to ready threads only• allows execution to remain with the current thread for a specific

number of cycles (discussed under compiler support)• up to 128 hardware contexts

• lots of latency to hide, mostly from the multi-hop interconnection network

• average instruction latency is 70-cycles (at one point)(i.e., 70 instruction streams needed to hide all latency, on average)

• processor state for all 128 contexts• GPRs (total of 4K registers!)• status registers (includes the PC)• branch target registers

Page 11: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 11

Cray (Tera) MTA

Interesting features

• no data caches

• to avoid having to keep caches coherent (topic of the next lecture)

• increases the latency for data accesses but reduces the variation

• L1 & L2 instruction caches

• instruction accesses are more predictable & have no coherency problem

• prefetch straight-line & target code

Page 12: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 12

Cray (Tera) MTA

Interesting features

• trade-off between avoiding memory bank conflicts & exploiting spatial locality

• memory distributed among processors

• memory addresses are randomized to avoid conflicts

• want to fully utilize all memory bandwidth

• run-time system can confine consecutive virtual addresses to a single (close-by) memory unit

• reduces latency (used mainly for instructions)

Page 13: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 13

Cray (Tera) MTA

Interesting features• tagged memory for synchronization, pointer-chasing, trap handling

• indirectly set full/empty bits to prevent data races• prevents a consumer/producer from loading/overwriting a value

before a producer/consumer has written/read it• set to empty when producer instruction starts• consumer instructions block if try to read the producer value• set to full when producer value is written• consumers can now read a valid value

• explicitly set full/empty bits for synchronization• primarily used to synchronize threads that are accessing shared

data (topic of the next lecture)• lock: read memory location & set to empty• other readers are blocked• unlock: write & set to full

Page 14: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 14

Cray (Tera) MTA

Interesting features

• virtual memory system

• no paging: want pages pinned down in memory

• page size is 256MB

• user-mode trap handlers

• fatal exceptions, normalizing floating point numbers

Page 15: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 15

Cray (Tera) MTA

Compiler support• each instruction is a VLIW instruction

• memory/arithmetic/branch• load/store architecture

• need a good code scheduler• explicit dependence look-ahead

• field in a memory instruction that specifies the number of independent LIW instructions that follow

• deviation from fine-grain multithreading• 8 memory operations can be outstanding

• handling branches• instruction to store a branch target in a register before the

branch is executed• can start prefetching the target code

Page 16: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 16

Cray (Tera) MTA

Run-time support

• thread manipulation

• protection domains: group of threads executing in the same virtual address space

• OS sets the maximum number of thread contexts (instruction streams) a domain is allowed

• domain can create & kill threads within that limit, depending on its need for them

• OS bases resource allocations on resource usage

Page 17: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 17

SMT: The Executive Summary

Simultaneous multithreaded (SMT) processors combine designs from:

• superscalar processors

• traditional multithreaded processors

The combination:

• SMT issues & executes instructions from multiple threads simultaneously

=> converting TLP to ILP

• threads share almost all hardware resources

Page 18: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 18

Performance Implications

Multiprogramming workload

• 2.5X on SPEC95, 4X on SPEC2000

Parallel programs

• ~.7 on SPLASH2

Commercial databases

• 2-3X on TPC B; 1.5 on TPC D

Web servers & OS

• 4X on Apache and Digital Unix

Page 19: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 19

Does this Processor Sound Familiar?

Huge performance boost + straightforward implementation =>

• 2-context Intel Hyperthreading

• 2-context Sun UltraSPARC on 4-processor CMP

• 4-context Compaq 21464

• network processor & mobile device start-ups

• others in the wings

Page 20: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 20

An SMT Architecture

Three primary goals for this architecture:

1. Achieve significant throughput gains with multiple threads

2. Minimize the performance impact on a single thread executing alone

3. Minimize the architectural impact on a conventional out-of-order superscalar design

Page 21: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 21

Implementing SMT

Page 22: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 22

Implementing SMT

Can use as is most hardware on current out-or-order processors

Out-of-order renaming & instruction scheduling mechanisms• physical register pool model• renaming hardware eliminates false dependences both within a

thread (just like a superscalar) & between threads• map thread-specific architectural registers onto a pool of thread-

independent physical registers• operands are thereafter called by their physical names• an instruction is issued when its operands become available & a

functional unit is free• instruction scheduler not consider thread IDs when dispatching

instructions to functional units(unless threads have different priorities)

Page 23: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 23

From Superscalar to SMT

Extra pipeline stages for accessing thread-shared register files

• 8 threads * 32 registers + renaming registers

SMT instruction fetcher (ICOUNT)

• fetch from 2 threads each cycle

• count the number of instructions for each thread in the pre-execution stages

• pick the 2 threads with the lowest number

• in essence fetching from the two highest throughput threads

Page 24: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 24

From Superscalar to SMT

Per-thread hardware• small stuff• all part of current out-of-order processors• none endangers the cycle time

• other per-thread processor state, e.g.,• program counters• return stacks• thread identifiers, e.g., with BTB entries, TLB entries

• per-thread bookkeeping for• instruction retirement• trapping• instruction queue flush

This is why there is only a 10% increase to Alpha 21464 chip area.

Page 25: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 25

Implementing SMT

Thread-shared hardware:

• fetch buffers

• branch prediction structures

• instruction queues

• functional units

• active list

• all caches & TLBs

• MSHRs

• store buffers

This is why there is little single-thread performance degradation (~1.5%).

Page 26: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 26

Architecture Research

Concept & potential of Simultaneous Multithreading: ISCA ’95 & ISCA 25th Anniversary Anthology

Designing the microarchitecture: ISCA ’96

• straightforward extension of out-of-order superscalars

I-fetch thread chooser: ISCA ’96

• 40% faster than round-robin

The lockbox for cheap synchronization: HPCA ’98

• orders of magnitude faster

• can parallelize previously unparallelizable codes

Page 27: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 27

Architecture Research

Software-directed register deallocation: TPDS ’99

• large register-file performance w. small register file

Mini-threads: HPCA ’03

• large SMT performance w. small SMTs

SMT instruction speculation: TOCS ‘03

• a good thread mix is the most important performance factor

Page 28: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 28

Compiler Research

Tuning compiler optimizations for SMT: Micro ‘97 & IJPP ’99

• data decomposition: use cyclic iteration scheduling

• tiling: use cyclic tiling; no tile size sweet spot

• software speculation: generates too much overhead code on integer programs

Communicate last-use info to HW for early register deallocation: TPDS ’99

• now need ¼ the renaming registers

Compiling for fewer registers/thread: HPCA ’03

• surprisingly little additional spill code (avg. 3%)

Page 29: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 29

OS Research

Analysis of OS behavior on SMT: ASPLOS ‘00

• Kernel-kernel conflicts in I$ & D$ & branch mispredictions ameliorated by SMT instruction issue + thread-sharing in HW

OS/runtime support for mini-threads: HPCA ’03

• dedicated server: recompile for fewer registers

• multiprogrammed environment: multiple versions

OS/runtime support for executing threaded programs: ISCA ’98 & PPoPP ’03

• dynamic memory allocation, synchronization, stack offsetting, page mapping

Page 30: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 30

Others are Now Carrying the Ball

Fault detection & recovery

Thread-level speculation

Instruction & data prefetching

Instruction issue hardware design

Thread scheduling

Single-thread execution

Profiling executing threads

SMT-CMP hybrids

Page 31: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 31

SMT Collaborators

UWHank LevySteve Gribble

Dean Tullsen (UCSD)Jack Lo (Transmeta)Sujay Parekh (IBM Yorktown)Brian Dewey (Microsoft)Manu Thambi (Microsoft)Josh Redstone (interviewing)Mike SwiftLuke McDowellSteve SwansonAaron Eakin (HP)Dimitriy Portnov (Google)

DEC/Compaq

Joel Emer (now Intel)

Rebecca Stamm

Luiz Barroso (now Google)

Kourosh Gharachorloo (now Google)

For more info on SMT:

http://www.cs.washington.edu/research/smt

Page 32: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 32

0

1

2

3

4

5

SMT Performance

SMT improves throughput by converting TLP from multiple contexts into ILP

Performance increases with more contexts IP

C

1 2 4 8Contexts

Apache

Page 33: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 33

A Problem with Large SMTs

Register file will not scale in future technologies

• e.g., 4-context Alpha 21464:

• 3 cycles to access

• 4 times the size of the 64KB instruction cache

First SMT implementations will be small

• limited to 2 hardware contexts

• expect lower throughput gains

Goal of mini-threads: achieve performance gains of larger SMTs without the register cost

Page 34: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 34

Mini-threads: the Concept

Execute more than one thread, called a mini-thread, within each context

Mini-threads in a context share the context’s architectural register set

Page 35: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 35

SMT Processor Model

Context 1

PC

R0…

R31

Architectural Registers

R0…

R31

Context 2

PC

Page 36: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 36

Mini-thread Processor Model

Context 1

PC

Architectural Registers

Context 2

PC

PC

PC

R0…

R31

R0…

R31

Page 37: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 37

Mini-thread Processor Model

Context 1

PC

Architectural Registers

Context 2

PC

PC

PC

R0 – R13

R14 – R17

R18 - R31

R0 – R13

R14 – R17

R18 - R31

Page 38: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 38

Mini-thread Processor Model

Context 1

Context 2

Context 1

PC

Architectural Registers

Context 2

PC

PC

PC

R0 – R15

R16 – R31

R0 – R15

R16 – R31

Page 39: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 39

Mini-threads Executing on SMT

The HW side: Add extra PCs to each hardware context

• mini-threads execute using their own PC

• mini-threads share the context’s architectural register set

Page 40: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 40

Mini-threads Executing on SMT

The SW side: applications can choose to trade-off registers for TLP

• ignore the PCs and the machine is SMT

• use the extra PCs for higher thread-level parallelism

• an application is responsible for coordinating register usage across mini-threads within its context

Page 41: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 41

The Register/TLP Trade-off

+ Throughput benefit of extra mini-threads (more TLP)

-- Instruction cost of fewer registers per mini-thread (spill code)

+/-- Spill code impact on throughput

+/-- Instruction impact of extra mini-threads

Page 42: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 42

Two Mini-threads/Context

--10

10

30

50

70

90

% I

ncre

ase

in I

PC

ApacheBarnes

FmmRaytrace

Water-spatial

• 1 Context: 63%

Workload Average

Page 43: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 43

Two Mini-threads/Context

-10

10

30

50

70

90

% I

ncre

ase

in I

PC

ApacheBarnes

FmmRaytrace

Water-spatial

• 1 Context: 63%

• 2 Contexts: 40%

• 4 Contexts: 26%

• 8 Contexts: 9%

1 Context2 Contexts4 Contexts8 Contexts

Workload Average

Page 44: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 44

16 Registers per Mini-thread

-25

-20

-15

-10

-5

0

5

10

15

20

% C

hang

e in

Dyn

amic

Ins

truc

tions

ApacheBarnes

FmmRaytrace

Water-spatial

• Only 3% on average!

• IPC cost can be greater

Page 45: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 45

16 Registers per Mini-thread

-25

-20

-15

-10

-5

0

5

10

15

20

% C

hang

e in

Dyn

amic

Ins

truc

tions

ApacheBarnes

FmmRaytrace

Water-spatial

• 7% savings in spill code

Page 46: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 46

16 Registers per Mini-thread

-25

-20

-15

-10

-5

0

5

10

15

20

% C

hang

e in

Dyn

amic

Ins

truc

tions

ApacheBarnes

FmmRaytrace

Water-spatial

• Kernel: .8%

Page 47: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 47

Bottom Line for a 2 Mini-threads on SMT

ApacheBarnes

FmmRaytrace

Water-spatial-40

-20

0

20

40

60

80

100

% S

peed

up

1 Context2 Contexts4 Contexts8 Contexts

• 1 Context: 60%• 2 Contexts: 38• 4 Contexts: 20%• 8 Contexts: -2%

Workload Average

Page 48: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 48

Bottom Line for a 2 Mini-threads on SMT

ApacheBarnes

FmmRaytrace

Water-spatial

• If free to choose:

– 1 Contexts: 60%

– 2 Contexts: 38%

– 4 Contexts: 22%

– 8 Contexts: 6%

-40

-20

0

20

40

60

80

100

% S

peed

up

1 Context2 Contexts4 Contexts8 Contexts

Workload Average

Page 49: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 49

Bottom Line for a 2 Mini-threads on SMT

ApacheBarnes

FmmRaytrace

Water-spatial

• 1 Contexts: 60%

• 2 Contexts: 38%

• 4 Contexts: 20%

• 8 Contexts: -2%

-40

-20

0

20

40

60

80

100

% S

peed

up

1 Context2 Contexts4 Contexts8 Contexts

83

66

43

10

Workload Average

Page 50: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 50

Operating System & Runtime Support

The issues:

• Runtime and/or OS support for managing mini-threads

• Compatible register usage between mini-threads & OS/runtime

Page 51: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 51

Operating System & Runtime Support

1. Dedicated server, e.g., web server

• Goal: maximum OS performance

• Solution:

• OS/runtime recompiled for half the register set

• each mini-thread uses different registers

• hardware partition bit steers register access to high or low set

+ Both mini-threads in a context can execute in the OS at the same time

-- Only 1 partition allowed & no sharing of values

Page 52: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 52

Operating System & Runtime Support

2. Multiprogrammed environment

• Goal: maximum flexibility to use registers

• Solution:

• standard OS compile, multiple versions of the runtime

• runtime schedules mini-threads

• hardware blocks 1 mini-thread when the other enters the OS

+ No restrictions on register usage and sharing among mini-threads

-- Lower throughput because mini-threads are blocked

Page 53: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 53

Conclusion for a 2 Mini-threads on SMT

Technique for obtaining large-scale SMT throughput with a small-scale SMT implementation

Mini-threads bypass the register scaling problem on SMT

• average 38% boost on 2-context SMT, and 20% on 4-context SMT, -2% on 8-context SMT

Performance increases when applications choose to use mini-threads when it pays (38%/22%/6%)

Reducing registers has low impact!

• 3% average increase in instructions

Page 54: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 54

A Problem or an Opportunity?

Database workload

+ low throughput (0.8 IPC on an 8-wide superscalar)

+ naturally threaded (and widely used) application

- already high cache miss rates on a single-threaded machine

Key question:

Can SMT’s simultaneous multi-thread instruction issue hide the latencies from additional misses in this already memory-intensive databases?

Page 55: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 55

Database Bottom Line

SMT gets good speedups on commercial database workloads• 3x for debit/credit systems, 1.7 for big search problems

The reasons:• Limits inter-thread data interference in the caches to superscalar

levels by:• a virtual-to-physical page mapping policy that exploits temporal

locality in the physically-addressed L2 cache• address offsetting for thread-private data in the L1 data cache

• Exploits SMT’s inter-thread instruction sharing(35% reduction in instruction cache miss rates)

• Hides latencies of all kinds with simultaneous, multi-thread instruction issue

Page 56: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 56

SMT Performance

Page 57: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 57

SMT Register Deallocation Strategies

(1) Compiler/architecture support to free registers after their last use• small SMT register files achieve large register file performance

Use a small register file if the register access sets the cycle time.

(2) OS support to free registers in idle contexts:• enables SMT registers to be shared by all contexts

(traditional multithreaded processors have separate contexts per thread)

• increases performance with a small register file by:• up to 50% with 8 threads• 3 times with fewer threads.

Design a good throughput CPU (i.e., SMT) & not penalize the performance of a single or few threads

Page 58: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 58

Deallocating Registers After Last Use

Page 59: Spring 2003CSE 548P1 Motivation for Multithreaded Architectures Processors not executing code at their hardware potential late 70’s: performance lost to

Spring 2003 CSE 548P 59

Putting This in Context