Upload
prosper-may
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Spring 2003 CSE 548P 1
Motivation for Multithreaded Architectures
Processors not executing code at their hardware potential
• late 70’s: performance lost to memory latency
• 90’s: performance not in line with the increasingly complex parallel hardware
• instruction issue bandwidth is increasing
• number of functional units is increasing
• instructions can execute out-of-order execution & complete in-order
• still, processor utilization was decreasing & instruction throughput not increasing in proportion to the issue width
Spring 2003 CSE 548P 2
Motivation for Multithreaded Architectures
Spring 2003 CSE 548P 3
Motivation for Multithreaded Architectures
Major cause is the lack of instruction-level parallelism in a single executing thread
Therefore the solution has to be more general than building a smarter cache or a more accurate branch predictor
Spring 2003 CSE 548P 4
Multithreaded Processors
Multithreaded processors can increase the pool of independent instructions & consequently address multiple causes of processor stalling
• holds processor state for more than one thread of execution
• registers
• PC
• each thread’s state is a hardware context
• execute the instruction stream from multiple threads without software context switching
• utilize thread-level parallelism (TLP) to compensate for a lack in ILP
Spring 2003 CSE 548P 5
Multithreading
Traditional multithreaded processors hardware switch to a different context to avoid processor stalls
3 styles
1. coarse-grain multithreading• switch on a long-latency operation (e.g., L2 cache miss)• another thread executes while the miss is handled• modest increase in instruction throughput with potentially no
slowdown to the thread with the miss• doesn’t hide latency of short-latency operations• doesn’t add much to an already long stall• need to fill the pipeline on a switch• no switch if no long-latency operations
• HEP, IBM RS64 III
Spring 2003 CSE 548P 6
Traditional Multithreading
3 styles
2. fine-grain multithreading
• can switch to a different thread each cycle (usually round robin)
• hides latencies of all kinds
• larger increase in instruction throughput but slows down the execution of each thread
• Cray (Tera) MTA
Spring 2003 CSE 548P 7
Comparison of Issue Capabilities
Spring 2003 CSE 548P 8
Multithreading
3 styles
3. simultaneous multithreading (SMT)
• issues multiple instructions from multiple threads each cycle
• no hardware context switching
• huge boost in instruction throughput with less degradation to individual threads
Spring 2003 CSE 548P 9
Comparison of Issue Capabilities
Spring 2003 CSE 548P 10
Cray (Tera) MTA
Fine-grain multithreaded processor• can switch to a different thread each cycle
• switches to ready threads only• allows execution to remain with the current thread for a specific
number of cycles (discussed under compiler support)• up to 128 hardware contexts
• lots of latency to hide, mostly from the multi-hop interconnection network
• average instruction latency is 70-cycles (at one point)(i.e., 70 instruction streams needed to hide all latency, on average)
• processor state for all 128 contexts• GPRs (total of 4K registers!)• status registers (includes the PC)• branch target registers
Spring 2003 CSE 548P 11
Cray (Tera) MTA
Interesting features
• no data caches
• to avoid having to keep caches coherent (topic of the next lecture)
• increases the latency for data accesses but reduces the variation
• L1 & L2 instruction caches
• instruction accesses are more predictable & have no coherency problem
• prefetch straight-line & target code
Spring 2003 CSE 548P 12
Cray (Tera) MTA
Interesting features
• trade-off between avoiding memory bank conflicts & exploiting spatial locality
• memory distributed among processors
• memory addresses are randomized to avoid conflicts
• want to fully utilize all memory bandwidth
• run-time system can confine consecutive virtual addresses to a single (close-by) memory unit
• reduces latency (used mainly for instructions)
Spring 2003 CSE 548P 13
Cray (Tera) MTA
Interesting features• tagged memory for synchronization, pointer-chasing, trap handling
• indirectly set full/empty bits to prevent data races• prevents a consumer/producer from loading/overwriting a value
before a producer/consumer has written/read it• set to empty when producer instruction starts• consumer instructions block if try to read the producer value• set to full when producer value is written• consumers can now read a valid value
• explicitly set full/empty bits for synchronization• primarily used to synchronize threads that are accessing shared
data (topic of the next lecture)• lock: read memory location & set to empty• other readers are blocked• unlock: write & set to full
Spring 2003 CSE 548P 14
Cray (Tera) MTA
Interesting features
• virtual memory system
• no paging: want pages pinned down in memory
• page size is 256MB
• user-mode trap handlers
• fatal exceptions, normalizing floating point numbers
Spring 2003 CSE 548P 15
Cray (Tera) MTA
Compiler support• each instruction is a VLIW instruction
• memory/arithmetic/branch• load/store architecture
• need a good code scheduler• explicit dependence look-ahead
• field in a memory instruction that specifies the number of independent LIW instructions that follow
• deviation from fine-grain multithreading• 8 memory operations can be outstanding
• handling branches• instruction to store a branch target in a register before the
branch is executed• can start prefetching the target code
Spring 2003 CSE 548P 16
Cray (Tera) MTA
Run-time support
• thread manipulation
• protection domains: group of threads executing in the same virtual address space
• OS sets the maximum number of thread contexts (instruction streams) a domain is allowed
• domain can create & kill threads within that limit, depending on its need for them
• OS bases resource allocations on resource usage
Spring 2003 CSE 548P 17
SMT: The Executive Summary
Simultaneous multithreaded (SMT) processors combine designs from:
• superscalar processors
• traditional multithreaded processors
The combination:
• SMT issues & executes instructions from multiple threads simultaneously
=> converting TLP to ILP
• threads share almost all hardware resources
Spring 2003 CSE 548P 18
Performance Implications
Multiprogramming workload
• 2.5X on SPEC95, 4X on SPEC2000
Parallel programs
• ~.7 on SPLASH2
Commercial databases
• 2-3X on TPC B; 1.5 on TPC D
Web servers & OS
• 4X on Apache and Digital Unix
Spring 2003 CSE 548P 19
Does this Processor Sound Familiar?
Huge performance boost + straightforward implementation =>
• 2-context Intel Hyperthreading
• 2-context Sun UltraSPARC on 4-processor CMP
• 4-context Compaq 21464
• network processor & mobile device start-ups
• others in the wings
Spring 2003 CSE 548P 20
An SMT Architecture
Three primary goals for this architecture:
1. Achieve significant throughput gains with multiple threads
2. Minimize the performance impact on a single thread executing alone
3. Minimize the architectural impact on a conventional out-of-order superscalar design
Spring 2003 CSE 548P 21
Implementing SMT
Spring 2003 CSE 548P 22
Implementing SMT
Can use as is most hardware on current out-or-order processors
Out-of-order renaming & instruction scheduling mechanisms• physical register pool model• renaming hardware eliminates false dependences both within a
thread (just like a superscalar) & between threads• map thread-specific architectural registers onto a pool of thread-
independent physical registers• operands are thereafter called by their physical names• an instruction is issued when its operands become available & a
functional unit is free• instruction scheduler not consider thread IDs when dispatching
instructions to functional units(unless threads have different priorities)
Spring 2003 CSE 548P 23
From Superscalar to SMT
Extra pipeline stages for accessing thread-shared register files
• 8 threads * 32 registers + renaming registers
SMT instruction fetcher (ICOUNT)
• fetch from 2 threads each cycle
• count the number of instructions for each thread in the pre-execution stages
• pick the 2 threads with the lowest number
• in essence fetching from the two highest throughput threads
Spring 2003 CSE 548P 24
From Superscalar to SMT
Per-thread hardware• small stuff• all part of current out-of-order processors• none endangers the cycle time
• other per-thread processor state, e.g.,• program counters• return stacks• thread identifiers, e.g., with BTB entries, TLB entries
• per-thread bookkeeping for• instruction retirement• trapping• instruction queue flush
This is why there is only a 10% increase to Alpha 21464 chip area.
Spring 2003 CSE 548P 25
Implementing SMT
Thread-shared hardware:
• fetch buffers
• branch prediction structures
• instruction queues
• functional units
• active list
• all caches & TLBs
• MSHRs
• store buffers
This is why there is little single-thread performance degradation (~1.5%).
Spring 2003 CSE 548P 26
Architecture Research
Concept & potential of Simultaneous Multithreading: ISCA ’95 & ISCA 25th Anniversary Anthology
Designing the microarchitecture: ISCA ’96
• straightforward extension of out-of-order superscalars
I-fetch thread chooser: ISCA ’96
• 40% faster than round-robin
The lockbox for cheap synchronization: HPCA ’98
• orders of magnitude faster
• can parallelize previously unparallelizable codes
Spring 2003 CSE 548P 27
Architecture Research
Software-directed register deallocation: TPDS ’99
• large register-file performance w. small register file
Mini-threads: HPCA ’03
• large SMT performance w. small SMTs
SMT instruction speculation: TOCS ‘03
• a good thread mix is the most important performance factor
Spring 2003 CSE 548P 28
Compiler Research
Tuning compiler optimizations for SMT: Micro ‘97 & IJPP ’99
• data decomposition: use cyclic iteration scheduling
• tiling: use cyclic tiling; no tile size sweet spot
• software speculation: generates too much overhead code on integer programs
Communicate last-use info to HW for early register deallocation: TPDS ’99
• now need ¼ the renaming registers
Compiling for fewer registers/thread: HPCA ’03
• surprisingly little additional spill code (avg. 3%)
Spring 2003 CSE 548P 29
OS Research
Analysis of OS behavior on SMT: ASPLOS ‘00
• Kernel-kernel conflicts in I$ & D$ & branch mispredictions ameliorated by SMT instruction issue + thread-sharing in HW
OS/runtime support for mini-threads: HPCA ’03
• dedicated server: recompile for fewer registers
• multiprogrammed environment: multiple versions
OS/runtime support for executing threaded programs: ISCA ’98 & PPoPP ’03
• dynamic memory allocation, synchronization, stack offsetting, page mapping
Spring 2003 CSE 548P 30
Others are Now Carrying the Ball
Fault detection & recovery
Thread-level speculation
Instruction & data prefetching
Instruction issue hardware design
Thread scheduling
Single-thread execution
Profiling executing threads
SMT-CMP hybrids
Spring 2003 CSE 548P 31
SMT Collaborators
UWHank LevySteve Gribble
Dean Tullsen (UCSD)Jack Lo (Transmeta)Sujay Parekh (IBM Yorktown)Brian Dewey (Microsoft)Manu Thambi (Microsoft)Josh Redstone (interviewing)Mike SwiftLuke McDowellSteve SwansonAaron Eakin (HP)Dimitriy Portnov (Google)
DEC/Compaq
Joel Emer (now Intel)
Rebecca Stamm
Luiz Barroso (now Google)
Kourosh Gharachorloo (now Google)
For more info on SMT:
http://www.cs.washington.edu/research/smt
Spring 2003 CSE 548P 32
0
1
2
3
4
5
SMT Performance
SMT improves throughput by converting TLP from multiple contexts into ILP
Performance increases with more contexts IP
C
1 2 4 8Contexts
Apache
Spring 2003 CSE 548P 33
A Problem with Large SMTs
Register file will not scale in future technologies
• e.g., 4-context Alpha 21464:
• 3 cycles to access
• 4 times the size of the 64KB instruction cache
First SMT implementations will be small
• limited to 2 hardware contexts
• expect lower throughput gains
Goal of mini-threads: achieve performance gains of larger SMTs without the register cost
Spring 2003 CSE 548P 34
Mini-threads: the Concept
Execute more than one thread, called a mini-thread, within each context
Mini-threads in a context share the context’s architectural register set
Spring 2003 CSE 548P 35
SMT Processor Model
Context 1
PC
R0…
R31
Architectural Registers
R0…
R31
Context 2
PC
Spring 2003 CSE 548P 36
Mini-thread Processor Model
Context 1
PC
Architectural Registers
Context 2
PC
PC
PC
R0…
R31
R0…
R31
Spring 2003 CSE 548P 37
Mini-thread Processor Model
Context 1
PC
Architectural Registers
Context 2
PC
PC
PC
R0 – R13
R14 – R17
R18 - R31
R0 – R13
R14 – R17
R18 - R31
Spring 2003 CSE 548P 38
Mini-thread Processor Model
Context 1
Context 2
Context 1
PC
Architectural Registers
Context 2
PC
PC
PC
R0 – R15
R16 – R31
R0 – R15
R16 – R31
Spring 2003 CSE 548P 39
Mini-threads Executing on SMT
The HW side: Add extra PCs to each hardware context
• mini-threads execute using their own PC
• mini-threads share the context’s architectural register set
Spring 2003 CSE 548P 40
Mini-threads Executing on SMT
The SW side: applications can choose to trade-off registers for TLP
• ignore the PCs and the machine is SMT
• use the extra PCs for higher thread-level parallelism
• an application is responsible for coordinating register usage across mini-threads within its context
Spring 2003 CSE 548P 41
The Register/TLP Trade-off
+ Throughput benefit of extra mini-threads (more TLP)
-- Instruction cost of fewer registers per mini-thread (spill code)
+/-- Spill code impact on throughput
+/-- Instruction impact of extra mini-threads
Spring 2003 CSE 548P 42
Two Mini-threads/Context
--10
10
30
50
70
90
% I
ncre
ase
in I
PC
ApacheBarnes
FmmRaytrace
Water-spatial
• 1 Context: 63%
Workload Average
Spring 2003 CSE 548P 43
Two Mini-threads/Context
-10
10
30
50
70
90
% I
ncre
ase
in I
PC
ApacheBarnes
FmmRaytrace
Water-spatial
• 1 Context: 63%
• 2 Contexts: 40%
• 4 Contexts: 26%
• 8 Contexts: 9%
1 Context2 Contexts4 Contexts8 Contexts
Workload Average
Spring 2003 CSE 548P 44
16 Registers per Mini-thread
-25
-20
-15
-10
-5
0
5
10
15
20
% C
hang
e in
Dyn
amic
Ins
truc
tions
ApacheBarnes
FmmRaytrace
Water-spatial
• Only 3% on average!
• IPC cost can be greater
Spring 2003 CSE 548P 45
16 Registers per Mini-thread
-25
-20
-15
-10
-5
0
5
10
15
20
% C
hang
e in
Dyn
amic
Ins
truc
tions
ApacheBarnes
FmmRaytrace
Water-spatial
• 7% savings in spill code
Spring 2003 CSE 548P 46
16 Registers per Mini-thread
-25
-20
-15
-10
-5
0
5
10
15
20
% C
hang
e in
Dyn
amic
Ins
truc
tions
ApacheBarnes
FmmRaytrace
Water-spatial
• Kernel: .8%
Spring 2003 CSE 548P 47
Bottom Line for a 2 Mini-threads on SMT
ApacheBarnes
FmmRaytrace
Water-spatial-40
-20
0
20
40
60
80
100
% S
peed
up
1 Context2 Contexts4 Contexts8 Contexts
• 1 Context: 60%• 2 Contexts: 38• 4 Contexts: 20%• 8 Contexts: -2%
Workload Average
Spring 2003 CSE 548P 48
Bottom Line for a 2 Mini-threads on SMT
ApacheBarnes
FmmRaytrace
Water-spatial
• If free to choose:
– 1 Contexts: 60%
– 2 Contexts: 38%
– 4 Contexts: 22%
– 8 Contexts: 6%
-40
-20
0
20
40
60
80
100
% S
peed
up
1 Context2 Contexts4 Contexts8 Contexts
Workload Average
Spring 2003 CSE 548P 49
Bottom Line for a 2 Mini-threads on SMT
ApacheBarnes
FmmRaytrace
Water-spatial
• 1 Contexts: 60%
• 2 Contexts: 38%
• 4 Contexts: 20%
• 8 Contexts: -2%
-40
-20
0
20
40
60
80
100
% S
peed
up
1 Context2 Contexts4 Contexts8 Contexts
83
66
43
10
Workload Average
Spring 2003 CSE 548P 50
Operating System & Runtime Support
The issues:
• Runtime and/or OS support for managing mini-threads
• Compatible register usage between mini-threads & OS/runtime
Spring 2003 CSE 548P 51
Operating System & Runtime Support
1. Dedicated server, e.g., web server
• Goal: maximum OS performance
• Solution:
• OS/runtime recompiled for half the register set
• each mini-thread uses different registers
• hardware partition bit steers register access to high or low set
+ Both mini-threads in a context can execute in the OS at the same time
-- Only 1 partition allowed & no sharing of values
Spring 2003 CSE 548P 52
Operating System & Runtime Support
2. Multiprogrammed environment
• Goal: maximum flexibility to use registers
• Solution:
• standard OS compile, multiple versions of the runtime
• runtime schedules mini-threads
• hardware blocks 1 mini-thread when the other enters the OS
+ No restrictions on register usage and sharing among mini-threads
-- Lower throughput because mini-threads are blocked
Spring 2003 CSE 548P 53
Conclusion for a 2 Mini-threads on SMT
Technique for obtaining large-scale SMT throughput with a small-scale SMT implementation
Mini-threads bypass the register scaling problem on SMT
• average 38% boost on 2-context SMT, and 20% on 4-context SMT, -2% on 8-context SMT
Performance increases when applications choose to use mini-threads when it pays (38%/22%/6%)
Reducing registers has low impact!
• 3% average increase in instructions
Spring 2003 CSE 548P 54
A Problem or an Opportunity?
Database workload
+ low throughput (0.8 IPC on an 8-wide superscalar)
+ naturally threaded (and widely used) application
- already high cache miss rates on a single-threaded machine
Key question:
Can SMT’s simultaneous multi-thread instruction issue hide the latencies from additional misses in this already memory-intensive databases?
Spring 2003 CSE 548P 55
Database Bottom Line
SMT gets good speedups on commercial database workloads• 3x for debit/credit systems, 1.7 for big search problems
The reasons:• Limits inter-thread data interference in the caches to superscalar
levels by:• a virtual-to-physical page mapping policy that exploits temporal
locality in the physically-addressed L2 cache• address offsetting for thread-private data in the L1 data cache
• Exploits SMT’s inter-thread instruction sharing(35% reduction in instruction cache miss rates)
• Hides latencies of all kinds with simultaneous, multi-thread instruction issue
Spring 2003 CSE 548P 56
SMT Performance
Spring 2003 CSE 548P 57
SMT Register Deallocation Strategies
(1) Compiler/architecture support to free registers after their last use• small SMT register files achieve large register file performance
Use a small register file if the register access sets the cycle time.
(2) OS support to free registers in idle contexts:• enables SMT registers to be shared by all contexts
(traditional multithreaded processors have separate contexts per thread)
• increases performance with a small register file by:• up to 50% with 8 threads• 3 times with fewer threads.
Design a good throughput CPU (i.e., SMT) & not penalize the performance of a single or few threads
Spring 2003 CSE 548P 58
Deallocating Registers After Last Use
Spring 2003 CSE 548P 59
Putting This in Context