Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading

Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading Slide 2 Caltech CS184 Spring2003 -- DeHon 2 Today Multitasking/Multithreading model Fine-Grained Multithreading SMT (Symmetric Multi-Threading) Slide 3 Caltech CS184 Spring2003 -- DeHon 3 Problem Long latency of operations IO or page-fault Non-local memory fetch Main memory, L3, remote node in distributed memory Long latency operations (mpy, fp) Wastes processor cycles while stalled If processor stalls on return Latency problem turns into a throughput (utilization) problem CPU sits idle Slide 4 Caltech CS184 Spring2003 -- DeHon 4 Idea Run something else useful while stalled In particular, another process/thread Another PC Use parallelism to tolerate latency Slide 5 Caltech CS184 Spring2003 -- DeHon 5 Old Idea Share expensive machine among multiple users (jobs) When one user task must wait on IO Run another one Time multiplex machine among users Slide 6 Caltech CS184 Spring2003 -- DeHon 6 Mandatory Concurrency Some tasks must be run concurrently (interleaved) with user tasks DRAM Refresh IO Keyboard, network, Window system (xclock) Autosave Clippy Slide 7 Caltech CS184 Spring2003 -- DeHon 7 Other Useful Concurrency Print spooler Web browser Download images in parallel Instant Messanger/Zephyr (Gale) biff/xbiff/xfaces Slide 8 Caltech CS184 Spring2003 -- DeHon 8 Multitasking Single machine run multiple tasks Machine provides same ISA/sequential semantics to each task Task believes it own machines Same as if other tasks running on different machines Tasks isolated from one another Cannot affect each other functionally (may impact each others performance) Slide 9 Caltech CS184 Spring2003 -- DeHon 9 Each task/process Process virtualization of the CPU Has own unique set of state: PC Registers VM Page Table (hence memory image) Slide 10 Caltech CS184 Spring2003 -- DeHon 10 Sharing the CPU Save/Restore PC/Registers/Page Table Virtual Memory Isolation Privileged system software User/System mode execution Functionally, task not notice that it gave up the CPU for period of time Slide 11 Caltech CS184 Spring2003 -- DeHon 11 Threads Threads separate PC, but shares and address space Has own processor state: PC Registers Shares Memory VM Page Table Process may have multiple threads Slide 12 Caltech CS184 Spring2003 -- DeHon 12 Multitasking/Multithreading Gives us an initial model for parallelism So far, parallelism of unrelated tasks Eventually, cooperating Have to address concurrent memory model (next time) Slide 13 Caltech CS184 Spring2003 -- DeHon 13 Fine Grained Slide 14 Caltech CS184 Spring2003 -- DeHon 14 HEP/ Unity/Tera Provide a number of contexts Separate PCs, register files, Number of contexts operation latency Pipeline depth Roundtrip time to main memory Run each context in round-robin fashion Slide 15 Caltech CS184 Spring2003 -- DeHon 15 HEP Pipeline [figure: Arvind+Innucci, DFVLR87] Slide 16 Caltech CS184 Spring2003 -- DeHon 16 Strict Interleaved Threading Uses parallelism to get throughput Avoid interlocking, bypass Cover memory latency Essentially C-slow transformation of processor Potentially poor single-threaded performance Increases end-to-end latency of thread Slide 17 Caltech CS184 Spring2003 -- DeHon 17 SMT Slide 18 Caltech CS184 Spring2003 -- DeHon 18 Superscalar and Multithreading? Do both? Issue from multiple threads into pipeline No worse than (super)scalar on single thread More throughput with multiple threads Fill in what would have been empty issue slots with instructions from different threads Slide 19 Caltech CS184 Spring2003 -- DeHon 19 SuperScalar Inefficiency Unused Slot Recall: limited Scalar IPC Slide 20 Caltech CS184 Spring2003 -- DeHon 20 SMT Promise Fill in empty slots with other threads Slide 21 Caltech CS184 Spring2003 -- DeHon 21 SMT Estimates (ideal) [Tullsen et. al. ISCA 95] Slide 22 Caltech CS184 Spring2003 -- DeHon 22 SMT Estimates (ideal) [Tullsen et. al. ISCA 95] Slide 23 Caltech CS184 Spring2003 -- DeHon 23 SMT uArch Observation: exploit register renaming Get small modifications to existing superscalar architecture Slide 24 Caltech CS184 Spring2003 -- DeHon 24 SMT uArch N.B. remarkable thing is how similar superscalar core is [Tullsen et. al. ISCA 96] Slide 25 Caltech CS184 Spring2003 -- DeHon 25 Alpha: Basic Out-of-order Pipeline Fetch Decod e/Map Queue Reg Read Execute Dcach e/Stor e Buffer Reg Write Retire PC Icache Register Map Dcache Regs Thread- blind [Src: Tryggve Fossum, Compaq 2000] Slide 26 Caltech CS184 Spring2003 -- DeHon 26 Alpha SMT Pipeline Fetch Decod e/Map Queue Reg Read Execute Dcach e/Stor e Buffer Reg Write Retire Icache Dcache PC Register Map Regs [Src: Tryggve Fossum, Compaq] Slide 27 Caltech CS184 Spring2003 -- DeHon 27 SMT uArch Changes: Multiple PCs Control to decide how to fetch from Separate return stacks per thread Per-thread reorder/commit/flush/trap Thread id w/ BTB Larger register file More things outstanding Slide 28 Caltech CS184 Spring2003 -- DeHon 28 Performance [Tullsen et. al. ISCA 96] Slide 29 Caltech CS184 Spring2003 -- DeHon 29 Relative Performance (Alpha) Relative Multithreaded Performance 0.00 0.50 1.00 1.50 2.00 2.50 Int95Fp95Int95/FP95SQL Programs Relative Performance 1-T 2-T 3-T 4-T [Src: Tryggve Fossum, Compaq] Slide 30 Caltech CS184 Spring2003 -- DeHon 30 Alpha SMT Cost-effective Multiprocessing--increased throughput 4 X architectural registers 2 X performance gain with little additional cost and complexity Slide 31 Caltech CS184 Spring2003 -- DeHon 31 Optimizing: fetch freedom RR=Round Robin RR.X.Y X threads do fetch in cycle Y instructions fetched/thread [Tullsen et. al. ISCA 96] Slide 32 Caltech CS184 Spring2003 -- DeHon 32 Optimizing: Fetch Alg. ICOUNT priority to thread w/ fewest pending instrs BRCOUNT MISSCOUNT IQPOSN penalize threads w/ old instrs (at front of queues) [Tullsen et. al. ISCA 96] Slide 33 Caltech CS184 Spring2003 -- DeHon 33 Throughput Improvement 8-issue superscalar Achieves little over 2 instructions per cycle Optimized SMT Achieves 5.4 instructions per cycle on 8 threads 2.5x throughput increase Slide 34 Caltech CS184 Spring2003 -- DeHon 34 Costs [Burns+Gaudiot HPCA2000] Slide 35 Caltech CS184 Spring2003 -- DeHon 35 Costs [Burns+Gaudiot HPCA2000] Single and double cache line fetches Slide 36 Caltech CS184 Spring2003 -- DeHon 36 Check on B5 Slide 37 Caltech CS184 Spring2003 -- DeHon 37 Machines Pull machines off grid? Matrix1? Matrix3739 Off for someone else, leave off and use if they finish in timely fashion? Not necessary for single processor machine Department grid cluster Slide 38 Caltech CS184 Spring2003 -- DeHon 38 Big Ideas 0, 1, Infinity virtualize resources Processes virtualize CPU Latency Hiding Processes, Threads Find something else useful to do while wait

Documents

Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading