Hyper-Threading , Chip multiprocessors and both

Hyper-Threading, Chip multiprocessors and both

Zoran Jovanovic

2

To Be Tackled in Multithreading

Review of Threading Algorithms Hyper-Threading Concepts Hyper-Threading Architecture Advantages/Disadvantages

3

Threading Algorithms Time-slicing

A processor switches between threads in fixed time intervals.

High expenses, especially if one of the processes is in the wait state. Fine grain

Switch-on-eventTask switching in case of long pausesWaiting for data coming from a relatively slow

source, CPU resources are given to other processes. Coarse grain

4

Threading Algorithms (cont.) Multiprocessing

Distribute the load over many processorsAdds extra cost

Simultaneous multi-threadingMultiple threads execute on a single

processor without switching.Basis of Intel’s Hyper-Threading technology.

5

Hyper-Threading Concept At each point of time only a part of

processor resources is used for execution of the program code of a thread.

Unused resources can also be loaded, for example, with parallel execution of another thread/application.

Extremely useful in desktop and server applications where many threads are used.

Quick Recall: Many Resources IDLE!

From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

For an 8-way superscalar.

Slide source: John Kubiatowicz

7

8

(a) A superscalar processor with no multithreading

(b) A superscalar processor with coarse-grain multithreading

(c) A superscalar processor with fine-grain multithreading

(d) A superscalar processor with simultaneous multithreading (SMT)

(a)(a) (b)(b) (c)(c) (d)(d)

9

Simultaneous Multithreading (SMT)Example: new Pentium with “Hyperthreading”

Key Idea: Exploit ILP across multiple threads! i.e., convert thread-level parallelism into more ILPexploit following features of modern processors:

multiple functional units modern processors typically have more functional units

available than a single thread can utilize

register renaming and dynamic scheduling multiple instructions from independent threads can co-exist

and co-execute!

10

Hyper-Threading Architecture First used in Intel Xeon MP processor Makes a single physical processor appear as

multiple logical processors. Each logical processor has a copy of

architecture state. Logical processors share a single set of physical

execution resources

11

Hyper-Threading Architecture Operating systems and user programs can

schedule processes or threads to logical processors as if they were in a multiprocessing system with physical processors.

From an architecture perspective we have to worry about the logical processors using shared resources.Caches, execution units, branch predictors,

control logic, and buses.

Power 5 dataflow ...

Why only two threads? With 4, one of the shared resources (physical registers,

cache, memory bandwidth) would be prone to bottleneck Cost:

The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

13

Advantages

Extra architecture only adds about 5% to the total die area.

No performance loss if only one thread is active. Increased performance with multiple threads

Better resource utilization.

14

Disadvantages To take advantage of hyper-threading

performance, serial execution can not be used.Threads are non-deterministic and involve

extra designThreads have increased overhead

Shared resource conflicts

Multicore

Multiprocessors on a single chip

15

CS267 Lecture 6 16

Basic Shared Memory Architecture

Processors all connected to a large shared memory Where are caches?

• Now take a closer look at structure, costs, limits, programming

P1

interconnect

memory

P2 Pn


What About Caching???

Want High performance for shared memory: Use Caches! Each processor has its own cache (or multiple caches) Place data from memory into cache Writeback cache: don’t send all writes over bus to memory

Caches Reduce average latency Automatic replication closer to processor More important to multiprocessor than uniprocessor: latencies longer

Normal uniprocessor mechanisms to access data Loads and Stores form very low-overhead communication primitive

Problem: Cache Coherence!

I/O devicesMem

P1

$ $

Pn

Bus

Example Cache Coherence Problem

I/O devices

Memory

P1

$ $ $

P2 P3

5

u = ?

4

u = ?

u :51

u :5

2

u :5

3

u = 7

Things to note: Processors could see different values for u after event 3 With write back caches, value written back to memory depends on

happenstance of which cache flushes or writes back value when How to fix with a bus: Coherence Protocol

Use bus to broadcast writes or invalidations Simple protocols rely on presence of broadcast medium

Bus not scalable beyond about 64 processors (max) Capacity, bandwidth limitations


CS267 Lecture 6

Limits of Bus-Based Shared Memory

I/O MEM MEM° ° °

PROC

cache

PROC

cache

° ° °

Assume:

1 GHz processor w/o cache

=> 4 GB/s inst BW per processor (32-bit)

=> 1.2 GB/s data BW at 30% load-store

Suppose 98% inst hit rate and 95% data hit rate

=> 80 MB/s inst BW per processor

=> 60 MB/s data BW per processor 140 MB/s combined BW

Assuming 1 GB/s bus bandwidth

8 processors will saturate bus

5.2 GB/s

140 MB/s

20

21

Cache Organizations for Multi-cores

• L1 caches are always private to a core

• L2 caches can be private or shared

• Advantages of a shared L2 cache: efficient dynamic allocation of space to each core data shared by multiple cores is not replicated every block has a fixed “home” – hence, easy to find the latest copy

• Advantages of a private L2 cache: quick access to private L2 – good for small working sets private bus to private L2 less contention

A Reminder: SMT (Simultaneous Multi Threading)

SMT vs. CMP

A Single Chip Multiprocessor L. Hammond at al. (Stanford), IEEE Computer 97

• For Same area (a billion tr. DRAM area)

Superscalar and SMT: Very Complex• Wide• Advanced Branch prediction• Register Renaming• OOO Instruction Issue• Non-Blocking data caches

Superscalar (SS) SMT

CMP

SS and SMT vs. CMP

CPU Cores: Three main hardware design problems (of SS and SMT):•Area increases quadratically with core complexity

• Number of Registers O(Instruction window size)• Register ports - O(Issue width)CMP solves this problem (~ linear Area to Issue width)

•Longer Cycle Times • Long Wires, many MUXes and crossbars• Large buffers, queues and register filesClustering (decreases ILP) or Deep Pipelining (Branch mispredication penalties)CMP allows small cycle time (with little effort)

Small and fastRelies on software to schedule- Poor ILP

•Complex Design and Verification

SS and SMT vs. CMPMemory:•12 issue SS or SMT require multiport data cache (4-6 ports)

• 2 X 128 Kbyte (2 cycle latency)CMP 16 X 16 Kbyte (single cycle latency), but secondary cache is slower (multiport)Shared memory: write through caches

SMT CMP

Performance comparison

• Compress: (Integer apps) Low ILP and no TLP• Mpeg-2: (MMedia apps) High ILP and TLP and moderate memory requirement (parallelized by hand)

+ SMT utilizes core resources better+ But CMP has 16 issue slots instead of 12

• Tomcatv: (FP applications) Large loop-level parallelism and large memory bandwidth (TLP by compiler) + CMP has large memory bandwidth on primary cache - SMT fundamental problem: unified and slow cache• Multiprogram: Integer multiprogramming workload, all computation-intensive (Low ILP, High PLP)

CMP Motivation

How to utilize available silicon? Speculation (aggressive superscalar) Simultaneous Multithreading (SMT, Hyperthreading) Several processors on a single chip

What is a CMP (Chip MultiProcessor)? Several processors (several masters) Both shared and distributed memory architectures Both homogenous and heterogeneous processor types

Why? Wire Delays Diminishing of Uniprocessors Very long design and verification times for modern processors

A Single Chip Multiprocessor L. Hammond at al. (Stanford), IEEE Computer 97

• TLP and PLP become widespread in future applications• Various Multimedia applications• Compilers and OS Favours CMP

CMP: • Better performance with simple hardware• Higher clock rates, better memory bandwidth• Shorter pipelines

SMT: has better utilizations but CMP has more resources (no wide-issue logic)

Although CMP bad for no TLP and ILP (compress), SMT and SS not much better

A Reminder: SMT (Simultaneous Multi Threading)

SMT CMP

• Pool of execution units (Wide machine)• Several Logical processors

• Copy of State for each• Mul. Threads are running concurrently• Better utilization and Latency Tolerance

• Simple Cores• Moderate amount of parallelism• Threads are running concurrently on different cores

30

SMT Dual-core: all four threads can run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 3 Thread 2 Thread 4

Documents

Hyper-Threading , Chip multiprocessors and both