27
ENGS 116 Lecture 16 1 Multiprocessors and Thread- Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for Friday: Sections 6.4 – 6.7 Reading for Wednesday: Sections 6.8 – 6.10, 6.14

ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

  • View
    218

  • Download
    4

Embed Size (px)

Citation preview

Page 1: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 1

Multiprocessors and Thread-Level

Parallelism Cont’d

Vincent Berk

November 2, 2005

Reading for Today: Sections 6.1 – 6.3

Reading for Friday: Sections 6.4 – 6.7

Reading for Wednesday: Sections 6.8 – 6.10, 6.14

Page 2: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 2

An Example Snoopy Protocol

• Invalidation protocol, write-back cache

• Each block of memory is in one state:

– Clean in all caches and up-to-date in memory (Shared)

– OR Dirty in exactly one cache (Exclusive)

– OR Not in any caches

• Each cache block is in one state (track these):

– Shared: block can be read

– OR Exclusive: cache has only copy, it is writable and dirty

– OR Invalid: block contains no data

• Read misses: cause all caches to snoop bus

• Writes to clean line are treated as misses

Page 3: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 3

Figure 6.11 A write-invalidate, cache-coherence protocol for a write-back cache showing the states and state transitions for each block in the cache.

Page 4: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 4

Figure 6.12 Cache-coherence state diagram with the state transitions induced by the local processor shown in black and by the bus activities shown in gray.

Page 5: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 5

MESI Protocol

Modified (private, ≠ Memory)Exclusive (private, = Memory)

Shared (shared, = Memory)Invalid

Snooping Cache Variations

Berkeley Protocol

Owned ExclusiveOwned Shared

SharedInvalid

Basic Protocol

ExclusiveSharedInvalid

Illinois Protocol

Private DirtyPrivate Clean

SharedInvalid

Owner can update via bus invalidate operationOwner must write back when replaced in cache

If read sourced from memory, then Private CleanIf read sourced from other cache, then SharedCan write in cache if held private clean or dirty

Page 6: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 6

Implementing Snooping Caches

• Multiple processors must be on bus, access to both addresses and data

• Add a few new commands to perform coherency, in addition to read and write

• Processors continuously snoop on address bus– If address matches tag, either invalidate or update

• Since every bus transaction checks cache tags, could interfere with CPU just to check: – solution 1: duplicate set of tags for L1 caches just to allow

checks in parallel with CPU– solution 2: L2 cache already duplicate, provided L2 obeys

inclusion with L1 cache• block size, associativity of L2 affects L1

Page 7: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 7

Implementing Snooping Caches

• Bus serializes writes, getting bus ensures no one else can perform memory operation

• On a miss in a write back cache, may have the desired copy and its dirty, so must reply

• Add extra state bit to cache to determine shared or not

• Add 4th state (MESI)

Page 8: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 8

Larger Multiprocessors

• Separate memory per processor• Local or remote access via memory controller• 1 cache coherency solution: non-cached pages • Alternative: directory per cache that tracks state of every block

in every cache– Which caches have copies of block, dirty vs. clean, ...

• Info per memory block vs. per cache block?– PLUS: In memory simpler protocol (centralized/one

location)– MINUS: In memory directory is ƒ(memory size) vs.

ƒ(cache size)• Prevent directory as bottleneck?

distribute directory entries with memory, each keeping track of which processors have copies of their blocks

Page 9: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 9

Distributed-Directory Multiprocessors

Interconnection Network

Processor+ cache

Memory I/O

Directory

Processor+ cache

Memory I/O

Directory

Processor+ cache

Memory I/O

Directory

Processor+ cache

Memory I/O

Directory

Processor+ cache

Memory I/O

Directory

Processor+ cache

Memory I/O

Directory

Processor+ cache

Memory I/O

Directory

Processor+ cache

Memory I/O

Directory

Page 10: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 10

Directory Protocol

• Similar to Snoopy Protocol: Three states

– Shared: ≥ 1 processors have data, memory up-to-date

– Uncached (no processor has it; not valid in any cache)

– Exclusive: 1 processor (owner) has data; memory out-of-date

• In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy)

• Keep it simple:

– Writes to non-exclusive data write miss

– Processor blocks until access completes

– Assume messages received and acted upon in order sent

Page 11: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 11

Figure 6.29 State transition diagram for an individual cache block in a directory-based system.

Page 12: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 12

Figure 6.30 The state transition diagram for the directory has the same states and structure as the transition diagram for an individual cache.

Page 13: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 13

Summary

• Caches contain all information on state of cached memory blocks

• Snooping and directory protocols similar; bus makes snooping easier because of broadcast (snooping uniform memory access)

• Directory has extra data structure to keep track of state of all cache blocks

• Distributed directory scalable shared address multiprocessor cache coherent, non-uniform memory access

Page 14: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 14

Synchronization

• Why synchronize? Need to know when it is safe for different processes to use shared data

• Issues for synchronization:

– Uninterruptable instruction to fetch and update memory (atomic operation)

– User-level synchronization operation using this primitive

– For large-scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

Page 15: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 15

Uninterruptable Instruction to Fetch and Update Memory

• Atomic exchange: interchange a value in a register for a value in memory

0 synchronization variable is free

1 synchronization variable is locked and unavailable

– Set register to 1 & swap

– New value in register determines success in getting lock

0 if you succeeded in setting the lock (you were first)1 if other processor had already claimed access

– Key is that exchange operation is indivisible

• Test-and-set: tests a value and sets it if the value passes the test

• Fetch-and-increment: returns the value of a memory location and atomically increments it

– 0 synchronization variable is free

Page 16: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 16

Uninterruptable Instruction to Fetch and Update Memory

• Hard to have read and write in 1 instruction: use 2 instead• Load linked (or load locked) + store conditional

– Load linked returns the initial value– Store conditional returns 1 if it succeeds (no other store to same memory

location since preceding load) and 0 otherwise• Example doing atomic swap with LL & SC:

try: mov R3,R4 ; mov exchange valuell R2,0(R1) ; load linkedsc R3,0(R1) ; store conditionalbeqz R3,try ; branch store fails (R3 = 0)mov R4,R2 ; put load value in R4

• Example doing fetch & increment with LL & SC:try: ll R2,0(R1) ; load linked

addi R2,R2,#1 ; increment (OK if reg–reg)sc R2,0(R1) ; store conditional beqz R2,try ; branch store fails (R2 = 0)

Page 17: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 17

User-Level Synchronization

• Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock

li R2,#1lockit: exch R2,0(R1) ; atomic exchange

bnez R2,lockit ; already locked?

• What about MP with cache coherency?

– Want to spin on cache copy to avoid full memory latency

– Likely to get cache hits for such variables

• Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic

• Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”):

try: li R2,#1lockit: lw R3,0(R1) ; load var

bnez R3,lockit ; not free spinexch R2,0(R1) ; atomic exchangebnez R2,try ; already locked?

Page 18: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 18

Memory Consistency

• What is consistency? When must a processor see the new value?

P1: A = 0; P2: B = 0; ..… .....

A = 1; B = 1;L1: if (B == 0) … L2: if (A == 0) …

• Impossible for both of statements L1 & L2 to be true?– What if write invalidate is delayed & processor continues?

• Memory consistency models: what are the rules for such cases?

• Sequential consistency (SC): result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved

– delay completion of memory access until all invalidations complete

– delay memory access until previous one is complete

Page 19: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 19

Memory Consistency Model

• More efficient approach is to assume that programs are synchronized

– All accesses to shared data are ordered by synchronization ops

write (x)...release (s) // unlock...acquire (s) // lock...read (x)

• Only those programs willing to be nondeterministic are not synchronized; outcome depends on processor speed and data race occurs

• Several relaxed models for memory consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses

Page 20: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 20

Measuring Performance of Parallel Systems

• Wall clock time is good measure

• Often want to know how performance scales as processors are added to the system

– Unscaled speedup: processors added to fixed-size problem

– Scaled speedup: processors added to scaled-size problem

• For scaled speedup, how do we scale the application?

– Memory-constrained scaling

– Time-constrained scaling (CPU power)

– Bandwidth-constrained scaling (communication)

• How do we measure the uniprocessor application?

Page 21: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 21

Symmetric Multithreading

• Process:– Address Space

– Context

– One or more execution Threads

• Thread:– Shared Heap (with other threads in process)

– Shared Text

– Exclusive Stack (one per thread)

– Exclusive CPU Context

Page 22: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 22

The Big Picture

Figure 6.44

Page 23: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 23

Superscalar

• Each thread is considered as a separate process

• Address space is shared through shared pages

• A thread switch requires a full context switch

• Examples:– Linux 2.4

Page 24: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 24

Real Multithreading

• Thread switches don’t require context switches

• Threads switched on:– timeout

– blocking for I/O

• CPU has to keep multiple states (contexts):– 1 for each concurrent thread

– Program Counter

– Register Array (usually done through renaming)

• Since memory is shared:– threads operate in same virtual address space

– TLB and Caches are specific to process, same for each thread

Page 25: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 25

Course MT

• CPU hardware support is inexpensive:– Amount of concurrent threads low (<4) = # of contexts

• Thread library prepares operating system

• Threads are hard to share between CPU’s

• Switching occurs on block boundaries– Instruction cache block

– Instruction fetch block

• Implemented in:– SPARC Solaris

Page 26: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 26

Fine MT

• Switching between threads on each cycle

• Hardware support is extensive:– Has to support several concurrent contexts

– Hazards in write-back stage (RAW, WAW, WAR)

• IBM used both Fine MT and SMT in experimental CPUs

Page 27: ENGS 116 Lecture 161 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for

ENGS 116 Lecture 16 27

Symmetric Multi Threading

• Coolest of all

• Interleaves instructions from multiple threads in each cycle

• Hardware Control Logic borders on the impossible

• Extensive context and renaming logic

• Very tight coupling with operating system

• IBM had great successes:– Bottle neck: memory BW

• Intel recently implemented “Hyperthreading” in Xeon series– slimmed down version of SMT

– No OS support: you see 2 ‘virtual’ CPU’s

– Just one more Marketing Term from our friends in Silicon Valley