MULTIPROCESSORSAND THREAD-L P (P 1) ART · usage are typically controlled by a single operating system and that share memory through a shared address space 2-32processors Single-chip

MULTIPROCESSORS AND

THREAD-LEVEL PARALLELISM(PART 1)Chapter 5

Appendix F

Appendix I

1

CP

E731 -D

r. Iyad Jafar

OUTLINE

Introduction (5.1)

Multiprocessor Architecture

Challenges in Parallel Processing

Centralized Shared Memory

Architectures (5.2)

Performance of SMP (5.3)2

CP

E731 -D

r. Iyad Jafar

INTRODUCTION

3

CP

E731 -D

r. Iyad Jafar

3

CP

E731 -D

r. Iyad Jafar

RISC

Move to multi-processor

Technology Improvement New Architectures and Organization

Power and ILP limitations?

INTRODUCTION

4

CP

E731 -D

r. Iyad Jafar

INTRODUCTION

Why multiprocessors? Increased costs of silicon and energy to exploit

ILP

Increasing performance of desktop is lessimportant

Advantage of replication rather than uniquedesign

Improved understanding on how to usemultiprocessors effectively; especially in servers! significant natural parallelism in large dataset,

scientific codes, and independent requests

Growing interest in high end servers for cloudcomputing and SaaS

A growth in data-intensive applications 5

CP

E731 -D

r. Iyad Jafar

INTRODUCTION

Multiprocessor Tightly coupled processors whose coordination and

usage are typically controlled by a single operatingsystem and that share memory through a sharedaddress space

2-32 processors Single-chip system (multicore) or multiple multicore

chips

Multiprocessors exploit thread-level parallelism Parallel programming execute tightly-coupled

threads that collaborate on a single task Request-level parallelism execute multiple

independent processes Single program or multiple applications (multiprogramming)

Multicomputers!6

CP

E731 -D

r. Iyad Jafar

INTRODUCTION

To maximize the advantage of multiprocessors withn processors, we need n threads

Independent threads are created by programmer oroperating systems

TLP may exploit DLP A thread may have some iterations of a loop to exploit

data-level parallelism

Grain size must be sufficiently large to compensate forthe thread overhead!

7

CP

E731 -D

r. Iyad Jafar

MULTIPROCESSOR ARCHITECTURE

Symmetric Shared-Memory Multiprocessors(SMPs) Centralized shared-memory multiprocessors Small number of cores Share single memory with uniform latency (UMA)

8

CP

E731 -D

r. Iyad Jafar


Distributed Shared Memory Multiprocessors(DSMs) Larger number of processors Memory distributed among processors Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-

direct (multi-hop) interconnection networks

9

CP

E731 -D

r. Iyad Jafar


The term shared memory in botharchitectures implies that threadscommunicate with each other through thesame address space

i.e. Any processor can reference any memorylocation as long as it has access rights

In DSM, the distributed memory adds thecommunication complexities and overhead

10

CP

E731 -D

r. Iyad Jafar

CHALLENGES

Limited Parallelism in Programs Example. Suppose you want to achieve a speedup

of 80 with 100 processors. What fraction of theoriginal computation can be sequential?

How this can be addressed?11

CP

E731 -D

r. Iyad Jafar

CHALLENGES

Communication Overhead Example. Suppose we have an application running on a

32-processor multiprocessor, which has a 200 ns time tohandle reference to a remote memory.

For this application, assume that all the referencesexcept those involving communication hit in the localmemory hierarchy, which is slightly optimistic.

Processors are stalled on a remote request, and theprocessor clock rate is 3.3 GHz.

If the base CPI (assuming that all references hit in thecache) is 0.5, how much faster is the multiprocessor ifthere is no communication versus if 0.2% of theinstructions involve a remote communication reference? 12

CP

E731 -D

r. Iyad Jafar

CHALLENGES

Communication Overhead Example.

CPIcom = CPIideal + miss penalty= 0.5 + remote request rate × penalty= 0.5 + 0.002 × 200 ns / 0.3 ns= 0.5 + 1.2= 1.7

Speedup = 1.7 / 0.5 = 3.4The multiprocessor with all local references is3.4 faster

How to address? SW HW

13

CP

E731 -D

r. Iyad Jafar

SMP ARCHITECTURES

14

CP

E731 -D

r. Iyad Jafar

SMP ARCHITECTURES

15

CP

E731 -D

r. Iyad Jafar

Intel Nehalem (Nov 2008)

SMP ARCHITECTURES

SMPs support caching of private and shareddata

Reduce latency, BW and contention

Caching private data is not a problem! Likea uniprocessor!

Caching shared data issues formemory system behavior Coherence; what values can be returned by a

read Consistency; when a written value will be

returned by a read 16

CP

E731 -D

r. Iyad Jafar

CACHE COHERENCE

17

CP

E731 -D

r. Iyad Jafar

P1 P2 P3

MemoryX = 5

X = 5 X = 5X = 8

X = ?X = ?

1

2

34

5

CACHE COHERENCE

A memory system is coherent if Preserve Program Order: A read by processor P to location

X that follows a write by P to X, with no writes of X by anotherprocessor occurring between the write and the read by P,always returns the value written by P

Coherent view of memory: Read by a processor to location Xthat follows a write by another processor to X returns thewritten value if the read and write are sufficiently separated intime and no other writes to X occur between the two accesses

Write serialization: 2 writes to same location by any 2processors are seen in the same order by all processors. Forexample, if the values 1 and then 2 are written to a location,processors can never read the value of the location as 2 andthen later read it as 1

18

CP

E731 -D

r. Iyad Jafar

BASIC SCHEMES FOR ENFORCING COHERENCE

A program running on multiple processors havecopies of the same data in several caches

In coherent multiprocessor, caches use migrationand replication

Migration Move data to a local cache and use transparently Reduce latency and BW demand

Replication Copy data to individual caches for simultaneous reads Reduce latency and bus contention

Use a HW protocol to keep caches coherent insteadof using a SW approach 19

CP

E731 -D

r. Iyad Jafar

BASIC SCHEMES FOR ENFORCING COHERENCE

Directory-based Protocols The sharing status of a shared block is kept in one (or

more) location, i.e. Directory

In SMP, centralized directory

In DSM, distributed directories

Snooping-based Protocols Every cache that has a copy of a shared block keeps track of

the sharing status

In SMP,

Caches are accessible via some broadcast medium

Each cache monitors or snoops the medium to determinewhether they have a copy of the requested block

Can be used in multichip multiprocessor on top of directoryprotocol within each multicore 20

CP

E731 -D

r. Iyad Jafar

SNOOPING COHERENCE PROTOCOLS

Write-update protocol (broadcast) A write to a cashed shared item updates all

cached copies via the medium Less popular; consumes BW!

Write-invalidate protocol A write to a shared cached item invalidates all

cached copies (exclusive access)

21

CP

E731 -D

r. Iyad Jafar

X

BASIC IMPLEMENTATION TECHNIQUES

A bus or broadcast medium Perform invalidates by acquiring the bus first, then

broadcasting the address Other processors snoop and check their caches for the

broadcasted address Invalidation by different processors is serialized by

bus arbitration

Locating shared items on a miss Simple in write-through! Write-back is more difficult! However, in write-back, caches can snoop for read

requests as well and provide the data if they have itin dirty state

Write buffers? 22

CP

E731 -D

r. Iyad Jafar

BASIC IMPLEMENTATION TECHNIQUES

Tracking state Use cache tags, valid and dirty bits to implement

snooping 1-bit to track the sharing state of each block

Exclusive/Modified state The processor has amodified copy of the block. There is no need to sendinvalidates on successive writes by the sameprocessor

Shared the block is in more than private caches.

Finite state controller in each core Responds to requests from the core and medium Change the state of a cached block

Invalid, modified or shared23

CP

E731 -D

r. Iyad Jafar

EXAMPLE PROTOCOL (INVALIDATE & WB)

24

CP

E731 -D

r. Iyad Jafar

Why to write-back?


25

CP

E731 -D

r. Iyad Jafar


26

CP

E731 -D

r. Iyad Jafar


27

CP

E731 -D

r. Iyad Jafar

EXTENSIONS TO MSI PROTOCOL

The previous protocol is called MSI Many extensions exist

Add states and/or transactions to improve performance

MESI protocol (Intel i7 MESIF) Exclusive state added to indicate cache line is the same as

main memory and is the only cached copy

When the state changes on Read Miss, no need to write-back block to memory

MOESI protocol (AMD Opteron) MSI and MESI update memory whenever changing the

state to Shared. MOESI adds the Owner state to indicate that a block is

owned by that cache and out-of-date in memory MOESI, a block can be changed from Modified to Owned

without writing to memory. The owner should update the block in memory on a miss

28

CP

E731 -D

r. Iyad Jafar

LIMITATIONS

Centralized memory can be become a bottleneck asthe number of processors or their memory demandsincrease

High BW connection to L3 cache allowed 4 to 8cores. However, it is not likely to scale! Multiple busses and interconnection networks such as

cross-bar or small point-to-point Banked memory or cache

29

CP

E731 -D

r. Iyad Jafar

LIMITATIONS

Snooping BW could become a problem. Eachprocessor must examine every miss

Snooping may interfere with cache operation Duplicate cache tags Centralized directory in the outermost cache Does not eliminate the bottleneck at the bus

30

CP

E731 -D

r. Iyad Jafar

PERFORMANCE OF SMPS

Performance is determined by Traffic caused by cache misses of processors Traffic of communication

Both are affected by processor count, cachesize and block size

Adds the fourth C (coherence) for the 3Csmisses

Types of coherence misses True Sharing Misses False Sharing Misses

Single valid bit per block.Writing a word in a block invalidates the entire

block. 31

CP

E731 -D

r. Iyad Jafar

PERFORMANCE OF SMPS

Coherence Misses ExampleAssume that words x1 and x2 are in the same cache block, which is in the shared state in the caches of both P1 and P2. Assuming the following sequence of events, identify each miss as a true sharing miss, a false sharing miss, or a hit.

32

CP

E731 -D

r. Iyad Jafar

PERFORMANCE OF SMPS

1998 Study

Processor Alpha Server 4100 with four Alpha 21164

processors 4 IPC at 300 MHz

Workload TPC-B: online transaction processing (OLTP) TPC-D: Decision support system (DSS) AltaVista: Web index search

33

CP

E731 -D

r. Iyad Jafar

PERFORMANCE OF SMPS

34

OLTP has the poorest performance due to memory hierarchy problems

Consider evaluating the OLTP when varying L3 cache size, block size and number of processors

PERFORMANCE OF SMPS

35

CP

E731 -D

r. Iyad JafarBiggest improvement

when moving from 1 to 2 MB L3?

PERFORMANCE OF SMPS

36

CP

E731 -D

r. Iyad JafarInstruction and

capacity misses drops but true sharing, false

and compulsory misses are unaffected!

PERFORMANCE OF SMPS

37

CP

E731 -D

r. Iyad Jafar

Increase of true sharing misses!

PERFORMANCE OF SMPS

38

CP

E731 -D

r. Iyad Jafar

Reduce true sharing misses!

Documents

MULTIPROCESSORSAND THREAD-L P (P 1) ART · usage are typically controlled by a single operating system and that share memory through a shared address space 2-32processors Single-chip