42
ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines

ECE 1747: Parallel Programming

  • Upload
    fia

  • View
    45

  • Download
    1

Embed Size (px)

DESCRIPTION

ECE 1747: Parallel Programming. Basics of Parallel Architectures: Shared-Memory Machines. Two Parallel Architectures. Shared memory machines. Distributed memory machines. Shared Memory: Logical View. Shared memory space. proc1. proc2. proc3. procN. Shared Memory Machines. - PowerPoint PPT Presentation

Citation preview

Page 1: ECE 1747: Parallel Programming

ECE 1747: Parallel Programming

Basics of Parallel Architectures:

Shared-Memory Machines

Page 2: ECE 1747: Parallel Programming

Two Parallel Architectures

• Shared memory machines.

• Distributed memory machines.

Page 3: ECE 1747: Parallel Programming

Shared Memory: Logical View

proc1 proc2 proc3 procN

Shared memory space

Page 4: ECE 1747: Parallel Programming

Shared Memory Machines

• Small number of processors: shared memory with coherent caches (SMP).

• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

Page 5: ECE 1747: Parallel Programming

SMPs

• 2- or 4-processors PCs are now commodity.

• Good price/performance ratio.

• Memory sometimes bottleneck (see later).

• Typical price (8-node): ~ $20-40k.

Page 6: ECE 1747: Parallel Programming

Physical Implementation

proc1 proc2 proc3 procN

Shared memory

cache1 cache2 cache3 cacheN

bus

Page 7: ECE 1747: Parallel Programming

Shared Memory Machines

• Small number of processors: shared memory with coherent caches (SMP).

• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

Page 8: ECE 1747: Parallel Programming

CC-NUMA: Physical Implementation

proc1 proc2 proc3 procN

mem2 mem3 memNmem1

cache2cache1 cacheNcache3

inter-connect

Page 9: ECE 1747: Parallel Programming

Caches in Multiprocessors

• Suffer from the coherence problem:– same line appears in two or more caches– one processor writes word in line– other processors now can read stale data

• Leads to need for a coherence protocol– avoids coherence problems

• Many exist, will just look at simple one.

Page 10: ECE 1747: Parallel Programming

What is coherence?

• What does it mean to be shared?

• Intuitively, read last value written.

• Notion is not well-defined in a system without a global clock.

Page 11: ECE 1747: Parallel Programming

The Notion of “last written” in a Multi-processor System

w(x)

w(x)

r(x)

r(x)

P0

P1

P2

P3

Page 12: ECE 1747: Parallel Programming

The Notion of “last written” in a Single-machine System

w(x) w(x) r(x) r(x)

Page 13: ECE 1747: Parallel Programming

Coherence: a Clean Definition

• Is achieved by referring back to the single machine case.

• Called sequential consistency.

Page 14: ECE 1747: Parallel Programming

Sequential Consistency (SC)

• Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.

Page 15: ECE 1747: Parallel Programming

Returning to our Example

w(x)

w(x)

r(x)

r(x)

P0

P1

P2

P3

Page 16: ECE 1747: Parallel Programming

Another Way of Defining SC

• All memory references of a single process execute in program order.

• All writes are globally ordered.

Page 17: ECE 1747: Parallel Programming

SC: Example 1

w(x,1) w(y,1)

r(x) r(y)

Initial values of x,y are 0.

What are possible final values?

Page 18: ECE 1747: Parallel Programming

SC: Example 2

w(x,1) w(y,1)

r(y) r(x)

Page 19: ECE 1747: Parallel Programming

SC: Example 3

w(x,1)

w(y,1)

r(y) r(x)

Page 20: ECE 1747: Parallel Programming

SC: Example 4

w(x,1)

w(x,2)

r(x)

r(x)

Page 21: ECE 1747: Parallel Programming

Implementation

• Many ways of implementing SC.

• In fact, sometimes stronger conditions.

• Will look at a simple one: MSI protocol.

Page 22: ECE 1747: Parallel Programming

Physical Implementation

proc1 proc2 proc3 procN

Shared memory

cache1 cache2 cache3 cacheN

bus

Page 23: ECE 1747: Parallel Programming

Fundamental Assumption

• The bus is a reliable, ordered broadcast bus.– Every message sent by a processor is received

by all other processors in the same order.

• Also called a snooping bus– Processors (or caches) snoop on the bus.

Page 24: ECE 1747: Parallel Programming

States of a Cache Line

• Invalid

• Shared– read-only, one of many cached copies

• Modified– read-write, sole valid copy

Page 25: ECE 1747: Parallel Programming

Processor Transactions

• processor read(x)

• processor write(x)

Page 26: ECE 1747: Parallel Programming

Bus Transactions

• bus read(x) – asks for copy with no intent to modify

• bus read-exclusive(x)– asks for copy with intent to modify

Page 27: ECE 1747: Parallel Programming

State Diagram: Step 0

I S M

Page 28: ECE 1747: Parallel Programming

State Diagram: Step 1

I S M

PrRd/BuRd

Page 29: ECE 1747: Parallel Programming

State Diagram: Step 2

I S M

PrRd/BuRdPrRd/-

Page 30: ECE 1747: Parallel Programming

State Diagram: Step 3

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

Page 31: ECE 1747: Parallel Programming

State Diagram: Step 4

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

Page 32: ECE 1747: Parallel Programming

State Diagram: Step 5

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

PrWr/-

Page 33: ECE 1747: Parallel Programming

State Diagram: Step 6

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

PrWr/-

BuRd/Flush

Page 34: ECE 1747: Parallel Programming

State Diagram: Step 7

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

Page 35: ECE 1747: Parallel Programming

State Diagram: Step 8

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

BuRdX/-

Page 36: ECE 1747: Parallel Programming

State Diagram: Step 9

I S M

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

BuRdX/-

BuRdX/Flush

Page 37: ECE 1747: Parallel Programming

In Reality

• Most machines use a slightly more complicated protocol (4 states instead of 3).

• See architecture books (MESI protocol).

Page 38: ECE 1747: Parallel Programming

Problem: False Sharing

• Occurs when two or more processors access different data in same cache line, and at least one of them writes.

• Leads to ping-pong effect.

Page 39: ECE 1747: Parallel Programming

False Sharing: Example (1 of 3)

for( i=0; i<n; i++ )

a[i] = b[i];

• Let’s assume we parallelize code: – p = 2– element of a takes 4 words– cache line has 32 words

Page 40: ECE 1747: Parallel Programming

False Sharing: Example (2 of 3)

a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]

cache line

Written by processor 0

Written by processor 1

Page 41: ECE 1747: Parallel Programming

False Sharing: Example (3 of 3)

P0

P1

a[0]

a[1]

a[2] a[4]

a[3] a[5]

...inv data

Page 42: ECE 1747: Parallel Programming

Summary

• Sequential consistency.

• Bus-based coherence protocols.

• False sharing.