ECE 1747: Parallel Programming

Basics of Parallel Architectures:Shared-Memory Machines

Two Parallel Architectures

• Shared memory machines.• Distributed memory machines.

Shared Memory: Logical View

proc1 proc2 proc3 procN

Shared memory space

Shared Memory Machines

• Small number of processors: shared memory with coherent caches (SMP).

• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

• 2- or 4-processors PCs are now commodity.• Good price/performance ratio.• Memory sometimes bottleneck (see later).• Typical price (8-node): ~ $20-40k.

Physical Implementation

Shared memory

cache1 cache2 cache3 cacheN

Shared Memory Machines

• Small number of processors: shared memory with coherent caches (SMP).

• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

CC-NUMA: Physical Implementation

mem2 mem3 memNmem1

cache2cache1 cacheNcache3

inter-connect

Caches in Multiprocessors

• Suffer from the coherence problem:– same line appears in two or more caches– one processor writes word in line– other processors now can read stale data

• Leads to need for a coherence protocol– avoids coherence problems

• Many exist, will just look at simple one.

What is coherence?

• What does it mean to be shared?• Intuitively, read last value written.• Notion is not well-defined in a system

without a global clock.

The Notion of “last written” in a Multi-processor System

The Notion of “last written” in a Single-machine System

w(x) w(x) r(x) r(x)

Coherence: a Clean Definition

• Is achieved by referring back to the single machine case.

• Called sequential consistency.

Sequential Consistency (SC)

• Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.

Returning to our Example

Another Way of Defining SC

• All memory references of a single process execute in program order.

• All writes are globally ordered.

SC: Example 1

w(x,1) w(y,1)

r(x) r(y)

Initial values of x,y are 0.

What are possible final values?

SC: Example 2

w(x,1) w(y,1)

r(y) r(x)

SC: Example 3

w(x,1)

w(y,1)

r(y) r(x)

SC: Example 4

w(x,1)

w(x,2)

Implementation

• Many ways of implementing SC.• In fact, sometimes stronger conditions.• Will look at a simple one: MSI protocol.

Physical Implementation

Shared memory

cache1 cache2 cache3 cacheN

Fundamental Assumption

• The bus is a reliable, ordered broadcast bus.– Every message sent by a processor is received

by all other processors in the same order.• Also called a snooping bus

– Processors (or caches) snoop on the bus.

States of a Cache Line

• Invalid• Shared

– read-only, one of many cached copies• Modified

– read-write, sole valid copy

Processor Transactions

• processor read(x)• processor write(x)

Bus Transactions

• bus read(x) – asks for copy with no intent to modify

• bus read-exclusive(x)– asks for copy with intent to modify

State Diagram: Step 0

PrRd/BuRd

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/-

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/-

BuRd/Flush

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

BuRdX/-

PrRd/BuRdPrRd/-

PrWr/BuRdX

PrWr/-

BuRd/Flush

BuRd/-

BuRdX/-

BuRdX/Flush

In Reality

• Most machines use a slightly more complicated protocol (4 states instead of 3).

• See architecture books (MESI protocol).

Problem: False Sharing

• Occurs when two or more processors access different data in same cache line, and at least one of them writes.

• Leads to ping-pong effect.

False Sharing: Example (1 of 3)

#pragma omp parallel for schedule(cyclic)for( i=0; i<n; i++ )

a[i] = b[i];

• Let’s assume: – p = 2– element of a takes 4 words– cache line has 32 words

a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]

cache line

Written by processor 0Written by processor 1

a[2] a[4]

a[3] a[5]

...inv data

Summary

• Sequential consistency.• Bus-based coherence protocols.• False sharing.

Algorithms for Scalable Synchronization on Shared-

Memory Multiprocessors

J.M. Mellor-Crummey, M.L. Scott(MCS Locks)

Introduction

• Busy-waiting techniques – heavily used in synchronization on shared memory MPs

• Two general categories: locks and barriers– Locks ensure mutual exclusion– Barriers provide phase separation in an

application

Problem

• Busy-waiting synchronization constructs tend to:– Have significant impact on network traffic due

to cache invalidations– Contention leads to poor scalability

• Main cause: spinning on remote variables

The Proposed Solution

• Minimize access to remote variables• Instead, spin on local variables• Claim:

– It can be done all in software (no need for fancy and costly hardware support)

– Spinning on local variables will minimize contention, allow for good scalability, and good performance

Spin Lock 1: Test-and-Set Lock

• Repeatedly test-and-set a boolean flag indicating whether the lock is held

• Problem: contention for the flag (read-modify-write instructions are expensive)– Causes lots of network traffic, especially on

cache-coherent architectures (because of cache invalidations)

• Variation: test-and-test-and-set – less traffic

Test-and-test with Backoff Lock• Pause between successive test-and-set (“backoff”)• T&S with backoff idea:

while test&set (L) fails {pause (delay);delay = delay * 2;

Spin Lock 2: The Ticket Lock

• 2 counters (nr_requests, and nr_releases)• Lock acquire: fetch-and-increment on the

nr_requests counter, waits until its “ticket” is equal to the value of the nr_releases counter

• Lock release: increment of the nr_releases counter

Spin Lock 2: The Ticket Lock

• Advantage over T&S: polls with read operations only

• Still generates lots of traffic and contention• Can further improve by using backoff

Array-Based Queueing Locks

• Each CPU spins on a different location, in a distinct cache line

• Each CPU clears the lock for its successor (sets it from must-wait to has-lock)

• Lock-acquire while (slots[my_place] == must-wait);

• Lock-release slots[(my_place + 1) % P] = has-lock;

List-Based Queueing Locks (MCS Locks)

• Spins on local flag variables only• Requires a small constant amount of space

per lock

• CPUs are all in a linked list: upon release by current CPU, lock is acquired by its successor

• Spinning is on local flag• Lock points at tail of queue (null if not held)• Compare-and-swap allows for detection if it

is the only processor in queue and atomic removal of self from the queue

• Spin in acquire_lock waits for lock to become free

• Spin in release_lock compensates for the time window between fetch-and-store and assignment to predecessor->next in acquire_lock

• If no compare_and_swap – cumbersome

The MCS Tree-Based Barrier

• Uses a pair of P (nr. of CPUs) trees: arrival tree, and wakeup tree

• Arrival tree: each node has 4 children • Wakeup tree: binary tree

– Fastest way to wake up all P processors

Hardware Description• BBN Butterfly 1 – DSM multiprocessor

– Supports up to 256 CPUs, 80 used in experiments– Atomic primitives allow fetch_and_add,

fetch_and_store (swap), test_and_set• Sequent Symmetry Model B – cache-coherent,

shared-bus multiprocessor– Supports up to 30 CPUs, 18 used in experiments– Snooping cache-coherence protocol

• Neither supports compare-and-swap

Measurement Technique

• Results averaged over 10k (Butterfly) or 100k (Symmetry) acquisitions

• For 1 CPU, time represents latency between acquire and release of lock

• Otherwise, time represents time elapsed between successive acquisitions

Spin Locks on Butterfly

• Anderson’s fares poorer because the Butterfly lacks coherent caches, and CPUs may spin on statically unpredictable locations – which may not be local

• T&S with exponential backoff, Ticket lock with proportional backoff, MCS all scale very well, with slopes of 0.0025, 0.0021 and 0.00025 μs respectively

Spin Locks on Symmetry

Latency and Impact of Spin Locks

• Latency results are poor on Butterfly because:– Atomic operations are inordinately expensive in

comparison to non-atomic ones– 16-bit atomic primitives on Butterfly cannot

manipulate 24-bit pointers

Barriers on Butterfly

Barriers on Symmetry

• Different results from Butterfly because:– More CPUs can spin on same location (own

copy in local cache)– Distributing writes across different memory

modules yields no benefit because the bus serializes all communication

Conclusions

• Criteria for evaluating spin locks:– Scalability and induced network load– Single-processor latency– Space requirements– Fairness– Implementability with available atomic

operations

Conclusions

• MCS lock algorithm scales best, together with array-based queueing on cache-coherent machines

• T&S and Ticket Locks with proper backoffs also scale well, but incur more network load

• Anderson and G&T: prohibitive space requirements for large numbers of CPUs

Conclusions

• MCS, array-based, and Ticket Locks guarantee fairness (FIFO)

• MCS benefits significantly from existence of compare-and-swap

• MCS is best when contention expected: excellent scaling, FIFO ordering, least interconnect contention, low space reqs.

ECE 1747: Parallel Programming

Documents

ECE 372 – Microcontroller Design Parallel IO Ports - Outputs

ECE 1747H : Parallel Programming Message Passing (MPI)

ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance

COMPUTER ORGANIZATION - gvpcew.ac.ingvpcew.ac.in/Material/ECE/4 ECE - COMPUTER... · COMPUTER ORGANIZATION ... CENTRAL PROCESSING UNIT ... PIPELINE AND VECTOR PROCESSING: Parallel

1747-SDN DeviceNet Scanner Module - Rockwell Automationliterature.rockwellautomation.com/idc/groups/literature/documents/... · 1747-SDN DeviceNet Scanner Module Catalog Number 1747-SDN,

ECE 1747H: Parallel Programming

ECE 669 Parallel Computer Architecture Lecture 10 Graph Applications

Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC

1747-6.2, SLC 500 Modular Hardware ... - Rockwell Automation · (Cat. Nos. 1747-L511, 1747-L514, 1747-L524, 1747-L531, 1747-L532, 1747-L541, 1747-L542, 1747-L543, 1747-L551, 1747-L552,

1747-6.21, SLC 500 Fixed Hardware Style Installation and ... · Installation and Operation Manual SLC 500 Fixed Hardware Style (Cat. No. 1747-L20, 1747-L30, and 1747-L40 Processors)

ECE 1749H: Interconnection Networks for Parallel Computer Architectures : Flow Control

Monsterrol 1747

ECE 669 Parallel Computer Architecture Lecture 19 Processor Design

ECE 1747H : Parallel Programming Lecture 1-2: Overview

L'année 1747Michael Maittaire (1668-1747) Jean-Pierre Des Ours de Mandajors (1679-1747) Alessandro Marcello (1684-1747) Joseph Marchand (1673-1747) François-Georges Mareschal de

ECE 1747H: Parallel Programming Lecture 2: Data Parallelism

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto yuan

Architecture of Parallel Computers CSC / ECE 506 BlueGene ...Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture Lecture 24 7/31/2006 Dr Steve Hunter. ... • sPPM

ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations