21
Multiprocessors and Multi-computers Multi-computers Distributed address space accessible by local processors Requires message passing Programming tends to be more difficult Multiprocessors Single address space accessible by all processors Simultaneous access to shared variables can produce inconsistent results Generally programming is more convenient Doesn’t scale to more than about sixteen processors

Multiprocessors and Multi-computers

Embed Size (px)

DESCRIPTION

Multiprocessors and Multi-computers. Multi-computers Distributed address space accessible by local processors Requires message passing Programming tends to be more difficult Multiprocessors Single address space accessible by all processors - PowerPoint PPT Presentation

Citation preview

Multiprocessors and Multi-computers

• Multi-computers– Distributed address space accessible by local processors– Requires message passing – Programming tends to be more difficult

• Multiprocessors– Single address space accessible by all processors– Simultaneous access to shared variables can produce

inconsistent results– Generally programming is more convenient– Doesn’t scale to more than about sixteen processors

Shared Memory Hardware

Bus

Processes Memory modules

Bus configuration Crossbar switch configuration

Memory Modules

Processe

s

Cache Coherence

• Cache Coherence Protocol– Write-Update: All caches

immediately updated with altered data

– Write-Invalidate: Altered data is invalidated in all caches. Updates take place only if subsequently referenced

• False Sharing: Cache updates take place because multiple processes access the same cache block but not the same locations

Processor 1 Processor 2

Memory

x

y

yx

Cache Blocks

Note: Significant because each processor has a local cache

Significantly impacts performance

Shared Memory Access

• Critical Section– A section of code that needs

to be protected from simultaneous access

• Mutual Exclusion– The mechanism used to

enforce a critical section– Locks– Semaphores– Monitors– Condition Variables

x

=1 =2

Process 1 Process 2

Shared Variable

Sequential Consistency

Formally defined by Lamport (1979):

• A multiprocessor result is sequentially consistent if:

– The operations of each individual processors occur in proper sequence specified by its program.

– The overall output matches some sequential order of operations by all the processors

• Summary: Arbitrary interleaving of instructions does not affect the output generated.

Deadlock

• Necessary Conditions– Circular Wait

– Limited Resource

– Non-preemptive

– Hold and Wait

Deadly Embrace

R1 R2 Rn-1 Rn

P1 P2 Pn-1 Pn

R1 R2

P1 P2

Two Process Deadlock

Resources permanently blocked waiting for needed resources

Locks

• Single bit variable: 1=locked, 0=unlocked“Enter door and lock the door at entry”

• Spin locks (busy wait locks)– while (lock==1) spin(); // Normally involves hardware support

lock = 1;// Critical sectionlock = 0;

• Advantages: Simple and easy to understand• Disadvantages

– Poor use of the CPU if process does not block while waiting– It’s easy to skip the lock=0 statement

• Examples: Pthreads and openMP provide OS abstractions

Locks are the simplest mutual exclusion mechanismNormally, these are provided by operating system calls

Note: The while and lock setting must be atomic

Semaphores

• Limits concurrent access

• An integer variable, s, controls the mechanism

• Operations– P operation: passeren in

Dutch for: to pass

s--;while (s<0) wait();// Critical section code

– V operation: vrigeven in Dutch for: to release

s++;if (s<=0) unblock a waiting process;

p(s); /* Critical section */ v(s);

• Notes– Set s=1 initially for s to

be a binary semaphore which acts like a lock.

– Set s=k>1 initially if k simultaneous entries are possible

– Set s=k<=0 for consumer processes waiting to consume data produced

• Disadvantage: Its easy to skip the v operation

• Example: UNIX OS

Monitors

• A Class mechanism that limits access to a shared resource

public class doIt{ public doIt() {//Constructor logic} public synchronized void critMethod() { wait(); // Wait till another thread signals notify();} }

• Advantage: Most natural mutual exclusive mechanism• Disadvantage: Requires a language that supports the construct• Examples: Java, ADA, Modula II

Condition Variables

• Advantages: – Reduce overhead with

checking if a global variable reaches some value

– Avoids having to frequently “poll” the global variable

• Disadvantage: Its easy to skip the unlock operations

• Example: Pthreads• Notes:

– wait() unlocks and locks mutex automatically

– Threads must already be waiting for a signal when it is thrown

Example

• Thread 1lock(mutex)while (c<>VALUE) wait(cVar,mutex) // Critical section unlock(mutex);

• Thread 2if (c==VALUE)

signal(condVar)

Mechanism to guarantee a global condition before critical section entry

Shared Memory Programming Alternatives

• Heavyweight processes

• Modified syntax of an existing language (HP Fortran)

• Programming language designed for parallel processing (ADA)

• Compiler extensions to specify parallel execution (OpenMP)

• Thread programming standard: Java Threads and pthreads

Threads

• Heavyweight processes (UNIX fork, wait, waitpid, shmat, shmdt)

– Disadvantage: time and memory expensive

– Advantage: A blocked process doesn’t block the other processes

• Lightweight threads (pthreads library)

– Only needs to share stack space and instruction counter

– "Thread Safe" programming required to guarantee consistent results

• Pthreads– Threads can be spawned and started by other threads

– They can run independently (detached from their parent thread) or require joins for termination

– Formation of thread pools are possible

– Threads communicate through signals

– Processing order is indeterminate

Definition: Path of execution through a process

Forks and Joins

General thread flow of control

pid = fork();if (pid == 0) { /* Do spawned thread code */ } else { /* Do spawning thread code */ }if (pid == 0) exit(0); else wait(0);

Note: Detached processes run independently from its parent without joins

Processes and Threads

Code

Resources

Listeners

Heap

Stack

IP

Single Thread Process

Code

Resources

Listeners

HeapStack

IP

Stack

IP

Dual Thread Process

Notes: •Threads can be three orders of magnitude faster than processes• Thread safe library routines can be used by multiple concurrent threads• Synchronization uses shared variables

Example Program (summing numbers)

Heavyweight UNIX processes (Section 8.7.1)Pseudo code

Create semaphoresAllocate shared memory and attach shared memoryLoad array with numbersFork child processesIF Parent THEN sum parent sectionELSE sum child sectionP(semaphore) Add to global sum V(semaphore)IF (child) terminate ELSE joinPrint resultsRelease semaphores, detatch and release shared memory

Note: The Java and pthread version require about half the code

Modify Existing Language Syntax

• Declaration of a shared memory variableshared int x;

• Specify statements to execute concurrentlypar { s1(); s2(); s3(); … sn(); }

• Iterations assigned to different processorsforall (i=0; i<n; i++) { //code }

• Examples: High Performance Fortran and C

Example Constructs

Compiler Optimizations

• The following works because the statements are independent

forall (i = 0; i < P; i++) a[i] = 0;

• Bernsteins conditions– Outputs from one processor cannot be inputs to another

– Outputs from the processors cannot overlap

• Example: a = x + y; b = x + z; are okay to execute simultaneously

Java Threads

• Instantiate and run a threadThreadClass t = new ThreadClass().start();

• Thread classClass ThreadClass extends Thread{ public ThreadClass {//Constructor} public void run() { while (true) { //yield or sleep periodically. //thread code executed here.} } }

Pthreads

Advantages• Industry standardized interface which replaces vendor proprietary APIs• Thread creation, synchronization, and context switching are implemented in user

space without kernel intervention, which is inherently more efficient than kernel-based thread operations

• User-level implementation provides the flexibility to choose a scheduler that best suits the application, independent of the kernel scheduler.

Drawbacks• Poor locality limits performance when accessing shared data across processors• The Pthreads scheduler hasn't proven suited to manage large numbers of threads• Shared memory multithreaded programs typically follow the SPMD model• Most parallel programs still are course-grain in design

IEEE POSIX 1003.1c 1995: UNIX-based C standardized API

Performance Comparisons Pthreads versus Kernel Threads

Real: wall clock time (actual elapsed time)User: time spent in user modeSys: time spent in the kernel within the process

Compiler Extensions (openMP)

• Extensions for C/C++, Fortran, and Java (JOMP)• Consists of: Compiler directives, library routines and

environment variables• Recognized industry standard developed in the late 1990s• Designed for shared memory programming• Uses fork-join model, but uses threads• Parallel sections of code execute “teams of threads”• General Syntax

– C: #pragma omp <directive>– JOMP: //omp <directive>