Chapter VIII Parallel Processor Organizations

PARALLEL PROCESSOR ORGANIZATIONSJehan-Franois [email protected]

Chapter OrganizationOverviewWriting parallel programsMultiprocessor OrganizationsHardware multithreadingAlphabet soup (SISD, SIMD, MIMD, )Roofline performance model

OVERVIEW

The hardware sideMany parallel processing solutionsMultiprocessor architecturesTwo or more microprocessor chipsMultiple architecturesMulticore architecturesSeveral processors on a single chip

The software sideTwo ways for software to exploit parallel processing capabilities of hardwareJob-level parallelismSeveral sequential processes run in parallelEasy to implement (OS does the job!)Process-level parallelismA single program runs on several processors at the same time

WRITING PARALLEL PROGRAMS

OverviewSome problems are embarrassingly parallelMany computer graphics tasksBrute force searches in cryptography or password guessingMuch more difficult for other applicationsCommunication overhead among sub-tasksAmdahl's lawBalancing the load

Amdahl's LawAssume a sequential process takestp seconds to perform operations that could be performed in parallelts seconds to perform purely sequential operationsThe maximum speedup will be(tp + ts )/ts

Balancing the loadMust ensure that workload is equally divided among all the processorsWorst case is when one of the processors does much more work than all others

Example (I)Computation partitioned among n processorsOne of them does 1/m of the work with m < nThat processor becomes a bottleneck

Maximum expected speedup: n

Actual maximum speedup: m

Example (II)Computation partitioned among 64 processorsOne of them does 1/8 of the work

Maximum expected speedup: 64

Actual maximum speedup: 8

A last issueHumans likes to address issues one after the orderWe have meeting agendasWe do not like to be interruptedWe write sequential programs

Rene DescartesSeventeenth-century French philosopherInventedCartesian coordinates Methodical doubt[To] never to accept anything for true which I did not clearly know to be such Proposed a scientific method based on four precepts

Method's third ruleThe third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain order even to those objects which in their own nature do not stand in a relation of antecedence and sequence.

MULTI PROCESSOR ORGANIZATIONS

Shared memory multiprocessorsInterconnection networkRAMI/O

Shared memory multiprocessorCan offerUniform memory access to all processors (UMA)Easiest to programNon-uniform memory access to all processors (NUMA)Can scale up to larger sizesOffer faster access to nearby memory

Computer clusters Interconnection network

Computer clustersVery easy to assembleCan take advantage of high-speed LANsGigabit Ethernet, Myrinet, Data exchanges must be done through message passing

Message passing (I)If processor P wants to access data in the main memory of processor Q it mustSend a request to QWait for a replyFor this to work, processor Q must have a threadWaiting for message from other processorsSending them replies

Message passing (II)In a shared memory architecture, each processor can directly access all data

A proposed solutionDistributed shared memory offers to the users of a cluster the illusion of a single address space for their shared dataStill has performance issues

When things do not add upMemory capacity is very important for big computing applicationsIf the data can fit into main memory, the computation will run much faster

A problemA company replaced Single shared memory computer with 32GB of RAMFour clustered computers with 8GB eachMore I/O than everWhat did happen?

The explanationAssume OS occupies one GB of RAMThe old shared-memory computer still had 31 GB of free RAMEach of the clustered computer has 7 GB of free RAMThe total RAM available to the program went down from 31 GB to 47 = 28 GB!

Grid computingThe computers are distributed over a very large networkSometimes computer time is donatedVolunteer computingSeti@HomeWorks well with embarrassingly parallel workloadsSearches in a n-dimensional space

HARDWARE MULTITHREADING

General ideaLet the processor switch to another thread of computation while them current one is stalled

Motivation:Increased cost of cache misses

ImplementationEntirely controlled by the hardwareUnlike multiprogrammingRequires a processor capable ofKeeping track of the state of each threadOne set of registersincluding PC for each concurrent threadQuickly switching among concurrent threads

ApproachesFine-grained multithreading:Switches between threads for each instructionProvides highest throughputsSlows down execution of individual threads

ApproachesCoarse-grained multithreadingSwitches between threads whenever a long stall is detectedEasier to implement Cannot eliminate all stalls

ApproachesSimultaneous multi-threading:Takes advantage of the possibility of modern hardware to perform different tasks in parallel for instructions of different threadsBest solution

ALPHABET SOUP

Overview Used to describe processor organizations whereSame instructions can be applied toMultiple data instancesEncountered inVector processors in the pastGraphic processing units (GPU)x86 multimedia extension

Classification SISD:Single instruction, single dataConventional uniprocessor architectureMIMD:Multiple instructions, multiple dataConventional multiprocessor architecture

Classification SIMD:Single instruction, multiple dataPerform same operations on a set of similar dataThink of adding two vectors

for (i = 0; i++; i < VECSIZE) sum[i] = a[i] + b[i];

Vector computingKind of SIMD architectureUsed by Cray computersPipelines multiple executions of single instruction with different data (vectors) trough the ALURequiresVector registers able to store multiple valuesSpecial vector instructions: say lv, addv,

BenchmarkingTwo factors to considerMemory bandwidthDepends on interconnection networkFloating-point performanceBest known benchmark is LINPACK

Roofline modelTakes into accountMemory bandwidthFloating-point performanceIntroduces arithmetic intensityTotal number of floating point operations in a program divided by total number of bytes transferred to main memoryMeasured in FLOPS/byte

Roofline modelAttainable GFLOPS/s = Min(Peak Memory BWArithmetic Intensity, Peak Floating-Point Performance

Roofline modelPeak floating-point performanceFloating-point performance islimited by memory bandwidth

Chart1

2

4

8

16

16

16

16

16

GFLOPS

Arithmetic Intensity

Attainable GFLOPS/s

Sheet1

IntensityGFLOPS

0.1252

0.254

0.58

116

216

416

816

1616

Documents

Chapter VIII Parallel Processor Organizations