Upload
naveen-nekuri
View
10
Download
0
Embed Size (px)
DESCRIPTION
CA
Citation preview
PARALLEL PROCESSOR ORGANIZATIONSJehan-Franois [email protected]
Chapter OrganizationOverviewWriting parallel programsMultiprocessor OrganizationsHardware multithreadingAlphabet soup (SISD, SIMD, MIMD, )Roofline performance model
OVERVIEW
The hardware sideMany parallel processing solutionsMultiprocessor architecturesTwo or more microprocessor chipsMultiple architecturesMulticore architecturesSeveral processors on a single chip
The software sideTwo ways for software to exploit parallel processing capabilities of hardwareJob-level parallelismSeveral sequential processes run in parallelEasy to implement (OS does the job!)Process-level parallelismA single program runs on several processors at the same time
WRITING PARALLEL PROGRAMS
OverviewSome problems are embarrassingly parallelMany computer graphics tasksBrute force searches in cryptography or password guessingMuch more difficult for other applicationsCommunication overhead among sub-tasksAmdahl's lawBalancing the load
Amdahl's LawAssume a sequential process takestp seconds to perform operations that could be performed in parallelts seconds to perform purely sequential operationsThe maximum speedup will be(tp + ts )/ts
Balancing the loadMust ensure that workload is equally divided among all the processorsWorst case is when one of the processors does much more work than all others
Example (I)Computation partitioned among n processorsOne of them does 1/m of the work with m < nThat processor becomes a bottleneck
Maximum expected speedup: n
Actual maximum speedup: m
Example (II)Computation partitioned among 64 processorsOne of them does 1/8 of the work
Maximum expected speedup: 64
Actual maximum speedup: 8
A last issueHumans likes to address issues one after the orderWe have meeting agendasWe do not like to be interruptedWe write sequential programs
Rene DescartesSeventeenth-century French philosopherInventedCartesian coordinates Methodical doubt[To] never to accept anything for true which I did not clearly know to be such Proposed a scientific method based on four precepts
Method's third ruleThe third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain order even to those objects which in their own nature do not stand in a relation of antecedence and sequence.
MULTI PROCESSOR ORGANIZATIONS
Shared memory multiprocessorsInterconnection networkRAMI/O
Shared memory multiprocessorCan offerUniform memory access to all processors (UMA)Easiest to programNon-uniform memory access to all processors (NUMA)Can scale up to larger sizesOffer faster access to nearby memory
Computer clusters Interconnection network
Computer clustersVery easy to assembleCan take advantage of high-speed LANsGigabit Ethernet, Myrinet, Data exchanges must be done through message passing
Message passing (I)If processor P wants to access data in the main memory of processor Q it mustSend a request to QWait for a replyFor this to work, processor Q must have a threadWaiting for message from other processorsSending them replies
Message passing (II)In a shared memory architecture, each processor can directly access all data
A proposed solutionDistributed shared memory offers to the users of a cluster the illusion of a single address space for their shared dataStill has performance issues
When things do not add upMemory capacity is very important for big computing applicationsIf the data can fit into main memory, the computation will run much faster
A problemA company replaced Single shared memory computer with 32GB of RAMFour clustered computers with 8GB eachMore I/O than everWhat did happen?
The explanationAssume OS occupies one GB of RAMThe old shared-memory computer still had 31 GB of free RAMEach of the clustered computer has 7 GB of free RAMThe total RAM available to the program went down from 31 GB to 47 = 28 GB!
Grid computingThe computers are distributed over a very large networkSometimes computer time is donatedVolunteer computingSeti@HomeWorks well with embarrassingly parallel workloadsSearches in a n-dimensional space
HARDWARE MULTITHREADING
General ideaLet the processor switch to another thread of computation while them current one is stalled
Motivation:Increased cost of cache misses
ImplementationEntirely controlled by the hardwareUnlike multiprogrammingRequires a processor capable ofKeeping track of the state of each threadOne set of registersincluding PC for each concurrent threadQuickly switching among concurrent threads
ApproachesFine-grained multithreading:Switches between threads for each instructionProvides highest throughputsSlows down execution of individual threads
ApproachesCoarse-grained multithreadingSwitches between threads whenever a long stall is detectedEasier to implement Cannot eliminate all stalls
ApproachesSimultaneous multi-threading:Takes advantage of the possibility of modern hardware to perform different tasks in parallel for instructions of different threadsBest solution
ALPHABET SOUP
Overview Used to describe processor organizations whereSame instructions can be applied toMultiple data instancesEncountered inVector processors in the pastGraphic processing units (GPU)x86 multimedia extension
Classification SISD:Single instruction, single dataConventional uniprocessor architectureMIMD:Multiple instructions, multiple dataConventional multiprocessor architecture
Classification SIMD:Single instruction, multiple dataPerform same operations on a set of similar dataThink of adding two vectors
for (i = 0; i++; i < VECSIZE) sum[i] = a[i] + b[i];
Vector computingKind of SIMD architectureUsed by Cray computersPipelines multiple executions of single instruction with different data (vectors) trough the ALURequiresVector registers able to store multiple valuesSpecial vector instructions: say lv, addv,
BenchmarkingTwo factors to considerMemory bandwidthDepends on interconnection networkFloating-point performanceBest known benchmark is LINPACK
Roofline modelTakes into accountMemory bandwidthFloating-point performanceIntroduces arithmetic intensityTotal number of floating point operations in a program divided by total number of bytes transferred to main memoryMeasured in FLOPS/byte
Roofline modelAttainable GFLOPS/s = Min(Peak Memory BWArithmetic Intensity, Peak Floating-Point Performance
Roofline modelPeak floating-point performanceFloating-point performance islimited by memory bandwidth
Chart1
2
4
8
16
16
16
16
16
GFLOPS
Arithmetic Intensity
Attainable GFLOPS/s
Sheet1
IntensityGFLOPS
0.1252
0.254
0.58
116
216
416
816
1616