View
39
Download
0
Category
Preview:
DESCRIPTION
PARALLEL PROCESSOR ORGANIZATIONS. Jehan-François Pâris jfparis@uh.edu. Chapter Organization. Overview Writing parallel programs Multiprocessor Organizations Hardware multithreading Alphabet soup (SISD, SIMD, MIMD, …) Roofline performance model. OVERVIEW. The hardware side. - PowerPoint PPT Presentation
Citation preview
PARALLEL PROCESSOR ORGANIZATIONS
Jehan-François Pârisjfparis@uh.edu
Chapter Organization
• Overview• Writing parallel programs• Multiprocessor Organizations• Hardware multithreading• Alphabet soup (SISD, SIMD, MIMD, …)• Roofline performance model
OVERVIEW
The hardware side
• Many parallel processing solutions– Multiprocessor architectures
• Two or more microprocessor chips• Multiple architectures
– Multicore architectures• Several processors on a single chip
The software side
• Two ways for software to exploit parallel processing capabilities of hardware– Job-level parallelism
• Several sequential processes run in parallel• Easy to implement (OS does the job!)
– Process-level parallelism• A single program runs on several processors
at the same time
WRITING PARALLEL PROGRAMS
Overview
• Some problems are embarrassingly parallel– Many computer graphics tasks– Brute force searches in cryptography or
password guessing• Much more difficult for other applications
– Communication overhead among sub-tasks– Amdahl's law– Balancing the load
Amdahl's Law
• Assume a sequential process takes
– tp seconds to perform operations that could be performed in parallel
– ts seconds to perform purely sequential operations
• The maximum speedup will be
(tp + ts )/ts
Balancing the load
• Must ensure that workload is equally divided among all the processors
• Worst case is when one of the processors does much more work than all others
Example (I)
• Computation partitioned among n processors• One of them does 1/m of the work with m < n
– That processor becomes a bottleneck
• Maximum expected speedup: n
• Actual maximum speedup: m
Example (II)
• Computation partitioned among 64 processors• One of them does 1/8 of the work
• Maximum expected speedup: 64
• Actual maximum speedup: 8
A last issue
• Humans likes to address issues one after the order– We have meeting agendas– We do not like to be interrupted– We write sequential programs
Rene Descartes
• Seventeenth-century French philosopher• Invented
– Cartesian coordinates – Methodical doubt
• [To] never to accept anything for true which I did not clearly know to be such
• Proposed a scientific method based on four precepts
Method's third rule
• The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain order even to those objects which in their own nature do not stand in a relation of antecedence and sequence.
MULTI PROCESSOR ORGANIZATIONS
Shared memory multiprocessors
PU
Cache
PU
Cache
PU
Cache
Interconnection network
RAM I/O
…
Shared memory multiprocessor
• Can offer– Uniform memory access to all processors
(UMA)• Easiest to program
– Non-uniform memory access to all processors(NUMA)• Can scale up to larger sizes• Offer faster access to nearby memory
Computer clusters
PU
Cache
RAM
PU
Cache
RAM
PU
Cache
RAM
Interconnection network
…
Computer clusters
• Very easy to assemble• Can take advantage of high-speed LANs
– Gigabit Ethernet, Myrinet, …• Data exchanges must be done through
message passing
Message passing (I)
• If processor P wants to access data in the main memory of processor Q it must– Send a request to Q– Wait for a reply
• For this to work, processor Q must have a thread– Waiting for message from other processors– Sending them replies
Message passing (II)
• In a shared memory architecture, each processor can directly access all data
• A proposed solution– Distributed shared memory offers to the
users of a cluster the illusion of a single address space for their shared data
– Still has performance issues
When things do not add up
• Memory capacity is very important for big computing applications– If the data can fit into main memory, the
computation will run much faster• A company replaced
– Single shared memory computer with 32GB of RAM
A problem
• A company replaced – Single shared memory computer with 32GB of
RAM– Four “clustered” computers with 8GB each
• More I/O than ever• What did happen?
The explanation
• Assume OS occupies one GB of RAM– The old shared-memory computer still had 31
GB of free RAM– Each of the clustered computer has 7 GB of
free RAM• The total RAM available to the program went
down from 31 GB to 47 = 28 GB!
Grid computing
• The computers are distributed over a very large network– Sometimes computer time is donated
• Volunteer computing• Seti@Home
– Works well with embarrassingly parallel workloads• Searches in a n-dimensional space
HARDWARE MULTITHREADING
General idea
• Let the processor switch to another thread of computation while them current one is stalled
• Motivation:– Increased cost of cache misses
Implementation
• Entirely controlled by the hardware– Unlike multiprogramming
• Requires a processor capable of– Keeping track of the state of each thread
• One set of registers—including PC– for each concurrent thread
– Quickly switching among concurrent threads
Approaches
• Fine-grained multithreading:– Switches between threads for each instruction– Provides highest throughputs– Slows down execution of individual threads
Approaches
• Coarse-grained multithreading– Switches between threads whenever a long
stall is detected– Easier to implement – Cannot eliminate all stalls
Approaches
• Simultaneous multi-threading:– Takes advantage of the possibility of modern
hardware to perform different tasks in parallel for instructions of different threads
– Best solution
ALPHABET SOUP
Overview
• Used to describe processor organizations where– Same instructions can be applied to– Multiple data instances
• Encountered in– Vector processors in the past– Graphic processing units (GPU)– x86 multimedia extension
Classification
• SISD:– Single instruction, single data– Conventional uniprocessor architecture
• MIMD:– Multiple instructions, multiple data– Conventional multiprocessor architecture
Classification
• SIMD:– Single instruction, multiple data– Perform same operations on a set of similar data
• Think of adding two vectors
for (i = 0; i++; i < VECSIZE)sum[i] = a[i] + b[i];
Vector computing
• Kind of SIMD architecture– Used by Cray computers
• Pipelines multiple executions of single instruction with different data (“vectors”) trough the ALU
• Requires– Vector registers able to store
multiple values– Special vector instructions: say lv, addv, …
Benchmarking
• Two factors to consider– Memory bandwidth
• Depends on interconnection network– Floating-point performance
• Best known benchmark is LINPACK
Roofline model
• Takes into account– Memory bandwidth– Floating-point performance
• Introduces arithmetic intensity– Total number of floating point operations in a
program divided by total number of bytes transferred to main memory
– Measured in FLOPS/byte
Roofline model
• Attainable GFLOPS/s =Min(Peak Memory BWArithmetic
Intensity, Peak Floating-Point Performance
Roofline model
Peak floating-point performance
Floating-point performance islimited by memory bandwidth
Recommended