PARALLEL PROCESSOR ORGANIZATIONS

Jehan-François Pârisjfparis@uh.edu

Chapter Organization

• Overview• Writing parallel programs• Multiprocessor Organizations• Hardware multithreading• Alphabet soup (SISD, SIMD, MIMD, …)• Roofline performance model

OVERVIEW

The hardware side

• Many parallel processing solutions– Multiprocessor architectures

• Two or more microprocessor chips• Multiple architectures

– Multicore architectures• Several processors on a single chip

The software side

• Two ways for software to exploit parallel processing capabilities of hardware– Job-level parallelism

• Several sequential processes run in parallel• Easy to implement (OS does the job!)

– Process-level parallelism• A single program runs on several processors

at the same time

WRITING PARALLEL PROGRAMS

Overview

• Some problems are embarrassingly parallel– Many computer graphics tasks– Brute force searches in cryptography or

password guessing• Much more difficult for other applications

– Communication overhead among sub-tasks– Amdahl's law– Balancing the load

Amdahl's Law

• Assume a sequential process takes

– tp seconds to perform operations that could be performed in parallel

– ts seconds to perform purely sequential operations

• The maximum speedup will be

(tp + ts )/ts

Balancing the load

• Must ensure that workload is equally divided among all the processors

• Worst case is when one of the processors does much more work than all others

Example (I)

• Computation partitioned among n processors• One of them does 1/m of the work with m < n

– That processor becomes a bottleneck

• Maximum expected speedup: n

• Actual maximum speedup: m

Example (II)

• Computation partitioned among 64 processors• One of them does 1/8 of the work

• Maximum expected speedup: 64

• Actual maximum speedup: 8

A last issue

• Humans likes to address issues one after the order– We have meeting agendas– We do not like to be interrupted– We write sequential programs

Rene Descartes

• Seventeenth-century French philosopher• Invented

– Cartesian coordinates – Methodical doubt

• [To] never to accept anything for true which I did not clearly know to be such

• Proposed a scientific method based on four precepts

Method's third rule

• The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain order even to those objects which in their own nature do not stand in a relation of antecedence and sequence.

MULTI PROCESSOR ORGANIZATIONS

Shared memory multiprocessors

Interconnection network

RAM I/O

Shared memory multiprocessor

• Can offer– Uniform memory access to all processors

(UMA)• Easiest to program

– Non-uniform memory access to all processors(NUMA)• Can scale up to larger sizes• Offer faster access to nearby memory

Computer clusters

Interconnection network

Computer clusters

• Very easy to assemble• Can take advantage of high-speed LANs

– Gigabit Ethernet, Myrinet, …• Data exchanges must be done through

message passing

Message passing (I)

• If processor P wants to access data in the main memory of processor Q it must– Send a request to Q– Wait for a reply

• For this to work, processor Q must have a thread– Waiting for message from other processors– Sending them replies

Message passing (II)

• In a shared memory architecture, each processor can directly access all data

• A proposed solution– Distributed shared memory offers to the

users of a cluster the illusion of a single address space for their shared data

– Still has performance issues

When things do not add up

• Memory capacity is very important for big computing applications– If the data can fit into main memory, the

computation will run much faster• A company replaced

– Single shared memory computer with 32GB of RAM

A problem

• A company replaced – Single shared memory computer with 32GB of

RAM– Four “clustered” computers with 8GB each

• More I/O than ever• What did happen?

The explanation

• Assume OS occupies one GB of RAM– The old shared-memory computer still had 31

GB of free RAM– Each of the clustered computer has 7 GB of

free RAM• The total RAM available to the program went

down from 31 GB to 47 = 28 GB!

Grid computing

• The computers are distributed over a very large network– Sometimes computer time is donated

• Volunteer computing• Seti@Home

– Works well with embarrassingly parallel workloads• Searches in a n-dimensional space

HARDWARE MULTITHREADING

General idea

• Let the processor switch to another thread of computation while them current one is stalled

• Motivation:– Increased cost of cache misses

Implementation

• Entirely controlled by the hardware– Unlike multiprogramming

• Requires a processor capable of– Keeping track of the state of each thread

• One set of registers—including PC– for each concurrent thread

– Quickly switching among concurrent threads

Approaches

• Fine-grained multithreading:– Switches between threads for each instruction– Provides highest throughputs– Slows down execution of individual threads

Approaches

• Coarse-grained multithreading– Switches between threads whenever a long

stall is detected– Easier to implement – Cannot eliminate all stalls

Approaches

• Simultaneous multi-threading:– Takes advantage of the possibility of modern

hardware to perform different tasks in parallel for instructions of different threads

– Best solution

ALPHABET SOUP

Overview

• Used to describe processor organizations where– Same instructions can be applied to– Multiple data instances

• Encountered in– Vector processors in the past– Graphic processing units (GPU)– x86 multimedia extension

Classification

• SISD:– Single instruction, single data– Conventional uniprocessor architecture

• MIMD:– Multiple instructions, multiple data– Conventional multiprocessor architecture

Classification

• SIMD:– Single instruction, multiple data– Perform same operations on a set of similar data

• Think of adding two vectors

for (i = 0; i++; i < VECSIZE)sum[i] = a[i] + b[i];

Vector computing

• Kind of SIMD architecture– Used by Cray computers

• Pipelines multiple executions of single instruction with different data (“vectors”) trough the ALU

• Requires– Vector registers able to store

multiple values– Special vector instructions: say lv, addv, …

Benchmarking

• Two factors to consider– Memory bandwidth

• Depends on interconnection network– Floating-point performance

• Best known benchmark is LINPACK

Roofline model

• Takes into account– Memory bandwidth– Floating-point performance

• Introduces arithmetic intensity– Total number of floating point operations in a

program divided by total number of bytes transferred to main memory

– Measured in FLOPS/byte

Roofline model

• Attainable GFLOPS/s =Min(Peak Memory BWArithmetic

Intensity, Peak Floating-Point Performance

Roofline model

Peak floating-point performance

Floating-point performance islimited by memory bandwidth

PARALLEL PROCESSOR ORGANIZATIONS

Documents

Inter-Processor Parallel Architecture

ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design

PARALLEL PROCESSOR ARRAY FOR HIGH SPEED PATH …ericfossum.com/Publications/Papers/Parallel processor array for high speed path...PARALLEL PROCESSOR ARRAY FOR HIGH SPEED PATH PLANNING

Parallel H.264 Decoding on an Embedded Multicore Processor

Co-Processor Acceleration of an Unmodiﬁed Parallel Solid ... · Processor Acceleration of an Unmodiﬁed Parallel Solid Mechanics Code with Feast-GPU’, Int. J. Computational Science

Optical content-addressable parallel processor ... · processor (OCAPP) for the efficient support of parallel symbolic computing are presented. The architecture is designed to exploit

Processor Efficient Parallel Solution of Linear Systemsmmorenom/CS855/Ressources/Kaltofen-P… · parallel processor efficient algorithms for the problems of computing the rank of

ECE 669 Parallel Computer Architecture Lecture 19 Processor Design

Simulation of a parallel processor based small tactical ... · Calhoun: The NPS Institutional Archive Theses and Dissertations Thesis Collection 1991 Simulation of a parallel processor

Instruction-Level Parallel Processor Architectures

The Massively Parallel Processor - MPE - MicroProcessor

A Parallel and Scalable Processor for JSON DataA Parallel and Scalable Processor for JSON Data Christina Pavlopoulou University of California, Riverside cpavl001@ucr.edu E. Preston

A Neural-RISC Processor and Parallel Architecture for

Graphics Processor Units: New Prospects for Parallel Computing

Lecture 24: Parallel Processor Architecture Algorithms · Lecture 24 Beyond 6.006 6.006 Fall 2011 Lecture 24: Parallel Processor Architecture & Algorithms Processor Architecture

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache

Chapter VIII Parallel Processor Organizations

Computer Architecture and the Fetch-Execute Cycle Parallel Processor Systems

Solutions Manual to Computer Architecture Pipelined and Parallel Processor Design

A Nanoliter-Scale Nucleic Acid Processor with Parallel Architecture