What is Parallel Computing (1).docx

Embed Size (px)

Citation preview

What is Parallel Computing?Traditionally, software has been written for serial computation: To be run on a single computer having a single Central Processing Unit (CPU); A problem is broken into a discrete series of instructions. Instructions are executed one after another. Only one instruction may execute at any moment in time.

Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem. To be run using multiple CPUs A problem is broken into discrete parts that can be solved concurrently Each part is further broken down to a series of instructions Instructions from each part execute simultaneously on different CPUs

What are the Resources for Parallel Computing? The compute resources can include: A single computer with multiple processors; A single computer with (multiple) processor(s) and some specialized computer resources (GPU, FPGA ) An arbitrary number of computers connected by a network; A combination of both. What are the applications of Parallel Computing? weather and climate chemical and nuclear reactions biological, human genome geological, seismic activity mechanical devices - from prosthetics to spacecraft electronic circuits manufacturing processes

Flynns classifications

Shared Memory Multiprocessing

Shared memory systems form a major category of multiprocessors. In this category, all processors share a global memory .

Communication between tasks running on different processors is performed through writing to and reading from the global memory. All interprocessor coordination and synchronization is also accomplished via the global memory. Address space is identical in all processors. Memory will not know which CPU is asking for the memory. Each CPU execute as if other CPUs does not exists. A shared memory system is relatively easy to program since all processors share a single view of data and the communication between processors can be as fast as memory accesses to a same location. Two main problems need to be addressed when designing a shared memory system:1. performance degradation due to contention.Performance degradation might happen when multiple processors are trying to access the shared memory simultaneously. A typical design might use caches to solve the contention problem.2. coherence problems.Having multiple copies of data, spread throughout the caches, might lead to a coherence problem. The copies in the caches are coherent if they are all equal to the same value. However, if one of the processors writes over the value of one of the copies, then the copy becomes inconsistent because it no longer equals the value of the other copies. Scalability remains the main drawback of a shared memory system. The simplest shared memory system consists of one memory module (M) that can be accessed from two processors P1 and P2

1. Requests arrive at the memory module through its two ports. An arbitration unit within the memory module passes requests through to a memory controller.2. If the memory module is not busy and a single request arrives, then the arbitration unit passes that request to the memory controller and the request is satisfied.3. The module is placed in the busy state while a request is being serviced. If a new request arrives while the memory is busy servicing a previous request, the memory module sends a wait signal, through the memory controller, to the processor making the new request.4. In response, the requesting processor may hold its request on the line until the memory becomes free or it may repeat its request some time later.5. If the arbitration unit receives two requests, it selects one of them and passes it to the memory controller. Again, the denied request can be either held to be served next or it may be repeated some time later.In computer software,shared memoryis either a method ofinter-process communication(IPC), i.e. a way of exchanging data between programs running at the same time. Oneprocesswill create an area inRAMwhich other processes can access,or a method of conserving memory space by directing accesses to what would ordinarily be copies of a piece of data to a single instance instead, by usingvirtual memorymappings or with explicit support of the program in question. This is most often used forshared libraries.Support on UNIX platformsPOSIXprovides a standardized API for using shared memory,POSIX Shared Memory. This uses the functionshm_openfrom sys/mman.h.[1]POSIX interprocess communication (part of the POSIX:XSI Extension) includes the shared-memory functionsshmat,shmctl,shmdtandshmget.UNIX System V provides an API for shared memory as well. This uses shmget from sys/shm.h. BSD systems provide "anonymous mapped memory" which can be used by several processes.

UMA Uniform Memory Access

In the UMA system a shared memory is accessible by all processors through an interconnection network in the same way a single processor accesses its memory. All processors have equal access time to any memory location. The interconnection network used in the UMA can be a single bus, multiple buses, or a crossbar switch.

MemoryInter Connection NetworkCPUCPU

CPU

Because access to shared memory is balanced, these systems are also called SMP (symmetric multiprocessor) systems. Each processor has equal opportunity to read/write to memory, including equal access speed. A typical bus-structured SMP computer, attempts to reduce contention for the bus by fetching instructions and data directly from each individual cache, as much as possible. In the extreme, the bus contention might be reduced to zero after the cache memories are loaded from the global memory, because it is possible for all instructions and data to be completely contained within the cache. This memory organization is the most popular among shared memory systems. Examples of this architecture are Sun Starfire servers, HP V series, and Compaq AlphaServer GS, Silicon Graphics Inc. multiprocessor servers.Nonuniform Memory Access (NUMA) In the NUMA system, each processor has part of the shared memory attached The memory has a single address space. Therefore, any processor could access any memory location directly using its real address. However, the access time to modules depends on the distance to the processor. This results in a nonuniform memory access time. A processor can also have a built-in memory controller as present in Intels Quick Path Interconnect (QPI) NUMA Architecture. Unlike Distributed Memory Architecture, the memory of other processor is accessible but the latency to access them is not same. The memory which is local to other processor is called asremote memory or foreign memory. A number of architectures are used to interconnect processors to memory modules in a NUMA. Among these are the tree and the hierarchical bus networks. Examples of NUMA architecture are BBN TC-2000, SGI Origin 3000, and Cray T3E.

Distributed memoryMultiprocessing

Distributed memoryrefers to amultiple-processor computer systemin which eachprocessorhas its own privatememory. Computational tasks can only operate on local data, and if remote data is required, the computational task must communicate with one or more remote processors.

There is typically a processor, a memory, and some form of interconnection that allows programs on each processor to interact with each other. If any cpu wants to accesslocal memory that is held by other cpu a cpu-cpu communication takes place to access the data from other memory through corresponding cpu. The interconnection can be organised withpoint to point linksor separate hardware can provide a switching network. Wise organization will keep all the desired data for a cpu in its local memory and only communication through interconnection network will be then messages between cpus. Thenetwork topologyis a key factor in determining how the multi-processor machinescales. The key issue in programming distributed memory systems is how to distribute the data over the memories. Depending on the problem solved, the data can be distributed statically, or it can be moved through the nodes. Data can be moved on demand, or data can be pushed to the new nodes in advance. Data can be kept statically in nodes if most computations happen locally, and only changes on edges have to be reported to other nodes. An example of this is simulation where data is modeled using a grid, and each node simulates a small part of the larger grid. On every iteration, nodes inform all neighboring nodes of the new edge data. The advantage of (distributed) shared memory is that it offers a unified address space in which all data can be found. The advantage of distributed memory is that it excludes race conditions, and that it forces the programmer to think about data distribution. The advantage of distributed (shared) memory is that it is easier to design a machine that scales with the algorithm Distributed shared memory hides the mechanism of communication - it does not hide the latency of communication.How Parallelism is done in Sequential machines?1. Multiplicity of functional unitsUse of multiple processing elements under one controller Many of the ALU functions can be distributed to multiple specialized units These multiple Functional Units are independent of each otherExample: The CDC-6600 10 Functional execution units built into its CPU The 6600 CP included 10 parallel functional units, allowing multiple instructions to be worked on at the same time. Today this is known as a superscalar design, while at the time it was simply "unique". The system read and decoded instructions from memory as fast as possible, generally faster than they could be completed, and fed them off to the units for processing. The units were: floating point multiply (2 copies) floating point divide floating point add IBM 360/91 2 parallel execution units Fixed point arithmetic Floating point arithmetic(2 Functional units) Floating point add-sub Floating point multiply-div2.Parallelism & pipelining within the CPU Parallelism is provided by building parallel adders in almost all ALUs Pipelining Each task is divided into subtasks which can be executed in parallel