68
Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Embed Size (px)

Citation preview

Page 1: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Parallel and Distributed Algorithms

Spring 2010

Johnnie W. Baker

Page 2: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Chapter 1

Introduction and General Concepts

Page 3: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

References• Selim Akl, Parallel Computation: Models and Methods,

Prentice Hall, 1997, Chapter 1, Updated online version available through website. -- primary reference

• Selim Akl, The Design of Efficient Parallel Algorithms, Chapter 2 in “Handbook on Parallel and Distributed Processing” edited by J. Blazewicz, K. Ecker, B. Plateau, and D. Trystram, Springer Verlag, 2000.

• Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Introduction to Parallel Computing, 2nd Edition, Addison Wesley, 2003.

• Harry Jordan and Gita Alaghband, Fundamentals of Parallel Processing: Algorithms Architectures, Languages, Prentice Hall, 2003.

• Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill, 2004, Chapter 1-2. -- secondary reference

• Michael Quinn, Parallel Computing: Theory and Practice, McGraw Hill, 1994

• Barry Wilkenson and Michael Allen, Parallel Programming, 2nd Ed.,Prentice Hall, 2005.

Page 4: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Outline• Need for Parallel & Distributed Computing• Flynn’s Taxonomy of Parallel Computers

– Two Main Types of MIMD Computers• Examples of Computational Models• Data Parallel & Functional/Control/Job Parallel

– Granularity• Analysis of Parallel Algorithms

– Elementary Steps: computational and routing steps– Running Time & Time Optimal– Parallel Speedup– Cost and Work– Efficiency

• Linear and Superlinear Speedup• Speedup and Slowdown Folklore Theorems• Amdahl’s and Gustafon’s Law

Page 5: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Reasons to Study Parallel & DistributedComputing

• Sequential computers have severe limits to memory size– Significant slowdowns occur when accessing

data that is stored in external devices.• Sequential computational times for most

large problems are unacceptable.• Sequential computers can not meet the

deadlines for many real-time problems.• Some problems are distributed in nature

and natural for distributed computation

Page 6: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Grand Challenge Problems Computational Science Categories (1989)

• Quantum chemistry, statistical mechanics, and relativistic physics

• Cosmology and astrophysics

• Computational fluid dynamics and turbulence

• Materials design and superconductivity

• Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling

• Medicine, and modeling of human organs and bones

• Global weather and environmental modeling

Page 7: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Weather Prediction

• Atmosphere is divided into 3D cells• Data includes temperature, pressure, humidity,

wind speed and direction, etc – Recorded at regular time intervals in each cell

• There are about 5×103 cells of 1 mile cubes.• Calculations would take a modern computer

over 100 days to perform calculations needed for a 10 day forecast

• Details in Ian Foster’s 1995 online textbook– Design & Building Parallel Programs– See author’s online copy for further information

Page 8: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Flynn’s Taxonomy

• Best-known classification scheme for parallel computers.

• Depends on parallelism it exhibits with its – Instruction stream– Data stream– A sequence of instructions (the instruction stream)

manipulates a sequence of operands (the data stream)

• The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M)

• Four combinations: SISD, SIMD, MISD, MIMD

Page 9: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

SISD

• Single Instruction, Single Data• Single-CPU systems

– i.e., uniprocessors

• Note: co-processors don’t count• Concurrent processing allowed

– Instruction prefetching– Pipelined execution of instructions

• Primary Example: PCs

Page 10: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

SIMD

• Single instruction, multiple data

• One instruction stream is broadcast to all processors

• Each processor (also called a processing element or PE) is very simplistic and is essentially an ALU;

– PEs do not store a copy of the program nor have a program control unit.

• Individual processors can be inhibited from participating in an instruction (based on a data test).

Page 11: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

SIMD (cont.)

• All active processor executes the same instruction synchronously, but on different data

• On a memory access, all active processors must access the same location in their local memory.

• The data items form an array and an instruction can act on the complete array in one cycle.

Page 12: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

SIMD (cont.)

• Quinn (2004) calls this architecture a processor array– Two examples are Thinking Machines’

• Connection Machine CM2• Connection Machine CM200

– Also, MasPar’s MP1 and MP2 are examples

• Quinn also considers a pipelined vector processor to be a SIMD– Somewhat non-standard. – An example is the Cray-1

Page 13: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

MISD

• Multiple instruction, single data

• Quinn (2004) argues that a systolic array is an example of a MISD structure (pg 55-57)

• Some authors include pipelined architecture in this category

• This category does not receive much attention from most authors

Page 14: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

MIMD

• Multiple instruction, multiple data

• Processors are asynchronous, since they can independently execute different programs on different data sets.

• Communications is handled either by use of message passing or through shared memory.

• Considered by most researchers to contain the most powerful, least restricted computers.

Page 15: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

MIMD (cont.)

• Have major communication costs that are usually ignored when – comparing to SIMDs– when computing algorithmic complexity

• A common way to program MIMDs is for all processors to execute the same program.– Called SPMD method (for single program, multiple

data)– Usual method when number of processors are large.

Page 16: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Multiprocessors (Shared Memory MIMDs)

• All processors have access to all memory locations .

• The processors access memory through some type of interconnection network.– This type of memory access is called uniform memory

access (UMA) .

• Most multiprocessors have hierarchical or distributed memory systems in order to provide fast access to all memory locations.– This type of memory access is called nonuniform

memory access (NUMA).

Page 17: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Multiprocessors (cont.)

• Normally, fast cache is used with NUMA systems to reduce the problem of different memory access time for PEs. – This creates the problem of ensuring that all copies of

the same date in different memory locations are identical.

• Symmetric Multiprocessors (SMPs) are a popular example of shared memory multiprocessors

Page 18: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Multicomputers (Message-Passing MIMDs)

• Processors are connected by an interconnection network

• Each processor has a local memory and can only access its own local memory.

• Data is passed between processors using messages, as dictated by the program.

• A common approach to programming multiprocessors is to use a message passing language (e.g., MPI, PVM)

• The problem is divided into processes that can be executed concurrently on individual processors. Each processor is normally assigned multiple processes.

Page 19: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Multicomputers (cont. 2/3)

• Programming disadvantages of message-passing– Programmers must make explicit message-

passing calls in the code– This is low-level programming and is error

prone.– Data is not shared but copied, which

increases the total data size.– Data Integrity: difficulty in maintaining

correctness of multiple copies of data item.

Page 20: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Multicomputers (cont. 3/3)• Programming advantages of message-

passing– No problem with simultaneous access to data. – Allows different PCs to operate on the same

data independently.– Allows PCs on a network to be easily

upgraded when faster processors become available.

Page 21: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Data Parallel Computation

• All tasks (or processors) apply the same set of operations to different data.

• Example:

• SIMDs always use data parallel computation.– All active processors execute the operations

synchronously.

for i 0 to 99 do a[i] b[i] + c[i]endfor

Page 22: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Data Parallel Computation for MIMDs

• A common way to support data parallel programming on MIMDs is to use SPMD (single program multiple data) programming as follows:– Processors (or tasks) executing the same block of

instructions concurrently but asynchronously– No communication or synchronization occurs within

these concurrent instruction blocks.– Each instruction block is normally followed by

synchronization and communication steps• If processors have identical tasks (with different

data), this method allows these to be executed in a data parallel fashion.

Page 23: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Functional/Control/Job Parallelism• Involves executing different sets of operations on

different data sets.• Typical MIMD control parallelism

– Problem is divided into different non-identical tasks

– Tasks are distributed among the processors• Tasks are usually divided between processors so

that their workload is roughly balanced

– Each processor follows some scheme for executing its tasks.

• Frequently, a task scheduling algorithm is used.• If the task being executed has to wait for data, the

task is suspended and another task is executed.

Page 24: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Grain Size

• Quinn defines grain size to be the average number of computations performed between communication or synchronization steps. – Quinn (2004), pg 411

• Data parallel programming usually results in smaller grain size computation– SIMD computation is considered to be fine-grain– MIMD data parallelism is usually considered to be

medium grain• Generally, increasing the grain size of increases

the performance of MIMD programs.

Page 25: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Examples of Parallel Models

• Synchronous PRAM (Parallel RAM)– Each processor stores a copy of the program

– All execute instructions simultaneously on different data• Generally, the active processors execute the same instruction

• The full power of PRAM allows processors to execute different instructions simultaneously.

– Here, “simultaneous” means they execute using same clock.

– All computers have the same (shared) memory.

– data exchanges occur through share memory

– All processors have I/O capabilities.

– See Akl, Figure 1.2

Page 26: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Examples of Parallel Models (cont)– Binary Tree of Processors (Akl, Fig. 1.3)

• No shared memory• Memory distributed equally between PEs• PEs communicate through binary tree network.• Each PE stores a copy of the program.• PE execute different instructions• Normally synchronous, but asychronous is also

possible.• Only leaf and root PEs have I/O capabilities.

Page 27: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Examples of Parallel Models (cont)

• Symmetric Multiprocessors (SMPs)– Asychronous processors of the same size.– Shared memory– Memory locations equally distant from each

PE in smaller machines.– Formerly called “tightly coupled

multiprocessors”.

Page 28: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Analysis of Parallel Algorithms• Elementary Steps:

– A computational step is a basic arithmetic or logic operation

– A routing step is used to move data of constant size between PEs using either shared memory or a link connecting the two PEs.

Page 29: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Communication Time Analysis• Worst Case

– Usually default for synchronous computation– Difficult to use for asynchronous computation for

asynchronous computation• Average Case

– Usual default for asynchronous computation– Only occasionally used with synchronous co

• Important to distinguish which is being used– Often “worst case” synchronous timings are

compared to “average case” asynchronous timings.– Default in Akl’s text is “worst case”– Default in Casanova et.al. text is probably “average

case”

Page 30: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Shared Memory Communication Time• Memory Accesses – Two Memory Accesses

– P(i) writes datum to a memory cell.– P(j) reads datum in that memory cell

• Uniform Analysis– Charge constant time for memory access– Unrealistic, as shown for PRAM in Chapter 2 of Akl’s book

• Non-Uniform Analysis– Charge O(lg M) cost for a memory access, where M is the

number of memory cells– More realistic, based on how memory is accessed

• Traditional to use Uniform Analysis– Considering a second variable M makes the analysis complex– Most non-constant time algorithms have the same complexity

using either the uniform and non-uniform analysis.

Page 31: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Communication through links

• Assumes constant time to communicate along one link.– An earlier paper (mid-90?) estimates a

communication step along a link to be on order of 10-100 times slower than a computational step.

• When two processors are not connected, routing a datum between the two processors requires O(m) time units, where m is the number of links along path used.

Page 32: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Running/Execution Time• Running Time for algorithm

– Time between when first processor starts executing & when last processor terminates.

– Either worst case or average case can be used.• Often not stated which is used • Normally worst case for synchronous & average case for

asynchronous

– An upper bound is given by the number of steps of the fastest algorithm known for the problem.

– The lower bound gives the fewest number of steps possible to solve a worst case of the problem

• Time optimal algorithm:– Algorithm in which the number of steps for the

algorithm is the same complexity as the lower bound for problem.

– Casanova, et. al. book uses term efficient instead.

Page 33: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Speedup• A measure of the decrease in running time due

to parallelism• Let t1 denote the running time of the fastest

(possible/known) sequential algorithm for the problem.

• If the parallel algorithm uses p PEs, let tp denote its running time.

• The speedup of the parallel algorithm using p processors is defined by S(1,p) = t1 / tp.

• The goal is to obtain largest speedup possible.• For worst (average) case speedup, both t1 and tp

should be worst (average) case, respectively.

Page 34: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Optimal Speedup• Theorem: The maximum possible speedup for

parallel computers with n PEs for “traditional” problems is n.

• Proof (for “traditional problems”):– Assume a computation is partitioned perfectly into n

processes of equal duration.– Assume no overhead is incurred as a result of this

partitioning of the computation – (e.g., partitioning process, information passing, coordination of processes, etc),

– Under these ideal conditions, the parallel computation will execute n times faster than the sequential computation.

– The parallel running time is ts /n.– Then the parallel speedup of this computation is

S(n,1) = ts /(ts /n) = n

Page 35: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Linear Speedup

• Preceding slide argues that

S(n,1) n

and optimally S(n,1) = n.• This speedup is called linear since S(n,1) = (n)• The next slide formalizes this statement and

provides an alternate proof.

Page 36: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Speedup Folklore TheoremTheorem: For a traditional problem, the maximum speedup

possible for a parallel algorithm using p processors is at most p. That is, S(1,p) p.

An Alternate Proof (for “traditional problems”)• Let t1 and tp be as above and assume t1/tp > p• This parallel algorithm can be simulated by a sequential

computer which executes each parallel step with p serial steps.

• The sequential algorithm runs in p tp time, which is faster than the fastest sequential algorithm since ptp < t1

• This contradiction establishes the “theorem”.

Note: We will see that this result is not valid for some non-traditional problems.

Page 37: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Speedup Usually Suboptimal

• Unfortunately, the best speedup possible for most applications is much less than n, as– Assumptions in first proof are usually invalid. – Usually some portions of algorithm are

sequential, with only one PE active.– Other portions of algorithm may allow a non-

constant number of PEs to be idle for more than a constant number of steps.• E.g., during parts of the execution, many

PEs may be waiting to receive or to send data.

Page 38: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Speedup Example

• Example: Compute the sum of n numbers using a binary tree with n leaves.

– Takes tp = lg n steps to compute the sum in

the root node.

– Takes t1 = n-1 steps to compute the sum sequentially.

– the number of processors is p =2n-1, so for n>1,

S(1,p) = n/(lg n) < n < 2n - 1 = p– Speedup is not optimal.

Page 39: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Optimal Speedup Example

• Example: Computing the sum of n numbers optimally using a binary tree.– Assume binary tree has n/(lg n) leaves on the tree. – Each leaf is given lg n numbers and computes the

sum of these numbers initially in (lg n) time. – The overall sum of the values in the n/(lg n) leaves

are found next, as in preceding example, in lg(n/lg n) = (lg n) time.

– Then tp = (lg n) and S(1,p) = (n/lg n) = cp for some constant c since p = 2(n/lg n) -1.

– The speedup of this parallel algorithm is optimal, as the Folklore Speedup Theorem is valid for traditional problems like this one so speedup p.

Page 40: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Superlinear Speedup

• Superlinear speedup occurs when S(n) > n • Most texts other than Akl’s argue that

– Linear speedup is the maximum speedup obtainable.– The preceding “optimal speedup proof” is used to

argue that superlinearity is impossible• Linear speedup is optimal for traditional problems

• Those stating that superlinearity is impossible claim that any ‘superlinear behavior’ occurs due the following type of natural reasons:– the extra memory in parallel system.– a sub-optimal sequential algorithm was used.– Random luck, e.g., in case of an algorithm that has a

random aspect in its design (e.g., random selection)

Page 41: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Superlinearity (cont)• Some problems cannot be solved without

use of parallel computation. – It seems reasonable to consider parallel

solutions to these to be “superlinear”. • Speedup would be infinite.

– Examples include many large software systems

• Data too large to fit in the memory of a sequential computer

• Problem contain deadlines that must be met, but the required computation is too great for a sequential computer to meet deadlines.

– Those not accepting superlinearity discount these, as indicated in previous slide.

Page 42: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Superlinearity (cont.)

• Textbook by Casanova, et. al. discuss superlinearity primarily on pgs 11-12 and in Exercise 4.3 on pg 140.– Accepts that superlinearity is possible

– Indicates natural causes (e.g., larger memory) as primary (perhaps only) reason for superlinearity.

– Treats cases of infinite speedup (where sequential processor can not do the job) as an anomaly.

• E.g., two men can carry a piano upstairs, but one man can not do this, even if given unlimited time.

Page 43: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Some Non-Traditional Problems

• Problems with any of the following characteristics are “non-traditional”– Input may arrive in real time – Input may vary over time

• For example, find the convex hull of a set of points, where points can be added & removed over time.

– Output may effect future input– Output may have to meet certain deadlines.

• See discussion on page 16-17 of Akl’ text• See Chapter 12 of Akl’s text for additional

discussions.

Page 44: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Other Superlinear Examples

• Selim Akl has given a large number of examples of different types of superlinear algorithms.– Example 1.17 in Akl’s online textbook (pg 17).– See last Chapter (Ch 12) of his online text.– See his 2007 book-chapter titled “Evolving

Computational Systems” • To be posted to website shortly.

– Some of his conference & journal publications

Page 45: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

A Superlinearity Example• Example: Tracking

– 6,000 streams of radar data arrive during each 0.5 second and each corresponds to a aircraft.

– Each must be checked against the “projected position box” for each of 4,000 tracked aircraft

– The position of each plane that is matched is updated to correspond with its radar location.

– This process must be repeated each 0.5 seconds

– A sequential processor can not process data this fast.

Page 46: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Slowdown Folklore Theorem: If a computation can be performed with p processors in time tp, with q processors in time tq, and q < p, then

tp tq (1 + p/q) tp

Comments:• The new run time tq is O(tp p/q)• This result puts a tight bound on the amount of

extra time required to execute an algorithm with fewer PEs.

• Establishes a smooth degradation of running time as the number of processors decrease.

• For q = 1, this is essentially the speedup theorem, but with a less sharp constant.

• Similar to Proposition 1.1, pg 11, in Canasova, et. al. textbook.

Page 47: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Proof: (for traditional problems)

• Let Wi denote the number of elementary steps performed collectively during the i-th time period of the algorithm (using p PEs).

• Then W = W1 + W2 + ... + Wk (where k=tp) is the total number of elementary steps performed collectively by the processors.

• Since all p processors are not necessarily busy all the time, W p tp .

• With q processors and q < p, we can simulate the Wi

elementary steps of i-th time period by having each of the q processors do Wi /q elementary steps.

• The time taken for the simulation is

tq W1 /q + W2 /q + ...+ Wk /q

W1 /q +1 + ...+ Wk /q +1

tp + p tp /q

Page 48: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

A Non-Traditional Example for “Folklore Theorems”

• Problem Statement: (from Akl, Example 1.17)

– n sensors are each used to repeat the same set {s1, s2, ..., sn } of measurements n times at an environmental site.

– It takes each sensor n time units to make each measurement.

– Each si is measured in turn by another sensor, producing n distinct cyclic permutations of the sequence S = (s1, s2, ..., sn)

– The sequences produced by 3 sensors are

(s1, s2, s3), (s3, s1 ,s2), (s2 ,s3,s1)

Page 49: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Nontraditional Example (cont.)

– The stream of data from each sensor is sent to a different processor, with all corresponding si

values in each stream arriving at the same time.• The data from a sensor is consistently sent to the

same processor.

– Each si must be read and stored in unit time or stream is terminated.

– Additionally, the minimum of the n new data values that arrive during each n steps must be computed, validated, and stored.

• If minimum is below a safe value, process is stopped

– The overall minimum of the sensor reports over all n steps is also computed and recorded.

Page 50: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Sequential Processor Solution• A single processor can only monitor the values in one

data stream.– After first value of a selected stream is stored, it is too

late to process the other n-1 items that arrived at the same time in other streams.

– A stream remains active only if its preceding value was read.

– Note this is the best we can do to solve problem• We assume that a processor can read a value, compare

it to the smallest previous value received, and update its “smallest value” in one unit of time.

• Sequentially, the computation takes n computational steps and n(n-1) time unit delays between arrivals, or exactly n2 time units to process all n streams.

• Summary: Calculating first minimum takes n2 time units

Page 51: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Binary Tree Solution using n Leaves

• Assume a complete binary tree with n leaf processors and a total of 2n-1 processors.

• Each leaf receives the input from a different stream.

• Inputs from each stream sent up the tree.• Each interior PE receives two inputs, computes

the minimum, and sends result to parent.• Takes lg n time to find compute the minimum and

to store answer at the root.– The root also calculates an overall minimum.

• Running time for parallel solution for first round (i.e. first minimum) is lg n using 2n-1 = (n) processors

Page 52: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

“Speedup Folklore Theorem” Invalid for Non-Traditional Problems

• A sequential solution to previous example required n2 steps

• The parallel solution to it took log n steps using a binary tree with steps using p = 2n-1 = (n) processors.

• The speedup is S(p,1) = t1/tn = n2/ log(n), which is asymptotic larger than p, the number of processors.

• Akl calls this speedup parallel synergy.

Page 53: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Read Next Two Slides

• The next two slides will be left for students to read.

• They establish that the “Slowdown Folklore Theorem” is also invalid for non-traditional problems.

• The technique used is similar to that used to show “Speedup Folklore Theorem” is invalid for non-traditional problems.

• Further details about both examples can be found in Akl-text.

Page 54: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Binary Tree solution using n Leaves

• Assume a complete binary tree with n leaf processors and a total of q = 2n - 1 processors.

• The n leaf processors monitor n streams whose first n values are distinct from each other.– Requires order of data arrival in streams be known in advance.– Note that this is the best we can do to solve original problem

• Consecutive data in each stream are separated by n time units, so each PE takes (n – 1)n time in the first phase to see the first n values in its stream and to compute their minimum.

• Each leaf computes the minimum of its first n values.• The minimum of the leaf “minimum values” is computed

by the binary tree in ln(n ) = ½(ln n) steps.• Running time for parallel solution using q=2n-1 PEs for

first round is (n – 1)n + ½(ln n)

Page 55: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

“Slowdown Folklore” Theorem Invalid for Non-Traditional Problems

• A parallel solution to previous example with q = 2n-1 = (n) PEs had tq = (n – 1)n + ½(lg n)

• The parallel solution using a binary tree with steps using p = 2n-1 = (n) PEs took tp = lg n.

• Slowdown Folklore Thm states that tp tq (1 + p/q) tp

• Clearly tq/tp is asymptotically greater than p/q, which contradicts Slowdown “Folklore Theorem”

– (p/q) tp = (2n-1)/(2n-1) (lg n)= (n lg n)

– tq = (n – 1)n + ½(lg n) = (n3/2)

• Akl also calls this asymptotical speedup parallel synergy.

Page 56: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Cost Optimal• A parallel algorithm for a problem is cost-optimal if its

cost is proportional to the running time of an optimal sequential algorithm for the same problem.– By proportional, we means that

cost = tp n = k ts

for some constant k.

• Equivalently, a parallel algorithm with cost C is cost optimal if there is an optimal sequential algorithm with running time ts and (C) = (ts).

• If no optimal sequential algorithm is known, then the cost of a parallel algorithm is usually compared to the running time of the “fastest known” sequential algorithm instead.

Page 57: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Cost Optimal Example• Recall speedup example of a binary tree with

n/(lg n) leaf PEs that computes the sum of n values in lg n time.

• C(n) = p(n) × t(n) = 2(n/lg n)-1 × lg(n)= 2n - lg(n) (n)

• An optimal sequential algorithm requires n-1= (n) steps

• This shows that this algorithm is cost optimal

Page 58: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Another Cost Optimal Example

• Sorting has a sequential lower bound of (n lg n) steps– Assumes algorithm is general purpose and

not tailored to specific machine.

• A parallel sorting algorithm using (n) PEs will require at least (lg n) time

• A parallel algorithm requiring (n) PEs and (lg n) time is cost optimal.

• A parallel algorithm using (lg n) PEs will require at least (n) time

Page 59: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Work• Defn: The work of a parallel algorithm is the

sum of the individual steps executed by all the processors.– While inactive processor time is included in cost, only

active processors time is included in work. – Work indicates the actual steps that a sequential

computer would have to take in order to simulate the action of a parallel computer.

– Observe that the cost is an upper bound for the work.– While this definition of work is fairly standard, a few

authors use our definition of “cost” to define “work”.

Page 60: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Algorithm Efficiency• Efficiency is defined by

• Observe that

• Sequential algorithm must be fastest known for this problem.

• Efficiency give the percentage of full utilization of parallel processors on computation.

• Note that for traditional problems:– Maximum speedup when using n processors is n.– Maximum efficiency is 1– For traditional problems, efficiency 1

• For non-traditional problems, algorithms may have speedup > n and efficiency > 1.

n

nSE

)(

cost),1( t

nt

tnE s

p

s

Page 61: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Amdahl’s Law

Let f be the fraction of operations in a computation that must be performed sequentially, where 0 ≤ f ≤ 1. The maximum speedup S(n) achievable by a parallel computer with n processors is

• Note: Amdahl’s law holds only for traditional problems.

fnffnS

1

/)1(

1)(

Page 62: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Proof: If the fraction of the computation that cannot be divided into concurrent tasks is f, and no overhead incurs when the computation is divided into concurrent parts, the time to perform the computation with n processors is given

by tp = fts + [(1 - f )ts] / n, as illustrated below:

Page 63: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

A Symbolic Proof of Amdahl’s Law

• Using the preceding expression for tp

• The last expression is obtained by dividing numerator and denominator by ts, which establishes Amdahl’s law.

• Multiplying numerator & denominator by n produces the following alternate version of this formula:

n

ff

n

tfft

t

t

tnS

ss

s

p

s

)1(1

)1()(

fn

n

fnf

nnS

)1(1)1()(

Page 64: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Comments on Amdahl’s Law

• Preceding proof assumes that speedup can not be superliner; i.e.,

S(n) = ts/ tp n – Question: Where was this assumption

used???• Amdahl’s law is not valid for non-traditional

problems where superlinearity can occur• Note that S(n) never exceed 1/f, but approaches

1/f as n increases.• The conclusion of Amdahl’s law is sometimes

stated as S(n) 1/f

Page 65: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Limits on Parallelism

• Note that Amdahl’s law places a very strict limit on parallelism:

• Example: If 5% of the computation is serial, then its maximum speedup is at most 20, no matter how many processors are used, since

• Initially, Amdahl’s law was viewed as a fatal limit to any significant future for parallelism.

• It is not unusual to find computer professionals (outside parallel area) who still think of Amdahl’s law as severe limit to parallelism.

205

100

05.0

11)(

fnS

Page 66: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Applications of Amdahl’s Law

• Law shows that efforts required to further reduce the fraction of the code that is sequential may pay off in large performance gains.

• Shows that hardware that achieves even a small decrease in the percent of things executed sequentially may be considerably more efficient.

• Can sometimes use Amdahl’s law to increase the efficient of parallel algorithms – E.g., see Jordan & Alaghband’s textbook

Page 67: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

Amdahl & Gustafon’s Laws• A key flaw in past arguments that ‘Amdahl’s law

is a severe limit to the future of parallelism” is – Gustafon’s Law: The proportion of the computations

that are sequential normally decreases as the problem size increases.

– Gustafon’s law is a “rule of thumb” or general principle, but is not a theorem that has been proved.

• Other limitations in applying Amdahl’s Law:– Its proof focuses on the steps in a particular

algorithm, and does not consider that other algorithms with a higher percent of parallelism may exist

– Amdahl’s law applies only to ‘traditional problems’ were superlinearity can not occur

Page 68: Parallel and Distributed Algorithms Spring 2010 Johnnie W. Baker

End of Chapter One

Introduction and General Concepts