Upload
zulema
View
47
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Parallel and Distributed Algorithms Spring 2010. Johnnie W. Baker. Chapter 1. Introduction and General Concepts. References. Selim Akl, Parallel Computation: Models and Methods, Prentice Hall, 1997, Chapter 1, Updated online version available through website. -- primary reference - PowerPoint PPT Presentation
Citation preview
Parallel and Distributed Algorithms
Spring 2010
Johnnie W. Baker
Chapter 1
Introduction and General Concepts
References• Selim Akl, Parallel Computation: Models and Methods,
Prentice Hall, 1997, Chapter 1, Updated online version available through website. -- primary reference
• Selim Akl, The Design of Efficient Parallel Algorithms, Chapter 2 in “Handbook on Parallel and Distributed Processing” edited by J. Blazewicz, K. Ecker, B. Plateau, and D. Trystram, Springer Verlag, 2000.
• Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, Introduction to Parallel Computing, 2nd Edition, Addison Wesley, 2003.
• Harry Jordan and Gita Alaghband, Fundamentals of Parallel Processing: Algorithms Architectures, Languages, Prentice Hall, 2003.
• Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill, 2004, Chapter 1-2. -- secondary reference
• Michael Quinn, Parallel Computing: Theory and Practice, McGraw Hill, 1994
• Barry Wilkenson and Michael Allen, Parallel Programming, 2nd Ed.,Prentice Hall, 2005.
Outline• Need for Parallel & Distributed Computing• Flynn’s Taxonomy of Parallel Computers
– Two Main Types of MIMD Computers• Examples of Computational Models• Data Parallel & Functional/Control/Job Parallel
– Granularity• Analysis of Parallel Algorithms
– Elementary Steps: computational and routing steps– Running Time & Time Optimal– Parallel Speedup– Cost and Work– Efficiency
• Linear and Superlinear Speedup• Speedup and Slowdown Folklore Theorems• Amdahl’s and Gustafon’s Law
Reasons to Study Parallel & DistributedComputing
• Sequential computers have severe limits to memory size– Significant slowdowns occur when accessing
data that is stored in external devices.• Sequential computational times for most
large problems are unacceptable.• Sequential computers can not meet the
deadlines for many real-time problems.• Some problems are distributed in nature
and natural for distributed computation
Grand Challenge Problems Computational Science Categories (1989)
• Quantum chemistry, statistical mechanics, and relativistic physics
• Cosmology and astrophysics
• Computational fluid dynamics and turbulence
• Materials design and superconductivity
• Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling
• Medicine, and modeling of human organs and bones
• Global weather and environmental modeling
Weather Prediction
• Atmosphere is divided into 3D cells• Data includes temperature, pressure, humidity,
wind speed and direction, etc – Recorded at regular time intervals in each cell
• There are about 5×103 cells of 1 mile cubes.• Calculations would take a modern computer
over 100 days to perform calculations needed for a 10 day forecast
• Details in Ian Foster’s 1995 online textbook– Design & Building Parallel Programs– See author’s online copy for further information
Flynn’s Taxonomy
• Best-known classification scheme for parallel computers.
• Depends on parallelism it exhibits with its – Instruction stream– Data stream– A sequence of instructions (the instruction stream)
manipulates a sequence of operands (the data stream)
• The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M)
• Four combinations: SISD, SIMD, MISD, MIMD
SISD
• Single Instruction, Single Data• Single-CPU systems
– i.e., uniprocessors
• Note: co-processors don’t count• Concurrent processing allowed
– Instruction prefetching– Pipelined execution of instructions
• Primary Example: PCs
SIMD
• Single instruction, multiple data
• One instruction stream is broadcast to all processors
• Each processor (also called a processing element or PE) is very simplistic and is essentially an ALU;
– PEs do not store a copy of the program nor have a program control unit.
• Individual processors can be inhibited from participating in an instruction (based on a data test).
SIMD (cont.)
• All active processor executes the same instruction synchronously, but on different data
• On a memory access, all active processors must access the same location in their local memory.
• The data items form an array and an instruction can act on the complete array in one cycle.
SIMD (cont.)
• Quinn (2004) calls this architecture a processor array– Two examples are Thinking Machines’
• Connection Machine CM2• Connection Machine CM200
– Also, MasPar’s MP1 and MP2 are examples
• Quinn also considers a pipelined vector processor to be a SIMD– Somewhat non-standard. – An example is the Cray-1
MISD
• Multiple instruction, single data
• Quinn (2004) argues that a systolic array is an example of a MISD structure (pg 55-57)
• Some authors include pipelined architecture in this category
• This category does not receive much attention from most authors
MIMD
• Multiple instruction, multiple data
• Processors are asynchronous, since they can independently execute different programs on different data sets.
• Communications is handled either by use of message passing or through shared memory.
• Considered by most researchers to contain the most powerful, least restricted computers.
MIMD (cont.)
• Have major communication costs that are usually ignored when – comparing to SIMDs– when computing algorithmic complexity
• A common way to program MIMDs is for all processors to execute the same program.– Called SPMD method (for single program, multiple
data)– Usual method when number of processors are large.
Multiprocessors (Shared Memory MIMDs)
• All processors have access to all memory locations .
• The processors access memory through some type of interconnection network.– This type of memory access is called uniform memory
access (UMA) .
• Most multiprocessors have hierarchical or distributed memory systems in order to provide fast access to all memory locations.– This type of memory access is called nonuniform
memory access (NUMA).
Multiprocessors (cont.)
• Normally, fast cache is used with NUMA systems to reduce the problem of different memory access time for PEs. – This creates the problem of ensuring that all copies of
the same date in different memory locations are identical.
• Symmetric Multiprocessors (SMPs) are a popular example of shared memory multiprocessors
Multicomputers (Message-Passing MIMDs)
• Processors are connected by an interconnection network
• Each processor has a local memory and can only access its own local memory.
• Data is passed between processors using messages, as dictated by the program.
• A common approach to programming multiprocessors is to use a message passing language (e.g., MPI, PVM)
• The problem is divided into processes that can be executed concurrently on individual processors. Each processor is normally assigned multiple processes.
Multicomputers (cont. 2/3)
• Programming disadvantages of message-passing– Programmers must make explicit message-
passing calls in the code– This is low-level programming and is error
prone.– Data is not shared but copied, which
increases the total data size.– Data Integrity: difficulty in maintaining
correctness of multiple copies of data item.
Multicomputers (cont. 3/3)• Programming advantages of message-
passing– No problem with simultaneous access to data. – Allows different PCs to operate on the same
data independently.– Allows PCs on a network to be easily
upgraded when faster processors become available.
Data Parallel Computation
• All tasks (or processors) apply the same set of operations to different data.
• Example:
• SIMDs always use data parallel computation.– All active processors execute the operations
synchronously.
for i 0 to 99 do a[i] b[i] + c[i]endfor
Data Parallel Computation for MIMDs
• A common way to support data parallel programming on MIMDs is to use SPMD (single program multiple data) programming as follows:– Processors (or tasks) executing the same block of
instructions concurrently but asynchronously– No communication or synchronization occurs within
these concurrent instruction blocks.– Each instruction block is normally followed by
synchronization and communication steps• If processors have identical tasks (with different
data), this method allows these to be executed in a data parallel fashion.
Functional/Control/Job Parallelism• Involves executing different sets of operations on
different data sets.• Typical MIMD control parallelism
– Problem is divided into different non-identical tasks
– Tasks are distributed among the processors• Tasks are usually divided between processors so
that their workload is roughly balanced
– Each processor follows some scheme for executing its tasks.
• Frequently, a task scheduling algorithm is used.• If the task being executed has to wait for data, the
task is suspended and another task is executed.
Grain Size
• Quinn defines grain size to be the average number of computations performed between communication or synchronization steps. – Quinn (2004), pg 411
• Data parallel programming usually results in smaller grain size computation– SIMD computation is considered to be fine-grain– MIMD data parallelism is usually considered to be
medium grain• Generally, increasing the grain size of increases
the performance of MIMD programs.
Examples of Parallel Models
• Synchronous PRAM (Parallel RAM)– Each processor stores a copy of the program
– All execute instructions simultaneously on different data• Generally, the active processors execute the same instruction
• The full power of PRAM allows processors to execute different instructions simultaneously.
– Here, “simultaneous” means they execute using same clock.
– All computers have the same (shared) memory.
– data exchanges occur through share memory
– All processors have I/O capabilities.
– See Akl, Figure 1.2
Examples of Parallel Models (cont)– Binary Tree of Processors (Akl, Fig. 1.3)
• No shared memory• Memory distributed equally between PEs• PEs communicate through binary tree network.• Each PE stores a copy of the program.• PE execute different instructions• Normally synchronous, but asychronous is also
possible.• Only leaf and root PEs have I/O capabilities.
Examples of Parallel Models (cont)
• Symmetric Multiprocessors (SMPs)– Asychronous processors of the same size.– Shared memory– Memory locations equally distant from each
PE in smaller machines.– Formerly called “tightly coupled
multiprocessors”.
Analysis of Parallel Algorithms• Elementary Steps:
– A computational step is a basic arithmetic or logic operation
– A routing step is used to move data of constant size between PEs using either shared memory or a link connecting the two PEs.
Communication Time Analysis• Worst Case
– Usually default for synchronous computation– Difficult to use for asynchronous computation for
asynchronous computation• Average Case
– Usual default for asynchronous computation– Only occasionally used with synchronous co
• Important to distinguish which is being used– Often “worst case” synchronous timings are
compared to “average case” asynchronous timings.– Default in Akl’s text is “worst case”– Default in Casanova et.al. text is probably “average
case”
Shared Memory Communication Time• Memory Accesses – Two Memory Accesses
– P(i) writes datum to a memory cell.– P(j) reads datum in that memory cell
• Uniform Analysis– Charge constant time for memory access– Unrealistic, as shown for PRAM in Chapter 2 of Akl’s book
• Non-Uniform Analysis– Charge O(lg M) cost for a memory access, where M is the
number of memory cells– More realistic, based on how memory is accessed
• Traditional to use Uniform Analysis– Considering a second variable M makes the analysis complex– Most non-constant time algorithms have the same complexity
using either the uniform and non-uniform analysis.
Communication through links
• Assumes constant time to communicate along one link.– An earlier paper (mid-90?) estimates a
communication step along a link to be on order of 10-100 times slower than a computational step.
• When two processors are not connected, routing a datum between the two processors requires O(m) time units, where m is the number of links along path used.
Running/Execution Time• Running Time for algorithm
– Time between when first processor starts executing & when last processor terminates.
– Either worst case or average case can be used.• Often not stated which is used • Normally worst case for synchronous & average case for
asynchronous
– An upper bound is given by the number of steps of the fastest algorithm known for the problem.
– The lower bound gives the fewest number of steps possible to solve a worst case of the problem
• Time optimal algorithm:– Algorithm in which the number of steps for the
algorithm is the same complexity as the lower bound for problem.
– Casanova, et. al. book uses term efficient instead.
Speedup• A measure of the decrease in running time due
to parallelism• Let t1 denote the running time of the fastest
(possible/known) sequential algorithm for the problem.
• If the parallel algorithm uses p PEs, let tp denote its running time.
• The speedup of the parallel algorithm using p processors is defined by S(1,p) = t1 / tp.
• The goal is to obtain largest speedup possible.• For worst (average) case speedup, both t1 and tp
should be worst (average) case, respectively.
Optimal Speedup• Theorem: The maximum possible speedup for
parallel computers with n PEs for “traditional” problems is n.
• Proof (for “traditional problems”):– Assume a computation is partitioned perfectly into n
processes of equal duration.– Assume no overhead is incurred as a result of this
partitioning of the computation – (e.g., partitioning process, information passing, coordination of processes, etc),
– Under these ideal conditions, the parallel computation will execute n times faster than the sequential computation.
– The parallel running time is ts /n.– Then the parallel speedup of this computation is
S(n,1) = ts /(ts /n) = n
Linear Speedup
• Preceding slide argues that
S(n,1) n
and optimally S(n,1) = n.• This speedup is called linear since S(n,1) = (n)• The next slide formalizes this statement and
provides an alternate proof.
Speedup Folklore TheoremTheorem: For a traditional problem, the maximum speedup
possible for a parallel algorithm using p processors is at most p. That is, S(1,p) p.
An Alternate Proof (for “traditional problems”)• Let t1 and tp be as above and assume t1/tp > p• This parallel algorithm can be simulated by a sequential
computer which executes each parallel step with p serial steps.
• The sequential algorithm runs in p tp time, which is faster than the fastest sequential algorithm since ptp < t1
• This contradiction establishes the “theorem”.
Note: We will see that this result is not valid for some non-traditional problems.
Speedup Usually Suboptimal
• Unfortunately, the best speedup possible for most applications is much less than n, as– Assumptions in first proof are usually invalid. – Usually some portions of algorithm are
sequential, with only one PE active.– Other portions of algorithm may allow a non-
constant number of PEs to be idle for more than a constant number of steps.• E.g., during parts of the execution, many
PEs may be waiting to receive or to send data.
Speedup Example
• Example: Compute the sum of n numbers using a binary tree with n leaves.
– Takes tp = lg n steps to compute the sum in
the root node.
– Takes t1 = n-1 steps to compute the sum sequentially.
– the number of processors is p =2n-1, so for n>1,
S(1,p) = n/(lg n) < n < 2n - 1 = p– Speedup is not optimal.
Optimal Speedup Example
• Example: Computing the sum of n numbers optimally using a binary tree.– Assume binary tree has n/(lg n) leaves on the tree. – Each leaf is given lg n numbers and computes the
sum of these numbers initially in (lg n) time. – The overall sum of the values in the n/(lg n) leaves
are found next, as in preceding example, in lg(n/lg n) = (lg n) time.
– Then tp = (lg n) and S(1,p) = (n/lg n) = cp for some constant c since p = 2(n/lg n) -1.
– The speedup of this parallel algorithm is optimal, as the Folklore Speedup Theorem is valid for traditional problems like this one so speedup p.
Superlinear Speedup
• Superlinear speedup occurs when S(n) > n • Most texts other than Akl’s argue that
– Linear speedup is the maximum speedup obtainable.– The preceding “optimal speedup proof” is used to
argue that superlinearity is impossible• Linear speedup is optimal for traditional problems
• Those stating that superlinearity is impossible claim that any ‘superlinear behavior’ occurs due the following type of natural reasons:– the extra memory in parallel system.– a sub-optimal sequential algorithm was used.– Random luck, e.g., in case of an algorithm that has a
random aspect in its design (e.g., random selection)
Superlinearity (cont)• Some problems cannot be solved without
use of parallel computation. – It seems reasonable to consider parallel
solutions to these to be “superlinear”. • Speedup would be infinite.
– Examples include many large software systems
• Data too large to fit in the memory of a sequential computer
• Problem contain deadlines that must be met, but the required computation is too great for a sequential computer to meet deadlines.
– Those not accepting superlinearity discount these, as indicated in previous slide.
Superlinearity (cont.)
• Textbook by Casanova, et. al. discuss superlinearity primarily on pgs 11-12 and in Exercise 4.3 on pg 140.– Accepts that superlinearity is possible
– Indicates natural causes (e.g., larger memory) as primary (perhaps only) reason for superlinearity.
– Treats cases of infinite speedup (where sequential processor can not do the job) as an anomaly.
• E.g., two men can carry a piano upstairs, but one man can not do this, even if given unlimited time.
Some Non-Traditional Problems
• Problems with any of the following characteristics are “non-traditional”– Input may arrive in real time – Input may vary over time
• For example, find the convex hull of a set of points, where points can be added & removed over time.
– Output may effect future input– Output may have to meet certain deadlines.
• See discussion on page 16-17 of Akl’ text• See Chapter 12 of Akl’s text for additional
discussions.
Other Superlinear Examples
• Selim Akl has given a large number of examples of different types of superlinear algorithms.– Example 1.17 in Akl’s online textbook (pg 17).– See last Chapter (Ch 12) of his online text.– See his 2007 book-chapter titled “Evolving
Computational Systems” • To be posted to website shortly.
– Some of his conference & journal publications
A Superlinearity Example• Example: Tracking
– 6,000 streams of radar data arrive during each 0.5 second and each corresponds to a aircraft.
– Each must be checked against the “projected position box” for each of 4,000 tracked aircraft
– The position of each plane that is matched is updated to correspond with its radar location.
– This process must be repeated each 0.5 seconds
– A sequential processor can not process data this fast.
Slowdown Folklore Theorem: If a computation can be performed with p processors in time tp, with q processors in time tq, and q < p, then
tp tq (1 + p/q) tp
Comments:• The new run time tq is O(tp p/q)• This result puts a tight bound on the amount of
extra time required to execute an algorithm with fewer PEs.
• Establishes a smooth degradation of running time as the number of processors decrease.
• For q = 1, this is essentially the speedup theorem, but with a less sharp constant.
• Similar to Proposition 1.1, pg 11, in Canasova, et. al. textbook.
Proof: (for traditional problems)
• Let Wi denote the number of elementary steps performed collectively during the i-th time period of the algorithm (using p PEs).
• Then W = W1 + W2 + ... + Wk (where k=tp) is the total number of elementary steps performed collectively by the processors.
• Since all p processors are not necessarily busy all the time, W p tp .
• With q processors and q < p, we can simulate the Wi
elementary steps of i-th time period by having each of the q processors do Wi /q elementary steps.
• The time taken for the simulation is
tq W1 /q + W2 /q + ...+ Wk /q
W1 /q +1 + ...+ Wk /q +1
tp + p tp /q
A Non-Traditional Example for “Folklore Theorems”
• Problem Statement: (from Akl, Example 1.17)
– n sensors are each used to repeat the same set {s1, s2, ..., sn } of measurements n times at an environmental site.
– It takes each sensor n time units to make each measurement.
– Each si is measured in turn by another sensor, producing n distinct cyclic permutations of the sequence S = (s1, s2, ..., sn)
– The sequences produced by 3 sensors are
(s1, s2, s3), (s3, s1 ,s2), (s2 ,s3,s1)
Nontraditional Example (cont.)
– The stream of data from each sensor is sent to a different processor, with all corresponding si
values in each stream arriving at the same time.• The data from a sensor is consistently sent to the
same processor.
– Each si must be read and stored in unit time or stream is terminated.
– Additionally, the minimum of the n new data values that arrive during each n steps must be computed, validated, and stored.
• If minimum is below a safe value, process is stopped
– The overall minimum of the sensor reports over all n steps is also computed and recorded.
Sequential Processor Solution• A single processor can only monitor the values in one
data stream.– After first value of a selected stream is stored, it is too
late to process the other n-1 items that arrived at the same time in other streams.
– A stream remains active only if its preceding value was read.
– Note this is the best we can do to solve problem• We assume that a processor can read a value, compare
it to the smallest previous value received, and update its “smallest value” in one unit of time.
• Sequentially, the computation takes n computational steps and n(n-1) time unit delays between arrivals, or exactly n2 time units to process all n streams.
• Summary: Calculating first minimum takes n2 time units
Binary Tree Solution using n Leaves
• Assume a complete binary tree with n leaf processors and a total of 2n-1 processors.
• Each leaf receives the input from a different stream.
• Inputs from each stream sent up the tree.• Each interior PE receives two inputs, computes
the minimum, and sends result to parent.• Takes lg n time to find compute the minimum and
to store answer at the root.– The root also calculates an overall minimum.
• Running time for parallel solution for first round (i.e. first minimum) is lg n using 2n-1 = (n) processors
“Speedup Folklore Theorem” Invalid for Non-Traditional Problems
• A sequential solution to previous example required n2 steps
• The parallel solution to it took log n steps using a binary tree with steps using p = 2n-1 = (n) processors.
• The speedup is S(p,1) = t1/tn = n2/ log(n), which is asymptotic larger than p, the number of processors.
• Akl calls this speedup parallel synergy.
Read Next Two Slides
• The next two slides will be left for students to read.
• They establish that the “Slowdown Folklore Theorem” is also invalid for non-traditional problems.
• The technique used is similar to that used to show “Speedup Folklore Theorem” is invalid for non-traditional problems.
• Further details about both examples can be found in Akl-text.
Binary Tree solution using n Leaves
• Assume a complete binary tree with n leaf processors and a total of q = 2n - 1 processors.
• The n leaf processors monitor n streams whose first n values are distinct from each other.– Requires order of data arrival in streams be known in advance.– Note that this is the best we can do to solve original problem
• Consecutive data in each stream are separated by n time units, so each PE takes (n – 1)n time in the first phase to see the first n values in its stream and to compute their minimum.
• Each leaf computes the minimum of its first n values.• The minimum of the leaf “minimum values” is computed
by the binary tree in ln(n ) = ½(ln n) steps.• Running time for parallel solution using q=2n-1 PEs for
first round is (n – 1)n + ½(ln n)
“Slowdown Folklore” Theorem Invalid for Non-Traditional Problems
• A parallel solution to previous example with q = 2n-1 = (n) PEs had tq = (n – 1)n + ½(lg n)
• The parallel solution using a binary tree with steps using p = 2n-1 = (n) PEs took tp = lg n.
• Slowdown Folklore Thm states that tp tq (1 + p/q) tp
• Clearly tq/tp is asymptotically greater than p/q, which contradicts Slowdown “Folklore Theorem”
– (p/q) tp = (2n-1)/(2n-1) (lg n)= (n lg n)
– tq = (n – 1)n + ½(lg n) = (n3/2)
• Akl also calls this asymptotical speedup parallel synergy.
Cost Optimal• A parallel algorithm for a problem is cost-optimal if its
cost is proportional to the running time of an optimal sequential algorithm for the same problem.– By proportional, we means that
cost = tp n = k ts
for some constant k.
• Equivalently, a parallel algorithm with cost C is cost optimal if there is an optimal sequential algorithm with running time ts and (C) = (ts).
• If no optimal sequential algorithm is known, then the cost of a parallel algorithm is usually compared to the running time of the “fastest known” sequential algorithm instead.
Cost Optimal Example• Recall speedup example of a binary tree with
n/(lg n) leaf PEs that computes the sum of n values in lg n time.
• C(n) = p(n) × t(n) = 2(n/lg n)-1 × lg(n)= 2n - lg(n) (n)
• An optimal sequential algorithm requires n-1= (n) steps
• This shows that this algorithm is cost optimal
Another Cost Optimal Example
• Sorting has a sequential lower bound of (n lg n) steps– Assumes algorithm is general purpose and
not tailored to specific machine.
• A parallel sorting algorithm using (n) PEs will require at least (lg n) time
• A parallel algorithm requiring (n) PEs and (lg n) time is cost optimal.
• A parallel algorithm using (lg n) PEs will require at least (n) time
Work• Defn: The work of a parallel algorithm is the
sum of the individual steps executed by all the processors.– While inactive processor time is included in cost, only
active processors time is included in work. – Work indicates the actual steps that a sequential
computer would have to take in order to simulate the action of a parallel computer.
– Observe that the cost is an upper bound for the work.– While this definition of work is fairly standard, a few
authors use our definition of “cost” to define “work”.
Algorithm Efficiency• Efficiency is defined by
• Observe that
• Sequential algorithm must be fastest known for this problem.
• Efficiency give the percentage of full utilization of parallel processors on computation.
• Note that for traditional problems:– Maximum speedup when using n processors is n.– Maximum efficiency is 1– For traditional problems, efficiency 1
• For non-traditional problems, algorithms may have speedup > n and efficiency > 1.
n
nSE
)(
cost),1( t
nt
tnE s
p
s
Amdahl’s Law
Let f be the fraction of operations in a computation that must be performed sequentially, where 0 ≤ f ≤ 1. The maximum speedup S(n) achievable by a parallel computer with n processors is
• Note: Amdahl’s law holds only for traditional problems.
fnffnS
1
/)1(
1)(
Proof: If the fraction of the computation that cannot be divided into concurrent tasks is f, and no overhead incurs when the computation is divided into concurrent parts, the time to perform the computation with n processors is given
by tp = fts + [(1 - f )ts] / n, as illustrated below:
A Symbolic Proof of Amdahl’s Law
• Using the preceding expression for tp
• The last expression is obtained by dividing numerator and denominator by ts, which establishes Amdahl’s law.
• Multiplying numerator & denominator by n produces the following alternate version of this formula:
n
ff
n
tfft
t
t
tnS
ss
s
p
s
)1(1
)1()(
fn
n
fnf
nnS
)1(1)1()(
Comments on Amdahl’s Law
• Preceding proof assumes that speedup can not be superliner; i.e.,
S(n) = ts/ tp n – Question: Where was this assumption
used???• Amdahl’s law is not valid for non-traditional
problems where superlinearity can occur• Note that S(n) never exceed 1/f, but approaches
1/f as n increases.• The conclusion of Amdahl’s law is sometimes
stated as S(n) 1/f
Limits on Parallelism
• Note that Amdahl’s law places a very strict limit on parallelism:
• Example: If 5% of the computation is serial, then its maximum speedup is at most 20, no matter how many processors are used, since
• Initially, Amdahl’s law was viewed as a fatal limit to any significant future for parallelism.
• It is not unusual to find computer professionals (outside parallel area) who still think of Amdahl’s law as severe limit to parallelism.
205
100
05.0
11)(
fnS
Applications of Amdahl’s Law
• Law shows that efforts required to further reduce the fraction of the code that is sequential may pay off in large performance gains.
• Shows that hardware that achieves even a small decrease in the percent of things executed sequentially may be considerably more efficient.
• Can sometimes use Amdahl’s law to increase the efficient of parallel algorithms – E.g., see Jordan & Alaghband’s textbook
Amdahl & Gustafon’s Laws• A key flaw in past arguments that ‘Amdahl’s law
is a severe limit to the future of parallelism” is – Gustafon’s Law: The proportion of the computations
that are sequential normally decreases as the problem size increases.
– Gustafon’s law is a “rule of thumb” or general principle, but is not a theorem that has been proved.
• Other limitations in applying Amdahl’s Law:– Its proof focuses on the steps in a particular
algorithm, and does not consider that other algorithms with a higher percent of parallelism may exist
– Amdahl’s law applies only to ‘traditional problems’ were superlinearity can not occur
End of Chapter One
Introduction and General Concepts