Upload
robert-butler
View
21
Download
0
Embed Size (px)
DESCRIPTION
Lecture 6: Performance of Multiprocessor Systems. Speedup. Execution time on 1 processorT 1 Speedup = ----------------------------------------------- = -------- Execution time on p processors T p t s : time for the serial part of the algorithm - PowerPoint PPT Presentation
Citation preview
Lecture 6:Lecture 6:
Performance of Performance of Multiprocessor Multiprocessor SystemsSystems
SpeedupExecution time on 1 processor T1
Speedup = ----------------------------------------------- = --------Execution time on p processors Tp
ts : time for the serial part of the algorithm
tp : time for the parallelizable part of the algorithm
T1 = ts + tp Speedup ideal
Tp = ts + tp/p
ts + tp Speedup(p) = ----------------
ts + tp/pp
Amdahl’s Law
If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s.
ts = (1/s) x T1
tp = (1- 1/s) x T1
T1
Speedup(p) = ------------------------ s T1/s + (1-1/s)T1
------------- p
Speedup(p) = s p lim p ∞
Speedup
Speedup
Speedup
Superlinear speedup
Speedup(p) > p superlinear speedup
Reasons: Increased cache size Random algorithms Parallel algorithm
Speedup
T1 Speedup = --------
Tp
Relative speedup: single processor execution time of the parallel algorithm is used
Absolute speedup: execution time of the best parallel algorithm on one processor is used
Efficiency
Speedup(p) T1
Efficiency(p) = ------------------- = ---------- ≤ 1p p x Tp
Efficiency
1
p
Amdahl’s Law
If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s.
ts = (1/s) x T1
tp = (1- 1/s) x T1
T1
Speedup = ------------------------ sT1/s + (1-1/s)T1
------------- p
Speedup = s p lim p ∞
Gustafson’s Law
work time
p pwork time
p p
wp wp wp wp
ws ws ws ws tp /p tp
/ptp /p
tp /p
ts
ts
tsts
wpwp
wp
wp
ws
ws
ws
ws
tp tp tp tp
ts ts ts ts
Fixed size
Fixed time
1 2 3 4
1 2 3 41 2 3 4
1 2 3 4
Gustafson’s Law Scaled Speedup (Fixed-size Speedup)
Tp = ts + tp
T1 = ts + p.tp
If the sequential component of an algorithm is 1/s of the program’s execution time
ts = (1/s) x Tp
tp = (1- 1/s) x Tp Speedup ideal
Speedup(p) = 1/s + (1-1/s)p
Speedup(p) = ∞ p lim p ∞
Sizeup
Total work on 1 processor Sizeup = -------------------------------------------
Total work on p processors
ws: serial work
wp: parallelizable work
wp’: scaled parallelizable work
ws + wp’ ws + p.wp
Sizeup = ---------------- = ----------------- ws + wp ws + wp
Roofline Performance Model
Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory
floating-point operations Arithmetic intensity = --------------------------------------- = FLOPs/Byte
number of data bytes
Roofline Performance Model
Attainable GFLOPs/second
Peak memory bandwidth x Arithmetic intensity= min
Peak floating-point performance
Roofline Performance Model Peak floating-point performance is given by the hardware
specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance
of all the cores on the chip. So, multiply the peak per chip by the number of chips
Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second)
Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as
Peak memory bandwidth x Arithmetic intensity
(bytes/second) x (FLOPs/bytes) ==> FLOPs/second
Roofline Performance Model
Roofline sets an upper bound on performance
Roofline of a computer does not vary by benchmark kernel
Stream Benchmark A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are
larger than the cache size http://www.cs.virginia.edu/stream/ref.html