Lecture 6: Performance of Multiprocessor Systems

Lecture 6:Lecture 6:

Performance of Performance of Multiprocessor Multiprocessor SystemsSystems

SpeedupExecution time on 1 processor T1

Speedup = ----------------------------------------------- = --------Execution time on p processors Tp

ts : time for the serial part of the algorithm

tp : time for the parallelizable part of the algorithm

T1 = ts + tp Speedup ideal

Tp = ts + tp/p

ts + tp Speedup(p) = ----------------

ts + tp/pp

Amdahl’s Law

If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s.

ts = (1/s) x T1

tp = (1- 1/s) x T1

T1

Speedup(p) = ------------------------ s T1/s + (1-1/s)T1

------------- p

Speedup(p) = s p lim p ∞

Speedup

Speedup

Speedup

Superlinear speedup

Speedup(p) > p superlinear speedup

Reasons: Increased cache size Random algorithms Parallel algorithm

Speedup

T1 Speedup = --------

Tp

Relative speedup: single processor execution time of the parallel algorithm is used

Absolute speedup: execution time of the best parallel algorithm on one processor is used

Efficiency

Speedup(p) T1

Efficiency(p) = ------------------- = ---------- ≤ 1p p x Tp

Efficiency

1

p

Amdahl’s Law

If the sequential component of an algorithm is 1/s of the program’s execution time, then maximum speedup that can be achieved on a parallel computer is s.

ts = (1/s) x T1

tp = (1- 1/s) x T1

T1

Speedup = ------------------------ sT1/s + (1-1/s)T1

------------- p

Speedup = s p lim p ∞

Gustafson’s Law

work time

p pwork time

p p

wp wp wp wp

ws ws ws ws tp /p tp

/ptp /p

tp /p

ts

ts

tsts

wpwp

wp

wp

ws

ws

ws

ws

tp tp tp tp

ts ts ts ts

Fixed size

Fixed time

1 2 3 4

1 2 3 41 2 3 4

1 2 3 4

Gustafson’s Law Scaled Speedup (Fixed-size Speedup)

Tp = ts + tp

T1 = ts + p.tp

If the sequential component of an algorithm is 1/s of the program’s execution time

ts = (1/s) x Tp

tp = (1- 1/s) x Tp Speedup ideal

Speedup(p) = 1/s + (1-1/s)p

Speedup(p) = ∞ p lim p ∞

Sizeup

Total work on 1 processor Sizeup = -------------------------------------------

Total work on p processors

ws: serial work

wp: parallelizable work

wp’: scaled parallelizable work

ws + wp’ ws + p.wp

Sizeup = ---------------- = ----------------- ws + wp ws + wp

Roofline Performance Model

Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory

floating-point operations Arithmetic intensity = --------------------------------------- = FLOPs/Byte

number of data bytes


Attainable GFLOPs/second

Peak memory bandwidth x Arithmetic intensity= min

Peak floating-point performance

Roofline Performance Model Peak floating-point performance is given by the hardware

specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance

of all the cores on the chip. So, multiply the peak per chip by the number of chips

Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second)

Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as

Peak memory bandwidth x Arithmetic intensity

(bytes/second) x (FLOPs/bytes) ==> FLOPs/second


Roofline sets an upper bound on performance

Roofline of a computer does not vary by benchmark kernel

Stream Benchmark A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are

larger than the cache size http://www.cs.virginia.edu/stream/ref.html

Documents

Lecture 6: Performance of Multiprocessor Systems