View
61
Download
0
Category
Tags:
Preview:
DESCRIPTION
High Performance Computing: Concepts, Methods & Means Performance I: Benchmarking. Prof. Thomas Sterling Department of Computer Science Louisiana State University January 23 rd , 2007. Topics. Definitions, properties and applications Early benchmarks - PowerPoint PPT Presentation
Citation preview
High Performance Computing: Concepts, Methods & Means
Performance I: Benchmarking
Prof. Thomas SterlingDepartment of Computer Science
Louisiana State University
January 23rd, 2007
2
Topics
• Definitions, properties and applications• Early benchmarks• Everything you ever wanted to know
about Linpack (but were afraid to ask)• Other parallel benchmarks• Organized benchmarking• Presentation and interpretation of
results• Summary
3
• Definitions, properties and applications• Early benchmarks• Linpack• Other parallel benchmarks• Organized benchmarking• Presentation and interpretation of
results• Summary
4
Basic Performance Metrics• Time related:
– Execution time [seconds]• wall clock time• system and user time
– Latency– Response time
• Rate related:– Rate of computation
• floating point operations per second [flops]• integer operations per second [ops]
– Data transfer (I/O) rate [bytes/second]• Effectiveness:
– Efficiency [%]– Memory consumption [bytes]– Productivity [utility/($*second)]
• Modifiers:– Sustained– Peak– Theoretical peak
5
What Is a Benchmark?
• The term “benchmark” also commonly applies to specially-designed programs used in benchmarking
• A benchmark should:– be domain specific (the more general the benchmark, the
less useful it is for anything in particular)– be a distillation of the essential attributes of a workload– avoid using single metric to express the overall performance
• Computational benchmark kinds– synthetic: specially-created programs that impose the load
on the specific component in the system– application: derived from a real-world application program
Benchmark: a standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance) [Merriam-Webster]
6
Purpose of Benchmarking
• To define the playing field• To provide a tool enabling quantitative
comparisons• Acceleration of progress
– enable better engineering by defining measurable and repeatable objectives
• Establishing of performance agenda– measure release-to-release or version-to-version
progress– set goals to meet– be understandable and useful also to the people
not having the expertise in the field (managers, etc.)
7
Properties of a Good Benchmark
• Relevance: meaningful within the target domain• Understandability• Good metric(s): linear, orthogonal, monotonic• Scalability: applicable to a broad spectrum of
hardware/architecture• Coverage: does not over-constrain the typical
environment• Acceptance: embraced by users and vendors• Has to enable comparative evaluation• Limited lifetime: there is a point when additional code
modifications or optimizations become counterproductive
Adapted from: Standard Benchmarks for Database Systems by Charles Levine, SIGMOD ‘97
8
• Definitions, properties and applications• Early benchmarks• Linpack• Other parallel benchmarks• Organized benchmarking• Presentation and interpretation of
results• Summary
9
Early Benchmarks
• Whetstone– Floating point intensive
• Dhrystone– Integer and character string oriented
• Livermore Fortran Kernels– “Livermore Loops”– Collection of short kernels
• NAS kernel– 7 Fortran test kernels for aerospace computation
The sources of the benchmarks listed above are available from: http://www.netlib.org/benchmark
14
• Definitions, properties and applications• Early benchmarks• Linpack• Other parallel benchmarks• Organized benchmarking• Presentation and interpretation of
results• Summary
15
Linpack Overview
• Introduced by Jack Dongarra in 1979• Based on LINPACK linear algebra package
developed by J. Dongarra, J. Bunch, C. Moler and P. Stewart (now superseded by the LAPACK library)
• Solves a dense, regular system of linear equations, using matrices initialized with pseudo-random numbers
• Provides an estimate of system’s effective floating-point performance
• Does not reflect the overall performance of the machine!
16
Linpack Benchmark Variants
• Linpack Fortran (single processor)– N=100– N=1000, TPP, best effort
• Linpack’s Highly Parallel Computing benchmark (HPL)
• Java Linpack
19
Linpack Fortran Performance on Different Platforms
Computer N=100 [MFlops]
N=1000, TPP [MFlops]
Theoretical Peak [MFlops]
Intel Pentium Woodcrest (1core, 3 GHz) 3018 6542 12000
NEC SX-8/8 (8 proc., 2 GHz) - 75140 128000
NEC SX-8/8 (1 proc., 2 GHz) 2177 14960 16000
HP ProLiant BL20p G3 (4 cores, 3.8 GHz Intel Xeon) - 8185 14800
HP ProLiant BL20p G3 (1 core 3.8 GHz Intel Xeon) 1852 4851 7400
IBM eServer p5-575 (8 POWER5 proc., 1.9 GHz) - 34570 60800
IBM eServer p5-575 (1 POWER5 proc., 1.9 GHz) 1776 5872 7600
SGI Altix 3700 Bx2 (1 Itanium2 proc., 1.6 GHz) 1765 5953 6400
HP ProLiant BL45p (4 cores AMD Opteron 854, 2.8 GHz) - 12860 22400
HP ProLiant BL45p (1 core AMD Opteron 854, 2.8 GHz) 1717 4191 5600
Fujitsu VPP5000/1 (1 proc., 3.33ns) 1156 8784 9600
Cray T932 (32 proc., 2.2ns) 1129 (1 proc.) 29360 57600
HP AlphaServer GS1280 7/1300 (8 Alpha proc., 1.3GHz) - 14260 20800
HP AlphaServer GS1280 7/1300 (1 Alpha proc., 1.3GHz) 1122 2132 2600
HP 9000 rp8420-32 (8 PA-8800 proc., 1000MHz) - 14150 32000
HP 9000 rp8420-32 (1 PA-8800 proc., 1000MHz) 843 2905 4000
Data excerpted from the 11-30-2006 LINPACK Benchmark Report at http://www.netlib.org/benchmark/performance.ps
20
Fortran Linpack Demo> ./linpack Please send the results of this run to:
Jack J. Dongarra Computer Science Department University of Tennessee Knoxville, Tennessee 37996-1300
Fax: 865-974-8296
Internet: dongarra@cs.utk.edu
This is version 29.5.04.
norm. resid resid machep x(1) x(n) 1.25501937E+00 1.39332990E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00
times are reported for matrices of order 100 dgefa dgesl total mflops unit ratio b(1) times for array with leading dimension of 201 4.890E-04 2.003E-05 5.090E-04 1.349E+03 1.483E-03 9.090E-03 -9.159E-15 4.860E-04 1.895E-05 5.050E-04 1.360E+03 1.471E-03 9.017E-03 1.000E+00 4.850E-04 2.003E-05 5.050E-04 1.360E+03 1.471E-03 9.018E-03 1.000E+00 4.856E-04 1.730E-05 5.029E-04 1.365E+03 1.465E-03 8.981E-03 5.298E+02
times for array with leading dimension of 200 4.210E-04 1.800E-05 4.390E-04 1.564E+03 1.279E-03 7.840E-03 1.000E+00 4.200E-04 1.901E-05 4.390E-04 1.564E+03 1.279E-03 7.840E-03 1.000E+00 4.200E-04 1.699E-05 4.370E-04 1.571E+03 1.273E-03 7.804E-03 1.000E+00 4.288E-04 1.640E-05 4.452E-04 1.542E+03 1.297E-03 7.950E-03 5.298E+02 end of tests -- this version dated 05/29/04
Reference: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html
Time spent in matrix factorization routine (dgefa)
Time spent in solver (dgesl)
Total time (dgefa+dgesl)
Sustained floating point rate
“Timing” unit (obsolete)
Fraction of Cray-1S execution time (obsolete)
First element of right hand side vector
Two different dimensions used to test the effect of array placement in memory
21
Linpack’s Highly Parallel Computing Benchmark (HPL)
• Measures the performance of distributed memory machines
• Used in the “Linpack Benchmark Report” (Table 3) and to determine the order of machines on the Top500 list
• The portable version (written in C)• External dependencies:
– MPI-1.1 functionality for inter-node communication– BLAS or VSIPL library for simple vector operations such as
scaled vector addition (DAXPY: y = αx+y) and inner dot product (DDOT: a = Σxiyi)
• Ground rules:– allows a complete user replacement of the LU factorization
and solver steps (the accuracy must satisfy given bound)– same matrix as in the driver program– no restrictions on problem size
24
HPL Linpack Metrics• The HPL implementation of the benchmark is run for
different problem sizes N on the entire machine• For certain problem size Nmax, the cumulative
performance in Mflops (reflecting 64-bit addition and multiplication operations) reaches its maximum value denoted as Rmax
• Another metric possible to obtain from the benchmark is N1/2, the problem size for which the half of the maximum performance (Rmax/2) is achieved
• The Rmax value is used to rank supercomputers in Top500 list; listed along with this number are the theoretical peak double precision floating point performance Rpeak of the machine and N1/2
25
Machine Parameters Influencing Linpack Performance
Parameter Linpack Fortran, N=100
Linpack Fortran, N=1000, TPP
HPL
Processor speed Yes Yes Yes
Memory capacity No No (modern system)
Yes (for Rmax)
Network latency/bandwidth
No No Yes
Compiler flags Yes Yes Yes
26
Ten Fastest Supercomputers On Current Top500 List
# Computer Site Processors Rmax Rpeak
1 IBM Blue Gene/L DoE/NNSA/LLNL (USA) 131,072 280,600 367,000
2 Cray Red Storm Sandia (USA) 26,544 101,400 127,411
3 IBM BGW IBM T. Watson Research Center (USA) 40,960 91,290 114,688
4 IBM ASC Purple DoE/NNSA/LLNL (USA) 12,208 75,760 92,781
5 IBM Mare Nostrum Barcelona Supercomputing Center (Spain) 10,240 62,630 94,208
6 Dell Thunderbird NNSA/Sandia (USA) 9,024 53,000 64,973
7 Bull Tera-10 Commissariat a l’Energie Atomique (France) 9,968 52,840 63,795
8 SGI Columbia NASA/Ames Research Center (USA) 10,160 51,870 60,960
9 NEC/Sun Tsubame GSIC Center, Tokyo Institute of Technology (Japan)
11,088 47,380 82,125
10 Cray Jaguar Oak Ridge National Laboratory (USA) 10,424 43,480 54,205
Source: http://www.top500.org/list/2006/11/100
============================================================================T/V N NB P Q Time Gflops----------------------------------------------------------------------------WR01L2L2 5000 32 2 2 7.14 1.168e+01----------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0400275 ...... PASSED||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0264242 ...... PASSED||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0051580 ...... PASSED============================================================================T/V N NB P Q Time Gflops----------------------------------------------------------------------------WR01L2L2 5000 32 1 4 7.00 1.192e+01----------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0335428 ...... PASSED||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0221433 ...... PASSED||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0043224 ...... PASSED============================================================================T/V N NB P Q Time Gflops----------------------------------------------------------------------------WR01L2L2 5000 32 4 1 7.00 1.191e+01----------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0426255 ...... PASSED||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0281393 ...... PASSED||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0054928 ...... PASSED============================================================================
Finished 3 tests with the following results: 3 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values.----------------------------------------------------------------------------
End of Tests.============================================================================
HPL Demo
28
> mpirun -np 4 xhpl============================================================================HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK============================================================================
An explanation of the input/output parameters follows:T/V : Wall time / encoded variant.N : The order of the coefficient matrix A.NB : The partitioning blocking factor.P : The number of process rows.Q : The number of process columns.Time : Time in seconds to solve the linear system.Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 5000 NB : 32 PMAP : Row-major process mappingP : 2 1 4 Q : 2 4 1 PFACT : Left NBMIN : 2 NDIV : 2 RFACT : Left BCAST : 1ringM DEPTH : 0 SWAP : Mix (threshold = 64)L1 : transposed formU : transposed formEQUIL : yesALIGN : 8 double precision words
----------------------------------------------------------------------------
- The matrix A is randomly generated for each test.- The following scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps * ||A||_1 * N ) 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )- The relative machine precision (eps) is taken to be 1.110223e-16- Computational tests pass if scaled residuals are less than 16.0
For configuration issues, consult: http://www.netlib.org/benchmark/hpl/faqs.html
29
• Definitions, properties and applications• Early benchmarks• Linpack• Other parallel benchmarks• Organized benchmarking• Presentation and interpretation of
results• Summary
30
Other Parallel Benchmarks
• High Performance Computing Challenge (HPCC) benchmarks– Devised and sponsored to enrich the
benchmarking parameter set
• NAS Parallel Benchmarks (NPB)– Powerful set of metrics– Reflects computational fluid dynamics
• NPBIO-MPI– Stresses external I/O system
31
HPC Challenge Benchmark
Consists of 7 individual tests:• HPL (Linpack TPP): floating point rate of execution of a solver
of linear system of equations• DGEMM: floating point rate of execution of double precision
matrix-matrix multiplication• STREAM: sustainable memory bandwidth (GB/s) and the
corresponding computation rate for simple vector kernel• PTRANS (parallel matrix transpose): total capacity of the
network using pairwise communicating processes• RandomAccess: the rate of integer random updates of memory
(in GUPS: Giga-Updates Per Second)• FFT: floating point rate of execution of double precision complex
1-D Discrete Fourier Transform• b_eff (effective bandwidth benchmark): latency and bandwidth
of a number of simultaneous communication patterns
32
Comparison of HPCC Results on Selected Supercomputers
0
20
40
60
80
100
Pe
rce
nta
ge
of
the
ma
xim
um
va
lue
G-HPL (max=91Tflops)
G-PTRANS(max=4666GB/s)
G-RandomAccess
(max=7.69GUP/s)
G-FFTE(max=1763
Gflops)
EP-STREAMsystem
(max=62890GB/s)
EP-DGEMMsystem
(max=161885Gflops)
Random RingBandwidth
(max=0.829GB/s)
Random RingLatency
(max=118.6 μs)
"Red Storm" Cray XT3, Sandia (Opteron/Cray custom 3D mesh) IBM p5-575, LLNL (Power5/IBM HPS)IBM Blue Gene/L, NNSA (PowerPC 440/IBM custom 3D torus & tree) Cray X1E, ORNL (X1E/Cray modified 2D torus)HP XC, Government (Itanium2/Quadrics Elan4) "Columbia" SGI, NASA (Itanium2/SGI NUMALINK)NEC SX-8, HLRS (SX-8/IXS crossbar) "Emerald" Rackable Systems, AMD (Opteron/Silverstorm Infiniband)
Notes:• all metrics shown are “higher-better”, except for the Random Ring Latency• machine labels include: machine name (optional), manufacturer and system name, affiliation and (in parentheses) processor/network fabric type
33
NAS Parallel Benchmarks• Derived from computational fluid dynamics (CFD) applications• Consist of five kernels and three pseudo-applications• Exist in several flavors:
– NPB 1: original paper-and-pencil specification• generally proprietary implementations by hardware vendors
– NPB 2: MPI-based sources distributed by NAS• supplements NPB 1• can be run with little or no tuning
– NPB 3: implementations in OpenMP, HPF and Java• derived from NPB-serial version with improved serial code• a set of multi-zone benchmarks was added• test implementation efficiency of multi-level and hybrid parallelization
methods and tools (e.g. OpenMP with MPI)– GridNPB 3: new suite of benchmarks, designed to rate the
performance of computational grids• includes only four benchmarks, derived from the original NPB• written in Fortran and Java• Globus as grid middleware
34
NPB 2 Overview• Multiple problem classes (S, W, A, B, C, D)• Tests written mainly in Fortran (IS in C):
– BT (block tri-diagonal solver with 5x5 block size)– CG (conjugate gradient approximation to compute the smallest
eigenvalue of a sparse, symmetric positive definite matrix)– EP (“embarrassingly parallel”; evaluates an integral by means of
pseudorandom trials)– FT (3-D PDE solver using Fast Fourier Transforms)– IS (large integer sort; tests both integer computation speed and
network performance)– LU (a regular-sparse, 5x5 block lower and upper triangular system
solver)– MG (simplified multigrid kernel; tests both short and long distance
data communication)– SP (solves multiple independent system of non-diagonally
dominant, scalar, pentadiagonal equations)• Sources and reports available from:
http://ww.nas.nasa.gov/Resources/Software/npb.html
37
• Definitions, properties and applications• Early benchmarks• Linpack• Other parallel benchmarks• Organized benchmarking• Presentation and interpretation of
results• Summary
38
Benchmarking Organizations
• SPEC– Created to satisfy the need for realistic, fair
and standardized performance tests– Motto: “An ounce of honest data is worth
more than a pound of marketing hype”
• TPC– Formed primarily due to lack of reliable
database benchmarks
44
• Definitions, properties and applications• Early benchmarks• Linpack• Other parallel benchmarks• Organized benchmarking• Presentation and interpretation of
results• Summary
Presentation of the Results
• Tables• Graphs
– Bar graphs (a)– Scatter plots (b)– Line plots (c)– Pie charts (d)– Gantt charts (e)– Kiviat graphs (f)
• Enhancements– Error bars, boxes or
confidence intervals– Broken or offset scales
(be careful!)– Multiple curves per graph
(but avoid overloading)– Data labels, colors, etc.
0
2000
4000
6000
8000
10000
12000
0 2000 4000 6000 8000 10000 12000
G-HPLG-PTRANSG-FFTEG-RanAccG-StreamEPStream
(a) (b)
(c) (d)
(e) (f)
Kiviat Graph Example
46Source: http://www.cse.clrc.ac.uk/disco/DLAB_BENCH_WEB/hpcc/hpcc_kiviat.shtml
Mixed Graph Example
47
WRF OOCORE MILC PARATEC HOMME BSSN_PUGH Whisky_Carpet ADCIRC PETSc_FUN3D
Computation fraction
Communication fraction
Floating point operations
Load/store operations
Other operations
Characterization of NSF/CCT parallel applications on POWER5 architecture(using data collected by IPM)
48
Graph Do’s and Don’ts• Good graphs:
– Require minimum effort from the reader– Maximize information– Maximize information-to-ink ratio– Use commonly accepted practices– Avoid ambiguity
• Poor graphs:– Have too many alternatives on a single chart– Display too many y-variables on a single chart– Use vague symbols in place of text– Show extraneous information– Select scale ranges improperly– Use line chart instead of a bar graph
Reference: Raj Jain, The Art of Computer Systems Performance Analysis, Chapter 10
49
Common Mistakes in Benchmarking
• Only average behavior represented in test workload• Skewness of device demands ignored• Loading level controlled inappropriately• Caching effects ignored• Buffering sizes not appropriate• Inaccuracies due to sampling ignored• Ignoring monitoring overhead• Not validating measurements• Not ensuring same initial conditions• Not measuring transient performance• Using device utilizations for performance comparisons• Collecting too much data but doing very little analysis
From Chapter 9 of The Art of Computer Systems Performance Analysis by Raj Jain:From Chapter 9 of The Art of Computer Systems Performance Analysis by Raj Jain:
50
Misrepresentation of Performance Results on Parallel Computers
• Quote only 32-bit performance results, not 64-bit results• Present performance for an inner kernel, representing it as the performance of the
entire application• Quietly employ assembly code and other low-level constructs• Scale problem size with the number of processors, but omit any mention of this fact• Quote performance results projected to the full system• Compare your results with scalar, unoptimized code run on another platform• When direct run time comparisons are required, compare with an old code on an
obsolete system• If MFLOPS rates must be quoted, base the operation count on the parallel
implementation, not on the best sequential implementation• Quote performance in terms of processor utilization, parallel speedups or MFLOPS
per dollar• Mutilate the algorithm used in the parallel implementation to match the architecture• Measure parallel run times on a dedicated system, but measure conventional run
times in a busy environment• If all else fails, show pretty pictures and animated videos, and don't talk about
performance
Reference:David Bailey “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”, Supercomputing Review, Aug 1991, pp.54-55, http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf
51
• Definitions, properties and applications• Early benchmarks• Linpack• Other parallel benchmarks• Organized benchmarking• Presentation and interpretation of
results• Summary
52
Knowledge Factors & Skills
• Knowledge factors:– benchmarking and metrics– performance factors– Top500 list
• Skill set:– determine state of system resources and
manipulate them– acquire, run and measure benchmark
performance– launch user application codes
Material For Test
• Basic performance metrics (slide 4)• Definition of benchmark in own words; purpose of
benchmarking; properties of good benchmark (slides 5, 6, 7)
• Linpack: what it is, what does it measure, concepts and complexities (slides 15, 17, 18)
• HPL: (slides 21 and 24)• Linpack compare and contrast (slide 25)• General knowledge about HPCC and NPB suites
(slides 31 and 34)• Benchmark result interpretation (slides 49, 50)
53
Recommended