View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Panel Question
• "HPC Performance Metrics: Should We Drop FLOPS?"
• As computer architectures have advanced, making floating-point operations relatively inexpensive, it is still very ingrained in HPC to compare systems by their FLOPS rates, and to optimize algorithms to minimize their floating-point operation count even at the cost of increased memory bandwidth demands or programming effort.
• Should we drop FLOPS as a metric? If the answer is yes, then is there a way to gracefully (or perhaps suddenly) move to a metric other than FLOPS that has the predictive value and ease-of-analysis that FLOPS had in the 1960s and 1970s?
• The panelists will give their position on this question and their suggestions on the general issue of using simplistic measures like FLOPS as a performance metric.
Déjà Vu All Over Again
• Workshop on Performance Characterization, Modelingand Benchmarking for HPC Systems– Emeryville, CA 5-7 May 2003
• Panel: Performance Metrics for HPCS: Holy Grail or Fata Morgana1. Why don’t we have a good metric for HPC performance?
2. Is there any chance to define a single metric for HPC performance? Does everybody needs his/her own metric?
3. What are the requirements (theoretical, practical, and political) for such metrics?
4. What needs to be done to get new metrics accepted? Is it even possible?
5. Sustained vs Peak? Repercussions?
6. Implications of performance metrics on the political level
A Single “Good” Metric? (1 & 2)
• Parameters
Performance(Theoretical Peak Capability)
Peak Processor Performance
# Processors
SystemMemory
Bisection BW
Clock rate
Ops/cycle
# nodes
Processors per node
Parameter Space§ (Examples)
i.e., characteristics
§Thanks to Thomas Sterling circa 2002
A Single “Good” Metric? (1 & 2)
• Parameters versus Metrics
Performance(Theoretical Peak Capability)
Peak Processor Performance
# Processors
SystemMemory
Bisection BW
Clock rate
Ops/cycle
# nodes
Processors per node
Parameter Space (Examples)
Metric Space (Examples)
User ApplicationsRepresentativeBenchmarks(HPC Challenge)
Wall Clock Time
Derived Metrics
SustainedFlop/s
Sustained(Memory)
Bandwidth
SustainedGUPS
SustainedIntop/s
A Single “Good” Metric? (1 & 2)
• Why don’t we have a good metric for HPC performance?– We do have good metrics — well sort of…
• Wall clock time to solution for your application (or representative benchmark)• Derived metrics of sustained performance for your application (or representative
benchmark)– Flop/s, Intop/s, GUPS (Giga Updates per Second), Bandwidth
– Because multiple derived metrics, can impose limitations with comparisons• Maybe not…
• Is there any chance to define a single metric for HPC performance?– No, everybody needs his/her own — applications and workflows vary too
greatly for a single performance metric• Scientific calculations — sustained Flop/s• Sustained GUPS• Sustained transactions• etc.
– Yes, we must define a single metric• The elevator sound bite metric• The metric for the Gordon Bell award
• As of 1 November 2007– 165 base runs – 21 optimized runs http://icl.cs.utk.edu/hpcc/
Requirements for Metrics (3)
• “Demonstrate (not claim) performance” (Kuck)• “You get what you measure”
• Theoretical– Stable
– Quantifiable
– Reasonable
• Practical– K.I.S.S.
• Political– Understandable by someone outside the community with limited
time, limited knowledge of HPC, … , (who controls your funding)
Accepting New Metrics (4)
• What needs to be done?– View as a science
• Perform research and publish!• Fund work in this area!
– Work toward consensus
• Is it possible?– Most definitely!
Sustained or Peak?Repercussions (5)
• Sustained Performance Honest Application
dependent
• Peak Performance Dishonest Misleading Processing in HPC
will be free!
Peak MachoFlop/s
%Peak
Metrics — The Layer 9 Protocol Implications (6)
• National level politics and the media
• Metric(s) need to be understandable – By individuals outside the
community with limited time, limited knowledge of HPC, … , (who controls your funding)
– The New York Times level of in-depth media coverage
– The elevator soundbite
Need main stream acceptance and understanding of HPC metrics
• Business
• Metric(s) need to be useful– For procurement planning and
decision-making by various organizations and business entities
HPC Performance Metrics:Should We Drop Flop/s?
• Clearly identify when you are referring to parameters versus metrics
• Drop Flops/? – Yes…
– No…
• Simple alternative to compare systems?– Utilized bandwidth…
HPC Challenge Benchmarks andMultiprocessor Architecture
• Each HPC Challenge benchmark focuses on a different part of the memory hierarchy– HPL: Register and cache bandwidth – STREAM: Local memory bandwidth– FFT: Remote memory bandwidth
(bisection bandwidth)– RandomAccess: Remote memory
bandwidth and latency (bisection bandwidth)
• Consider comparing HPC Challenge benchmark performance to illustratearchitecture characteristics– Normalize to “Utilized Bandwidth” (B/s)– B/s = (B/op x op/s)
where B/s is Bytes per second B/op is average bytes of memory access
per operation op/s is the HPC Challenge metric for each
benchmark
ArchitectureImplications
MultiprocessorNUMA Memory HierarchyHPC Challenge Benchmarks
t(k) = t(k) a(i)t – global vectora – data stream
64,000 GUPS(2000x)
RandomAccess
1D Complex Fast Fourier Transform
0.5 Petaflop/s(200x)
FFT
a(i) = b(i) + s*c(i)a, b, c – vectors
6.5 PetaB/s(40x)
STREAM Triad
Solve Ax = bA – matrixb – vector
2 Petaflop/s(8x)
HPL
ActionHPCS GoalsImprovement
HPC ChallengeBenchmark
t(k) = t(k) a(i)t – global vectora – data stream
64,000 GUPS(2000x)
RandomAccess
1D Complex Fast Fourier Transform
0.5 Petaflop/s(200x)
FFT
a(i) = b(i) + s*c(i)a, b, c – vectors
6.5 PetaB/s(40x)
STREAM Triad
Solve Ax = bA – matrixb – vector
2 Petaflop/s(8x)
HPL
ActionHPCS GoalsImprovement
HPC ChallengeBenchmark Registers
“Spinning” Disk
“Remote” Memory
Local Memory
L1 Cache
L2 Cache
L3 Cache
Registers
“Spinning” Disk
“Remote” Memory
Local Memory
L1 Cache
L2 Cache
L3 Cache
Incr
easi
ng
Ban
dw
idth
Incr
easi
ng
Lat
ency
Incr
easi
ng
Cap
acit
y
Incr
easi
ng
Ban
dw
idth
Incr
easi
ng
Lat
ency
Incr
easi
ng
Cap
acit
y
HPC Challenge PerformanceSystem Comparisons
HPC Challenge Bandwith Utilization Comparison
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
Me
as
ure
d M
em
ory
Hie
rarc
hy
Ba
nd
wid
th
HPLGB/s
STREAM TriadGB/s
G-FFTGB/s
G-RAGB/s
GIGA B/s
TERA B/s
PETA B/s
CrayMTA2
DARPAHPCSGoals
4x
16x
Clusters
MPPOptimized
>104x
>105x
Cray MTA2 data has been added to illustrate architecture differences
Select Systems from the 3 January 2007 HPC Challenge Benchmark ResultsAll benchmarks have been normalized as Bandwidth Utilization within the memory hierarchy
Data displayed are B/s = (B/op x op/s)
HPCS Goals flattenthe memory hierarchy
HPCS Goals are similar to Cray MTA2 but at 105x the scale
Optimized G-RandomAccess results offer both scalable and improvedperformance (10-102x)
New HPC Metrics?
• A Knight of The Order of the Grail bemusedly calls upon seekers of the Holy Grail to 'choose wisely.'
• Is utilized bandwidth the Grail?
• The true Grail was a simple wooden cup that stood out among the false golden, jewel encrusted grails — remember, choose wisely!
Indiana Jones and the Last
Crusade