SC’07 7 th International APART Workshop Panel 11 November 2007 David Koester, Ph.D. [email protected]

SC’077th International APART Workshop

Panel

11 November 2007

David Koester, Ph.D.

[email protected]

Panel Question

• "HPC Performance Metrics: Should We Drop FLOPS?"

• As computer architectures have advanced, making floating-point operations relatively inexpensive, it is still very ingrained in HPC to compare systems by their FLOPS rates, and to optimize algorithms to minimize their floating-point operation count even at the cost of increased memory bandwidth demands or programming effort.

• Should we drop FLOPS as a metric? If the answer is yes, then is there a way to gracefully (or perhaps suddenly) move to a metric other than FLOPS that has the predictive value and ease-of-analysis that FLOPS had in the 1960s and 1970s?

• The panelists will give their position on this question and their suggestions on the general issue of using simplistic measures like FLOPS as a performance metric.

Déjà Vu All Over Again

• Workshop on Performance Characterization, Modelingand Benchmarking for HPC Systems– Emeryville, CA 5-7 May 2003

• Panel: Performance Metrics for HPCS: Holy Grail or Fata Morgana1. Why don’t we have a good metric for HPC performance?

2. Is there any chance to define a single metric for HPC performance? Does everybody needs his/her own metric?

3. What are the requirements (theoretical, practical, and political) for such metrics?

4. What needs to be done to get new metrics accepted? Is it even possible?

5. Sustained vs Peak? Repercussions?

6. Implications of performance metrics on the political level

A Single “Good” Metric? (1 & 2)

• Parameters

Performance(Theoretical Peak Capability)

Peak Processor Performance

# Processors

SystemMemory

Bisection BW

Clock rate

Ops/cycle

# nodes

Processors per node

Parameter Space§ (Examples)

i.e., characteristics

§Thanks to Thomas Sterling circa 2002


• Parameters versus Metrics

Performance(Theoretical Peak Capability)

Peak Processor Performance

# Processors

SystemMemory

Bisection BW

Clock rate

Ops/cycle

# nodes

Processors per node

Parameter Space (Examples)

Metric Space (Examples)

User ApplicationsRepresentativeBenchmarks(HPC Challenge)

Wall Clock Time

Derived Metrics

SustainedFlop/s

Sustained(Memory)

Bandwidth

SustainedGUPS

SustainedIntop/s


• Why don’t we have a good metric for HPC performance?– We do have good metrics — well sort of…

• Wall clock time to solution for your application (or representative benchmark)• Derived metrics of sustained performance for your application (or representative

benchmark)– Flop/s, Intop/s, GUPS (Giga Updates per Second), Bandwidth

– Because multiple derived metrics, can impose limitations with comparisons• Maybe not…

• Is there any chance to define a single metric for HPC performance?– No, everybody needs his/her own — applications and workflows vary too

greatly for a single performance metric• Scientific calculations — sustained Flop/s• Sustained GUPS• Sustained transactions• etc.

– Yes, we must define a single metric• The elevator sound bite metric• The metric for the Gordon Bell award

• As of 1 November 2007– 165 base runs – 21 optimized runs http://icl.cs.utk.edu/hpcc/

Requirements for Metrics (3)

• “Demonstrate (not claim) performance” (Kuck)• “You get what you measure”

• Theoretical– Stable

– Quantifiable

– Reasonable

• Practical– K.I.S.S.

• Political– Understandable by someone outside the community with limited

time, limited knowledge of HPC, … , (who controls your funding)

Accepting New Metrics (4)

• What needs to be done?– View as a science

• Perform research and publish!• Fund work in this area!

– Work toward consensus

• Is it possible?– Most definitely!

Sustained or Peak?Repercussions (5)

• Sustained Performance Honest Application

dependent

• Peak Performance Dishonest Misleading Processing in HPC

will be free!

Peak MachoFlop/s

%Peak

Metrics — The Layer 9 Protocol Implications (6)

• National level politics and the media

• Metric(s) need to be understandable – By individuals outside the

community with limited time, limited knowledge of HPC, … , (who controls your funding)

– The New York Times level of in-depth media coverage

– The elevator soundbite

Need main stream acceptance and understanding of HPC metrics

• Business

• Metric(s) need to be useful– For procurement planning and

decision-making by various organizations and business entities

HPC Performance Metrics:Should We Drop Flop/s?

• Clearly identify when you are referring to parameters versus metrics

• Drop Flops/? – Yes…

– No…

• Simple alternative to compare systems?– Utilized bandwidth…

HPC Challenge Benchmarks andMultiprocessor Architecture

• Each HPC Challenge benchmark focuses on a different part of the memory hierarchy– HPL: Register and cache bandwidth – STREAM: Local memory bandwidth– FFT: Remote memory bandwidth

(bisection bandwidth)– RandomAccess: Remote memory

bandwidth and latency (bisection bandwidth)

• Consider comparing HPC Challenge benchmark performance to illustratearchitecture characteristics– Normalize to “Utilized Bandwidth” (B/s)– B/s = (B/op x op/s)

where B/s is Bytes per second B/op is average bytes of memory access

per operation op/s is the HPC Challenge metric for each

benchmark

ArchitectureImplications

MultiprocessorNUMA Memory HierarchyHPC Challenge Benchmarks

t(k) = t(k) a(i)t – global vectora – data stream

64,000 GUPS(2000x)

RandomAccess

1D Complex Fast Fourier Transform

0.5 Petaflop/s(200x)

FFT

a(i) = b(i) + s*c(i)a, b, c – vectors

6.5 PetaB/s(40x)

STREAM Triad

Solve Ax = bA – matrixb – vector

2 Petaflop/s(8x)

HPL

ActionHPCS GoalsImprovement

HPC ChallengeBenchmark

t(k) = t(k) a(i)t – global vectora – data stream

64,000 GUPS(2000x)

RandomAccess

1D Complex Fast Fourier Transform

0.5 Petaflop/s(200x)

FFT

a(i) = b(i) + s*c(i)a, b, c – vectors

6.5 PetaB/s(40x)

STREAM Triad

Solve Ax = bA – matrixb – vector

2 Petaflop/s(8x)

HPL

ActionHPCS GoalsImprovement

HPC ChallengeBenchmark Registers

“Spinning” Disk

“Remote” Memory

Local Memory

L1 Cache

L2 Cache

L3 Cache

Registers

“Spinning” Disk

“Remote” Memory

Local Memory

L1 Cache

L2 Cache

L3 Cache

Incr

easi

ng

Ban

dw

idth

Incr

easi

ng

Lat

ency

Incr

easi

ng

Cap

acit

y

Incr

easi

ng

Ban

dw

idth

Incr

easi

ng

Lat

ency

Incr

easi

ng

Cap

acit

y

HPC Challenge PerformanceSystem Comparisons

HPC Challenge Bandwith Utilization Comparison

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

Me

as

ure

d M

em

ory

Hie

rarc

hy

Ba

nd

wid

th

HPLGB/s

STREAM TriadGB/s

G-FFTGB/s

G-RAGB/s

GIGA B/s

TERA B/s

PETA B/s

CrayMTA2

DARPAHPCSGoals

4x

16x

Clusters

MPPOptimized

>104x

>105x

Cray MTA2 data has been added to illustrate architecture differences

Select Systems from the 3 January 2007 HPC Challenge Benchmark ResultsAll benchmarks have been normalized as Bandwidth Utilization within the memory hierarchy

Data displayed are B/s = (B/op x op/s)

HPCS Goals flattenthe memory hierarchy

HPCS Goals are similar to Cray MTA2 but at 105x the scale

Optimized G-RandomAccess results offer both scalable and improvedperformance (10-102x)

New HPC Metrics?

• A Knight of The Order of the Grail bemusedly calls upon seekers of the Holy Grail to 'choose wisely.'

• Is utilized bandwidth the Grail?

• The true Grail was a simple wooden cup that stood out among the false golden, jewel encrusted grails — remember, choose wisely!

Indiana Jones and the Last

Crusade

Documents

SC’07 7 th International APART Workshop Panel 11 November 2007 David Koester, Ph.D. [email protected]