2
www.hpcchallenge.org FIND OUT MORE AT SPONSORED BY PROJECT GOALS • Provide performance bounds in locality space using real world computational kernels • Allow scaling of input data size and time to run according to the system capability • Verify the results using standard error analysis • Allow vendors and users to provide optimized code for superior performance • Make the benchmark information continuously available to the public in order to disseminate performance tuning knowledge and record technological progress over time • Ensure reproducibility of the results by detailed reporting of all aspects of benchmark runs FEATURE HIGHLIGHTS OF HPCC 1.4.2 RELEASED OCTOBER 2012 • Increased sizes of scratch vectors for local FFT tests to account for runs on systems with large main memory (reported by IBM, SGI and Intel). • Reduced vector size for local FFT tests due to larger scratch space needed. • Added a type cast to prevent overflow of a 32-bit integer vector size in FFT data generation routine (reported by IBM). • Fixed variable types to handle array sizes that overflow 32-bit integers in RandomAccess (reported by IBM and SGI). • Changed time-bound code to be used by default in Global RandomAccess and allowed for it to be switched off with a compile time flag if necessary. • Code cleanup to allow compilation without warnings of RandomAccess test. • Changed communication code in PTRANS to avoid large message sizes that caused problems in some MPI implementations. • Updated documentation in README.txt and README.html files. SUMMARY OF HPCC AWARDS CLASS 1: Best Performance • Best in G-HPL, EP-STREAM-Triad per system, G-RandomAccess, G-FFT • There will be 4 winners (one in each category) CLASS 2: Most Productivity • One or more winners • Judged by a panel at SC12 BOF • Stresses elegance and performance • Implementations in various (existing and new) languages are encouraged • Submissions may include up to two kernels not present in HPCC • Submission consists of: code, its description, performance numbers, and a presentation at the BOF DIGITAL SIGNAL PROCESSING RADAR CROSS SECTION LOW HIGH HIGH SPATIAL LOCALITY TEMPORAL LOCALITY APPLICATIONS COMPUTATIONAL FLUID DYNAMICS TRAVELING SALES PERSON STREAM (EP & SP) PTRANS (G) HPL Linpack (G) Matrix Mult (EP & SP) RandomAccess (G, EP, & SP) FFT (G, EP, & SP) SN-DGEMM RandomRing Bandwidth RandomRing Latency PP-HPL PP-PTRANS PP-RandomAccess PP-FFTE SN-STREAM Triad 1 0.8 0.6 0.4 0.2 Sun Cray Dalco Cray XD1 AMD Opteron 64 procs – 2.2 GHz 1 thread/MPI process (64) RapidArray Interconnect System 11-22-2004 Dalco Opteron/QsNet Linux Cluster AMD Opteron 64 procs – 2.2 GHz 1 thread/MPI process (64) QsNetII 11-04-2004 Sun Fire V20z Cluster AMD Opteron 64 procs – 2.2 GHz 1 thread/MPI process (64) Gigabit Ethernet, Cisco 6509 switch 03-06-2005 LOCALITY SPACE OF MEMORY ACCESS IN APPLICATIONS HPCC RESULTS' PAGE KIVIAT CHART WITH RESULTS FOR THREE DIFFERENT CLUSTERS

PROJECT GOALS FEATURE HIGHLIGHTS OF HPCC …icl.eecs.utk.edu/graphics/posters/files/SC12-HPCC.pdf · PROJECT GOALS • Provide performance bounds in locality space using real world

  • Upload
    buitruc

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

www.hpcchallenge.org

FIND OUT MORE AT

SPONSORED BY

PROJECT GOALS• Provide performance bounds in locality space using real world

computational kernels

• Allow scaling of input data size and time to run according to the system capability

• Verify the results using standard error analysis

• Allow vendors and users to provide optimized code for superior performance

• Make the benchmark information continuously available to the public in order to disseminate performance tuning knowledge and record technological progress over time

• Ensure reproducibility of the results by detailed reporting of all aspects of benchmark runs

FEATURE HIGHLIGHTS OF HPCC 1.4.2 RELEASED OCTOBER 2012• Increased sizes of scratch vectors for local FFT tests to account for runs on systems with

large main memory (reported by IBM, SGI and Intel).• Reduced vector size for local FFT tests due to larger scratch space needed.• Added a type cast to prevent overflow of a 32-bit integer vector size in FFT data

generation routine (reported by IBM).• Fixed variable types to handle array sizes that overflow 32-bit integers in RandomAccess

(reported by IBM and SGI).• Changed time-bound code to be used by default in Global RandomAccess and allowed

for it to be switched off with a compile time flag if necessary.• Code cleanup to allow compilation without warnings of RandomAccess test.• Changed communication code in PTRANS to avoid large message sizes that caused

problems in some MPI implementations.• Updated documentation in README.txt and README.html files.

SUMMARY OF HPCC AWARDSCLASS 1: Best Performance• Best in G-HPL, EP-STREAM-Triad per system, G-RandomAccess, G-FFT• There will be 4 winners (one in each category)

CLASS 2: Most Productivity• One or more winners• Judged by a panel at SC12 BOF• Stresses elegance and performance• Implementations in various (existing and new) languages are encouraged• Submissions may include up to two kernels not present in HPCC• Submission consists of: code, its description, performance numbers, and a

presentation at the BOF

DIGITAL SIGNALPROCESSING

RADAR CROSSSECTION

LOW HIGH

HIGH

SPA

TIA

L L

OC

ALIT

Y

TEMPORAL LOCALITY

APPLICATIONS

COMPUTATIONALFLUID DYNAMICS

TRAVELINGSALES PERSON

STREAM (EP & SP) PTRANS (G)

HPL Linpack (G)Matrix Mult (EP & SP)

RandomAccess (G, EP, & SP) FFT (G, EP, & SP) SN-DGEMM

RandomRing Bandwidth

RandomRing Latency

PP-HPL

PP-PTRANS

PP-RandomAccess

PP-FFTE

SN-STREAM Triad

1

0.8

0.6

0.4

0.2

SunCray

DalcoCray XD1 AMD Opteron 64 procs – 2.2 GHz1 thread/MPI process (64) RapidArray Interconnect System11-22-2004

Dalco Opteron/QsNet Linux Cluster AMD Opteron64 procs – 2.2 GHz1 thread/MPI process (64)QsNetII11-04-2004

Sun Fire V20z Cluster AMD Opteron64 procs – 2.2 GHz1 thread/MPI process (64) Gigabit Ethernet, Cisco 6509 switch03-06-2005

LOCALITY SPACE OF MEMORY ACCESS IN APPLICATIONS

HPCC RESULTS' PAGE

KIVIAT CHART WITH RESULTS FOR THREE DIFFERENT CLUSTERS

www.hpcchallenge.org

FIND OUT MORE AT

SPONSORED BY

2005 2006 2007 2008 2009 2010 2011

TB

/s

EP-STREAM-Triad

0

200

400

600

800

1000

IBMBlue Gene/LLIVERMORE

Cray XT5 Quad-core

OAK RIDGEJaguar

Cray XT5 Hex-coreOAK RIDGEJaguar

3rd

2nd

1st

GU

PS

2005 2006 2007 2008 2009 2010 2011

G-RandomAccess

0

30

60

90

120

150

1st2nd

3rd

IBMBlue Gene/LLIVERMORE

IBMBlue Gene/P

ARGONNEIntrepid

FujitsuSPARC64 VIIIfx

RIKEN AICSK computer

HPL

This is the widely used implementation of the Linpack TPP benchmark. It measures the sustained floating point rate of execution for solving a linear system of equations.

STREAM

A simple benchmark test that measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for four vector kernel codes.

RandomAccess

Measures the rate of integer updates to random locations in large global memory array.

PTRANS

Implements parallel matrix transpose that exercises a large volume communication pattern whereby pairs of processes communicate with each other simultaneously.

FFT

Calculates a Discrete Fourier Transform (DFT) of very large one-dimensional complex data vector.

b_eff

Effective bandwidth benchmark is a set of MPI tests that measure the latency and bandwidth of a number of simultaneous communication patterns.

DGEMM

Measures the floating point rate of execution of double precision real matrix-matrix multiplication.

HPCC BENCHMARKS

HPCC AWARDS CLASS 1: PERFORMANCE

2005 2006 2007 2008 2009 2010 2011

Tflop/s

G-FFT

2nd

0

5

10

15

20

25

30

35

Cray XT5 Hex-coreOAK RIDGEJaguar

IBMBlue Gene/P

ARGONNEIntrepid

Cray XT3SANDIA

Red Storm

IBMBlue Gene/LLIVERMORE

NEC SX-9JAMSTEC

EarthSimulator

1st

3rd

Tflop/s

2005 2006 2007 2008 2009 2010 2011

G-HPL

1st

2nd

3rd

0

500

1000

1500

2000

2500

IBMBlue Gene/LLIVERMORE

Cray XT5 Quad-core

OAK RIDGEJaguar

Cray XT5 Hex-coreOAK RIDGEJaguar

IBMBlue Gene/P

LIVERMOREDawn

FujitsuSPARC64 VIIIfx

RIKEN AICSK computer

FujitsuSPARC64 VIIIfx

RIKEN AICSK computer

FujitsuSPARC64 VIIIfx

RIKEN AICSK computer