Performance Monitoring on NERSC s POWER 5 System · • Parallel application performance – looks...

Preview:

Citation preview

Performance Monitoring on NERSC’s POWER 5 System Richard Gerber NERSC User Services RAGerber@lbl.gov

ScicomP 14 Poughkeepsie, N.Y.

May 22, 2008

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 2

Prequel

•  Although this talk centers around identifying system problems, it also highlights the stability and excellent design of the p575 POWER 5 system.

•  Virtually all the problems we’ve had with Bassi can be attributed to software complexity.

•  Take away-point: Application testing and monitoring are necessary to ensure proper system function.

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 3

Outline

•  Why monitor performance? •  What are we monitoring? •  Procedure •  Data and results •  Discussion •  Summary

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 4

Why monitor performance?

•  To provide stable, consistent high performance scientific computing resources.

•  To ensure that system performance and reliability never decreases over the machine’s lifetime.

•  To recognize when software, hardware, and configuration changes impact performance.

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 5

What are we monitoring?

•  Parallel application performance – looks at the system from a user perspective.

•  High-level “component” tests: IO, memory bandwidth, MPI latency and bandwidth.

•  Serial application performance on a single packed node.

•  The goal is to monitor, maintain, improve the user experience.

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 6

Three parallel applications

•  GTC (Gyrokinetic Toroidal Code) ▪  Studies energy transfer via plasma

microturbulence in fusion device plasmas. PIC code.

•  PARATEC (Parallel Total Energy Code) ▪  Ab-initio quantum total energy calculations via

pseudo-potentials and plane wave basis set. Self-consistent field conjugate gradient.

•  CAM (Community Atmospheric Model) ▪  Complicated multi-physics and chemistry.

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 7

Three standard parallel benchmarks

•  NPB 2.4 Class D: SP •  NPB 2.4 Class D: MG •  NPB 2.4 Class D: FT

•  These were chosen as models of memory-intensive kernels that benefited from good memory bandwidth

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 8

Three “Component” Tests

•  MEMRATE ▪  Single-node memory bandwidth.

•  MPITEST ▪  MPI latency ▪  MPI bandwidth

•  PIORAW ▪  Parallel IO performance

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 9

Procedure

•  Goals ▪  Monitor the system in production mode ▪  Small impact on users ▪  Fast and flexible way to test SW and HW changes ▪  Automatically run suite, gather data and post results

•  Implementation ▪  Run suite of codes approximately weekly ▪  Each code typically runs a few minutes on 8 (of 111)

nodes ▪  End to end workflow accommodated through scripts

that run, parse, import into DB and results displayed on the web.

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 10

Results Web Interface

•  The monitoring results are publicly available on the web ▪  http://www.nersc.gov/nusers/systems/

bassi/monitor.php •  This is useful for visually scanning

for anomalies. •  Data is quickly available to IBM

management and technical staff.

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 11

Examples of Problems Revealed

•  PE upgrade problem •  HPS firmware upgrade problem •  AIX 5.3 upgrade problem •  Password file indexing problem •  System daemon problem •  Compiler upgrade issues •  Random hardware problems (e.g., L3

cache)

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 12

CAM 3.0 Results CAM 16x1 (no OpenMP)

200.00

250.00

300.00

350.00

400.00

450.00

500.00

550.00

600.00

Sep-05 Mar-06 Oct-06 Apr-07 Nov-07 Jun-08

Date

Mflo

ps/s

ec/ta

sk

Healthy PE upgrade problem Firmware upgrade problemAIX 5.3 testing Indexing System DaemonXLF 11

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 13

PARATEC Results Paratec 64 tasks

3000.00

3200.00

3400.00

3600.00

3800.00

4000.004200.00

4400.00

4600.00

4800.00

5000.00

Sep-05 Mar-06 Oct-06 Apr-07 Nov-07 Jun-08Date

Mflo

ps/s

ec/ta

sk

Healthy PE upgrade problem Firmware upgrade problemAIX 5.3 testing Indexing System DaemonBad Node

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 14

NPB MG Results NPB MG 2.4 64 tasks

0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1400.00

1600.00

Sep-05 Dec-05 Mar-06 Jul-06 Oct-06 Jan-07 Apr-07 Aug-07 Nov-07 Feb-08 Jun-08 Sep-08

Date

Mflo

ps/s

ec/ta

sk

Healthy PE upgrade problem Firmware upgrade problemAIX 5.3 test Bad Node System DaemonIndexing XLF 11

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 15

MEMRATE Packed Node MEMRATE Packed Node TRIAD

0

1000

2000

3000

4000

5000

6000

7000

8000

Sep-05 Mar-06 Oct-06 Apr-07 Nov-07 Jun-08

Date

MB

/sec

Healthy PE upgrade problem Firmware upgrade problemsBad Node Mem Segments IndexingSystem Daemon LL Upgrade/Config

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 16

GTC Results GTC 64 tasks

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

800.00

900.00

Sep-05 Mar-06 Oct-06 Apr-07 Nov-07 Jun-08

Date

Mflo

ps/s

ec/ta

sk

Healthy PE upgrade Firmware upgradeAIX 5.3 testing Bad Node System daemonIndexing LL upgrade/config XLF 11

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 17

Discussion

•  Most of the problems we uncovered involved software upgrades or configuration changes

•  None of these issues were known before monitoring revealed them

•  None except the bad L3 caches and compiler changes were identified by “system” tests or had simple resolutions

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 18

Healthy Results Summary

Benchmark Avg. MFlop/s Std. Dev. COV

CAM 16 502 1.4 0.28%

PARATEC 64 4719 80 1.69%

GTC 747 8.9 1.20%*

NPB MG D 64 1331 26 1.95%

MEMRATE 7070 42 0.60%

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 19

Discussion

•  When you run and examine the results over time you get a “feel” for when something is wrong

•  The measured variation on a healthy system allows you to quickly evaluate the significance of an outlying result

•  Even minor variations in run time can indicate that a system is sick

•  The historical results provide quantitative evidence that a problem exists

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 20

Discussion

•  A healthy Bassi has remarkably consistent run time performance characteristics.

•  This is especially notable because the system is heavily loaded with a diverse workload.

•  We don’t have comparable long-term data for other systems at NERSC (yet!)

•  Aggregate measures (SSP) are also useful, but are not as diagnostic for as many problems

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 21

Sustained System Performance (SSP)

The SSP is an aggregate metric derived from a (geometric) average of application benchmarks and standard parallel benchmarks.

Indexing System Daemon

ScicomP 14, Poughkeepsie, N.Y. May 22, 2008 22

Summary

•  Application testing and monitoring are necessary to ensure proper system function.

•  When configured properly application performance on Bassi is remarkably consistent.

•  Most problems we’ve had with Bassi can be attributed to software complexity.

Recommended