Download pptx - Yellowstone Capability and capacity for the atmospheric and related sciences

YellowstoneCapability and capacity

for the atmospheric and related sciences

David [email protected]

NCAR/CISL User ServicesSC13 — Denver, CO — Nov. 19-20, 2013

2

Outline• Yellowstone environment• Metrics for management and planning

– Users and fields of science– Availability, utilization, and efficiency– Job mix: Capacity v. Capability– Wait times and user satisfaction– Disk and archive growth

Yellowstone environment

Partner Sites XSEDE Sites

Data TransferServices

Science GatewaysRDA, ESG

High Bandwidth Low Latency HPC and I/O NetworksMellanox FDR InfiniBand fat tree and 10Gb Ethernet

Remote Vis

1Gb/10Gb Ethernet (40Gb+ future)

NCAR HPSS Archive100 PB capacity

~15 PB/yr growth

Geyser16 nodes

4 x 10-core Westmere

1 TBCaldera16 nodes

2 x 8-core SNB64 GB

2 NVIDIA GPUs

GLADECentral disk resource

11 PB, 90 GB/s bandwidth

YellowstoneHPC resource, 1.50 PFLOPS peak

4,536 nodes: 2 x 8-core Intel Sandy Bridge, 32 GB

4

Yellowstone users

Sep-08

Oct-08

Nov-08

Dec-08

Jan-09

Feb-09

Mar-09

Apr-09

May-09

Jun-09Jul-0

9

Aug-09

Sep-090

500

1,000

1,500

2,000

2,500

3,0002,866

440337

1,066

162

Yellowstone open users Yellowstone active users HPSS R/W usersHPSS storage users Geyser/Caldera active users

5

Time to start, duration of use

0 20 40 60 80100

120140

160180

200220

240260

280300

320340

360380

050

100150200250

Days between first job and last job to date

Use

rs

0 20 40 60 80100

120140

160180

200220

240260

280300

320340

3800

50100150200

Days from account set-up to first job

Use

rs

6

Fields of science

Nov-08 Dec-08 Jan-09 Feb-09 Mar-09 Apr-09 May-09 Jun-09 Jul-09 Aug-09 Sep-09 Oct-090%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Climate, Large-Scale Dynamics Fluid Dynamics and Turbulence Geospace Sciences

Ocean Sciences Weather Prediction Earth Sciences

Atmospheric Chemistry Mesoscale Meteorology All Others

7

ASD v. Other use

Sep-08

Oct-08

Nov-08

Dec-08

Jan-09

Feb-09

Mar-09

Apr-09

May-09

Jun-09Jul-0

9

Aug-09

0

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

35,000,000

40,000,000

45,000,000

50,000,000

ASD Others

Core

-hou

rs u

sed

(mill

ions

)

8

Availability, utilization & efficiency

AveragesAvailability: 97.0% ± 0.9% Utilization: 81.1% ± 1.6% User: 40.9% ± 1.1% FP Efficiency: 1.65% ± 0.1%

12/15/12

12/25/12

1/4/13

1/14/13

1/24/13

2/3/13

2/13/13

2/23/13

3/5/13

3/15/13

3/25/13

4/4/13

4/14/13

4/24/13

5/4/13

5/14/13

5/24/13

6/3/13

6/13/13

6/23/13

7/3/13

7/13/13

7/23/13

8/2/13

8/12/13

8/22/13

9/1/13

9/11/13

9/21/13

10/1/13

10/11/13

10/21/13

10/31/13

11/10/130%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

% Availability % Utilization %User %FP Eff (vs. %Util)

9

Daily utilization pattern

10

Analysis cluster availability & utilization

12/20/1

2

12/30/1

2

1/9/1

3

1/19/1

3

1/29/1

3

2/8/1

3

2/18/1

3

2/28/1

3

3/10/1

3

3/20/1

3

3/30/1

3

4/9/1

3

4/19/1

3

4/29/1

3

5/9/1

3

5/19/1

3

5/29/1

3

6/8/1

3

6/18/1

3

6/28/1

3

7/8/1

3

7/18/1

3

7/28/1

3

8/7/1

3

8/17/1

3

8/27/1

3

9/6/1

3

9/16/1

3

9/26/1

3

10/6/1

3

10/16/1

3

10/26/1

3

11/5/1

30%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Geyser Availability Geyser Utilization Caldera Availability Caldera Utilization

11

Yellowstone job mix

0200400600800

1,000

1,200

1,400

1,600

1,800

2,000

2,200

2,400

2,600

2,800

3,000

3,200

3,400

3,600

3,800

4,000

4,200

4,400

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000Core-hours used # of Jobs

Nodes used (nearest 100)

David L Hart. 2011. Measuring TeraGrid: workload characterization for a high-performance computing federation. Int. J. High Perform. Comput. Appl. 25, 4 (November 2011), 451-465. DOI=10.1177/1094342010394382

12

1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,1920%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Projects Jobs Core-hours

Maximum node count for project's jobs

Capability versus capacity

CapabilityCapacity

David Hart. 2011. Deep and wide metrics for HPC resource capability and project usage. In State of the Practice Reports (SC '11). ACM, New York, NY, USA, Article 1, 7 pages. DOI=10.1145/2063348.2063350.

13

Analysis cluster activity

Nov-08 Dec-08 Jan-09 Feb-09 Mar-09 Apr-09 May-09 Jun-09 Jul-09 Aug-09 Sep-090

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

Usage Jobs

14

Job wait times

0200400600800

1,000

1,200

1,400

1,600

1,800

2,000

2,200

2,400

2,600

2,800

3,000

3,200

3,400

3,600

3,800

4,000

4,200

4,400

02468

10

Job size in nodes (nearest 100)

Avg

wai

t (hr

s)

0 28 56 84 1121401681962242522803083363643924204484765045325601

10100

1,00010,000

100,0001,000,000

10,000,000

Wait hours

Jobs

User satisfaction metrics

Job attribute YellowstoneAverages

XSEDEAverages

Unweighted

Run time (hrs) 0.6 2.0

Wait time (hrs) 0.6 4.0

Response time (hrs) 1.2, 6.8

Slow down 22.7 320.3

Weighted

Wtd run time (hrs) 6.7 19.6

Wtd wait time (hrs) 8.0 16.8

Wtd response time (hrs) 14.7 36.4

Wtd slow down 3.1 2.6

15

Ernemann, C., Hamscher, V., and Yahyapour, R. (2004). Benefits of Global Grid Computing for Job Scheduling. In Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04). Washington, DC: IEEE Computer Society, pp. 374-379.

16

GLADE disk growth

11/2/0

8

11/19/0

8

12/6/0

8

12/23/0

8

1/9/0

9

1/26/0

9

2/12/0

9

3/1/0

9

3/18/0

9

4/4/0

9

4/21/0

9

5/8/0

9

5/25/0

9

6/11/0

9

6/28/0

9

7/15/0

9

8/1/0

9

8/18/0

9

9/4/0

9

9/21/0

9

10/8/0

9

10/25/0

90.0

500.0

1,000.0

1,500.0

2,000.0

2,500.0

3,000.0

3,500.0

4,000.0

4,500.0

5,000.0

5,500.0

/glade/p/work /glade/project /glade/scratch

PB

17

GLADE weekly growth

Nov-08

Nov-08

Dec-08

Dec-08

Jan-09

Jan-09

Feb-09

Mar-09

Mar-09

Apr-09

Apr-09

May-09

May-09

Jun-09Jun-09

Jul-09Jul-0

9

Aug-09

Aug-09

Sep-09

Sep-09

Oct-09

-50

0

50

100

150

200

250

300

350

TB w

eekl

y in

crea

se/d

ecre

ase

18

HPSS daily activity

10/14/0

8

10/28/0

8

11/11/0

8

11/25/0

8

12/9/0

8

12/23/0

8

1/6/0

9

1/20/0

9

2/3/0

9

2/17/0

9

3/3/0

9

3/17/0

9

3/31/0

9

4/14/0

9

4/28/0

9

5/12/0

9

5/26/0

9

6/9/0

9

6/23/0

9

7/7/0

9

7/21/0

9

8/4/0

9

8/18/0

9

9/1/0

9

9/15/0

9

9/29/0

9

10/13/0

90

10000200003000040000500006000070000

Terabytes written daily

10/14/0

8

10/28/0

8

11/11/0

8

11/25/0

8

12/9/0

8

12/23/0

8

1/6/0

9

1/20/0

9

2/3/0

9

2/17/0

9

3/3/0

9

3/17/0

9

3/31/0

9

4/14/0

9

4/28/0

9

5/12/0

9

5/26/0

9

6/9/0

9

6/23/0

9

7/7/0

9

7/21/0

9

8/4/0

9

8/18/0

9

9/1/0

9

9/15/0

9

9/29/0

9

10/13/0

90

10000

20000

30000

40000

Terabytes read daily

19

HPSS weekly growth

04/03/1

1

05/19/1

1

07/04/1

1

08/19/1

1

10/04/1

1

11/19/1

1

01/04/1

2

02/19/1

2

04/05/1

2

05/21/1

2

07/06/1

2

08/21/1

2

10/06/1

2

11/21/1

2

01/06/1

3

02/21/1

3

04/08/1

3

05/24/1

3

07/09/1

3

08/24/1

3

10/09/1

35,000,001

10,000,001

15,000,001

20,000,001

25,000,001

30,000,001

PB (unique) PB (2-copy)

20

Multi-resource tracking

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000HPC Core-hours (cumulative) GLADE GB HPSS GB

0.0500,000.0

1,000,000.01,500,000.02,000,000.02,500,000.03,000,000.03,500,000.04,000,000.0

HPC use disk GB tape GB

21

Questions?

Additional information• David L. Hart, Pam Gillman, and Erich Thanhardt. 2013. NCAR storage accounting

and analysis possibilities. In Proceedings of the Conference on Extreme Science and Engineering Discovery Environment (XSEDE '13). ACM, New York, NY, USA, Article 55, 7 pages. DOI=10.1145/2484762.2484802

• David Hart. 2011. Deep and wide metrics for HPC resource capability and project usage. In State of the Practice Reports (SC '11). ACM, New York, NY, USA, Article 1, 7 pages. DOI=10.1145/2063348.2063350

• David L Hart. 2011. Measuring TeraGrid: workload characterization for a high-performance computing federation. Int. J. High Perform. Comput. Appl. 25, 4 (November 2011), 451-465. DOI=10.1177/1094342010394382

• David L. Hart. 2012. Longitudinal user and usage patterns in the XSEDE user community. In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment (XSEDE '12). ACM, New York, NY, USA, Article 53, 7 pages. DOI=10.1145/2335755.2335851