YellowstoneCapability and capacity
for the atmospheric and related sciences
David [email protected]
NCAR/CISL User ServicesSC13 — Denver, CO — Nov. 19-20, 2013
2
Outline• Yellowstone environment• Metrics for management and planning
– Users and fields of science– Availability, utilization, and efficiency– Job mix: Capacity v. Capability– Wait times and user satisfaction– Disk and archive growth
Yellowstone environment
Partner Sites XSEDE Sites
Data TransferServices
Science GatewaysRDA, ESG
High Bandwidth Low Latency HPC and I/O NetworksMellanox FDR InfiniBand fat tree and 10Gb Ethernet
Remote Vis
1Gb/10Gb Ethernet (40Gb+ future)
NCAR HPSS Archive100 PB capacity
~15 PB/yr growth
Geyser16 nodes
4 x 10-core Westmere
1 TBCaldera16 nodes
2 x 8-core SNB64 GB
2 NVIDIA GPUs
GLADECentral disk resource
11 PB, 90 GB/s bandwidth
YellowstoneHPC resource, 1.50 PFLOPS peak
4,536 nodes: 2 x 8-core Intel Sandy Bridge, 32 GB
4
Yellowstone users
Sep-08
Oct-08
Nov-08
Dec-08
Jan-09
Feb-09
Mar-09
Apr-09
May-09
Jun-09Jul-0
9
Aug-09
Sep-090
500
1,000
1,500
2,000
2,500
3,0002,866
440337
1,066
162
Yellowstone open users Yellowstone active users HPSS R/W usersHPSS storage users Geyser/Caldera active users
5
Time to start, duration of use
0 20 40 60 80100
120140
160180
200220
240260
280300
320340
360380
050
100150200250
Days between first job and last job to date
Use
rs
0 20 40 60 80100
120140
160180
200220
240260
280300
320340
3800
50100150200
Days from account set-up to first job
Use
rs
6
Fields of science
Nov-08 Dec-08 Jan-09 Feb-09 Mar-09 Apr-09 May-09 Jun-09 Jul-09 Aug-09 Sep-09 Oct-090%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Climate, Large-Scale Dynamics Fluid Dynamics and Turbulence Geospace Sciences
Ocean Sciences Weather Prediction Earth Sciences
Atmospheric Chemistry Mesoscale Meteorology All Others
7
ASD v. Other use
Sep-08
Oct-08
Nov-08
Dec-08
Jan-09
Feb-09
Mar-09
Apr-09
May-09
Jun-09Jul-0
9
Aug-09
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
45,000,000
50,000,000
ASD Others
Core
-hou
rs u
sed
(mill
ions
)
8
Availability, utilization & efficiency
AveragesAvailability: 97.0% ± 0.9% Utilization: 81.1% ± 1.6% User: 40.9% ± 1.1% FP Efficiency: 1.65% ± 0.1%
12/15/12
12/25/12
1/4/13
1/14/13
1/24/13
2/3/13
2/13/13
2/23/13
3/5/13
3/15/13
3/25/13
4/4/13
4/14/13
4/24/13
5/4/13
5/14/13
5/24/13
6/3/13
6/13/13
6/23/13
7/3/13
7/13/13
7/23/13
8/2/13
8/12/13
8/22/13
9/1/13
9/11/13
9/21/13
10/1/13
10/11/13
10/21/13
10/31/13
11/10/130%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
% Availability % Utilization %User %FP Eff (vs. %Util)
9
Daily utilization pattern
10
Analysis cluster availability & utilization
12/20/1
2
12/30/1
2
1/9/1
3
1/19/1
3
1/29/1
3
2/8/1
3
2/18/1
3
2/28/1
3
3/10/1
3
3/20/1
3
3/30/1
3
4/9/1
3
4/19/1
3
4/29/1
3
5/9/1
3
5/19/1
3
5/29/1
3
6/8/1
3
6/18/1
3
6/28/1
3
7/8/1
3
7/18/1
3
7/28/1
3
8/7/1
3
8/17/1
3
8/27/1
3
9/6/1
3
9/16/1
3
9/26/1
3
10/6/1
3
10/16/1
3
10/26/1
3
11/5/1
30%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Geyser Availability Geyser Utilization Caldera Availability Caldera Utilization
11
Yellowstone job mix
0200400600800
1,000
1,200
1,400
1,600
1,800
2,000
2,200
2,400
2,600
2,800
3,000
3,200
3,400
3,600
3,800
4,000
4,200
4,400
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000Core-hours used # of Jobs
Nodes used (nearest 100)
David L Hart. 2011. Measuring TeraGrid: workload characterization for a high-performance computing federation. Int. J. High Perform. Comput. Appl. 25, 4 (November 2011), 451-465. DOI=10.1177/1094342010394382
12
1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,1920%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Projects Jobs Core-hours
Maximum node count for project's jobs
Capability versus capacity
CapabilityCapacity
David Hart. 2011. Deep and wide metrics for HPC resource capability and project usage. In State of the Practice Reports (SC '11). ACM, New York, NY, USA, Article 1, 7 pages. DOI=10.1145/2063348.2063350.
13
Analysis cluster activity
Nov-08 Dec-08 Jan-09 Feb-09 Mar-09 Apr-09 May-09 Jun-09 Jul-09 Aug-09 Sep-090
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
Usage Jobs
14
Job wait times
0200400600800
1,000
1,200
1,400
1,600
1,800
2,000
2,200
2,400
2,600
2,800
3,000
3,200
3,400
3,600
3,800
4,000
4,200
4,400
02468
10
Job size in nodes (nearest 100)
Avg
wai
t (hr
s)
0 28 56 84 1121401681962242522803083363643924204484765045325601
10100
1,00010,000
100,0001,000,000
10,000,000
Wait hours
Jobs
User satisfaction metrics
Job attribute YellowstoneAverages
XSEDEAverages
Unweighted
Run time (hrs) 0.6 2.0
Wait time (hrs) 0.6 4.0
Response time (hrs) 1.2, 6.8
Slow down 22.7 320.3
Weighted
Wtd run time (hrs) 6.7 19.6
Wtd wait time (hrs) 8.0 16.8
Wtd response time (hrs) 14.7 36.4
Wtd slow down 3.1 2.6
15
Ernemann, C., Hamscher, V., and Yahyapour, R. (2004). Benefits of Global Grid Computing for Job Scheduling. In Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04). Washington, DC: IEEE Computer Society, pp. 374-379.
16
GLADE disk growth
11/2/0
8
11/19/0
8
12/6/0
8
12/23/0
8
1/9/0
9
1/26/0
9
2/12/0
9
3/1/0
9
3/18/0
9
4/4/0
9
4/21/0
9
5/8/0
9
5/25/0
9
6/11/0
9
6/28/0
9
7/15/0
9
8/1/0
9
8/18/0
9
9/4/0
9
9/21/0
9
10/8/0
9
10/25/0
90.0
500.0
1,000.0
1,500.0
2,000.0
2,500.0
3,000.0
3,500.0
4,000.0
4,500.0
5,000.0
5,500.0
/glade/p/work /glade/project /glade/scratch
PB
17
GLADE weekly growth
Nov-08
Nov-08
Dec-08
Dec-08
Jan-09
Jan-09
Feb-09
Mar-09
Mar-09
Apr-09
Apr-09
May-09
May-09
Jun-09Jun-09
Jul-09Jul-0
9
Aug-09
Aug-09
Sep-09
Sep-09
Oct-09
-50
0
50
100
150
200
250
300
350
TB w
eekl
y in
crea
se/d
ecre
ase
18
HPSS daily activity
10/14/0
8
10/28/0
8
11/11/0
8
11/25/0
8
12/9/0
8
12/23/0
8
1/6/0
9
1/20/0
9
2/3/0
9
2/17/0
9
3/3/0
9
3/17/0
9
3/31/0
9
4/14/0
9
4/28/0
9
5/12/0
9
5/26/0
9
6/9/0
9
6/23/0
9
7/7/0
9
7/21/0
9
8/4/0
9
8/18/0
9
9/1/0
9
9/15/0
9
9/29/0
9
10/13/0
90
10000200003000040000500006000070000
Terabytes written daily
10/14/0
8
10/28/0
8
11/11/0
8
11/25/0
8
12/9/0
8
12/23/0
8
1/6/0
9
1/20/0
9
2/3/0
9
2/17/0
9
3/3/0
9
3/17/0
9
3/31/0
9
4/14/0
9
4/28/0
9
5/12/0
9
5/26/0
9
6/9/0
9
6/23/0
9
7/7/0
9
7/21/0
9
8/4/0
9
8/18/0
9
9/1/0
9
9/15/0
9
9/29/0
9
10/13/0
90
10000
20000
30000
40000
Terabytes read daily
19
HPSS weekly growth
04/03/1
1
05/19/1
1
07/04/1
1
08/19/1
1
10/04/1
1
11/19/1
1
01/04/1
2
02/19/1
2
04/05/1
2
05/21/1
2
07/06/1
2
08/21/1
2
10/06/1
2
11/21/1
2
01/06/1
3
02/21/1
3
04/08/1
3
05/24/1
3
07/09/1
3
08/24/1
3
10/09/1
35,000,001
10,000,001
15,000,001
20,000,001
25,000,001
30,000,001
PB (unique) PB (2-copy)
20
Multi-resource tracking
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000HPC Core-hours (cumulative) GLADE GB HPSS GB
0.0500,000.0
1,000,000.01,500,000.02,000,000.02,500,000.03,000,000.03,500,000.04,000,000.0
HPC use disk GB tape GB
21
Questions?
Additional information• David L. Hart, Pam Gillman, and Erich Thanhardt. 2013. NCAR storage accounting
and analysis possibilities. In Proceedings of the Conference on Extreme Science and Engineering Discovery Environment (XSEDE '13). ACM, New York, NY, USA, Article 55, 7 pages. DOI=10.1145/2484762.2484802
• David Hart. 2011. Deep and wide metrics for HPC resource capability and project usage. In State of the Practice Reports (SC '11). ACM, New York, NY, USA, Article 1, 7 pages. DOI=10.1145/2063348.2063350
• David L Hart. 2011. Measuring TeraGrid: workload characterization for a high-performance computing federation. Int. J. High Perform. Comput. Appl. 25, 4 (November 2011), 451-465. DOI=10.1177/1094342010394382
• David L. Hart. 2012. Longitudinal user and usage patterns in the XSEDE user community. In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment (XSEDE '12). ACM, New York, NY, USA, Article 53, 7 pages. DOI=10.1145/2335755.2335851