SLURM Deployment Experiences on Stampede

Karl W. Schulz Director, Scientific Applications

XSEDE/NSF Campus Champions w June 14, 2013

SLURM Deployment Experiences on Stampede

2

Acknowledgements

•  Thanks/kudos to: –  Sponsor: National Science Foundation

•  NSF Grant #OCI-1134872 Stampede Award, “Enabling, Enhancing, and Extending Petascale Computing for Science and Engineering”

•  NSF Grant #OCI-0926574 - “Topology-Aware MPI Collectives and Scheduling”

–  Many, many Dell, Intel, and Mellanox engineers –  Dr. Kent Milfeld, Dr. Tommy Minyard, Dr. Bill Barth –  All my colleagues at TACC who saw way too

much of each other during the last year

3

Outline

•  Quick high-level system summary •  Motivation for SLURM on Stampede •  SLURM experiences

–  local configuration, new things we’ve done –  features we use –  things we like / pain points

4

Stampede - High Level Overview •  Base Cluster (Dell/Intel/Mellanox):

–  Intel Sandy Bridge processors –  Dell dual-socket nodes w/32GB RAM (2GB/core) –  6,400 nodes –  56 Gb/s Mellanox FDR InfiniBand interconnect –  More than 100,000 cores, 2.2 PF peak performance

•  Co-Processors: –  Intel Xeon Phi “MIC” Many Integrated Core processors –  Special release of “Knight’s Corner” (61 cores) –  All MIC cards are on site at TACC

•  more than 6000 installed •  final installation ongoing for formal

summer acceptance –  7+ PF peak performance

•  Max Total Concurrency: –  approaching 500,000 cores –  1.6M threads

•  Entered production operations on January 7, 2013

5

Additional Integrated Subsystems

•  Stampede includes 16 1TB Sandy Bridge shared memory nodes with dual GPUs

•  128 of the compute nodes are also equipped with NVIDIA Kepler K20 GPUs (and MICS for performance bake-offs)

•  16 login, data mover and management servers (batch, subnet manager, provisioning, etc)

•  Software included for high throughput computing, remote visualization

•  Storage subsystem driven by Dell storage nodes: –  Aggregate Bandwidth greater than 150GB/s –  More than 14PB of capacity –  Similar partitioning of disk space into multiple Lustre filesystems as previous

TACC systems ($HOME, $WORK and $SCRATCH)

6

Stampede Footprint

Machine Room Expansion Added 6.5MW of additional power

Ranger Stampede

8000 ft2 ~10PF 6.5 MW

3000 ft2 0.6 PF 3 MW

7

Stampede: a “Quick” Deployment

Feb 20, 2012

March 22, 2012

May 16, 2012

Sep 10, 2012

Pumping Utilities

Chilling Towers

8

SLURM Motivation

•  We have primarily been an SGE shop for the last 6 years –  were able to get some desired features (e.g. advanced reservations) through Sun vendor

relationship on Ranger –  ran LSF/PBS prior to that –  Originally proposed to run SGE on Stampede as well, but undertook a SLURM evaluation:

•  worries about SGE future •  issues with lack of formal API for scheduling plugins and large job starvation

•  We need flexible support and scalability for both capacity and capability usage on our big systems

–  very diverse job mix (so large job starvation is a problem with backfill enabled) –  reasonable number of jobs and users:

•  ~900K jobs in first 5 months •  ~1800 unique users •  job sizes from 16 cores to 90K cores

•  We tend to need functionality which is not 100% standard –  e.g. accounting integration into central XSEDE mechanism –  need flexibility for total control of certain functionality and src access to hack/modify to suit –  this may not be the same at your institution

Stampede Queue Definitions

Queue Name Max Runtime Max Nodes/Procs SU Charge Rate Purpose

normal 24 hrs 256 / 4K 1 normal production

normal-mic 24 hrs 256 / 4K 1 normal MIC production

large 24 hrs 1024 / 16K 1 large core counts (access by request)

request 24 hrs -- 1 special requests

largemem 24 hrs 4 / 128 2 large memory 32 cores/node, Teslas, no MICs

development 4 hrs 16 / 256 1 development nodes serial 12 hrs 1 / 16 1 serial/shared_memory

gpu 24 hrs 32 / 512 1 GPU nodes

gpudev 4 hrs 4 / 64 1 GPU development nodes

vis 8 hrs 32 / 512 1 GPU nodes + VNC service

10

SLURM Task Request •  A common source of confusion for many-core

systems is how to control process launching at the socket/core level: –  if all MPI, generally want 1 MPI thread/core –  if all threading, may want 1 task/socket or host –  users need ability to do any sensible combination in between

•  We tried to keep things relatively simple –  we support “–n” an “–N” (total_tasks and total hosts)

•  User can just supply –n (tot. tasks), if using 16 tasks/node •  User supplies both –N (node count) and –n (tot tasks) if fewer than

16 tasks per node •  We endeavor to make reasonable affinity settings for hybrid cases

through numactl (tacc_affinty)

Example MPI Batch Job

staff$ cat /share/doc/slurm/job.mpi !#!/bin/bash!#----------------------------------------------------!# Example SLURM job script to run MPI applications on !# TACC's Stampede system.!#!# $Id: job.mpi 1580 2013-01-08 04:10:50Z karl $!#----------------------------------------------------!!#SBATCH -J myMPI # Job name!#SBATCH -o myjob.%j.out # Name of stdout output file (%j expands to jobId)!#SBATCH -p development # Queue name!#SBATCH -N 2 # Total number of nodes requested (16 cores/node)!#SBATCH -n 32 # Total number of mpi tasks requested!#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - 1.5 hours!!#SBATCH -A A-yourproject # <-- Allocation name to charge job against!!# Launch the MPI executable named "a.out"!!ibrun tacc_affinity ./a.out!

Submit via “sbatch job”

12

Example Interactive/MIC Usage

•  Interactive programming example –  Request interactive job (srun) –  Compile on the compute node –  Using the Intel compiler toolchain –  Here, we are building a simple hello world…

•  First, compile for SNB and run on the host –  note the __MIC__ macro can be used to

isolate MIC only execution, in this case no extra output is generated on the host

•  Next, build again and add “-mmic” to ask the compiler to cross-compile a binary for native MIC execution

–  note that when we try to run the resulting binary on the host, it throws an error

–  ssh to the MIC (mic0) and run the executable out of $HOME directory

–  this time, we see extra output from within the guarded__MIC__ macro

login1$ srun –p devel --pty /bin/bash –l!c401-102$ cat hello.c!#include<stdio.h>!int main()!{! printf("Hook 'em Horns!\n");!!#ifdef __MIC__! printf(" --> Ditto from MIC\n");!#endif!}!!c401-102$ icc hello.c!c401-102$ ./a.out !Hook 'em Horns!!!c401-102$ icc –mmic hello.c!c401-102$ ./a.out!bash: ./a.out: cannot execute binary file!!c401-102$ ssh mic0 ./a.out!Hook 'em Horns!! --> Ditto from MIC!

Interactive Hello World

13

submit

TACC filter_options SPANK Plugin

(requires some src code hackery)

TACC job_completion Plugins (custom accounting records we integrate into global system)

Fair Share Filter

Backfill Prolog Epilog Acct Report Launch

Priority

TACC prolog

TACC epilog

TACC ibrun

Local Workflow Configuration Customization

General user job workflow

SLURM multi-factor scheduler plugin

14

Local SLURM Site Customizations

15

Prolog Customizations

•  Examples of the kinds of things we do in our prolog script:

–  Set "Performance" options in /sys/.../cpufreq/scaling_governor

–  Determine queue and accelerators: –  Enable exclusive MIC access for user if necessary –  Vis nodes remove any socket locks and gsetsid for gdm-

binary execution - setup X –  cache current user state in /tmp for debugging (last job and

user)

16

Epilog Customizations

•  Examples of the kinds of things we do in our epilog script: –  Kill all user processes (avoiding root and system daemon jobs) –  Reset "ondemand" scaling-governor –  Kill all user processes again (a potential oddity with rhel 6.3 kernel) –  Reboot MIC card (tacc_sanitize_mic)

•  about 3 minutes now (required some slurm timeout tweaks/code changes) •  reducing this time with latest Intel MIC stack

–  Drain node if still have zombie processes “scontrol update state=drain”

–  Clean /tmp –  Sync and drop caches –  Restore default ACL for MIC

17

Accounting Filter Customizations

•  SLURM provides a nice way to customize the raw accounting logging –  SLURM just calls a shell script of your own creation to format as desired - ours

is very simple and we then user our own tools to ingest into a central accounting authority

–  We don’t use slurm DB to enforce accounting - we do this at the job submission time (more on that in a bit)

•  Our script is trivial:

#!/bin/bash!!OUTFILE=/var/log/slurm/tacc_jobs_completed!!echo "$JOBID:$UID:$ACCOUNT:$BATCH:$START:$END:$SUBMIT:$PARTITION:$LIMIT:$JOBNAME:$JOBSTATE:$NODECNT:$PROCS" >> $OUTFILE!!exit 0!

18

Job Launch Customizations

•  We don’t use SLURMs PMI library linkage for job launching (ie, we don’t use srun)

•  Was not really a viable option in our heterogenous MIC environment and we want total control of the launch process –  multiple MPI stacks –  MPI between MICs –  affinity settings –  application logging –  environment tuning

•  We call native MPI launch mechanisms directly –  abstracted within our standard ibrun script –  hosts snarfed from SLURM: scontrol show hostname $SLURM_NODELIST

19

Local Developments for SLURM

20

showq for SLURM

•  We have always liked the form of the old maui showq utility –  we created a version for LSF –  ported to SGE when we switched over for Ranger –  have ported to SLURM for use on Stampede

•  Uses the C API for slurm –  just summarizes running, waiting, and blocked jobs

# showq --help!!Usage: slurm [OPTIONS]!!OPTIONS:! --help generate help message and exit! --version output version information and exit! -l [ --long ] enable more verbose (long) listing! -u [ --user ] display jobs for current user only! -U <username> display jobs for username only!

21

showq for SLURM ACTIVE JOBS--------------------!JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME!================================================================================!15574 HOMME viennej Running 1024 21:34:48 Thu Jan 3 07:49:23!17956 bash bureddy Running 368 11:44:46 Thu Jan 3 15:59:21!

17967 bash cazes Running 32 3:56:44 Thu Jan 3 16:11:19!! 53 active jobs : 1470 of 6416 hosts ( 22.91 %)!!

WAITING JOBS------------------------!JOBID JOBNAME USERNAME STATE CORE WCLIMIT QUEUETIME!================================================================================!14595 pal-14 tg804247 Waiting 768 12:00:00 Sun Dec 23 22:45:45!

16336 di3 wuzhe Waiting 1024 12:00:00 Fri Dec 28 20:51:59!17798 wrf_lrg_10 cazes Waiting 10240 1:30:00 Thu Jan 3 11:20:22!!BLOCKED JOBS--!

JOBID JOBNAME USERNAME STATE CORE WCLIMIT QUEUETIME!================================================================================!14596 pal-15 tg804247 Waiting 768 12:00:00 Sun Dec 23 22:45:45!14597 pal-16 tg804247 Waiting 768 12:00:00 Sun Dec 23 22:45:45!17526 xyl.1.0 bernardi Waiting 480 12:00:00 Wed Jan 2 11:24:48!

17527 xyl.1.0 bernardi Waiting 480 12:00:00 Wed Jan 2 11:24:48!17538 xyl.0.0 bernardi Waiting 480 12:00:00 Wed Jan 2 11:26:51!17539 xyl.0.0 bernardi Waiting 480 12:00:00 Wed Jan 2 11:26:51!17611 just_hel wuzhe Waiting 1024 12:00:00 Wed Jan 2 23:24:17!

!Total Jobs: 105 Active Jobs: 53 Idle Jobs: 21 Blocked Jobs: 31!

22

Job Submission Filter

•  The existing job-submission filter and accounting enforcement mechanism was non-ideal for our usage (so we created a new one)

•  Leverages the SPANK plugin infrastructure within SLURM

•  Requires some src code mods to be able to pass user job submission requirements to the plugin to make decisions

•  Gives us a simple policy framework for managing queue limits, user exceptions, accounting, etc

•  You will see evidence of the filter when submitting a job….

23

Example Job Submission Output

staff$ sbatch job ----------------------------------------------------------------- Welcome to the Stampede Supercomputer

-----------------------------------------------------------------!

--> Verifying valid submit host (staff)...OK --> Enforcing max jobs per user...OK --> Verifying availability of your home dir (/home1/00161/karl)...OK --> Verifying availability of your work dir (/work/00161/karl)...OK --> Verifying availability of your scratch dir( /scratch/00161/karl).OK --> Verifying access to desired queue (development)...OK --> Verifying job request is within current queue limits...OK --> Checking available allocation (A-ccsc)...OK

We can also use filter to give user-specific disable messages (ie. if they are doing bad things to the file system).

24

Job Submission Policy File Example

[tacc_filter/max_jobs]!!# default for all users!total_user_jobs = 50!!# special exemptions!!total_user_jobs/karl = 250!![tacc_filter/queue_size_limits/normal]!!max_cores = 4096!max_hours = 24!!karl_expire = '2013-06-30'!karl_max_cores = 32!karl_max_hours = 72!

25

Job Submission Policy File Example

[tacc_filter] !!check_home = true!check_work = true!check_queue_acls = true!check_queue_runlimits = true!check_accounting = true!restrict_submission_hosts = true!restrict_max_user_jobs = true!!queue_acl_denial_message = 'One or more of your previous batch jobs have been identified !as potentially causing system problems. Please contact TACC support if you have not already been contacted directly in order to re-enable queue access.'!

26

General SLURM Comments

•  Positives: –  src is easy to build, relatively easy to work with –  easy to deploy - (but make sure config is identical on every host) –  no large job starvation - scheduler works pretty good (fairshare/backfill) –  reasonably responsive managing 1-2K jobs for 6K hosts

•  Wish list/problem points: –  missing a load monitor (quite common in other systems) –  missing a flexible resource management policy that is divorced from SLURM DB –  some issues with topology-based scheduler (still working on this) –  could use finer granularity for managing users and queues –  interactive option is really great when it works - but it doesn’t always work and

looks to be getting worse with more load on our system (so we are installing our own interactive utility again, idev)

–  have to interact with another DB to do multifactor scheduler (and you have to let it know about every single user to do equal fairshare)

27

Thanks for your time! Questions?

Documents

SLURM Deployment Experiences on Stampede