27
SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request Experience using Slurm on ARIS HPC System Nikos Nikoloutsakos GRNET Greek Research and Technology Network, Greece hpc.gnret.gr 27 September 2016 Slurm User Group Meeting 2016 27 September 2016 1/27

Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Experience using Slurm on ARIS HPC System

Nikos Nikoloutsakos

GRNETGreek Research and Technology Network, Greece

hpc.gnret.gr

27 September 2016

Slurm User Group Meeting 2016 27 September 2016 1/27

Page 2: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Overview

1 IntroductionWho we areSystem Overview

2 SlurmConfigurationAdministration - Monitoring

3 Issues

4 Feature request

Slurm User Group Meeting 2016 27 September 2016 2/27

Page 3: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Who we areGreek Research Technology NetworkGRNET enables researchers from Greece to obtain access to the powerfulnational High Performance Computing system ARIS.

Advanced Research Information SystemARIS Infrastructure provides state-of-the-art supercomputing capabilitiesfor large-scale scientific applications.

GRNET provides services to:Greece - Greek Academic Community

Greek UniversitiesTechnological institutionsResearch centers

EuropePRACE (Tier 1 system)DECIother EU Projects (Vi-seem, Eudat, EGI,...)

Slurm User Group Meeting 2016 27 September 2016 3/27

Page 4: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Adiministrative and Application support

Support Team HPC provides:Management of InfrastructureUser Support

Comprehensive end-user supportUser support in operational problemsDocumentationEducational and Training Events

Application Support - Transfer and optimizing applicationPeer-Review support and coordination

Slurm User Group Meeting 2016 27 September 2016 4/27

Page 5: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Open Access

Peer-Review AccessThe criteria for the evaluation:

Scientific ExcellenceImpact of the proposed researchThe need for HPC resourcesMaturity and experience of the principal investigator andhis/her teamFeasibility of the project based on a technical evaluationand the availability of resources

Slurm User Group Meeting 2016 27 September 2016 5/27

Page 6: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Project TypesPreparatory-Development ProjectsExecution of scalability tests, performance tests, resolve issues.Code porting, development, optimization.

Review: Technical onlyCall: Always OpenAccess: 2-4 months

Production ProjectsProjects that have the technical expertise to take advantage ofavailable resources and are selected by the procedure of peerreview

Review: Technical-ScientificPeriodic Call - 2 per yearAccess: 1 year

Slurm User Group Meeting 2016 27 September 2016 6/27

Page 7: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

since June 2015

First pilot operational phase in June 2015150 projects400 Users24 Organizations300 software modules120.000 jobs submitted,46M core hours (1 year)25 scientific publications (up to now)https://hpc.grnet.gr/results-publications/

Slurm User Group Meeting 2016 27 September 2016 7/27

Page 8: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

ARIS

Compute Power: 180 TFlops (HPL) #465 Top500 - iterationJune 2015

Slurm User Group Meeting 2016 27 September 2016 8/27

Page 9: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

ARIS - Compute Nodes I

426 compute nodes: IBM NextScale n360 M48520 cores: 2x (Intel E5 [email protected] - 10 core) pernode27TB total memory: 64GB memory per node(8 RDIMMS, 1866 MHz)Half-width, 1U systems grouped in 6U enclosures(12 nodes per enclosure)

Slurm User Group Meeting 2016 27 September 2016 9/27

Page 10: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

ARIS - Compute Nodes II

6 Racks, 6 enclosures per rack.DisklessIBM 1PB GPFS, Tape Library ΙΒΜ TS3500 6PBMax nominal power consumption: 162 KW(154 KW on HPL). 183 KW with air-cooling.

Slurm User Group Meeting 2016 27 September 2016 10/27

Page 11: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

ARIS - Network

Mellanox SX6536 648-Port Infiniband Director SwitchFDR 56 Gbits / secFat tree non-blocking mode450 QSFP+Optical cables5 Km fabric cables

Slurm User Group Meeting 2016 27 September 2016 11/27

Page 12: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

ARIS - Expansion44 gpu nodes: “2 x NVIDIA Tesla k40m” acceleratednodes.

Dell Power Edge R7302 x Intel Xeon [email protected] GB RAM

18 phi nodes: “2 x INTEL Xeon Phi 7120p” acceleratednodes.

Dell Power Edge R7302 x Intel Xeon [email protected] GB RAM

44 fat nodesDell PowerEdge R8204x Intel Xeon [email protected] GB RAM

IBM 1PB GPFS,Tape Library ΙΒΜ TS3500 6PB

Slurm User Group Meeting 2016 27 September 2016 12/27

Page 13: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

ARIS - Managment

14 support nodes, NextScale x3650 M4 2 x E5-2640v22x Managment Nodes, 2x Login Nodes, 10x service nodesMonitoring software xCAT, Nagios, Ganglia, BMS(Business Management System) Dell OpenManage, MRTGScheduler SLURM 14.11.8XDMoD, UMGMT (User Managment Tool) in house

Slurm User Group Meeting 2016 27 September 2016 13/27

Page 14: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Partition Queues

ONE cluster ”ARIS”

Partition Description Nodescompute Thin nodes 426gpu GPU nodes 44phi PHI nodes 18fat FAT nodes 24taskp Serial queue 20

Default timelimit 2 days

Slurm User Group Meeting 2016 27 September 2016 14/27

Page 15: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Configurtion parameters I

Consumable ResourcesSelectTypeParameters= CR_CORE_MEMORYShared mode - unless user specifies --exclusive

Resource LimitsAccountingStorageEnforce = associations,limits,safe

Generic Resource (GRES) SchedulingGresTypes = gpu,micmic offload mode only

Slurm User Group Meeting 2016 27 September 2016 15/27

Page 16: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Configurtion parameters II

MpiDefault = pmi2Supports MPI implementation being used on system:Intelmpi,OpenMPI, mvapich2

The larger the job, the greater its job size priority.PriorityFavorSmall=NO

Accounting GatherAcctGatherEnergyType=acct_gather_energy/ipmiAcctGatherInfinibandType=acct_gather_infiniband/ofed

JobAcctGatherType = jobacct_gather/linux

Slurm User Group Meeting 2016 27 September 2016 16/27

Page 17: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Priority Flags I

Multifactor PriorityPriorityType= priority/multifactorPriorityWeightAge = 100PriorityWeightFairShare = 1000PriorityWeightJobSize = 1000PriorityWeightPartition = 0PriorityDecayHalfLife = 00:00:00PriorityUsageResetPeriod = WEEKLYPriorityMaxAge = 30-00:00:00

PriorityWeightQOS=0

Slurm User Group Meeting 2016 27 September 2016 17/27

Page 18: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Priority Flags II

Fair Tree FairsharePriorityFlags = FAIR_TREEPriorityCalcPeriod = 02:00:00

Backfill SchedulingSchedulerType= sched/backfill

Slurm User Group Meeting 2016 27 September 2016 18/27

Page 19: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

$mybudget

$myreport

Slurm User Group Meeting 2016 27 September 2016 19/27

Page 20: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

XdMod

Slurm User Group Meeting 2016 27 September 2016 20/27

Page 21: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

UMGMT I

Users Management Tool

Tool to manage project proposals and user access on thesystem.Associate project proposals to slurm accountinginformationKeep Track start end dates per project,Extensions: core hours-access periodProject status , send alert emails to usersStatistics consumed core-hours(%) per project

in development: Ruby on Rails

Slurm User Group Meeting 2016 27 September 2016 21/27

Page 22: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

UMGMT II

Slurm User Group Meeting 2016 27 September 2016 22/27

Page 23: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

SLURM - MRTG

Slurm User Group Meeting 2016 27 September 2016 23/27

Page 24: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Slurm Script TemplateHelps users prepare batch job scripts for Slurm at ARIS.

Acknowledgment BYU Job Script Generatorhttps://github.com/BYUHPC/BYUJobScriptGenerator

Slurm User Group Meeting 2016 27 September 2016 24/27

Page 25: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Issues

Problem:Reservation (daily) had 20 nodes , 15 where active , 5where active by same user but for other job1 node (from 15) died, unable to reschedule.

Slurm User Group Meeting 2016 27 September 2016 25/27

Page 26: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Feature request

More verbose error messages:Users could figure why a job is rejected.More information about which limit violatedMPI Task 0: may need more memoryAbility to specify less processes on first node.Allocation per GRES(gpu,mic) not only cpu ch

What’s Nextupgrade to version 16

Slurm User Group Meeting 2016 27 September 2016 26/27

Page 27: Experience using Slurm on ARIS HPC System · SLUG 2016 N. Nikoloutsakos Introduction Who we are System Overview Slurm Configuration Administration - Monitoring Issues Feature request

SLUG 2016

N. Nikoloutsakos

IntroductionWho we areSystem Overview

SlurmConfigurationAdministration -Monitoring

Issues

Featurerequest

Thank you !

Slurm User Group Meeting 2016 27 September 2016 27/27