24
David Paul Computational Systems Group Lawrence Berkeley National Lab [email protected] SLURM BurstBuffer Integration -1- September 26, 2016

SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

Embed Size (px)

Citation preview

Page 1: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

David Paul!Computational Systems Group!Lawrence Berkeley National [email protected]

SLURM BurstBuffer Integration

-1-

September26,2016

Page 2: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

Overview:

Ø  CoriSystem/NERSC-8/CrayXC40/“Na=ve”SLURMWLM

Ø  BurstBufferasCray’sDatawarpProduct

Ø  BriefIntroduc=ontoDatawarpconcepts

Ø  DatawarpusageexamplesandtheSLURMinterfacetoDatawarp

Ø  DatawarpStatus,ProblemIden=fica=onandErrorRecovery²  Note:BusrtBufferBoF@SC16bof110s1-TueNov15Time:12:15pm-1:15pmRoom:255-D

Page 3: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

Cori/NERSC-8/XC40

Ø  Phase1–IniLalinstallthrough19-Sept-2016•  144DWservers(288SSDs,twoDWservers/blade)•  1,628Haswellnodes•  27PBLustreParallelFilesystem-$SCRATCH•  GlobalGPFS-$HOMEs,$PROJECTs,S/W,Modules,etc.•  BurstBufferof900TB@900GB/sec,12.5MIOPS(measured)

Ø  Phase2–installaLonunderway

•  +144DWservers(288servers,576SSDs)•  ~2,000Haswellnodes•  ~9,300KNLnodes•  TotalBurstBufferof1.8PB@~1.6TB/sec,12.5MIOPS(esLmated)

Page 4: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

DatawarpComponentInteracLon

Page 5: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

DatawarpTerms

Ø  DWS-DataWarpService-sodwareformanagingandconfiguringtheSSDI/OinstallaLon

²  Pool-subsetofDW-serverswithacommonallocaLongranularity(ex.200GB),ex.wlm_pool

²  Session-typicallycreatedbyaDW-enabledWLMjob(Token=JobIDorName)

²  Instance-anobjectrepresenLngauser’srequestfordiskspace(ex.600GB)²  Fragment-apieceofaninstance,asitexistsonaDW-server(ex.3@200GB)

²  Configura=on-anobjectrepresenLnghowthespaceistobeused(scratch,striped,private,etc.)²  Namespace-representsthemetadata(calledtree)anddata(calleddata),i.e.aFileSystem(DWFS)

²  Registra=on-anobjectforlinkingtogetherasessionandaconfiguraLon²  Ac=va=on-anobjectrepresenLngwhereaconfiguraLonistobeused(i.e.mounted)

²  Realm-groupofDWFSmountpointsthatcooperate,presentwhatappeartobedifferentFileSystems

Ø  DWFS-DataWarpFileSystem-aCrayFileSystemthatsupportsstagingoffilesandstripingdataacrossmulLpleDW-servers.

Ø  DVS-DataVirtualizaLonService-Cray’sI/Oforwardingsodware(enablesGPFS&DWFSI/OonHSN)Ø  XFS-FileSystemusedtopersistdataflowingthroughDWFSØ  DW-Server–DVSserverwithSSDs,DWS,DWFS,XFS,LVMandaccesstoaPFS

Page 6: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

DatawarpProcessoverview

Page 7: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

SLURMconfiguraLonforDatawarp(verysimple)

Ø  slurm.conf:BurstBufferType=burst_buffer/cray

Ø  burst_buffer.conf:•  DefaultPool:nameofthepoolusedbydefaultforresourceallocaLons

•  wlm_pool

•  AltPoolName:allowsfordifferentstorageconfiguraLons(ex.Granularitysize)

•  DenyUsers:listofusernamesand/orIDspreventedfromusingburstbuffers

•  FlagsEnablePersistent:allowsuserstocreate/destroypersistentburstbuffers

•  FlagsTeardownFailure:removeDWallocaLononjobfailure

Ø  QoS/TRES–controluseraccess,userquotas,usageandreportthem

Page 8: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

DWSAPIClientHelper(oneexample)

dw_wlm_cli–f

•  pools:showpoolinformaLon•  paths:generateenvironmentvariablestobeinjectedintorunningbatchjob•  job_process:validatecorrectnessof#DWdirecLves•  setup:createsession,instance,configuraLons•  data_in:performstage-inacLviLes•  pre_run:preparecomputenodesforDatawarp•  post_run:revokecomputenodeDatawarpaccess•  data_out:performstage-outacLviLes•  teardown:cleanupalljob-affiliatedDWstate

Page 9: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

DWS’dwclivs.SLURM(onesession)#dwcli–jlssession"created":1473889069,"creator":"CLI","expiraLon":0,"expired":false,"id":9711,"links":{"client_nodes":[]"owner":95448,"state":{"actualized":true,"fuse_blown":false,"goal":"create","mixed":false,"transiLoning":false"token":"tractorD”#scontrolshowburst|grepdpaulName=tractorDCreateTime=2016-09-14T14:37:49Pool=wlm_poolSize=7200GState=allocatedUserID=dpaul(95448)

Page 10: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

SLURM–(onecommandtobindthemall)

# scontrol show burst

Name=cray DefaultPool=wlm_pool Granularity=200G TotalSpace=765600G UsedSpace=50400G

AltPoolName[0]=tr_cache Granularity=16M TotalSpace=61047200M UsedSpace=6842000M

Flags=EnablePersistent,TeardownFailure

StageInTimeout=86400 StageOutTimeout=86400 ValidateTimeout=5 OtherTimeout=300

GetSysState=/opt/cray/dw_wlm/default/bin/dw_wlm_cli

Allocated Buffers:

Name=udabb CreateTime=2016-08-28T13:33:26 Pool=wlm_pool Size=10400G State=allocated UserID=dgh(93131)

Name=rfmip_modat CreateTime=2016-08-30T21:18:23 Pool=wlm_pool Size=12400G State=allocated UserID=dfeld(96837)

Name=dpaul_tr CreateTime=2016-08-22T12:38:59 Pool=tr_cache Size=800G State=allocated UserID=dpaul(95448)

JobID=0_0(2793398) CreateTime=2016-08-31T00:28:50 Pool=(null) Size=0 State=allocated UserID=dfeld(96837)

JobID=2971140 CreateTime=2016-09-09T14:10:26 Pool=wlm_pool Size=1200G State=teardown UserID=kim(97002)

Per User Buffer Use:

UserID=dgh(93131) Used=10400G

UserID=dfeld(96837) Used=12400G

UserID=dpaul(95448) Used=800G

UserID=kim(91002) Used=1200G

Page 11: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

DWSdwstat(administratorfocused)

# dwstat most

==============================================

pool units quantity free gran

tr_cache bytes 5.82TiB 5.82TiB 16MiB

wlm_pool bytes 809.96TiB 627.34TiB 200GiB

sess state token creator owner created expiration nodes

9708 CA--- 2993022 SLURM 90891 2016-09-14T14:27:48 never 8

9710 CA--- tractorD CLI 95448 2016-09-14T14:31:43 never 0

inst state sess bytes nodes created expiration intact label public confs

1943 CA--- 9708 27.73TiB 142 2016-09-14T14:27:48 never true I9708-0 false 1

1945 CA--- 9710 27.73TiB 142 2016-09-14T14:31:43 never true tractorD true 1

Page 12: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

UsingDatawarpwithoutSLURM

$dwclicreatesession--expiraLon4000000000--creator$(id-un)--tokenexample-session--owner$(id-u)--hostsexample-nodecreatedsessionid10$dwclicreateinstance--expiraLon4000000000--public--session10--poolexample-poolname--capacity1099511627776--labelexample-instance--opLmizaLonbandwidthcreatedinstanceid8$dwclicreateconfiguraLon--typescratch--access-typestripe--root-permissions0755--instance8--group513createdconfiguraLonid7$createacLvaLon--mount/some/pfs/mount/directory--configuraLon7--session10createdacLvaLonid7

Page 13: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

SLURMjobscriptdirecLves-#DW

#!/bin/bash

#SBATCH-n32-t2

#DWjobdwtype=scratchaccess_mode=stripedcapacity=1TiB

#DWstage_intype=directorysource=/lustre/my_in_dirdesLnaLon=$DW_JOB_STRIPED

#DWstage_outtype=directorydesLnaLon=/lustre/my_out_dirsource=$DW_JOB_STRIPED

exportJOBDIR=$DW_JOB_STRIPED

cd$DW_JOB_STRIPED

srun–n32a.out

Page 14: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

UserLibraryexample-libdatawarp

//moduleloaddatawarp(togetaccesstotheuserlibraryforbuilding)#include<datawarp.h>//GetInfoonDataWarpConfiguraLon:intr=dw_get_stripe_configuraLon(fd,&stripe_size,&stripe_width,&stripe_index);//Usedw_stage_file_infuncLontomoveafilefromPFStoDataWarpintr=dw_stage_file_in(dw_file,pfs_file);//Usedw_stage_file_outfuncLontomoveafilefromDataWarptoPFSintr=dw_stage_file_out(dw_file,pfs_file,DW_STAGE_IMMEDIATE);//Usedw_query_file_stagefuncLontocheckstagein/outcompleLonintr=dw_query_file_stage(dw_file,&complete,&pending,&deferred,&failed);

Page 15: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

Create a Persistent Reservation/Allocation (PR) #!/bin/bash#SBATCH-pdebug#SBATCH-N1#SBATCH-t00:01:00

(CreateaPersistentReservaLon/AllocaLon(PR))#BBcreate_persistentname=tractorDcapacity=7TBaccess=stripedtype=scratchexit________________________________________________________________

(SpecifyPRforasubsequentjob-#sbatchomi~ed)#DWpersistentdwname=tractorD

(Copyindatainforthejob)#DWstage_insource=/global/cscratch1/sd/dpaul/decam.tardesLnaLon=$DW_PERSISTENT_STRIPED_tractorD/job1/runit.shtype=file#DWstage_insource=/global/cscratch1/sd/dpaul/src_dirdesLnaLon=$DW_PERSISTENT_STRIPED_tractorD/job1/type=directory(conLnued)

Page 16: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

PR Continued (Runthejob)

cd$DW_PERSISTENT_STRIPED_tractorD/job1/srunrunit.sh<src_dir>output_dir

(SaveresultsatjobcompleLon)#DWstage_outsource=$DW_PERSISTENT_STRIPED_tractorD/job1/output_dirdesLnaLon=/global/cscratch1/sd/dpaul/job1/type=directory

Page 17: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

Transparent Cache features

Ø  BurstBufferwillbeusedasfilesystemcacheforallI/Oto/fromthePFS:

#DWjobdwpfs=/global/cscratch1/sd/dpaul/stage_out_all/capacity=800GBtype=cache

access_mode=stripedpool=wlm_pool

Page 18: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

ProblemIdenLficaLon

Ø  Outputfrom“squeue–l”

Ø  SLURMlog:•  slurmctld.log

Ø  Datawarplogs:

•  CentrallytoSMWwithLLMconsolidatedbydaemonname

•  /var/opt/cray/log/p#-<bootsession>/dws/

•  dwsd.yyyymmdd–schedulingdaemon(typicallyonsdbnode)

•  dwmd.yyyymmdd–DW-serversmanagerdaemon

•  dwrest.yyyymmdd–dwgatewaynode(s)

Page 19: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

slurmctld.log(underthecovers)2016-09-14T14:24:33.143826-07:00c4-0c1s1n1[2016-09-14T14:24:29.766]bb_p_job_validate:burst_buffer:#BBcreate_persistentname=tractorDcapacity=7.0TBaccess=stripedtype=scratch2016-09-14T14:24:33.143828-07:00c4-0c1s1n1[2016-09-14T14:24:29.769]CreateName:tractorDPool:wlm_poolSize:214748364800Access:stripedType:scratchState:pending2016-09-14T14:24:43.157596-07:00c4-0c1s1n1[2016-09-14T14:24:37.740]dw_wlm_cli--funcLoncreate_persistent-cCLI-ttractorD–u95448-Cwlm_pool:214748364800-astriped-Tscratch2016-09-14T14:24:43.157600-07:00c4-0c1s1n1[2016-09-14T14:24:37.740]create_persistentoftractorDranforusec=74651892016-09-14T14:24:43.157614-07:00c4-0c1s1n1[2016-09-14T14:24:37.851]{"sessions":[{"created":1471622703,"creator":"CLI","expiraLon":0,"expired":false,"id":8001,"links":{"client_nodes":[]},"owner":94645,"state":{"actualized":true,"fuse_blown":false,"goal":"create","mixed":false,"transiLoning":false},"token":"tractorSmall"},{"created":1472690713,"creator":"CLI","expiraLon":0,"expired":false,"id":8692,"links":{"client_nodes":[]},"owner":93131,"state":{"actualized":true,"fuse_blown":false,"goal":"create","mixed":false,"transiLoning":false},{"client_nodes":[]},"owner":91349,"state":{"actualized":true,"fuse_blown":false,"goal”:"create","mixed”:false},"creator":"CLI","expiraLon":0,"expired":false,"id":9653,"links":{"client_nodes":[]},"owner":95448,"state":{"actualized":true,"fuse_blown":false,"goal":"create","mixed":false,"transiLoning":false},"token":"dpaul1"},{"created":1473888270,"creator":"CLI","expiraLon":0,"expired":false,"id":9707,"links":{"client_nodes":[]},"owner":95448,"state":{"actualized":true,"fuse_blown":false,"goal":"create","mixed":false,"transiLoning":false},"token":"tractorD"}]}

Page 20: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

squeue–l

JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES (REASON)

2772005 regular Mdwarf haus PENDING 0:00 8:00:00 1407 (burst_

buffer/cray: setup: dwpost - failed client status code 409, messages:

Entity exists at destination, Session record with token 2772005 already

exists

Error creating session:

Page 21: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

Error/FailureRecovery

Ø  DatawarpmaturedtomostlyautomaLcrecovery•  enableSLURMTeardownFailureflag•  setuperrorsresultinJobHeldAdmin

Ø  DatawarpwillusuallyrecoveraderDW-serverreboot

Ø  Mostfailuresrelatedto“stuckinteardown”state:•  “D”–Destroy•  “T”–TransisiLoning•  “F”–FuseBlown(retriesexceeded)

Ø  Primarycommandsused

•  dwclirmsession–id=####•  dwcliupdateregistraLon–id=####--no-wait(don’twaitforteardown)•  dwcliupdateregistraLon–id=####--replace-fuse(retryteardownfuncLons)

Ø  Asalastresort–rebootthesuspectDW-server

Page 22: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

#dwstatmost

sessstatetokencreatorownercreatedexpiraLonnodes

9097D----atlasxaodCLI914212016-09-08T00:01:16never0

inststatesessbytesnodescreatedexpiraLonintactlabelpublicconfs

1817D---M909755.86TiB1432016-09-08T00:01:16neverfalseatlasxtrue1

confstateinsttypeacLvs

1823D---M1817scratch0

regstatesessconfwait

9067D--TM90971823true

_______________________

9067D-F-M90971823true

Page 23: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

VendorResponsiveness

Ø  SchedMDhasbeenextremelyresponsivetoNERSC’sneeds

•  FuncLonalityEnhancements

•  Bugsfixesandpatches

•  NREcontractcompleted

Ø  CrayDatawarpDevelopers•  NREcontractforfuncLonalenhancements(ex.transparentcache)

•  Phaseddelivery(currently@~Phase2.5)

•  Bi-Weeklyconferencecalls

•  TroubleshooLng/Diagnosis

•  Fixes&Patches

Page 24: SLURM BurstBuffer Integration - Slurm Workload Manager · PDF fileSLURM BurstBuffer Integration ... • 144 DW servers (288 SSDs ... ² Namespace - represents the metadata (called

Come visit us! SanFrancisco/BayAreaCalifornia•  Temperateclimate•  WorldClassCity•  SiliconValley•  EasyAccesstoNatural

Wonders

PointsofInterest•  U.C.Berkeley•  Napa/SonomaVineyards•  MuirWoods•  YosemiteValley•  BerkeleyNa=onalLab-

WangHall–[email protected]:[email protected]