Upload
vuongnga
View
224
Download
1
Embed Size (px)
Citation preview
David Paul!Computational Systems Group!Lawrence Berkeley National [email protected]
SLURM BurstBuffer Integration
-1-
September26,2016
Overview:
Ø CoriSystem/NERSC-8/CrayXC40/“Na=ve”SLURMWLM
Ø BurstBufferasCray’sDatawarpProduct
Ø BriefIntroduc=ontoDatawarpconcepts
Ø DatawarpusageexamplesandtheSLURMinterfacetoDatawarp
Ø DatawarpStatus,ProblemIden=fica=onandErrorRecovery² Note:BusrtBufferBoF@SC16bof110s1-TueNov15Time:12:15pm-1:15pmRoom:255-D
Cori/NERSC-8/XC40
Ø Phase1–IniLalinstallthrough19-Sept-2016• 144DWservers(288SSDs,twoDWservers/blade)• 1,628Haswellnodes• 27PBLustreParallelFilesystem-$SCRATCH• GlobalGPFS-$HOMEs,$PROJECTs,S/W,Modules,etc.• BurstBufferof900TB@900GB/sec,12.5MIOPS(measured)
Ø Phase2–installaLonunderway
• +144DWservers(288servers,576SSDs)• ~2,000Haswellnodes• ~9,300KNLnodes• TotalBurstBufferof1.8PB@~1.6TB/sec,12.5MIOPS(esLmated)
DatawarpComponentInteracLon
DatawarpTerms
Ø DWS-DataWarpService-sodwareformanagingandconfiguringtheSSDI/OinstallaLon
² Pool-subsetofDW-serverswithacommonallocaLongranularity(ex.200GB),ex.wlm_pool
² Session-typicallycreatedbyaDW-enabledWLMjob(Token=JobIDorName)
² Instance-anobjectrepresenLngauser’srequestfordiskspace(ex.600GB)² Fragment-apieceofaninstance,asitexistsonaDW-server(ex.3@200GB)
² Configura=on-anobjectrepresenLnghowthespaceistobeused(scratch,striped,private,etc.)² Namespace-representsthemetadata(calledtree)anddata(calleddata),i.e.aFileSystem(DWFS)
² Registra=on-anobjectforlinkingtogetherasessionandaconfiguraLon² Ac=va=on-anobjectrepresenLngwhereaconfiguraLonistobeused(i.e.mounted)
² Realm-groupofDWFSmountpointsthatcooperate,presentwhatappeartobedifferentFileSystems
Ø DWFS-DataWarpFileSystem-aCrayFileSystemthatsupportsstagingoffilesandstripingdataacrossmulLpleDW-servers.
Ø DVS-DataVirtualizaLonService-Cray’sI/Oforwardingsodware(enablesGPFS&DWFSI/OonHSN)Ø XFS-FileSystemusedtopersistdataflowingthroughDWFSØ DW-Server–DVSserverwithSSDs,DWS,DWFS,XFS,LVMandaccesstoaPFS
DatawarpProcessoverview
SLURMconfiguraLonforDatawarp(verysimple)
Ø slurm.conf:BurstBufferType=burst_buffer/cray
Ø burst_buffer.conf:• DefaultPool:nameofthepoolusedbydefaultforresourceallocaLons
• wlm_pool
• AltPoolName:allowsfordifferentstorageconfiguraLons(ex.Granularitysize)
• DenyUsers:listofusernamesand/orIDspreventedfromusingburstbuffers
• FlagsEnablePersistent:allowsuserstocreate/destroypersistentburstbuffers
• FlagsTeardownFailure:removeDWallocaLononjobfailure
Ø QoS/TRES–controluseraccess,userquotas,usageandreportthem
DWSAPIClientHelper(oneexample)
dw_wlm_cli–f
• pools:showpoolinformaLon• paths:generateenvironmentvariablestobeinjectedintorunningbatchjob• job_process:validatecorrectnessof#DWdirecLves• setup:createsession,instance,configuraLons• data_in:performstage-inacLviLes• pre_run:preparecomputenodesforDatawarp• post_run:revokecomputenodeDatawarpaccess• data_out:performstage-outacLviLes• teardown:cleanupalljob-affiliatedDWstate
DWS’dwclivs.SLURM(onesession)#dwcli–jlssession"created":1473889069,"creator":"CLI","expiraLon":0,"expired":false,"id":9711,"links":{"client_nodes":[]"owner":95448,"state":{"actualized":true,"fuse_blown":false,"goal":"create","mixed":false,"transiLoning":false"token":"tractorD”#scontrolshowburst|grepdpaulName=tractorDCreateTime=2016-09-14T14:37:49Pool=wlm_poolSize=7200GState=allocatedUserID=dpaul(95448)
SLURM–(onecommandtobindthemall)
# scontrol show burst
Name=cray DefaultPool=wlm_pool Granularity=200G TotalSpace=765600G UsedSpace=50400G
AltPoolName[0]=tr_cache Granularity=16M TotalSpace=61047200M UsedSpace=6842000M
Flags=EnablePersistent,TeardownFailure
StageInTimeout=86400 StageOutTimeout=86400 ValidateTimeout=5 OtherTimeout=300
GetSysState=/opt/cray/dw_wlm/default/bin/dw_wlm_cli
Allocated Buffers:
Name=udabb CreateTime=2016-08-28T13:33:26 Pool=wlm_pool Size=10400G State=allocated UserID=dgh(93131)
Name=rfmip_modat CreateTime=2016-08-30T21:18:23 Pool=wlm_pool Size=12400G State=allocated UserID=dfeld(96837)
Name=dpaul_tr CreateTime=2016-08-22T12:38:59 Pool=tr_cache Size=800G State=allocated UserID=dpaul(95448)
JobID=0_0(2793398) CreateTime=2016-08-31T00:28:50 Pool=(null) Size=0 State=allocated UserID=dfeld(96837)
JobID=2971140 CreateTime=2016-09-09T14:10:26 Pool=wlm_pool Size=1200G State=teardown UserID=kim(97002)
Per User Buffer Use:
UserID=dgh(93131) Used=10400G
UserID=dfeld(96837) Used=12400G
UserID=dpaul(95448) Used=800G
UserID=kim(91002) Used=1200G
DWSdwstat(administratorfocused)
# dwstat most
==============================================
pool units quantity free gran
tr_cache bytes 5.82TiB 5.82TiB 16MiB
wlm_pool bytes 809.96TiB 627.34TiB 200GiB
sess state token creator owner created expiration nodes
9708 CA--- 2993022 SLURM 90891 2016-09-14T14:27:48 never 8
9710 CA--- tractorD CLI 95448 2016-09-14T14:31:43 never 0
inst state sess bytes nodes created expiration intact label public confs
1943 CA--- 9708 27.73TiB 142 2016-09-14T14:27:48 never true I9708-0 false 1
1945 CA--- 9710 27.73TiB 142 2016-09-14T14:31:43 never true tractorD true 1
UsingDatawarpwithoutSLURM
$dwclicreatesession--expiraLon4000000000--creator$(id-un)--tokenexample-session--owner$(id-u)--hostsexample-nodecreatedsessionid10$dwclicreateinstance--expiraLon4000000000--public--session10--poolexample-poolname--capacity1099511627776--labelexample-instance--opLmizaLonbandwidthcreatedinstanceid8$dwclicreateconfiguraLon--typescratch--access-typestripe--root-permissions0755--instance8--group513createdconfiguraLonid7$createacLvaLon--mount/some/pfs/mount/directory--configuraLon7--session10createdacLvaLonid7
SLURMjobscriptdirecLves-#DW
#!/bin/bash
#SBATCH-n32-t2
#DWjobdwtype=scratchaccess_mode=stripedcapacity=1TiB
#DWstage_intype=directorysource=/lustre/my_in_dirdesLnaLon=$DW_JOB_STRIPED
#DWstage_outtype=directorydesLnaLon=/lustre/my_out_dirsource=$DW_JOB_STRIPED
exportJOBDIR=$DW_JOB_STRIPED
cd$DW_JOB_STRIPED
srun–n32a.out
UserLibraryexample-libdatawarp
//moduleloaddatawarp(togetaccesstotheuserlibraryforbuilding)#include<datawarp.h>//GetInfoonDataWarpConfiguraLon:intr=dw_get_stripe_configuraLon(fd,&stripe_size,&stripe_width,&stripe_index);//Usedw_stage_file_infuncLontomoveafilefromPFStoDataWarpintr=dw_stage_file_in(dw_file,pfs_file);//Usedw_stage_file_outfuncLontomoveafilefromDataWarptoPFSintr=dw_stage_file_out(dw_file,pfs_file,DW_STAGE_IMMEDIATE);//Usedw_query_file_stagefuncLontocheckstagein/outcompleLonintr=dw_query_file_stage(dw_file,&complete,&pending,&deferred,&failed);
Create a Persistent Reservation/Allocation (PR) #!/bin/bash#SBATCH-pdebug#SBATCH-N1#SBATCH-t00:01:00
(CreateaPersistentReservaLon/AllocaLon(PR))#BBcreate_persistentname=tractorDcapacity=7TBaccess=stripedtype=scratchexit________________________________________________________________
(SpecifyPRforasubsequentjob-#sbatchomi~ed)#DWpersistentdwname=tractorD
(Copyindatainforthejob)#DWstage_insource=/global/cscratch1/sd/dpaul/decam.tardesLnaLon=$DW_PERSISTENT_STRIPED_tractorD/job1/runit.shtype=file#DWstage_insource=/global/cscratch1/sd/dpaul/src_dirdesLnaLon=$DW_PERSISTENT_STRIPED_tractorD/job1/type=directory(conLnued)
PR Continued (Runthejob)
cd$DW_PERSISTENT_STRIPED_tractorD/job1/srunrunit.sh<src_dir>output_dir
(SaveresultsatjobcompleLon)#DWstage_outsource=$DW_PERSISTENT_STRIPED_tractorD/job1/output_dirdesLnaLon=/global/cscratch1/sd/dpaul/job1/type=directory
Transparent Cache features
Ø BurstBufferwillbeusedasfilesystemcacheforallI/Oto/fromthePFS:
#DWjobdwpfs=/global/cscratch1/sd/dpaul/stage_out_all/capacity=800GBtype=cache
access_mode=stripedpool=wlm_pool
ProblemIdenLficaLon
Ø Outputfrom“squeue–l”
Ø SLURMlog:• slurmctld.log
Ø Datawarplogs:
• CentrallytoSMWwithLLMconsolidatedbydaemonname
• /var/opt/cray/log/p#-<bootsession>/dws/
• dwsd.yyyymmdd–schedulingdaemon(typicallyonsdbnode)
• dwmd.yyyymmdd–DW-serversmanagerdaemon
• dwrest.yyyymmdd–dwgatewaynode(s)
slurmctld.log(underthecovers)2016-09-14T14:24:33.143826-07:00c4-0c1s1n1[2016-09-14T14:24:29.766]bb_p_job_validate:burst_buffer:#BBcreate_persistentname=tractorDcapacity=7.0TBaccess=stripedtype=scratch2016-09-14T14:24:33.143828-07:00c4-0c1s1n1[2016-09-14T14:24:29.769]CreateName:tractorDPool:wlm_poolSize:214748364800Access:stripedType:scratchState:pending2016-09-14T14:24:43.157596-07:00c4-0c1s1n1[2016-09-14T14:24:37.740]dw_wlm_cli--funcLoncreate_persistent-cCLI-ttractorD–u95448-Cwlm_pool:214748364800-astriped-Tscratch2016-09-14T14:24:43.157600-07:00c4-0c1s1n1[2016-09-14T14:24:37.740]create_persistentoftractorDranforusec=74651892016-09-14T14:24:43.157614-07:00c4-0c1s1n1[2016-09-14T14:24:37.851]{"sessions":[{"created":1471622703,"creator":"CLI","expiraLon":0,"expired":false,"id":8001,"links":{"client_nodes":[]},"owner":94645,"state":{"actualized":true,"fuse_blown":false,"goal":"create","mixed":false,"transiLoning":false},"token":"tractorSmall"},{"created":1472690713,"creator":"CLI","expiraLon":0,"expired":false,"id":8692,"links":{"client_nodes":[]},"owner":93131,"state":{"actualized":true,"fuse_blown":false,"goal":"create","mixed":false,"transiLoning":false},{"client_nodes":[]},"owner":91349,"state":{"actualized":true,"fuse_blown":false,"goal”:"create","mixed”:false},"creator":"CLI","expiraLon":0,"expired":false,"id":9653,"links":{"client_nodes":[]},"owner":95448,"state":{"actualized":true,"fuse_blown":false,"goal":"create","mixed":false,"transiLoning":false},"token":"dpaul1"},{"created":1473888270,"creator":"CLI","expiraLon":0,"expired":false,"id":9707,"links":{"client_nodes":[]},"owner":95448,"state":{"actualized":true,"fuse_blown":false,"goal":"create","mixed":false,"transiLoning":false},"token":"tractorD"}]}
squeue–l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES (REASON)
2772005 regular Mdwarf haus PENDING 0:00 8:00:00 1407 (burst_
buffer/cray: setup: dwpost - failed client status code 409, messages:
Entity exists at destination, Session record with token 2772005 already
exists
Error creating session:
Error/FailureRecovery
Ø DatawarpmaturedtomostlyautomaLcrecovery• enableSLURMTeardownFailureflag• setuperrorsresultinJobHeldAdmin
Ø DatawarpwillusuallyrecoveraderDW-serverreboot
Ø Mostfailuresrelatedto“stuckinteardown”state:• “D”–Destroy• “T”–TransisiLoning• “F”–FuseBlown(retriesexceeded)
Ø Primarycommandsused
• dwclirmsession–id=####• dwcliupdateregistraLon–id=####--no-wait(don’twaitforteardown)• dwcliupdateregistraLon–id=####--replace-fuse(retryteardownfuncLons)
Ø Asalastresort–rebootthesuspectDW-server
#dwstatmost
sessstatetokencreatorownercreatedexpiraLonnodes
9097D----atlasxaodCLI914212016-09-08T00:01:16never0
inststatesessbytesnodescreatedexpiraLonintactlabelpublicconfs
1817D---M909755.86TiB1432016-09-08T00:01:16neverfalseatlasxtrue1
confstateinsttypeacLvs
1823D---M1817scratch0
regstatesessconfwait
9067D--TM90971823true
_______________________
9067D-F-M90971823true
VendorResponsiveness
Ø SchedMDhasbeenextremelyresponsivetoNERSC’sneeds
• FuncLonalityEnhancements
• Bugsfixesandpatches
• NREcontractcompleted
Ø CrayDatawarpDevelopers• NREcontractforfuncLonalenhancements(ex.transparentcache)
• Phaseddelivery(currently@~Phase2.5)
• Bi-Weeklyconferencecalls
• TroubleshooLng/Diagnosis
• Fixes&Patches
Come visit us! SanFrancisco/BayAreaCalifornia• Temperateclimate• WorldClassCity• SiliconValley• EasyAccesstoNatural
Wonders
PointsofInterest• U.C.Berkeley• Napa/SonomaVineyards• MuirWoods• YosemiteValley• BerkeleyNa=onalLab-
WangHall–[email protected]:[email protected]