1
Schedulers’ Limitations Workload System’s Workaround Side Effects Max number of jobs Throttle submissions Decreased job throughput Limited job throughput Aggregate jobs Increased workload runtime Lack of job/ensemble status & control API Track individual job’s status through files I/O bottleneck Lack of programmable failure detection Inspect failures manually Unnecessary job resubmissions Fully Hierarchical Scheduling: Paving the Way to Exascale Workloads Motivation Emerging HPC workloads represent an order of magnitude increase in both scale and complexity; yet batch scheduling remains stuck in the decades-old, centralized scheduling model The fully hierarchical scheduling model and its implementation, Flux, provide general solutions to these limitations Future Work From the Top Ten Exascale Research Challenges. DOE ASCAC Subcommittee Report. 2014. References and Acknowledgements [1] D. Ahn, et. al. Flux: A Next-generation Resource Management Framework for Large HPC Centers. In ICCPW’14. [2] J. Gyllenhaal, et. al. Enabling High Job Throughput for Uncertainty Quantification on BG/Q. In ScicomP’14. [3] T. Dahlgren, et. al. Scaling Uncertainty Quantification Studies to Millions of Jobs. In SC’15. The authors acknowledge the advice and support from the Flux team and others at LLNL: Tom Scogland, Jim Garlick, Mark Grondona, Becky Springmeyer, Chris Morrone, and Al Chu. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. National Science Foundation CCF #1318445. This research was supported by the Exascale Computing Project: 17-SC-20-SC Stephen Herbein 1,2 , Tapasya Patki 2 , Dong H. Ahn 2 , Don Lipari 2 , Tamara Dahlgren 2 , David Domyancic 2 , Michela Taufer 1 1 University of Delaware, 2 Lawrence Livermore National Laboratory LLNL-POST-735166 Schedulers’ Limitations on Number of Jobs Centralized Model Exhausts local resources when handling large numbers of jobs [2] Fully Hierarchical Model Distributes local resource requirements across multiple schedulers HPC batch schedulers have several limitations with respect to these emerging workloads, which has led to a proliferation of workflow systems that provide specialized workarounds [2,3] Integrate Flux’s job/ensemble status & control API into the UQP to simplify the submission/tracking of the job ensembles while also eliminating the I/O bottleneck Develop a programmable failure detection mechanism within Flux to reduce unnecessary resubmissions and simplify error handling for users “It is widely expected that rigorous uncertainty quantification over high-dimensional input spaces will play a crucial role in enabling extreme-scale science. Indeed, a thousand-fold increase in computing power would facilitate orders-of- magnitude more simulation realizationsMoving from the centralized to the fully hierarchical model increases the scheduler’s job scalability by 133x Case Study: Synthetic Stress Test Case Study: UQP Workload 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 1 10 100 1000 10000 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 1 10 100 1000 10000 B B B B B B B B B B B B BBB B B B B B B B B B B BB B B B B B B B B B 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 1 10 100 1000 10000 Makespan (sec) B core Number of jobs (total) 133x Lower is better Larger is better Scheduling Model Workload Runtime (sec) Schedulers’ Limitations on Job Throughput Centralized Model Tasks within an aggregated UQP job are run serially, increasing the workload’s runtime Fully Hierarchical Model Aggregated job is managed by its own full-featured scheduler, allowing tasks to be run concurrently Moving from the centralized scheduler Slurm to a fully hierarchical scheduler Flux results in a 37% faster workload runtime Lower is better 37% Schedulers’ Limited Job Ensemble Support Centralized Model Submit and track each job individually Fully Hierarchical Model Submit hierarchies of jobs and track them at variable levels of granularity through an API UQP Startup Job Submission File Creation File Access Non-I/O Runtime Stages Slurm (Centralized Scheduler) Limited API for job status UQP tracks job states through files, creating an I/O bottleneck Flux (Fully Hierarchical Scheduler) Provides a subscription-based job status API, eliminating I/O Flux and the fully hierarchical model simplifies the submission and tracking of job ensembles and thus eliminates the need for the UQP’s I/O Study configuration: all three scheduler models evaluated on a 32 node cluster with a synthetic workload of dummy jobs Study configuration: UQP runs with Slurm and Flux evaluated on a 16 node cluster with a workload of a single-core Monte Carlo application [3] Fully Hierarchical Scheduling Under Flux New HPC scheduling model aimed at addressing next-generation scheduling challenges using one common resource and job management framework at both system and application levels [1] Applies a divide-and-conquer approach to scheduling, allowing for the distribution of scheduling work across an arbitrarily deep hierarchy of schedulers Uncertainty Quantification Pipeline (UQP) Slurm Slurm + Moab Flux Scalability Centralized Scheduler Nodes A B C D E F Low Job Queue A B C D E FMedium Limited Hierarchical Global Sched Job Queue Nodes Sched 3 Sched 1 Sched 2 A B E F C D A B C D E FFully Hierarchical High Global Sched Nodes A E C Sched 2 Sched 1 D B F Sched 1.1 Sched 1.1 Sched 2.1 Sched 2.2 Job Queue A B C D E FAccounts for roughly 50.9 million CPU hours each year at LLNL Simplifies performing uncertainty quantification studies Requires running an ensemble of simulations containing anywhere between 1,000 and 100,000,000 jobs Provides workarounds for existing HPC schedulers’ limitations Workarounds result in decreased job throughput and an I/O bottleneck Training Data Ensemble Generation Ensemble Execution Surrogate Model Construction Sensitivity and UQ Analysis Input parameters, run environments High fidelity surrogate(s) UQ configuration, Simulation files The fully hierarchical scheduler handles all of the challenges encountered by the UQP Example UQP Workflow

Fully Hierarchical Scheduling: Paving the Way to

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fully Hierarchical Scheduling: Paving the Way to

Schedulers’LimitationsWorkload System’s

Workaround SideEffectsMaxnumberofjobs Throttlesubmissions DecreasedjobthroughputLimitedjob throughput Aggregatejobs Increased workloadruntimeLackofjob/ensemblestatus &controlAPI

Trackindividual job’sstatus throughfiles I/Obottleneck

Lackofprogrammablefailuredetection Inspect failuresmanually Unnecessary job

resubmissions

FullyHierarchicalScheduling:PavingtheWaytoExascale Workloads

Motivation• EmergingHPCworkloadsrepresentanorderofmagnitudeincreaseinbothscaleandcomplexity;yetbatchschedulingremainsstuckinthedecades-old,centralizedschedulingmodel

Thefullyhierarchicalschedulingmodelanditsimplementation,Flux,providegeneralsolutionstotheselimitations

FutureWork

FromtheTopTenExascale ResearchChallenges.DOEASCACSubcommitteeReport.2014.

ReferencesandAcknowledgements[1]D.Ahn,et.al.Flux:ANext-generationResourceManagementFrameworkforLargeHPCCenters.InICCPW’14.[2]J.Gyllenhaal,et.al.EnablingHighJobThroughputforUncertaintyQuantificationonBG/Q.InScicomP’14.[3]T.Dahlgren,et.al.ScalingUncertaintyQuantificationStudiestoMillionsofJobs.InSC’15.• TheauthorsacknowledgetheadviceandsupportfromtheFluxteamandothersatLLNL:TomScogland,JimGarlick,MarkGrondona,BeckySpringmeyer,ChrisMorrone,andAlChu.

ThisworkwasperformedundertheauspicesoftheU.S.DepartmentofEnergybyLawrenceLivermoreNationalLaboratoryunderContractDE-AC52-07NA27344.National Science FoundationCCF#1318445.ThisresearchwassupportedbytheExascale ComputingProject:17-SC-20-SC

StephenHerbein1,2,Tapasya Patki2,DongH.Ahn2,DonLipari2,TamaraDahlgren2,DavidDomyancic2,MichelaTaufer1

1UniversityofDelaware,2LawrenceLivermoreNationalLaboratory

LLNL-POST-735166

Schedulers’LimitationsonNumberofJobsCentralizedModel• Exhaustslocalresourceswhenhandlinglargenumbersofjobs[2]

FullyHierarchicalModel• Distributeslocalresourcerequirementsacrossmultipleschedulers

• HPCbatchschedulershaveseverallimitationswithrespecttotheseemergingworkloads,whichhas ledtoaproliferationofworkflowsystemsthatprovidespecializedworkarounds[2,3]

• IntegrateFlux’sjob/ensemblestatus&controlAPIintotheUQPtosimplifythesubmission/trackingofthejobensembleswhilealsoeliminatingtheI/Obottleneck

• DevelopaprogrammablefailuredetectionmechanismwithinFluxtoreduceunnecessaryresubmissionsandsimplifyerrorhandlingforusers

“Itiswidelyexpectedthatrigorousuncertaintyquantificationoverhigh-dimensionalinputspaceswillplayacrucialroleinenablingextreme-scalescience.Indeed,athousand-foldincreaseincomputingpowerwouldfacilitateorders-of-magnitudemoresimulation realizations”

Movingfromthecentralizedtothefullyhierarchicalmodelincreasesthescheduler’sjobscalabilityby133x

CaseStudy:SyntheticStressTest CaseStudy:UQPWorkload

H H H H H H H H HH

H

H

H H H H H H H H HH

H

H

H H H H H H H H HH

H

H

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

524288

1048576

1

10

100

1000

10000

H node

J J JJ J J J J J

J

J

J

J J J J J J J JJJ

J

J

J J J J J J J J JJ

J

J

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

524288

1048576

1

10

100

1000

10000

J allocation

B B B B B B B BBB

BB

B B B B B B BBBB

B

B

B B B B B B B BBB

BB

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

524288

1048576

1

10

100

1000

10000

Mak

espa

n (s

ec)

B coreCentricity:

Number of jobs (total)

133x

Lowerisbetter

Largerisbetter

SchedulingModel

Wor

kloa

d R

untim

e (s

ec)

Schedulers’LimitationsonJobThroughputCentralizedModel• TaskswithinanaggregatedUQPjobarerunserially,increasingtheworkload’sruntime

FullyHierarchicalModel• Aggregatedjobismanagedbyitsownfull-featuredscheduler,allowingtaskstoberunconcurrently

MovingfromthecentralizedschedulerSlurmtoafullyhierarchicalschedulerFluxresultsina37%fasterworkloadruntime

Lowerisbetter 37%

Schedulers’LimitedJobEnsembleSupportCentralizedModel• Submitandtrackeachjobindividually

FullyHierarchicalModel• SubmithierarchiesofjobsandtrackthematvariablelevelsofgranularitythroughanAPI

UQPStartupJobSubmissionFileCreationFileAccess

Non-I/O

RuntimeStages

Slurm(CentralizedScheduler)• LimitedAPIforjobstatus• UQPtracksjobstatesthrough

files,creatinganI/ObottleneckFlux(FullyHierarchicalScheduler)• Providesasubscription-based

jobstatusAPI,eliminatingI/O

Fluxandthefullyhierarchicalmodelsimplifiesthesubmissionandtrackingofjobensembles

andthuseliminatestheneedfortheUQP’sI/O

• Studyconfiguration:allthreeschedulermodelsevaluatedona32nodeclusterwithasyntheticworkloadofdummyjobs

• Studyconfiguration:UQPrunswithSlurm andFluxevaluatedona16nodeclusterwithaworkloadofasingle-coreMonteCarloapplication[3]

FullyHierarchicalSchedulingUnderFlux• NewHPCschedulingmodelaimedataddressingnext-generationschedulingchallengesusingonecommonresourceandjobmanagementframeworkatbothsystemandapplicationlevels [1]

• Appliesadivide-and-conquerapproachtoscheduling,allowingforthedistributionofschedulingworkacrossanarbitrarilydeephierarchyofschedulers

UncertaintyQuantificationPipeline(UQP)

Slurm Slurm +Moab Flux

Scalability

Centralized

Scheduler

Nodes

A B C

DEF

Low

JobQueueA B C D E F…

Medium

LimitedHierarchical

GlobalSched

JobQueue

Nodes

Sched3Sched1 Sched2

AB

EF

CD

A B C D E F…

FullyHierarchical

High

GlobalSched

Nodes

A EC

Sched2Sched1

DB F

Sched1.1 Sched1.1 Sched2.1 Sched2.2

JobQueueA B C D E F…

• Accountsforroughly50.9millionCPUhourseachyearatLLNL

• Simplifiesperforminguncertaintyquantificationstudies

• Requiresrunninganensembleofsimulationscontaininganywherebetween1,000and100,000,000jobs

• ProvidesworkaroundsforexistingHPCschedulers’limitations• Workaroundsresultindecreasedjob

throughputandanI/Obottleneck

Training Data

Ensemble Generation

Ensemble Execution

Surrogate Model Construction

Sensitivity and UQ Analysis

Input parameters,run environments

High fidelitysurrogate(s)

UQ configuration,Simulation files

ThefullyhierarchicalschedulerhandlesallofthechallengesencounteredbytheUQP

ExampleUQPWorkflow