View
218
Download
1
Category
Tags:
Preview:
Citation preview
©2003 Dror Feitelson
Parallel Computing SystemsPart III: Job Scheduling
Dror Feitelson
Hebrew University
©2003 Dror Feitelson
Types of Scheduling
• Task scheduling– Application is partitioned into tasks– Tasks have precedence constraints– Need to map tasks to processors– Need to consider communications too– Part of creating an application
• Job scheduling– Scheduling competing jobs belonging to
different users– Part of the operating system
©2003 Dror Feitelson
Dimensions of Scheduling
• Space slicing– Partition the machine into disjoint parts– Jobs get exclusive use of a partition
• Time slicing– Multitasking on each processor– Similar to conventional systems
• Use both together
• Use none – batch scheduling on dedicated machine
Feitelson, RC 19790 1997
©2003 Dror Feitelson
Space Slicing
• Fixed: predefined partitions– Used on CM-5
• Variable: carve out number requested– Used on most systems: Paragon, SP, …– Some restrictions may apply, e.g. torus
• Adaptive: modify request size according to system considerations– Less nodes if more jobs are present
• Dynamic: modify size at runtime too
©2003 Dror Feitelson
Time Slicing
• Uncoordinated: each PE schedules on its ownLocal queue: processes allocated to PEs– Requires load balancing
Global queue– Provides automatic load sharing– Queue may become a bottleneck
• Coordinated across multiple Pes– Explicit gang scheduling– Implicit co-scheduling
©2003 Dror Feitelson
run to comp.
time slicing
local queue
global queue
gang sched.
full machine
Illiac IV
MPP GF11
StarOS Tera
NYU Ult. Dynix
2level/top
Alliant FX/8
space slicing
fixed or powers
of 2
hypercube CM-2
iPSC/2 nCUBE
CM-5 Cedar
flexibleIBM SP
2level/bottransput.
Chrysalis
Mach ParPar
LLNL
©2003 Dror Feitelson
Scheduling Framework
Arriving
jobs
Terminating
jobs
Allocation
Partitioning with run-to-completion– Order of taking jobs from the queue– Re-definition of job size
©2003 Dror Feitelson
Scheduling Framework
Arriving
jobs
Terminating
jobs
Preemption
Time slicing with preemption– Setting time quanta and priorities– May jobs migrate/change size when preempted?
©2003 Dror Feitelson
Memory Considerations
• The processes of a parallel application typically communicate
• To make good progress, they should all run simultaneously
• A process that suffers a page fault is unavailable for communication
• Paging should therefore be avoided
©2003 Dror Feitelson
Scheduling Framework
Dispatching
Memory allocation
Two stages of scheduling– Or three stages, with swapping
©2003 Dror Feitelson
Batch Systems
• Define system of queues with different combinations of resource bounds
• Schedule FCFS from these queues
• Different queues active at prime vs. non-prime time
• Sophisticated/complex services provided– Accounting and limits on users/groups– Staging of data in and out of machine– Political prioritization as needed
©2003 Dror Feitelson
Example – SDSC Paragonnodes
1 4 8 16 32 64 128 256 *15m q4t
1h q4s q8s
Q8s
q16s
Q16s
q32s
Q32s
q64s
4h q32m
Q32m
q64m
Q64m
q128m
Q128m
q256m
Q256m
12h q1l q32l
Q32l
q64l
Q64l
q128l
Q128l Q256l
* qstb
Qstb
tim
e
16MB
32MB
Low priority
Wan et al., JSSPP 1996
©2003 Dror Feitelson
The Problem
• Fragmentation– If the first queued job needs more processors
than are available, need to wait for more to be freed
– Available processors remain idle during the wait
• FCFS (first come first serve)– Short jobs may be stuck behind long jobs in the
queue
©2003 Dror Feitelson
The Solution
• Out of order scheduling– Allows for better packing of jobs– Allows for prioritization according to desired
considerations
©2003 Dror Feitelson
Backfilling
• Allow jobs from the back of the queue to jump over previous jobs
• Make reservations for jobs at the head of the queue to prevent starvation
• Requires estimates of job runtimes
Lifka, JSSPP 1995
©2003 Dror Feitelson
Parameters
• Order for going over the queue– FCFS– Some prioritized order (Maui)
• How many reservations to make– Only one (EASY)– For all skipped jobs (Conservative)– According to need
• Lookahead– Consider one job at a time– Look deeper into the queue
©2003 Dror Feitelson
EASY Backfilling
Extensible Argonne Scheduling System(first large IBM SP installation)
• Definitions:– Shadow time: time at which first queued job
can run– Extra processors: processors left over when
first job runs
• Backfill if– Job will terminate by shadow time– Job needs less than extra processors
Lifka, JSSPP 1995
©2003 Dror Feitelson
Properties
• Unbounded delay– Backfill jobs will not delay first queued job– But they may delay other queued jobs…
Mu’alem & Feitelson, IEEE TPDS 2001
©2003 Dror Feitelson
Properties
• Unbounded delay– Backfill jobs will not delay first queued job– But they may delay other queued jobs…
• No starvation– Delay of first queued job is bounded by runtime
of current jobs– When it runs, the second queued job becomes
first– It is then immune of further delays
Mu’alem & Feitelson, IEEE TPDS 2001
©2003 Dror Feitelson
User Runtime Estimates
• Small estimates allow job to backfill and skip the queue
• Too short estimates risk the job being killed because it exceeded its time
• So estimates may be expected to be accurate
©2003 Dror Feitelson
Surprising Consequence
Performance is actually better if runtime estimates are inaccurate!
Experiment: replace user estimates by up to f times the actual runtime(Data for KTH)
f resp. time slowdown
0 15001 67.6
1 14717 67.0
3 14645 62.7
10 14880 63.7
30 15028 64.7
100 15110 64.9
users 15568 84.0
©2003 Dror Feitelson
Exercise
Understand why this happens
• Run simulations of EASY backfilling with real workloads
• Insert instrumentation to record detailed behavior
• Try to find why f10 is better than f=1
• Try to find why user estimates are so bad
©2003 Dror Feitelson
Hint
• It may be beneficial to look at different job classes
• Example: EASY vs. Conservative– EASY favors small long jobs: can backfill
despite delaying non-first jobs– This comes at expense of larger short jobs– Happens more with user estimates than with
accurate estimates
Small Large
Short
Long
©2003 Dror Feitelson
Another Surprise
Possible to improve performance by multiplying user estimates by 2!(table shows reduction in %)
EASY Conserv.
Bounded slowdown
KTH -4.8% -23.0%
CTC -7.9% -18.0%
SDSC +4.6% -14.2%
Response time
KTH -3.3% -7.0%
CTC -0.9% -1.6%
SDSC -1.6% -10.9%
©2003 Dror Feitelson
The MAUI SchedulerQueue order depends on• Waiting time in queue
– Promote equitable service
• Fair share status• Political priority• Job parameters
– Favor small/large jobs etc.
• Number of times skipped by backfill– Prevent starvation
• Problem: conflicts are possible, hard to figure out what will happen
Jackson et al, JSSPP 2001
©2003 Dror Feitelson
Fair Share
• Actually unfair: strive for specific share
• Based on comparison with historical data
• Parameters:– How long to keep information– How to decay old information– Specifying shares for user or group– Shares are upper/lower bound or both
• Handling of multiple resources by maximal “PE equivalents” (usage out of total available)
©2003 Dror Feitelson
Lookahead
• EASY uses a greedy algorithm and considers jobs in one given order
• The alternative is to consider a set of jobs at once and try to derive an optimal packing
©2003 Dror Feitelson
Dynamic Programming
• Outer loop: number of jobs that are being considered
• Inner loop: number of processors that are available
Edi Shmueli, IBM Haifa
p=1 p=2 p=3 p=4
no job 0 0 0 0
Job 1
Job 2 u 2,3
Job 3
Achievable utilization on 3
processors using only first 2 jobs
©2003 Dror Feitelson
Cell Update
• If j.size > p job is too big to consider
uj,p = uj-1,p
j is not selected• Else consider adding job j
u’ = uj-1,p-j.size + j.size
if u’ > uj-1,p then uj,p = u’j is selected
else uj,p = uj-1,p
j is not selected
©2003 Dror Feitelson
Preventing Starvation
• Option I: only use jobs that will terminate by the shadow time
• Option II: make a reservation for the first queued job (as in EASY)
Requires a 3D data structure:1. Jobs being considered2. Processors being used now3. Extra processors used at the shadow time
©2003 Dror Feitelson
Dynamic Programming
• In the end the bottom-right cell contains the maximal achievable utilization
• The set of jobs to schedule is obtained by the path of selected jobs
©2003 Dror Feitelson
Performance
• Backfilling leads to significant performance gains relative to FCFS
• More reservations reduce performance somewhat (EASY better than conservative)
• Lookahead improves performance somewhat
©2003 Dror Feitelson
Two-Level Scheduling
• Bottom level – processor allocation– Done by the system– Balance requests with availability– Can change at runtime
• Top level – process scheduling– Done by the application– Use knowledge about priorities, holding locks,
etc.
©2003 Dror Feitelson
Programming Model
• Applications required to handle arbitrary changes in allocated processors
• Workpile model– Easy to change number of worker threads
• Scheduler activations– Any change causes an upcall into the
application, which can reconsider what to run
©2003 Dror Feitelson
Equipartitioning• Strive to give all applications equal numbers of
processors– When a job arrives take some processors from each
running job– When it terminates, give some to each other job
• Fair and similar to processor sharing• Caveats
– Applications may have a maximal number of processors they can use efficiently
– Applications may need a minimal number of processors due to memory constraints
– Reconfigurations require many process migrations
Not an issue for shared memory
©2003 Dror Feitelson
Folding
• Reduce processor preemptions by selecting a partition and dividing it in half
• All partition sizes are powers of 2
• Easier for applications: when halved, multitask two processes on each processor
McCann & Zahorjan, SIGMETRICS 1994
©2003 Dror Feitelson
The Bottom Line
• Places restrictions on programming model– OK for workpile, Cray autotasking– Not suitable for MPI
• Very efficient at the system level– No fragmentation– Load leads to smaller partitions and reduced
overheads for parallelism
• Of academic interest only, in shared memory architectures
©2003 Dror Feitelson
Definition
• Processes are mapped one-to-one on processors
• Time slicing is used for multiprogramming• Context switching is coordinated across
processors– All processes are switched at the same time – Either all run or none do
• This applies to gangs, typically all processes in a job
©2003 Dror Feitelson
CoScheduling
• Variant in which an attempt is made to schedule all the processes, but subsets may also be scheduled
• Assumes “process working set” that should run together to make progress
• Does this make sense?– All processes are active entities– Are some more important than others?
Ousterhout, ICDCS 1982
©2003 Dror Feitelson
Advantages
• Compensate for lack of knowledge– If runtimes are not known in advance, preemption
prevents short jobs from being stuck– Same as in conventional systems
• Retain dedicated machine model– Application doesn’t need to handle interference
by system (as in dynamic partitioning)– Allow use of hardware support for fine-grain
communication and synchronization
• Improve utilization
©2003 Dror Feitelson
Utilization
• Assume a 128-node machine
• A 64-node job is running
• A 32-node job and a 128-node job are queued
• Should the 32-node job be started?
Feitelson &Jette, JSSPP 1997
©2003 Dror Feitelson
Best Case
Start 32-node job leading to 75% utilization
left idle
time
32-node job
64-node job
©2003 Dror Feitelson
Worst Case
Start 32-node job, but then 64-node job terminates leading to 25% utilization
left idle
time
32
64
©2003 Dror Feitelson
With Gang Scheduling
Start 32-node job in slot with 64-node job, and 128-node job in another slot. Utilization is 87.5% (or 62.5% if 64-node job terminates)
time
32
64
128
©2003 Dror Feitelson
Disadvantages
• Overhead for context switching
• Overhead for coordinating context switching across many processors
• Reduced cache efficiency
• Memory pressure – more jobs need to be memory resident
©2003 Dror Feitelson
Implementation
• Pack jobs in processor-time space– Assign processes to processors– Is migration allowed?
• Perform coordinated context switching– Decide on time slices: are all equal?
©2003 Dror Feitelson
Packing
• Ousterhout matrix– Rows represent time slices– Columns represent processors
• Each job mapped to a single row with enough space
• Optimizations– Slot unification when occupancy is
complementary– Alternate scheduling
©2003 Dror Feitelson
Fragmentation
There can be unused space in each slot
(Unused slots are not a problem)
©2003 Dror Feitelson
Alternate Scheduling
Effect can be reduced by running jobs in additional slots
This depends on good packing
Ousterhout, ICDCS 1982
©2003 Dror Feitelson
Buddy System
• Allocate processors according to a buddy system– Successive partitioning into halves as needed
• Used and unused processors in different slots will tend to be aligned
• This facilitates alternate schedulingFeitelson &Rudolph, Computer 1990
©2003 Dror Feitelson
Coordinated Context Switching
• Typically coordinated by a central manager
• Executed by local daemons
• Use SIGSTOP/SIGCONT to leave only one runnable process
©2003 Dror Feitelson
STORM
• Base job execution and scheduling of 3 primitives– xfer-&-signal (broadcast)– test-event (optional block)– comp-&-write (on global variables)
• Implemented efficiently using NIC and network support
Frachtenberg et al., SC 2002
©2003 Dror Feitelson
Flexibility
Idea: reduce constraints of strict gang scheduling
• DCS: demand-based coscheduling
• ICS: implicit coscheduling
• FCS: flexible coscheduling
• Paired gang scheduling
The common factor: involve local scheduling
©2003 Dror Feitelson
DCS
• Coscheduling is good for threads that communicate or synchronize
• Prioritizing threads that communicate will cause them to be coscheduled on different nodes
Sobalvarro & Weihl, JSSPP 1995
©2003 Dror Feitelson
Algorithmic Details
• Switch to a thread that receives a message• But only if that thread has received less than its
fair share of the CPU– To prevent jobs with more messages from
monopolizing the system
• And only if the arriving message does not belong to a previous epoch– A new epoch started when some node switches
spontaneously– Prevents thrashing and allows a job to gain control
of the full machine
©2003 Dror Feitelson
ICS
• Priority-based schedulers automatically give priority to threads waiting on communication
• But need to ensure that sending thread stays scheduled until a reply arrives
• Do this with two-phase blocking, and wait for about 5 context-switch durations
Dusseau & Culler, SIGMETRICS 1996
©2003 Dror Feitelson
FCS
• Retain coordinated context switching across the machine
• But allow local scheduler to override global scheduling instructions according to local data
• Use local classification of processes into those that require gang scheduling and those that don’t
Frachtenberg et al., IPDPS 2003
©2003 Dror Feitelson
Implementation Details
• Instrument MPI library to measure compute time between calls (granularity) and wait time for communication to complete
• Use this to classify processes
• Data is aged exponentially
• Upon a coordinated context switch, decide whether to abide or use local scheduler
©2003 Dror Feitelson
Process Classification
Computation time per iteration
Com
mun
icat
ion
wai
t tim
e pe
r it
erat
ion
CS
F DC
Fine-grain and low wait: coscheduling
is effective
Fine grain but long wait: process is frustrated.
Don’t coschedule, but give priority Other processes
don’t care;use as filler
©2003 Dror Feitelson
Paired Gang Scheduling
• Monitor CPU activity of processes as they run
• Low CPU implies heavy I/O or communication
• Schedule pairs of complementary gangs together
Wiseman & Feitelson, IEEE TPDS 2003
©2003 Dror Feitelson
Memory Pressure
• With partitioning: memory requirements may set minimal partition size– May lead to underutilization of processors
• With gang scheduling: memory requirements may lead to excessive paging– No local context switching: processors remain
idle waiting for pages– However, no thrashing
©2003 Dror Feitelson
Paging
• Paging is asynchronous
• If two nodes communicate, and one suffers a page fault, the other will have to wait
• The whole application makes progress at the rate of the slowest process at each instant
• Effect is worse for finer grained applications
©2003 Dror Feitelson
Solutions
• Admission control– Allow jobs into the system only as long as
memory suffices
(variant: allow only 3 jobs…)– Jobs that do not fit may wait a long time
• Swapping– Perform long-range scheduling to allow queued
jobs a chance to run
©2003 Dror Feitelson
Problems
• Assessing memory requirements– Data size as noted in executable
Good for static Fortran code– Historical data from previous runs– Problem of dynamic allocations
• Blocking of short jobs
Batat & Feitelson, IPPS/SPDP 1999
©2003 Dror Feitelson
PerformancePrevention of paging compensates for blocking of short jobs
Batat & Feitelson, IPPS/SPDP 1999
0
1000
2000
3000
4000
5000
6000
0.3 0.4 0.5 0.6 0.7 0.8 0.9
system load
aver
age
resp
onse
time
1.25
1.5
1.75
1.99
1.5+
Add memoryconsiderations
Higher locality
©2003 Dror Feitelson
Swapping Overhead
• m memory
• r bandwidth
• Swapping time is then 2m/r
Alverson et al., JSSPP 1995
time
m
Thread Arunning
Thread Brunning
Swapping A out
Swapping B in2/r
Recommended