View
218
Download
1
Category
Tags:
Preview:
Citation preview
Local Resource Management System & State Estimation
Local resource management systems Condor, Maui, LSF, PBS
Prediction techniques example NWS improve resource selection
Condor - Introduction
Batch job system that allows usage of both dedicated and non-dedicated systems.
Provides users with extra computing power
Introduces complexities remove jobs before they are finished (preemption) run on a wide array of machines (matchmaking)
CondorPreemptive Resume Scheduling
Advantages use resources that are only available occasionally by
the use of checkpoints, preemption and allocation no backfilling (take advantage of holes in the schedule
to run more jobs, and hereby increase efficiency) fair sharing of jobs and towards users compute on demand (low vs high priority)
Condor – Scheduling
Submit jobs to local computer queue Interact with matchmaker to run job (1 cpu/job) Run appropiate (ClassAd) job by claiming it
Triumvirate
User agent – make sure job finishes, on failure resubmit, etc.
Owner agent – ensure owner's policy of how computer is used, responsible for running submitted jobs
Matchmaker – find matches between user and owner agent and implement system-wide policies
Triumvirate (2)
Condor – Matchmaking & Claiming
User submits job to queue, unique identification User agent sends ClassAd (5 min) until there are
jobs that are not running Owner agent sends ClassAd (5 min) to describe
the computer it is responsible for Matchmaker accepts ClassAd's and attempts to
find matches – negotiation On match, user and owner agent independently of
matchmaker work out the details (up-to-date inf.) User agent sends job to owner agent, and it runs
Condor – Matchmaking & Claiming (2)
On problems outside process redo matchmaking; on program error, record problem and inform user
When program starts, another process (shadow) is started on user agent that is responsible for Condor’s remote I/O capabilities
Running jobs continue even if matchmaker fails
Condor - preemption
Preemption is necessary to respect interests of all parties
Key to success is checkpoint creation when preempted from a machine manual checkpoint creation periodic checkpoint creation to safeguard against failures
Crashes/disruptions happen frequently in grids Check pointing and reacting to preemptions is an
essential part of Condor’s approach to reliability.
Condor – user preemption
Manual preemption Automation of above process (eg. running time) Preemption on behalf of Condor
eg. check if job can run on a better machine not supported in current version of Condor needs consideration such as ‘thrashing’ (always look
for better computer, not being able to do any jobs)
Condor – owner / matchmaker preemption
Owner removes job running on his machine automated by Condor (eg. check keyboard inactivity) manually by running a command
Matchmaker can enforce administrator policies to increase efficiency eg. run a better job on a machine already running one Condor strongly prefers however not to preempt jobs if
they can be run on an idle machine.
Condor - conclusion
Condor can balance the desires of all stakeholders Condor can take both advantage of sporadically
available resources and react to problems such as failures
This flexibility and robustness is its key to success
Maui Scheduler - Introduction
High performance scheduler for local clusters Includes resource reservation, availability
estimation and allocation management
External manager, extends and enhances the capabilities and performance of existing scheduler
Maui – Allocation properties
Concept of reservation to maintain resource allocations most important feature is future allocations set aside a block of resources for various purposes such
as cluster maintenance, guaranteed job start time resource expression: resource quantity and type
conditions which must be met to include access control list (ACL): which consumers may utilize
the reserved resources timeframe: time period over which reservation actually
blocks resources
Maui – Allocation properties (2)
Revocation of allocation support for revocable and irrevocable reservations eg. strict time constrains on data availability or job
completion default is irrevocable; reservations maintained until
timeframe has expired or explicitly removed Guaranteed completion time of allocations
locked to exact time, guaranteed to complete before certain time or guaranteed to start after given time
scheduler regularly tries to optimize
Maui – Allocation properties (3)
Guaranteed number of attempts to complete a job don’t attempt to start job until all prerequisites are met using defer mechanism maui can specify how many
times to locate resources for a job before giving up, or putting on hold
Allocation run-to-completion configure to disable all or subset of preemptions thus
guaranteeing a job to complete without interference Exclusive allocations
request dedicated resources to guarantee exclusive access
Maui – Allocation properties (4)
Malleable Allocations all aspects can be dynamically modified if job consumes excessive resources, Maui can preempt or
even cancel job depending on the resource utilization policy
Maui - Access to available scheduling info
Access to the tentative scheduler provide information to all possible availability times scheduler can request single estimated start time for
job Exclusive control
Maui maintains exclusive control over the execution Event notification
generalized event management interface; respond immediately to changes in the environment
Maui – Requesting resources
Allocation offers full contextual information regarding the request and if
and how Maui can satisfy this request Allocation cost or objective information
interface with allocation management systems that assist to assign costs to resource consumption
Advance reservation allows full control to peers over the scheduling of jobs
through time Requirement for providing maximum allocation
time in advance credential-based walltime limits can be configured
based on various criteria
Maui – Requesting resources (2)
Deallocation policy support for single-step resource allocation requests;
create resource allocation valid until job completion two-phase courtesy reservation; after courtesy is sent,
needs to receive a reservation commit; otherwise remove job
Remote co-scheduling stage remote jobs to a local cluster
Consideration of job dependencies offer basic job dependency support to block certain job
steps until specific prerequisites are met
Maui – Manipulating the allocation execution
Preemption suspend operations are supported as far as that
capability is available in the underlying manager Checkpointing
‘checkpoint and terminate’ & ‘checkpoint and continue’ are supported
Migration support for intra-domain job migration, but no support
for QoS, load balancing, or other optimization Restart
checkpoints used if available
LSF - Introduction
As a low-level scheduler
Load Sharing Facility
LSF – Available-information attributes
Access to the tentative scheduler often impractical in real-world applications, no support
Exclusive control LSF executes in user-space, so its control is not
exclusive so can only provide necessary measures Event notification
supplies an event-notification service for high-level schedulers
LSF – Available-information attributes
Access to the tentative scheduler often impractical in real-world applications, no support
Exclusive control LSF executes in user-space, so its control is not
exclusive so can only provide necessary measures Event notification
supplies an event-notification service for high-level schedulers
LSF – Requesting resources
Allocation offers doesn’t expose potential resource allocations
Allocation cost or objective information unsupported
Advance reservation provides built-in and Maui-integrated capabilites
Requirement for providing maximum allocation time in advance high regard
LSF – Requesting resources (2)
Deallocation policy automatic
Remote co-scheduling support by a higher-order scheduling instances
Consideration of job dependencies built-in support for job dependencies by logical
expressions based on 15 dependency conditions
LSF – Allocation properties
Revocation of allocation not needed because of resource shortness, etc.
Guaranteed completion time of allocations
LSF – Allocation properties (2)
Guaranteed number of attempts to complete a job distinguish between attempts that are execution pre-
condition and execution condition with complete flexibility
Allocation run-to-completion with implicit assumptions that allocations don’t exceed
resource limits for example Exclusive allocations
can dispatch jobs to hosts where no other LSF job is running
LSF – Allocation properties (3)
Malleable Allocations built-in mechanisms allow allocations to decay consumption
over time on a per-resource basis
LSF – Manipulating the allocation execution
Preemption support since 1995, preempted workloads retain
resources Checkpointing
assuming application supports it, LSF provides interface Migration
provide mechanism to be done by high-level scheduler Restart
provides interface
LSF - Conclusion
Supports most attributes of a low-level scheduler that can be exploited by a high-level scheduler
PBS – Introduction
Portable Batch System Flexible workload management and batch job
scheduling system Covers the entire Grid computing space: security,
information, compute and data Middleware technology that sits between
compute-intensive or data-intensive applictions and the network, hardware and OS
All jobs to single virtual pool which is scheduled and distributed on the grid
PBS – Security
Fundamental capabilities are secure authentication and authentication
Internally it makes use of user-name based auth Support for X.509 Grid standard identification
certificate lifetime (expire/renew) Identity mapping between sites is handled by a
mapping function
PBS - Information
Information management with access to the state of the infrastructure
Collect real-time data on state with job executor daemon process (MOMs)
Easy integration with larger Grid information databases
PBS - Compute
Advance reservation support check for conflicts eg. reserve resources for car-crash test including
computer cycles, network, database, facility Cycle harvesting
expand available computing resources by using idle workstations
Peer scheduling enable a site or sites with different PBS installations to
automatically run jobs from eachother no job will be moved if it cannot run immediately
PBS - Data
Most basic capability of data Grid: file staging automatic handling of copying files onto execution
nodes (stage-in) prior to running job copying files off execution nodes (stage-out) after job
completes PBS will not run jobs until stage-in is fully done Support for Globus Toolkit, scp, Gridftp, etc.
PBS – Available-information attributes
Access basic information by typing qstat Email notification
PBS – Requesting resources
Single resource solution to a job request Estimated completion time is configurable
absence of this information however hampers peformance (needed by backfilling for example)
Job dependencies Co-scheduling by simply configuring the queues
of the system
PBS – Allocation properties
Revoke any allocation both while job is queued or is running
Also possible preemption by the scheduler; choice of suspension, checkpointing, requeuing, termination
Configurable job completion attempts Configurable exclusive allocation, etc. No support for malleable allocation (eg. allows
addition or revocation of resources during runtime)
PBS - Manipulating the allocation execution
Support for requeue, restart On preemption checkpoint generation and
migration
Prediction techniques
Problem of scheduling and resource allocation are central to Grid performance
Applications must balance between performance and communication overhead parallelism produces
Grid resources differ widely in performance
A resource allocator must choose right combination of resources from pool while it's constantly changing
Prediction techniques (2)
Categorization into static and dynamic performance characteristics based on speed of change
static: clock speed (CPU) for example dynamic: CPU load, network throughput
Grid resource performance prediction
For a grid scheduler two characteristics can be exploited to overcome the complexities introduced by the dynamics of Grid performance response
Observable Forecast Accuracy predictions for future performance measurements can
be evaluated by recording the accuracy once the measurements are actually gathered
Near-term Forecasting Epochs scheduler can make decisions dynamically, just before
execution begins. Since accuracy usually degrades into the future, make decision at last possible moment
Prediction – an example (NWS)
Provide 3 fundamental functionalities Monitoring, Forecasting, Reporting
NWS – Network Weather Service grid monitoring and forecasting tool designed to
support dynamic resource allocation and scheduling sensor control subsystem historical data for future performance prediction multiple reporting interfaces convenient methodology for replication and caching
Prediction – an example (NWS) (2)
Performance monitoring and forecasting system must be able to execute on all platforms available to the user written in C; highest portability with standard libs
Two types of monitors (CPU probe) passive: read measurement gathered through some
other means (eg. local OS) eg. UNIX load average non-intrusive inaccurate?
active: load own resource and observe performance response
know exact performance intrusive
Prediction – an example (NWS) (3)
Intrusiveness vs Scalability (Network probe) probe the network by timing packet travel duration for more hosts, probe collision will occur, resulting in
loss of bandwidth NWS uses a token-passing method to prevent such
problems
Prediction – an example (NWS) (4)
Forecasting an inherent problem of prediction. assumptions made on what resources will be when the
job runs in Grid settings, available resource performance can
fluctuate dynamically
NWS uses statistical methods to attempt to mechanize and automate forecasting based on historical data
Prediction - Conclusions
Effective resource allocation and scheduling are critical to performance
Immediate performance history data is used to make implicit prediction
To be truly effective the performance gathering system must be robust, portable and non-intrusive
Overhead introduced by perf.gath. system must be carefully controlled
Using fast, robust techniques it is possible to improve accuracy of performance predictions
Improve resource selection with prediction
Run time predictions statistical analysis that have already run automatic code analysis or instrumentation
Explanation of two techniques, both using statistical data with information provided to scheduler upon run
Categorization prediction technique
Derive run time predictions from historical information based on previous similar runs many ways to look at similar applications; application
name, user, arguments, submission time, etc. use of genetic algorithm to identify good templates (eg
user+time) for a given workload use a mean prediction type results are an average error of 39%
Instance-based learning approach
Also called locally-weighted learning techniques A database of experiences is maintained and used
for predictions each entry consists of input and output features input is the condition under which experience was
observed output describe what happened under those conditions
Use genetic algorithm to find values that minimize prediction error
Error rate of 49%
Queue wait time predictors
Request to execute a job is not serviced immediately but put on a queue
Predictions of wait times are useful for such systems guide user to select appropiate queue submit multiple requests so they receive resources
simultaneously plan other activities in supercomputer environments
Scheduling algorithms
Two methods are examined
Predict execution time for each application in the system and use this to drive simulation algorithm potential to provide very accurate run time predictors if queue items depend on items not yet submitted to
queue, inaccuracy drops requires detailed knowledge of scheduling system used
Predict wait time based on wait times of applications that were in a similer scheduler state eg how long will it take if I have 3 before me and 4
after?
Scheduling Algorithms (2)
FCFS, LWF and conservative backfill First Come First Serve, in order of arrival Least Work First tries in order of arrival but ordered in
estimated amount of work CF is a variant on FCFS in that it allows a job to run
before it would if it doesn't delay jobs in the queue waiting before it
Results show that FCFS is most accurately predicted followed by backfill and LWF.
Both methods are affected by not knowing what applications will be submitted in the near future
Scheduling
Scheduling using run time predicitons use application execution times for scheduling measure utilization and wait time improves backfill and LWF minimal impact but decreases mean wait time by 25%
Scheduling with advance reservations some applications want resources from multiple
parallel computers to execute non-restarble applications are forced to used maximum
wait times as predictions when scheduling even without reservations, performance can be
increased with more accurate run time predictions
Recommended