Upload
roderick-jenkins
View
220
Download
1
Embed Size (px)
Citation preview
Fault DetectionFault DetectionSathish S. VadhiyarSathish S. Vadhiyar
Source/Credits: From Referenced Source/Credits: From Referenced PapersPapers
IntroductionIntroduction
Fine Grain Cycle Sharing (FGCS)Fine Grain Cycle Sharing (FGCS) Host computers allow guest jobs to utilize Host computers allow guest jobs to utilize
CPU cyclesCPU cycles Availability of host computers varyAvailability of host computers vary
Guest jobs may incur resource failuresGuest jobs may incur resource failures Need to predict availability of host Need to predict availability of host
computerscomputers A scheduling system can allocate guest jobs A scheduling system can allocate guest jobs
based on the availability of host computersbased on the availability of host computers
Kinds of Non AvailabilitiesKinds of Non Availabilities
FRC (Failures Caused by Resource FRC (Failures Caused by Resource Contention)Contention) A guest job may significantly impact host A guest job may significantly impact host
processesprocesses Hence a guest job can be removedHence a guest job can be removed
FRR (Failures Caused by Resource FRR (Failures Caused by Resource Revocation)Revocation) A machine owner suspends resource A machine owner suspends resource
contribution without noticecontribution without notice Hardware-software failures occurHardware-software failures occur
Resource Failure PredictionResource Failure Prediction
A multi-state failure model and application of a A multi-state failure model and application of a semi-Markov Process (SMP) to predict the semi-Markov Process (SMP) to predict the temporal reliabilitytemporal reliability
Predicting probability that no resource failure will Predicting probability that no resource failure will occur on a machine in a future time windowoccur on a machine in a future time window
Observing host resource usage values in a time Observing host resource usage values in a time window; calculating parameters of SMP based window; calculating parameters of SMP based on host resource usage valueson host resource usage values
Multi-state resource failure modelMulti-state resource failure model
FRR – 2 statesFRR – 2 states A machine is either available or unavailableA machine is either available or unavailable
FRC FRC Failures when host processes incur Failures when host processes incur
noticeable slowdown due to contention from noticeable slowdown due to contention from guest processesguest processes
A host processor can first decrease the A host processor can first decrease the priority of guest processes; If this does not priority of guest processes; If this does not help, the guest process is terminatedhelp, the guest process is terminated
Measured host resource usage as indicators Measured host resource usage as indicators of noticeable slowdownof noticeable slowdown
Initial ExperimentsInitial Experiments
To study relations between host resource usage To study relations between host resource usage and FRC - Experiments conducted to simulate and FRC - Experiments conducted to simulate resource contentions between a guest process resource contentions between a guest process and host processesand host processes
Host-group – an aggregated set of host Host-group – an aggregated set of host processes with various resource usagesprocesses with various resource usages
Slowdown of host group – reduction of its CPU Slowdown of host group – reduction of its CPU utilization due to contending guest processutilization due to contending guest process
Host programs are run with their isolated CPU Host programs are run with their isolated CPU usage between 10% and 100%usage between 10% and 100%
Guest process – a CPU bound programGuest process – a CPU bound program
Experiments on CPU contentionExperiments on CPU contention
Also measured reduction rate of host CPU Also measured reduction rate of host CPU usage for a host-groupusage for a host-group
Experiments repeated with different host groups Experiments repeated with different host groups with host priority 0, and guest priority 0 and 19 with host priority 0, and guest priority 0 and 19 (renice)(renice)
Measured reduction rate plotted as function of Measured reduction rate plotted as function of isolated host CPU usage, Lisolated host CPU usage, LHH
Found 2 thresholds for LHFound 2 thresholds for LH Th1 – highest value of LH when guest process needs Th1 – highest value of LH when guest process needs
to be reniced to keep reduction rate below 5%to be reniced to keep reduction rate below 5% Th2 – highest value of LH when guest process needs Th2 – highest value of LH when guest process needs
to be suspended to keep reduction rate below 5%to be suspended to keep reduction rate below 5%
State model for LRCState model for LRC
3 states3 states S1 - When LH < Th1; ignore resource S1 - When LH < Th1; ignore resource
contention due to guest processes; contention due to guest processes; slowdown already less than 5%slowdown already less than 5%
S2 - When Th1 < LH < Th2; renice guest S2 - When Th1 < LH < Th2; renice guest processes for slowdown to be < 5%processes for slowdown to be < 5%
S3 - When LH > Th2; terminate guest S3 - When LH > Th2; terminate guest processprocess
Experiments on CPU and Memory Experiments on CPU and Memory ContentionContention
When memory trashing occursWhen memory trashing occurs Total memory of guest and host processes Total memory of guest and host processes
exceed available memory sizeexceed available memory size Experiments were conducted to verify Experiments were conducted to verify
memory trashing does not depend on guest memory trashing does not depend on guest prioritypriority
S4 – for failure due to memory trashingS4 – for failure due to memory trashing
Multi-State Failure ModelMulti-State Failure Model
Proposed prediction algorithm is to predict the Proposed prediction algorithm is to predict the probability that a machine will never transfer to probability that a machine will never transfer to S3, S4, or S5 within a future time windowS3, S4, or S5 within a future time window
TransitionsTransitions Between S1, S2, S3 – decided by measured host CPU Between S1, S2, S3 – decided by measured host CPU
usageusage To S4 – when memory is limitedTo S4 – when memory is limited
Semi-Markov Process Model Semi-Markov Process Model (SMP)(SMP)
Applicable when next transition depends only onApplicable when next transition depends only on Current stateCurrent state How long the system at the current stateHow long the system at the current state
Transition probabilities depend on amount of Transition probabilities depend on amount of time elapsed since last change in statetime elapsed since last change in state
SMP is defined by a 3-tupleSMP is defined by a 3-tuple S – finite set of statesS – finite set of states Q – state transition matrixQ – state transition matrix H – holding time mass function matrixH – holding time mass function matrix
SMP (Contd…)SMP (Contd…)
The most important statistics of SMP - Interval transition The most important statistics of SMP - Interval transition probabilities, Pprobabilities, P
To calculate PTo calculate P Continuous time SMP is expensiveContinuous time SMP is expensive Hence the work develops a discrete time SMP modelHence the work develops a discrete time SMP model
SMP for Resource AvailabilitySMP for Resource Availability
TR – probability of never transferring to S3, S4 or S5 TR – probability of never transferring to S3, S4 or S5 within an arbitrary time window, Wwithin an arbitrary time window, W
SSinitinit – initial system state – initial system state W – WW – Winitinit + T + T
Q and H calculated based on statistics from history logs Q and H calculated based on statistics from history logs due to monitoring host resource usagedue to monitoring host resource usage
SMP for Resource AvailabilitySMP for Resource Availability
PPi,ji,j(m) = P(m) = Pi,ji,j(W(Winitinit, W, Winitinit+m)+m) PP11
i,ki,k(l) – interval transition probabilities for a one-step (l) – interval transition probabilities for a one-step transitiontransition
d – time unit of a discretization intervald – time unit of a discretization interval Q and H calculated based on statistics from history logs Q and H calculated based on statistics from history logs
due to monitoring host resource usagedue to monitoring host resource usage
System Design and ImplementationSystem Design and Implementation
Client requests job submission Client requests job submission Client’s job scheduler queries Client’s job scheduler queries
the gateways on available the gateways on available machines for temporal machines for temporal availabilitiesavailabilities
Chooses a machine and Chooses a machine and spawns a guest jobspawns a guest job
During job execution, monitor During job execution, monitor detects state transition and detects state transition and notifies gatewaynotifies gateway
Gateway renices or kills the Gateway renices or kills the guest processes accordinglyguest processes accordingly
Resource monitor uses simple Resource monitor uses simple cpu commands like `top’ to cpu commands like `top’ to calculate cpu usagescalculate cpu usages
Computation in Solving SMPComputation in Solving SMP Matrix sparsity in SMP is exploited to reduce Matrix sparsity in SMP is exploited to reduce
computationscomputations
The sparse matrix is constructed based on 2 The sparse matrix is constructed based on 2 facts:facts: It takes a finite amount of time to transition from one It takes a finite amount of time to transition from one
state to anotherstate to another S3, S4, S5 are unrecoverable failure statesS3, S4, S5 are unrecoverable failure states