Optimal strategies for scheduling checkpoints and preventive maintenance

IEEE TRANSACTIONS ON RELIABILITY, VOL. 39, NO. 1, 1990 APRIL 9

Optimal Strategies for Scheduling Checkpoints and Preventive Maintenance

E. G. Coffman Jr, Fellow IEEE

E. N. Gilbert, Fellow IEEE AT&T Bell Laboratories, Murray Hill

AT&T Bell Laboratories, Murray Hill

Key Words - Preventive maintenance, Scheduling, Optimization

Reader Aids - Purpose: Present and analyze several mathematical models Special math needed for derivations: Elementary probability Special math needed to use results: Same Results useful to: Statisticians, system analysts

Abstract - At checkpoints during the operation of a computer, the state of the system is saved. Whenever a machine fails, it is repaired and then reset to the state saved at the latest checkpoint. In this paper, save times are known constants and repair times are random variables; failures are the epochs of a given renewal process. In scheduling the checkpoints, the cost of saves must be traded off against the cost of work lost when the computer fails. We show how to schedule checkpoints to minimize the mean total-time to finish a given job. We obtain similar optimization results for the tails of the distribution of the finishing time, and certain variants of the basic model.

The large literature on checkpointing is discussed exten- sively by: Trivedi [7]; Tantawi & Ruschitzka [5]; Koren, Koren, Su [3]; Goyal et al. [2]; and Kulkami, Nicola, Trivedi [4]. Most previous research concerns queueing models, with jobs of random sizes arriving at random times. By contrast, this paper considers one job of given running time, and designs checkpointing strategies to optimize various performance criteria. The work of Toueg & Babaoglu [6] is closely related; they derived a dynamic programming algorithm that calculates the optimal strategy for minimizing mean completion times in very general situations. The problems considered herein are more specializ- ed but they usually have optimal strategies that we can exhibit explicitly.

Section 2 defines two mathematical models and their scheduling strategies. Section 3 analyzes model 1. Sections 3.1 & 3.2 characterize strategies for minimizing the mean time to finish a job of given size. Section 3.3 deals similarly with the tails of the distribution of the finishing time. Section 3.4 assumes failures are of two types: repairable and permanent; see Geist, Reynolds, Westall [l]. An optimal strategy is found for maximizing the probability that a given job finishes before the first permanent failure. Section 3 concludes with a discussion of various extensions of the results. Applications to the M/G/1 queueing system are also touched on. Section 4 concerns model 2 and the problem of maximizing the mean amount of useful work saved between consecutive failures, assuming that the computer is in constant operation.

1. INTRODUCTION 2. DEFINITIONS

Some failures obliterate all work a computer has ac- complished. When the computer resumes its service, jobs in progress when the computer failed must begin all over again. To protect against these losses, partially completed work can be saved on permanent storage (disk or tape) that is not affected by failures. Then only work-not-saved needs to be repeated.

It might be advisable to save work many times during long periods of computer operation. Although these saves prevent needless repetition of work, each save consumes time. To finish the entire job in a short time, some strategy must decide when to save. The best strategy depends on the criteria for optimization. We consider several such criteria.

In the literature, save times are often called checkpoints. At such times, preventive maintenance as well as saves can be scheduled. In the most common computer application, the operating system implements checkpoints and, at the same time, repairs certain types of failures. In other applications, computer users themselves can schedule checkpoints. For example, a user who is writing a long program at a terminal can save the input buffer occasionally to minimize the rewriting caused by failures that affect the buffer.

(Editors ’ note: Our standard notationhomenclature is given in “Information for Readers & Authors” at the rear of each issue. This paper has been changed somewhat to conform to that standard.)

2.1 Models

The size of the job can be specified by its required service time t. That is, a job of size t could be done in time t if no time were wasted on saves or on repeating lost work. In most of what follows, t is assumed given and the problem is to find a strategy that completes a job of size t while optimizing some speed criterion.

The time required to perform a save is taken as a fixed quantity, one time unit long. The repair time is also a fixed constant, R. In most problems repair times can be generalized to i.i.d. random variables with mean R. When the repair is finished, the time x to the next failure is a random variable with given Cdf, B ( x ) , and Sf, B ( x ) . Successive times x between failures are s-independent, ie, failures arrive by a renewal process. Failures can occur during the time required to perform a save.

0018-9529/90/ 1000-0009$01.00O1990 IEEE

10

Work is then lost, even though a save is in progress. As shown below, the possibility of losing work during a save must be considered in designing optimal strategies. Two models are considered.

Model 1 : During each save the computer also receives preventive maintenance. The computer is in the same good-as- new condition after each save and after each failure has been repaired. B ( x ) is then the Cdf of time to the next failure at either of these times.

Model 2: No maintenance is performed during a save. The time to failure after a save does not necessarily have the Cdf, B ( x ) . If the save is completed at time y after the last repair, the conditional Cdf for x is:

The two models are the same in the important case of exponentially distributed times to failure, ie, when

B ( x ) = exp( -AX), x > 0.

For then, regardless of the time y since the latest maintenance, the conditional Cdf for the time x to the next failure remains

Introducing a general B ( x ) instead of the exponential distribution adds only a slight complication to model 1 . However, model 2 with a general B ( x ) is much harder to analyze. Most of what follows applies to model 2 only when B ( x ) is the exponential distribution.

In either model, work begins initially when the computer is like new, ie, the computer is new, or maintenance has just been performed. Again, this assumption is not needed if x is exponentially distributed.

B(XlY) = B b ) .

2.2 Strategies

A strategy is a rule that determines when work is to be saved. Here, strategies aim to finish jobs quickly and so only special kinds of strategies have any interest. For example, strategies must not begin a save when the remaining work can be completed in less than unit time. For even if the save succeeds (no failure occurs in the next time unit), the job could be already finished by not saving.

Several quantities are important in deciding whether or not to schedule a checkpoint. One is the size S of the unsaved (but not lost) work that has been completed. Another is the size of the work that still remains. If the work that remained after the last save had size c ', the work that remains now has size t ' - S . The time y since the latest maintenance is also important. However, for model 1, y = S, so y is not an extra consideration.

Because the computer is in new condition after each maintenance, additional information about failures and checkpoints before the latest ones has no bearing on future failures. Then strategies need not consider these earlier events. Our strategies use only t ' , S, and (in model 2 ) y to determine the next checkpoint.

For example, a simple unqorm strategy chooses k check-

IEEE TRANSACTIONS ON RELIABILITY, VOL. 39, NO. 1 , 1990 APRIL

points beforehand and breaks the job of size t into k + 1 consecutive parts, of size t i ( k + 1 ). Then a new checkpoint occurs whenever the size of the unsaved completed work reaches the level S = t / ( k + 1 ) .

The most general strategy for model 1 is only slightly more complicated. Starting with a job of size t '= t , work continues until the unsaved completed work S reaches a level SI (depending on t ) at which the strategy dictates a save. If the save fails, the work must start over. Eventually, work of size S, is saved and the job remaining has size t ' = t - S1. The next save is attempted when the unsaved completed work S reaches a new level S, that depends on t - S1 (and hence only on t ) . Continuing in this way the strategy determines k checkpoints and the sizes SI, . . .& of the unsaved completed work at each checkpoint. Then, with t given, a strategy for model 1 is equivalent to a choice of constants: k, S1,. . . , sk . Because good strategies never save when less than one time-unit of work remains, the condition -

is assumed to hold for all strategies. For example, if t I 1 the only strategy has k = 0.

Strategies for model 2 are more complicated. Each checkpoint occurs when S reaches a level that depends on y as well as on t ' . Then sizes of saved work, and even the number of checkpoints, are random variables that depend on when failures occur.

3 . MODEL 1

3.1 Minimizing the mean time.

This subsection considers the strategy that minimizes m ( t ) , the mean time required to finish a job of size t. The repair time can have an arbitrary distribution, with mean R. In theorem 1 , strategies are specified by the constants k, SI,.. . ,& of section 2.2.

Reorem 1. The strategy without saves ( k = O ) has -

A strategy with k 2 1 saves has -

si E Si+l, i = l , ...) k,

S k + l t + k - s l - ... -Sk = t - s l - ... -sk (4)

If t > 1, all strategies have -

p 3 g.1.b. (greatest lower bound) of $ (s) / ( s - 1 ) in 1 < s I t.

COFFMAN/GILBERT: OPTIMAL STRATEGIES FOR SCHEDULING CHECKPOINTS AND PREVENTIVE MAINTENANCE

~

1 1

Proot The strategy with k = 0 continues work without checkpoints until the job is finished. A simple equation for m ( t ) follows by considering the time x to the first failure. If t l x , an event of probability B( t ) , then the job is finished without loss and the time required is t . If x < t , the initial work for time x is lost. A repair of mean time R and a job of size t still remain. Then -

m ( t ) = t B ( t ) + j b {x+R+m(t)} dB(x).

This equation can be solved directly for rn ( t ) and the result (2) follows after an integration by parts. Eq. (2) and the correspon- ding transform in (17) also appear in [6] with the same simple proofs; however, the results are used quite differently here.

If k r 1 the strategy breaks the job into k+ 1 parts. Part i ( i I k ) completes time Si of the original job and also finishes a save, a job of unit size. Then the entire part i is a job of size Si+ 1 =si. The argument for (2) shows that it requires mean time $ ( s i ) . Likewise, the last part completes the remaining work without a checkpoint, a job of size sk+ 1. Eq. (3) merely states that m ( t ) is the sum of the mean times of the separate parts.

From (4), si > 1 for i = 1,. . . ,k . Also, sk+ > 1 follows from (1) and (4) if t l l . Then $(si) 2 p ( s i - l ) , and (5) follows from (3). 0

Ifk, S1, ..., Skcanbechosentomakeallof$(si)/(si-l), i = 1 , . . . , k + 1, close to p then m ( t ) is close to the lower bound p ( t - 1 ) and the strategy is close to optimal. For example, consider x exponentially distributed with mean l/A. Then (2) shows:

In 1 <s< 00, eq. (6) has a unique minimum at s =s*, the root of

$ (s) / (s - 1 ) is decreasing for s < s* and increasing for s* < s. Thenp = $ ( t ) / ( t - 1 ) i f t<s* , andp = $(s*)/(s*-1) otherwise.

If IS*, (5 ) becomes m ( t ) L $ ( t ) , and (2) shows that the strategy without saves ( k = O ) is optimal.

Ifs* < t , then a strategy that makes si = s* approximately, for all i = 1, ..., k+ 1, has $ ( s i ) near p(s* - 1 ) and m ( t ) near p ( t - 1 ). Then (5 ) shows that the strategy is close to optimal. Since the si satisfy (4), they cannot all be near s* unless k is chosen to make -

t + k t -1 k + l k + l

+ 1 = s * - - - -

hold approximately. Although S1,. . . ,Sk are then nearly equal, the strategy differs from the uniform strategy of section 2.2 in leaving a larger job, of size sk+ = t - S1 - . . . - Sk = s*, to finish after the last checkpoint.

Table 1 gives s* - 1 , the desired size Si ( i = 1 ,. . . , k ) of the saved pieces as a function of A. The table also gives ( ehs' - 1 ) / (s* - 1 ) , which appears in ( 6 ) and equals p / (R+ l/A) when s* < t. Note that R enters $ (s) in (6) , and hence m ( t ) in (3), only through the factor (R + 1 /A). Then the optimal strategy is the same for all R. The next section continues this example.

TABLE 1 Parameters of the Minimum-Mean-Time Strategy

for Exponentially Distributed x

h S* - 1 (eh' - l ) / ( s * - 1 )

.001

.002 ,005 .01 .02 .05 . I .2 .5

1. 2. 5.

OD

44.06 30.96 19.34 13.48 9.345 5.676 3.832 2.534 1.397 ,8414 .4738 .1995 0

1.0461 1.0660 1.1070 1.1558 1.2299 1.3963 1.6212 2.0274 3.3144 6.3054

19.0589 402.4 145

01

More generally, similar arguments apply to any distribution B ( x ) for which $(s)/ (s - 1 ) has a unique minimum at somefinitevalues* in 1<s<00. Againk=Oisbes t i f t i s* . Otherwise, good strategies have k and S, determined to make (8) and si = s* hold approximately. For another example of this kind, take x uniformly distributed over 0 I x I X, ie, B ( x ) = x/X, and -

s(R+X-s/2) x-s $(s) = ( s<X) .

Here, the condition (7) to determine s* is replaced by a quadratic equation,

(2R+X-l )s2 + 2% - 2X(R+X) = 0.

Now R appears in the equation for s* and affects the best strategy. The value R= 1/2 gives an especially simple result:

s* = J2(x+1) - 1.

3.2 Convex $(s)

In general, (8) does not hold for an integer k. Then section 3.1 gives only an approximate guide to the strategy that minimizes m ( t ) . For a given k, the numbers sl,. . . ,sk+ cannot all equal s*. These numbers must be adjusted, subject to (41, or


to achieve a truly minimum m (t) . The minimization is par- ticularly easy if $ (s) is a convex function of s. For instance, if x is exponentially distributed, (6) shows $ (s) to be a multi- ple of exp(hr) - 1, and hence convex ($"($) >O). Likewise $(s) is convex if x is uniformly distributed.

Every convex function $ (s) satisfies:

and (14) no longer apply. In fact, for any concave $(s) , the following argument shows that the best strategy is never to save.

Notefirst, f o r a n y p i n O i p r 1 , that$(ps)rp$(s). For, sinceps is an averageps = (1 -p)O+ps of 0 and s, $(O) = O and the concavity of $( a ) imply:

Now consider any strategy with k r 1. From (15) -

for i = 1, 2. The first two terms of (3) are then: Then (3) and (9) yield:

$ 0 1 ) + $ 0 2 ) 2 $ b l + S 2 ) . (16) (10)

Now consider a new strategy that omits the first save. The two terms on the 1.h.s. of (16) are then replaced by a single term. The new term is: But the bound (10) is achieved by taking s l , . . . &+ 1 to have the

common value:

s ( k ) = ( t + k ) / ( k + l ) .

$ ( t ) i $ ( t + 1) = $ ( s 1 + s 2 ) , k = 1. s1 = ... = sk = ( f - l ) / ( k + l ) . (12)

(note that (2) defines $(s) to be increasing). In either case, removing a save gave a better strategy.

R enters the previous example in a critical way if Q < 1. For then $ ( t ) changes from convex to concave at R = T/ ( 1 - a). If R is large enough, failures waste long repair times that are best avoided by never spending extra time trying to save.

3.3 The Cdf:

Consider a strategy k, S 1 , . . . , & for a job of size f. The time to complete the job, including times of all saves, repairs, and repeats of lost work, is a random variable 7 with E { T ) = m ( t ) . The probability distribution of 7, or equivalently its Laplace transform h ( p ) = E{e-P'}, is now obtained. To simplify the formulas, the repair time is now a constant R. Generalizing the repair time to a random variable would be easy to do and would require only that terms exp ( -pR) be replaced by the Laplace transform of the pdf of R.

Again, the work remaining after the last checkpoint has a larger size:

t - & - . . . - & = s ( k ) = l + ( f - l ) / ( k + l ) . (13)

The extra time unit in (13) prevents saves from beginning when the job is within unit time of completion (cf. section 2.2).

For a given k, (12) achieves the minimum:

m(t) = ( k + l ) $ - = (t-1) ~ $ ( s ( k ) ) . (14) s ( k ) - 1 (:I:> The best choice of k minimizes (14) but, in ga"m+l, the minimum is not as small as p ( t - 1) in (5 ) .

For another example, take -

Theorem 2. The strategy with no saves has - so that -

B(t)exp( -PO 1 -exp( -pR)jb exp( -px)d~(x) '

h ( p ) = U ( f , P ) =

, a = 1. A strategy with k r 1 saves has -

h ( p ) = n w ( s i * p )

k + 1

i = l

$(SI =

If T > ( 1 - Q)R, then $ (s) is convex and the optimal strategy is (12) for some k. If T< (1 -a)R, $(s) is concave and (12)


~

13

S~,...,S~+~ are defined in (4).

Proof. With k=O, h (p) is found by considering the time x to the first failure. The reasoning used in theorem 1 now shows -

h ( p ) = B ( t ) exp( - p t ) + E{exp( - p ( x + R + ~ ) } d B ( x ) sb and (17) follows. Eq. (18), like (3), follows because T is the sum of the (s-independent) times required to complete the k + 1

Eq. (2) and (3) could have been derived from theorem 2 becausem(t) = -h’(O) and $(s) = - a w ( s , p ) / a p atp=O.

In principle, the Cdf for 7 is obtainable as an inverse Laplace transform of (1 8). Here only the tail of the distribution, for large 7, is considered. The rate at which this tail decreases can be inferred from the singularities of h (p) . Because it is a mean, E{exp( - p 7 ) } , h ( p ) exists for O s p < 0 0 . If it exists even for - q < p < 00, then h( -4) = E{exp(q~)} < 00 and P r { 7 > ~ ~ } < h ( -q)exp( - q T 0 ) . Conversely, if Pr {T > 70) = 0 (exp ( - qT0) ) , then h (p) can be shown to exist for -q<p<co.

Since h ( p ) is a product in (18), its singularities are those of its factors w ( si ,p) . The denominator -

parts into which the checkpoints break the job.

Di(p) = 1 -exp( -pR) exp( -px ) d B ( x ) (19)

of factor i is an increasing function o f p with ~ ( 0 ) = B ( s i ) . Suppose that B ( s i ) < 1 (ie, suppose the strategy does not deliberately wait so long for save i that a failure is certain). Then Di(0) > O and factor i remains finite over some range -qi<p<oo with -q i<O. Write q = min{ql ,..., q k + l } . Then, for any E > O , h{7>70} is 0 (exp[- ( q - e ) ~ ~ ] ) but not 0 (exP[- (q+e)701).

To obtain a 7 distributed with a tail decreasing as fast as possible one must make q as large as possible. Suppose now that x has pdf, b ( x ) = dB/dx. Then (19) defines Di(p) as a continuous function of si so that qi is a solution of Di( -qi) = 0. Differentiating the equation Di ( - qi) = 0 with respect to si shows that qi is a decreasing function of si. Then each si should be small to make q large. But the si satisfy (9) and so the best choice again makes si = ( t + k ) / ( k + 1 ) , ie, strategy (12) gives the 7 with the fastest decreasing tail. Convex $ ( t ) is not needed here.

Although a strategy with equal si is obtained again, the best k is now different from the one in section 3.1. Since si = (t - 1 ) / ( k + 1) + 1, increasing k decreases si and that was shown above to increase qi. Thus, there is no optimum k. In- creasing the k does decrease Pr { 7 > T ~ } for all sufficiently large T ~ . However, as k increases, it may be necessary to look fur- ther out on the tail (larger T ~ ) to see the improvement. The problem of minimizing Pr { 7 > T ~ } for a fixed 7o requires more than just an asymptotic argument.

Each factor w ( si,p) in (1 8) has a pole at p = - qi, the real root of Di (p) = 0. When all si have the same value s = ( t +

Sr

k ) / ( k + 1 ) , all qi have a common value q and h ( p ) in (1 8) has a pole of order k + 1 at p = -4. Near -q,

where the missing terms are all of lower order. The term in (20) transforms into -

+... . (21)

Ordinarily the other singularities of h ( p ) all lie in a half-plane to the left of - q and hence the missing terms in (21) all die away faster than exp( - q ~ ~ ) . Then (21) shows that the tail of the distribution for 7 is 0 ( Tkexp ( - q7) .

3.4 Maximizing the completion probability.

This subsection modifies model 1 to allow two kinds of failures. There is an additional permanent failure that cannot be repaired. Times x between failures again have Cdf, B ( x ) , but each new failure now has probability 1 - a > 0 of being permanent. When a permanent failure occurs, any job still in progress is never completed. Under these conditions m ( t ) ordinarily fails to exist. One cannot find minimum-mean-time strategies. However, one can try to maximize the probability that a job is ever finished.

This problem, studied originally in [ 11, can also be inter- preted entirely in terms of ordinary (repairable) failures. Sup- pose a fraction 1 -a>O of the failures are just randomly designated as “special”. Then the strategy sought has the highest probability of completing a job before a special failure occurs. Such a strategy also tends to finish a job quickly, although it optimizes something different from m ( t ) . The strategy has the advantage of depending on an adjustable parameter a . Small values of a lead to strategies that maximize the (perhaps slim) chance of doing the job in a very short time or with only few failures. These strategies might substitute for more difficult strategies that maximize the probability of finishing the job in a prescribed time.

Write f( t ) for the probability that the strategy k, SI,. . . ,sk completes a job of size t before a permanent failure occurs.

Theorem 3. The strategy with no saves has -

I f k 1 1, then -

14 IEEE TRANSACTIONS ON RELIABILITY, VOL. 39, NO. 1, 1990 APRIL

the job remaining after the first checkpoint begins with the computer in new condition. Then theorems 2-4 apply to the remaining checkpoints. These theorems will often determine &,. . . ,Sk

(22)

where sl, ... , s k + l are given by (4). If t L 1 , all strategies have

where p is the g.1.b. of -[log+(s)]/(s-1) in l < s i t .

The proof of (22) is omitted, for it closely parallels the proofs of (3) and (18). The argument for (5 ) may be repeated for (23), with -log+(s) and -logf(t) substituted for $(s)

Now good strategies have small values of - [log+ (si)]/ (si - 1 ) . If - log+ (s) is convex, the argument of section 3.2 shows that the best strategy has all si equal to s ( k ) in (1 l), but now with k determined to minimize - [log4 (s ( k ) )]/ (s ( k ) - 1 ). This solution applies when x is exponentially or uniformly distributed because both distributions have - log+ (s) convex.

and m ( t ) . 0

A curious distribution -

on 1 IX< 03, with /3 a positive parameter, has -log+ (s ) = @(s- 1). Then the g.1.b. p of theorem 3 has the value /3 and is achieved for every s > 1 . All strategies have f( t ) = exp ( - p ( t - 1 ) ) and are therefore optimal.

The distribution B ( x ) = Ox/ ( 1 + O x ) provides an example where -log+(s) is concave. Then, as in section 3.2, the optimal strategy never saves.

3.5 ModiJcutions.

The strategies of section 2.2 finish a job of size t but save only the first S1 + . . . + S k time units of the work. If the entire job must be saved as well as finished, the final save can be considered part of the job and t in theorems 1-3 should be replaced by t + 1. Now sk+ is like s1 ,. . . , and s k in including a unit time for saving. Whenever an optimal strategy has s1 =... = s k = s k + 1, the checkpoints break the job of size t into k + 1 parts of size t / ( k + 1).

Sections 3.1-3.4 require the job to begin with the computer in new condition. Most problems begin at some time y > 0 since the computer last received maintenance. Then the time x to the first failure has a different Cdf B1 ( x ) . This distribution determines new functions q1 (sl), w1 ( s l , p ) , and (sl) that should replace $(sl), w(s l ,p) , +(sl) in theorems 1-3. If y is known, B l ( x ) is just B ( x l y ) , given in section 2.1. If the job begins at a random time, chosen when the computer is not being repaired, then -

to be all equal; then only an optimization on S1 remains. Instead of assuming that each save requires the same time,

one might prefer a time [ (S) depending on the size S of the unsaved completed work to be saved. The sizes of the first k pieces (including the saves) into which a strategy breaks a job are then si = Si + [ (Si) for i = 1, . . . ,k. If the job is to end with a save, the last piece also has size s k + = S k + + 4 (Sk+ 1 ) where S k + , denotes t-Sl =... - S k . Eq. (3), (18), (22) again hold, but with the new si replacing (4).

Results like theorems 1-3 also hold in the special case that [ (S) is a linear function:

[ ( S ) = A + BS.

For then -

= ( l + B ) t + ( k + l ) A .

If $( S) is convex, the strategy with k + 1 saves (including the last one) that minimizes m ( t ) has all of sl,. . . ,sk+ equal to A + (1 + B ) t / ( k + 1). In general, (5) becomes m ( t ) >M( 1 + B ) t where M is the g.1.b. of $ ( s ) / ( s - A ) in A < s s t .

If A in [ (S) = A + BS is small, there is an incentive to make frequent small saves. For example, with exponentially distributed x and with A =0, $ (s) /s is a minimum at s = 0. Tak- ing k large and Si all small produces strategies with m ( t ) near (1 + B ) ( 1 +XR)t.

For a final modification, consider jobs that arrive at random in a Poisson stream and form an M/G/1 queue awaiting service from the computer. The job sizes t can be chosen in- dependently at random from some common distribution. This distribution, and the Checkpoint strategy, determine the service- time distribution of the queue. If, for each given t , the checkpoint strategy is one that minimizes m (?), then the M/G/1 queue has a service-time distribution with minimum mean. That minimizes the traffic intensity p to the queue and ensures that a random arrival has maximum probability 1 - p of finding the server idle. It also maximizes the 1.u.b. of arrival rates for.which the queue remains stable. However, it may not minimize the mean time a job waits for service because that mean time depends on the standard deviation of the service-time distribution.

4. MODEL 2

4.1 The problem.

where X, the mean time-between-failures, is assumed to exist. Although the first checkpoint is then different from the rest,

Model 2 is difficult because the distribution of the time x to the next failure depends on the time since the last failure.


The problem to be considered has a different objective from the ones in section 3. Now t is not prescribed but is assumed to be so large that many failures occur before the job is completed. The objective is to maximize the mean amount of work saved between failures.

Consider a save that is completed at a time y after the last failure was repaired. The strategy uses y to decide on the time S = S ( y ) to wait before the next checkpoint. The strategy determines a function g (y) , the mean size of additional work that is saved before the next failure. The next save succeeds with probability -

B(S+ 1 ly) = B ( y + S + l ) / B ( y )

and so -

The best strategy chooses S to maximize g ( y ) in (24).

4 .2 Solutions.

With x distributed exponentially, future failures are unrelated to the past and there should be a solution with g ( y ) =g and S ( y ) =S, independent of y . Eq. (24) then shows that -

S exp[h(S+ l ) ] - 1 g =

and that the maximizing S is a root of ( 1 -AS)exp[h(S+ l ) ] = 1. But with s denoting S+ 1 , the s that maximizes g satisfies (7) again. The optimal strategy saves in the same increments S = s* - 1 that table 1 gave for the problem of minimizing m ( t ) .

More generally, (24) is a recurrence relating g ( y ) to values of g ( e ) at larger values of the argument. If times x between failures are distributed over only a finite interval 0 I X I X, then g ( y ) = 0 in X- 1 < y I X, a failure is sure to occur before more work can be saved. Knowing g(y) =O in X - 1 < y < X , one can use (24) to work backward to smaller values of y and ultimate- ly find g (0) , which is the mean size-of-work saved between failures. A special example with uniformly distributed x ,

B ( x ) = min{x/X, l } , x 1 0 ,

illustrates the procedure. A change of variable -

is convenient. Then -

G(r]) = 0, in 01r]< l ,

and G(q) for larger r] is found from (24) in the form:

In (25), S is to be chosen, depending on r ] , so as to maximize G(r]). The quantity of main interest is g ( 0 ) = G(X). But, X does not appear in (25) and so X does not enter into the form of G ( - ) nor into the choice of interval size S that maximizes G(r]) at each iteration.

To begin the iteration suppose that 0 < r] - S - 1 c 1, which is surely so if 1 < r] < 2 . Then (25) shows -

r]G(r]) = m;x{(r]-S-l)S} = ( ~ - 1 ) ~ / 4 , (26)

with the maximum occurring at:

In fact, even if r] is in the larger range ZI = [ 1,3], the choice (27) puts q - S - 1 in Io = [0,1] and produces an r]G(r]) as large as (26). It is not obvious, when 2 < r ] 5 3, that a smaller choice of S butting r]-S- 1 in [1,2]) might not make r]G(r]) even larger. But, with r] - S- 1 in [1,2], (26) shows that (r ] -S- l )G(r] -S-1) = ( ~ - S - 2 ) ~ / 4 and then (25) gives:

= (r]-S-l)S + %[(r]-S-2)2 - (r]-U21

= - % ( 1 - S) (3 (77 - S- 1 ) -v}.

With 1 5 7 - S - 1 1 2 and 251113, then -

3 ( 7 - S - 1 ) - 7 > 3 * 1 - 3 = 0

1-s = r ] - S + l - ( ? 7 - 2 ) 2 1 - 1 = 0 .

Then (28) shows qG(q) s (7 - 1)2/4; the solution (26) and (27) holds even in the larger range Zl = [1,3].

Continuing in this way one finds that G( r ] ) and Shave different analytic formulas in different ranges Io, Zl, Z2,.. . of r ] .

Theorem 4. When times between failures are uniformly distributed over 0 s x I X the strategy that maximizes the mean work-saved-between-failures is obtained as follows. When a save has been completed at time y after the last failure, let r] = X - y ; r] is in one of the intervals Io, Zl, ..., where:

If r] E Zk, the maximum expected size of future work that can be saved is:

If k L 1 , the maximizing strategy waits for a time -


1 K ( S + l ) = X + - [%K2 + % K -

K + 1 (31)

before starting the next save. If no failure occurs before the save is finished, the next value of 17 is 17 ‘ = 17 - S k (77) - 1, which satisfies -

4.3 Comments on Theorem 4.

The next section proves theorem 4. Some comments are given first.

Just after a failure has been repaired, y = 0 and 9 = X. If X E Ik , the maximizing strategy will save an expected work Gk (X) . It first waits for time Sk ( X ) before the next checkpoint. If the next save succeeds, the next 7 is an q’ in 1k-l and the strategy next waits time Sk- (7 ’) before saving again. Check- points continue with values of 7 in I k - 2 , I k - 3 , ... until a failure occurs.

The optimal strategy is very different from the ones found for model 1 . Intervals S k ( 7) between saves now grow shorter and shorter with time because there is a growing chance that a failure will occur soon.

If X is large, (29) and (30) show that work of mean size near X/2 can be saved. One could not hope for more because X/2 is the mean time-between-failures.

A simplerfied-step strategy, saving work in steps of fixed size S, makes an interesting comparison with the maximizing strategy of theorem 4. The strategy has two parameters, S and K. Each new checkpoint occurs after waiting time S since the completion of the previous save or repair of a failure. If as many as K saves succeed, the strategy schedules no more checkpoints and waits for the next failure to be repaired before starting over. Since time S+ 1 is spent between saves, the K and S are constrained by:

K ( S + l ) I X.

For k = 1 ,. . . , K, save k is attempted; it succeeds if no failure occurs for a time k (S + 1 ) . The mean work saved at checkpoint k is S[ 1 - k (S + 1 ) /a. The total mean work saved is obtained by summing on k

E{work saved} = SK [ 1 - ( S + y+ 1 ) ] A good choice of parameters is:

X 1

K + l 2 ’ K = [SX, S = - - -

Note that, for this choice,

K ( K - 1 ) 5 x -

2(K+ 1 )

so that K( S + 1 ) I X holds for all X L 1 . Again E{work saved} is near X / 2 for large X . Moreover, as table 2 shows, this fixed- step strategy is nearly optimal for all values of X. The strategy is sometimes improved by picking a smaller K (the choice K= 1 would have saved work of mean size 0.5625 when X = 4 and minor improvements are obtainable when X= 10,50, 100,200, and 400). However, table 2 already shows that the mean work saved is not a sensitive function of strategy.

A greedy strategy maximizes the mean size of the work saved on the next attempt, without regard to future saves. It continues in this way until a failure occurs. The mean size of saved work from future saves is again g(y) in (24), but each save chooses S only to maximize the first term:

S B(y+S+ l)/E(y).

TABLE 2 Mean Work-Saved-Between-Failures for Three Strategies.

Times between failures are uniformly distributed on [O,X].

X maximum fixed step greedy

2 .1250 ,1250 .1250 4 ,5833 ,5208 ,5781 6 7

10 15 20 30 50 70

100 200 400

1000 2000

1.1667 1.8125 2.5000 4.3333 6.2708

10.3250 18.8250 27.6048 41.0661 87.1625

181.6409 470.6839 958.3350

1.1250 1.7604 2.4000 4.2250 6.1250

10.1250 18.5150 27.2397 40.5920 86.4646

180.6074 468.9990 955.9015

1.1354 1.7285 2.3453 3.9333 5.5525 8.8319

15.4468 22.0877 32.0666 65.3716

132.02 16 332.0100 665.3388

If x is exponentially distributed the greedy strategy maximizes S exp[ - h ( S + l)], independent of y, by taking S = 1 /X. By contrast, the optimal strategy has been shown to take S = s* - 1, given in table 1. If 1 / X is large, s* = approximately and the greedy strategy waits much too long between saves, trying to save work in big pieces. Table 3 compares the mean work, that these strategies save between failures, against the mean time 1 /A between failures. The optimal strategy saves a mean fraction exp ( -Ai-) of the available time 1 /A. If X is large this fraction is near 1; the greedy strategy has a smaller fraction, near 1 / ( e - 1 ) = 0.5820. If 1 / X is small there is rare- ly enough time for more than one save; then the greedy strategy is nearly optimal.

COFFMANIGILBERT: OPTIMAL STRATEGIES FOR SCHEDULING CHECKPOINTS AND PREVENTIVE MAINTENANCE 17

TABLE 3 Mean Size-of-Work-Saved-Between-Failures for Two Strategies

then 7 - S - 1 E Ik, for some K < k. Then (25) shows that -

Exponential distribution of x. v G ( v ) 5 ( v - S - ~ ) { S + G K ( V - S - ~ ) )

v G ( v ) I ( 7 - S - 1 ) {S+Gk- l (V-S- l )} , 1/X = exp[x] optimal greedy

lo00 955.94 581.06

500 469.04 ;;!::: the last inequality following from an identity - 200 180.66 100 86.52 57.29 50 20 10

40.66 14.32 6.168

5 2.466 2 ,6034

(34)

2.155 .5744 that can be derived from (30). A straightforward differentia-

1 .I586 .1565 tion shows that the r.h.s. of (34) achieves the maximum value qGk(V) at S = S k ( V ) . Moreover, with 7 E I k , (29) and (31) show that -

.02620 .5 ,02623 .2 ,000497 .o00497

kV k 17 - S k ( V l ) - 1 = - - - k+l 2 ’ If x is uniformly distributed on [ O , X l , the greedy strategy V ’

picks S to maximize:

which satisfies -

k < V ’ 5 -

= g ( X - 7 ) . The greedy strategy puts (24) into the form k+ 1 2 2 k+ 1 Then S= (X-y -1 ) /2 . Again introduce V E X - ~ and G ( 7 ) -. k k ( k + l ) - - k

Now qG( 7) has different quadratic expressions in intervals Jo, J ~ , ... with Jk= 1 2 ~ - 1 , 2k+1-1]. In particular -

The results for the greedy strategy appear in table 2. When X I 3, the greedy and optimal strategies agree because both strategies attempt only one save. The greedy strategy resembles the optimal strategy, in general, because it waits for shorter times to make its later saves. Indeed, for 3 < X < 6.4 the greedy strategy is still slightly better than the fixed-step strategy. However, when Xis large the first steps of the greedy strategy are much too large and the greedy strategy is poor.

( k+ 1 ) ( k + 2 ) k -_ 2 2 ’

Then 7’ E 1 k - l . as required by (32), and so equality in (34) is achievable by taking S = Sk(7).

Although (33) and G ( 7 ) = Gk(V) surely hold if -

k (k+ 1 ) < V < - - - + 1,

k ( k+ 1 ) 2 2

the other values of 7 in Ik admit values of S for which v - S - 1 E Ik. Perhaps G( 7) can be made larger than Gk( V ) by choosing S so small that v - S- 1 E Zk. A proof by contradiction will now show that choosing S to make 7 - S - 1 E Ik is not a good strategy. Let q0 denote any value in Ik such that the following holds:

For instance, values less than %k( k+ 1 ) + 1 in Ik can serve for V O It suffices to Prove that G( V ) = Gk (77 also holds for all values of 7 in Ik that satisfy to < 1 5 qo + 1.

4.4 Proof of Theorem 4.

The proof is by induction on k. The theorem has already been verified for k=O and 1. Suppose it is true for 0, 1 , ..., k- 1 , and consider now any 7 E Ik. A waiting time S must be chosen to maximize G ( v ) in (25).

Again different arguments apply, depending on the size of S . If S is large enough so that -

Suppose q s q o + 1. If r ] - S - l s k ( k + l ) / 2 , ie, (33) then the maximum G(V) has already been found to be

Gk(V). Otherwise, 7-S-1 E Ik; butalsoV-S-lsV,,. Then G ( v - S - 1 ) = Gk(q-S-1) and (25) is -

k (k+ 1 ) 2

7 - S - 1 s

18 IEEE TRANSACTIONS ON RELIABILITY, VOL. 39, NO. 1, 1990 APRIL

From (30), the r.h.s. is a quadratic in s With maXhnUm [4] V. G. Kulkami, V. F. Nicola, K. S. Trivedi, “Effects of checkpointing TGk, (7) at s=&+ (7) . But this maximum is not achievable because 7 - &+ (9) - 1 is too small to lie in 1,. The largest

endpoint of I k ,

7 ~ ( 7 ) = I/zk(k+ 1 ) r7- 1 - t/zk(k+ 1) + G~ ( ~ / z k ( k + I ) ) ]

and queueing on program perfo-ce”, Research Rep. RC 13283, IBM Heights, NY USA.

[SI A. N. Tantawi, M. Ruschitzka, “Performance analysis of checkpointing

[6] S. Toueg, 0. Babaoglu, “On the optimum checkpoint seletion problem”, SIAM J . Compur., 13, 1984, pp 630-649.

[7] K. s. Trivedi, “Reliability evaluation for fault tolerant systems”, in Mathomatical Computer Per$onnuwe m d Reliability, North-Holland, 1983,

G ( 7 ) is obtained if q-’- = k(k+ ‘)I2* the left strategies”, ACM Trans, Cornpurer sysrems, 2, 1984, pp 123-144,

pp 403-414. k ( k + l ) ( 3k2 + Ilk + l o ) ‘ 2 - y - 12 1’ - -

But then (30) yields -

ie, the best strategy must settle for G k ( q ) . 0

ACKNOWLEDGMENT

We are grateful to Dr. Robert Geist for identifying and correcting several flaws in the original typescript.

REFERENCES

R. Geist, R. Reynolds, J . Westall, “Selection of a checkpoint interval in a critical task environment”, IEEE Trans. Reliability, vol 37, 1988, pp 395-400. A. Goyal, V. Nicola, A. N. Tantawi, K. S . Trivedi, “Reliability of systems with limited repairs’’, IEEE Trans. Reliability, (Special Issue on Fault Tolerant Computing), vol R-36, 1987, pp 202-207. I. Koren, Z. Koren, S. Su, “Analysis of a class of recovery procedures”, IEEE Trans. Computers, vol C-35, 1986, pp 703-712.

AUTHORS

Dr. E. G. Coffman Jr.; Rm 2D-155; AT&T Bell Laboratories; Murray Hill, New Jersey 07974-2070 USA. Edward G. Coffman Jr. (M ’61, SM ’83, F ’84) began work as a computer

scientist in 1958 at the System Development Corporation, where he was a system programmer until 1965. At the conclusion of this period his graduate studies at UCLA culminated in the PhD degree in engineering. From 1966 to 1979 he served on the Computer Science Faculties at Princeton University, Princeton, The Pennsylvania State University, University Park, Columbia University, New York, and the University of California, Santa Barbara. One year appointments were held at the University of Newcastle upon Tyne (1969) and at the lnstitut de Recherche d’Informatique et d’Automatique (1975) in France. Since 1979 he has been a Member of the Technical Staff at Bell Laboratories, Murray Hill. His research has concentrated on the mathematical modeling and analysis of system performance.

Dr. Edgar N. Gilbert; Rm2C-381; AT&T Bell Laboratories; Murray Hill, New Jersey 07974-2070 USA. Edgar N. Gilbert (M’ 67, SM ’73, F ’74) was born in Woodhaven, New York,

in 1923. He received the BS degree in physics from Queens College, Flushing, and the PhD degree in mathematics from the Massachusetts Institute of Technology (MIT), Cambridge in 1943 and 1948, respectively. In 1943 he taught physics at the University of Illinois, Urbana. During 1944-1946 he designed radar antennas as a member of the staff of the MIT Radiation Laboratory. In 1948 he joined the Mathematics Research Center at Bell Laboratories, Murray Hill, where he has been engaged in mathematical studies related to communication theory.

Manuscript TR88-152 received 1988 August 22; revised 1989 August 11.

IEEE Log Number 31571 4TRF

INTERNATIONAL RELIABILITY AVAILABILITY MAINTAINABILITY

1990June12-June15 Hershey, Pennsylvania USA

Documents

Optimal strategies for scheduling checkpoints and preventive maintenance