44
2015 NCSL International Workshop & Symposium Instrument Adjustment Policies Speaker/Author: Paul Reese Baxter Healthcare Corporation 25212 West Illinois Route 120 Mail Stop: WG2-2S Round Lake, IL 60073 Phone: (224) 270-4547 Fax: (224) 270-2491 E-mail: [email protected] Abstract Instrument adjustment policies play a key role in the reliability of calibrated instruments to maintain their accuracy over a specified time interval. Periodic review and adjustment of assigned calibration intervals is required by national standard ANSI/NCSL Z540.3 and is employed to manage the End of Period Reliability (EOPR) to acceptable levels. Instrument adjustment policies may also be implemented with various guardband strategies to manage false accept risk. However, policies and guidance addressing the routine adjustment of in-tolerance instruments are not so well established. National and international calibration standards ANSI/NCSL Z540.3 and ISO/IEC-17025 do not mandate any particular adjustment policy with regard to in-tolerance equipment. Evidence has been previously presented where routine adjustment of in-tolerance items may even degrade performance. Yet, this important part of the overall calibration process is often left to the discretion of the calibrating technician based on heuristic assessment. Astute adjustment decisions require knowledge of the random vs. systematic nature of instrument error. Instruments dominated by systematic effects, such as drift, benefit from adjustment, while those displaying more random behavior may not. Monte Carlo methods are used here to investigate the effect of various adjustment thresholds on in-tolerance instruments. 1. Background Instrument adjustment policies during calibration vary among different organizations. Such policies can generally be classified into one of three categories: 1) Adjust always 2) Adjust only if Out-Of-Tolerance (OOT) 3) Adjust with discretion when In-Tolerance and always when OOT While the first two polices are essentially self-explanatory, the third category deserves further attention. Herein, a discretionary adjustment is one in which the calibration technician (or software) makes a decision to adjust an instrument, which is observed to be in-tolerance, based on consideration of additional factors. Discretionary adjustment may sometimes be performed in conjunction with guardbanding strategies to mitigate false-accept-risk. Guardbanding techniques often require discretionary adjustments to be made where low Test Uncertainty Ratio (TUR) and/or End Of Period Reliability (EOPR) encountered. Significant literature exists on this subject [23-23].

(Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

Embed Size (px)

Citation preview

Page 1: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Instrument Adjustment Policies

Speaker/Author: Paul Reese

Baxter Healthcare Corporation

25212 West Illinois Route 120

Mail Stop: WG2-2S

Round Lake, IL 60073

Phone: (224) 270-4547 Fax: (224) 270-2491

E-mail: [email protected]

Abstract

Instrument adjustment policies play a key role in the reliability of calibrated instruments to

maintain their accuracy over a specified time interval. Periodic review and adjustment of

assigned calibration intervals is required by national standard ANSI/NCSL Z540.3 and is

employed to manage the End of Period Reliability (EOPR) to acceptable levels. Instrument

adjustment policies may also be implemented with various guardband strategies to manage false

accept risk. However, policies and guidance addressing the routine adjustment of in-tolerance

instruments are not so well established. National and international calibration standards

ANSI/NCSL Z540.3 and ISO/IEC-17025 do not mandate any particular adjustment policy with

regard to in-tolerance equipment. Evidence has been previously presented where routine

adjustment of in-tolerance items may even degrade performance. Yet, this important part of the

overall calibration process is often left to the discretion of the calibrating technician based on

heuristic assessment. Astute adjustment decisions require knowledge of the random vs.

systematic nature of instrument error. Instruments dominated by systematic effects, such as drift,

benefit from adjustment, while those displaying more random behavior may not. Monte Carlo

methods are used here to investigate the effect of various adjustment thresholds on in-tolerance

instruments.

1. Background

Instrument adjustment policies during calibration vary among different organizations. Such

policies can generally be classified into one of three categories:

1) Adjust always

2) Adjust only if Out-Of-Tolerance (OOT)

3) Adjust with discretion when In-Tolerance and always when OOT

While the first two polices are essentially self-explanatory, the third category deserves further

attention. Herein, a discretionary adjustment is one in which the calibration technician (or

software) makes a decision to adjust an instrument, which is observed to be in-tolerance, based

on consideration of additional factors. Discretionary adjustment may sometimes be performed in

conjunction with guardbanding strategies to mitigate false-accept-risk. Guardbanding techniques

often require discretionary adjustments to be made where low Test Uncertainty Ratio (TUR)

and/or End Of Period Reliability (EOPR) encountered. Significant literature exists on this subject

[23-23].

Page 2: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

However, this paper endeavors to provide an investigation into discretionary adjustments of in-

tolerance instruments which are made, not to mitigate false accept risk, but as a preemptive

measure in an attempt to reduce the potential for future out-of-tolerance (OOT) conditions. A

reduction in OOT probability can translate into improved EOPR reliability. Such adjustments are

often made “on the bench” at the discretion of the calibration technician when the observed error

is deemed too close to the tolerance limits. Organizations sometimes may have a blanket policy

or threshold in place that defines, in a broad general sense, what too close is. This adjustment

threshold may be 70 % of specification, 80 % of specification, or any other arbitrary value. The

intent of this policy may be to improve accuracy and mitigate future OOT conditions, improving

EOPR. The objective of this paper is to investigate whether such adjustments can, in fact,

provide an increase in accuracy and a reduction in OOT probability (increased EOPR) and, if so,

by how much and under what conditions. The possibility of calibration adjustments unwittingly

degrading performance is also investigated.

There are no national or international standards which dictate or require adjustment during

calibration, unless an instrument is found OOT or the observed error fails to meet guardband

criteria. ANSI/NCSL Z540.3-2006 and ISO/IEC-17025:2005 do not mandate discretionary

adjustment of in-tolerance items [1 - 3]. The International Vocabulary of Metrology (VIM)

clearly defines calibration, verification, and adjustment as separate actions [4]. Adjustment is not

a defacto aspect of calibration. As defined by the VIM:

Calibration: Operation that, under specified conditions, in a first step, establishes a relation between the quantity values with measurement uncertainties provided by measurement standards and corresponding indications with associated measurement uncertainties and, in a second step, uses this information to establish a relation for obtaining a measurement result from an indication… NOTE 2: Calibration should not be confused with adjustment of a measuring system, often mistakenly called “self-calibration”, nor with verification of calibration.

Adjustment of a measuring system: Set of operations carried out on a measuring system so that it provides prescribed indications corresponding to given values of a quantity to be measured… NOTE 2: Adjustment of a measuring system should not be confused with calibration, which is a prerequisite for adjustment. Verification: Provision of objective evidence that a given item fulfils specified requirements… EXAMPLE 2: Confirmation that performance properties or legal requirements of a measuring system are achieved…NOTE 3: The specified requirements may be, e.g. that a manufacturer's specifications are met… NOTE 5: Verification should not be confused with calibration.

Despite these established definitions, there have been recent accounts where entities regulated by

the Food and Drug Administration (FDA) have received Form-483 Investigational Observations

and Warning Letters arising from the failure to always adjust in-tolerance instruments (i.e. all

instruments) during calibration [5]. These incidents may be attributable to a nebulous distinction

between the definitions of calibration, verification, and adjustment. References to similar events

in regulated industries have also been published [6 - 8] where calibration requirements have

been inferred to mandate adjustment during calibration.

Page 3: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Consistent with the VIM definitions, a calibration, where a pass/fail conformance decision is

made, also satisfies the definition of a verification. However, the converse is not true; not all

verifications are calibrations. This distinction is important because, for example, not all

calibrations result in a pass/fail conformance decision being issued. Such is the case for most

calibrations performed by National Metrology Institutes (NMI) and some reference standards

laboratories where calibrations are routinely performed and no pass/fail conformance decision is

made. The definition of calibration requires no such conformance decision be rendered. In these

cases, calibration consists of the measurement data reported along with the measurement

uncertainty. Such operations still adhere to the VIM definition of calibration, but they are not

verifications, since no statement of conformance to metrological specifications is given.

However calibrations, which do result in a statement of conformance (i.e. pass/fail) with respect

to an established metrological specification, are also verifications. In such scenarios, the

definitions of calibration and verification are both applicable. However, the absence of

“adjustment of a measuring system” during calibration in no way negates or disqualifies the

proper usage of the term calibration. Many instruments do not lend themselves to adjustment

and are not designed to be physically or electronically adjusted to periodically nominalize their

performance for the purpose of reducing measurement errors; yet, such instruments are still quite

capable of being calibrated. The distinction is readily apparent as indicated by ANSI/NCSL

Z540.3-2006 section 5.3a and 5.3b shown below [1, 2].

5.3 Calibration of Measuring and Test Equipment

a) Where calibrations provide for reporting measured values, the measurement uncertainty shall be acceptable to the customer and shall be documented.

b) Where calibrations provide for verification that measurement quantities are within specified

tolerances, the probability that incorrect acceptance decisions (false accept) will result from

calibration tests shall not exceed 2 % and shall be documented. Where it is not practicable

to estimate this probability, the test uncertainty ratio shall be equal to or greater than 4:1.

2. NCSLI RP-1: Establishment and Adjustment of Calibration Intervals

As stated, discretionary adjustments of in-tolerance instruments are often left to the judgment of

the calibration technician, or governed by organizational policy. When deferred to the discretion

of the technician, such adjustments are optimally based on professional evaluation by qualified

personnel with experience and training in the metrological disciplines for which they are

responsible. Heuristic assessment of instrument adjustment requirements, combined with

empirical data and epistemological knowledge gathered over multiple calibration operations may

provide a somewhat intuitive qualitative notion of when adjustment might be beneficial.

However, there is little formal quantitative guidance on this subject. The most authoritative

reference on such discretionary adjustments is found in NCSLI Recommended Practice RP-1,

“Establishment and Adjustment of Calibration Intervals”, henceforth referred to as NCSLI RP-1

[9]. Appendix G of NCSLI RP-1 refers to three adjustment policies as

1) Renew-always

2) Renew-if-failed

3) Renew-as-needed

Page 4: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

NCSLI RP-1 employs the term renew to convey an adjustment action. Herein, the renew-as-

needed policy is synonymous with discretionary adjustment. As stated in RP-1 [9],

“At present, no inexpensive systematic tools exist for deciding on the optimal renewal policy for

a given MTE. While it can be argued that one policy over another should be implemented on an

organizational level, there is a paucity of rigorously demonstrable tests that lead to a clear-cut

decision as to what that policy should be. The implementation of reliability models, such as the

drift model, that yield information on the relative contributions of random and systematic effects,

seems to be a step in the right direction.”

The objective of this paper is to provide some additional discourse regarding the random and

systematic drift effects associated with some instruments and to provide insight as to the impact

of these effects on EOPR reliability under various discretionary adjustment thresholds. As

provided in NCSL RP-1 [9], discretionary adjustments may be influenced by one or more of the

following criteria, where this paper focuses specifically on questions #4, #5, #6, & #7:

1) Does parameter adjustment disturb the equilibrium of a parameter, thereby hastening the

occurrence of an out-of-tolerance condition?

2) Do parameter adjustments stress functioning components, thereby shortening the life of

the MTE?

3) During calibration, the mechanism is established to optimize or "center-spec''

parameters. The technician is there, the equipment is set up, the references are in-place.

If it is desired to have parameters performing at their nominal values, is this not the best

time to adjust?

4) By placing parameter values as far from the tolerance limits as possible, does adjustment

to nominal extend the time required for re-calibration?

5) Do random effects dominate parameter value changes to the extent that adjustment is

merely a futile attempt to control random fluctuations?

6) Do systematic effects dominate parameter value changes to the extent that adjustment is

beneficial?

7) Is parameter drift information available that would lead us to believe that not adjusting

to nominal would, in certain instances, actually extend the time required for re-

calibration?

8) Is parameter adjustment prohibitively expensive?

9) If adjustment to nominal is not done at every calibration, are equipment users being

short-changed?

10) What renewal practice is likely to be followed by calibrating personnel, irrespective of

policy?

11) Which renewal policy is most consistent with a cost-effective interval analysis

methodology?

Page 5: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Weiss [10] addressed the issue of calibration adjustment in some detail in 1991 in a paper

entitled, “Does Calibration Adjustment Optimize Measurement Integrity?”. Weiss showed that in

the presence of purely random errors associated with a normal probability density function,

where no statistical difference in the mean value of the distributions exists from one calibration

to the next, that calibration adjustment can degrade instrument performance. Weiss and several

other authors [10 – 14, 56 – 60] have drawn upon the popular Deming funnel experiment to

illustrate how “tampering” with or adjusting a calibrated system in a state of statistical control

can introduce additional unwanted variation into a process rather than reduce existing variation1.

As Weiss demonstrates, if the process exhibits purely random error represented by a normal

probability density function, the effect of this tampering is to increase the variance σ2 by a factor

of 2. This is equivalent to increasing the standard deviation σ to a value of √2𝜎 , or ~1.414σ. If

the specification limits were originally set to achieve 95 % confidence (±1.96σ), then this

increased variation from tampering results in an in-tolerance probability (EOPR) of only 83.4 %.

This value becomes important for the interpretations of the results later in this paper in Section 6.

Shah [11] likewise comments in 2007 stating, “Calibration… has nothing to do with adjustment.

When a measurement system is… adjusted to measure the nominal value whether it is within

tolerance or not... Is this advisable or is it causing more harm than good?... Some adjustments

are justified. Others are not.… A calibration technician has to make an instant decision on a

measurement taken... Making a bad decision can lead to quality problems… It is shown that a

stable process with its inherent natural (random) variation should be left on its own.”

Abell [13] also touched on this issue in 2003 noting that, “…one might be inclined to readjust

points to the center of the specification. The temptation to optimize all points… by adjusting to

the exact center between the specifications causes two problems. The first is that it might not be

possible to adjust the instrument on a re-calibration to an optimal center value, even with an

expensive repair. Second, a stable instrument that is unlikely to drift will be made worse by

attempts to optimize its performance.”

Payne [14] in 2005 makes similar comments. “There are two reasons adjustment is

not part of the formal definition of calibration: (1) The historical calibration data on

an instrument can be useful when describing the normal variation of the instrument or a

population of substantially identical instruments... (2) …a single measurement from that process

is a random sample from the probability density function that describes it. Without other

knowledge, there is no way to know if the sample is within the normal variation limits. The

history gives us that information. If the measurement is within the normal variation and not

outside the specification limits, there is no reason to adjust it. In fact, making an adjustment

could just as likely make it worse as it could make it better. W. Edwards Deming discusses the

problem of overadjustment in chapter 11 of Out of the Crisis.”

1 In the Deming experiment, a stationary funnel is fixed a short distance directly above the center of a target and marbles are

dropped through the funnel onto the target; the resting spot of each marble is marked. Repeated cycles of this will display resting

spots in a random pattern with a natural fixed common-cause variation (σ) around the target’s center, following so-called “rule

#1” of never adjusting the position of the funnel. Alternatively, if the operator follows “rule #2” and futility attempts to adjust the

position of the funnel after each drop (equal and opposite to the last observed error), the variation of the resting spots increases.

Page 6: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

ISO/TS 16949:2009 [57], which supersedes QS9000 quality management requirements for the

automotive industry, also refers to the phenomenon of “over-adjustment” in Section 8.1.2 by

requiring, “Basic statistical concepts, such as variation, control (stability), process capability

and over-adjustment, shall be understood throughout the organization.”

The MSA Reference Manual [56] also describes “over-adjustment”, stating:

“…the decision to adjust a manufacturing process is now commonly based on measurement

data. The data, or some statistic calculated from them, are compared with statistical control

limits for the process, and if the comparison indicates that the process is out of statistical

control, then an adjustment of some kind is made. Otherwise, the process is allowed to run

without adjustment… [However] Often manufacturing operations use a single part at the

beginning of the day to verify that the process is targeted. If the part measured is off target, the

process is then adjusted. Later, in some cases another part is measured and again the process

may be adjusted. Dr. Deming referred to this type of measurement and decision-making as

‘tampering’… Over-adjustment of the process has added variation and will continue to do so...

The measurement error just compounds the problem... Other examples of the funnel experiment

are (1) Recalibration of gages based on arbitrary limits – i.e., limits not reflecting the

measurement system’s variability (Rule 3). (2) Autocompensation adjusts the process based on

the last part produced. (Rule 2).

Nolan and Provost [58] in 1990 also provide the following, “Decisions are made to adjust

equipment… to calibrate a measurement device… etc. All these decisions must consider the

variation in the appropriate measurements or quality characteristics of the process… The aim of

the adjustment is to bring the quality characteristic closer to the target in the future. ...there are

circumstances in which the adjustments will improve the performance of the process, and there

are circumstances in which the adjustment will result in worse performance than if no

adjustment is made... Continual adjustment of a stable process, that is, one whose output is

dominated by common causes, will increase variation and usually make the performance of the

process worse.”

Bucher, in The Quality Calibration Handbook [59], and The Metrology Handbook [60], states

“With regard to adjusting IM&TE, there are several schools of thought on the issue. On one end

of the spectrum, some (particularly government regulatory agencies) require that an instrument

be adjusted at every calibration, whether or not it is actually required. At the other end of the

spectrum, some hold that any adjustment is tampering with the natural system (from Deming)

and what should be done is simply to record the values and make corrections to measurements.

An intermediate position is to adjust the instrument only if (a) the measurement is outside the

specification limits, (b) the measurement is inside but near the specifications limits, where near

is defined by the uncertainty of the calibration standards, or (c) a documented history of the

values of the measured parameter shows that the measurement trend is likely to take it out of

specification before the next calibration due date”.

The Weiss and Deming model [10] assume purely random variation for which adjustment is not

only futile, but actually detrimental. In such cases, adjustment or tampering results in an increase

to the standard deviation (σ) of the process by a factor of 1.414, or about 41 %. However, if the

behavior is not purely random, the results can differ. As noted in NCSL RP-1 Appendix G [9],

Page 7: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

“However, if a systematic mean value change mechanism, such as monotonic drift, is introduced

into the model, the result can be quite different. For discussion purposes, modifications of the

model that provide for systematic change mechanisms will be referred to as Weiss-Castrup

models (unpublished)… By experimenting with different combinations of values for drift rate

and extent of attribute fluctuations in a Weiss-Castrup model, it becomes apparent that the

decision to adjust or not adjust depends on whether changes in attribute values are

predominately random or systematic.”

Appendix D of NCSL RP-1 describes ten Measurement Reliability Models with #9 being

“systematic attribute drift superimposed over random fluctuations (drift model)” [9]:

1) Constant out-of-tolerance rate (exponential model).

2) Constant-operating-period out-of-tolerance rate with a superimposed burn-in or wear out

period (Weibull model).

3) System out-of-tolerances resulting from the failure of one or more components, each

characterized by a constant failure rate (mixed exponential model).

4) Out-of-tolerances due to random fluctuations in the MTE attribute (random walk model).

5) Out-of-tolerances due to random attribute fluctuations confined to a restricted domain

around the nominal or design value of the attribute (restricted random-walk model).

6) Out-of-tolerances resulting from an accumulation of stresses occurring at a constant

average rate (modified gamma model).

7) Monotonically increasing or decreasing out-of-tolerance rate (mortality drift model).

8) Out-of-tolerances occurring after a specific interval (warranty model).

9) Systematic attribute drift superimposed over random fluctuations (drift model).

10) Out-of-tolerances occurring on a logarithmic time scale (lognormal model).

This paper investigates behavioral characteristics of instruments that are described by the #9

reliability model above, “systematic attribute drift superimposed over random fluctuations (drift

model)”.

Background information provided in Appendix D of NCSLI RP-1 is highly enlightening with

respect to the Weiss-Castrup Drift model and the decision to adjust or not. Additional

information is also provided by Castrup [54].

A section from Appendix D of NCSLI RP-1 is provided here to facilitate an understanding of the

relationship between systematic and random components of behavior and their influence on both

interval and instrument adjustment decisions, where Φ denotes the normal distribution function:

Φ(𝑥) = 1

𝜎√2𝜋𝑒

−(𝑥−𝜇)2

2𝜎2

Where:

𝑥 = random variable, 𝜎 = standard deviation, 𝜇 = mean

Page 8: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Appendix D of NCSL RP-1 [9]

Drift Model: 𝑅(𝑡, 𝜃) = Φ(𝜃1 + 𝜃3𝑡) + Φ(𝜃2 − 𝜃3𝑡)

𝜕𝑅

𝜕𝜃1

=1

√2𝜋𝑒−(�̂�1+ �̂�3𝑡)

2/2

𝜕𝑅

𝜕𝜃2

=1

√2𝜋𝑒−(�̂�2− �̂�3𝑡)

2/2

𝜕𝑅

𝜕𝜃3

=𝑡

√2𝜋[𝑒−(�̂�1+ �̂�3𝑡)

2/2 − 𝑒−(�̂�2− �̂�3𝑡)

2/2]

Figure D-11. Drift Measurement Reliability Model

𝜃1 = 2.5, 𝜃2 = 0.5, and 𝜃3 = 0.5

Renewal Policy and the Drift Model:

In the drift model, if the conditions |(�̂�3𝑡 ≫ �̂�1𝑡)| and |(�̂�3𝑡 ≪ �̂�2𝑡)| hold, then the measurement reliability

of the attribute of interest is not sensitive to time elapsed since calibration. This is equivalent to saying

that, if the coefficient �̂�3 is small enough, the attribute can essentially be left alone, i.e., not periodically adjusted.

Interestingly, the coefficient �̂�3 is the rate of attribute value drift divided by the attribute value standard

deviation: �̂�3 = 𝑚/𝜎, where 𝑚 = attribute drift rate, and 𝜃 = attribute standard deviation. From this

expression, we see that the coefficient �̂�3 is the ratio of the systematic and random components of the

mechanism by which attribute values vary with time. If the systematic component dominates then �̂�3 will

be large. If, on the other hand, if the random component dominates, then �̂�3 will be small. Putting this observation together with the foregoing remarks concerning attribute adjustment leads to the following axiom:

If random fluctuation is the dominating mechanism for attribute value changes over time, then the benefit of periodic adjustment is minimal.

As a corollary, it might also be stated that

If drift or other systematic change is the dominating mechanism for attribute value changes over time, then the benefit of periodic adjustment is high.

Obviously, use of the drift model can assist in determining which adjustment practice to employ for a given attribute. By fitting the drift model to an observed out-of-tolerance time series and evaluating the

coefficient �̂�3 it can be determined whether the dominant mechanism for attribute value change is

systematic or random. If �̂�3 is small, then random changes dominate and a renew-if-failed only practice

should be considered. If �̂�3 is large, then a renew-always practice should perhaps be implemented.

Copyright© 2010 NCSLI. All Rights Reserved. NCSLI Information Manual. Reprinted here under the provisions of the

“Permission to Reproduce” clause of NCSLI RP-1.

Page 9: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

The Weiss-Castrup Drift model described in NCSL RP-1 was primarily intended for the

determination, adjustment, and optimization of calibration intervals in association with Methods

S2 & S3, also called the Binomial Method and the Renewal Time Method, respectively [9].

The Weiss-Castrup drift model is investigated here with a focus on instrument adjustment

thresholds, rather than interval adjustment actions. That is, for a given fixed calibration interval,

how do various discretionary adjustment thresholds (0 % to 100 % of specification), in the

presence of both drift and random variation, affect EOPR reliability? Clearly, if the behavior is

purely random, as in the Weiss and Deming models, an adjust-always policy (0 % adjust

threshold) is detrimental to the instrument performance resulting in decreased EOPR.

However, if the behavior has any element of monotonic drift, as in the Weiss-Castrup Drift

model, an adjustment will be necessary at some point to prevent an eventual OOT condition

resulting from a true attribute bias due to drift. The difficulty manifests during calibration when

attempting to discriminate between attribute bias and a random error. Thus, investigating optimal

adjustment thresholds to maximize EOPR in the presence of random and systematic errors seems

a worthy endeavor. It is also prudent to consider that, even if an optimum adjustment threshold

is determined, there may be other administrative and managerial factors as described in NCSL

RP-1 Appendix G [9] that should be considered when formulating adjustment policies.

The policy of some U.S. Department of Defense military programs and third party OEM

accredited calibration laboratories has been to not routinely, by default, adjust most equipment

unless found out-of-tolerance. For example, “The U.S. Navy has the policy of not adjusting test

equipment that are in tolerance.” [15].

However, even under some programs which typically employ an “adjust-only-if-OOT” policy,

discretionary adjustments are still performed for select equipment types. For example, it is not

uncommon to always assign new calibration factors to microwave power sensors, or sensitivity

values to accelerometers, or coefficients to temperature sensors (e.g. RTS, PRT’s, etc.),

regardless of the as-found condition of the device. In these cases, rather than judge in-tolerance

or out-of-tolerance based on published specifications, these decisions are often rendered based

on the previously assigned uncertainty, applicable to the assigned value. In these applications,

uncertainties must include a reproducibility component in the uncertainty budget that is

applicable over the calibration interval for stated conditions. Such estimates can be attained by

evaluation of historical performance.

3. Empirical Examples: Systematic Drift Superimposed Over Random Fluctuations

The idea that attribute bias can grow or drift over time is ubiquitous; indeed much of the history

of metrology and the impetus for calibration are predicated on this possibility. Examples of such

behavior are often encountered. The distinction between attribute bias arising from drift (or

otherwise), and a random error, is sometimes only discernable from the analysis of historical

data. Monotonic drift can be estimated using linear regression models. Such is the case with 10 V

DC zener voltage references. Calibration of these devices must be performed via comparison to

other characterized zeners, standard cells, or, in the most accurate cases, Josephson voltage

measurement systems. Due to the inherently low drift characteristics of commercial zener

references, it would not be possible to adequately detect or resolve drift without a measurement

system exhibiting high resolution, low noise, and zero (or well-characterized/compensated) drift.

Page 10: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

The data represented in Figure 1 was acquired with a Josephson voltage measurement system.

The “noise” or variation observed in the data is primarily due to the zener under test and not to

the measurement standard, while all of the observed drift is attributable to the zener and none to

the measurement standard [16, 17]. It may be noted that the fluctuations about the predicted drift

line are not purely “random” in nature; they are pseudo-random.

Figure 1. Zener drift and pseudo-random variation

Short term variation is also significantly lower than long term variation about the predicted line

(better repeatability than reproducibility). Long term variation is attributable to uncorrected

seasonal pressure and/or humidity dependencies, 1/f noise, white noise, etc. In the presence of

this long-term variation, significant calibration history is necessary in order to confidently

characterize the drift of such instruments. Moreover, some of the apparently random common-

cause variation might indeed be “correctable”. One example is by application of pressure

coefficients to correct for ambient changes in barometric pressure. In many applications, with

enough effort and the availability of measurement systems with ultra-high resolution and

accuracy, some apparently common-cause variation can be revealed as special-cause. All

metrology systems, to include the UUT, will ultimately contain a finite amount of common-

cause variation or uncertainty, even after all corrections have been applied.

The R2 value (or coefficient of determination) from the regression is a figure of merit for the

linear drift model and other models, as it compares the amount of variation around the prediction

to the variation resulting from a constant (no-drift) model. R2 is an indicator of the amount of

variation that is explained by linear monotonic drift. Normality tests and visual analysis of the

regression residuals is also beneficial and can reveal secondary non-linear effects.

Page 11: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

It is interesting to visually ponder an attempt to characterize zener drift in Figure 1over a

relatively short period of time. Certain instances of data analysis over such time periods might

produce significantly different predictions of drift. This is evident, even via visual examination,

by observing only data from Jan 2001 to Jan 2002 which would result in a positive drift slope.

This illustrates the benefit of long calibration histories when attempting to predict drift in the

presence of random or pseudo-random variation, especially where the periodicity of these

variations is long.

However, a subjective decision must be made when determining how much historical data to

include in the regression. At some point, it may be reasonable to conclude that future behavior,

especially in the short term, is not significantly dependent on data from 10+ years ago. In

general, short-term predictions are better made by assessment of more recent history only, while

long-term predictions might be more accurate using the full comprehensive history. Special

cause variation, such as the loss of power, can justify excluding data previous to the event. This

is a subjective process and heuristic judgment based on experience and knowledge of zener

behavior is helpful in determining how much data to include in the regression.

Zener references are not typically declared in-tolerance or out-of-tolerance by assessment against

a published accuracy specification, but rather to their predicted value and its assigned uncertainty

at a given time during the calibration interval. Zener references are also not typically adjusted,

although provision for electrical adjustment does exist. In lieu of physical/electrical adjustment,

the assigned/predicted value is mathematically adjusted or reassigned over time during

calibration. Algebraic corrections never interfere with the stability of a device nor are they

limited by the resolution of the adjusting mechanism. They do require the manual use of “charted

values” and uncertainties via reference to a Report of Test or calibration certificate.

By sheer numbers, the majority of items calibrated throughout the world are not predominately

high-level reference standards, but are of the more general variety of Test, Measurement, and

Diagnostic Equipment (TM&DE). Often times, the calibration history of such TM&DE contains

adjustment actions of both in-tolerance and out-of-tolerance instruments. The data shown in

Figure 2 represents actual data from the 50 V DC test point of a 4½ digit handheld multimeter

(UUT). On the third calibration event, the UUT was found out-of-tolerance and was adjusted

back to nominal, resulting in zero observed error.

It is visually intuitive that this particular test point displays a high degree of monotonic drift with

very little random variation. In order to perform regression analysis, the magnitude and direction

of the adjustment action must be mathematically removed from the raw calibration data. The

resulting regression analysis is shown in Figure 3.

Page 12: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Figure 2. Calibration history representing as-found and as-left data

Figure 3. Regression of calibration data with adjustments mathematically removed (R2 = 0.96)

However, in many cases of general purpose TMDE, detection of monotonic drift may be more

difficult to resolve due to domination by more random behavior or even special-cause variation

where instruments apparently “step” out-of-tolerance, rather than drift in a predictable manner.

Such an example, along with the regression analysis is shown in Figures 4 and 5. In such cases, a

model with random fluctuation superimposed on monotonic drift may not be the best model. One

of the other models proposed in NCSLI RP-1 may be more appropriate for such an instrument.

Page 13: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Figure 4. Calibration history representing predominately non-monotonic drift behavior

Figure 5. Regression w/ relatively low R2, indicating significant behavior not explained by drift

Page 14: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

4. Assumptions of the Drift Model

The Weiss-Castrup Drift model is investigated in this paper with the following assumptions:

1) The only two change mechanisms for instrument error are (a) linear monotonic drift and

(b) normally distributed random errors. No spontaneous special-cause step transitions or

other variation/ behavior is accommodated.

2) The periodicity and magnitude of the random fluctuations during measurement

(repeatability) is negligibly small compared to the periodicity and magnitude of the

random fluctuations over the calibration interval (reproducibility). Here, a single

measurement is simulated.

3) Tolerance specifications for the UUT are intended to represent approximately 95 %

containment probability. Drift from 0 % to 100 % of specification is modeled as attribute

bias. The normally distributed random component selected as the remainder of the

specification, less the allotted drift bias µ, where σ yields 95 % containment probability

for the remaining random error. The higher the allotted drift (µ), the lower variation (σ).

4) The drift is constrained between 0 % and ~100 % of the stated specification.

5) The measurement uncertainty at the time of calibration is negligibly small, i.e., high Test

Uncertainty Ratio (TUR) or, equivalently, Measurement Capability Index (Cm).

Laboratory standards do not contribute significantly to the measurement uncertainty.

6) High precision physical and/or electrical adjustment provisions for the UUT are provided

which are capable of rendering an observed error of zero (eOBS = 0) after adjustment. This

may be a poor assumption for multi-range, multifunction instruments with many test

points. Algebraic (manually applied mathematical) corrections are equivalent.

7) Physical or electrical adjustments do not induce any secondary instabilities or otherwise

disturb the equilibrium or stress components of the instrument. No interaction between

adjustment controls for various test points, ranges, or functions is assumed.

8) Observed Out-Of-Tolerance conditions (>100 % of specification) require mandatory

adjustment. The adjustment threshold is constrained between 0 % to 100 % of

specification. However, adjustment thresholds >100 % are briefly investigated.

9) An adjustment action will always negate any previous attribute bias present at the end of

the previous period, but will also (insidiously) result in a present attribute bias equal to

the negative of the previous random error. No quantitative a-priori drift information is

assumed at the time of adjustment. Adjustment will overcompensate by the amount of

previous random error, as in the Deming funnel rule #2.

10) The adjustment threshold is always adhered to. If eOBS > adjustment threshold, an

adjustment will always be performed. If eOBS < adjustment threshold, no adjustment is

performed. Human behavioral/procedural error in adhering to the adjustment threshold is

not accommodated.

11) Symmetry is assumed and only positive drift is simulated with equal implications and

conclusions applicable to negative drift.

Assumptions #2, #3, and #9 above require further comment.

Page 15: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

4.1 Assumption (#2): Periodicity and Magnitude of Variation

The Weiss paper and the Deming funnel addressed the periodicity by restricting adjustment

decisions to a “single reading” or observation. In the Weiss example, a single meter reading and

adjustment was performed every hour. Rather than decrease any attribute bias, the adjustments

resulted in increased random variation. Weiss concludes, “The presence and size of the bias

cannot be determined by a single reading; multiple data points are required… One must

observe enough data to characterize the variability of the meter readings to know which is the

correct strategy [adjust or not]”.

Likewise, the model herein assumes that a single measurement is made during calibration or, if

repeated measurements are made and averaged, that the variability during calibration is

negligible with respect to the larger variation that occurs over the calibration interval. That is,

the random fluctuations occurring during the relatively short observation period of calibration

(repeatability) are not representative of, or do not capture, the full extent of the variation

exhibited over the longer calibration interval (reproducibility). This is somewhat akin to the long

term dependency of 1/f noise. On the contrary, if the periodicity and magnitude of fluctuations

are similar, then random fluctuations during calibration interval are represented by those

encountered during the shorter measurement process. Such variations can then be largely negated

with averaging techniques during the measurement process, which should then be capable of

discerning actual attribute bias in the presence of random fluctuations. Under these

circumstances, adjustment could be warranted resulting in a genuine improvement in accuracy.

Figure 6. Measurement variation during calibration, compared with variation over cal interval.

Page 16: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Like the Weiss example and Deming funnel (rule #2), this model presented in this paper will

incorrectly assume the observed UUT error of +60 % shown in Figure 6 is attribute bias, even

under purely random behavior. Such an erroneous assumption will result in a calibration

adjustment magnitude of -60 % in a futile effort to correct for the observed random error. Like

Weiss and Deming, the correct assumption under purely random behavior is that the +60 % error

is common-cause and, if left undisturbed, will soon fluctuate and take on some other random

error represented by the UUT distribution. If this assumption is valid, the correct action would be

to do nothing and not adjust. The model presented here attempts to replicate the actions of the

calibrating technician, whom does not have knowledge of the magnitudes of the individual

systematic attribute bias and vs. random behavior; adjustments are made only on the observed

error at the time of calibration, which is comprised of both bias from drift and random error.

But this decision can only confidently be made if a-priori knowledge of the UUT error

distribution over the course of the calibration interval is known. In many cases, this distribution

is not readily available and discretionary calibration adjustments are made with the assumption

that all of the observed error is an actual attribute bias which will remain (or possibly grow)

unless an adjustment is performed. In an ideal case, the calibration technician would be able to

discern a short-term random error from an actual long-term attribute bias through examination

of historical data. At the time of calibration however, the two types of errors are often

inextricably combined into the “observed error”, whether obtained from a single reading or

several averaged measurements over a short period of time. The attribute bias is somewhat

hidden in the presence of random error. This is the behavior that is modeled herein.

4.2 Assumption (#3): 95 % Containment Specifications; Selection of Drift vs. Random

This is perhaps the most significant and sweeping assumption used in the model presented here.

The rationale used herein assumes that specifications are generally intended to adequately

accommodate or contain the majority of errors that an instrument might exhibit, with relatively

high confidence (e.g. 95 %). As such, the magnitudes of drift and random variability are selected

as complementary to one another and modeled under this assumption. This greatly restricts the

domain of possible instrument behavior investigated here. Instruments with drift and random

variation, which are both far better (lower) than their specifications might imply, are not modeled

here. Rationale for the assumption and selection of the particular domain of instrument behavior

investigated in this paper is provided here.

As stated in Section 5.4 of NASA HDBK-8739.19-2 [18], “In general, manufacturer

specifications are intended to convey tolerance limits that are expected to contain a given

performance parameter or attribute with some level of confidence under baseline conditions…

Performance parameters and attributes such as nonlinearity, repeatability, hysteresis,

resolution, noise, thermal stability and zero shift are considered to be random variables that

follow probability distributions that relate the frequency of occurrence of values to the values

themselves. Therefore, the establishment of tolerance limits should be tied directly to the

probability that a performance parameter or attribute will lie within these limits…

The selection of applicable probability distributions depends on the individual performance

parameter or attribute and are often determined from test data obtained for a sample of articles

Page 17: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

or items selected from the production population. The sample statistics are used to infer

information about the underlying parameter population distribution for the produced items. This

population distribution represents the item to item variation of the given parameter. The

performance parameter or attribute of an individual item may vary from the population mean.

However, the majority of the produced items should have parameter mean values that are very

close to the population mean. Accordingly, a central tendency exists that can be described by the

normal distribution…

Baseline performance specifications are often established from data obtained from the testing of

a sample of items selected from the production population. Since the test results are applied to

the entire population of produced items, the tolerance limits should be established to ensure that

a large percentage of the items within the population will perform as specified… performance

parameter distributions are established by testing a selected sample of the production

population. Since the test results are applied to the entire population of a given parameter, limits

are developed to ensure that a large percentage of the population will perform as specified.

Consequently, the parameter specifications are confidence limits with associated confidence

levels.”

Accuracy specifications are of little benefit if they cannot be relied upon with reasonably high

confidence. Manufacturers sometimes publish specifications at both 95 % and 99 % confidence

levels [Ref 19]. After many calibration cycles, EOPR is then an empirical estimate of that

confidence; i.e. EOPR provides a measure or assessment of the probability for an instrument to

comply with its specifications at the end of its calibration interval.

However, the intent and conditions of specifications and any assumed confidence are subject to a

certain amount of interpretation and inference. Is the confidence level specifically stated or is it

implied? Does the confidence level of the specification apply to a single test point, or to a single

instrument, or to a population of similar instruments?

For example, the published absolute uncertainty specification at a 95 % confidence level for a

Fluke 8508A DMM, at 20 VDC, is ±3.2 ppm [Ref 19]. The same 20 VDC point has a published

uncertainty of ±4.25 ppm expressed at a 99 % confidence level. As manufactured and if properly

used, it might be reasonable for the end-user of this DMM to apply the stated specification at this

particular 20 VDC test point and assume the stated confidence level applies.

However, it can be argued that for multifunction instruments with multiple test points, the actual

confidence level of any individual test point must be much greater than 95 % or even 99 %

confidence if the instrument as-a-whole is expected to meet its specifications with the stated

confidence.

As Deaver has noted [20]“…each Fluke Model 5520A Multiproduct Calibrator is tested at 552

points on the production line prior to shipment. If each of the points has a 95% probability of

being found in tolerance, there would only be a 0.95552

= 0.000000000[0]51% chance of finding

all the points within the specification limits if the points are independent! Even if we estimate

100 independent points (about 2 per range for each function), we would still have only a 0.95100

= 0.6% chance of being able to ship the product.”

Page 18: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Similar statements have been published by Dobbert [Ref 21, 21a]. “A common assumption is

that product specifications describe 95% of the population of product items [emphasis added].

From the mean, µ, and standard deviation, σ, an interval of [µ - 2σ, µ + 2σ] contains

approximately 95% of the population. However, when manufacturers set product specifications,

the test line limit is often set wider than 2σ from the population mean...

For choosing the tolerance interval probability, a generally accepted minimum value is 95%.

However, manufacturers may choose a probability other than 95% for different reasons.

Consider again a multi-parameter product. Manufacturers wish to have high yields for the entire

product so that the yield considering all parameters meets the respective test line limits. If the

product parameters are statistically independent, the overall yield, in this case, is the product of

the probability for each parameter. For a product with just three independent parameters, each

with a test limit intended to give 95% probability, the product would only have a (0.95%)3

or

85.7 % chance of meeting all test line limits, which is perhaps unacceptable to the manufacturer.

For this reason, manufacturers select tolerance interval probabilities greater than 95% so that

the overall probability is acceptable.”

When discussing drift, Dobbert also notes, “Stress due to environmental change, as well as

everyday use, transport, aging and other factors may induce small changes in performance that

accumulate over time. In other words, products drift. The effect of drift is that from the time of

manufacture to the end of the initial calibration interval, it is likely that performance has shifted.

…a population of product items also experiences a shift in the mean, a change in the standard

deviation, or both, due to the mechanisms associated with drift…

To ensure products meet specification over the initial calibration interval, manufacturers may

include an additional guard band between the test line limit and the specification… In the

simplest case, the total guard band between the test line limit and the specifications is the sum of

the individual guard band components for environmental factors, drift, measurement uncertainty

and any other required component. For example, 𝑠𝑝𝑒𝑐 = 𝑡𝑙𝑙 + ∆𝑒𝑛𝑣 + ∆𝑑𝑟𝑖𝑓𝑡 + ∆𝑢𝑛𝑐…

gives what is often the initial specification for a product. For the final specification,

manufacturers must consider manufacturing costs, market demands and competing product

performance.”

When discussing manufacturer’s specifications propagated into uncertainty analyses, Dobbert

additionally notes, “The GUM provides guidance for evaluation of standard uncertainty and

specifically includes manufacturer’s specifications as a source of information for Type-B

estimate… To evaluate a Type-B uncertainty, the GUM gives specific advice when an

uncertainty is quoted at a given level of confidence. In this instance, an assumption can be made

that a Gaussian distribution was used to determine the quoted uncertainty. The standard

uncertainty can then be determined by dividing by the appropriate factor given the stated level of

confidence. Various manufacturers state a level of confidence for product specifications and

applying this GUM advice to product specifications quoted at a level of confidence is common

and accepted by various accreditation bodies.”

Page 19: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

The assumption, used in the model investigated by this paper, is that a specification represents

95 % containment probability of errors for a given test point; thus, the magnitude and proportion

of drift and random components are modeled accordingly (see Section 5). This may be a

significant assumption and highly conservative, especially where actual instrument performance

at a given test point exhibits systematic drift (bias) and random error components much lower

than represented by the specifications. For example, the domain of performance for instruments

displaying drift (µ) of only 10 % of specification per interval and, at the same time, a random

component (σ) of only 20 % of specification is not modeled here.

However, a great many instruments may be well capable of performing at such levels, i.e.

considerably better than their specifications would imply. This is especially true if one assumes

that the manufacturer has built significant margins or guardbands into the specifications and/or

that the confidence level of specifications is intended to represent an entire population of

instruments, or one instrument as-a-whole, rather than a single test point. Investigations of such

domains of behavior, and the effect on EOPR of various adjustment thresholds under such

“improved” instrument performance, may be highly insightful and are deferred to future

explorations2. Moreover, models where random variation (σ) itself increases with time (such as

random-walk models) would be useful, with or without a drift component. Such a model, even in

the absence of monotonic drift, exhibits a time-dependent mechanism for transitioning to OOT3.

4.3 Assumption (#9): Mandatory Adjustment of OOT Conditions is Required

In practice, calibration laboratories, which are charged with verification as part of the calibration

process, are required to perform an adjustment if the UUT if it exceeds the allowable tolerance(s)

(>100 %) defined by the agreed-upon specifications. It is not generally acceptable to return an

item to the end-user as calibrated, while exhibiting an observed OOT condition.

However, in a Weiss or Deming model where fluctuations are purely random, this would appear

the correct course of action. The OOT condition, like the in-tolerance condition, should not be

adjusted; it should be allowed to remain with the assumption that it will soon decrease and take-

on some other random value which will likely be contained within the specification limits. In this

regard, there is nothing “special” about the OOT condition. It is simply part of the normal

common-cause random variation that will inevitably, albeit rather infrequently (e.g. 5 %), fall

outside of specification limits which are intended to represent 95 % confidence or other

containment probability. Appendix G of NCSLI RP-1 perhaps best describes this as a logical

predicament when discussing non-adjustment of items as follows:

“If we can convince ourselves that adjustment of in-tolerance attributes should not be made, how

then to convince ourselves that adjustment of out-of-tolerance attributes is somehow beneficial?

For instance, if we conclude that attribute fluctuations are random, what is the point of adjusting

attributes at all? What is special about attribute values that cross over a completely arbitrary

line called a tolerance limit? Does traversing this line transform them into variables that can be

controlled systematically? Obviously not.”

More on the topic of non-adjustment of OOT conditions is presented later in Section 6.

2 The author thanks Jonathan Harben of Keysight Technologies for these astute suggestions.

3 The author thanks Dr. Howard Castrup of Integrated Sciences Group for this valuable observation.

Page 20: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

The model presented herein concedes to the conventional industry practice which mandates

adjustment of items which are observed to be out-of-tolerance. Where the observed error is

predominately a “long-term” attribute bias, resulting from systematic monotonic drift or

otherwise, adjustment is a beneficial action. Such attribute bias is likely to remain or possibly

grow larger if left unadjusted. However, where the observed error resulted predominately from a

short-term random event, adjustment will be the incorrect decision. Like the calibration

technician, this model assumes (correctly or incorrectly) that all observed as-received errors

represent systematic attribute bias; adjustment actions will be implemented according to the

adjustment threshold parameter set for the model (0 % to 100 % of specification). In this sense,

the model feigns ignorance of the constituent proportion of random to attribute bias during

adjustment actions but, in actuality, is privy to the amount of attribute bias at all times in the

simulation. For investigational purposes, adjustment thresholds >100 % of specification are

briefly discussed, although they are believed unlikely to find application in most calibration

laboratories.

5. Modeling and Selection of Magnitude for Drift and Random Variation.

The illustration in Figure 7 represents the general concept of monotonic drift superimposed on

constant random variation.

Figure 7. Monotonic drift superimposed on constant random variation

LEFT: The random variation has been superimposed on no drift at all (0 %) and the

specification adequately contains the 95 % of the random errors.

CENTER: The random variation has been superimposed on drift in the amount of 50 % of

specification. The mean µ of this distribution at the end of the calibration interval is not zero,

but is equal to the amount of drift accumulated over the calibration interval (50 %). Thus, a

significant portion (16.4 %) of errors will exceed the upper specification limit and only 0.2 %

will exceed the lower specification limit. Only 83.5 % will be in-tolerance (EOPR).

RIGHT: The random variation is superimposed on drift in the amount of 100 % of

specification. The mean µ of the distribution is shifted to 100 % of specification, resulting in

50 % of the errors exceeding the upper specification limit when received for calibration. This

is generally an unacceptable situation, as End Of Period Reliability of 50 % is below most

industry accepted reliability targets. See Section 7 for examples of EOPR objectives.

Page 21: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Figure 7 represented random variation as a normal probability distribution with constant width (σ

= constant). However, if the specification limits are intended to provide a containment

probability of 95 % as discussed in Section 4.2, then any allowable drift must result in a

commensurate reduction in the amount of allowable random variation in order to still provide a

95 % confidence. In the model used herein, the amount of drift is first selected as a percentage

(0 % to 100 %) of the allowable specification over one interval. This will result in a systematic

drift-induced attribute bias at the end of one interval equal to the amount of specified drift. OOT

incidents will tend towards the direction of drift; e.g. for a positive drift allowance, OOT

conditions will predominately be found exceeding the upper specification limit in only one tail of

the distribution. The resulting drift, after one interval, forms the mean (µ) of the normally

distributed random component.

Since the intent of the accuracy specification is assumed to represent a 95 % containment

probability for the error, the remaining portion of the specification is then modeled as a normally

distributed random component with a standard deviation of (σ) selected to still provide 95 %

containment (see Table 1 and Figure 8). This complementary aspect of these two components is

necessary to provide the desired containment probability. As discussed in Section 4.2,

specifications are often, directly or implied, provided by the OEM with an allowance for drift

designed into them and provided at a relatively high confidence level. This is the basis for the

choice of magnitudes for the model used here. As the drift component dominates and approaches

the 100 % specification limit, the random component approaches zero. That is, as the systematic

drift (µ) increases, the random variation (σ) decreases, as shown in Figure 8.

Figure 8. Positive drift superimposed on complementary random variation

To maintain 95 % EOPR, a “perfect” adjustment would need to be made at the end of each

calibration interval (in-tolerance or not). This is necessary to reduce the attribute bias (due to

drift or otherwise) to zero. Only if this ideal adjustment always occurs at each the end of each

calibration interval would 95 % EOPR be achievable in this model. However, such adjustment

will not be possible in this model, due to the nature of the random variation precluding an ideal

adjustment. Thus, EOPR will be less than 95 % for adjustment thresholds between 0 % and 100

% of specification.

Page 22: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Table 1. Magnitude of drift (µ) and random (σ) components, modeled to maintain 95 % in-tolerance confidence.

(σ) Given as percentage of the specification.

(µ) Given as a percentage of the specification per interval.

Mean S.D. Ratio Left Tail Right Tail Mean S.D. Ratio Left Tail Right Tail Mean S.D. Ratio Mean S.D. Ratio

(µ) (σ) (µ / σ) OOT OOT (µ) (σ) (µ / σ) OOT OOT (µ) (σ) (µ / σ) (µ) (σ) (µ / σ)

Drift Random

ΣProb. ΣProb. Drift Random

ΣProb. ΣProb. Drift Random

Drift Random

0 % 51.021 % 0.000 2.500 % 2.500 % 26 % 44.386 % 0.586 0.226 % 4.774 % 51 % 29.790 % 1.71 76 % 14.591 % 5.21

1 % 51.011 % 0.020 2.385 % 2.614 % 27 % 43.881 % 0.615 0.190 % 4.810 % 52 % 29.181 % 1.78 77 % 13.983 % 5.51

2 % 50.981 % 0.039 2.271 % 2.729 % 28 % 43.363 % 0.646 0.158 % 4.842 % 53 % 28.573 % 1.85 78 % 13.375 % 5.83

3 % 50.933 % 0.059 2.158 % 2.843 % 29 % 42.834 % 0.677 0.130 % 4.870 % 54 % 27.966 % 1.93 79 % 12.767 % 6.19

4 % 50.864 % 0.079 2.044 % 2.956 % 30 % 42.291 % 0.709 0.106 % 4.894 % 55 % 27.358 % 2.01 80 % 12.159 % 6.58

5 % 50.776 % 0.098 1.933 % 3.068 % 31 % 41.739 % 0.743 0.085 % 4.915 % 56 % 26.749 % 2.09 81 % 11.551 % 7.01

6 % 50.668 % 0.118 1.822 % 3.178 % 32 % 41.177 % 0.777 0.067 % 4.933 % 57 % 26.142 % 2.18 82 % 10.943 % 7.49

7 % 50.539 % 0.139 1.712 % 3.287 % 33 % 40.606 % 0.813 0.053 % 4.947 % 58 % 25.534 % 2.27 83 % 10.335 % 8.03

8 % 50.392 % 0.159 1.605 % 3.395 % 34 % 40.029 % 0.849 0.041 % 4.960 % 59 % 24.926 % 2.37 84 % 9.727 % 8.64

9 % 50.224 % 0.179 1.499 % 3.500 % 35 % 39.444 % 0.887 0.031 % 4.969 % 60 % 24.318 % 2.47 85 % 9.119 % 9.32

10 % 50.038 % 0.200 1.396 % 3.604 % 36 % 38.856 % 0.926 0.023 % 4.977 % 61 % 23.710 % 2.57 86 % 8.511 % 10.1

11 % 49.831 % 0.221 1.296 % 3.705 % 37 % 38.263 % 0.967 0.017 % 4.983 % 62 % 23.102 % 2.68 87 % 7.904 % 11.0

12 % 49.603 % 0.242 1.198 % 3.803 % 38 % 37.666 % 1.01 0.012 % 4.988 % 63 % 22.494 % 2.80 88 % 7.296 % 12.1

13 % 49.356 % 0.263 1.103 % 3.898 % 39 % 37.066 % 1.05 0.009 % 4.991 % 64 % 21.886 % 2.92 89 % 6.687 % 13.3

14 % 49.088 % 0.285 1.011 % 3.989 % 40 % 36.464 % 1.10 0.006 % 4.994 % 65 % 21.278 % 3.05 90 % 6.080 % 14.8

15 % 48.801 % 0.307 0.922 % 4.078 % 41 % 35.861 % 1.14 0.004 % 4.996 % 66 % 20.670 % 3.19 91 % 5.472 % 16.6

16 % 48.495 % 0.330 0.838 % 4.163 % 42 % 35.256 % 1.19 0.003 % 4.998 % 67 % 20.062 % 3.34 92 % 4.864 % 18.9

17 % 48.168 % 0.353 0.757 % 4.243 % 43 % 34.650 % 1.24 0.002 % 4.998 % 68 % 19.454 % 3.50 93 % 4.256 % 21.9

18 % 47.821 % 0.376 0.680 % 4.320 % 44 % 34.043 % 1.29 0.001 % 4.999 % 69 % 18.846 % 3.66 94 % 3.648 % 25.8

19 % 47.456 % 0.400 0.608 % 4.393 % 45 % 33.435 % 1.35 0.001 % 4.999 % 70 % 18.238 % 3.84 95 % 3.040 % 31.3

20 % 47.071 % 0.425 0.540 % 4.461 % 46 % 32.828 % 1.40 0.000 % 4.999 % 71 % 17.630 % 4.03 96 % 2.432 % 39.5

21 % 46.666 % 0.450 0.476 % 4.524 % 47 % 32.220 % 1.46 0.000 % 4.999 % 72 % 17.022 % 4.23 97 % 1.824 % 53.2

22 % 46.244 % 0.476 0.417 % 4.583 % 48 % 31.613 % 1.52 0.000 % 5.000 % 73 % 16.415 % 4.45 98 % 1.216 % 80.6

23 % 45.805 % 0.502 0.362 % 4.638 % 49 % 31.005 % 1.58 0.000 % 5.000 % 74 % 15.807 % 4.68 99 % 0.608 % 163

24 % 45.347 % 0.529 0.312 % 4.687 % 50 % 30.398 % 1.64 0.000 % 5.000 % 75 % 15.199 % 4.93 100 % 0.000 % N/A

25 % 44.874 % 0.557 0.267 % 4.733 %

Page 23: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

AS-RECEIVED AS-LEFT DURING CAL INTERVAL END PERIOD

Figure 9. Monte Carlo simulation model for one calibration interval

eOBS = Error Observed for the UUT, as-received. It is equal to the End-Of-Period error for the

previous calibration interval (eEOP (i-1) ). Only a portion of eOBS is due to systematic error (eBIAS (i-1) +

eDRIFT (i-1)). However, any adjustments are performed equal-and-opposite to the whole of eOBS,

which includes random error (eRAND (i-1)) in addition to systematic error (eBIAS (i-1) + eDRIFT (i-1)).

eBIAS = UUT attribute bias, as-left. If no adjustment has been made, eBIAS remains the same as the

sum of the systematic errors at the end of the previous calibration interval (eBIAS (i-1) + eDRIFT (i-1)). If

an adjustment is made, eBIAS is equal to the negative of the previous random error (-eRAND (i-1)). After

adjustment, eBIAS is zero only if the random error during the previous cal interval (eRAND (i-1)) was

zero (unlikely). Adjustment actions will always negate previously accumulated attribute bias,

but will also result in attribute bias of their own, due to an overcompensated adjustment.

eDRIFT = Error of UUT attributable to monotonic drift. If no adjustment is made, this systematic

drift error carries over or accumulates from one calibration interval to the next. For the model,

eDRIFT is specified as a percentage of the allowable tolerance or accuracy specification. The

remainder of the specification is then allocated to eRAND as (100 % - Drift %).

eRAND = Error of UUT attributable to random behavior. A random number generator is used to

select eRAND from a normal Gaussian distribution. Ideally, no adjustment should be made to

compensate for this component. This is common-cause variation with an assumed period

significantly longer than the observation period during calibration. If all variation is random,

adjusting is equivalent to “tampering” with a system which may otherwise be in a state of

statistical control. It is analogous to moving the funnel in the Deming experiment.

eEOP = Error of UUT at End of Period (includes attribute bias, plus drift, plus random error).

Systematic

Monotonic

Drift

eDRIFT

In Tol ?

(eOBS <

Tolerance?)

Adjust ?

(eOBS >

Adjustment

Threshold?)

Y

∑[𝑒𝐵𝐼𝐴𝑆 (𝑖−1)

𝑛

𝑖=1

+ 𝑒𝐷𝑅𝐼𝐹𝑇 (𝑖−1)]

Previous

Cumulative

Systematic Bias

µ = eBIAS (i-1) Adjustment

Performed

(Adjustment = -eOBS)

Result: eOBS = 0

N

N

Y

Negative

Previous

Random

Component

µ = -eRAND (i-1)

Random

Component

(Norm Dist)

eRAND

Y

eEOP =

eBIAS +

eDRIFT +

eRAND

eBIAS

eDRIFT

eRAND

eEOP

eOBS

Page 24: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

6. Results

The results in Figure 10A and 10B were rendered via the Monte Carlo method to visually

investigate aspects of the Weiss-Castrup drift model with regard to adjustment thresholds. The

model in Figure 9 is repeated for 100 000 iterations and the number of Out-Of-Tolerance

instances for eOBS is tallied over the 105 cycles. The End-of-Period-Reliability is then computed as

EOPR = (105 – OOT’s)/10

5. This process is repeated ten times with the average taken to arrive at

a final simulated EOPR output, applicable to a specifically chosen pair of values in the model,

i.e. (1) the amount of monotonic drift and (2) the adjustment threshold. A 101 x 101 matrix of

EOPR values is then generated by looping the process in +1 % increments from 0 % to 100 %

for both the monotonic drift variable and the adjustment threshold variable. In total, ~1010

Monte

Carlo iterations are used in the generation of the matrix. This requires considerable

computational brute-force and consumed approximately 43 hours of CPU time running under

MS Windows® 7 in Excel

® 2010 using an Intel

® CoreTM i5 4300 CPU clocked at 2.6 GHz. See

Appendix B for a discussion of using Excel for Monte Carlo methods.

The resulting multivariate matrix can then be plotted as a three dimensional surface plot (Figures

10A & 10B), with the EOPR values displayed on the vertical z-axis. The x-axis represents the

monotonic drift rate and the y- axis represents the adjustment threshold, from 0 % to ~100 %

each. This provides insight into the effects that these variables impart to EOPR, which is

arguably the most important quality metric for many calibration and metrology organizations.

Other important quality metrics, such as Test Uncertainty Ratio (TUR) and the Probability of

False Accept (PFA), are inextricably interrelated to the observed EOPR [22, 23].

Figure 10A. 3D surface plot of EOPR as a function of adjustment threshold and drift

Page 25: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Figure 10B. 3D surface plot of EOPR as a function of adjustment threshold and drift

It is important to bear in mind the nature of the x-axis, representing drift in Figures 10A and 10B.

As the amount of drift µ increases, the random behavior σ decreases, as assumed by this

particular model (see Table 1). Other modeling can be performed with different parametric

assumptions, e.g. where the random variation is held constant (or grows larger) in the presence of

increasing drift. Still other assumptions, such as zero drift and increasing random variation, e.g.

random-walk models, could be modeled. Such investigations would provide additional insight.

It should also be noted that here, the x-axis merely approaches 100 % drift (zero random error).

When drift is exactly 100 % of specification with zero random error, all adjustment thresholds

≤100 % result in 100 % EOPR. In that case, adjustments are always performed and they are

always “perfect” due to the absence of random error (assuming infinite TUR; see assumption #5

in Section 4).

Many implications exist from the resulting model in Figure 10A and 10B for the stated

assumptions. Perhaps the most significant commonality in all instances is that, as the calibration

adjustment threshold increases from 0 % to 100 % of specification, the EOPR remains constant

or decreases in all cases; it never increases. This is further illustrated in Figure 11.

Page 26: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Figure 11. EOPR as a function of adjustment threshold for various levels of drift

In Figure 11, note that for the case of purely random variation with zero drift (green line), the

EOPR is constant at 83.4 %, just as the Weiss and Deming model would predict when

adjustments are always made (i.e. adjustment threshold of 0 % of specification). However, it is

interesting to note that this 83.4 % EOPR does not improve as the adjustment threshold is

increased from 0 % (always adjust) towards 100 % of specification (adjust less frequently).

Why does an increase in EOPR (reduction in variability) not result, in this purely random case,

as the adjustment threshold increases from 0 % to 100 % (i.e. less frequent adjustments)? The

answer to this question can be elucidated if the scale of the adjustment threshold and y-axis are

extended beyond the 100 % of specification limit (OOT point). With the model constrained to a

maximum of 100 % adjustment threshold in the purely random case, adjustments will still be

made for all observed OOT conditions. Even though these adjustments occur less frequently than

the always-adjust scenario (0 % adjustment threshold), the magnitude of these less frequent

adjustments or tampering is always quite large. For purely random systems, these large but less-

frequent adjustments for observed OOT conditions ultimately result in the same outcome as the

Weiss and Deming model predict; i.e. they lead to the same increased variability (2𝜎2 or √2𝜎)

and resulting lower EOPR (83.4 %), just as if adjustment or tampering was performed every

time.

If the adjustment threshold is increased to 500 % of specification (or more), and the simulation is

run again, a decrease in variability (from √2𝜎 to 𝜎) and resulting increase in EOPR (from 83.4 %

to 95 %) is indeed observed. However, the transition region where this phenomenon occurs is not

well-behaved (see Figure 12). That is, as the adjustment threshold is raised above 100 % of

specification, fewer and fewer adjustments are ever made. The probability of adjustment

Page 27: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

becomes exceedingly low. However, when one of these very rare events does occur, triggering

an adjustment (after many thousands of iterations of the Monte Carlo simulation), the effect is

quite significant. Since it was presumed to be a random event, no adjustment should have been

made (even at 150 %, 200 %, 300 % of specification, or more). Adjusting such a large random

error imparts an equally large attribute bias, opposite in magnitude.

Figure 12. Monte Carlo modeled behavior for random errors w/ adjust thresholds >100 % of spec

If the Monte Carlo simulations are extended to include adjustment thresholds far above 100 % of

specification (>OOT), the EOPR behavior becomes somewhat erratic between 150 % and 270 %

of specification. It ultimately settles at the 95 % EOPR, just as if no adjustments were ever made,

because no adjustments are essentially ever made when the adjustment threshold is so large. The

repeatability of the Monte Carlo process is also poor in this transition region (even with 106

iterations) because the results of the simulation are highly sensitive to very improbable events.

After the adjustment threshold extends beyond ~270 % of specification (~5.5σ), adjustment

actions become so rare as to approach the “never adjust” scenario of the Deming funnel (rule #1)

where the variation is lowest. Under these circumstances, the EOPR settles at the original 95 %

containment probability of the purely random variation with respect to the ±1.96σ specification

limits.

This scenario will likely find little application in calibration laboratories. One would have to be

willing to not adjust instruments with observed errors >>100 % of specification (highly OOT).

The rationale for such decision would be to attribute all errors (regardless of how large) as purely

random events that would not remain if simply left alone and not adjusted. In reality, such large

errors may be much more likely to be true attribute bias resulting from “special-cause” variation

such as misuse, over-ranging, rough handling, etc. Analysis of historical data is of great benefit

when attempting to characterize such errors.

Page 28: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

7. EOPR Reliability Targets

The use of EOPR as a quality metric for calibrated equipment is of great importance. EOPR

targets are analogous to an Acceptable Quality Level (AQL) in manufacturing environments.

Both metrics speak to the percentage of items that comply with their stated specifications,

although AQL’s are expressed as the complement of this (i.e. tolerable percent defective, not to

be confused with LTPD). Calibration intervals are often adjusted in an effort to achieve these

goals. Target EOPR levels are often proprietary for commercial and private industry. However,

it is insightful to review some EOPR objectives for calibrated equipment in military and

aerospace organizations. A summary of such targets is provided here.

TARGET EOPR LEVELS

NASA – Kennedy Space Center (KNPR 8730.1, Rev. Basic-1; 2003 to 2009, Obsolete)

“At KSC, calibration intervals are adjusted to achieve an EOPR range of 0.85 to 0.95.” [24]

U.S. Navy (OPNAV 3960.16A; 2005) “CNO policy requires USN/USMC to:… (o) Establish an objective end of period reliability goal for TMDE equal to or greater than 85 percent, with the threshold reliability in no case to be lower than 72 percent.” [25]

U.S. Navy (Albright, J. Thesis; 1997) “…intervals are based on End-Of-Period (EOP) operational reliability targets of 72% for non-critical General Purpose Test Equipment (GPTE) and 85% for critical Special Purpose Test Equipment (SPTE).” [26]

U.S. Air Force (TO 00-20-14; 2011) “The Air Force calibration interval… is the period of time over which the equipment shall perform its mission or function with a statistically derived end-of-period reliability (shall be within tolerance) of 85% or better.” [27]

U.S. Army (GAO B-160682, LCD-77-427; 1977, Obsolete) “…the Army decided to follow the Air Force's and Navy's lead in establishing an 85-percent end-of-period reliability requirement. However, the Army has adopted a new statistical model and changed its policy to require 75-percent end-of-period reliability.” [28]

U.S. Army (AR 750-43; 2014, Current) “On average, 90 percent of items will be in tolerance over the calibration interval, and 81 percent will be in tolerance at the end of the interval.” [29]

The NCSL International Benchmarking Survey (LM-5) provides additional information on

EOPR targets, termed “Average % In-Tolerance Target” [55]. In the survey, statistics were

aggregated from 357 national and international respondents polled in 2007. Demographics

included aerospace, military & defense, automotive, biomedical/pharmaceutical,

chemical/process, electronics, government, healthcare, M&TE manufacturers, medical

equipment, military, nuclear/energy, service industry, universities and R&D, and “other”. This

NCSLI survey found:

4 % of respondents employ EOPR targets <85 %

19 % of respondents employ EOPR targets between 85 % and 90 %

25 % of respondents employ EOPR targets between 91 % and 95 %

52 % of respondents employ EOPR targets >95 %

Page 29: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

8. Non-Adjustable Instruments

It should be noted that, in the presence of any amount of monotonic drift regardless of how

small, an adjustment will eventually have to be made or the attribute bias will ultimately exceed

the allowable specification. Indeed, the very practice of shortening an interval to increase EOPR

is somewhat predicated on some form of time-dependent mechanism increasing the magnitude of

possible errors, along with the ability to adjust (reduce) the attribute bias to or near zero.

For non-adjustable instruments, EOPR cannot generally be increased by shortening a calibration

interval via the same mechanism applicable to adjustable instruments. However, shortening the

calibration interval for non-adjustable instruments can still be beneficial in two ways.

1) An increase in EOPR can still result from shortening the calibration interval for non-

adjustable instruments which exhibit a relatively small time-dependent mechanism for

transitioning to an OOT condition (e.g. low drift). This is true because more in-tolerance

calibrations will be performed prior to the occurrence of an OOT condition. Once a non-

adjustable instrument incurs its first OOT condition, it cannot be adjusted back into

tolerance and has effectively reached the end of its service life at which point EOPR =

(#Calibrations - 1) / (#Calibrations). The shorter the interval, the more in-tolerance

calibrations will have been performed and the higher the EOPR will be. After the first OOT

event, the instrument must then be retired from service or the allowable tolerance must be

increased with consent from the end-user or “charted values” must be manually employed

via a Report of Test or Calibration Certificate. Such action should only be taken if no impact

will result to the application or process for which the instrument is employed.

2) Organizational benefits, other than increased EOPR, can also be realized through shortening

of calibration intervals for non-adjustable instruments. These benefits do not manifest as an

increase in EOPR, but rather in a reduction of the exposure to possible consequences

associated with an out-of-tolerance condition. For example, a working-standard resistor

(calibrated to a tolerance) may not be adjustable. An out-of-tolerance condition may

eventually arise from drift or even special-cause variation (over-power/voltage, mechanical

shock/damage, etc.) Shortening the calibration interval will provide no direct benefit to

EOPR via a reduction in errors through adjustment. However, since any OOT condition will

result in an impact assessment (reverse traceability) for all instruments calibrated by this

OOT resistor, a shorter calibration interval will reduce the amount of impact possible

assessments and risk exposure to product or process, providing benefits of a different nature.

9. Conclusions

Discretionary adjustment during calibration of in-tolerance equipment is not mandated by

national and international calibration standards ANSI/Z540.3 and ISO-17025, nor is adjustment

contained within the VIM definition of calibration. A model has been used here in an attempt to

describe the effect of various discretionary adjustment thresholds on in-tolerance instruments,

assuming a specific behavioral mode called the Weiss-Castrup drift model and under very

specific assumptions. These assumptions may not hold for many items of TM&DE. Other

alternative assumptions, where the domain of drift and random behavior simultaneously

comprise only a small percentage of the associated specification, may yield significantly

different results and are worthy of further investigation.

Page 30: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Using Monte Carlo methods, the effect of various discretionary adjustment thresholds on End Of

Period Reliability (EOPR) has been investigated for in-tolerance instruments under these specific

conditions. For the model and assumptions stated, it is shown that discretionary adjustments of

in-tolerance instruments can be beneficial in the presence of monotonic drift superimposed on

random variation. Under these conditions, the non-adjustment benefits of reduced variation

(increased EOPR), posed by the Weiss model and Deming funnel model, do not appear to

manifest between the 0 % and 100 % of specification adjustment thresholds. As the calibration

adjustment threshold increases from 0 % to 100 % of specification, the EOPR remains constant

or decreases in all cases; it never increases. Only after the adjustment threshold far exceeds

100 % of specification and effectively approaches the never-adjust scenario, are these benefits

realized for purely random behavior. Never adjusting items with any significant amount of

monotonic drift is not a viable option, as these instruments will rather quickly transition to an

OOT condition resulting from a true attribute bias due to drift.

The assumptions of the model may be idealized and unrealistic in the empirical world. Moreover,

it may be unlikely that the behavior of any instrument would be entirely restricted to only the

two change mechanisms accommodated by this model or the domain of magnitudes and/or

proportions of drift and random behavior restricted to the values modeled here. Many general

purpose TM&DE instruments may perform considerably better than their specifications would

imply. They may also be impacted by other behavioral characteristics and special cause events,

hindering the use of this model and of the use of linear regression as a prediction technique.

Random walk behavior, where the magnitude of the random variation (σ) itself increases with

time may be more realistic in many cases. Under such random walk models, the probability of

OOT events increases with time, even in the absence of monotonic drift. Much opportunity for

continued investigations and research exists in this regard. However, the assumptions stated

herein, when combined with the Weiss-Castrup drift model, provide a rudimentary working

construct with which to glean useful insight into the effect of various adjustment thresholds for

in-tolerance instruments under a variety of systematic and random errors.

Many programmatic factors must be considered when implementing instrument adjustment

policies or thresholds, above and beyond the exclusive consideration of maximizing EOPR.

Instrument adjustment can increase expense to a company or calibration laboratory in that “As-

Received” data must be acquired prior to adjustment, and “As-Left” data must be taken after the

adjustment. The model presented here strives to encourage additional investigation while

providing program managers and metrology professionals with a tool to assist in the

establishment of instrument adjustment policies and to guide possible decision processes. Astute

policy makers will likely use a variety of tools, models, assumptions, and empirical data,

balancing many options and objectives, to achieve the most prudent adjustment policy for a

particular organization.

Page 31: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

10. Acknowledgements and Disclosures

The author wishes to thank Tom Waltrich, Nancy Mescher, and Jerry Phillips of Baxter

Healthcare Corporation for many fruitful discussions of the material presented here relating to

instrument drift. Much gratitude is extended to Dr. Howard Castrup of Integrated Sciences

Group and Jonathan Harben of Keysight Technologies for insightful comments, critiques, and

suggestions during review of this paper. Material presented here was adopted from, or inspired

by, NCSLI RP-1 with regard to renewal/adjustment policies. The writing efforts of the NCSL

International Calibration Interval Committee are gratefully acknowledged and appreciated. No

endorsement of the work presented here, by the aforementioned parties, is implied. At the time of

publication, the results and conclusions of modeling presented in this paper are considered

preliminary and have not had the benefit of vetting by other independent sources. Such review is

highly desired and encouraged.

Page 32: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

11. Bibliography

1. ANSI/NCSL Z540.3:2006. Requirements for the Calibration of Measuring and Test

Equipment. American National Standards Institute / NCSL International, 2006.

http://www.ncsli.org/I/i/p/z3/c/a/p/NCSL_International_Z540.3_Standard.aspx?hkey=7de83171-

16ff-416c-9182-94c8447fb300

2. NCSL Z540.3:2006 Handbook, Handbook for the Application of ANSI/NCSL Z540.3-2006 –

Requirements for the Calibration of Measuring and Test Equipment. American National

Standards Institute / NCSL International. 2006.

http://www.ncsli.org/I/i/p/zHB/c/a/p/Zhb1.aspx?hkey=572363f0-59e9-4817-8b65-ae6ba5d8ff24

3. ISO/IEC 17025:2005(E). General Requirements for the Competence of Testing and

Calibration Laboratories. International Organization for Standardization / International

Electrotechnical Commission. 2005.

http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=39883

4. JCGM 200:2012 (ISO/IEC Guide 99-12:2007). International Vocabulary of Metrology —

Basic and General Concepts and Associated Terms (VIM). Joint Committee for Guides in

Metrology - Working Group 2, 3rd

Edition. 2008.

http://www.bipm.org/utils/common/documents/jcgm/JCGM_200_2012.pdf

5. U.S. Department of Health and Human Services, Food and Drug Administration, Form 483

Observation #11 – Control of Inspection, Measuring, and Test Equipment, Commander S.

Creighton, Consumer Safety Officer. Issued to St. Jude Medical IESD, Sylmar CA. October 17,

2012.

http://www.fda.gov/downloads/aboutfda/centersoffices/officeofglobalregulatoryoperationsandpol

icy/ora/oraelectronicreadingroom/ucm328488.pdf

Response (November 7, 2012) – Observation #11

http://www.fda.gov/downloads/AboutFDA/CentersOffices/OfficeofGlobalRegulatoryOpe

rationsandPolicy/ORA/ORAElectronicReadingRoom/UCM334747.pdf

Response (March 13, 2013) – Observation #11a

http://www.fda.gov/downloads/AboutFDA/CentersOffices/OfficeofGlobalRegulatoryOpe

rationsandPolicy/ORA/ORAElectronicReadingRoom/UCM346876.pdf

6. J. Bucher, Measure for Measure - Out of Sync. American Society for Quality (ASQ)

Measurement Quality Division, The Standard, Vol. 27 No. 2, PDF pp 21-22. June 2013.

http://rube.asq.org/measure/2013/05/the-standard-june-2013.pdf

6a. The preceding paper was also published in ASQ Quality Progress, pp 52-53. March 2010.

http://asq.org/quality-progress/2010/03/measure-for-measure/out-of-sync.html

7. J. Bucher, Debunking The Two Great Myths About Calibration: Traceability to NIST: If You

Cannot Adjust, You Cannot Calibrate. Proceedings of the NCSL International Workshop and

Symposium, National Harbor MD. Aug 2011. https://www.ncsli.org/c/f/p11/48.299.pdf

8. J. Bucher., Where Does It Say That? Clearing Up the FDA’s Calibration Requirements.

American Society for Quality, Measurement Quality Division, The Standard, Vol. 27 No. 2, PDF

pp 31-32. June 2013. http://rube.asq.org/measure/2013/05/the-standard-june-2013.pdf

Page 33: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

8a. The preceding paper was also published in ASQ Quality Progress, pp 50-51. November 2010.

http://asq.org/quality-progress/2010/11/measure-for-measure/where-does-it-say-that.html

9. NCSL RP-1:2010, Recommended Practice: Establishment and Adjustment of Calibration

Intervals, NCSL International, Boulder CO. 2010.

http://www.ncsli.org/I/i/Store/rp/iMIS/Store/rp.aspx?hkey=bf3e3957-f502-484d-9842-

fa5ef6325073

10. B. Weiss, Does Calibration Adjustment Optimize Measurement Integrity?. Proceedings of

the National Conference of Standards Laboratories Workshop and Symposium, Albuquerque

NM. August 1991. http://legacy.library.ucsf.edu/tid/jlw43b00/pdf

11. D. Shah, Deming Funnel Experiment and Calibration Over Adjustment: New Innovation?

American Society for Quality, ASQ World Conference on Quality and Improvement

Proceedings, Vol. 61, Orlando FL. April 2007. http://asq.org/qic/display-item/?item=21074

12. S. Prevette, Dr. Deming’s Funnel Experiment, Symphony Technologies Pvt Ltd, Rule 2

Example – Periodic Calibrations, Pune India.

www.symphonytech.com/articles/pdfs/spfunnel.pdf

13. D. Abell, Do You Really Need a 17025 Accredited Calibration?. Proceedings of the NCSL

International Workshop and Symposium, Tampa FL. August 2003.

14. G Payne, Measure for Measure: Calibration: What Is It? ASQ Quality Progress, American

Society for Quality. pp 72-76. May 2005.

http://asq.org/quality-progress/2005/05/measure-for-measure/calibration-what-is-it.html

15. D. Jackson, Calibration Intervals – New Models and Techniques, Naval Surface Warfare

Center Corona Division, Proceedings of the Measurement Science Conference, Anaheim CA.

January 2002.

16. C. Hamilton, Y. Tang, Evaluating the Uncertainty of Josephson Voltage Standards.

Metrologia Vol. 36 No. 1, pp 53-58. February 1999. https://www.researchgate.net/profile/Y_Tang2/publication/231103850_Evaluating_the_uncertainty_of_Jo

sephson_voltage_standards/links/54abe6cf0cf25c4c472fb877.pdf

17. C. Hamilton, L. Tarr. Projecting Zener DC Reference Performance Between Calibrations.

IEEE Transactions on Instrumentation and Measurement. Vol. 52 No. 2, pp 454-456. April 2003.

http://vmetrix.home.comcast.net/~vmetrix/ZenerP.pdf

18. NASA-HDBK-8739.19-2. Measuring and Test Equipment Specifications, NASA

Measurement Quality Assurance Handbook – ANNEX 2. National Aeronautics and Space

Administration. July 2010. https://standards.nasa.gov/documents/viewdoc/3315777/3315777

19. Fluke 8508A Digital Multimeter, Extended Accuracy Specifications, Publication 1887212 D-

ENG-N Rev C, DS263. Fluke Corporation. October 2002.

http://media.fluke.com/documents/8508A_Extended_Specs_Rev_C.pdf

20. D. Deaver. Having Confidence in Specifications. Proceedings of the Measurement Science

Conference. Newport Beach CA. 2004. http://assets.fluke.com/appnotes/calibration/msc04.pdf

21. M. Dobbert. Setting and Using Specifications – An Overview. Proceedings of the 2010

NCSL International Workshop and Symposium. Providence RI. July 2010.

Page 34: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

21a. A version of the preceding paper was also published in NCSLI Measure – The Journal of

Measurement Science. Vol. 5 No. 3, pp 68-73. September 2010.

http://www.keysight.com/upload/cmc_upload/All/Setting_Using_Specifications.pdf

22. P. Reese, J. Harben. Implementing Strategies for Risk Mitigation in the Modern Calibration

Laboratory, Proceedings of the NCSL International Workshop and Symposium, National Harbor

MD. August 2011. https://www.researchgate.net/profile/Paul_Reese2/publication/258311599_Implementing_Strategies_for_

Risk_Mitigation_In_the_Modern_Calibration_Laboratory/file/e0b49527c29c941b2e.pdf

23. P. Reese, J. Harben. Risk Mitigation Strategies for Compliance Testing, Measure – The

Journal of Measurement Science, NCSL International Vol.7, No.1 pp 38-49. March 2012 https://www.researchgate.net/profile/Paul_Reese2/publication/258311819_Risk_Mitigation_Strategies_fo

r_Compliance_Testing/file/e0b49527c24ec50664.pdf

24. NASA KNPR 8730.1 (Rev. Basic-1), Kennedy NASA Procedural Requirements, Section 3.3

– Calibration Intervals, pp 11. National Aeronautics and Space Administration. March 2003.

25. Navy OPNAV 3960.16A, Navy Test, Measurement, and Diagnostic Equipment (TMDE),

Automatic Test Systems (ATS), and Metrology and Calibration (METCAL), Section 6 – Policy,

U.S. Navy, paragraph (o), pp 6. August 2005. http://doni.daps.dla.mil/Directives/03000%20Naval%20Operations%20and%20Readiness/03-

900%20Research,%20Development,%20Test%20and%20Evaluation%20Services/3960.16A.pdf

26. Albright J., Thesis: Reliability Enhancement of the Navy Metrology and Calibration

Program, Naval Postgraduate School, Monterey CA. December 1997.

https://calhoun.nps.edu/bitstream/handle/10945/8906/reliabilityenhan00albr.pdf

27. USAF TO 00-20-14, Technical Manual – Air Force Metrology and Calibration Program.

Section 3.4 – Calibration Intervals, pp 3-8. Secretary of the United States Air Force, September

2011. www.wpafb.af.mil/shared/media/document/AFD-120724-063.pdf

28. GAO LCD-77-427, B-160682, A Central Manager is Needed to Coordinate the Military

Diagnostic and Calibration Program. Appendix I – Different Criteria Used To Establish

Calibration Intervals at Metrology Centers, U.S. General Accounting Office, pp 1-2, May 1977. http://gao.justia.com/national-aeronautics-and-space-administration/1977/5/a-central-manager-is-needed-

to-coordinate-the-military-diagnostic-and-calibration-program-lcd-77-427/LCD-77-427-full-report.pdf

29. AR-750-43, Maintenance of Supplies and Equipment – Army Test, Measurement, and

Diagnostic Equipment, Chapter 6, Section I – Program Objectives and Administration,

Paragraph 6-1a Program Objectives, pp 24. Department of Defense, U.S. Army. Jan 2014.

http://www.apd.army.mil/pdffiles/r750_43.pdf

30. B. McCullough, B. Wilson. On the Accuracy of Statistical Procedures in Microsoft Excel

97. Computational Statistics & Data Analysis. Vol. 31 No. 1, pp 27-37. July 1999.

http://users.df.uba.ar/cobelli/LaboratoriosBasicos/excel97.pdf

31. L. Knüsel. On the accuracy of the statistical distributions in Microsoft Excel 97.

Computational Statistics & Data Analysis. Vol. 26 No. 3, pp 375-377. January 1998.

http://www.sciencedirect.com/science/article/pii/S0167947397817562

Page 35: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

32. B. McCullough, B. Wilson. On the Accuracy of Statistical Procedures in Microsoft Excel

2000 and Excel XP. Computational Statistics & Data Analysis. Vol.40 No. 4, pp 713-721.

October 2002. https://www.researchgate.net/publication/222672996_On_the_accuracy_of_statistical_procedures_in_Mi

crosoft_Excel_2000_and_Excel_XP/links/00b4951c314aac4702000000.pdf

33. B. McCullough, B. Wilson. On the Accuracy of Statistical Procedures in Microsoft Excel

2003. Computational Statistics & Data Analysis. Vol.49. No. 4, pp 1244-1252. June 2005.

http://www.pucrs.br/famat/viali/tic_literatura/artigos/planilhas/msexcel.pdf

34. L. Knüsel. On the accuracy of statistical distributions in Microsoft Excel 2003.

Computational Statistics & Data Analysis, Vol. 48, No. 3, pp 445-449. March 2005.

http://www.sciencedirect.com/science/article/pii/S0167947304000337

35. B. McCullough, D. Heiser. On the Accuracy of Statistical Procedures in Microsoft Excel

2007. Computational Statistics & Data Analysis. Vol.52. No. 10, pp 4570-4578. June 2008.

http://users.df.uba.ar/mricci/F1ByG2013/excel2007.pdf

36. A. Yalta. The Accuracy of Statistical Distributions in Microsoft®

Excel 2007. Computational

Statistics & Data Analysis. Vol. 52 No. 10, pp 4579 – 4586. June 2008.

http://www.sciencedirect.com/science/article/pii/S0167947308001618

37. B. McCullough. Microsoft Excel’s ‘Not The Wichmann-Hill’ Random Number Generators.

Computational Statistics and Data Analysis. Vol.52. No. 10, pp 4587-4593. June 2008.

http://www.sciencedirect.com/science/article/pii/S016794730800162X

38. G. Melard. On the Accuracy of Statistical Procedures in Microsoft Excel 2010.

Computational Statistics. Vol.29 No. 5, pp 1095-1128. October 2014.

http://homepages.ulb.ac.be/~gmelard/rech/gmelard_csda23.pdf

39. L. Knüsel. On the Accuracy of Statistical Distributions in Microsoft Excel 2010. Dept. of

Stats. - University of Munich, Germany. http://www.csdassn.org/software_reports/excel2011.pdf

40. M. Foley. About That 1 Billion Microsoft Office Figure…. All About Microsoft. ZDNet

June 2010. http://www.zdnet.com/article/about-that-1-billion-microsoft-office-figure/

41. NIST Statistical Reference Database (StRD). National Institute of Standards and

Technology. Information Technology Laboratory - Statistical Engineering Div. November 2003.

http://www.itl.nist.gov/div898/strd/

42. P. L'Ecuyer, R. Simard, TestU01: A C Library for Empirical Testing of Random Number

Generators. ACM Transactions on Mathematical Software. Vol. 33 No. 4, article 22, pp 22:1 –

22:40. August 2007. http://www.iro.umontreal.ca/~lecuyer/myftp/papers/testu01.pdf

http://simul.iro.umontreal.ca/testu01/tu01.html (current version 1.2.3, 18 August 2009).

43. G. Marsaglia. The Marsaglia Random Number CDROM Including the Diehard Battery of

Tests of Randomness. Florida State University - Department of Statistics and Supercomputer

Computations Research Institute. 1995. http://www.stat.fsu.edu/pub/diehard/

44. M. Matsumoto, T. Nishimura. Mersenne Twister: A 623-Dimensionally Equidistributed

Uniform Pseudo-Random Number Generator. ACM Transactions on Modeling and Computer

Simulation. Vol.8 No. 1, pp 3-30. January 1998.

http://www.math.sci.hiroshima-u.ac.jp/~%20m-mat/MT/ARTICLES/mt.pdf

Page 36: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

45. B. McCullough. A Review of TestU01. Journal of Applied Econometrics. Vol. 21 No. 5, pp

677-682. July/August 2006. http://www.pages.drexel.edu/~bdm25/testu01.pdf

46. B. Wichmann, I. Hill. Algorithm AS 183: An Efficient and Portable Pseudo-Random

Number Generator. Applied Statistics. Vol. 31 No. 2, pp 188-190. June 1982. https://www.researchgate.net/publication/243774153_Algorithm_AS_183_An_efficient_and_portable_ps

eudo-random_number_generator

47. B. Wichmann, I. Hill. Generating Good Pseudo-Random Numbers. Computational Statistics

& Data Analysis. Vol.51 No. 3, pp 1614-1622. December 2006. https://www.researchgate.net/publication/220055967_Generating_good_pseudo-random_numbers.

47a. A “long” version of the preceding paper (w/software and results from BigCrush TestU01) is

available from NPL with margin notes and additional appendices regarding implementation of

the enhanced 4-cycle Wichmann-Hill PRNG. http://www.npl.co.uk/science-technology/mathematics-modelling-and-simulation/mathematics-and-

modelling-for-metrology/mmm-software-downloads

48. T. Symul, S. Assad, P. Lam. Real Time Demonstration of High Bitrate Quantum Random

Number Generation with Coherent Laser Light. Applied Physics Letters. Vol. 98 No. 23. June

2011. http://arxiv.org/pdf/1107.4438.pdf http://photonics.anu.edu.au/qoptics/Research/qrng.php

49. A. Yee, S. Kondo. 12.1 Trillion Digits of Pi, And We’re Out of Disk Space... December 2013.

http://www.numberworld.org/misc_runs/pi-12t/

50. F. Panneton, P. L’Ecuyer, M. Matsumoto. Improved Long-Period Generators Based on

Linear Recurrences Modulo 2. ACM Transactions on Mathematical Software. Vol. 32 No. 1, pp

1-16. March 2006. http://www.iro.umontreal.ca/~lecuyer/myftp/papers/wellrng.pdf

51. JCGM:101 2008. Evaluation of Measurement Data – Supplement 1 to the “Guide to the

Expression of Uncertainty in Measurement” - Propagation of Distributions Using a Monte Carlo

Method. Joint Committee for Guides in Metrology. Working-Group 1. First Edition, 2008.

http://www.bipm.org/utils/common/documents/jcgm/JCGM_101_2008_E.pdf

52. A. Steele, R. Douglas. Simplifications from Simulations: Monte Carlo Methods for

Uncertainties. NCSLI Measure – The Journal of Measurement Science. Vol. 1 No. 2, pp 56-68.

June 2006. http://www.ncsli.org/I/mj/dfiles/NCSLI_Measure_2006_June.pdf

52a. A version of the preceding paper was also published in the 2005 Proceedings of the NCSL

International Workshop & Symposium, Washington D.C. August 2005.

53. P. Reese. Personal communications with P. L’Ecuyer & R. Simard via email. April 2015.

54. H. Castrup. Calibration Requirements Analysis System. Proceedings of the 1989 NCSL

Workshop and Symposium, Denver CO. July 1989.

http://www.isgmax.com/articles_papers/ncsl89.pdf

55. NCSLI LM-5. Laboratory Management Publication: Benchmark Survey - 2007. Sponsored

by Boing Company. NCSLI International 182 Benchmarking Programs Committee. Boulder

CO. 2007. http://www.ncsli.org

56. AIAG MSA-4. Measurement Systems Analysis Reference Manual. Automotive Industry

Action Group (AIAG) MSA Work Group. Chrysler Group LLC, Ford Motor Company, General

Motors Corporation. ISBN 978-1-60-534211-5. Fourth Edition. June 2010. http://www.aiag.org/source/Orders/prodDetail.cfm?productDetail=MSA-4

Page 37: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

57. ISO/TS 16949:2009. Quality management systems -- Particular Requirements for the

Application of ISO 9001:2008 for Automotive Production and Relevant Service Part

Organizations. International Organization for Standardization. 2009.

http://www.ts16949.com/a55aeb/ts16949.nsf/layoutB/Home+Page?OpenDocument

58. T. Nolan, P. Provost. Understanding Variation. ASQ Quality Progress. American Society for

Quality. Vol. 23 No. 5. May 1990. http://www.apiweb.org/UnderstandingVariation.pdf

59. J. Bucher. The Quality Calibration Handbook: Developing and Managing a Calibration

Program. American Society for Quality, Quality Press. ISBN-13: 978-0-87389-704-1. 2007.

http://asq.org/quality-press/display-item/?item=H1293

60. J. Bucher. The Metrology Handbook. American Society for Quality, Measurement Quality

Division. ASQ Quality Press. ISBN 0-87389-620-3. 2004.

http://asq.org/quality-press/display-item/?item=H1428

Page 38: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

APPENDIX A

Monte Carlo Modeling of EOPR

For Various Adjustment Thresholds Under Drift and Random Variation

Figure A1. Example of first ten iterations of the Monte Carlo simulation

Conditions:

Drift set to 10 % of Specification (Random: σ = 50.038 % of Spec).

Adjustment Threshold set to 80 % of Spec.

To facilitate a step-by-step understanding of the model, the first 10 iterations are shown in Figure

A1, the first five of which are described in detail below.

Iteration #1.

Initial Conditions are set to 0 % observed error and 0 % attribute bias. Since the observed error

(0 %) is less than 100 % of spec, the UUT is declared In-Tolerance. Since the observed error (0

%) is also less than the adjustment threshold of 80 %, no discretionary adjustment is made. The

as-left attribute bias is 0 % of specification. The UUT is returned to the customer. During the

course of this calibration interval, a random error associated with a normal distribution manifests

(+81.2 % of Spec). Additionally, a systematic drift error also manifests (+10 % of spec). These

two errors are additive, resulting in a net error of +91.2 % of spec. At the end of the calibration

interval (End-of-Period), the observed error observed for the UUT is +91.2 % of spec. Note that

only 10 % of this error is due to systematic drift. Thus, the true “bias” error of the UUT is only

+10 %. The additional +81.2 % error arose from a random error. If a proper adjustment was to be

made at the end of this interval, only a -10 % adjustment should be made to correct only the

systematic attribute bias due to the drift over this calibration interval.

Page 39: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Iteration #2. The UUT is received with an observed error of +91.2 % of spec. It is not known to

the calibration technician how much of the observed +91.2 % error is due to drift (bias) and how

much is due to random behavior. Since this error is less than 100 % of spec, the UUT is declared

In-Tolerance. However, since the observed error of +91.2 % is also greater than the adjustment

threshold of ±80 %, a discretionary adjustment is made. The technician makes an adjustment of -

91.2 % in an attempt to correct for the observed error. A proper adjustment would have only

been -10 % to compensate only for the cumulative systematic drift over the first interval. But this

is not possible since the only information available at the time of adjustment is the observed error

of +91.2 %. Thus, the adjustment overcompensates by -81.2 % and the UUT is returned to the

customer with an actual attribute bias of -81.2 %. During the course of this calibration interval, a

random error associated with a normal distribution manifests (+7.5 % of spec). Additionally, a

systematic drift error also manifests (+10 % of spec). These two errors are additive, resulting in a

net error of +17.5 % of spec. However, the previous calibration adjustment left the UUT with a -

81.2 % systematic bias. Therefore, this pre-existing -81.2 % attribute bias is also added to the

+17.5 % error, resulting in a net observed error at the End-Of-Period of -63.7 %. Note that only a

-71.2 % error is due to systematic effects (-81.2 % from the overcompensated adjustment and

+10 % drift from the second interval). If a proper adjustment was to be made at the end of this

interval, only a +71.2 % adjustment should be made to correct exclusively for the systemic

attribute bias due to the overcompensated adjustment and the drift during this second interval.

Iteration #3. The UUT is received with an observed error of -63.7 % of spec. It is not known to

the calibration technician how much of the observed -63.7 % error is due to systematic effects

(bias) and how much is due to random behavior. Since the observed error is less than 100 % of

spec, the UUT is declared In-Tolerance. Moreover, the observed error of -63.7 % is less than the

adjustment threshold of ±80 %; therefore, no discretionary adjustment is made. The UUT is

returned to the customer with an actual attribute bias of -71.2 %. During the course of this

calibration interval, a random error associated with a normal distribution manifests (+44.8 % of

spec). Additionally, a systematic drift error also manifests (+10 % of spec). These two errors are

additive, resulting in a net error of +54.8 % of spec. However, the previous calibration left the

UUT with a -71.2 % systematic bias. Therefore, this pre-existing -71.2 % attribute bias is added

to the +54.8 % error, resulting in a net observed error at the End-Of-Period of -16.5 %. Note that

only a -61.2 % error is due to systematic effects (-71.2 % and an additional +10 % drift from this

third interval). If a proper adjustment was to be made at the end of this interval, only a +61.2 %

adjustment should be made to correct exclusively for the systemic attribute bias.

Iteration #4. The UUT is received with an observed error of -16.5 % of spec. It is not known to

the calibration technician how much of the observed -16.5 % error is due to systematic effects

(bias) and how much is due to random behavior. Since the observed error is less than 100 % of

spec, the UUT is declared In-Tolerance. Moreover, since the observed error of -16.5 % is less

than the adjustment threshold of ±80 %, a discretionary adjustment is not performed. The UUT is

returned to the customer with an actual attribute bias of -61.2 %. During the course of this

calibration interval, a random error associated with a normal distribution manifests (-39.7 % of

spec). Additionally, a systematic drift error also manifests (+10 % of spec). These two errors are

additive, resulting in a net error of -29.7 % of spec. However, the calibration adjustment left the

UUT with a -61.2 % systematic bias. Therefore, this pre-existing -61.2 % attribute bias is added

to the -29.7 % error, resulting in a net observed error at the End-Of-Period of -90.9 %. Note that

the attribute bias is only -51.2 % due to systematic effects (-61.2 % and an additional +10 %

Page 40: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

drift from this fourth interval. If a proper adjustment was to be made at the end of this interval,

only a +51.2 % adjustment should be made to correct exclusively for the systemic attribute bias.

Iteration #5. The UUT is received with an observed error of -90.9 % of spec. It is not known to

the calibration technician how much of the observed -90.9 % error is due to systematic effects

(bias) and how much is due to random behavior. Since the observed error is less than 100 % of

spec, the UUT is declared In-Tolerance. However, since the observed error of -90.3 % is greater

than the adjustment threshold of ±80 %, a discretionary adjustment is made. The technician

makes an adjustment of +90.9 % in an attempt to correct for the observed error. A proper

adjustment would have only been +51.2 % to compensate only for the systematic attribute bias.

But this is not possible since the only information available at the time of adjustment is the

observed error of -90.9 %. Thus, the adjustment overcompensates by +39.7 % and the UUT is

returned to the customer with an actual attribute bias of +39.7 %. During the course of this

calibration interval, a random error associated with a normal distribution manifests (+115.8 % of

spec). Additionally, a systematic drift error also manifests (+10 % of spec). These two errors are

additive, resulting in a net error of +125.8 % of spec. However, the previous calibration left the

UUT with a +39.7 % systematic bias from the previous adjustment. Therefore, this pre-existing

+39.7 % attribute bias is added to the +125.8 % error, resulting in a net observed error at the

End-Of-Period of +165.5 %. Note that the attribute bias is only +49.7 % due to systematic

effects (+39.7 % bias from the previous adjustment, and another +10 % drift from this fifth

interval). If a proper adjustment was to be made at the end of this interval, only a -49.7 %

adjustment should be made to correct exclusively for the systemic attribute bias due to the

overcompensated adjustment and the drift over this fifth calibration interval. The UUT will

arrive in the calibration lab at the beginning of the 6th

iteration with an actual attribute bias of -

49.7 %, but with an observed error of +165.5 %.

The cyclic process described above is repeated for 100 000 iterations and the EOPR is computed.

The 100 000 iteration cycle is repeated 9 more times and the average of the ten EOPR values is

taken as the final estimate of one EOPR value for use in the 101 x 101 matrix. This entire

process is then repeated 10 200 times (1.02 x 1010

total iterations) to complete the matrix, shown

in Figures 10A and 10B.

Page 41: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

APPENDIX B

On the Use of Microsoft Excel® for Monte Carlo Methods

The use of Microsoft Excel as a serious scientific platform for statistical analysis has many

detractors as well as a long history of critique by the statistical community [30-39]. However, it

remains one of the most widely used of all software tools in current use. As of 2010, an

estimated 750 million copies have been installed as part of the MS Office suite [40].

Excel may arguably be described by the principle of Maslow’s Hammer, often stated as, “If all

you have is a hammer, everything looks like a nail.” Excel is undoubtedly utilized in many

situations where a more appropriate or efficient tool exists. Yet, this observation alone does not

preclude Excel’s utility in a wide array of diverse applications. The flexibility and ubiquitous

nature of Excel may be more analogous to a Swiss Army Knife than a hammer. It may not be the

best tool for any job, but it can be an acceptable tool for many jobs, especially when

precautionary measures are taken to ensure acceptable performance. Given the Visual Basic for

Applications (VBA) programming environment in Excel, it can be a powerful option.

Like any software, it should be confirmed via objective evidence that Excel will provide

trustworthy, accurate results with an acceptable degree of confidence. This gives rise to

validation requirements in some critical applications to ensure that computations are being

performed correctly with an acceptable degree of accuracy. In this regard, Excel is no different

than any other software package. Its built-in functions, user-defined functions, logic, equations,

etc. should be validated to the extent necessary to satisfy applicable requirements. NIST provides

a Statistical Reference Database to aid in such evaluations [41].

Excel 2010

Many, but not all, of the historical criticisms regarding Excel’s suitability for statistical analysis

have been addressed and largely rectified with Excel 2010 [38-39]. Melard [38] has evaluated

Excel 2010’s Pseudo Random Number Generator (PRNG), implemented as the RAND()

function. The RAND() function is designed to return values uniformly distributed over the range

of [0,1). Melard has shown the RAND() function in Excel 2010 to pass most modern statistical

tests for randomness, specifically a modified version of the Crush test suite in the TestU01

library by L’Ecuyer and Simard [42]. TestU01 has essentially superseded older series of RNG

tests, e.g. Diehard tests of Marsaglia [43] and offers a challenging battery of tests for any PRNG.

Had Melard chosen to invoke the most rigorous test suite of the TestU01 library for testing the

RAND() function in Excel 2010, called BigCrush, it would have required a very large test file of

random numbers from Excel, roughly 3 TB in size; thus, BigCrush was not performed. As it was,

the smaller 412 GB Crush test-file (~235

numbers) took two weeks to generate and 36 hours of

CPU time to run the actual Crush tests. He concludes, “All tests are passed except ‘Periods in

Strings’ with r = 15 and s = 15 for which the p-value is 8.10−7

”. Melard attributes these

anomalies to his specific approach in generating the test file to manage its size. Additionally,

Melard references a “semi-official” indication that the Mersenne Twister algorithm known as

MT19937 has been implemented for the Excel 2010 RAND() function and is assumed to be

responsible for the improved performance, compared with previous versions of Excel.

Page 42: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

The Mersenne Twister (MT19937) is a somewhat modern pseudo random number generator

published in 1998 by Matsumoto and Nishimura [44]. It is now available in many mathematical

programming packages, e.g. MATLAB, Maple, R, GAUSS, SAS, SPSS, Ruby, Python, Julia,

Visual C++, etc. and presumably in Excel 2010. However, MT19937 has been shown to fail two

tests for linear complexity (r=1 and r = 29, where p <10-15

) in the extensive BigCrush suite of

tests using version 1.0 of TestU01 [38, 42]. Conflictingly, other authors report that MT19937

passes all tests in BigCrush using version 1.1 [38, 45] and version 0.6.04 [47] of TestU01.

However, L’Ecuyer and Simard (the authors of TestU01) have confirmed that MT19937 does

indeed fail recent TestU01 tests for linear complexity and that it is well-understood why this

occurs [53]. L’Ecuyer et al have published that MT19937, “...successfully passed all the

statistical tests included in… BigCrush of TestU01, except those that look for linear

dependencies in a long sequence of bits, such as… the linear complexity tests… This is in fact a

limitation of all f2-linear generators, including the Mersenne Twister… Because of their linear

nature, the sequences produced by these generators just cannot have the linear complexity of a

truly random sequence. This is definitely unacceptable in cryptology… but is quite acceptable

for the vast majority of simulation applications…”[50].

Good PRNG’s are essential in order for Monte Carlo methods to yield accurate results. The same

is true for the probability distributions used in the simulations, e.g. the normal probability density

function and its inverse. In the aforementioned paper by Melard [38], the accuracy of the Excel

2010 NORM.INV function, along with many other probability distributions was also tested with

positive results. Improvements over Excel 2003 and 2007 were noted and its accuracy was on par

with other statistical applications. Melard states, “On the basis of these results, Microsoft Excel

2010 appears as good as OpenOffice.org Calc 3.3”. He continues, “To conclude, most of the

problems of Excel raised by Yalta (2008) were corrected in the 2010 version.” Regarding the

NORM.S.DIST and NORM.DIST functions of Excel 2010, Knüsel [39] additionally notes, “No

errors were found with these two functions5” and states “Most of the errors in Microsoft Excel 97

and Excel 2003 pointed out in my previous papers have been eliminated in Excel 2010.”

The preceding evidence suggests that two critical aspects of Monte Carlo simulations are

satisfied by Excel 2010, i.e. accurate statistical probability density functions and a robust random

number generator. As such, Excel 2010 may be a viable tool for investigations such as those

presented in this paper. In addition, loping functions in VBA can be used to readily process

Monte Carlo simulations in Excel. When doing so, it is sometimes helpful to change the formula

calculation option for workbooks from its default setting of “automatic” to “manual”, and then to

embed the VBA code to perform calculations (e.g. “calculate”) into the loop itself. Also, turning

off screen updating “Application.ScreenUpdating = False” can greatly reduce the time required

to perform long sequences of Monte Carlo iterations in Excel.

4 Wichmann & Hill in 2006 [47] report MT19937 passes all BigCrush tests in Version 6.0 of TestU01 (dated Jan 14

2005). However, Simard has confirmed this must have actually been version 0.6.0 (“pre-official-release”) where the

Linear Complexity tests used all of the first 30 bits; MT19937 would indeed pass this [53]. Later official versions of

TestU01 have two linear complexity tests that use the 1st and 30

th bit of each random number, which MT19937 fails

[42, 50, 53] using version 1.0 and later. It is unknown why passing results for all BigCrush tests in Ver 1.1 of

TestU01 were reported by McCullough [45]. The current version of TestU01 is 1.2.3, dated 18 August 2009 [42]. 5 Knüsel does report [negligible] errors in Excel 2010’s NORM.S.INV and NORM.INV functions at extremely

small probabilities (p-values) <2.2251 x 10-308

.

Page 43: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

Previous versions of Excel

Prior to Excel 2010, the RAND() function in Excel was not generally considered suitable for such

methods [30-39], most notably prior to Excel 2003. The release of Excel 2003 saw an improved

PRNG, with RAND() reportedly implementing the popular 3-cycle Wichmann-Hill PRNG from

1982, also known as algorithm AS 183 [46]. However, a bug in the original Excel 2003 RAND()

function caused negative numbers to be occasionally generated and Microsoft soon issued a

patch to correct this error [33, 37]. It should be noted that, even the patched 2003 version of

RAND(), as well as the version implemented in Excel 2007, was tested and found not to be a

robust implementation of the AS 183 Wichmann-Hill PRNG algorithm from 1982 [37]. Issues

with the accuracy of some probability distributions prior to 2010 have also been reported [30-

36]. Thus, versions of Excel prior to 2010 should be carefully evaluated when used for Monte

Carlo simulations or other statistically intensive computations, especially in critical applications

where the risk or consequences of inaccurate results is significant.

Pseudo Random Number Generators (PRNG’s)

With improvements in computing power and increased abilities to test PRNG’s, even a proper

implementation of the “3-cycle” Wichmann-Will AS 183 algorithm from 1982 [46] can present

limitations in modern applications such as intensive Monte Carlo modeling. A new breed of

PRNG’s has evolved that addresses these issues, such as the aforementioned Mersenne Twister.

However, even the best PRNG’s are deterministic. That is, given the same set of initial

conditions, called the seed, a given PRNG will produce exactly the same output stream of

numbers each time it is run. If the algorithm is known, and the seed is known, the sequence of

output numbers can be exactly predicted. This is desirable in some instances (such as auditing)

but not for others (e.g. encryption). While predictability and randomness may seem mutually

exclusive, they are not necessarily so.

For example, the digits of pi (now known to more than 12 trillion digits [49]) have been

postulated to be random. Although no formal proof of pi’s randomness has been found to date,

neither has any regular pattern. The apparently random digits of pi are nevertheless predictable.

Such predictability does not necessarily negate the randomness inherent to the sequence of

numbers. With PRNG’s, the predictability of the output stream can be somewhat (but not totally)

inhibited by introducing entropy and/or secrecy into the seed, because it will be unknown exactly

where the sequence of numbers “started”. The intractability of prediction and retrodiction are

requirements for cryptographically secure pseudo random number generators (CSPRNG’s).

Such generators must preclude prediction of the random numbers, even though the algorithm

might be known, and long samples of output numbers are available for inspection, and even

when its state has been revealed. In the most extreme cases, truly random numbers may be

generated from quantum phenomena [48]. Monte Carlo methods do not require such generators –

only that the PRNG used is robust and passes most modern statistical tests (e.g. TestU01).

Although evidence indicates that the RAND() function in Excel 2010 should be adequately

robust, it is seeded by an undocumented method, which is generally believed to be associated

with the real-time clock of the host computer. There is no direct user-control over the seeding

process. It is a volatile function, returning a different random number each time a “calculation”

is performed. This does not necessarily reduce the performance of the RAND() function, but it

will not provide for reproducibility which may be important for independent auditing.

Page 44: (Baxter_P.Reese)_Instrument_Adjustement_Policies_14_(2015_NCSLI)

2015 NCSL International Workshop & Symposium

In addition to the deterministic nature of pseudo random number generators, PRNG’s do not

have infinite periods. At some point, the stream of output numbers will begin to repeat itself; a

replicate pattern will eventually emerge. Short periods are undesirable. The original “3-cycle”

Wichmann-Hill PRNG (1982) has a period of ~1013

[46, 47]. This is relatively small by today’s

standards and this older PRNG also fails some BigCrush tests in TestU01. A revised/enhanced

“4-cycle” Wichmann-Hill PRNG (2006) has a period of ~1036

[47], adequate for most any

application imaginable. Moreover, the enhanced 4-cycle Wichmann-Hill PRNG has been shown

to pass all BigCrush tests in version 0.6.06 of TestU01 [47, 47a]. It has many other desirable

properties as well and requires only 26 lines of code to implement in ‘C’; it could also be

implemented in Excel using VBA. It is regarded as a highly robust PRNG and is referenced in

Annex C of JCGM 101:20087 (GUM Supplement 1) for computing measurement uncertainty via

Monte Carlo methods [51]. Although the aforementioned MT19937 algorithm fails two

BigCrush tests in more recent versions of TestU01, it has an extremely long period of ~219937

[44] or ~106001

. To fully appreciate the length of these periods, consideration of the following

large numbers is insightful:

The age of the universe is ~4.4 x 1017

seconds (13.8 billion years).

The fastest supercomputers approach ~3 x 1016

floating point operations per second.

The number of atoms in the observable universe is ~1080

.

The number of Plank volumes in the observable universe is ~10185

.

A paper in 2006 by Steele and Douglas8 [52, 52a] also provides a wealth of practical information

and useful insights for performing Monte Carlo simulations in Excel. While focused on

computing measurement uncertainties in Excel, the paper illustrates the usefulness of the VBA

programming environment for implementing alternative (custom) pseudo random number

generators. The authors provide VBA code for the 1982 Wichmann-Hill PRNG and include

step-by-step instructions for writing custom user-defined VBA functions. Advanced users are

referred to external Dynamic Link Libraries (DLL’s) to facilitate faster execution of compiled

code in ‘C’ (such as PRNG’s) within Excel. Also identified in the paper are the limitations of

the 1982 Wichmann-Hill generator along with reference to a PRNG called RANLUX which

offers higher dynamic range as well as other beneficial characteristics (the authors offer to

provide VBA code for RANLUX as well as additional helpful resources). It should be noted that

the Steel and Douglas paper was written prior to the publication of the enhanced Wichmann-Hill

4-cycle PRNG (2006), prior to the final release of JCGM 101:2008 (GUM Supplement One), and

prior to the advent of Excel 2010. Nevertheless, the paper remains an excellent resource for the

researcher wishing to investigate Monte Carlo methods in Excel.

6 Reported by Wichmann & Hill [47] as version 6.0 of TestU01. See preceding footnote 3 regarding version 0.6.0.

7 JCGM 101:2008 does not exclusively recommend any particular PRNG over others. It states, “Generators other

than those given in this annex can be used. Their statistical quality should be tested before use.” [51] 8 This paper was also presented by Dr. Alan Steele on August 9

th at the 2005 NCSL International Workshop and

Symposium in Washington D.C., for which it won Best Paper award in Theoretical Metrology [52a].