Beyond MARLAP: New Statistical Tests For Method Validation NAREL – ORIA – US EPA Laboratory Incident Response Workshop At the 53 rd Annual RRMC

Beyond MARLAP:Beyond MARLAP:New Statistical TestsNew Statistical TestsFor Method ValidationFor Method Validation

NAREL – ORIA – US EPANAREL – ORIA – US EPALaboratory Incident Response WorkshopLaboratory Incident Response Workshop

At the 53At the 53rdrd Annual RRMC Annual RRMC

2

Outline The method validation problem MARLAP’s test

And its peculiar features New approach – testing mean squared error

(MSE) Two possible tests of MSE

Chi-squared test Likelihood ratio test

Power comparisons Recommendations and implications for

MARLAP

3

The Problem

We’ve prepared spiked samples at one or more activity levels

A lab has performed one or more analyses of the samples at each level

Our task: Evaluate the results to see whether the lab and method can achieve the required uncertainty (uReq) at each level

4

MARLAP’s Test

In 2003 the MARLAP work group developed a simple test for MARLAP Chapter 6

Chose a very simple criterion Original criterion was whether every

result was within ±3uReq of the target Modified slightly to keep false

rejection rate ≤ 5 % in all cases

5

Equations Acceptance range is TV ± k uReq where

TV = target value (true value) uReq = required uncertainty at TV, and

E.g., for n = 21 measurements (7 reps at each of 3 levels), with α = 0.05, we get k = z0.99878 = 3.03

For smaller n we get slightly smaller k

nzk /1)1(5.05.0

6

Required Uncertainty The required uncertainty, uReq, is a function of

the target value

Where uMR is the required method uncertainty at the upper bound of the gray region (UBGR)

φMR is the corresponding relative method uncertainty

UBGRTVTV

UBGRTVuTVu

if,

if,)(

MR

MRReq

7

Alternatives We considered a chi-squared (χ2) test as an

alternative in 2003 Accounted for uncertainty of target values

using “effective degrees of freedom” Rejected at the time because of complexity

and lack of evidence for performance Kept the simple test that now appears in

MARLAP Chapter 6

But we didn’t forget about the χ2 test

8

Peculiarity of MARLAP’s Test

Power to reject a biased but precise method decreases with number of analyses performed (n)

Because we adjusted the acceptance limits to keep false rejection rates low

Acceptance range gets wider as n gets larger

Biased but Precise

This graphic image was borrowed and editedfor the RRMC workshop presentation. Please

view the original now at despair.com.

http://www.despair.com/consistency.html



10

Best Use of Data? It isn’t just about bias MARLAP’s test uses data inefficiently – even

to evaluate precision alone (its original purpose)

The statistic – in effect – is just the worst normalized deviation from the target value

Wastes a lot of useful information

Req1

wheremaxu

TVXZZM j

jjnj

11

Example: The MARLAP Test Suppose we perform a level D method

validation experiment UBGR = AL = 100 pCi/L uMR = 10 pCi/L φMR = 10/100 = 0.10, or 10 %

Three activity levels (L = 3) 50 pCi/L, 100 pCi/L, and 300 pCi/L

Seven replicates per level (N = 7) Allow 5 % false rejections (α = 0.05)

12

Example (continued) For 21 measurements, calculate

When evaluating measurement results for target value TV, require for each result Xj:

Equivalently, require

0.303.399878.0)05.01(5.05.0 21/1

zzk

0.3Req

u

TVXZ

j

j

0.3max211

j

jZM

13

Example (continued)

We’ll work through calculations at just one target value

Say TV = 300 pCi/L This value is greater than UBGR

(100 pCi/L) So, the required uncertainty is 10 %

of 300 pCi/L uReq = 30 pCi/L

14

Example (continued) Suppose the lab produces 7

results Xj shown at the right For each result, calculate

the “Z score”

We require |Zj| ≤ 3.0 for each j

j Xj

1 256.1

2 235.2

3 249.0

4 258.5

5 265.2

6 255.7

7 254.5

pCi/L30

pCi/L300

Req

jj

j

X

u

TVXZ

15

Example (continued)

Every Zj is smaller than ±3.0

The method is obviously biased (~15 % low)

But it passes the MARLAP test

Scores and Evaluation

Xj Zj | Zj | ≤ 3.0?

256.1 −1.463 Yes

235.2 −2.160 Yes

249.0 −1.700 Yes

258.5 −1.383 Yes

265.2 −1.160 Yes

255.7 −1.477 Yes

254.5 −1.517 Yes

160.2max71

j

jZM

16

2007

In early 2007 we were developing the new method validation guide

Applying MARLAP guidance, including the simple test of Chapter 6

Someone suggested presenting power curves in the context of bias

Time had come to reconsider MARLAP’s simple test

17

Bias and Imprecision

Which is worse: bias or imprecision? Either leads to inaccuracy Both are tolerable if not too large When we talk about uncertainty (à la

GUM), we don’t distinguish between the two

18

Mean Squared Error

When characterizing a method, we often consider bias and imprecision separately

Uncertainty estimates combine them There is a concept in statistics that

also combines them: mean squared error

19

Definition of MSE

If X is an estimator for a parameter θ, the mean squared error of X is MSE(X) = E((X − θ)2) by definition

It also equals MSE(X) = V(X) + Bias(X)2 = σ2 + δ2

If X is unbiased, MSE(X) = V(X) = σ2

We tend to think in terms of the root MSE, which is the square root of MSE

20

New Approach For the method validation guide we

chose a new conceptual approach

A method is adequate if its root MSE at each activity level does not exceed the required uncertainty at that level

We don’t care whether the MSE is dominated by bias or imprecision

21

Root MSE v. Standard Uncertainty

Are root MSE and standard uncertainty really the same thing?

Not exactly, but one can interpret the GUM’s treatment of uncertainty in such a way that the two are closely related

We think our approach – testing uncertainty by testing MSE – is reasonable

22

Chi-squared Test Revisited

For the new method validation document we simplified the χ2 test proposed (and rejected) in 2003 Ignore uncertainties of target values,

which should be small Just use a straightforward χ2 test

Presented as an alternative in App. E But the document still uses MARLAP’s

simple test

23

The Two Hypotheses We’re now explicitly testing the MSE Null hypothesis (H0): Alternative hypothesis (H1):

In MARLAP the 2 hypotheses were not clearly stated

Assumed any bias (δ) would be small We were mainly testing variance (σ2)

2RequMSE

2RequMSE

24

A χ2 Test for Variance Imagine we really tested variance only H0: H1: We could calculate a χ2 statistic

Chi-squared with N − 1 degrees of freedom Presumes there may be bias but doesn’t

test for it

2Req

2 u2Req

2 u

N

jj XX

u 1

22Req

2 )(1

25

MLE for Variance The maximum-likelihood estimator (MLE) for

σ2 when the mean is unknown is:

Notice similarity to χ2 from preceding slide

N

jjX XX

N 1

22 )(1̂

N

jj XX

u 1

22Req

2 )(1

26

Another χ2 Test for Variance We could calculate a different χ2 statistic

N degrees of freedom Can be used to test variance if there is no

bias Any bias increases the rejection rate

N

jj TVX

u 1

22Req

2 )(1

27

MLE for MSE The MLE for the MSE is:

Notice similarity to χ2 from preceding slide

In the context of biased measurements, χ2 seems to assess MSE rather than variance

N

jjX TVX

NTVX

1

222 )(1

)(̂

N

jj TVX

u 1

22Req

2 )(1

28

Our Proposed χ2 Test for MSE For a given activity level (TV), calculate a χ2

statistic W:

Calculate the critical value of W as follows:

N = number of replicate measurements α = max false rejection rate at this level

N

jj

N

jj ZTVX

uW

1

2

1

22Req

)(1

)(21C Nw

29

Multiple Activity Levels When testing at more than one activity

level, calculate the critical value as follows:

Where L is the number of levels and N is the number of measurements at each level

Now α is the maximum overall false rejection rate

)(2

)1(C /1 Nw L

30

Evaluation Criteria To perform the test, calculate Wi at

each activity level TVi

Compare each Wi to wC

If Wi > wC for any i, reject the method The method must pass the test at

each spike activity level Don’t allow bad performance at one

level just because of good performance at another

31

Lesson Learned Don’t test at too many levels Otherwise you must choose:

High false acceptance rate at each level, High overall false rejection rate, or Complicated evaluation criteria

Prefer to keep error rates low Need a low level and a high level But probably not more than three

levels (L = 3)

32

Better Use of Same Data

The χ2 test makes better use of the measurement data than the MARLAP test

The statistic W is calculated from all the data at a given level – not just the most extreme value

33

Caveat

The distribution of W is not completely determined by the MSE

Depends on how MSE is partitioned into variance and bias components

Our test looks like a test of variance As if we know δ = 0 and we’re testing σ2

only But we’re actually using it to test MSE

34

False Rejections If wC < N, the maximum false rejection rate

(100 %) occurs when δ = ±uReq and σ = 0 But you’ll never have this situation in practice

If wC ≥ N + 2, the maximum false rejection rate occurs when σ = uReq and δ = 0 This is the usual situation Why we can assume the null distribution is χ2

Otherwise the maximum false rejection rate occurs when both δ and σ are nonzero This situation is unlikely in practice

35

To Avoid High Rejection Rates We must have wC ≥ N + 2

This will always be true if α < 0.08, even if L = N = 1

Ensures the maximum false rejection rate occurs when δ = 0 and the MSE is just σ2

Not stated explicitly in App. E, because: We didn’t have a proof at the time Not an issue if you follow the procedure

Now we have a proof

36

Example: Critical Value Suppose L = 3 and N = 7 Let α = 0.05 Then the critical value for W is

Since wC ≥ N + 2 = 9, we won’t have unexpectedly high false rejection rates

1.17)7()7()( 298305.0

2

95.0

2

)1(C 3/1/1

Nw L

Since α < 0.08, we didn’t really have to check

37

Some Facts about the Power The power always increases with |δ| The power increases with σ if

or if For a given bias δ with , there is a

positive value of σ that minimizes the power If , even this minimum power

exceeds 50 % Power increases with N

Nwu /CReq

Nwu /CReq

)2/(C2Req

22 Nwu

Nwu /CReq

38

Power Comparisons

We compared the tests for power Power to reject a biased method Power to reject an imprecise method

The χ2 test outperforms the simple MARLAP test on both counts

Results of comparisons at end of this presentation

39

False Rejection Rates

2Req

22 u

0

Rejection rate = α

Rejection rate < α

Rejection rate = 0H0

H1

Requ

2C Nw

Requ

40

Region of Low Power

N

wu C

Req

0

H0

H1

RequRequ

2C Nw

N

wu C

Req

Rejection rate = α

41

Region of Low Power (MARLAP)

Requk

0

H0

H1

RequRequRequk

Rejection rate = α

42

Example: Applying the χ2 Test Return to the scenario used earlier for the

MARLAP example Three levels (L = 3) Seven measurements per level (N = 7) 5 % overall false rejection rate (α = 0.05) Consider results at just one level,

TV = 300 pCi/L, where uReq = 30 pCi/L 1.17)(2

)1(C /1

Nw L

43

Example (continued) Reuse the data from our

earlier example Calculate the χ2 statistic

Since W > wC (17.4 > 17.1),the method is rejected

We’re using all the data now – not just the worst result

j Xj Zj

1 256.1 −1.463

2 235.2 −2.160

3 249.0 −1.700

4 258.5 −1.383

5 265.2 −1.160

6 255.7 −1.477

7 254.5 −1.517

4.171

2

N

jjZW

44

Likelihood Ratio Test for MSE

We also discovered a statistical test published in 1999, which directly addressed MSE for analytical methods

By Danish authors Erik Holst and Poul Thyregod

It’s a “likelihood ratio” test, which is a common, well accepted approach to hypothesis testing

45

Likelihood Ratio Tests

To test a hypothesis about a parameter θ, such as the MSE

First find a likelihood function L(θ), which tells how “likely” a value of θ is, given the observed experimental data Based on the probability mass function

or probability density function for the data

46

Test Statistic Maximize L(θ) on all possible values of θ and

again on all values of θ that satisfy the null hypothesis H0

Can use the ratio of these two maxima as a test statistic

The authors actually use λ = −2 ln(Λ) as the statistic for testing MSE

)(max

)(max

10

0

HH

H

L

L

47

Critical Values

It isn’t simple to derive equations for λ, or to calculate percentiles of its distribution, but Holst and Thyregod did both

They used numerical integration to approximate percentiles of λ, which serve as critical values

48

Equations For the two-sided test statistic, λ:

Where is the unique real root of the cubic polynomial See Holst & Thyregod for details

N

jjZ

NZ

1

1

N

jjZ ZZ

N 1

22 )(1̂

2

2

2

22

~1

ˆln1~

1

)~

(ˆ

ZZ Z

N

~

ZZZ Z )ˆ( 2223

49

One-Sided Test We actually need the one-sided test

statistic:

This is equivalent to:

otherwise0,

1ˆif, 22* ZZ

otherwise0,

)(ˆif, 2Req

22* uTVXX

50

Issues The distribution of either λ or λ* is not

completely determined by the MSE Under H0 with , the percentiles λ1−α

and λ*1−α are maximized when σ 0 and |

δ| uReq

To ensure the false rejection rate never exceeds α, use the maximum value of the percentile as the critical value

Apparently we improved on the authors’ method of calculating this maximum

2Req

22 u

52

Distribution Function for λ*

To calculate max values of the percentiles, use the following “cdf” for λ*:

From this equation, obtain percentiles of the null distribution by iteration

Select a percentile (e.g., 95th) as a critical value

0

/

23

2

21

/2/

e11ee

12

1);(*

k

kNxN

N

Nxx

k

kNxF

53

The Downside

More complicated to implement Critical values are not readily

available (unlike percentiles of χ2) Unless you can program the equation

from the preceding slide in software

54

Power of the Likelihood Ratio Test

More powerful than either the χ2 test or MARLAP’s test for rejecting biased methods Sometimes much more powerful

Slightly less powerful at rejecting unbiased but imprecise methods Not so much worse that we wouldn’t

consider it a reasonable alternative

55

Power Comparisons

Same scenario as before: Level D method validation 3 activity levels: AL/2, AL, 3×AL 7 replicate measurements per level φMR is 0.10, or 10 %

Constant relative bias at all levels Assume ratio σ/uReq constant at all

levels

56

Power Curves

RSD = 5 % at AL

00.10.20.30.40.50.60.70.80.9

1

0 2 4 6 8 10 12 14 16 18 20

Relative Bias (%)

P

MARLAP

Chi-Squared

Likelihood Ratio

57

Power Curves

RSD = 7.5 % at AL

00.10.20.30.40.50.60.70.80.9

1

0 2 4 6 8 10 12 14 16 18 20

Relative Bias (%)

P

MARLAP

Chi-Squared

Likelihood Ratio

58

Power Curves

RSD = 10 % at AL

00.10.20.30.40.50.60.70.80.9

1

0 2 4 6 8 10 12 14 16 18 20

Relative Bias (%)

P

MARLAP

Chi-Squared

Likelihood Ratio

59

Power Curves

RSD = 12.5 % at AL

00.10.20.30.40.50.60.70.80.9

1

0 2 4 6 8 10 12 14 16 18 20

Relative Bias (%)

P

MARLAP

Chi-Squared

Likelihood Ratio

60

Power Contours Same assumptions as before (Level D

method validation, etc.) Contour plots show power as a

function of both δ and σ Horizontal coordinate is bias (δ) at the

action level Vertical coordinate is the standard

deviation (σ) Power is shown as color

61

Power

Power of MARLAP’s test

uReq 2uReq−uReq

62

Power

Power of the chi-squared test

uReq 2uReq−uReq

N

wu C

Req

63

Power

Power of the likelihood ratio test

uReq 2uReq−uReq

64

Recommendations You can still use the MARLAP test We prefer the χ2 test of App. E.

It’s simple Critical values are widely available (percentiles

of χ2) It outperforms the MARLAP test

The likelihood ratio test is a possibility, but It is somewhat complicated Our guide doesn’t give you enough information

to implement it

65

Implications for MARLAP

The χ2 test for MSE will likely be included in revision 1 of MARLAP

So will the likelihood ratio test Or maybe a variant of it

One or both of these will probably become the recommended test for evaluating a candidate method

Questions

67

Power Calculations – MARLAP For MARLAP’s test, probability of rejecting a

method at a given activity level is

Where σ is the method’s standard deviation at that level, δ is the method’s bias, and k is the multiplier calculated earlier (k ≈ 3)

Φ(z) is the cdf for the standard normal distribution

Nukuk

ReqReq1]rejectPr[

68

Power Calculations (continued) Same probability is calculated by the

following equation

is the cdf for the non-central χ2 distribution In this case, with ν = 1 degree of freedom and

non-centrality parameter λ = δ2 / σ2

Nuk

F

2

2

2

2Req

2

,1;1]rejectPr[2

,;2

xF

69

Power Calculations – χ2

For the new χ2 test, the probability of rejecting a method is

Where again σ is the method’s standard deviation at that level and δ is the method’s bias

2

2

2

2ReqC ,;1]rejectPr[

2

NN

uwF

70

Non-Central χ2 CDF The cdf for the non-central χ2 distribution is

given by

Where P(∙,∙) denotes the incomplete gamma function

You can find algorithms for P(∙,∙), e.g., in books like Numerical Recipes

0

2/

2,

2!

)2/(e,;

2

j

j xjP

jxF

71

Solving the Cubic To solve the cubic equation for

3)()(

~

54

)277ˆ9(9

2ˆ3

3/13/1

23

22

22

ZTRTR

RQT

ZZR

ZQ

Z

Z

~

72

Variations Another possibility: Use Holst & Thyregod’s

methodology to derive a likelihood ratio test for H0: versus H1:

There are a couple of new issues to deal with when k > 1

But the same approach mostly works Only recently considered – not fully

explored yet

2Req

22 uk 2Req

22 uk

Documents

Beyond MARLAP: New Statistical Tests For Method Validation NAREL – ORIA – US EPA Laboratory Incident Response Workshop At the 53 rd Annual RRMC