Upload
philippa-walton
View
212
Download
0
Embed Size (px)
Citation preview
Beyond MARLAP:Beyond MARLAP:New Statistical TestsNew Statistical TestsFor Method ValidationFor Method Validation
NAREL – ORIA – US EPANAREL – ORIA – US EPALaboratory Incident Response WorkshopLaboratory Incident Response Workshop
At the 53At the 53rdrd Annual RRMC Annual RRMC
2
Outline The method validation problem MARLAP’s test
And its peculiar features New approach – testing mean squared error
(MSE) Two possible tests of MSE
Chi-squared test Likelihood ratio test
Power comparisons Recommendations and implications for
MARLAP
3
The Problem
We’ve prepared spiked samples at one or more activity levels
A lab has performed one or more analyses of the samples at each level
Our task: Evaluate the results to see whether the lab and method can achieve the required uncertainty (uReq) at each level
4
MARLAP’s Test
In 2003 the MARLAP work group developed a simple test for MARLAP Chapter 6
Chose a very simple criterion Original criterion was whether every
result was within ±3uReq of the target Modified slightly to keep false
rejection rate ≤ 5 % in all cases
5
Equations Acceptance range is TV ± k uReq where
TV = target value (true value) uReq = required uncertainty at TV, and
E.g., for n = 21 measurements (7 reps at each of 3 levels), with α = 0.05, we get k = z0.99878 = 3.03
For smaller n we get slightly smaller k
nzk /1)1(5.05.0
6
Required Uncertainty The required uncertainty, uReq, is a function of
the target value
Where uMR is the required method uncertainty at the upper bound of the gray region (UBGR)
φMR is the corresponding relative method uncertainty
UBGRTVTV
UBGRTVuTVu
if,
if,)(
MR
MRReq
7
Alternatives We considered a chi-squared (χ2) test as an
alternative in 2003 Accounted for uncertainty of target values
using “effective degrees of freedom” Rejected at the time because of complexity
and lack of evidence for performance Kept the simple test that now appears in
MARLAP Chapter 6
But we didn’t forget about the χ2 test
8
Peculiarity of MARLAP’s Test
Power to reject a biased but precise method decreases with number of analyses performed (n)
Because we adjusted the acceptance limits to keep false rejection rates low
Acceptance range gets wider as n gets larger
Biased but Precise
This graphic image was borrowed and editedfor the RRMC workshop presentation. Please
view the original now at despair.com.
http://www.despair.com/consistency.html
10
Best Use of Data? It isn’t just about bias MARLAP’s test uses data inefficiently – even
to evaluate precision alone (its original purpose)
The statistic – in effect – is just the worst normalized deviation from the target value
Wastes a lot of useful information
Req1
wheremaxu
TVXZZM j
jjnj
11
Example: The MARLAP Test Suppose we perform a level D method
validation experiment UBGR = AL = 100 pCi/L uMR = 10 pCi/L φMR = 10/100 = 0.10, or 10 %
Three activity levels (L = 3) 50 pCi/L, 100 pCi/L, and 300 pCi/L
Seven replicates per level (N = 7) Allow 5 % false rejections (α = 0.05)
12
Example (continued) For 21 measurements, calculate
When evaluating measurement results for target value TV, require for each result Xj:
Equivalently, require
0.303.399878.0)05.01(5.05.0 21/1
zzk
0.3Req
u
TVXZ
j
j
0.3max211
j
jZM
13
Example (continued)
We’ll work through calculations at just one target value
Say TV = 300 pCi/L This value is greater than UBGR
(100 pCi/L) So, the required uncertainty is 10 %
of 300 pCi/L uReq = 30 pCi/L
14
Example (continued) Suppose the lab produces 7
results Xj shown at the right For each result, calculate
the “Z score”
We require |Zj| ≤ 3.0 for each j
j Xj
1 256.1
2 235.2
3 249.0
4 258.5
5 265.2
6 255.7
7 254.5
pCi/L30
pCi/L300
Req
jj
j
X
u
TVXZ
15
Example (continued)
Every Zj is smaller than ±3.0
The method is obviously biased (~15 % low)
But it passes the MARLAP test
Scores and Evaluation
Xj Zj | Zj | ≤ 3.0?
256.1 −1.463 Yes
235.2 −2.160 Yes
249.0 −1.700 Yes
258.5 −1.383 Yes
265.2 −1.160 Yes
255.7 −1.477 Yes
254.5 −1.517 Yes
160.2max71
j
jZM
16
2007
In early 2007 we were developing the new method validation guide
Applying MARLAP guidance, including the simple test of Chapter 6
Someone suggested presenting power curves in the context of bias
Time had come to reconsider MARLAP’s simple test
17
Bias and Imprecision
Which is worse: bias or imprecision? Either leads to inaccuracy Both are tolerable if not too large When we talk about uncertainty (à la
GUM), we don’t distinguish between the two
18
Mean Squared Error
When characterizing a method, we often consider bias and imprecision separately
Uncertainty estimates combine them There is a concept in statistics that
also combines them: mean squared error
19
Definition of MSE
If X is an estimator for a parameter θ, the mean squared error of X is MSE(X) = E((X − θ)2) by definition
It also equals MSE(X) = V(X) + Bias(X)2 = σ2 + δ2
If X is unbiased, MSE(X) = V(X) = σ2
We tend to think in terms of the root MSE, which is the square root of MSE
20
New Approach For the method validation guide we
chose a new conceptual approach
A method is adequate if its root MSE at each activity level does not exceed the required uncertainty at that level
We don’t care whether the MSE is dominated by bias or imprecision
21
Root MSE v. Standard Uncertainty
Are root MSE and standard uncertainty really the same thing?
Not exactly, but one can interpret the GUM’s treatment of uncertainty in such a way that the two are closely related
We think our approach – testing uncertainty by testing MSE – is reasonable
22
Chi-squared Test Revisited
For the new method validation document we simplified the χ2 test proposed (and rejected) in 2003 Ignore uncertainties of target values,
which should be small Just use a straightforward χ2 test
Presented as an alternative in App. E But the document still uses MARLAP’s
simple test
23
The Two Hypotheses We’re now explicitly testing the MSE Null hypothesis (H0): Alternative hypothesis (H1):
In MARLAP the 2 hypotheses were not clearly stated
Assumed any bias (δ) would be small We were mainly testing variance (σ2)
2RequMSE
2RequMSE
24
A χ2 Test for Variance Imagine we really tested variance only H0: H1: We could calculate a χ2 statistic
Chi-squared with N − 1 degrees of freedom Presumes there may be bias but doesn’t
test for it
2Req
2 u2Req
2 u
N
jj XX
u 1
22Req
2 )(1
25
MLE for Variance The maximum-likelihood estimator (MLE) for
σ2 when the mean is unknown is:
Notice similarity to χ2 from preceding slide
N
jjX XX
N 1
22 )(1̂
N
jj XX
u 1
22Req
2 )(1
26
Another χ2 Test for Variance We could calculate a different χ2 statistic
N degrees of freedom Can be used to test variance if there is no
bias Any bias increases the rejection rate
N
jj TVX
u 1
22Req
2 )(1
27
MLE for MSE The MLE for the MSE is:
Notice similarity to χ2 from preceding slide
In the context of biased measurements, χ2 seems to assess MSE rather than variance
N
jjX TVX
NTVX
1
222 )(1
)(̂
N
jj TVX
u 1
22Req
2 )(1
28
Our Proposed χ2 Test for MSE For a given activity level (TV), calculate a χ2
statistic W:
Calculate the critical value of W as follows:
N = number of replicate measurements α = max false rejection rate at this level
N
jj
N
jj ZTVX
uW
1
2
1
22Req
)(1
)(21C Nw
29
Multiple Activity Levels When testing at more than one activity
level, calculate the critical value as follows:
Where L is the number of levels and N is the number of measurements at each level
Now α is the maximum overall false rejection rate
)(2
)1(C /1 Nw L
30
Evaluation Criteria To perform the test, calculate Wi at
each activity level TVi
Compare each Wi to wC
If Wi > wC for any i, reject the method The method must pass the test at
each spike activity level Don’t allow bad performance at one
level just because of good performance at another
31
Lesson Learned Don’t test at too many levels Otherwise you must choose:
High false acceptance rate at each level, High overall false rejection rate, or Complicated evaluation criteria
Prefer to keep error rates low Need a low level and a high level But probably not more than three
levels (L = 3)
32
Better Use of Same Data
The χ2 test makes better use of the measurement data than the MARLAP test
The statistic W is calculated from all the data at a given level – not just the most extreme value
33
Caveat
The distribution of W is not completely determined by the MSE
Depends on how MSE is partitioned into variance and bias components
Our test looks like a test of variance As if we know δ = 0 and we’re testing σ2
only But we’re actually using it to test MSE
34
False Rejections If wC < N, the maximum false rejection rate
(100 %) occurs when δ = ±uReq and σ = 0 But you’ll never have this situation in practice
If wC ≥ N + 2, the maximum false rejection rate occurs when σ = uReq and δ = 0 This is the usual situation Why we can assume the null distribution is χ2
Otherwise the maximum false rejection rate occurs when both δ and σ are nonzero This situation is unlikely in practice
35
To Avoid High Rejection Rates We must have wC ≥ N + 2
This will always be true if α < 0.08, even if L = N = 1
Ensures the maximum false rejection rate occurs when δ = 0 and the MSE is just σ2
Not stated explicitly in App. E, because: We didn’t have a proof at the time Not an issue if you follow the procedure
Now we have a proof
36
Example: Critical Value Suppose L = 3 and N = 7 Let α = 0.05 Then the critical value for W is
Since wC ≥ N + 2 = 9, we won’t have unexpectedly high false rejection rates
1.17)7()7()( 298305.0
2
95.0
2
)1(C 3/1/1
Nw L
Since α < 0.08, we didn’t really have to check
37
Some Facts about the Power The power always increases with |δ| The power increases with σ if
or if For a given bias δ with , there is a
positive value of σ that minimizes the power If , even this minimum power
exceeds 50 % Power increases with N
Nwu /CReq
Nwu /CReq
)2/(C2Req
22 Nwu
Nwu /CReq
38
Power Comparisons
We compared the tests for power Power to reject a biased method Power to reject an imprecise method
The χ2 test outperforms the simple MARLAP test on both counts
Results of comparisons at end of this presentation
39
False Rejection Rates
2Req
22 u
0
Rejection rate = α
Rejection rate < α
Rejection rate = 0H0
H1
Requ
2C Nw
Requ
40
Region of Low Power
N
wu C
Req
0
H0
H1
RequRequ
2C Nw
N
wu C
Req
Rejection rate = α
41
Region of Low Power (MARLAP)
Requk
0
H0
H1
RequRequRequk
Rejection rate = α
42
Example: Applying the χ2 Test Return to the scenario used earlier for the
MARLAP example Three levels (L = 3) Seven measurements per level (N = 7) 5 % overall false rejection rate (α = 0.05) Consider results at just one level,
TV = 300 pCi/L, where uReq = 30 pCi/L 1.17)(2
)1(C /1
Nw L
43
Example (continued) Reuse the data from our
earlier example Calculate the χ2 statistic
Since W > wC (17.4 > 17.1),the method is rejected
We’re using all the data now – not just the worst result
j Xj Zj
1 256.1 −1.463
2 235.2 −2.160
3 249.0 −1.700
4 258.5 −1.383
5 265.2 −1.160
6 255.7 −1.477
7 254.5 −1.517
4.171
2
N
jjZW
44
Likelihood Ratio Test for MSE
We also discovered a statistical test published in 1999, which directly addressed MSE for analytical methods
By Danish authors Erik Holst and Poul Thyregod
It’s a “likelihood ratio” test, which is a common, well accepted approach to hypothesis testing
45
Likelihood Ratio Tests
To test a hypothesis about a parameter θ, such as the MSE
First find a likelihood function L(θ), which tells how “likely” a value of θ is, given the observed experimental data Based on the probability mass function
or probability density function for the data
46
Test Statistic Maximize L(θ) on all possible values of θ and
again on all values of θ that satisfy the null hypothesis H0
Can use the ratio of these two maxima as a test statistic
The authors actually use λ = −2 ln(Λ) as the statistic for testing MSE
)(max
)(max
10
0
HH
H
L
L
47
Critical Values
It isn’t simple to derive equations for λ, or to calculate percentiles of its distribution, but Holst and Thyregod did both
They used numerical integration to approximate percentiles of λ, which serve as critical values
48
Equations For the two-sided test statistic, λ:
Where is the unique real root of the cubic polynomial See Holst & Thyregod for details
N
jjZ
NZ
1
1
N
jjZ ZZ
N 1
22 )(1̂
2
2
2
22
~1
ˆln1~
1
)~
(ˆ
ZZ Z
N
~
ZZZ Z )ˆ( 2223
49
One-Sided Test We actually need the one-sided test
statistic:
This is equivalent to:
otherwise0,
1ˆif, 22* ZZ
otherwise0,
)(ˆif, 2Req
22* uTVXX
50
Issues The distribution of either λ or λ* is not
completely determined by the MSE Under H0 with , the percentiles λ1−α
and λ*1−α are maximized when σ 0 and |
δ| uReq
To ensure the false rejection rate never exceeds α, use the maximum value of the percentile as the critical value
Apparently we improved on the authors’ method of calculating this maximum
2Req
22 u
52
Distribution Function for λ*
To calculate max values of the percentiles, use the following “cdf” for λ*:
From this equation, obtain percentiles of the null distribution by iteration
Select a percentile (e.g., 95th) as a critical value
0
/
23
2
21
/2/
e11ee
12
1);(*
k
kNxN
N
Nxx
k
kNxF
53
The Downside
More complicated to implement Critical values are not readily
available (unlike percentiles of χ2) Unless you can program the equation
from the preceding slide in software
54
Power of the Likelihood Ratio Test
More powerful than either the χ2 test or MARLAP’s test for rejecting biased methods Sometimes much more powerful
Slightly less powerful at rejecting unbiased but imprecise methods Not so much worse that we wouldn’t
consider it a reasonable alternative
55
Power Comparisons
Same scenario as before: Level D method validation 3 activity levels: AL/2, AL, 3×AL 7 replicate measurements per level φMR is 0.10, or 10 %
Constant relative bias at all levels Assume ratio σ/uReq constant at all
levels
56
Power Curves
RSD = 5 % at AL
00.10.20.30.40.50.60.70.80.9
1
0 2 4 6 8 10 12 14 16 18 20
Relative Bias (%)
P
MARLAP
Chi-Squared
Likelihood Ratio
57
Power Curves
RSD = 7.5 % at AL
00.10.20.30.40.50.60.70.80.9
1
0 2 4 6 8 10 12 14 16 18 20
Relative Bias (%)
P
MARLAP
Chi-Squared
Likelihood Ratio
58
Power Curves
RSD = 10 % at AL
00.10.20.30.40.50.60.70.80.9
1
0 2 4 6 8 10 12 14 16 18 20
Relative Bias (%)
P
MARLAP
Chi-Squared
Likelihood Ratio
59
Power Curves
RSD = 12.5 % at AL
00.10.20.30.40.50.60.70.80.9
1
0 2 4 6 8 10 12 14 16 18 20
Relative Bias (%)
P
MARLAP
Chi-Squared
Likelihood Ratio
60
Power Contours Same assumptions as before (Level D
method validation, etc.) Contour plots show power as a
function of both δ and σ Horizontal coordinate is bias (δ) at the
action level Vertical coordinate is the standard
deviation (σ) Power is shown as color
61
Power
Power of MARLAP’s test
uReq 2uReq−uReq
62
Power
Power of the chi-squared test
uReq 2uReq−uReq
N
wu C
Req
63
Power
Power of the likelihood ratio test
uReq 2uReq−uReq
64
Recommendations You can still use the MARLAP test We prefer the χ2 test of App. E.
It’s simple Critical values are widely available (percentiles
of χ2) It outperforms the MARLAP test
The likelihood ratio test is a possibility, but It is somewhat complicated Our guide doesn’t give you enough information
to implement it
65
Implications for MARLAP
The χ2 test for MSE will likely be included in revision 1 of MARLAP
So will the likelihood ratio test Or maybe a variant of it
One or both of these will probably become the recommended test for evaluating a candidate method
Questions
67
Power Calculations – MARLAP For MARLAP’s test, probability of rejecting a
method at a given activity level is
Where σ is the method’s standard deviation at that level, δ is the method’s bias, and k is the multiplier calculated earlier (k ≈ 3)
Φ(z) is the cdf for the standard normal distribution
Nukuk
ReqReq1]rejectPr[
68
Power Calculations (continued) Same probability is calculated by the
following equation
is the cdf for the non-central χ2 distribution In this case, with ν = 1 degree of freedom and
non-centrality parameter λ = δ2 / σ2
Nuk
F
2
2
2
2Req
2
,1;1]rejectPr[2
,;2
xF
69
Power Calculations – χ2
For the new χ2 test, the probability of rejecting a method is
Where again σ is the method’s standard deviation at that level and δ is the method’s bias
2
2
2
2ReqC ,;1]rejectPr[
2
NN
uwF
70
Non-Central χ2 CDF The cdf for the non-central χ2 distribution is
given by
Where P(∙,∙) denotes the incomplete gamma function
You can find algorithms for P(∙,∙), e.g., in books like Numerical Recipes
0
2/
2,
2!
)2/(e,;
2
j
j xjP
jxF
71
Solving the Cubic To solve the cubic equation for
3)()(
~
54
)277ˆ9(9
2ˆ3
3/13/1
23
22
22
ZTRTR
RQT
ZZR
ZQ
Z
Z
~
72
Variations Another possibility: Use Holst & Thyregod’s
methodology to derive a likelihood ratio test for H0: versus H1:
There are a couple of new issues to deal with when k > 1
But the same approach mostly works Only recently considered – not fully
explored yet
2Req
22 uk 2Req
22 uk