14
ENVIRONMETRICS Environmetrics 2010; 21: 645–658 Published online 2 September 2009 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/env.1024 Interval estimation of population parameters based on environmental data with detection limits Cheng Peng ,Department of Mathematics and Statistics, University of Southern Maine, Portland, Maine 04104, U.S.A. SUMMARY In this paper, we focus on constructing confidence intervals of normal population parameters based on multiple left- censored samples (singly and doubly censored samples are special cases). In addition to the asymptotic confidence interval, we propose a bootstrap procedure based on type I left censored samples with multiple censoring limits and use it to construct bootstrap confidence intervals for the population parameters of interest. Guidelines of selecting either asymptotic or bootstrap confidence intervals for practitioners are provided. We also provide some illustrative examples using real-life data. Copyright © 2009 John Wiley & Sons, Ltd. key words: asymptotic confidence interval; bootstrap confidence interval; censoring level; detection limit; type I (multiply) left censoring 1. INTRODUCTION The type I left-censored data are common in many areas such as environmental sciences in which some of the measurements of interest are below the detection limits due to the precision of instruments. Many estimation methods, such as simple replacement, nonparametric and parametric procedures, have been proposed to handle such data in analysis. Gilliom and Helsel (1986) and Helsel and Gilliom (1986) provide a good review on commonly used existing methods. The recent monograph by Helsel (2005) gives a detailed discussion on various methods available for analyzing data with detection limits. The likelihood methods are still the major statistical tools in analyzing type I censored data. Since the likelihood function based on censored samples involves the cumulative distribution function (CDF) of the underlying parametric distribution, maximizing the likelihood function becomes demanding and is heavily dependent on numerical procedures. In order to avoid using numerical iterations to obtain the maximum likelihood estimator (MLE) due to lack of computing power at the time, Gupta (1952) and Cohen (1957) maximized the likelihood function using tabular and graphic methods for the singly left censored normal sample case. The asymptotic variance–covariance matrix of the MLE was also given in Gupta (1952) and Cohen (1959). Cohen (1991) systematically studied the estimation of population Correspondence to: C. Peng, Department of Mathematics and Statistics, University of Southern Maine, Portland, Maine 04104, U.S.A. E-mail: [email protected] Received 12 November 2008 Copyright © 2009 John Wiley & Sons, Ltd. Accepted 27 July 2009

Interval estimation of population parameters based on environmental data with detection limits

Embed Size (px)

Citation preview

Page 1: Interval estimation of population parameters based on environmental data with detection limits

ENVIRONMETRICSEnvironmetrics 2010; 21: 645–658Published online 2 September 2009 in Wiley Online Library(wileyonlinelibrary.com) DOI: 10.1002/env.1024

Interval estimation of population parameters based onenvironmental data with detection limits

Cheng Peng∗,†

Department of Mathematics and Statistics, University of Southern Maine, Portland, Maine 04104, U.S.A.

SUMMARY

In this paper, we focus on constructing confidence intervals of normal population parameters based on multiple left-censored samples (singly and doubly censored samples are special cases). In addition to the asymptotic confidenceinterval, we propose a bootstrap procedure based on type I left censored samples with multiple censoring limits anduse it to construct bootstrap confidence intervals for the population parameters of interest. Guidelines of selectingeither asymptotic or bootstrap confidence intervals for practitioners are provided. We also provide some illustrativeexamples using real-life data. Copyright © 2009 John Wiley & Sons, Ltd.

key words: asymptotic confidence interval; bootstrap confidence interval; censoring level; detection limit; typeI (multiply) left censoring

1. INTRODUCTION

The type I left-censored data are common in many areas such as environmental sciences in which someof the measurements of interest are below the detection limits due to the precision of instruments. Manyestimation methods, such as simple replacement, nonparametric and parametric procedures, have beenproposed to handle such data in analysis. Gilliom and Helsel (1986) and Helsel and Gilliom (1986)provide a good review on commonly used existing methods. The recent monograph by Helsel (2005)gives a detailed discussion on various methods available for analyzing data with detection limits.

The likelihood methods are still the major statistical tools in analyzing type I censored data. Sincethe likelihood function based on censored samples involves the cumulative distribution function (CDF)of the underlying parametric distribution, maximizing the likelihood function becomes demanding andis heavily dependent on numerical procedures. In order to avoid using numerical iterations to obtain themaximum likelihood estimator (MLE) due to lack of computing power at the time, Gupta (1952) andCohen (1957) maximized the likelihood function using tabular and graphic methods for the singly leftcensored normal sample case. The asymptotic variance–covariance matrix of the MLE was also givenin Gupta (1952) and Cohen (1959). Cohen (1991) systematically studied the estimation of population

∗Correspondence to: C. Peng, Department of Mathematics and Statistics, University of Southern Maine, Portland,Maine 04104, U.S.A.†E-mail: [email protected]

Received 12 November 2008Copyright © 2009 John Wiley & Sons, Ltd. Accepted 27 July 2009

Page 2: Interval estimation of population parameters based on environmental data with detection limits

646 C. PENG

parameters of interest using a variety of life distributions based on censored and truncated samples usingthe classical likelihood approach. Two other rigorous likelihood procedures available in the literaturefor censored samples are restricted maximum likelihood estimator (RMLE) due to Persson and Rootzen(1977) and modified maximum likelihood estimator (MMLE) due to Tiku (1967). Since both restrictedand modified likelihood functions were modified from the original likelihood function so that they donot contain the CDF, they are computationally easier than the classical likelihood function in terms ofmaximization.

In contrast, finding MLE using the EM (Expectation and Maximization) algorithm proposed byDempster et al. (1977) is inherently different from the aforementioned approaches. In order to findthe MLE of parameters, the EM algorithm uses a sequence of easier maximizations to replace thedifficult maximization and the corresponding estimates ultimately converge to the original MLE. Theimportant fact of EM algorithm is that if the EM algorithm works, the MLE will ultimately be obtained.An interesting observation described in El-Shaarawi and Esterby (1992) is that EM shares the sameidea used in replacement methods by substituting the censored observations with certain values. TheEM algorithm uses the sequence of updated values based on the information in the complete data toidentify the MLE, while the standard replacement methods use the one-time replacement of a constant toproduce a descriptive statistic which determines their inferiority to other statistically and mathematicallyrigorous procedures.

In addition to the likelihood based methods, some parametric and nonparametric regression relatedapproaches are also discussed by some researchers such as Helsel (1990), Gilliom and Helsel (1986)and El-Shaarawi (1989). Recently, Singh and Nocerino (2002) gave a comprehensive review of severalestimators such as MLE, MMLE, RMLE, EM-based MLE, regression based estimators, etc., based ona variety of rigorous statistical approaches and conducted a comparison study on the performance ofthese methods by numerical examples and simulation.

Instead of discussing singly, doubly and multiply left censoring problems separately, in this paper,we first obtain MLE of parameters directly from the likelihood function based on left censored samplesfrom normal population with multiple detection limits so that the singly and doubly left censoredproblems are only special cases of the proposed procedure. An R program that includes quasi-Newtonmethod “BFGS” (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970), the gradient searchmethod “CG” (Fletcher and Reeve, 1964), and the non-gradient maximizer of Nelder and Mead (1965) isprovided. The practitioners can select an algorithm from the R code at their own preferences. Secondly,using the relationship between the parameters in normal and log-normal distributions and the deltamethod, we develop the asymptotic confidence interval for log-normal parameters directly from theasymptotic results of the MLE based on normal population parameters. Finally, we propose a non-parametric bootstrap procedure based on the sample with multiple detection limits for the first time inthis area to check the normality assumption made on the sampling distribution of the MLE and constructtwo bootstrap confidence intervals.

The rest of the paper is organized as follows. In Section 2, we briefly present the maximum likelihoodestimation based on type I multiply left censored samples from normal populations and the correspondingasymptotic confidence intervals. The asymptotic confidence intervals of lognormal parameters using thedelta method are given in Section 3. The bootstrap procedure based on multiply left censored samplesis discussed in Section 4. Several illustrative examples using real-life environmental datasets and someguidelines for selecting appropriate confidence intervals are presented in Section 5. Discussions andconcluding remarks are given in Section 6. A sample output of the R program is provided in theAppendix.

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 3: Interval estimation of population parameters based on environmental data with detection limits

INTERVAL ESTIMATION FROM DATA WITH DETECTION LIMITS 647

2. THE MLE AND ASYMPTOTIC CONFIDENCE INTERVALS

Let {y1, · · · , yn} be the type I left sample taken from a normal population N(µ, σ2) with yi = (xi, ci)where ci is the censoring variable taking value 1 if xi is censored and 0 if xi is completely observed.The likelihood function based on the data is given by

L(µ, σ) =n∏

i=1

[�

(xi − µ

σ

)]ci[φ

(xi − µ

σ

)]1−ci

(1)

where �(·) and φ(·) are the CDF and density function of the standard normal distribution and µ andσ are the underlying normal population mean and the standard deviation. Since type I left censoredsamples have fixed censoring limits, to simplify the notation, we assume that {x1, x2, · · · , xm0} is the setof complete observations and the fixed multiple detection limits are (d1, d2, · · · , dk) with correspondingcensoring counts m1, m2, · · · , mk. The corresponding log-likelihood kernel based on Equation (1) isgiven by

l(µ, σ) =k∑

j=1

mj ln �

(dj − µ

σ

)− m0 ln σ − 1

2σ2

m0∑i=1

(xi − µ)2 (2)

The maximum likelihood estimator of (µ, σ), denoted by (µ, σ), is the solution to the followingsystem of nonlinear score equations

∂l(µ, σ)

∂µ= −

k∑j=1

mj

σ

φ(ξj)

�(ξj)+ 1

σ2

m0∑i=1

(xi − µ) = 0 (3)

∂l(µ, σ)

∂σ= −

k∑j=1

mjξj

σ

φ(ξj)

�(ξj)− m0

σ+ 1

σ3

m0∑i=1

(xi − µ)2 = 0 (4)

where ξj = (dj − µ)/σ. Since the solutions to Equations (3) and (4) have no closed form, numericalmethods is used to find the solutions to the score equations. The following Hessian matrix is used insome maximization procedures in the R program

H = ∂2l(µ,σ)

∂µ2∂2l(µ,σ)∂µ∂σ

∂2l(µ,σ)∂σ∂µ

∂2l(µ,σ)∂σ2

≡(

h11 h12

h21 h22

)(5)

Note that E(Xi − µ) = 0 and E(Xi − µ)2 = σ2. After some algebra, we have

ψ11 = −E(h11) =k∑

j=1

mj

σ2

φ(ξj)

�(ξj)

[ξj + φ(ξj)

�(ξj)

]+ m0

σ2 ,

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 4: Interval estimation of population parameters based on environmental data with detection limits

648 C. PENG

ψ12 = ψ21 = −E(h12) =k∑

j=1

mj

σ2

φ(ξj)

�(ξj)

[ξ2j − 1 + ξjφ(ξj)

�(ξj)

],

ψ22 = −E(h22) =k∑

j=1

mj

σ2

ξjφ(ξj)

�(ξj)

[−2 + ξ2

j + ξjφ(ξj)

�(ξj)

]+ 2m0

σ2

The covariance matrix of (µ, σ) is explicitly given by

V ≡ cov

σ

)= 1

ψ11ψ22 − ψ212

(ψ22 −ψ12

−ψ21 ψ11

)(6)

The standard large sample theory yields the following asymptotic bivariate normal distribution

(µ − µ0, σ − σ0)T ∼ N(0, V ) (7)

where µ0 and σ0 are the true values of the population mean and the standard deviation. The variancesof the MLEs µ and σ can be estimated by

var(µ) = ψ22

ψ11ψ22 − ψ212

, var(σ) = ψ11

ψ11ψ22 − ψ212

(8)

where ψij = ψij(µ, σ), for i = 1, 2, j = 1, 2. Finally, the Wald type asymptotic confidence limits forµ and σ can be easily constructed using Equations (7) and (8) and are reported in the R program.

Since the standard error of the MLE for the population parameters can be obtained in Equation (8), wecan easily perform a two sample asymptotic hypothesis test regarding two population parameters (datacollected from two different sites). Millard and Deverel (1988) discussed the problem using variousnonparametric rank based methods.

3. CONFIDENCE INTERVAL FOR LOG-NORMAL PARAMETERS

It is common in environmental and related areas that the measurements of interest are asymmetricallydistributed. The log-normal transformation is commonly used, in hope, to obtain a normal distribution.Cohen (1991) obtained the asymptotic confidence intervals directly from the log-normal likelihoodprocedure. In this section, we consider the two parameter log-normal distribution and use the deltamethod to derive the MLE and confidence intervals based on the MLEs obtained using the procedure ofthe normal distribution case discussed in Section 2. Let µln and σln be the mean and standard deviationof the two parameter log-normal population. The transformed normal population mean and standarddeviation are denoted by µ and σ as usual. The relationship between (µln, σln) and (µ, σ) are given by(Yuan, 1933)

µln = exp(µ + σ2/2), σln = exp(µ)√

exp(σ2)[exp(σ2) − 1] (9)

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 5: Interval estimation of population parameters based on environmental data with detection limits

INTERVAL ESTIMATION FROM DATA WITH DETECTION LIMITS 649

Since MLE (µ, σ) of (µ, σ) are obtained in Section 2, the MLE of log-normal parameters (µln, σln)can be easily obtained by replacing (µ, σ) in Equation (9) with (µ, σ). Note that

∂µln

∂µ= exp(µ + σ2/2) = µln,

∂µln

∂σ= exp(µ + σ2/2)σ = σµln,

∂σln

∂µ= exp(µ)

√exp(σ2)[exp(σ2) − 1] = σln,

∂σln

∂σ= σ

(σln + exp(µ + σ2/2)

/√exp(2σ2) − exp(σ2)

)

Using first order Taylor expansion of MLE (µln, σln) at (µ0, σ0), the true value of (µ, σ), we have

(µln − µ0

ln

σln − σ0ln

)=

(µ0

ln σ0µ0ln

σ0ln σ0E

) (µ − µ0

σ − σ0

)+ oP (n−1/2) (10)

where

E = σ0ln + exp (µ0 + 2σ2

0 )/√

exp(2σ20 ) − exp(σ2

0 )

and (µ0ln, σ

0ln) is (µln, σln) evaluated at (µ0, σ0). Therefore, employing Slutsky’s Theorem and the

Multivariate Central Limit Theorem, we can see that (µ, σ) is also asymptotically normally distributedwith variance–covariance matrix

Vln ≈ var

(µln

σln

)=

(µ0

ln σ0µ0ln

σ0ln σ0E

)V

(µ0

ln σ0µ0ln

σ0ln σ0E

)T

≡(

v011ln v012

ln

v021ln v022

ln

)(11)

where V is the covariance matrix of (µ, σ)T specified in Equation (6). Since Vln contains unknownpopulation parameters (µ0, σ0), the MLE of Vln(µ0, σ0), denoted by Vln(µ0, σ0) = Vln(µ, σ), is used toestimate covariance matrix of (µln, σln). The main diagonal elements of Vlv, denoted by v11

ln and v22ln , are

the (approximate) asymptotic variances of µln and σln, respectively. Therefore, the confidence intervalsfor the log-normal parameters (µln, σln) can be easily constructed. Numerical examples will be used inSection 5 to implement the procedure developed in this section.

Since the asymptotic results are dependent on the assumption that the MLE is approximately normallydistributed which is affected by the sample size and the censoring proportions, it is necessary to checkthe assumption of normal sampling distribution before we use the asymptotic confidence limits that arediscussed in this section and in the previous section. In the next section, we introduce a nonparametricbootstrap procedure which requires only a reasonable sample size to get the valid estimate of theempirical distribution.

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 6: Interval estimation of population parameters based on environmental data with detection limits

650 C. PENG

4. NONPARAMETRIC BOOTSTRAP CONFIDENCE INTERVALS

The Bootstrap procedure has been successfully applied in many quantitative areas since it was formallyproposed by Efron (1979, 1981b, 1985). A very comprehensive discussion of bootstrap methodologyand applications can be found in Efron and Tibshirani (1993) and Davison and Hinkley (1997). Itsapplication in incomplete censored samples has been focused on survival data analysis with randomright censored observations (Efron, 1981a). It turns out that Shumway et al. (1989) were among thefirst to use the bootstrap method in environmental data with a single detection limit. In this section,we propose a nonparametric bootstrap procedure based on sampling on cases (Shumway et al., 1989;Davison and Hinkley, 1997) for the type I left censored samples with multiple detection limits (singleand double detection limits are special cases under this frame work) and then bootstrap percentile andbias corrected and accelerated (BCa) bootstrap confidence intervals are constructed. Meanwhile, we alsouse the bootstrap sampling distribution to assess the normality assumption of the sampling distributionof the MLE and use it to select the appropriate confidence intervals.

Let {y1, · · · , yn} be the type I left sample taken from a normal population N(µ, σ2) with yi = (xi, ci)where ci is the censoring variable taking value 1 if xi is censored (below the detection limit) and 0 ifxi is completely observed (above the detection limit). A Bootstrap sample, denoted by {y∗

1, · · · , y∗n},

is selected from the original random sample {y1, · · · , yn} with replacement as described in Davisonand Hinkley (1997). We repeat the process a large number of times, say, B. Then for the bth bootstrapsample, we perform the procedure in Section 2 to obtain the bootstrap estimates of (µ, σ), denoted by(µ∗

b, σ∗b ). The bootstrap percentile and BCa confidence intervals for (µ, σ) will be constructed based on

the set of B ordered pairs (µ∗b, σ

∗b ) (for b = 1, · · · , B). Note that, with this sampling-cases procedure,

the observed censoring proportion in each bootstrap sample varies and is in general different from thatin the original real dataset. For ease of presentation and without loss of generality, we next present onlythe bootstrap confidence intervals for µ. The same discussion applies to σ.

4.1. Bootstrap percentile confidence interval

To obtain 100(1 − α)% bootstrap percentile confidence interval of µ, we simply find the α/2 and 1 − α/2percentiles, denoted by µ∗(α/2) and µ∗(1−α/2), based on the set of bootstrap estimates of µ. The simple100(1 − α)% bootstrap percentile confidence interval is defined to be (µ∗(α/2), µ∗(1−α/2)).

4.2. Bias corrected and accelerated (BCa) bootstrap confidence interval

The BCa bootstrap confidence interval is essentially a modified percentile confidence interval. Here weadopt Efron’s BCa confidence interval (Efron and Tibshirani, 1993, pp. 184–188). A 100(1 − α)% BCaconfidence interval of µ is given by (µ∗(ν1), µ∗(ν2)), where

ν1 = �

(z0 + z0 + z(α/2)

1 − a(z0 + z(α/2))

)and ν2 = �

(z0 + z0 + z(1−α/2)

1 − a(z0 + z(1−α/2))

)

As usual, here �(·) is the standard normal CDF and z(α) is the 100αth percentile point of a standardnormal distribution. The bias-correction z0 can be calculated from the bootstrap estimates and the MLE

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 7: Interval estimation of population parameters based on environmental data with detection limits

INTERVAL ESTIMATION FROM DATA WITH DETECTION LIMITS 651

µ obtained from the original sample data as

z0 = �−1(

#{µ∗b < µ}B

)where �−1(·) is the inverse of the CDF of the standard normal random variable. For acceleration a, wechoose to use the following Jackknife estimator (Efron and Tibshirani, 1993, pp. 141–149)

a =∑n

i=1(µ(·) − µ(i))3

6{∑ni=1(µ(·) − µ(i))2}3/2 (12)

where µ(i) is the MLE of µ based on the original data with the ith observation deleted and µ(·) =∑ni=1 µ(i)/n.The bootstrap percentile confidence interval is easy to implement. The BCa bootstrap confidence

interval is seemingly complex, but it is easily implemented using a few lines of the computer program.The above bootstrap confidence intervals will be reported in illustrative examples in the next section.

All bootstrap sampling distributions will be represented graphically by histograms. The shapes of thehistograms provide rough information on whether the asymptotic confidence interval is appropriate.

Finally, it should be pointed out that, for the confidence interval of log-normal population parameters,the bootstrap confidence intervals are constructed based on the bootstrap estimates (µ∗

ln, σ∗ln).

5. ILLUSTRATIVE EXAMPLES USING ENVIRONMENTAL DATA

In this section, we work on four examples based on real environmental data containing single andmultiple detection limits. We will report MLE of parameters and various confidence intervals (asymptoticand bootstrap percentile and BCa). The precision used in the numeric procedure is 10−8 (the absolutedifference between the true maximum likelihood and the likelihood evaluated at the estimated parametersis less than or equal to 10−8). We use the gradient search method to optimize the likelihood kernel.

At the end of this section, we provide contour plots for the surfaces of the kernel of log-likelihoodfunctions defined based on the four numerical examples to that the point estimates of parameters actuallymaximize the likelihood.

As a guideline in selecting from available confidence intervals, we recommend to reportbootstrap confidence intervals if the sampling distribution of the MLE is not normally distributed.If the bootstrap sampling distribution of the MLE is approximately normally distributed, both asymp-totic and bootstrap confidence intervals are similar to each other. The BCa confidence interval usesnormal distribution, but the inclusion of bias correction coefficient rectifies the normal assumption.Both Bootstrap percentile and BCa confidence intervals are distribution free.

5.1. Left singly censored sample case

Example 5.1. The following sulfate concentration dataset is taken from the U.S. EPA (1992) RCRAGuidance Document. The data values are <1450, <1450, <1450, 1800, 1840, 1820, 1860, 1780,1760, 1800, 1900, 1770, 1790, 1780, 1850, 1760, 1710, 1575, 1475, 1780, 1790, 1780, 1790, 1800.Three concentration levels were below the detection limit 1450. The dataset has been analyzed using

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 8: Interval estimation of population parameters based on environmental data with detection limits

652 C. PENG

Table 1. 95% Confidence intervals for population mean and standard deviation based on singly left censoredsulfate concentration data

µ σ

MLE (standard error) 1723.997 (31.67845) 153.6449 (24.76515)Asymptotic C.I. (1661.908, 1786.085) (105.1061, 202.1837)Bootstrap percentile C.I. (1542.058, 1771.367) (99.2592, 668.4425)Bootstrap BCa C.I. (1669.196, 1803.077) (25.12218, 203.0218)

Figure 1. Bootstrap sampling distribution of mean and standard deviation of mean sulfate concentrations

different methods (Singh and Nocerino, 2002). Assume that the original data are normally distributed.The summarized statistics are given in Table 1. Since the score function evaluated at the estimatedparameters are −1.402435 × 10−6 and 1.729616 × 10−7, and two eigenvalues of the correspondingHessian matrix are both negative with values −0.000990529 and −0.001646700, therefore, thelikelihood was maximized at (1723.997, 153.6449). It took less that 10 seconds to obtain the MLEusing a standard desktop computer (Pentium 4 with CPU 1 GHz and 1 GB of RAM).

Since the bootstrap sampling distributions of µ and σ are asymmetrically distributed (Figure 1), wewould recommend reporting the bootstrap BCa confidence intervals.

5.2. Left doubly censored sample case

Example 5.2 (Ground Water Data). Deverel and Millard (1988) studied the distribution of tracementconcentrations in shallow ground-waters from Alluvial Fan Zone and Basin-Trough Zone underneath theSan Joaquin Valley, California. The dataset contains multiple detection limits and a few missing valuesas well. Millard and Deverel (1988) re-analyzed copper and zinc concentrations using nonparametricrank based two sample location tests to compare the copper and zinc concentrations in the two geologiczones. In this example, we use the 68 Zinc concentrations (including one missing value, denoted byms) given in micrograms per liter: <10, 9, ms, 5, 18, <10, 12, 10, 11, 11, 19, 8, <3, <10, <10, 10,10, 10, 10, <10, 10, <10, 10, <10, 10, <10, 10, 10, 20, 20, <10, 20, 20, 20, <10, 10, 20, 620, 40, 50,33, 10, 20, 10, 10, 10, 30, 20, 10, 20, 20, 20, <10, 20, 23, 17, 10, <10, 10, 20, 29, 20, <10, 10, <10,10, 7, <10. In this analysis and thereafter, we will ignore the missing measurements (treat it as if it ismissing at completely random, MCR). Since the original measurements are not normally distributed,we take the log transformation with natural base on the original data and fit the transformed with normal

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 9: Interval estimation of population parameters based on environmental data with detection limits

INTERVAL ESTIMATION FROM DATA WITH DETECTION LIMITS 653

Figure 2. Bootstrap sampling distribution of µln and σln based on zinc concentrations in Alluvial Fan Zone in San JoaquinValley, CA

distribution with mean µ and the standard deviation σ and convert to the original scale to obtain theconfidence intervals for µln and σln.

When calculating bootstrap confidence intervals, we need to use the relation in Equation (9) to get thebootstrap estimates for µln and σln based on the bootstrap estimates of µ and σ. The correction coefficienta in Equation (12) needs to be re-calculated based on µln and σln from the original data accordinglyusing the Jackknife procedure as well. Figure 2 shows that the bootstrap sampling distributions of bothµln and σln have two peaks. After we remove the outlier concentration (620) from the data, the bootstrapsampling distribution of µln becomes unimodal, while the bootstrap of σln remains bimodal. It is notedthat the bootstrap sampling distribution reflects the actual sampling distribution of the MLE for largesamples.

Various confidence intervals for the mean zinc concentration and its standard deviation are sum-marized in Table 2. Since the bootstrap sampling distribution of µln and σln are not symmetricallydistributed, it is inappropriate to use the asymptotic confidence intervals. We would recommend reportingbootstrap percentile or BCa for µln and bootstrap percentile for σln.

5.3. Left multiply censored sample case

Example 5.3. We still use the Ground Water Data described in Example 5.2. The measurements ofinterest we consider here are the 50 copper concentration levels (in micrograms per liter) taken from

Table 2. 95% Confidence intervals for population mean and standard deviation based on the doubly left censoredzinc concentration data collected from Alluvial Fan Zone in San Joaquin Valley, CA

µ σ

MLE (standard error) 2.474561 (0.1031153) 0.8019212 (0.0819199)

µln σln

MLE (standard error) 16.38063 (1.851450) 15.5601 (3.32169)Asymptotic C.I. (12.75185, 20.0094) (9.049707, 22.07049)Bootstrap percentile C.I. (10.44129, 18.42754) (6.626945, 43.07005)Bootstrap BCa C.I. (12.65597, 19.18215) (6.048323, 26.97774)

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 10: Interval estimation of population parameters based on environmental data with detection limits

654 C. PENG

Table 3. 95% Confidence intervals for population mean and standard deviation based on the multiply left censoredcopper concentration data (Basin Trough Zone)

µ σ

MLE (standard error) 1.033080 (0.1469766) 0.9355252 (0.1105054)

µln σln

MLE (standard error) 4.352212 (0.718086) 5.148453 (1.515124)Asymptotic C.I. (2.944789, 5.759635) (2.178864, 8.118042)Bootstrap percentile C.I. (2.923054, 4.992643) (2.927852, 7.664359)Bootstrap BCa C.I. (3.76134, 5.500981) (3.154738, 8.163464)

Figure 3. Bootstrap sampling distribution of µln and σln for copper concentrations taken from Basin Trough Zone in the SanJoaquin Valley, CA

Basin Trough Zone: 2, 2, 12, 2, 1, <10, <10, 4, <10, <1, 1, <2, <2, 1, 2, <10, 3, <1, 1, 1, 3, <5,ms, 17, 23, 9, 9, 3, 3, <15, <5, 4, <5, <5, <5, 4, 8, 1, 15, 3, 3, 1, 6, 3, 6, 3, 4, 5, 14, 4. Again thecopper concentration levels are NOT normally distributed. We take the log transformation first to usethe normal distribution based procedure developed in Section 2 and then convert it to the original scale.There are 5 detection limits in this dataset. We report the similar statistics as those reported in theprevious examples.

We can see from Figure 3 that the bootstrap sampling distribution for µln is close to a normaldistribution, while the bootstrap sampling distribution for σln is skewed. We would recommend reporting

Figure 4. Bootstrap sampling distribution of µln and σln of copper concentrations taken from Alluvial Fan Zone in the SanJoaquin Valley, CA

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 11: Interval estimation of population parameters based on environmental data with detection limits

INTERVAL ESTIMATION FROM DATA WITH DETECTION LIMITS 655

Table 4. 95% Confidence intervals for population mean and standard deviation based on multiply left censoredcopper concentration data (Alluvial Fan Zone)

µ σ

MLE (standard error) 0.944206 (0.1086561) 0.8005244 (0.0807726)

µln σln

MLE (standard error) 3.541767 (0.4208793) 3.356416 (0.7239804)Asymptotic C.I. (2.716858, 4.366675) (1.937441, 4.775392)Bootstrap percentile C.I. (2.711325, 3.963123) (2.207391, 4.695401)Bootstrap BCa C.I. (3.088806, 4.36353) (2.242605, 4.777407)

bootstrap percentile interval since they are more robust (normal assumption is not implicitly made). Infact both percentile and BCa intervals are quite similar to each other (see Table 3).

Example 5.4. The measurements of interest we consider here are the 68 copper concentration levels(including missing values) from Alluvial Fan Zone the Ground Water Data described in Example 5.2:<1, <1, 3, 3, 5, 1, 4, 4, 2, 2, 1, 2, <5, 11, <1, 2, 2, 2, 2, <20, 2, 2, 3, 3, ms, <20, <10, 7, 5, 2, 2,

Figure 5. Contour plots for log-likelihood kernel surface for worked examples. The MLE and the initial values mean andstandard deviation calculated based on the complete sample data values, labeled by @, are shown in the plots. Note that the MLEs

in Examples 5.2–5.4 are based on normally distributed log-concentrations of zinc or copper

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 12: Interval estimation of population parameters based on environmental data with detection limits

656 C. PENG

<10, 7, 12, <1, 20, ms, ms, 16, <5, 1, 2, <5, 3, 2, 8, 7, 5, <5, 2, <10, <5, <5, 2, 10, 2, 4, <5, 2,3, 9, <5, 2, 2, 2, 2, 1, 1. Again the copper concentration levels are NOT normally distributed. The logtransformation on the copper concentration is performed before the estimation. The bootstrap samplingdistribution and the MLE and various confidence intervals are given in Table 4.

Figure 4 indicates that the bootstrap sampling distribution for both µln and σln are close to a normaldistribution. We can report either asymptotic or bootstrap confidence intervals for µln and σln. This factis also reflected in Table 4 in which all confidence intervals are quite close to each other.

To conclude this section, we provide contour plots for the surfaces of likelihood kernel (normalbased) in the four worked examples. Based on the worked examples, an interesting observation fromthe contour plot is that the surface turns out to be concave upward at the neighborhood determinedby the point with coordinates calculated from the complete data values. This leads us to be able tofind an easy and efficient initial value(s) for the iterative numerical procedure. This special observationalso implies that any standard likelihood methods such as EM algorithm and the interpolation methodof Cohen (1959) should produce the same results. The contour plots (Figure 5) also indicate that theproposed methods maximize the likelihood in the neighborhood of the initial value (µ0, σ0) based onthe complete data values in the samples in all four worked examples.

6. CONCLUDING REMARKS

We have systematically established both asymptotic and bootstrap confidence intervals for the populationparameters based on random samples with multiple detection limits. Three goals have been achieved: (1)The asymptotic confidence interval is based on general random samples with multiple detection limitswhich includes singly and doubly left censored problems as special cases. The optimization is performeddirectly on the likelihood function using a fast R program with several maximization algorithms availablefor practitioners’ choices. (2) Using the delta method and the relationship between the parameters inlog normal and normal distributions, we developed the asymptotic confidence intervals for the lognormal population parameters based on the asymptotic results of MLEs obtained from normal sampleswith multiple detection limits. This approach is different from that obtained in Cohen (1991), whichwas developed based on lognormal distribution. (3) Realizing that the bootstrap procedure available inliterature on analyzing environmental data is limited to only the singly left sample case, we proposed thebootstrap procedure based on the general case using samples with multiple detection limits. In additionto the construction of bootstrap confidence intervals, we also proposed to use the bootstrap samplingdistribution as a guideline in selecting appropriate confidence intervals.

The R program used in the numerical examples is available from the author upon request. In additionto the MLE and the asymptotic confidence intervals, the R program also reports the CPU time used toobtain the MLE, the number of iterations required to achieve the convergence criterion, and the valueof score function at MLE. The eigenvalues of the hessian matrix are also reported to confirm that theactual MLE is obtained. A sample output of the program is given in the Appendix. Numerical examplesshow that the MLE of (log) normal population parameters can be easily and quickly obtained using theR program. In each of the four worked examples, it took only a few seconds to get the MLE. Due to thespecial feature of the left censored data, the mean and standard deviation of the complete measurementsare good initial values to start the iteration. Numerical examples show that it only need less than 100iterations to achieve the precision limit (10−8 in our worked examples).

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 13: Interval estimation of population parameters based on environmental data with detection limits

INTERVAL ESTIMATION FROM DATA WITH DETECTION LIMITS 657

ACKNOWLEDGEMENT

The author thanks two anonymous reviewers and the editor-in-chief for their constructive comments which improvedthe presentation.

REFERENCES

Broyden CG. 1970. The convergence of a class of double-rank minimization algorithms. I: General considerations. Journal ofApplied Mathematics 6: 76–90.

Cohen AC. 1957. On the solution of estimating equations for truncated and censored samples from normal populations. Biometrika44: 225–236.

Cohen AC. 1959. Simplified estimators for the normal distribution when samples are singly censored or truncated. Technometrics1: 217–237.

Cohen AC. 1991. Truncated and Censored Samples: Theory and Applications. Vol. 119 of Statistics: Textbooks and Monographs.Marcel Dekker Inc.: New York.

Davison AC, Hinkley DV. 1997. Bootstrap Methods and Their Application. Vol. 1 of Cambridge Series in Statistical andProbabilistic Mathematics. Cambridge University Press: Cambridge.

Dempster AP, Laird NM, Rubin DB. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of theRoyal Statistical Society: Series B 39(1): 1–38. With discussion.

Deverel SJ, Millard SP. 1988. Disitribution and mobility of selenium and other trace elements in shallow ground water of thewestern San Joaquin Valley, California. Environmental Science & Technology 22: 697–702.

Efron B. 1979. Bootstrap methods: another look at the jackknife. Annals of Statistics 7(1): 1–26.Efron B. 1981a. Censored data and the bootstrap. Journal of the American Statistical Association 76(374): 312–319.Efron B. 1981b. Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika 68(3):

589–599.Efron B. 1985. Bootstrap confidence intervals for a class of parametric problems. Biometrika 72(1): 45–58.Efron B, Tibshirani RJ. 1993. An Introduction to the Bootstrap. Vol. 57 of Monographs on Statistics and Applied Probability.

Chapman and Hall: New York.El-Shaarawi AH. 1989. Inferences about the mean from censored water quality data. Water Resources Research 25(4): 685–690.El-Shaarawi AH, Esterby SR. 1992. Replacement of censored observations by a constant: an evaluation. Water Research 26(6):

835–844.Fletcher R. 1970. A new approach to variable metric algorithms. The Computer Journal 13(3): 317–322.Fletcher R, Reeve CM. 1964. Function minimization by conjugate gradients. The Computer Journal 7(2): 149–154.Gilliom R, Helsel D. 1986. Estimation of distributional parameters for censored trace level water quality data 1. Estimation

techniques. Water Resources Research 22(2): 135–146.Goldfarb D. 1970. A family of variable-metric methods derived by variational means. Mathematics of Computation 24: 23–26.Gupta AK. 1952. Estimation of the mean and standard deviation of a normal population from a censored sample. Biometrika 39:

260–273.Helsel D. 1990. Less than obvious: statistical treatment of data below the detection limit. Environmental Science and Technology

24(12): 1767–1774.Helsel D. 2005. Nondetects and Data Analysis: Statistics for Censored Environmetal Data. Wiley: New York.Helsel D, Gilliom R. 1986. Estimation of distributional parameters for censored trace level water quality data 2. Verification and

applications. Water Resources Research 22(2): 147–155.Millard SP, Deverel SJ. 1988. Nonparametric statistical methods for comparing two sites based on data with multiple nondetect

limits. Water Resources Research 24(12): 2087–2098.Nelder JA, Mead R. 1965. A simplex method for function minimization. The Computer Journal 7(4): 308–313.Persson T, Rootzen H. 1977. Simple and highly efficient estimators for a type I censored normal sample. Biometrika 64: 123–128.Shanno DF. 1970. Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation 24: 647–656.Shumway RH, Azari AS, Johnson P. 1989. Estimating mean concentration under transformation for environmetal data with

detection limits. Technometrics 31: 347–356.Singh A, Nocerino J. 2002. Robust estimation of mean and variance using environmental data sets with below detection limit

observations. Chemometrics and Intelligent Laboratory Systems 60(1): 69–86.Tiku ML. 1967. Estimating the mean and standard deviation from a censored normal sample. Biometrika 54: 155–165.U.S. EPA. 1992. Statistical Analysis of Ground Water Monitoring Data at RARA Facilities, Addendum to Interim Final Guidance,

Office of Solid Waste, Waste Management Division, U.S. EPA.Yuan PT. 1933. On the logrithmic frequency distribution and semi-logarithmic correlation surface. Annals of Mathematical

Statistics 4: 30–74.

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env

Page 14: Interval estimation of population parameters based on environmental data with detection limits

658 C. PENG

APPENDIX

A sample output of the R program

> Asymp.CI(ex5.4.datamatrix, 0.05, logNorm=TRUE)===== This is a multiply censoring problem =====

Real CPU time used to get MLE: 4.92 secondsConvergence status (0 indicates convergence): 0The number iterations: 48The left hand of score equations at MLEs: -1.106093e-06 -3.621999e-07Eigenvalues of the Hessian at MLE: -82.97483 -159.2731

=== Summarized Statistics for Normal Parameters ==The MLE of (mu, stdev) is: (0.944206, 0.8005244)The SE of the MLEs are (0.1086561, 0.0807726);95% C.I. of mean: (0.7312439, 1.157168);95% C.I. of SD: (0.642213, 0.9588357);

===== Summarized Statistics for Log Normal Parameters =====The MLE of (mu.ln, stdev.ln) is: (3.541767, 3.356416)The SE the MLE.ln: (0.4208793, 0.7239804)95% C.I. of log mean: (2.716858, 4.366675);95% C.I. of log SD: (1.937441, 4.775392);

Copyright © 2009 John Wiley & Sons, Ltd. Environmetrics 2010; 21: 645–658DOI: 10.1002/env