81
IMPACT OF MISCOMPUTED STANDARD ERRORS AND FLAWED DATA RELEASE POLICES ON THE AMERICAN COMMUNITY SURVEY TO: ROBERT GROVES, DIRECTOR FROM: ANDREW A. BEVERIDGE, PROFESSOR OF SOCIOLOGY, QUEENS COLLEGE AND GRADUATE CENTER CUNY AND CONSULTANT TO THE NEW YORK TIMES FOR CENSUS AND OTHER DEMOGRAPHIC ANALYSES * SUBJECT: IMPACT OF MISCOMPUTED STANDARD ERRORS AND FLAWED DATA RELEASE POLICES ON THE AMERICAN COMMUNITY SURVEY DATE: 1/24/2011 CC: RODERICK LITTLE, ASSOCIATE DIRECTOR; CATHY MCCULLY CHIEF U.S. CENSUS BUREAU REDISTRICTING DATA OFFICE This memo is to alert the Bureau to two problems with the American Community Survey as released that degrade the utility of the data through confusion and filtering (not releasing all data available for a specific geography). The resultant errors could reflect poorly on the Bureau’s ability to conduct the American Community Survey and could potentially undercut support for that effort. The problems are as follows: 1. The Standard Errors, and the Margins of Error (MOE) based upon them, reported for data in the American Community Survey are miscomputed. For situations where the usual approximation used to compute standard errors is inappropriate (generally where there a given cell in a table makes up only a small part or a very large portion of the total) it is used anyway. The results they yield include reports of “negative” numbers of people for many cells within the MOE in the standard reports. 2. The rules for not releasing data about specific fields or variables in the one-year and three- year files (so-called “filtering”) in a given table seriously undercuts the value of the one-year and three-year ACS for many communities and areas throughout the United States by depriving them of valuable data. (The filtering is also based in part on the miscomputed MOEs.) This memo discusses the issues underlying each problem. Attached to this memo are a number of items to support and illustrate my points. A. MISCOMPUTED STANDARD ERRORS Background The standard errors and MOEs reported with the American Community Survey, were miscomputed in certain situations and lead to erroneous results. When very small or very large proportions of a given table are in a specific category then the standard errors are computed improperly. Indeed, there are a massive number of instances where the Margins of Error (MOEs) include zero or negative counts of persons with given characteristics. This happens even though individuals were found in the sample with specific * I write as a friend of the Census Bureau. Since 1993 I have served as consultant to the New York Times with regard to the Census and other demographic matters. I use Census and ACS data in my research and have testified using Census data numerous times in court.

Memo Regarding ACS-With Response

Embed Size (px)

DESCRIPTION

Memo Regarding ACS Confidence Interval With ResponseAndrew A. Beveridge, Ph.D.

Citation preview

Page 1: Memo Regarding ACS-With Response

IMPACT OF MISCOMPUTED STANDARD ERRORS AND FLAWED DATA RELEASE POLICES ON THE AMERICAN COMMUNITY SURVEY

TO: ROBERT GROVES, DIRECTOR

FROM: ANDREW A. BEVERIDGE, PROFESSOR OF SOCIOLOGY, QUEENS COLLEGE AND GRADUATE CENTER CUNY AND CONSULTANT TO THE NEW YORK TIMES FOR CENSUS AND OTHER DEMOGRAPHIC ANALYSES*

SUBJECT: IMPACT OF MISCOMPUTED STANDARD ERRORS AND FLAWED DATA RELEASE POLICES ON THE AMERICAN COMMUNITY SURVEY

DATE: 1/24/2011

CC: RODERICK LITTLE, ASSOCIATE DIRECTOR; CATHY MCCULLY CHIEF U.S. CENSUS BUREAU REDISTRICTING DATA OFFICE

This memo is to alert the Bureau to two problems with the American Community Survey as released that degrade the utility of the data through confusion and filtering (not releasing all data available for a specific geography). The resultant errors could reflect poorly on the Bureau’s ability to conduct the American Community Survey and could potentially undercut support for that effort. The problems are as follows:

1. The Standard Errors, and the Margins of Error (MOE) based upon them, reported for data in the American Community Survey are miscomputed. For situations where the usual approximation used to compute standard errors is inappropriate (generally where there a given cell in a table makes up only a small part or a very large portion of the total) it is used anyway. The results they yield include reports of “negative” numbers of people for many cells within the MOE in the standard reports.

2. The rules for not releasing data about specific fields or variables in the one-year and three-year files (so-called “filtering”) in a given table seriously undercuts the value of the one-year and three-year ACS for many communities and areas throughout the United States by depriving them of valuable data. (The filtering is also based in part on the miscomputed MOEs.)

This memo discusses the issues underlying each problem. Attached to this memo are a number of items to support and illustrate my points.

A. MISCOMPUTED STANDARD ERRORS

Background

The standard errors and MOEs reported with the American Community Survey, were miscomputed in certain situations and lead to erroneous results. When very small or very large proportions of a given table are in a specific category then the standard errors are computed improperly. Indeed, there are a massive number of instances where the Margins of Error (MOEs) include zero or negative counts of persons with given characteristics. This happens even though individuals were found in the sample with specific

* I write as a friend of the Census Bureau. Since 1993 I have served as consultant to the New York Times with regard to the Census and other demographic matters. I use Census and ACS data in my research and have testified using Census data numerous times in court.

Page 2: Memo Regarding ACS-With Response

Miscomputation of Standard Error and Flawed Release Rules in the ACS--2

characteristics (e.g. Non-Hispanic Asian). Obviously, if one finds any individuals of a given category in a given area, the MOE should never be negative or include zero for that category. If it did, then the method by which it was computed should be considered incorrect. For instance, in the case of non-Hispanic Asians in counties, in some 742 counties where the point estimate of the number of non-Hispanic Asians is greater than zero, the published MOE implies that the actual number in those counties could be negative, and in 40 additional cases the MOE includes zero. Similarly, for the 419 counties where the point estimate is zero, the published MOE implies that there could be “negative” or “minus” Non-Hispanic Asians in the county. (See Appendix 1 for these and other results. Here I have only shared results from Table B03002. Many other tables exhibit similar issues.) This problem permeates the data released from the 2005-2009 ACS, and has contributed to filtering issues in the three-year and one-year files, which will be discussed later in this memo.

Issues with the ACS Approach to Computing Standard Error

This problem appears to be directly related to the method by which Standard Errors and MOEs are computed for the American Community Survey, especially for situations where a given data field constitutes a very large or very small proportion of a specific table. Chapter 12 of the ACS Design and Methodology describes in some detail the methods used to compute standard error in the ACS (see Appendix 2). It appears that the problems with using the normal approximation or Wald approximation in situations with large or small proportions are ignored. This is especially surprising given the fact that the problems of using a normal approximation to the binomial distribution under these conditions are well known. I have attached an article that surveys some of these issues. (See Appendix 3, Lawrence D. Brown, T. Tony Cai and Anirban DasGupta. “Interval Estimation for a Binomial Proportion.” Statistical Science 16:2 (2001): p.101–33.) The article also includes responses from several researchers who have tackled this problem. Indeed, even Wikipedia has references to this problem in its entry regarding the “Binomial Proportion.” Much of the literature makes the point that even in conditions where the issue is not a large or small proportion, the normal approximation may result in inaccuracies. Some standard statistical packages (e.g., SAS, SUDAAN) have implemented several of the suggested remedies. Indeed, some software programs even warn users when the Wald or normal approximation should not be used.

It is true that on page 12-6 of ACS Design and Methodology the following statement seems to indicate that the Census Bureau recognizes the problem (similar statements are in other documents regarding the ACS):

Users are cautioned to consider ‘‘logical’’ boundaries when creating confidence bounds from the margins of error. For example, a small population estimate may have a calculated lower bound less than zero. A negative number of people does not make sense, so the lower bound should be set to zero instead. Likewise, bounds for percents should not go below zero percent or above 100 percent.

In fact, this means that it was obvious to the bureau that the method for computing standard errors and MOEs resulted in errors in certain situations, but nothing, so far, has been done to remedy the problem. Furthermore in the American Fact Finder MOEs which included negative numbers, where respondents were found in the area abound. (See Appendix 4, for an example.) It is equally problematic to have the MOE include zero when the sample found individuals or households with a given characteristic. I should note that exact confidence intervals for binomial proportions and appropriate approximations under these conditions are asymmetric, and never include zero or become negative. Therefore, the ACS’s articulated (but not implemented) MOE remedy of setting the lower bound to zero would not be a satisfactory solution. Rather, the method of computation must be changed to reflect the accepted statistical literature and statistical practice by major statistical packages and to guard against producing data that are illogical and thus likely to be misunderstood and criticized.

Page 3: Memo Regarding ACS-With Response

Miscomputation of Standard Error and Flawed Release Rules in the ACS--3

Consequences of the Miscomputation

There are serious consequences stemming from this miscomputation:

1. The usefulness, accuracy and reliability of ACS data may be thrown into question. With the advent of the ACS 2005-2009 data, many of the former uses of the Census long-form will now depend upon the ACS five year file, while others could be based upon the one-year and three-year files. One of these uses is in the contentious process of redistricting Congress, state legislatures and local legislative bodies. In any area where the Voting Rights Act provisions are used to assess potential districting plans, Citizens of Voting Age Population (CVAP) numbers need to be computed for various cognizable groups, including Hispanics, Asian and Pacific Islanders, Non-Hispanic White and Non-Hispanic Black, etc. Since citizenship data is not collected in the decennial census, such data, which used to come from the long form, must now be produced from the 2005-2009 ACS. In 2000 the Census Bureau created a special tabulation (STP76) based upon the long form to report CVAP by group for the voting age population to help in the computation of the so-called “effective majority.” I understand that the Department of Justice has requested a similar file based upon the 2005 to 2009 ACS. Unless the standard error for the ACS is correctly computed, I envision that during the very contentious and litigious process of redistricting, someone could allege that the ACS is completely without value for estimating CVAP due to its flawed standard errors and MOEs, and therefore is not useful for redistricting. I am sure that somewhere a redistricting expert would testify to that position if the current standard error and MOE computation procedure were maintained. This could lead to a serious public relations and political problem, not only for the ACS, but for census data generally.

When a redistricting expert suggested this exact scenario to me, I became convinced that I should formally bring these issues to the Bureau’s attention. Given the fraught nature of the politics over the ACS and census data generally, I think it is especially important that having a massive number of MOEs including “minus and zero people” should be addressed immediately by choosing an appropriate method to compute MOEs and implementing it in the circumstances discussed above.

I should note that in addition to redistricting, the ACS data will be used for the following:

1. Equal employment monitoring using an Equal Employment Opportunity Commission File, which includes education and occupation data, so it must be based upon the ACS.

2. Housing availability and affordability studies using a file created for the Housing and Urban Development (HUD) called the “Comprehensive Housing Affordability Strategy (CHAS) file. This file also requires ACS data since it includes several housing characteristics and cost items.

3. Language assistance assessments for voting by the Department of Justice, which is another data file that can only be created from the ACS because it is based upon the report of the language spoken at home.

Because the ACS provides richer data on a wider variety of topics than the decennial census, there are a multitude of other uses for the ACS in planning transportation, school enrollment, districting and more where the MOEs would be a significant issue. In short, the miscomputed MOEs are likely to cause local and state government officials, private researchers, and courts of law to use the ACS less effectively and less extensively, or to stop using it all together.

Page 4: Memo Regarding ACS-With Response

Miscomputation of Standard Error and Flawed Release Rules in the ACS--4

B. FLAWED RELEASE POLICIES FOR DATA TABLES IN THE ONE-YEAR AND THREE-YEAR ACS FILES

Background

Having yearly and tri-yearly data available makes the ACS potentially much more valuable than the infrequent Census long form. However, much of that added value has been undercut by the miscomputation of MOEs coupled with disclosure rules for data in the one-year and three-year ACS files. This has meant that important data for many communities and areas have been “filtered,” that is not released. The general rule, of course, is that data are released for areas with a population of 65,000 or greater for the one-year file and for areas with a population of 20,000 or greater for the three-year files. However, the implementation of the so-called “filtering rule,” used to prevent unreliable statistical data from being released, has had the practical effect of blocking the release of much important data.

As chapter 13 of the ACS Design Methodology states:

The main data release rule for the ACS tables works as follows. Every detailed table consists of a series of estimates. Each estimate is subject to sampling variability that can be summarized by its standard error. If more than half of the estimates in the table are not statistically different from 0 (at a 90 percent confidence level), then the table fails. Dividing the standard error by the estimate yields the coefficient of variation (CV) for each estimate. (If the estimate is 0, a CV of 100 percent is assigned.) To implement this requirement for each table at a given geographic area, CVs are calculated for each table’s estimates, and the median CV value is determined. If the median CV value for the table is less than or equal to 61 percent, the table passes for that geographic area and is published; if it is greater than 61 percent, the table fails and is not published. (Chapter 13 of the ACS Design and Methodology, p. 13-7, see Appendix 5.)

Negative Consequences of the Filtering Rules

These rules on their face are flawed, since they make the release of data about non-Hispanic blacks or non-Hispanic whites contingent on the presence of members of other groups living in the same area, such as Hispanic Native Americans or Alaskan Natives, Hispanic Asians, or Hispanic Hawaiian Natives or other Pacific Islanders. Therefore, for areas populated by just a few groups, the Bureau will not release the data about any group for that area. Thus, many cities, counties and other areas will not have important data about the population composition reported because they do not have enough different population groups living within the specified area. Furthermore, using the CV based upon the miscomputed standard error (most especially for the case of where few or zero individuals in the area are in a given category) means that the likelihood of reporting a high CV is increased and even more areas will not have some data released. (As noted in the section discussing the miscomputed standard errors, the computation of the standard error and thus the computation of CV is flawed.)

To demonstrate the impact of this rule, I assessed the “filtering” for the variable B03002 (Hispanic status by race) in the 2009 one-year release, using the Public Use Micro-data Areas (PUMAs). PUMAs are areas that were designed to be used with the Public Use Micro-Data files and are required to include at least 100,000 people. Due to the Bureau’s release rules, the only areas released for the one-year and three-year files that provide complete coverage of the United States were the nation, States, and PUMAs levels of data. Using PUMAs, it is easy to understand the impact of the “filtering” rules since every PUMA received data in 2009.

For the Race and Hispanic status table (B03002), the table elements include total population; total non-Hispanic population; non-Hispanic population for the following groups: white; black; native American or Alaskan Native; Asian; native Hawaiian or other Pacific Islander; some other race; two or more races; two races including some other race; two races excluding some other race, and three or more races; and total Hispanic population and then population for each of the same groups listed above but for those that are Hispanic, such as Hispanic Asian, Hispanic native Hawaiian and other Pacific Islander, as well as Hispanic

Page 5: Memo Regarding ACS-With Response

Miscomputation of Standard Error and Flawed Release Rules in the ACS--5

White and Hispanic black. The table contains some 21 items, of which 3 are subtotals of other items of other items. (See Appendix 2 for an example.)

Plainly, many of the cells in this table are likely to be zero or near zero for many geographies within the United States. For that reason, it is not surprising that 1,274 of 2,101 PUMAs (well more than half) were “filtered” for the 2009 one-year file. Those not filtered are likely to be urban areas with a substantial degree of population diversity, while those filtered are often the opposite. The map presented on the final page of this memo shows graphically which PUMAs had table B03002 filtered. This particular example shows where the proportion of the population that was non-Hispanic white was not revealed in this table.

Looking at the map, the problem becomes clear. Vast portions of the United States are filtered, as represented by the green cross-hatch pattern. Parts of North Dakota and Maine for example do not have a report on the number of non-Hispanic whites. In a similar vein, there is at least one PUMA in New York City that is so concentrated in terms of non-Hispanic African American population that B03002 is “filtered.”

Conclusion and Recommendations

Based upon this analysis, I would make three recommendations to the Bureau:

1) That a User Note immediately be drafted indicating that there is a problem with the MOEs (and the standard errors which are used to compute them) in the ACS for certain situations, and that the Bureau will recompute them for the ACS 2005-2009 file on an expedited basis. We are basically one-month from the start of redistricting with Virginia and New Jersey required to redraw their lines by early summer. Both states have substantial numbers of foreign-born individuals, many of whom are Hispanic, so the Citizens of Voting Age by group calculations are very important and dependent on clear ACS data.

2) That the MOEs for the 2005-2009 ACS, (as well as all of the one and three year files already issues) be recomputed using a method that takes into account the issues surrounding the “binomial proportion.” That the ACS 2005-2009 data be re-released with these new MOE files. This should be done almost immediately for the more recent files and as soon as possible for the others.

3) That the Bureau adopt a new release policy based not upon whole tables (the specifications of which are in any event arbitrary), but rather based upon specific variables or table cells. In this way, the release of important data would not be subject to miscomputed standard errors, or to the size or estimated variability of other variables or cells. It also may make sense to dispense with “filtering” altogether. This policy should then be applied to previously released data, and the data should be re-released.

The problems identified in this memo are causing serious issues regarding the ACS’s use in a wide variety of settings, and could seriously threaten the viability of this irreplaceable and vital data source if they became widely known. As a Census and ACS user, I truly hope that steps can be taken to remedy these problems swiftly.

Page 6: Memo Regarding ACS-With Response

Miscomputation of Standard Error and Flawed Release Rules in the ACS--6

MAP OF THE LOWER 48 STATES SHOWING NON-RELEASE (“FILTERING”) OF DATA FOR RACE AND HISPANIC STATUS BY PUMAS. GREEN CROSS-HATCHES INDICATE NON-RELEASE

Page 7: Memo Regarding ACS-With Response

Andrew A. Beveridge <[email protected]>

Andrew A. Beveridge <[email protected]> Tue, Feb 8, 2011 at 8:44 AMTo: [email protected], David Rindskopf <[email protected]>Cc: [email protected], [email protected], "sharon.m.stern" <[email protected]>Bcc: [email protected], Matthew Ericson <[email protected]>

Dear Rod:

It was good to talk to you yesterday along with David Rindskopfregarding the issues raised in my memo regarding MOE and filtering inthe ACS Data.

We understand the fact that recomputing and re-releasing theconfidence intervals would be a very time consuming process.Nonetheless, I do believe that some sort of statement regarding theMOE's is necessary, particularly around negative and zero values forgroups that were found in the sample, as well as the intervals forsituations where no subjects were found.

I attach to this memo an example table from the CVAP data released onFriday. This is for a block group in Brooklyn, and as you can see thepoint estimates for seven of the 11 groups are zero, and the MOE's are123 in every case. These groups include several that have very lowfrequencies in Brooklyn (e.g., American Indian or Alaskan Native). Inthe litigation context, it would be very easy to make such numberslook absurd. I still remain concerned that this could easily harm theBureau's credibility, which would be very damaging given the contextof support for Bureau activities.

I also agree that doing some research into what would be anappropriate way to view and compute Confidence Intervals for the ACS,possibly including reference to the Decennial Census results and alsocontemplating the importance of various spatial distributions wouldultimately be very helpful. At the same time, it is nonetheless thefact that at least two of the major statistical package vendors haveimplemented procedures, which take into account the issue of high orlow proportions in estimating confidence intervals from survey data.

Page 8: Memo Regarding ACS-With Response

I attach the write-up on confidence intervals for SAS's SURVEYFREQ,one of a set of procedures that SAS has developed to handle theanalysis of data drawn from complex samples. They have the followingoption for the computation of confidence limits: "PROC SURVEYFREQalso provides the PSMALL option, which uses the alternative confidencelimit type for extreme (small or large) proportions and uses the Waldconfidence limits for all other proportions (not extreme)." (SeeAttachment)

At the same time, later yesterday afternoon I was speaking with a verywell known lawyer who works on redistricting issues. He had looked atthe release of the CVAP data at the block-group level and basicallyconcluded that it was effectively worthless. Part of this, I think,is due to the fact that the data such as that in the attachment, donot look reliable on their face.

If nothing is done, it will be very difficult to defend these data incourt or use them to help in drawing districts. I remain convincedthat there also could be serious collateral damage to the CensusBureau in general. I for one hope that this does not occur. I lookforward to your response to the issues I raised.

Very truly yours,

Andy

--Andrew A. BeveridgeProf of Sociology Queens College and Grad Ctr CUNYChair Queens College Sociology DeptOffice: 718-997-2848Email: [email protected] Powdermaker Hall65-30 Kissena BlvdFlushing, NY 11367-1597www.socialexplorer.com

Attachment_2_8_2011.doc196K

Page 9: Memo Regarding ACS-With Response

Attachment to EMAIL to Rod Little 2/8/2011 -1- 1

Block Group 2, Census Tract 505, Kings County, New York LNTITLE LNNUMBER CVAP_EST CVAP_MOE

Total 1 625 241Not Hispanic or Latino 2 195 124American Indian or Alaska Native Alone 3 0 123Asian Alone 4 35 55Black or African American Alone 5 45 51Native Hawaiian or Other Pacific Islander Alone 6 0 123White Alone 7 120 80American Indian or Alaska Native and White 8 0 123Asian and White 9 0 123Black or African American and White 10 0 123American Indian or Alaska Native and Black or African American

11 0 123

Remainder of Two or More Race Responses 12 0 123Hispanic or Latino 13 430 197

The SURVEYFREQ Procedure

Confidence Limits for Proportions

http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_surveyfreq_a0000000221.htm

If you specify the CL option in the TABLES statement, PROC SURVEYFREQ computes confidence limits for the proportions in the frequency and crosstabulation tables.

By default, PROC SURVEYFREQ computes Wald ("linear") confidence limits if you do not specify an alternative confidence limit type with the TYPE= option. In addition to Wald confidence limits, the following types of design-based confidence limits are available for proportions: modified Clopper-Pearson (exact), modified Wilson (score), and logit confidence limits.

PROC SURVEYFREQ also provides the PSMALL option, which uses the alternative confidence limit type for extreme (small or large) proportions and uses the Wald confidence limits for all other proportions (not extreme). For the default PSMALL= value of 0.25, the procedure computes Wald confidence limits for proportions between 0.25 and 0.75 and computes the alternative confidence limit type for proportions that are outside of this range. See Curtin et al. (2006).

Page 10: Memo Regarding ACS-With Response

Attachment to EMAIL to Rod Little 2/8/2011 -2- 2

See Korn and Graubard (1999), Korn and Graubard (1998), Curtin et al. (2006), and Sukasih and Jang (2005) for details about confidence limits for proportions based on complex survey data, including comparisons of their performance. See also Brown, Cai, and DasGupta (2001), Agresti and Coull (1998) and the other references cited in the following sections for information about binomial confidence limits.

For each table request, PROC SURVEYFREQ produces a nondisplayed ODS table, "Table Summary," which contains the number of observations, strata, and clusters that are included in the analysis of the requested table. When you request confidence limits, the "Table Summary" data set also contains the degrees of freedom df and the value of that is used to compute the confidence limits. See Example 84.3 for more information about this output data set.

Wald Confidence Limits

PROC SURVEYFREQ computes standard Wald ("linear") confidence limits for proportions by default. These confidence limits use the variance estimates that are based on the sample design. For the proportion in table cell , the Wald confidence limits are computed as

where is the estimate of the proportion in table cell , is the standard error of the estimate, and is the th percentile of the t distribution with df degrees of freedom calculated as described in the section Degrees of Freedom. The confidence level is determined by the value of the ALPHA= option, which by default equals 0.05 and produces 95% confidence limits.

The confidence limits for row proportions and column proportions are computed similarly to the confidence limits for table cell proportions.

Modified Confidence Limits

PROC SURVEYFREQ uses the modification described in Korn and Graubard (1998) to compute design-based Clopper-Pearson (exact) and Wilson (score) confidence limits. This modification substitutes the degrees-of-freedom adjusted effective sample size for the original sample size in the confidence limit computations.

The effective sample size is computed as

where is the original sample size (unweighted frequency) that corresponds to the total domain of the proportion estimate, and is the design effect.

Page 11: Memo Regarding ACS-With Response

Attachment to EMAIL to Rod Little 2/8/2011 -3- 3

If the proportion is computed for a table cell of a two-way table, then the domain is the two-way table, and the sample size is the frequency of the two-way table. If the proportion is a row proportion, which is based on a two-way table row, then the domain is the row, and the sample size is the frequency of the row.

The design effect for an estimate is the ratio of the actual variance (estimated based on the sample design) to the variance of a simple random sample with the same number of observations. See the section Design Effect for details about how PROC SURVEYFREQ computes the design effect.

If you do not specify the ADJUST=NO option, the procedure applies a degrees-of-freedom adjustment to the effective sample size to compute the modified sample size. If you specify ADJUST=NO, the procedure does not apply the adjustment and uses the effective sample size in the confidence limit computations.

The modified sample size is computed by applying a degrees-of-freedom adjustment to the effective sample size as

where df is the degrees of freedom and is the th percentile of the t distribution with df degrees of freedom. The section Degrees of Freedom describes the computation of the degrees of freedom df, which is based on the variance estimation method and the sample design. The confidence level is determined by the value of the ALPHA= option, which by default equals 0.05 and produces 95% confidence limits.

The design effect is usually greater than 1 for complex survey designs, and in that case the effective sample size is less than the actual sample size. If the adjusted effective sample size is greater than the actual sample size , then the procedure truncates the value of to , as recommended by Korn and Graubard (1998). If you specify the TRUNCATE=NO option, the procedure does not truncate the value of .

Modified Clopper-Pearson Confidence Limits

Clopper-Pearson (exact) confidence limits for the binomial proportion are constructed by inverting the equal-tailed test based on the binomial distribution. This method is attributed to Clopper and Pearson (1934). See Leemis and Trivedi (1996) for a derivation of the distribution expression for the confidence limits.

PROC SURVEYFREQ computes modified Clopper-Pearson confidence limits according to the approach of Korn and Graubard (1998). The degrees-of-freedom adjusted effective sample size

is substituted for the sample size in the Clopper-Pearson computation, and the adjusted effective sample size times the proportion estimate is substituted for the number of positive

Page 12: Memo Regarding ACS-With Response

Attachment to EMAIL to Rod Little 2/8/2011 -4- 4

responses. (Or if you specify the ADJUST=NO option, the procedure uses the unadjusted effective sample size instead of .)

The modified Clopper-Pearson confidence limits for a proportion ( and ) are computed as

where is the th percentile of the distribution with and degrees of freedom, is the adjusted effective sample size, and is the proportion estimate.

Modified Wilson Confidence Limits

Wilson confidence limits for the binomial proportion are also known as score confidence limits and are attributed to Wilson (1927). The confidence limits are based on inverting the normal test that uses the null proportion in the variance (the score test). See Newcombe (1998) and Korn and Graubard (1999) for details.

PROC SURVEYFREQ computes modified Wilson confidence limits by substituting the degrees-of-freedom adjusted effective sample size for the original sample size in the standard Wilson computation. (Or if you specify the ADJUST=NO option, the procedure substitutes the unadjusted effective sample size .)

The modified Wilson confidence limits for a proportion are computed as

where is the adjusted effective sample size and is the estimate of the proportion. With the degrees-of-freedom adjusted effective sample size , the computation uses . With the unadjusted effective sample size, which you request with the ADJUST=NO option, the computation uses . See Curtin et al. (2006) for details.

Logit Confidence Limits

If you specify the TYPE=LOGIT option, PROC SURVEYFREQ computes logit confidence limits for proportions. See Agresti (2002) and Korn and Graubard (1998) for more information.

Logit confidence limits for proportions are based on the logit transformation . The logit confidence limits and are computed as

Page 13: Memo Regarding ACS-With Response

Attachment to EMAIL to Rod Little 2/8/2011 -5- 5

where

where is the estimate of the proportion, is the standard error of the estimate, and is the th percentile of the t distribution with df degrees of freedom. The degrees of freedom are calculated as described in the section Degrees of Freedom. The confidence level is determined by the value of the ALPHA= option, which by default equals 0.05 and produces 95% confidence limits.

Page 14: Memo Regarding ACS-With Response

Andrew A. Beveridge <[email protected]>

[email protected] <[email protected]> Tue, Feb 8, 2011 at 9:22 AMTo: "Andrew A. Beveridge" <[email protected]>Cc: [email protected], [email protected], David Rindskopf <[email protected]>, [email protected],"sharon.m.stern" <[email protected]>

Andy and David, thanks for this. The SAS appendix looks interesting and seems to be a good start on the problem. I think I have a better idea of the mainconcerns from our conversation, and will convey this to the ACS team as we consider next steps. Best, Rod

From: "Andrew A. Beveridge" <[email protected]>To: [email protected], David Rindskopf <[email protected]>Cc: [email protected], [email protected], "sharon.m.stern" <[email protected]>Date: 02/08/2011 08:45 AMSubject: Confidence Limits for Small and Large Proportion in ACS DataSent by: [email protected]

[Quoted text hidden]

[attachment "Attachment_2_8_2011.doc" deleted by Roderick Little/DIR/HQ/BOC]

Page 15: Memo Regarding ACS-With Response
Page 16: Memo Regarding ACS-With Response
Page 17: Memo Regarding ACS-With Response
Page 18: Memo Regarding ACS-With Response
Page 19: Memo Regarding ACS-With Response
Page 20: Memo Regarding ACS-With Response
Page 21: Memo Regarding ACS-With Response
Page 22: Memo Regarding ACS-With Response

13.8 IMPORTANT NOTES ON MULTIYEAR ESTIMATES

While the types of data products for the multiyear estimates are almost entirely identical to thoseused for the 1-year estimates, there are several distinctive features of the multiyear estimates thatdata users must bear in mind.

First, the geographic boundaries that are used for multiyear estimates are always the boundary asof January 1 of the final year of the period. Therefore, if a geographic area has gained or lost terri-tory during the multiyear period, this practice can have a bearing on the user’s interpretation ofthe estimates for that geographic area.

Secondly, for multiyear period estimates based on monetary characteristics (for example, medianearnings), inflation factors are applied to the data to create estimates that reflect the dollar valuesin the final year of the multiyear period.

Finally, although the Census Bureau tries to minimize the changes to the ACS questionnaire, thesechanges will occur from time to time. Changes to a question can result in the inability to build cer-tain estimates for a multiyear period containing the year in which the question was changed. Inaddition, if a new question is introduced during the multiyear period, it may be impossible tomake estimates of characteristics related to the new question for the multiyear period.

13.9 CUSTOM DATA PRODUCTS

The Census Bureau offers a wide variety of general-purpose data products from the ACS designedto meet the needs of the majority of data users. They contain predefined sets of data for standardcensus geographic areas. For users whose data needs are not met by the general-purpose prod-ucts, the Census Bureau offers customized special tabulations on a cost-reimbursable basisthrough the ACS custom tabulation program. Custom tabulations are created by tabulating datafrom ACS edited and weighted data files. These projects vary in size, complexity, and cost,depending on the needs of the sponsoring client.

Each custom tabulation request is reviewed in advance by the DRB to ensure that confidentiality isprotected. The requestor may be required to modify the original request to meet disclosure avoid-ance requirements. For more detailed information on the ACS Custom Tabulations program, go to<http://www.census.gov/acs/www/Products/spec_tabs/index.htm>.

13−8 Preparation and Review of Data Products ACS Design and Methodology

U.S. Census Bureau

Page 23: Memo Regarding ACS-With Response

Cumulative CumulativeFrequency Percent

MOE Not Reported 3,098 96.18 3,098 96.18MOE Less than Estimate 123 3.82 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Not Reported 2,503 77.71 2,503 77.71MOE Greater than Estimate 2 0.06 2,505 77.77MOE Less than Estimate 715 22.20 3,220 99.97Est Zero, MOE Positive 1 0.03 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Greater than Estimate 4 0.12 4 0.12MOE Less than Estimate 3,215 99.81 3,219 99.94Est Zero, MOE Positive 2 0.06 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 24 0.75 24 0.75MOE Greater than Estimate 429 13.32 453 14.06MOE Less than Estimate 2,473 76.78 2,926 90.84Est Zero, MOE Positive 295 9.16 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 35 1.09 35 1.09MOE Greater than Estimate 665 20.65 700 21.73MOE Less than Estimate 2,133 66.22 2,833 87.95Est Zero, MOE Positive 388 12.05 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 40 1.24 40 1.24MOE Greater than Estimate 742 23.04 782 24.28MOE Less than Estimate 2,020 62.71 2,802 86.99Est Zero, MOE Positive 419 13.01 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 21 0.65 21 0.65MOE Greater than Estimate 893 27.72 914 28.38MOE Less than Estimate 419 13.01 1,333 41.38Est Zero, MOE Positive 1,888 58.62 3,221 100.00

Appendix 1. Tabulation of MOEs, Including Those That Include Zero or Negative Cases for Table B03002 (Race by Hispanic Status)

Neg MOE vs EST: Not Hispanic or Latino: Native Hawaiian and Other Pacific Islander alone

B03002_7_neg Frequency Percent

Neg MOE vs EST: Not Hispanic or Latino: Asian alone

B03002_6_neg Frequency Percent

Neg MOE vs EST: Not Hispanic or Latino: American Indian and Alaska Native alone

B03002_5_neg Frequency Percent

Neg MOE vs EST: Not Hispanic or Latino: Black or African American alone

B03002_4_neg Frequency Percent

Neg MOE vs EST: Not Hispanic or Latino: White alone

B03002_3_neg Frequency Percent

Neg MOE vs EST: Not Hispanic or Latino

B03002_2_neg Frequency Percent

Neg MOE vs EST: Total population

B03002_1_neg Frequency Percent

Page 1 of 4

Page 24: Memo Regarding ACS-With Response

Appendix 1. Tabulation of MOEs, Including Those That Include Zero or Negative Cases for Table B03002 (Race by Hispanic Status)

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 28 0.87 28 0.87MOE Greater than Estimate 1,082 33.59 1,110 34.46MOE Less than Estimate 817 25.36 1,927 59.83Est Zero, MOE Positive 1,294 40.17 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 34 1.06 34 1.06MOE Greater than Estimate 339 10.52 373 11.58MOE Less than Estimate 2,706 84.01 3,079 95.59Est Zero, MOE Positive 142 4.41 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 25 0.78 25 0.78MOE Greater than Estimate 1,019 31.64 1,044 32.41MOE Less than Estimate 505 15.68 1,549 48.09Est Zero, MOE Positive 1,672 51.91 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 32 0.99 32 0.99MOE Greater than Estimate 354 10.99 386 11.98MOE Less than Estimate 2,684 83.33 3,070 95.31Est Zero, MOE Positive 151 4.69 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Not Reported 2,503 77.71 2,503 77.71MOE Equals Estimate 12 0.37 2,515 78.08MOE Greater than Estimate 187 5.81 2,702 83.89MOE Less than Estimate 476 14.78 3,178 98.67Est Zero, MOE Positive 43 1.33 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 21 0.65 21 0.65MOE Greater than Estimate 342 10.62 363 11.27MOE Less than Estimate 2,757 85.59 3,120 96.86Est Zero, MOE Positive 101 3.14 3,221 100.00

Neg MOE vs EST: Hispanic or Latino: White alone

B03002_13_neg Frequency Percent

Neg MOE vs EST: Hispanic or Latino

B03002_12_neg Frequency Percent

Neg MOE vs EST: Not Hispanic or Latino: Two or more races: Two races excluding Some other race, and three or more races

B03002_11_neg Frequency Percent

Neg MOE vs EST: Not Hispanic or Latino: Two or more races: Two races including Some other race

B03002_10_neg Frequency Percent

Neg MOE vs EST: Not Hispanic or Latino: Two or more races

B03002_9_neg Frequency Percent

Neg MOE vs EST: Not Hispanic or Latino: Some other race alone

B03002_8_neg Frequency Percent

Page 2 of 4

Page 25: Memo Regarding ACS-With Response

Appendix 1. Tabulation of MOEs, Including Those That Include Zero or Negative Cases for Table B03002 (Race by Hispanic Status)

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 23 0.71 23 0.71MOE Greater than Estimate 883 27.41 906 28.13MOE Less than Estimate 788 24.46 1,694 52.59Est Zero, MOE Positive 1,527 47.41 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 34 1.06 34 1.06MOE Greater than Estimate 1,062 32.97 1,096 34.03MOE Less than Estimate 646 20.06 1,742 54.08Est Zero, MOE Positive 1,479 45.92 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 11 0.34 11 0.34MOE Greater than Estimate 557 17.29 568 17.63MOE Less than Estimate 274 8.51 842 26.14Est Zero, MOE Positive 2,379 73.86 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 6 0.19 6 0.19MOE Greater than Estimate 363 11.27 369 11.46MOE Less than Estimate 62 1.92 431 13.38Est Zero, MOE Positive 2,790 86.62 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 24 0.75 24 0.75MOE Greater than Estimate 581 18.04 605 18.78MOE Less than Estimate 2,320 72.03 2,925 90.81Est Zero, MOE Positive 296 9.19 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 40 1.24 40 1.24MOE Greater than Estimate 886 27.51 926 28.75MOE Less than Estimate 1,573 48.84 2,499 77.58Est Zero, MOE Positive 722 22.42 3,221 100.00

Neg MOE vs EST: Hispanic or Latino: Two or more races

B03002_19_neg Frequency Percent

Neg MOE vs EST: Hispanic or Latino: Some other race alone

B03002_18_neg Frequency Percent

Neg MOE vs EST: Hispanic or Latino: Native Hawaiian and Other Pacific Islander alone

B03002_17_neg Frequency Percent

Neg MOE vs EST: Hispanic or Latino: Asian alone

B03002_16_neg Frequency Percent

Neg MOE vs EST: Hispanic or Latino: American Indian and Alaska Native alone

B03002_15_neg Frequency Percent

Neg MOE vs EST: Hispanic or Latino: Black or African American alone

B03002_14_neg Frequency Percent

Page 3 of 4

Page 26: Memo Regarding ACS-With Response

Appendix 1. Tabulation of MOEs, Including Those That Include Zero or Negative Cases for Table B03002 (Race by Hispanic Status)

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 36 1.12 36 1.12MOE Greater than Estimate 975 30.27 1,011 31.39MOE Less than Estimate 1,263 39.21 2,274 70.60Est Zero, MOE Positive 947 29.40 3,221 100.00

Cumulative CumulativeFrequency Percent

MOE Equals Estimate 32 0.99 32 0.99MOE Greater than Estimate 1,023 31.76 1,055 32.75MOE Less than Estimate 906 28.13 1,961 60.88Est Zero, MOE Positive 1,260 39.12 3,221 100.00

Neg MOE vs EST: Hispanic or Latino: Two or more races: Two races excluding Some other race, and three or more races

B03002_21_neg Frequency Percent

Neg MOE vs EST: Hispanic or Latino: Two or more races: Two races including Some other race

B03002_20_neg Frequency Percent

Page 4 of 4

Page 27: Memo Regarding ACS-With Response

Appendix 2.

Page 28: Memo Regarding ACS-With Response

ACS Design and Methodology (Ch. 12 Revised 12/2010) Variance Estimation 12-1

U.S. Census Bureau

Chapter 12. Variance Estimation

12.1 OVERVIEW

Sampling error is the uncertainty associated with an estimate that is based on data gathered from a sample of the population rather than the full population. Note that sample-based estimates will vary depending on the particular sample selected from the population. Measures of the magnitude of sampling error, such as the variance and the standard error (the square root of the variance), reflect the variation in the estimates over all possible samples that could have been selected from the population using the same sampling methodology.

The American Community Survey (ACS) is committed to providing its users with measures of sampling error along with each published estimate. To accomplish this, all published ACS estimates are accompanied either by 90 percent margins of error or confidence intervals, both based on ACS direct variance estimates. Due to the complexity of the sampling design and the weighting adjustments performed on the ACS sample, unbiased design-based variance estimators do not exist. As a consequence, the direct variance estimates are computed using a replication method that repeats the estimation procedures independently several times. The variance of the full sample is then estimated by using the variability across the resulting replicate estimates. Although the variance estimates calculated using this procedure are not completely unbiased, the current method produces variances that are accurate enough for analysis of the ACS data.

For Public Use Microdata Sample (PUMS) data users, replicate weights are provided to approximate standard errors for the PUMS-tabulated estimates. Design factors are also provided with the PUMS data, so PUMS data users can compute standard errors of their statistics using either the replication method or the design factor method.

12.2 VARIANCE ESTIMATION FOR ACS HOUSING UNIT AND PERSON ESTIMATES

Unbiased estimates of variances for ACS estimates do not exist because of the systematic sample design, as well as the ratio adjustments used in estimation. As an alternative, ACS implements a replication method for variance estimation. An advantage of this method is that the variance estimates can be computed without consideration of the form of the statistics or the complexity of the sampling or weighting procedures, such as those being used by the ACS.

The ACS employs the Successive Differences Replication (SDR) method (Wolter, 1984; Fay & Train, 1995; Judkins, 1990) to produce variance estimates. It has been the method used to calculate ACS estimates of variances since the start of the survey. The SDR was designed to be used with systematic samples for which the sort order of the sample is informative, as in the case of the ACS’s geographic sort. Applications of this method were developed to produce estimates of variances for the Current Population Survey (U.S. Census Bureau, 2006) and Census 2000 Long Form estimates (Gbur & Fairchild, 2002).

In the SDR method, the first step in creating a variance estimate is constructing the replicate factors. Replicate base weights are then calculated by multiplying the base weight for each housing unit (HU) by the factors. The weighting process then is rerun, using each set of replicate base weights in turn, to create final replicate weights. Replicate estimates are created by using the same estimation method as the original estimate, but applying each set of replicate weights instead of the original weights. Finally, the replicate and original estimates are used to compute the variance estimate based on the variability between the replicate estimates and the full sample estimate.

Page 29: Memo Regarding ACS-With Response

12-2 Variance Estimation (Ch. 12 Revised 12/2010) ACS Design and Methodology

U.S. Census Bureau

The following steps produce the ACS direct variance estimates:

1. Compute replicate factors.

2. Compute replicate weights.

3. Compute variance estimates.

Replicate Factors

Computation of replicate factors begins with the selection of a Hadamard matrix of order R (a multiple of 4), where R is the number of replicates. A Hadamard matrix H is a k-by-k matrix with all entries either 1 or −1, such that H'H = kI (that is, the columns are orthogonal). For ACS, the number of replicates is 80 (R = 80). Each of the 80 columns represents one replicate.

Next, a pair of rows in the Hadamard matrix is assigned to each record (HU or group quarters (GQ) person). An algorithm is used to assign two rows of an 80×80 Hadamard matrix to each HU. The ACS uses a repeating sequence of 780 pairs of rows in the Hadamard matrix to assign rows to each record, in sort order (Navarro, 2001a). The assignment of Hadamard matrix rows repeats every 780 records until all records receive a pair of rows from the Hadamard matrix. The first row of the matrix, in which every cell is always equal to one, is not used.

The replicate factor for each record then is determined from these two rows of the 80×80 Hadamard matrix. For record i (i = 1,…,n, where n is sample size) and replicate r (r = 1,…,80), the replicate factor is computed as:

where R1i and R2i are respectively the first and second row of the Hadamard matrix assigned to the i-th HU, and a

Rli,r and a

R2i,r are respectively the matrix elements (either 1 or −1) from the

Hadamard matrix in rows R1i and R2i and column r. Note that the formula for ƒi,r

• If a

yields replicate factors that can take one of three approximate values: 1.7, 1.0, or 0.3. That is;

R1i,r = +1 and a

R2i,r

• If a = +1, the replicate factor is 1.

R1i,r = −1 and a

R2i,r

• If a = −1, the replicate factor is 1.

R1i,r = +1 and a

R2i,r

• If a = −1, the replicate factor is approximately 1.7.

R1i,r = −1 and a

R2i,r

The expectation is that 50 percent of replicate factors will be 1, and the other 50 percent will be evenly split between 1.7 and 0.3 (Gunlicks, 1996).

= +1, the replicate factor is approximately 0.3.

The following example demonstrates the computation of replicate factors for a sample of size five, using a Hadamard matrix of order four:

Table 12.1 presents an example of a two-row assignment developed from this matrix, and the values of replicate factors for each sample unit.

Page 30: Memo Regarding ACS-With Response

ACS Design and Methodology (Ch. 12 Revised 12/2010) Variance Estimation 12-3

U.S. Census Bureau

Table 12.1 Example of Two- Row Assignment, Hadamard Matrix Elements, and Replicate Factors

Case #(i)

Row

Hadamard matrix element Approximate replicate

R1 R2i

Replicate 1 i

Replicate 2 Replicate 3 Replicate 4 f f

i,1 f

i,2 f

i,3

a i,4

aR1i,1

aR2i,1

aR1i,2

aR2i,2

aR1i,3

aR2i,3

aR1i,4

1

R2i,4

2 3 -1 +1 +1 +1 -1 -1 +1 -1 0.3 1 1 1.7 2 3 4 +1 -1 +1 +1 -1 +1 -1 -1 1.7 1 0.3 1 3 4 2 -1 -1 +1 +1 +1 +1 -1 +1 1 1 1 0.3 4 2 3 -1 +1 +1 +1 -1 -1 +1 -1 0.3 1 1 1.7 5 3 4 +1 -1 +1 +1 -1 +1 -1 -1 1.7 1 0.3 1

Note that row 1 is not used. For the third case (i = 3), rows four and two of the Hadamard matrix are to calculate the replicate factors. For the second replicate (r = 2), the replicate factor is computed using the values in the second column of rows four (+1) and two (+1) as follows:

Replicate Weights

Replicate weights are produced in a way similar to that used to produce full sample final weights. All of the weighting adjustment processes performed on the full sample final survey weights (such as applying noninterview adjustments and population controls) also are carried out for each replicate weight. However, collapsing patterns are retained from the full sample weighting and are not determined again for each set of replicate weights.

Before applying the weighting steps explained in Chapter 11, the replicate base weight (RBW) for replicate r is computed by multiplying the full sample base weight (BW— see Chapter 11 for the computation of this weight) by the replicate factor ƒ

i,r; that is, RBW

i,r = BW

i × ƒ

i,r, where RBW

i,r

One can elaborate on the previous example of the replicate construction using five cases and four replicates: Suppose the full sample BW values are given under the second column of the following table (Table 12.2). Then, the replicate base weight values are given in columns 7−10.

is the replicate base weight for the i-th HU and the r-th replicate (r = 1, …, 80).

Table 12.2 Example of Computation of Replicate Base Weight Factor (RBW)

Case # BWApproximate Replicate Factor

i

Replicate Base Weight f f

i,1 f

i,2 f

i,3 RBW

i,4 RBW

i,1 RBW

i,2 RBW

i,3

1 i,4

100 0.3 1 1 1.7 29 100 100 171 2 120 1.7 1 0.3 1 205 120 35 120 3 80 1 1 1 0.3 80 80 80 23 4 120 0.3 1 1 1.7 35 120 120 205 5 110 1.7 1 0.3 1 188 110 32 110

The rest of the weighting process (Chapter 11) then is applied to each replicate weight RBWi,r

(starting from the adjustment for CAPI subsampling) and proceeding to the population control adjustment or raking). Basically, the weighting adjustment process is repeated independently 80 times and the RBW

i,r is used in place of BW

i

By the end of this process, 80 final replicate weights for each HU and person record are produced.

(as in Chapter 11).

Variance Estimates

Given the replicate weights, the computation of variance for any ACS estimate is straightforward. Suppose that is an ACS estimate of any type of statistic, such as mean, total, or proportion. Let

denote the estimate computed based on the full sample weight, and , , …, denote the estimates computed based on the replicate weights. The variance of , , is estimated as the

Page 31: Memo Regarding ACS-With Response

12-4 Variance Estimation (Ch. 12 Revised 12/2010) ACS Design and Methodology

U.S. Census Bureau

sum of squared differences between each replicate estimate (r = 1, …, 80) and the full sample estimate . The formula is as follows1

This equation holds for count estimates as well as any other types of estimates, including percents, ratios, and medians.

:

There are certain cases, however, where this formula does not apply. The first and most important cases are estimates that are “controlled” to population totals and have their standard errors set to zero. These are estimates that are forced to equal intercensal estimates during the weighting process’s raking step—for example, total population and collapsed age, sex, and Hispanic origin estimates for weighting areas. Although race is included in the raking procedure, race group estimates are not controlled; the categories used in the weighting process (see Chapter 11) do not match the published tabulation groups because of multiple race responses and the “Some Other Race” category. Information on the final collapsing of the person post-stratification cells is passed from the weighting to the variance estimation process in order to identify estimates that are controlled. This identification is done independently for all weighting areas and then is applied to the geographic areas used for tabulation. Standard errors for those estimates are set to zero, and published margins of error are set to “*****” (with an appropriate accompanying footnote).

Another special case deals with zero-estimated counts of people, households, or HUs. A direct application of the replicate variance formula leads to a zero standard error for a zero-estimated count. However, there may be people, households, or HUs with that characteristic in that area that were not selected to be in the ACS sample, but a different sample might have selected them, so a zero standard error is not appropriate. For these cases, the following model-based estimation of standard error was implemented.

For ACS data in a census year, the ACS zero-estimated counts (for characteristics included in the 100 percent census (“short form”) count) can be checked against the corresponding census estimates. At least 90 percent of the census counts for the ACS zero-estimated counts should be within a 90 percent confidence interval based on our modeled standard error.2

Then, set the 90 percent upper bound for the zero estimate equal to the Census count:

Let the variance of the estimate be modeled as some multiple (K) of the average final weight (for a state or the nation). That is:

Solving for K yields:

K was computed for all ACS zero-estimated counts from 2000 which matched to Census 2000 100 percent counts, and then the 90th percentile of those Ks was determined. Based on the Census 2000 data, we use a value for K of 400 (Navarro, 2001b). As this modeling method requires census counts, the 400 value can next be updated using the 2010 Census and 2010 ACS data.

For publication, the standard error (SE) of the zero count estimate is computed as:

1 A general replication-based variance formula can be expressed as where c

r is the multiplier related to the r-th replicate determined by the replication method. For the SDR

method, the value of cr is 4 / R, where R is the number of replicates (Fay & Train, 1995).

2 This modeling was done only once, in 2001, prior to the publication of the 2000 ACS data.

Page 32: Memo Regarding ACS-With Response

ACS Design and Methodology (Ch. 12 Revised 12/2010) Variance Estimation 12-5

U.S. Census Bureau

The average weights (the maximum of the average housing unit and average person final weights) are calculated at the state and national level for each ACS single-year or multiyear data release. Estimates for geographic areas within a state use that state’s average weight, and estimates for geographic areas that cross state boundaries use the national average weight.

Finally, a similar method is used to produce an approximate standard error for both ACS zero and 100 percent estimates. We do not produce approximate standard errors for other zero estimates, such as ratios or medians.

Variance Estimation for Multiyear ACS Estimates – Finite Population Correction Factor

Through the 2008 and 2006-2008 data products, the same variance estimation methodology described above was implemented for both 1-year and 3-year. No changes to the methodology were necessary due to using multiple years of sample data. However, beginning with the 2007-2009 and 2005-2009 data products, the ACS incorporated a finite population correction (FPC) factor into the 3-year and 5-year variance estimation procedures.

The Census 2000 long form, as noted above, used the same SDR variance estimation methodology as the ACS currently does. The long form methodology also included an FPC factor in its calculation. One-year ACS samples are not large enough for an FPC to have much impact on variances. However, with 5-year ACS estimates, up to 50 percent of housing units in certain blocks may have been in sample over the 5-year period. Applying an FPC factor to multi-year ACS replicate estimates will enable a more accurate estimate of the variance, particularly for small areas. It was decided to apply the FPC adjustment to 3-year and 5-year ACS products, but not to 1-year products.

The ACS FPC factor is applied in the creation of the replicate factors:

where is the FPC factor. Generically, n is the unweighted sample size, and N is the unweighted universe size. The ACS uses two separate FPC factors: one for HUs responding by mail or telephone, and a second for HUs responding via personal visit follow-up.

The FPC is typically applied as a multiplicative factor “outside” the variance formula. However, under certain simplifying assumptions, the variance using the replicate factors after applying the FPC factor is equal to the original variance multiplied by the FPC factor. This method allows a direct application of the FPC to each housing unit’s or person’s set of replicate weights, and a seamless incorporation into the ACS’s current variance production methodology, rather than having to keep track of multiplicative factors when tabulating across areas of different sampling rates.

The adjusted replicate factors are used to created replicate base weights, and ultimately final replicate weights. It is expected that the improvement in the variance estimate will carry through the weighting, and will be seen when the final weights are used.

The ACS FPC factor could be applied at any geographic level. Since the ACS sampling rates are determined at the small area level (mainly census tracts and governmental units), a low level of geography was desirable. At higher levels, the high sampling rates in specific blocks would likely be masked by the lower rates in surrounding blocks. For that reason, the factors are applied at the census tract level.

Group quarters persons do not have an FPC factor applied to their replicate factors.

12.3 MARGIN OF ERROR AND CONFIDENCE INTERVAL

Once the standard errors have been computed, margins of error and/or confidence bounds are produced for each estimate. These are the measures of overall sampling error presented along with each published ACS estimate. All published ACS margins of error and the lower and upper

Page 33: Memo Regarding ACS-With Response

12-6 Variance Estimation (Ch. 12 Revised 12/2010) ACS Design and Methodology

U.S. Census Bureau

bounds of confidence intervals presented in the ACS data products are based on a 90 percent confidence level, which is the Census Bureau’s standard (U.S. Census Bureau, 2010b). A margin of error contains two components: the standard error of the estimate, and a multiplication factor based on a chosen confidence level. For the 90 percent confidence level, the value of the multiplication factor used by the ACS is 1.645. The margin of error of an estimate can be computed as:

where SE( ) is the standard error of the estimate . Given this margin of error, the 90 percent confidence interval can be computed as:

That is, the lower bound of the confidence interval is [ − margin of error ( ) ], and the upper bound of the confidence interval is [ + margin of error ( ) ]. Roughly speaking, this interval is a range that will contain the ‘‘full population value’’ of the estimated characteristic with a known probability.

Users are cautioned to consider ‘‘logical’’ boundaries when creating confidence bounds from the margins of error. For example, a small population estimate may have a calculated lower bound less than zero. A negative number of people does not make sense, so the lower bound should be set to zero instead. Likewise, bounds for percents should not go below zero percent or above 100 percent. For other characteristics, like income, negative values may be legitimate.

Given the confidence bounds, a margin of error can be computed as the difference between an estimate and its upper or lower confidence bounds:

Using the margin of error (as published or calculated from the bounds), the standard error is obtained as follows:

For ranking tables and comparison profiles, the ACS provides an indicator as to whether two estimates, Est

1 and Est

2

If Z < −1.645 or Z > 1.645, the difference between the estimates is significant at the 90 percent level. Determinations of statistical significance are made using unrounded values of the standard errors, so users may not be able to achieve the same result using the standard errors derived from the rounded estimates and margins of error as published. Only pairwise tests are used to determine significance in the ranking tables; no multiple comparison methods are used.

, are statistically significantly different at the 90 percent confidence level. That determination is made by initially calculating:

12.4 VARIANCE ESTIMATION FOR THE PUMS

The Census Bureau cannot possibly predict all combinations of estimates and geography that may be of interest to data users. Data users can download PUMS files and tabulate the data to create estimates of their own choosing. Because the ACS PUMS contains only a subset of the full ACS sample, estimates from the ACS PUMS file will often be different from the published ACS estimates that are based on the full ACS sample.

Users of the ACS PUMS files can compute the estimated variances of their statistics using one of two options: (1) the replication method using replicate weights released with the PUMS data, and (2) the design factor method.

Page 34: Memo Regarding ACS-With Response

ACS Design and Methodology (Ch. 12 Revised 12/2010) Variance Estimation 12-7

U.S. Census Bureau

PUMS Replicate Variances

For the replicate method, direct variance estimates based on the SDR formula as described in Section 12.2 above can be implemented. Users can simply tabulate 80 replicate estimates in addition to their desired estimate by using the provided 80 replicate weights, and then apply the variance formula:

PUMS Design Factor Variances

Similar to methods used to calculate standard errors for PUMS data from Census 2000, the ACS PUMS provides tables of design factors for various topics such as age for persons or tenure for HUs. For example, the 2009 ACS PUMS design factors are published at national and state levels (U.S. Census Bureau, 2010a), and were calculated using 2009 ACS data. PUMS design factors are updated periodically, but not necessarily on an annual basis. The design factor approach was developed based on a model that uses a standard error from a simple random sample as the base, and then inflates it to account for an increase in the variance caused by the complex sample design. Standard errors for almost all counts and proportions of persons, households, and HUs are approximated using design factors. For 1-year ACS PUMS files beginning with 2005, use:

for a total, and

for a percent, where:

= the estimate of total or a count.

= the estimate of a percent.

DF = the appropriate design factor based on the topic of the estimate.

N = the total for the geographic area of interest (if the estimate is of HUs, the number of HUs is used; if the estimate is of families or households, the number of households is used; otherwise the number of persons is used as N).

B = the denominator (base) of the percent.

The value 99 in the formula is the value of the 1-year PUMS FPC factor, which is computed as (100 − ƒ) / ƒ, where ƒ (given as a percent) is the sampling rate for the PUMS data. Since the PUMS is approximately a 1 percent sample of HUs, (100 − ƒ) / ƒ = (100 − 1)/1 = 99.

For 3-year PUMS files beginning with 2005−2007, the 3 years’ worth of data represent approximately a 3 percent sample of HUs. Hence, the 3-year PUMS FPC factor is (100 − ƒ) / ƒ = (100 − 3) / 3 = 97 / 3. To calculate standard errors from 3-year PUMS data, substitute 97 / 3 for 99 in the above formulas.

Similarly, 5-year PUMS files, beginning with 2005-2009, represent approximately a 5 percent sample of HUs. So, the 5-year PUMS FPC is 95 / 5 = 19, which can be substituted for 99 in the above formulas.

The design factor (DF) is defined as the ratio of the standard error of an estimated parameter (computed under the replication method described in Section 12.2) to the standard error based on a simple random sample of the same size. The DF reflects the effect of the actual sample design and estimation procedures used for the ACS. The DF for each topic was computed by modeling

Page 35: Memo Regarding ACS-With Response

12-8 Variance Estimation (Ch. 12 Revised 12/2010) ACS Design and Methodology

U.S. Census Bureau

the relationship between the standard error under the replication method (RSE) with the standard error based on a simple random sample (SRSSE); that is, RSE = DF × SRSSE, where the SRSSE is computed as follows:

The value 39 in the formula above is the FPC factor based on an approximate sampling fraction of 2.5 percent in the ACS; that is, (100 − 2.5) / 2.5 = 97.5 / 2.5 = 39.

The value of DF is obtained by fitting the no-intercept regression model RSE = DF × SRSSE using standard errors (RSE, SRSSE) for various published table estimates at the national and state levels. The values of DFs by topic can be obtained from the “PUMS Accuracy of the Data” statement that is published with each PUMS file. For example, 2009 1-year PUMS DFs can be found in U,S, Census Bureau (2010a).. The documentation also provides examples on how to use the design factors to compute standard errors for the estimates of totals, means, medians, proportions or percentages, ratios, sums, and differences.

The topics for the 2009 PUMS design factors are, for the most part, the same ones that were available for the Census 2000 PUMS. We recommend to users that, in using the design factor approach, if the estimate is a combination of two or more characteristics, the largest DF for this combination of characteristics is used. The only exceptions to this are items crossed with race or Hispanic origin; for these items, the largest DF is used, after removing the race or Hispanic origin DFs from consideration.

12.5 REFERENCES

Fay, R., & Train, G. (1995). Aspects of Survey and Model Based Postcensal Estimation of Income and Poverty Characteristics for States and Counties. Joint Statistical Meetings: Proceedings of the Section on Government Statistics (pp. 154-159). Alexandria, VA: American Statistical Association: http://www.census.gov/did/www/saipe/publications/files/FayTrain95.pdf

Gbur, P., & Fairchild, L. (2002). Overview of the U.S. Census 2000 Long Form Direct Variance Estimation. Joint Statistical Meetings: Proceedings of the Section on Survey Research Methods (pp. 1139-1144). Alexandria, VA: American Statistical Association.

Gunlicks, C. (1996). 1990 Replicate Variance System (VAR90-20). Washington, DC: U.S. Census Bureau.

Judkins, D. R. (1990). Fay's Method for Variance Estimation. Journal of Official Statistics , 6 (3), 223-239.

Navarro, A. (2001a). 2000 American Community Survey Comparison County Replicate Factors. American Community Survey Variance Memorandum Series #ACS-V-01. Washington, DC: U.S. Census Bureau.

Navarro, A. (2001b). Estimating Standard Errors of Zero Estimates. Washington, DC: U.S. Census Bureau.

U.S. Census Bureau. (2006). Current Population Survey: Technical Paper 66—Design and Methodology. Retrieved from U.S. Census Bureau: http://www.census.gov/prod/2006pubs/tp-66.pdf

U.S. Census Bureau. (2010a). PUMS Accuracy of the Data (2009). Retrieved from U.S. Census Bureau: http://www.census.gov/acs/www/Downloads/data_documentation/pums/Accuracy/2009AccuracyPUMS.pdf

U.S. Census Bureau. (2010b). Statistical Quality Standard E2: Reporting Results. Washington, DC: U.S. Census Bureau: http://www.census.gov/quality/standards/standarde2.html

Page 36: Memo Regarding ACS-With Response

ACS Design and Methodology (Ch. 12 Revised 12/2010) Variance Estimation 12-9

U.S. Census Bureau

Wolter, K. M. (1984). An Investigation of Some Estimators of Variance for Systematic Sampling. Journal of the American Statistical Association , 79, 781-790.

Page 37: Memo Regarding ACS-With Response

Appendix 3.

Page 38: Memo Regarding ACS-With Response

Statistical Science2001, Vol. 16, No. 2, 101–133

Interval Estimation fora Binomial ProportionLawrence D. Brown, T. Tony Cai and Anirban DasGupta

Abstract. We revisit the problem of interval estimation of a binomialproportion. The erratic behavior of the coverage probability of the stan-dard Wald confidence interval has previously been remarked on in theliterature (Blyth and Still, Agresti and Coull, Santner and others). Webegin by showing that the chaotic coverage properties of the Wald inter-val are far more persistent than is appreciated. Furthermore, commontextbook prescriptions regarding its safety are misleading and defectivein several respects and cannot be trusted.This leads us to consideration of alternative intervals. A number of

natural alternatives are presented, each with its motivation and con-text. Each interval is examined for its coverage probability and its length.Based on this analysis, we recommend the Wilson interval or the equal-tailed Jeffreys prior interval for small n and the interval suggested inAgresti and Coull for larger n. We also provide an additional frequentistjustification for use of the Jeffreys interval.

Key words and phrases: Bayes, binomial distribution, confidenceintervals, coverage probability, Edgeworth expansion, expected length,Jeffreys prior, normal approximation, posterior.

1. INTRODUCTION

This article revisits one of the most basic andmethodologically important problems in statisti-cal practice, namely, interval estimation of theprobability of success in a binomial distribu-tion. There is a textbook confidence interval forthis problem that has acquired nearly universalacceptance in practice. The interval, of course, isp ± zα/2 n−1/2�p�1 − p��1/2, where p = X/n isthe sample proportion of successes, and zα/2 is the100�1 − α/2�th percentile of the standard normaldistribution. The interval is easy to present andmotivate and easy to compute. With the exceptions

Lawrence D. Brown is Professor of Statistics, TheWharton School, University of Pennsylvania, 3000Steinberg Hall-Dietrich Hall, 3620 Locust Walk,Philadelphia, Pennsylvania 19104-6302. T. Tony Caiis Assistant Professor of Statistics, The WhartonSchool, University of Pennsylvania, 3000 SteinbergHall-Dietrich Hall, 3620 Locust Walk, Philadelphia,Pennsylvania 19104-6302. Anirban DasGupta isProfessor, Department of Statistics, Purdue Uni-versity, 1399 Mathematical Science Bldg., WestLafayette, Indiana 47907-1399

of the t test, linear regression, and ANOVA, itspopularity in everyday practical statistics is virtu-ally unmatched. The standard interval is known asthe Wald interval as it comes from the Wald largesample test for the binomial case.So at first glance, one may think that the problem

is too simple and has a clear and present solution.In fact, the problem is a difficult one, with unantic-ipated complexities. It is widely recognized that theactual coverage probability of the standard inter-val is poor for p near 0 or 1. Even at the level ofintroductory statistics texts, the standard intervalis often presented with the caveat that it should beused only when n ·min�p�1−p� is at least 5 (or 10).Examination of the popular texts reveals that thequalifications with which the standard interval ispresented are varied, but they all reflect the concernabout poor coverage when p is near the boundaries.In a series of interesting recent articles, it has

also been pointed out that the coverage proper-ties of the standard interval can be erraticallypoor even if p is not near the boundaries; see, forinstance, Vollset (1993), Santner (1998), Agresti andCoull (1998), and Newcombe (1998). Slightly olderliterature includes Ghosh (1979), Cressie (1980)and Blyth and Still (1983). Agresti and Coull (1998)

101

Page 39: Memo Regarding ACS-With Response

102 L. D. BROWN, T. T. CAI AND A. DASGUPTA

particularly consider the nominal 95% case andshow the erratic and poor behavior of the stan-dard interval’s coverage probability for small neven when p is not near the boundaries. See theirFigure 4 for the cases n = 5 and 10.We will show in this article that the eccentric

behavior of the standard interval’s coverage prob-ability is far deeper than has been explained or isappreciated by statisticians at large. We will showthat the popular prescriptions the standard inter-val comes with are defective in several respects andare not to be trusted. In addition, we will moti-vate, present and analyze several alternatives to thestandard interval for a general confidence level. Wewill ultimately make recommendations about choos-ing a specific interval for practical use, separatelyfor different intervals of values of n. It will be seenthat for small n (40 or less), our recommendationdiffers from the recommendation Agresti and Coull(1998) made for the nominal 95% case. To facili-tate greater appreciation of the seriousness of theproblem, we have kept the technical content of thisarticle at a minimal level. The companion article,Brown, Cai and DasGupta (1999), presents the asso-ciated theoretical calculations on Edgeworth expan-sions of the various intervals’ coverage probabili-ties and asymptotic expansions for their expectedlengths.In Section 2, we first present a series of exam-

ples on the degree of severity of the chaotic behav-ior of the standard interval’s coverage probability.The chaotic behavior does not go away even whenn is quite large and p is not near the boundaries.For instance, when n is 100, the actual coverageprobability of the nominal 95% standard intervalis 0.952 if p is 0.106, but only 0.911 if p is 0.107.The behavior of the coverage probability can be evenmore erratic as a function of n. If the true p is 0.5,the actual coverage of the nominal 95% interval is0.953 at the rather small sample size n = 17, butfalls to 0.919 at the much larger sample size n = 40.This eccentric behavior can get downright

extreme in certain practically important prob-lems. For instance, consider defective proportions inindustrial quality control problems. There it wouldbe quite common to have a true p that is small. Ifthe true p is 0.005, then the coverage probabilityof the nominal 95% interval increases monotoni-cally in n all the way up to n = 591 to the level0.945, only to drop down to 0.792 if n is 592. Thisunlucky spell continues for a while, and then thecoverage bounces back to 0.948 when n is 953, butdramatically falls to 0.852 when n is 954. Subse-quent unlucky spells start off at n = 1279, 1583 andon and on. It should be widely known that the cov-erage of the standard interval can be significantly

lower at quite large sample sizes, and this happensin an unpredictable and rather random way.Continuing, also in Section 2 we list a set of com-

mon prescriptions that standard texts present whilediscussing the standard interval. We show whatthe deficiencies are in some of these prescriptions.Proposition 1 and the subsequent Table 3 illustratethe defects of these common prescriptions.In Sections 3 and 4, we present our alterna-

tive intervals. For the purpose of a sharper focuswe present these alternative intervals in two cat-egories. First we present in Section 3 a selectedset of three intervals that clearly stand out inour subsequent analysis; we present them as our“recommended intervals.” Separately, we presentseveral other intervals in Section 4 that arise asclear candidates for consideration as a part of acomprehensive examination, but do not stand outin the actual analysis.The short list of recommended intervals contains

the score interval, an interval recently suggestedin Agresti and Coull (1998), and the equal tailedinterval resulting from the natural noninforma-tive Jeffreys prior for a binomial proportion. Thescore interval for the binomial case seems tohave been introduced in Wilson (1927); so we callit the Wilson interval. Agresti and Coull (1998)suggested, for the special nominal 95% case, theinterval p±z0025n−1/2�p�1−p��1/2, where n = n+4and p = �X + 2�/�n + 4�; this is an adjusted Waldinterval that formally adds two successes andtwo failures to the observed counts and then usesthe standard method. Our second interval is theappropriate version of this interval for a generalconfidence level; we call it the Agresti–Coull inter-val. By a slight abuse of terminology, we call ourthird interval, namely the equal-tailed intervalcorresponding to the Jeffreys prior, the Jeffreysinterval.In Section 3, we also present our findings on the

performances of our “recommended” intervals. Asalways, two key considerations are their coverageproperties and parsimony as measured by expectedlength. Simplicity of presentation is also sometimesan issue, for example, in the context of classroompresentation at an elementary level. On considera-tion of these factors, we came to the conclusion thatfor small n (40 or less), we recommend that eitherthe Wilson or the Jeffreys prior interval shouldbe used. They are very similar, and either may beused depending on taste. The Wilson interval has aclosed-form formula. The Jeffreys interval does not.One can expect that there would be resistance tousing the Jeffreys interval solely due to this rea-son. We therefore provide a table simply listing the

Page 40: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 103

limits of the Jeffreys interval for n up to 30 andin addition also give closed form and very accurateapproximations to the limits. These approximationsdo not need any additional software.For larger n �n > 40�, the Wilson, the Jeffreys

and the Agresti–Coull interval are all very simi-lar, and so for such n, due to its simplest form,we come to the conclusion that the Agresti–Coullinterval should be recommended. Even for smallersample sizes, the Agresti–Coull interval is stronglypreferable to the standard one and so might be thechoice where simplicity is a paramount objective.The additional intervals we considered are two

slight modifications of the Wilson and the Jeffreysintervals, the Clopper–Pearson “exact” interval,the arcsine interval, the logit interval, the actualJeffreys HPD interval and the likelihood ratiointerval. The modified versions of the Wilson andthe Jeffreys intervals correct disturbing downwardspikes in the coverages of the original intervals veryclose to the two boundaries. The other alternativeintervals have earned some prominence in the liter-ature for one reason or another. We had to apply acertain amount of discretion in choosing these addi-tional intervals as part of our investigation. Sincewe wish to direct the main part of our conversationto the three “recommended” intervals, only a briefsummary of the performances of these additionalintervals is presented along with the introductionof each interval. As part of these quick summaries,we indicate why we decided against including themamong the recommended intervals.We strongly recommend that introductory texts

in statistics present one or more of these recom-mended alternative intervals, in preference to thestandard one. The slight sacrifice in simplicitywould be more than worthwhile. The conclusionswe make are given additional theoretical supportby the results in Brown, Cai and DasGupta (1999).Analogous results for other one parameter discretefamilies are presented in Brown, Cai and DasGupta(2000).

2. THE STANDARD INTERVAL

When constructing a confidence interval we usu-ally wish the actual coverage probability to be closeto the nominal confidence level. Because of the dis-crete nature of the binomial distribution we cannotalways achieve the exact nominal confidence levelunless a randomized procedure is used. Thus ourobjective is to construct nonrandomized confidenceintervals for p such that the coverage probabilityPp�p ∈ CI� ≈ 1 − α where α is some prespecifiedvalue between 0 and 1. We will use the notation

C�p�n� = Pp�p ∈ CI��0 < p < 1, for the coverageprobability.A standard confidence interval for p based on nor-

mal approximation has gained universal recommen-dation in the introductory statistics textbooks andin statistical practice. The interval is known to guar-antee that for any fixed p ∈ �0� 1��C�p�n� → 1− αas n→ ∞.Let φ�z� and ��z� be the standard normal density

and distribution functions, respectively. Throughoutthe paper we denote κ ≡ zα/2 = �−1�1 − α/2�� p =X/n and q = 1 − p. The standard normal approxi-mation confidence interval CIs is given by

CIs = p± κ n−1/2�pq�1/2(1)

This interval is obtained by inverting the accep-tance region of the well known Wald large-samplenormal test for a general problem:

��θ− θ�/se�θ�� ≤ κ�(2)

where θ is a generic parameter, θ is the maximumlikelihood estimate of θ and se�θ� is the estimatedstandard error of θ. In the binomial case, we haveθ = p� θ =X/n and se�θ� = �pq�1/2n−1/2The standard interval is easy to calculate and

is heuristically appealing. In introductory statis-tics texts and courses, the confidence interval CIsis usually presented along with some heuristic jus-tification based on the central limit theorem. Moststudents and users no doubt believe that the largerthe number n, the better the normal approximation,and thus the closer the actual coverage would be tothe nominal level 1−α. Further, they would believethat the coverage probabilities of this method areclose to the nominal value, except possibly when nis “small” or p is “near” 0 or 1. We will show howcompletely both of these beliefs are false. Let ustake a close look at how the standard interval CIsreally performs.

2.1 Lucky n, Lucky p

An interesting phenomenon for the standardinterval is that the actual coverage probabilityof the confidence interval contains nonnegligibleoscillation as both p and n vary. There exist some“lucky” pairs �p�n� such that the actual coverageprobability C�p�n� is very close to or larger thanthe nominal level. On the other hand, there alsoexist “unlucky” pairs �p�n� such that the corre-sponding C�p�n� is much smaller than the nominallevel. The phenomenon of oscillation is both in n,for fixed p, and in p, for fixed n. Furthermore, dras-tic changes in coverage occur in nearby p for fixedn and in nearby n for fixed p. Let us look at fivesimple but instructive examples.

Page 41: Memo Regarding ACS-With Response

104 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 1. Standard interval; oscillation phenomenon for fixed p = 02 and variable n = 25 to 100

The probabilities reported in the following plotsand tables, as well as those appearing later inthis paper, are the result of direct probabilitycalculations produced in S-PLUS. In all casestheir numerical accuracy considerably exceeds thenumber of significant figures reported and/or theaccuracy visually obtainable from the plots. (Plotsfor variable p are the probabilities for a fine gridof values of p, e.g., 2000 equally spaced values of pfor the plots in Figure 5.)

Example 1. Figure 1 plots the coverage prob-ability of the nominal 95% standard interval forp = 02. The number of trials n varies from 25 to100. It is clear from the plot that the oscillation issignificant and the coverage probability does notsteadily get closer to the nominal confidence levelas n increases. For instance, C�02�30� = 0946 andC�02�98� = 0928. So, as hard as it is to believe,the coverage probability is significantly closer to0.95 when n = 30 than when n = 98. We see thatthe true coverage probability behaves contrary toconventional wisdom in a very significant way.

Example 2. Now consider the case of p = 05.Since p = 05, conventional wisdom might suggestto an unsuspecting user that all will be well if n isabout 20. We evaluate the exact coverage probabil-ity of the 95% standard interval for 10 ≤ n ≤ 50.In Table 1, we list the values of “lucky” n [definedas C�p�n� ≥ 095] and the values of “unlucky” n[defined for specificity as C�p�n� ≤ 092]. The con-clusions presented in Table 2 are surprising. We

Table 1Standard interval; lucky n and unlucky n for 10 ≤ n ≤ 50 and p = 05

Lucky n 17 20 25 30 35 37 42 44 49C�05� n� 0.951 0.959 0.957 .957 0.959 0.953 0.956 0.951 0.956

Unlucky n 10 12 13 15 18 23 28 33 40C�05� n� 0.891 0.854 0.908 0.882 0.904 0.907 0.913 0.920 0.919

note that when n = 17 the coverage probabilityis 0.951, but the coverage probability equals 0.904when n = 18. Indeed, the unlucky values of n arisesuddenly. Although p is 0.5, the coverage is stillonly 0.919 at n = 40. This illustrates the inconsis-tency, unpredictability and poor performance of thestandard interval.

Example 3. Now let us move p really close tothe boundary, say p = 0005. We mention in theintroduction that such p are relevant in certainpractical applications. Since p is so small, now onemay fully expect that the coverage probability ofthe standard interval is poor. Figure 2 and Table2.2 show that there are still surprises and indeedwe now begin to see a whole new kind of erraticbehavior. The oscillation of the coverage probabil-ity does not show until rather large n. Indeed, thecoverage probability makes a slow ascent all theway until n = 591, and then dramatically drops to0.792 when n = 592. Figure 2 shows that thereafterthe oscillation manifests in full force, in contrastto Examples 1 and 2, where the oscillation startedearly on. Subsequent “unlucky” values of n againarise in the same unpredictable way, as one can seefrom Table 2.2.

2.2 Inadequate Coverage

The results in Examples 1 to 3 already show thatthe standard interval can have coverage noticeablysmaller than its nominal value even for values of nand of np�1 − p� that are not small. This subsec-

Page 42: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 105

Table 2Standard interval; late arrival of unlucky n for small p

Unlucky n 592 954 1279 1583 1876C�0005� n� 0.792 0.852 0.875 0.889 0.898

tion contains two more examples that display fur-ther instances of the inadequacy of the standardinterval.

Example 4. Figure 3 plots the coverage probabil-ity of the nominal 95% standard interval with fixedn = 100 and variable p. It can be seen from Fig-ure 3 that in spite of the “large” sample size, signifi-cant change in coverage probability occurs in nearbyp. The magnitude of oscillation increases signifi-cantly as p moves toward 0 or 1. Except for valuesof p quite near p = 05, the general trend of thisplot is noticeably below the nominal coverage valueof 095.

Example 5. Figure 4 shows the coverage proba-bility of the nominal 99% standard interval with n =20 and variable p from 0 to 1. Besides the oscilla-tion phenomenon similar to Figure 3, a striking factin this case is that the coverage never reaches thenominal level. The coverage probability is alwayssmaller than 0.99, and in fact on the average thecoverage is only 0.883. Our evaluations show thatfor all n ≤ 45, the coverage of the 99% standardinterval is strictly smaller than the nominal levelfor all 0 < p < 1.

It is evident from the preceding presentationthat the actual coverage probability of the standardinterval can differ significantly from the nominalconfidence level for moderate and even large sam-ple sizes. We will later demonstrate that there areother confidence intervals that perform much better

Fig. 2. Standard interval; oscillation in coverage for small p

in this regard. See Figure 5 for such a comparison.The error in coverage comes from two sources: dis-creteness and skewness in the underlying binomialdistribution. For a two-sided interval, the roundingerror due to discreteness is dominant, and the errordue to skewness is somewhat secondary, but stillimportant for even moderately large n. (See Brown,Cai and DasGupta, 1999, for more details.) Notethat the situation is different for one-sided inter-vals. There, the error caused by the skewness canbe larger than the rounding error. See Hall (1982)for a detailed discussion on one-sided confidenceintervals.The oscillation in the coverage probability is

caused by the discreteness of the binomial dis-tribution, more precisely, the lattice structure ofthe binomial distribution. The noticeable oscil-lations are unavoidable for any nonrandomizedprocedure, although some of the competing proce-dures in Section 3 can be seen to have somewhatsmaller oscillations than the standard procedure.See the text of Casella and Berger (1990) for intro-ductory discussion of the oscillation in such acontext.The erratic and unsatisfactory coverage prop-

erties of the standard interval have often beenremarked on, but curiously still do not seem tobe widely appreciated among statisticians. See, forexample, Ghosh (1979), Blyth and Still (1983) andAgresti and Coull (1998). Blyth and Still (1983) alsoshow that the continuity-corrected version still hasthe same disadvantages.

2.3 Textbook Qualifications

The normal approximation used to justify thestandard confidence interval for p can be signifi-cantly in error. The error is most evident when thetrue p is close to 0 or 1. See Lehmann (1999). Infact, it is easy to show that, for any fixed n, the

Page 43: Memo Regarding ACS-With Response

106 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 3. Standard interval; oscillation phenomenon for fixed n = 100 and variable p

confidence coefficient C�p�n� → 0 as p → 0 or 1.Therefore, most major problems arise as regardscoverage probability when p is near the boundaries.Poor coverage probabilities for p near 0 or 1 are

widely remarked on, and generally, in the popu-lar texts, a brief sentence is added qualifying whento use the standard confidence interval for p. Itis interesting to see what these qualifications are.A sample of 11 popular texts gives the followingqualifications:The confidence interval may be used if:

1. np�n�1− p� are ≥ 5 (or 10);2. np�1− p� ≥ 5 (or 10);3. np� n�1− p� are ≥ 5 (or 10);4. p± 3

√p�1− p�/n does not contain 0 or 1;

5. n quite large;6. n ≥ 50 unless p is very small.

It seems clear that the authors are attempting tosay that the standard interval may be used if thecentral limit approximation is accurate. These pre-scriptions are defective in several respects. In theestimation problem, (1) and (2) are not verifiable.Even when these conditions are satisfied, we see,for instance, from Table 1 in the previous section,that there is no guarantee that the true coverageprobability is close to the nominal confidence level.

Fig. 4. Coverage of the nominal 99% standard interval for fixed n = 20 and variable p.

For example, when n = 40 and p = 05, one hasnp = n�1 − p� = 20 and np�1 − p� = 10, so clearlyeither of the conditions (1) and (2) is satisfied. How-ever, from Table 1, the true coverage probability inthis case equals 0.919 which is certainly unsatisfac-tory for a confidence interval at nominal level 0.95.The qualification (5) is useless and (6) is patently

misleading; (3) and (4) are certainly verifiable, butthey are also useless because in the context of fre-quentist coverage probabilities, a data-based pre-scription does not have a meaning. The point is thatthe standard interval clearly has serious problemsand the influential texts caution the readers aboutthat. However, the caution does not appear to serveits purpose, for a variety of reasons.Here is a result that shows that sometimes the

qualifications are not correct even in the limit asn→ ∞.

Proposition 1. Let γ > 0. For the standard con-fidence interval,

limn→∞ inf

p�np�n�1−p�≥γC�p�n�(3)

≤ P�aγ < Poisson�γ� ≤ bγ��

Page 44: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 107

Fig. 5. Coverage probability for n = 50.

Table 3Standard interval; bound (3) on limiting minimum coverage

when np�n�1− p� ≥ γ

� 5 7 10

limn→∞ inf

p�np�n�1−p�≥γC�p�n� 0.875 0.913 0.926

where aγ and bγ are the integer parts of

�κ2 + 2γ ± κ√κ2 + 4γ�/2�

where the − sign goes with aγ and the + sign with bγ.

The proposition follows from the fact that thesequence of Bin�n� γ/n� distributions convergesweakly to the Poisson(γ) distribution and so thelimit of the infimum is at most the Poisson proba-bility in the proposition by an easy calculation.Let us use Proposition 1 to investigate the validity

of qualifications (1) and (2) in the list above. Thenominal confidence level in Table 3 below is 0.95.

Table 4Values of λx for the modified lower bound for the Wilson interval

1 − � x = 1 x = 2 x = 3

0.90 0.105 0.532 1.1020.95 0.051 0.355 0.8180.99 0.010 0.149 0.436

It is clear that qualification (1) does not work atall and (2) is marginal. There are similar problemswith qualifications (3) and (4).

3. RECOMMENDED ALTERNATIVE INTERVALS

From the evidence gathered in Section 2, it seemsclear that the standard interval is just too risky.This brings us to the consideration of alternativeintervals. We now analyze several such alternatives,each with its motivation. A few other intervals arealso mentioned for their theoretical importance.Among these intervals we feel three stand out intheir comparative performance. These are labeledseparately as the “recommended intervals”.

3.1 Recommended Intervals

3.1.1 The Wilson interval. An alternative to thestandard interval is the confidence interval basedon inverting the test in equation (2) that uses thenull standard error �pq�1/2n−1/2 instead of the esti-mated standard error �pq�1/2n−1/2. This confidenceinterval has the form

CIW = X+ κ2/2n+ κ2

± κn1/2

n+ κ2�pq+ κ2/�4n��1/2(4)

This interval was apparently introduced by Wilson(1927) and we will call this interval the Wilsoninterval.The Wilson interval has theoretical appeal. The

interval is the inversion of the CLT approximation

Page 45: Memo Regarding ACS-With Response

108 L. D. BROWN, T. T. CAI AND A. DASGUPTA

to the family of equal tail tests of H0� p = p0.Hence, one accepts H0 based on the CLT approx-imation if and only if p0 is in this interval. AsWilson showed, the argument involves the solutionof a quadratic equation; or see Tamhane and Dunlop(2000, Exercise 9.39).

3.1.2 The Agresti–Coull interval. The standardinterval CIs is simple and easy to remember. Forthe purposes of classroom presentation and use intexts, it may be nice to have an alternative that hasthe familiar form p ± z

√p�1− p�/n, with a better

and new choice of p rather than p =X/n. This canbe accomplished by using the center of the Wilsonregion in place of p. Denote X = X + κ2/2 andn = n+ κ2. Let p = X/n and q = 1− p. Define theconfidence interval CIAC for p by

CIAC = p± κ�pq�1/2n−1/2(5)

Both the Agresti–Coull and the Wilson interval arecentered on the same value, p. It is easy to checkthat the Agresti–Coull intervals are never shorterthan the Wilson intervals. For the case when α =005, if we use the value 2 instead of 1.96 for κ,this interval is the “add 2 successes and 2 failures”interval in Agresti and Coull (1998). For this rea-son, we call it the Agresti–Coull interval. To thebest of our knowledge, Samuels and Witmer (1999)is the first introductory statistics textbook that rec-ommends the use of this interval. See Figure 5 forthe coverage of this interval. See also Figure 6 forits average coverage probability.

3.1.3 Jeffreys interval. Beta distributions are thestandard conjugate priors for binomial distributionsand it is quite common to use beta priors for infer-ence on p (see Berger, 1985).SupposeX ∼ Bin�n�p� and suppose p has a prior

distribution Beta�a1� a2�; then the posterior distri-bution of p is Beta�X + a1� n − X + a2�. Thus a100�1− α�% equal-tailed Bayesian interval is givenby

�B�α/2�X+ a1� n−X+ a2��B�1− α/2�X+ a1� n−X+ a2���

where B�α�m1�m2� denotes the α quantile of aBeta�m1�m2� distribution.The well-known Jeffreys prior and the uniform

prior are each a beta distribution. The noninforma-tive Jeffreys prior is of particular interest to us.Historically, Bayes procedures under noninforma-tive priors have a track record of good frequentistproperties; see Wasserman (1991). In this problem

the Jeffreys prior is Beta�1/2�1/2� which has thedensity function

f�p� = π−1p−1/2�1− p�−1/2The 100�1−α�% equal-tailed Jeffreys prior intervalis defined as

CIJ = �LJ�x��UJ�x���(6)

where LJ�0� = 0�UJ�n� = 1 and otherwise

LJ�x� = B�α/2�X+ 1/2� n−X+ 1/2��(7)

UJ�x� = B�1− α/2�X+ 1/2� n−X+ 1/2�(8)

The interval is formed by taking the central 1 − αposterior probability interval. This leaves α/2 poste-rior probability in each omitted tail. The exceptionis for x = 0�n� where the lower (upper) limits aremodified to avoid the undesirable result that thecoverage probability C�p�n� → 0 as p→ 0 or 1.The actual endpoints of the interval need to be

numerically computed. This is very easy to do usingsoftwares such as Minitab, S-PLUS or Mathematica.In Table 5 we have provided the limits for the caseof the Jeffreys prior for 7 ≤ n ≤ 30.The endpoints of the Jeffreys prior interval are

the α/2 and 1−α/2 quantiles of the Beta�x+1/2� n−x + 1/2� distribution. The psychological resistanceamong some to using the interval is because of theinability to compute the endpoints at ease withoutsoftware.We provide two avenues to resolving this problem.

One is Table 5 at the end of the paper. The secondis a computable approximation to the limits of theJeffreys prior interval, one that is computable withjust a normal table. This approximation is obtainedafter some algebra from the general approximationto a Beta quantile given in page 945 in Abramowitzand Stegun (1970).The lower limit of the 100�1 − α�% Jeffreys prior

interval is approximately

x+ 1/2n+ 1+ �n− x+ 1/2��e2ω − 1� �(9)

where

ω = κ√4pq/n+ �κ2 − 3�/�6n2�

4pq

+ �1/2− p��pq�κ2 + 2� − 1/n�6n�pq�2

The upper limit may be approximated by the sameexpression with κ replaced by −κ in ω. The simpleapproximation given above is remarkably accurate.Berry (1996, page 222) suggests using a simpler nor-mal approximation, but this will not be sufficientlyaccurate unless np�1− p� is rather large.

Page 46: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 109

Table 595% Limits of the Jeffreys prior interval

x n = 7 n = 8 n = 9 n = 10 n = 11 n = 12

0 0 0.292 0 0.262 0 0.238 0 0.217 0 0.200 0 0.1851 0.016 0.501 0.014 0.454 0.012 0.414 0.011 0.381 0.010 0.353 0.009 0.3282 0.065 0.648 0.056 0.592 0.049 0.544 0.044 0.503 0.040 0.467 0.036 0.4363 0.139 0.766 0.119 0.705 0.104 0.652 0.093 0.606 0.084 0.565 0.076 0.5294 0.234 0.861 0.199 0.801 0.173 0.746 0.153 0.696 0.137 0.652 0.124 0.6125 0.254 0.827 0.224 0.776 0.200 0.730 0.180 0.6886 0.270 0.800 0.243 0.757

x n = 13 n = 14 n = 15 n = 16 n = 17 n = 18

0 0 0.173 0 0.162 0 0.152 0 0.143 0 0.136 0 0.1291 0.008 0.307 0.008 0.288 0.007 0.272 0.007 0.257 0.006 0.244 0.006 0.2322 0.033 0.409 0.031 0.385 0.029 0.363 0.027 0.344 0.025 0.327 0.024 0.3113 0.070 0.497 0.064 0.469 0.060 0.444 0.056 0.421 0.052 0.400 0.049 0.3814 0.114 0.577 0.105 0.545 0.097 0.517 0.091 0.491 0.085 0.467 0.080 0.4465 0.165 0.650 0.152 0.616 0.140 0.584 0.131 0.556 0.122 0.530 0.115 0.5066 0.221 0.717 0.203 0.681 0.188 0.647 0.174 0.617 0.163 0.589 0.153 0.5637 0.283 0.779 0.259 0.741 0.239 0.706 0.222 0.674 0.207 0.644 0.194 0.6178 0.294 0.761 0.272 0.728 0.254 0.697 0.237 0.6689 0.303 0.746 0.284 0.716

x n = 19 n = 20 n = 21 n = 22 n = 23 n = 24

0 0 0.122 0 0.117 0 0.112 0 0.107 0 0.102 0 0.0981 0.006 0.221 0.005 0.211 0.005 0.202 0.005 0.193 0.005 0.186 0.004 0.1792 0.022 0.297 0.021 0.284 0.020 0.272 0.019 0.261 0.018 0.251 0.018 0.2413 0.047 0.364 0.044 0.349 0.042 0.334 0.040 0.321 0.038 0.309 0.036 0.2974 0.076 0.426 0.072 0.408 0.068 0.392 0.065 0.376 0.062 0.362 0.059 0.3495 0.108 0.484 0.102 0.464 0.097 0.446 0.092 0.429 0.088 0.413 0.084 0.3986 0.144 0.539 0.136 0.517 0.129 0.497 0.123 0.478 0.117 0.461 0.112 0.4447 0.182 0.591 0.172 0.568 0.163 0.546 0.155 0.526 0.148 0.507 0.141 0.4898 0.223 0.641 0.211 0.616 0.199 0.593 0.189 0.571 0.180 0.551 0.172 0.5329 0.266 0.688 0.251 0.662 0.237 0.638 0.225 0.615 0.214 0.594 0.204 0.57410 0.312 0.734 0.293 0.707 0.277 0.681 0.263 0.657 0.250 0.635 0.238 0.61411 0.319 0.723 0.302 0.698 0.287 0.675 0.273 0.65312 0.325 0.713 0.310 0.690

x n = 25 n = 26 n = 27 n = 28 n = 29 n = 30

0 0 0.095 0 0.091 0 0.088 0 0.085 0 0.082 0 0.0801 0.004 0.172 0.004 0.166 0.004 0.160 0.004 0.155 0.004 0.150 0.004 0.1452 0.017 0.233 0.016 0.225 0.016 0.217 0.015 0.210 0.015 0.203 0.014 0.1973 0.035 0.287 0.034 0.277 0.032 0.268 0.031 0.259 0.030 0.251 0.029 0.2434 0.056 0.337 0.054 0.325 0.052 0.315 0.050 0.305 0.048 0.295 0.047 0.2865 0.081 0.384 0.077 0.371 0.074 0.359 0.072 0.348 0.069 0.337 0.067 0.3276 0.107 0.429 0.102 0.415 0.098 0.402 0.095 0.389 0.091 0.378 0.088 0.3677 0.135 0.473 0.129 0.457 0.124 0.443 0.119 0.429 0.115 0.416 0.111 0.4048 0.164 0.515 0.158 0.498 0.151 0.482 0.145 0.468 0.140 0.454 0.135 0.4419 0.195 0.555 0.187 0.537 0.180 0.521 0.172 0.505 0.166 0.490 0.160 0.47610 0.228 0.594 0.218 0.576 0.209 0.558 0.201 0.542 0.193 0.526 0.186 0.51111 0.261 0.632 0.250 0.613 0.239 0.594 0.230 0.577 0.221 0.560 0.213 0.54512 0.295 0.669 0.282 0.649 0.271 0.630 0.260 0.611 0.250 0.594 0.240 0.57813 0.331 0.705 0.316 0.684 0.303 0.664 0.291 0.645 0.279 0.627 0.269 0.61014 0.336 0.697 0.322 0.678 0.310 0.659 0.298 0.64115 0.341 0.690 0.328 0.672

Page 47: Memo Regarding ACS-With Response

110 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 6. Comparison of the average coverage probabilities. From top to bottom: the Agresti–Coull interval CIAC� the Wilson interval CIW�the Jeffreys prior interval CIJ and the standard interval CIs. The nominal confidence level is 095

In Figure 5 we plot the coverage probability of thestandard interval, the Wilson interval, the Agresti–Coull interval and the Jeffreys interval for n = 50and α = 005.

3.2 Coverage Probability

In this and the next subsections, we compare theperformance of the standard interval and the threerecommended intervals in terms of their coverageprobability and length.Coverage of the Wilson interval fluctuates accept-

ably near 1 − α, except for p very near 0 or 1. Itmight be helpful to consult Figure 5 again. It canbe shown that, when 1− α = 095,

limn→∞ inf

γ≥1C

n� n

)= 092�

limn→∞ inf

γ≥5C

n� n

)= 0936

and

limn→∞ inf

γ≥10C

n� n

)= 0938

for the Wilson interval. In comparison, these threevalues for the standard interval are 0.860, 0.870,and 0.905, respectively, obviously considerablysmaller.The modification CIM−W presented in Section

4.1.1 removes the first few deep downward spikesof the coverage function for CIW. The resulting cov-erage function is overall somewhat conservative forp very near 0 or 1. Both CIW and CIM−W have thesame coverage functions away from 0 or 1.The Agresti–Coull interval has good minimum

coverage probability. The coverage probability ofthe interval is quite conservative for p very closeto 0 or 1. In comparison to the Wilson interval itis more conservative, especially for small n. Thisis not surprising because, as we have noted, CIACalways contains CIW as a proper subinterval.

The coverage of the Jeffreys interval is quali-tatively similar to that of CIW over most of theparameter space �0�1�. In addition, as we will seein Section 4.3, CIJ has an appealing connection tothe mid-P corrected version of the Clopper–Pearson“exact” intervals. These are very similar to CIJ,over most of the range, and have similar appealingproperties. CIJ is a serious and credible candidatefor practical use. The coverage has an unfortunatefairly deep spike near p = 0 and, symmetrically,another near p = 1. However, the simple modifica-tion of CIJ presented in Section 4.1.2 removes thesetwo deep downward spikes. The modified Jeffreysinterval CIM−J performs well.Let us also evaluate the intervals in terms of their

average coverage probability, the average being overp. Figure 6 demonstrates the striking difference inthe average coverage probability among four inter-vals: the Agresti–Coull interval, the Wilson intervalthe Jeffreys prior interval and the standard inter-val. The standard interval performs poorly. Theinterval CIAC is slightly conservative in terms ofaverage coverage probability. Both the Wilson inter-val and the Jeffreys prior interval have excellentperformance in terms of the average coverage prob-ability; that of the Jeffreys prior interval is, ifanything, slightly superior. The average coverageof the Jeffreys interval is really very close to thenominal level even for quite small n. This is quiteimpressive.Figure 7 displays the mean absolute errors,∫ 1

0 �C�p�n� − �1 − α��dp, for n = 10 to 25, andn = 26 to 40. It is clear from the plots that amongthe four intervals, CIW�CIAC and CIJ are com-parable, but the mean absolute errors of CIs aresignificantly larger.

3.3 Expected Length

Besides coverage, length is also very importantin evaluation of a confidence interval. We compare

Page 48: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 111

Fig. 7. The mean absolute errors of the coverage of the standard �solid�� the Agresti–Coull �dashed�� the Jeffreys �+� and the Wilson�dotted� intervals for n = 10 to 25 and n = 26 to 40

both the expected length and the average expectedlength of the intervals. By definition,

Expected length

= En�p�length�CI��

=n∑

x=0�U�x�n� −L�x�n��

(nx

)px�1− p�n−x�

where U and L are the upper and lower lim-its of the confidence interval CI, respectively.The average expected length is just the integral∫ 10 En�p�length(CI)�dp.We plot in Figure 8 the expected lengths of the

four intervals for n = 25 and α = 005. In this case,CIW is the shortest when 0210 ≤ p ≤ 0790, CIJ isthe shortest when 0133 ≤ p ≤ 0210 or 0790 ≤ p ≤0867, and CIs is the shortest when p ≤ 0133 or p ≥0867. It is no surprise that the standard interval isthe shortest when p is near the boundaries. CIs isnot really in contention as a credible choice for suchvalues of p because of its poor coverage propertiesin that region. Similar qualitative phenomena holdfor other values of n.Figure 9 shows the average expected lengths of

the four intervals for n = 10 to 25 and n = 26 to

Fig. 8. The expected lengths of the standard �solid�� the Wilson �dotted�� the Agresti–Coull �dashed� and the Jeffreys �+� intervals forn = 25 and α = 005.

40. Interestingly, the comparison is clear and con-sistent as n changes. Always, the standard intervaland the Wilson interval CIW have almost identicalaverage expected length; the Jeffreys interval CIJ iscomparable to the Wilson interval, and in fact CIJis slightly more parsimonious. But the difference isnot of practical relevance. However, especially whenn is small, the average expected length of CIAC isnoticeably larger than that of CIJ and CIW. In fact,for n till about 20, the average expected length ofCIAC is larger than that of CIJ by 0.04 to 0.02, andthis difference can be of definite practical relevance.The difference starts to wear off when n is largerthan 30 or so.

4. OTHER ALTERNATIVE INTERVALS

Several other intervals deserve consideration,either due to their historical value or their theoret-ical properties. In the interest of space, we had toexercise some personal judgment in deciding whichadditional intervals should be presented.

4.1 Boundary modification

The coverage probabilities of the Wilson intervaland the Jeffreys interval fluctuate acceptably near

Page 49: Memo Regarding ACS-With Response

112 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 9. The average expected lengths of the standard �solid�� the Wilson �dotted�� the Agresti–Coull �dashed� and the Jeffreys �+�intervals for n = 10 to 25 and n = 26 to 40.

1−α for p not very close to 0 or 1. Simple modifica-tions can be made to remove a few deep downwardspikes of their coverage near the boundaries; seeFigure 5.

4.1.1 Modified Wilson interval. The lower boundof the Wilson interval is formed by inverting a CLTapproximation. The coverage has downward spikeswhen p is very near 0 or 1. These spikes exist for alln and α. For example, it can be shown that, when1− α = 095 and p = 01765/n,

limn→∞Pp�p ∈ CIW� = 0838

and when 1 − α = 099 and p = 01174/n�limn→∞ Pp�p ∈ CIW� = 0889 The particularnumerical values �01174�01765� are relevant onlyto the extent that divided by n, they approximatethe location of these deep downward spikes.The spikes can be removed by using a one-sided

Poisson approximation for x close to 0 or n. Supposewe modify the lower bound for x = 1� � x∗. For afixed 1 ≤ x ≤ x∗, the lower bound of CIW should be

Fig. 10. Coverage probability for n = 50 and p ∈ �0�015�. The plots are symmetric about p = 05 and the coverage of the modified intervals�solid line� is the same as that of the corresponding interval without modification �dashed line� for p ∈ �015�085�.

replaced by a lower bound of λx/n where λx solves

e−λ�λ0/0!+λ1/1!+· · ·+λx−1/�x−1�!� = 1−α(10)

A symmetric prescription needs to be followed tomodify the upper bound for x very near n. The valueof x∗ should be small. Values which work reasonablywell for 1− α = 095 are

x∗ = 2 for n < 50 and x∗ = 3 for 51 ≤ n ≤ 100+.Using the relationship between the Poisson and

χ2 distributions,

P�Y ≤ x� = P�χ22�1+x� ≤ 2λ�where Y ∼ Poisson�λ�, one can also formallyexpress λx in (10) in terms of the χ2 quantiles:λx = �1/2�χ22x� α� where χ22x� α denotes the 100αthpercentile of the χ2 distribution with 2x degrees offreedom. Table 4 gives the values of λx for selectedvalues of x and α.For example, consider the case 1 − α = 095 and

x = 2. The lower bound of CIW is ≈ 0548/�n +4�. The modified Wilson interval replaces this by alower bound of λ/n where λ = �1/2�χ24�005. Thus,

Page 50: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 113

Fig. 11. Coverage probability of other alternative intervals for n = 50.

from a χ2 table, for x = 2 the new lower bound is0355/n.We denote this modified Wilson interval by

CIM−W. See Figure 10 for its coverage.

4.1.2 Modified Jeffreys interval. Evidently, CIJhas an appealing Bayesian interpretation, and,its coverage properties are appealing again exceptfor a very narrow downward coverage spike fairlynear 0 and 1 (see Figure 5). The unfortunate down-ward spikes in the coverage function result becauseUJ�0� is too small and symmetrically LJ�n� is toolarge. To remedy this, one may revise these twospecific limits as

UM−J�0� = pl and LM−J�n� = 1− pl�

where pl satisfies �1 − pl�n = α/2 or equivalentlypl = 1− �α/2�1/n.We also made a slight, ad hoc alteration of LJ�1�

and set

LM−J�1� = 0 and UM−J�n− 1� = 1

In all other cases, LM−J = LJ and UM−J = UJ.We denote the modified Jeffreys interval by CIM−J.This modification removes the two steep down-ward spikes and the performance of the interval isimproved. See Figure 10.

4.2 Other intervals

4.2.1 The Clopper–Pearson interval. The Clopper–Pearson interval is the inversion of the equal-tailbinomial test rather than its normal approxima-tion. Some authors refer to this as the “exact”procedure because of its derivation from the bino-mial distribution. If X = x is observed, thenthe Clopper–Pearson (1934) interval is defined byCICP = �LCP�x��UCP�x��, where LCP�x� and UCP�x�are, respectively, the solutions in p to the equations

Pp�X ≥ x� = α/2 and Pp�X ≤ x� = α/2

It is easy to show that the lower endpoint is the α/2quantile of a beta distribution Beta�x�n − x + 1�,and the upper endpoint is the 1− α/2 quantile of abeta distribution Beta�x + 1� n − x�. The Clopper–Pearson interval guarantees that the actual cov-erage probability is always equal to or above thenominal confidence level. However, for any fixed p,the actual coverage probability can be much largerthan 1−α unless n is quite large, and thus the confi-dence interval is rather inaccurate in this sense. SeeFigure 11. The Clopper–Pearson interval is waste-fully conservative and is not a good choice for prac-tical use, unless strict adherence to the prescriptionC�p�n� ≥ 1−α is demanded. Even then, better exactmethods are available; see, for instance, Blyth andStill (1983) and Casella (1986).

Page 51: Memo Regarding ACS-With Response

114 L. D. BROWN, T. T. CAI AND A. DASGUPTA

4.2.2 The arcsine interval. Another interval isbased on a widely used variance stabilizing trans-formation for the binomial distribution [see, e.g.,Bickel and Doksum, 1977: T�p� = arcsin�p1/2��This variance stabilization is based on the deltamethod and is, of course, only an asymptotic one.Anscombe (1948) showed that replacing p byp = �X + 3/8�/�n + 3/4� gives better variancestabilization; furthermore

2n1/2�arcsin�p1/2� − arcsin�p1/2�� →N�0�1�as n→ ∞.

This leads to an approximate 100�1−α�% confidenceinterval for p,

CIArc =[sin2�arcsin�p1/2� − 1

2κn−1/2��

sin2�arcsin�p1/2� + 12κn

−1/2�]

(11)

See Figure 11 for the coverage probability of thisinterval for n = 50. This interval performs reason-ably well for p not too close to 0 or 1. The coveragehas steep downward spikes near the two edges; infact it is easy to see that the coverage drops to zerowhen p is sufficiently close to the boundary (seeFigure 11). The mean absolute error of the coverageof CIArc is significantly larger than those of CIW,CIAC and CIJ. We note that our evaluations showthat the performance of the arcsine interval withthe standard p in place of p in (11) is much worsethan that of CIArc.

4.2.3 The logit interval. The logit interval isobtained by inverting a Wald type interval for thelog odds λ = log� p

1−p�; (see Stone, 1995). The MLEof λ (for 0 < X < n) is

λ = log(

p

1− p

)= log

(X

n−X

)�

which is the so-called empirical logit transform. Thevariance of λ, by an application of the delta theorem,can be estimated by

V = n

X�n−X�

This leads to an approximate 100�1−α�% confidenceinterval for λ,

CI�λ� = �λl� λu� = �λ− κV1/2� λ+ κV1/2�(12)

The logit interval for p is obtained by inverting theinterval (12),

CILogit =[

eλl

1+ eλl�

eλu

1+ eλu

](13)

The interval (13) has been suggested, for example,in Stone (1995, page 667). Figure 11 plots the cov-erage of the logit interval for n = 50. This intervalperforms quite well in terms of coverage for p awayfrom 0 or 1. But the interval is unnecessarily long;in fact its expected length is larger than that of theClopper–Pearson exact interval.

Remark. Anscombe (1956) suggested that λ =log� X+1/2

n−X+1/2� is a better estimate of λ; see also Coxand Snell (1989) and Santner and Duffy (1989). Thevariance of Anscombe’s λ may be estimated by

V = �n+ 1��n+ 2�n�X+ 1��n−X+ 1�

A new logit interval can be constructed using thenew estimates λ and V. Our evaluations show thatthe new logit interval is overall shorter than CILogitin (13). But the coverage of the new interval is notsatisfactory.

4.2.4 The Bayesian HPD interval. An exactBayesian solution would involve using the HPDintervals instead of our equal-tails proposal. How-ever, HPD intervals are much harder to computeand do not do as well in terms of coverage proba-bility. See Figure 11 and compare to the Jeffreys’equal-tailed interval in Figure 5.

4.2.5 The likelihood ratio interval. Along withthe Wald and the Rao score intervals, the likeli-hood ratio method is one of the most used methodsfor construction of confidence intervals. It is con-structed by inversion of the likelihood ratio testwhich accepts the null hypothesis H0� p = p0 if−2 log�2n� ≤ κ2, where 2n is the likelihood ratio

2n = L�p0�supp L�p�

= pX0 �1− p0�n−X�X/n�X�1−X/n�n−X �

L being the likelihood function. See Rao (1973).Brown, Cai and DasGupta (1999) show by analyt-ical calculations that this interval has nice proper-ties. However, it is slightly harder to compute. Forthe purpose of the present article which we view asprimarily directed toward practice, we do not fur-ther analyze the likelihood ratio interval.

4.3 Connections between Jeffreys Intervalsand Mid-P Intervals

The equal-tailed Jeffreys prior interval has someinteresting connections to the Clopper–Pearsoninterval. As we mentioned earlier, the Clopper–

Page 52: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 115

Pearson interval CICP can be written as

CICP = �B�α/2�X�n−X+ 1��B�1− α/2�X+ 1� n−X��

It therefore follows immediately that CIJ is alwayscontained in CICP. Thus CIJ corrects the conserva-tiveness of CICP.It turns out that the Jeffreys prior interval,

although Bayesianly constructed, has a clear andconvincing frequentist motivation. It is thus no sur-prise that it does well from a frequentist perspec-tive. As we now explain, the Jeffreys prior intervalCIJ can be regarded as a continuity correctedversion of the Clopper–Pearson interval CICP.The interval CICP inverts the inequality Pp�X ≤

L�p�� ≤ α/2 to obtain the lower limit and similarlyfor the upper limit. Thus, for fixed x, the upper limitof the interval for p, UCP�x�, satisfies

PUCP�x��X ≤ x� ≤ α/2�(14)

and symmetrically for the lower limit.This interval is very conservative; undesirably so

for most practical purposes. A familiar proposal toeliminate this over-conservativeness is to insteadinvert

Pp�X≤L�p�−1�+�1/2�Pp�X=L�p��=α/2�(15)

This amounts to solving

�1/2��PUCP�x��X ≤ x− 1�+PUCP�x��X ≤ x�� = α/2�

(16)

which is the same as

Umid-P�X� = �1/2�B�1− α/2�x�n− x+ 1�+ �1/2�B�1− α/2�x+ 1� n− x�

(17)

and symmetrically for the lower endpoint. Theseare the “Mid-P Clopper-Pearson” intervals. They areknown to have good coverage and length perfor-mance. Umid-P given in (17) is a weighted averageof two incomplete Beta functions. The incompleteBeta function of interest, B�1−α/2�x�n−x+ 1�, iscontinuous and monotone in x if we formally treatx as a continuous argument. Hence the average ofthe two functions defining Umid-P is approximatelythe same as the value at the halfway point, x+1/2.Thus

Umid-P�X�≈B�1−α/2�x+1/2�n−x+1/2�=UJ�x��exactly the upper limit for the equal-tailed Jeffreysinterval. Similarly, the corresponding approximatelower endpoint is the Jeffreys’ lower limit.Another frequentist way to interpret the Jeffreys

prior interval is to say that UJ�x� is the upper

limit for the Clopper–Pearson rule with x−1/2 suc-cesses and LJ�x� is the lower limit for the Clopper–Pearson rule with x + 1/2 successes. Strawdermanand Wells (1998) contains a valuable discussion ofmid-P intervals and suggests some variations basedon asymptotic expansions.

5. CONCLUDING REMARKS

Interval estimation of a binomial proportion is avery basic problem in practical statistics. The stan-dard Wald interval is in nearly universal use. Wefirst show that the performance of this standardinterval is persistently chaotic and unacceptablypoor. Indeed its coverage properties defy conven-tional wisdom. The performance is so erratic andthe qualifications given in the influential textsare so defective that the standard interval shouldnot be used. We provide a fairly comprehensiveevaluation of many natural alternative intervals.Based on this analysis, we recommend the Wilsonor the equal-tailed Jeffreys prior interval for smalln�n ≤ 40). These two intervals are comparable inboth absolute error and length for n ≤ 40, and webelieve that either could be used, depending ontaste.For larger n, the Wilson, the Jeffreys and the

Agresti–Coull intervals are all comparable, and theAgresti–Coull interval is the simplest to present.It is generally true in statistical practice that onlythose methods that are easy to describe, rememberand compute are widely used. Keeping this in mind,we recommend the Agresti–Coull interval for prac-tical use when n ≥ 40. Even for small sample sizes,the easy-to-present Agresti–Coull interval is muchpreferable to the standard one.We would be satisfied if this article contributes

to a greater appreciation of the severe flaws of thepopular standard interval and an agreement that itdeserves not to be used at all. We also hope thatthe recommendations for alternative intervals willprovide some closure as to what may be used inpreference to the standard method.Finally, we note that the specific choices of the

values of n, p and α in the examples and figuresare artifacts. The theoretical results in Brown, Caiand DasGupta (1999) show that qualitatively sim-ilar phenomena as regarding coverage and lengthhold for general n and p and common values ofthe coverage. (Those results there are asymptoticas n → ∞, but they are also sufficiently accuratefor realistically moderate n.)

Page 53: Memo Regarding ACS-With Response

116 L. D. BROWN, T. T. CAI AND A. DASGUPTA

APPENDIX

Table A.195% Limits of the modified Jeffreys prior interval

x n = 7 n = 8 n = 9 n = 10 n = 11 n = 12

0 0 0.410 0 0.369 0 0.336 0 0.308 0 0.285 0 0.2651 0 0.501 0 0.454 0 0.414 0 0.381 0 0.353 0 0.3282 0.065 0.648 0.056 0.592 0.049 0.544 0.044 0.503 0.040 0.467 0.036 0.4363 0.139 0.766 0.119 0.705 0.104 0.652 0.093 0.606 0.084 0.565 0.076 0.5294 0.234 0.861 0.199 0.801 0.173 0.746 0.153 0.696 0.137 0.652 0.124 0.6125 0.254 0.827 0.224 0.776 0.200 0.730 0.180 0.6886 0.270 0.800 0.243 0.757

x n = 13 n = 14 n = 15 n = 16 n = 17 n = 18

0 0 0.247 0 0.232 0 0.218 0 0.206 0 0.195 0 0.1851 0 0.307 0 0.288 0 0.272 0 0.257 0 0.244 0 0.2322 0.033 0.409 0.031 0.385 0.029 0.363 0.027 0.344 0.025 0.327 0.024 0.3113 0.070 0.497 0.064 0.469 0.060 0.444 0.056 0.421 0.052 0.400 0.049 0.3814 0.114 0.577 0.105 0.545 0.097 0.517 0.091 0.491 0.085 0.467 0.080 0.4465 0.165 0.650 0.152 0.616 0.140 0.584 0.131 0.556 0.122 0.530 0.115 0.5066 0.221 0.717 0.203 0.681 0.188 0.647 0.174 0.617 0.163 0.589 0.153 0.5637 0.283 0.779 0.259 0.741 0.239 0.706 0.222 0.674 0.207 0.644 0.194 0.6178 0.294 0.761 0.272 0.728 0.254 0.697 0.237 0.6689 0.303 0.746 0.284 0.716

x n = 19 n = 20 n = 21 n = 22 n = 23 n = 24

0 0 0.176 0 0.168 0 0.161 0 0.154 0 0.148 0 0.1421 0 0.221 0 0.211 0 0.202 0 0.193 0 0.186 0 0.1792 0.022 0.297 0.021 0.284 0.020 0.272 0.019 0.261 0.018 0.251 0.018 0.2413 0.047 0.364 0.044 0.349 0.042 0.334 0.040 0.321 0.038 0.309 0.036 0.2974 0.076 0.426 0.072 0.408 0.068 0.392 0.065 0.376 0.062 0.362 0.059 0.3495 0.108 0.484 0.102 0.464 0.097 0.446 0.092 0.429 0.088 0.413 0.084 0.3986 0.144 0.539 0.136 0.517 0.129 0.497 0.123 0.478 0.117 0.461 0.112 0.4447 0.182 0.591 0.172 0.568 0.163 0.546 0.155 0.526 0.148 0.507 0.141 0.4898 0.223 0.641 0.211 0.616 0.199 0.593 0.189 0.571 0.180 0.551 0.172 0.5329 0.266 0.688 0.251 0.662 0.237 0.638 0.225 0.615 0.214 0.594 0.204 0.57410 0.312 0.734 0.293 0.707 0.277 0.681 0.263 0.657 0.250 0.635 0.238 0.61411 0.319 0.723 0.302 0.698 0.287 0.675 0.273 0.65312 0.325 0.713 0.310 0.690

x n = 25 n = 26 n = 27 n = 28 n = 29 n = 30

0 0 0.137 0 0.132 0 0.128 0 0.123 0 0.119 0 0.1161 0 0.172 0 0.166 0 0.160 0 0.155 0 0.150 0 0.1452 0.017 0.233 0.016 0.225 0.016 0.217 0.015 0.210 0.015 0.203 0.014 0.1973 0.035 0.287 0.034 0.277 0.032 0.268 0.031 0.259 0.030 0.251 0.029 0.2434 0.056 0.337 0.054 0.325 0.052 0.315 0.050 0.305 0.048 0.295 0.047 0.2865 0.081 0.384 0.077 0.371 0.074 0.359 0.072 0.348 0.069 0.337 0.067 0.3276 0.107 0.429 0.102 0.415 0.098 0.402 0.095 0.389 0.091 0.378 0.088 0.3677 0.135 0.473 0.129 0.457 0.124 0.443 0.119 0.429 0.115 0.416 0.111 0.4048 0.164 0.515 0.158 0.498 0.151 0.482 0.145 0.468 0.140 0.454 0.135 0.4419 0.195 0.555 0.187 0.537 0.180 0.521 0.172 0.505 0.166 0.490 0.160 0.47610 0.228 0.594 0.218 0.576 0.209 0.558 0.201 0.542 0.193 0.526 0.186 0.51111 0.261 0.632 0.250 0.613 0.239 0.594 0.230 0.577 0.221 0.560 0.213 0.54512 0.295 0.669 0.282 0.649 0.271 0.630 0.260 0.611 0.250 0.594 0.240 0.57813 0.331 0.705 0.316 0.684 0.303 0.664 0.291 0.645 0.279 0.627 0.269 0.61014 0.336 0.697 0.322 0.678 0.310 0.659 0.298 0.64115 0.341 0.690 0.328 0.672

Page 54: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 117

ACKNOWLEDGMENTS

We thank Xuefeng Li for performing some helpfulcomputations and Jim Berger, David Moore, SteveSamuels, Bill Studden and Ron Thisted for use-ful conversations. We also thank the Editors andtwo anonymous referees for their thorough and con-structive comments. Supported by grants from theNational Science Foundation and the National Secu-rity Agency.

REFERENCES

Abramowitz, M. and Stegun, I. A. (1970). Handbook of Mathe-matical Functions. Dover, New York.

Agresti, A. and Coull, B. A. (1998). Approximate is better than“exact” for interval estimation of binomial proportions. Amer.Statist. 52 119–126.

Anscombe, F. J. (1948). The transformation of Poisson, binomialand negative binomial data. Biometrika 35 246–254.

Anscombe, F. J. (1956). On estimating binomial response rela-tions. Biometrika 43 461–464.

Berger, J. O. (1985). Statistical Decision Theory and BayesianAnalysis, 2nd ed. Springer, New York.

Berry, D. A. (1996). Statistics: A Bayesian Perspective.Wadsworth, Belmont, CA.

Bickel, P. and Doksum, K. (1977). Mathematical Statistics.Prentice-Hall, Englewood Cliffs, NJ.

Blyth, C. R. and Still, H. A. (1983). Binomial confidence inter-vals. J. Amer. Statist. Assoc. 78 108–116.

Brown, L. D., Cai, T. and DasGupta, A. (1999). Confidence inter-vals for a binomial proportion and asymptotic expansions.Ann. Statist to appear.

Brown, L. D., Cai, T. and DasGupta, A. (2000). Interval estima-tion in discrete exponential family. Technical report, Dept.Statistics. Univ. Pennsylvania.

Casella, G. (1986). Refining binomial confidence intervalsCanad. J. Statist. 14 113–129.

Casella, G. and Berger, R. L. (1990). Statistical Inference.Wadsworth & Brooks/Cole, Belmont, CA.

Clopper, C. J. and Pearson, E. S. (1934). The use of confidenceor fiducial limits illustrated in the case of the binomial.Biometrika 26 404–413.

Cox, D. R. and Snell, E. J. (1989). Analysis of Binary Data, 2nded. Chapman and Hall, London.

Cressie, N. (1980). A finely tuned continuity correction. Ann.Inst. Statist. Math. 30 435–442.

Ghosh, B. K. (1979). A comparison of some approximate confi-dence intervals for the binomial parameter J. Amer. Statist.Assoc. 74 894–900.

Hall, P. (1982). Improving the normal approximation whenconstructing one-sided confidence intervals for binomial orPoisson parameters. Biometrika 69 647–652.

Lehmann, E. L. (1999). Elements of Large-Sample Theory.Springer, New York.

Newcombe, R. G. (1998). Two-sided confidence intervals for thesingle proportion; comparison of several methods. Statisticsin Medicine 17 857–872.

Rao, C. R. (1973). Linear Statistical Inference and Its Applica-tions. Wiley, New York.

Samuels, M. L. and Witmer, J. W. (1999). Statistics forthe Life Sciences, 2nd ed. Prentice Hall, EnglewoodCliffs, NJ.

Santner, T. J. (1998). A note on teaching binomial confidenceintervals. Teaching Statistics 20 20–23.

Santner, T. J. and Duffy, D. E. (1989). The Statistical Analysisof Discrete Data. Springer, Berlin.

Stone, C. J. (1995). A Course in Probability and Statistics.Duxbury, Belmont, CA.

Strawderman, R. L. and Wells, M. T. (1998). Approximatelyexact inference for the common odds ratio in several 2 × 2tables (with discussion). J. Amer. Statist. Assoc. 93 1294–1320.

Tamhane, A. C. and Dunlop, D. D. (2000). Statistics and DataAnalysis from Elementary to Intermediate. Prentice Hall,Englewood Cliffs, NJ.

Vollset, S. E. (1993). Confidence intervals for a binomial pro-portion. Statistics in Medicine 12 809–824.

Wasserman, L. (1991). An inferential interpretation of defaultpriors. Technical report, Carnegie-Mellon Univ.

Wilson, E. B. (1927). Probable inference, the law of succes-sion, and statistical inference. J. Amer. Statist. Assoc. 22209–212.

CommentAlan Agresti and Brent A. Coull

In this very interesting article, Professors Brown,Cai and DasGupta (BCD) have shown that discrete-

Alan Agresti is Distinguished Professor, Depart-ment of Statistics, University of Florida, Gainesville,Florida 32611-8545 (e-mail: [email protected]). BrentA. Coull is Assistant Professor, Department of Bio-statistics, Harvard School of Public Health, Boston,Massachusetts 02115 �e-mail: [email protected]�.

ness can cause havoc for much larger sample sizesthat one would expect. The popular (Wald) confi-dence interval for a binomial parameter p has beenknown for some time to behave poorly, but readerswill surely be surprised that this can happen forsuch large n values.Interval estimation of a binomial parameter is

deceptively simple, as there are not even any nui-sance parameters. The gold standard would seemto be a method such as the Clopper–Pearson, basedon inverting an “exact” test using the binomial dis-

Page 55: Memo Regarding ACS-With Response

118 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 1. A Comparison of mean expected lengths for the nominal 95% Jeffreys �J�� Wilson �W�� Modified Jeffreys �M-J�� Modified Wilson�M-W�� and Agresti–Coull �AC� intervals for n = 5�6�7�8�9.

tribution rather than an approximate test usingthe normal. Because of discreteness, however, thismethod is too conservative. A more practical, nearlygold standard for this and other discrete problemsseems to be based on inverting a two-sided testusing the exact distribution but with the mid-Pvalue. Similarly, with large-sample methods it isbetter not to use a continuity correction, as other-wise it approximates exact inference based on anordinary P-value, resulting in conservative behav-ior. Interestingly, BCD note that the Jeffreys inter-val (CIJ) approximates the mid-P value correctionof the Clopper–Pearson interval. See Gart (1966)for related remarks about the use of 1

2 additionsto numbers of successes and failures before usingfrequentist methods.

1. METHODS FOR ELEMENTARYSTATISTICS COURSES

It’s unfortunate that the Wald interval for pis so seriously deficient, because in addition tobeing the simplest interval it is the obvious oneto teach in elementary statistics courses. By con-trast, the Wilson interval (CIW) performs surpris-ingly well even for small n. Since it is too com-plex for many such courses, however, our motiva-tion for the “Agresti–Coull interval” (CIAC) was toprovide a simple approximation for CIW. Formula(4) in BCD shows that the midpoint p for CIW isa weighted average of p and 1/2 that equals thesample proportion after adding z2α/2 pseudo obser-vations, half of each type; the square of the coef-ficient of zα/2 is the same weighted average of thevariance of a sample proportion when p = p andwhen p = 1/2, using n = n+ z2α/2 in place of n. TheCIAC uses the CIW midpoint, but its squared coef-ficient of zα/2 is the variance pq/n at the weighted

average p rather than the weighted average of thevariances. The resulting interval p ± zα/2�pq/n�1/2is wider than CIW (by Jensen’s inequality), in par-ticular being conservative for p near 0 and 1 whereCIW can suffer poor coverage probabilities.Regarding textbook qualifications on sample size

for using the Wald interval, skewness considera-tions and the Edgeworth expansion suggest thatguidelines for n should depend on p through �1 −2p�2/�p�1−p��. See, for instance, Boos and Hughes-Oliver (2000). But this does not account for theeffects of discreteness, and as BCD point out, guide-lines in terms of p are not verifiable. For elemen-tary course teaching there is no obvious alternative(such as t methods) for smaller n, so we think it issensible to teach a single method that behaves rea-sonably well for all n, as do the Wilson, Jeffreys andAgresti–Coull intervals.

2. IMPROVED PERFORMANCE WITHBOUNDARY MODIFICATIONS

BCD showed that one can improve the behaviorof the Wilson and Jeffreys intervals for p near 0and 1 by modifying the endpoints for CIW whenx = 1�2� n − 2� n − 1 (and x = 3 and n − 3 forn > 50) and for CIJ when x = 0�1� n − 1� n. Onceone permits the modification of methods near thesample space boundary, other methods may per-form decently besides the three recommended inthis article.For instance, Newcombe (1998) showed that when

0 < x < n the Wilson interval CIW and the Waldlogit interval have the same midpoint on the logitscale. In fact, Newcombe has shown (personal com-munication, 1999) that the logit interval necessarily

Page 56: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 119

Fig. 2. A comparison of expected lengths for the nominal 95% Jeffreys �J��Wilson �W��Modified Jeffreys �M-J��Modified Wilson �M-W��and Agresti–Coull �AC� intervals for n = 5.

contains CIW. The logit interval is the uninforma-tive one [0�1] when x = 0 or x = n, but substitut-ing the Clopper–Pearson limits in those cases yieldscoverage probability functions that resemble thosefor CIW and CIAC, although considerably more con-servative for small n. Rubin and Schenker (1987)recommended the logit interval after 1

2 additions tonumbers of successes and failures, motivating it as anormal approximation to the posterior distributionof the logit parameter after using the Jeffreys prior.However, this modification has coverage probabili-ties that are unacceptably small for p near 0 and 1(See Vollset, 1993). Presumably some other bound-ary modification will result in a happy medium. Ina letter to the editor about Agresti and Coull (1998),Rindskopf (2000) argued in favor of the logit inter-val partly because of its connection with logit mod-eling. We have not used this method for teachingin elementary courses, since logit intervals do notextend to intervals for the difference of proportionsand (like CIW and CIJ) they are rather complex forthat level.For practical use and for teaching in more

advanced courses, some statisticians may prefer thelikelihood ratio interval, since conceptually it is sim-ple and the method also applies in a general model-building framework. An advantage compared to theWald approach is its invariance to the choice ofscale, resulting, for instance, both from the origi-nal scale and the logit. BCD do not say much aboutthis interval, since it is harder to compute. However,it is easy to obtain with standard statistical soft-ware (e.g., in SAS, using the LRCI option in PROCGENMOD for a model containing only an interceptterm and assuming a binomial response with logitor identity link function). Graphs in Vollset (1993)

suggest that the boundary-modified likelihood ratiointerval also behaves reasonably well, although con-servative for p near 0 and 1.For elementary course teaching, a disadvantage

of all such intervals using boundary modificationsis that making exceptions from a general, simplerecipe distracts students from the simple conceptof taking the estimate plus and minus a normalscore multiple of a standard error. (Of course, thisconcept is not sufficient for serious statistical work,but some over simplification and compromise is nec-essary at that level.) Even with CIAC, instructorsmay find it preferable to give a recipe with thesame number of added pseudo observations for allα, instead of z2α/2. Reasonably good performanceseems to result, especially for small α, from thevalue 4 ≈ z20025 used in the 95% CIAC interval (i.e.,the “add two successes and two failures” interval).Agresti and Caffo (2000) discussed this and showedthat adding four pseudo observations also dramat-ically improves the Wald two-sample interval forcomparing proportions, although again at the cost ofrather severe conservativeness when both parame-ters are near 0 or near 1.

3. ALTERNATIVE WIDTH COMPARISON

In comparing the expected lengths of thethree recommended intervals, BCD note that thecomparison is clear and consistent as n changes,with the average expected length being noticeablylarger for CIAC than CIJ and CIW. Thus, in theirconcluding remarks, they recommend CIJ and CIWfor small n. However, since BCD recommend mod-ifying CIJ and CIW to eliminate severe downwardspikes of coverage probabilities, we believe that a

Page 57: Memo Regarding ACS-With Response

120 L. D. BROWN, T. T. CAI AND A. DASGUPTA

more fair comparison of expected lengths uses themodified versions CIM−J and CIM−W. We checkedthis but must admit that figures analogous tothe BCD Figures 8 and 9 show that CIM−J andCIM−W maintain their expected length advantageover CIAC, although it is reduced somewhat.However, when n decreases below 10, the results

change, with CIM−J having greater expected widththan CIAC and CIM−W. Our Figure 1 extends theBCD Figure 9 to values of n < 10, showing how thecomparison differs between the ordinary intervalsand the modified ones. Our Figure 2 has the formatof the BCD Figure 8, but for n = 5 instead of 25.Admittedly, n = 5 is a rather extreme case, one forwhich the Jeffreys interval is modified unless x = 2or 3 and the Wilson interval is modified unless x = 0or 5, and for it CIAC has coverage probabilities thatcan dip below 0.90. Thus, overall, the BCD recom-mendations about choice of method seem reasonableto us. Our own preference is to use the Wilson inter-val for statistical practice and CIAC for teaching inelementary statistics courses.

4. EXTENSIONS

Other than near-boundary modifications, anothertype of fine-tuning that may help is to invert a testpermitting unequal tail probabilities. This occursnaturally in exact inference that inverts a sin-gle two-tailed test, which can perform better thaninverting two separate one-tailed tests (e.g., Sterne,1954; Blyth and Still, 1983).

Finally, we are curious about the implications ofthe BCD results in a more general setting. Howmuch does their message about the effects of dis-creteness and basing interval estimation on theJeffreys prior or the score test rather than the Waldtest extend to parameters in other discrete distri-butions and to two-sample comparisons? We haveseen that interval estimation of the Poisson param-eter benefits from inverting the score test ratherthan the Wald test on the count scale (Agresti andCoull, 1998).One would not think there could be anything

new to say about the Wald confidence intervalfor a proportion, an inferential method that mustbe one of the most frequently used since Laplace(1812, page 283). Likewise, the confidence inter-val for a proportion based on the Jeffreys priorhas received attention in various forms for sometime. For instance, R. A. Fisher (1956, pages 63–70) showed the similarity of a Bayesian analysiswith Jeffreys prior to his fiducial approach, in a dis-cussion that was generally critical of the confidenceinterval method but grudgingly admitted of limitsobtained by a test inversion such as the Clopper–Pearson method, “though they fall short in logicalcontent of the limits found by the fiducial argument,and with which they have often been confused, theydo fulfil some of the desiderata of statistical infer-ences.” Congratulations to the authors for brilliantlycasting new light on the performance of these oldand established methods.

CommentGeorge Casella

1. INTRODUCTION

Professors Brown, Cai and DasGupta (BCD) areto be congratulated for their clear and imaginativelook at a seemingly timeless problem. The chaoticbehavior of coverage probabilities of discrete confi-dence sets has always been an annoyance, result-ing in intervals whose coverage probability can be

George Casella is Arun Varma CommemorativeTerm Professor and Chair, Department of Statis-tics, University of Florida, Gainesville, Florida32611-8545 �e-mail: [email protected]�.

vastly different from their nominal confidence level.What we now see is that for the Wald interval, anapproximate interval, the chaotic behavior is relent-less, as this interval will not maintain 1 − α cover-age for any value of n. Although fixes relying onad hoc rules abound, they do not solve this funda-mental defect of the Wald interval and, surprisingly,the usual safety net of asymptotics is also shownnot to exist. So, as the song goes, “Bye-bye, so long,farewell” to the Wald interval.Now that the Wald interval is out, what is in?

There are probably two answers here, dependingon whether one is in the classroom or the consult-ing room.

Page 58: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 121

Fig. 1. Coverage probabilities of the Blyth-Still interval �upper� and Agresti-Coull interval �lower� for n = 100 and 1− α = 095.

2. WHEN YOU SAY 95%

In the classroom it is (still) valuable to have aformula for a confidence intervals, and I typicallypresent the Wilson/score interval, starting fromthe test statistic formulation. Although this doesn’thave the pleasing p ± something, most studentscan understand the logic of test inversion. More-over, the fact that the interval does not have asymmetric form is a valuable lesson in itself; thestatistical world is not always symmetric.However, one thing still bothers me about this

interval. It is clearly not a 1 − α interval; that is,it does not maintain its nominal coverage prob-ability. This is a defect, and one that should notbe compromised. I am uncomfortable in present-ing a confidence interval that does not maintain its

stated confidence; when you say 95% you shouldmean 95%!But the fix here is rather simple: apply the “con-

tinuity correction” to the score interval (a techniquethat seems to be out of favor for reasons I do notunderstand). The continuity correction is easy tojustify in the classroom using pictures of the nor-mal density overlaid on the binomial mass func-tion, and the resulting interval will now maintainits nominal level. (This last statement is not basedon analytic proof, but on numerical studies.) Anyonereading Blyth (1986) cannot help being convincedthat this is an excellent approximation, coming atonly a slightly increased effort.One other point that Blyth makes, which BCD do

not mention, is that it is easy to get exact confi-dence limits at the endpoints. That is, forX = 0 the

Page 59: Memo Regarding ACS-With Response

122 L. D. BROWN, T. T. CAI AND A. DASGUPTA

lower bound is 0 and for X = 1 the lower bound is1− �1− α�1/n [the solution to P�X = 0� = 1− α].

3. USE YOUR TOOLS

The essential message that I take away from thework of BCD is that an approximate/formula-basedapproach to constructing a binomial confidenceinterval is bound to have essential flaws. However,this is a situation where brute force computing willdo the trick. The construction of a 1 − α binomialconfidence interval is a discrete optimization prob-lem that is easily programmed. So why not use thetools that we have available? If the problem willyield to brute force computation, then we shoulduse that solution.Blyth and Still (1983) showed how to compute

exact intervals through numerical inversion oftests, and Casella (1986) showed how to computeexact intervals by refining conservative intervals.

So for any value of n and α, we can compute anexact, shortest 1 − α confidence interval that willnot display any of the pathological behavior illus-trated by BCD. As an example, Figure 1 shows theAgresti–Coull interval along with the Blyth–Stillinterval for n = 100 and 1 − α = 095. Whilethe Agresti–Coull interval fails to maintain 095coverage in the middle p region, the Blyth–Stillinterval always maintains 095 coverage. What ismore surprising, however, is that the Blyth–Stillinterval displays much less variation in its cov-erage probability, especially near the endpoints.Thus, the simplistic numerical algorithm producesan excellent interval, one that both maintains itsguaranteed coverage and reduces oscillation in thecoverage probabilities.

ACKNOWLEDGMENT

Supported by NSF Grant DMS-99-71586.

CommentChris Corcoran and Cyrus Mehta

We thank the authors for a very accessibleand thorough discussion of this practical prob-lem. With the availability of modern computa-tional tools, we have an unprecedented opportu-nity to carefully evaluate standard statistical pro-cedures in this manner. The results of such workare invaluable to teachers and practitioners ofstatistics everywhere. We particularly appreciatethe attention paid by the authors to the gener-ally oversimplified and inadequate recommenda-tions made by statistical texts regarding when touse normal approximations in analyzing binarydata. As their work has plainly shown, even inthe simple case of a single binomial proportion,the discreteness of the data makes the use of

Chris Corcoran is Assistant Professor, Depart-ment of Mathematics and Statistics, UtahState University, 3900 old Main Hill, Logon,Utah, 84322-3900 �e-mail: [email protected]�. Cyrus Mehta is Professor, Departmentof Biostatistics, Harvard School of Public Health,655 Huntington Avenue Boston, Massachusetts02115 and is with Cytel Software Corporation, 675Massachusetts Avenue, Cambridge, Massachusetts02319.

some asymptotic procedures tenuous, even when theunderlying probability lies away from the boundaryor when the sample size is relatively large.The authors have evaluated various confidence

intervals with respect to their coverage propertiesand average lengths. Implicit in their evaluationis the premise that overcoverage is just as bad asundercoverage. We disagree with the authors on thisfundamental issue. If, because of the discreteness ofthe test statistic, the desired confidence level cannotbe attained, one would ordinarily prefer overcover-age to undercoverage. Wouldn’t you prefer to hirea fortune teller whose track record exceeds expec-tations to one whose track record is unable to liveup to its claim of accuracy? With the exception ofthe Clopper–Pearson interval, none of the intervalsdiscussed by the authors lives up to its claim of95% accuracy throughout the range of p. Yet theauthors dismiss this interval on the grounds thatit is “wastefully conservative.” Perhaps so, but theydo not address the issue of how the wastefulness ismanifested.What penalty do we incur for furnishing confi-

dence intervals that are more truthful than wasrequired of them? Presumably we pay for the conser-vatism by an increase in the length of the confidenceinterval. We thought it would be a useful exercise

Page 60: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 123

Fig. 1. Actual coverage probabilities for BSC and LR intervalsas a function of p�n = 50� Compare to author’s Figures 5� 10and 11.

to actually investigate the magnitude of this penaltyfor two confidence interval procedures that are guar-anteed to provide the desired coverage but are notas conservative as Clopper–Pearson. Figure 1 dis-plays the true coverage probabilities for the nominal95% Blyth–Still–Casella (see Blyth and Still, 1983;Casella, 1984) confidence interval (BSC interval)and the 95% confidence interval obtained by invert-ing the exact likelihood ratio test (LR interval; theinversion follows that shown by Aitken, Anderson,Francis and Hinde, 1989, pages 112–118).There is no value of p for which the coverage of the

BSC and LR intervals falls below 95%. Their cover-age probabilities are, however, much closer to 95%than would be obtained by the Clopper–Pearson pro-cedure, as is evident from the authors’ Figure 11.Thus one could say that these two intervals are uni-formly better than the Clopper–Pearson interval.We next investigate the penalty to be paid for the

guaranteed coverage in terms of increased length ofthe BSC and LR intervals relative to the Wilson,Agresti–Coull, or Jeffreys intervals recommendedby the authors. This is shown by Figure 2.

In fact the BSC and LR intervals are actuallyshorter than Agresti–Coull for p < 02 or p > 08,and shorter than the Wilson interval for p < 01and p > 09. The only interval that is uniformlyshorter than BSC and LR is the Jeffreys interval.Most of the time the difference in lengths is negligi-ble, and in the worst case (at p = 05) the Jeffreysinterval is only shorter by 0.025 units. Of the threeasymptotic methods recommended by the authors,the Jeffreys interval yields the lowest average prob-ability of coverage, with significantly greater poten-tial relative undercoverage in the �005�020� and�080�095� regions of the parameter space. Consid-ering this, one must question the rationale for pre-ferring Jeffreys to either BSC or LR.The authors argue for simplicity and ease of com-

putation. This argument is valid for the teaching ofstatistics, where the instructor must balance sim-plicity with accuracy. As the authors point out, it iscustomary to teach the standard interval in intro-ductory courses because the formula is straight-forward and the central limit theorem provides agood heuristic for motivating the normal approxi-mation. However, the evidence shows that the stan-dard method is woefully inadequate. Teaching sta-tistical novices about a Clopper–Pearson type inter-val is conceptually difficult, particularly becauseexact intervals are impossible to compute by hand.As the Agresti–Coull interval preserves the confi-dence level most successfully among the three rec-ommended alternative intervals, we believe thatthis feature when coupled with its straightforwardcomputation (particularly when α = 005) makesthis approach ideal for the classroom.Simplicity and ease of computation have no role

to play in statistical practice. With the adventof powerful microcomputers, researchers no longerresort to hand calculations when analyzing data.While the need for simplicity applies to the class-room, in applications we primarily desire reliable,accurate solutions, as there is no significant dif-ference in the computational overhead required bythe authors’ recommended intervals when comparedto the BSC and LR methods. From this perspec-tive, the BSC and LR intervals have a substantialadvantage relative to the various asymptotic inter-vals presented by the authors. They guarantee cov-erage at a relatively low cost in increased length.In fact, the BSC interval is already implemented inStatXact (1998) and is therefore readily accessible topractitioners.

Page 61: Memo Regarding ACS-With Response

124 L.D. BROWN, T.T. CAI AND A. DASGUPTA

Fig. 2. Expected lengths of BSC and LR intervals as a function of p compared� respectively� to Wilson� Agresti–Coull and Jeffreys

intervals �n = 25�. Compare to authors’ Figure 8

CommentMalay Ghosh

This is indeed a very valuable article which bringsout very clearly some of the inherent difficultiesassociated with confidence intervals for parame-ters of interest in discrete distributions. ProfessorsBrown, Cai and Dasgupta (henceforth BCD) areto be complimented for their comprehensive andthought-provoking discussion about the “chaotic”behavior of the Wald interval for the binomial pro-portion and an appraisal of some of the alternativesthat have been proposed.My remarks will primarily be confined to the

discussion of Bayesian methods introduced in thispaper. BCD have demonstrated very clearly that the

Malay Ghosh is Distinguished Professor, Depart-ment of Statistics, University of Florida, Gainesville,Florida 32611-8545 �e-mail: [email protected]�.

modified Jeffreys equal-tailed interval works wellin this problem and recommend it as a possible con-tender to the Wilson interval for n ≤ 40.There is a deep-rooted optimality associated with

Jeffreys prior as the unique first-order probabilitymatching prior for a real-valued parameter of inter-est with no nuisance parameter. Roughly speak-ing, a probability matching prior for a real-valuedparameter is one for which the coverage probabilityof a one-sided Baysian credible interval is asymp-totically equal to its frequentist counterpart. Beforegiving a formal definition of such priors, we pro-vide an intuitive explanation of why Jeffreys prioris a matching prior. To this end, we begin withthe fact that if X1� �Xn are iid N�θ�1�, then�Xn = �n

i=1Xi/n is the MLE of θ. With the uni-form prior π�θ� ∝ c (a constant), the posterior of θ

Page 62: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 125

isN� �Xn�1/n�. Accordingly, writing zα for the upper100α% point of the N�0�1� distribution,

P�θ ≤ �Xn + zαn−1/2� �Xn�

= 1− α = P�θ ≤ �Xn + zαn−1/2�θ�

and this is an example of perfect matching. Nowif θn is the MLE of θ, under suitable regular-ity conditions, θn�θ is asymptotically (as n → ∞)N�θ� I−1�θ��, where I�θ� is the Fisher Informationnumber. With the transformation g�θ� = ∫ θ

I1/2�t�,by the delta method, g�θn� is asymptoticallyN�g�θ��1�. Now, intuitively one expects the uniformprior π�θ� ∝ c as the asymptotic matching prior forg�θ�. Transforming back to the original parameter,Jeffreys prior is a probability matching prior for θ.Of course, this requires an invariance of probabilitymatching priors, a fact which is rigorously estab-lished in Datta and Ghosh (1996). Thus a uniformprior for arcsin�θ1/2�, where θ is the binomial pro-portion, leads to Jeffreys Beta (1/2, 1/2) prior for θ.When θ is the Poisson parameter, the uniform priorfor θ1/2 leads to Jeffreys’ prior θ−1/2 for θ.In a more formal set-up, let X1� �Xn be iid

conditional on some real-valued θ. Let θ1−απ �X1� �Xn� denote a posterior �1−α�th quantile for θ underthe prior π. Then π is said to be a first-order prob-ability matching prior if

P�θ ≤ θ1−απ �X1� �Xn��θ�= 1− α+ o�n−1/2�(1)

This definition is due to Welch and Peers (1963)who showed by solving a differential equation thatJeffreys prior is the unique first-order probabilitymatching prior in this case. Strictly speaking, Welchand Peers proved this result only for continuousdistributions. Ghosh (1994) pointed out a suitablemodification of criterion (1) which would lead to thesame conclusion for discrete distributions. Also, forsmall and moderate samples, due to discreteness,one needs some modifications of Jeffreys interval asdone so successfully by BCD.This idea of probability matching can be extended

even in the presence of nuisance parameters.Suppose that θ = �θ1� � θp�T, where θ1 is the par-ameter of interest, while �θ2� � θp�T is the nui-sance parameter. Writing I�θ� = ��Ijk�� as theFisher information matrix, if θ1 is orthogonal to�θ2� � θp�T in the sense of Cox and Reid (1987),that is, I1k = 0 for all k = 2� � p, extendingthe previous intuitive argument, π�θ� ∝ I

1/211 �θ�

is a probability matching prior. Indeed, this prior

belongs to the general class of first-order probabil-ity matching priors

π�θ� ∝ I1/211 �θ�h�θ2� � θp�

as derived in Tibshirani (1989). Here h�·� is an arbi-trary function differentiable in its arguments.In general, matching priors have a long success

story in providing frequentist confidence intervals,especially in complex problems, for example, theBehrens–Fisher or the common mean estimationproblems where frequentist methods run into dif-ficulty. Though asymptotic, the matching propertyseems to hold for small and moderate sample sizesas well for many important statistical problems.One such example is Garvan and Ghosh (1997)where such priors were found for general disper-sion models as given in Jorgensen (1997). It maybe worthwhile developing these priors in the pres-ence of nuisance parameters for other discrete casesas well, for example when the parameter of interestis the difference of two binomial proportions, or thelog-odds ratio in a 2× 2 contingency table.Having argued so strongly in favor of matching

priors, I wonder, though, whether there is any spe-cial need for such priors in this particular problem ofbinomial proportions. It appears that any Beta (a� a)prior will do well in this case. As noted in this paper,by shrinking the MLE X/n toward the prior mean1/2, one achieves a better centering for the construc-tion of confidence intervals. The two diametricallyopposite priors Beta (2, 2) (symmetric concave withmaximum at 1/2 which provides the Agresti–Coullinterval) and Jeffreys prior Beta (1/2�1/2) (symmet-ric convex with minimum at 1/2) seem to be equallygood for recentering. Indeed, I wonder whether anyBeta �α�β� prior which shrinks the MLE towardthe prior mean α/�α + β� becomes appropriate forrecentering.The problem of construction of confidence inter-

vals for binomial proportions occurs in first coursesin statistics as well as in day-to-day consulting.While I am strongly in favor of replacing Wald inter-vals by the new ones for the latter, I am not quitesure how easy it will be to motivate these new inter-vals for the former. The notion of shrinking can beexplained adequately only to a few strong studentsin introductory statistics courses. One possible solu-tion for the classroom may be to bring in the notionof continuity correction and somewhat heuristcallyask students to work with �X+ 1

2 � n−X+ 12� instead

of �X�n − X�. In this way, one centers around�X+ 1

2�/�n+ 1� a la Jeffreys prior.

Page 63: Memo Regarding ACS-With Response

126 L. D. BROWN, T. T. CAI AND A. DASGUPTA

CommentThomas J. Santner

I thank the authors for their detailed look ata well-studied problem. For the Wald binomial pinterval, there has not been an appreciation ofthe long persistence (in n) of p locations havingsubstantially deficient achieved coverage comparedwith the nominal coverage. Figure 1 is indeed apicture that says a thousand words. Similarly, theasymptotic lower limit in Theorem 1 for the mini-mum coverage of the Wald interval is an extremelyuseful analytic tool to explain this phenomenon,although other authors have given fixed p approx-imations of the coverage probability of the Waldinterval (e.g., Theorem 1 of Ghosh, 1979).My first set of comments concern the specific bino-

mial problem that the authors address and then theimplications of their work for other important dis-crete data confidence interval problems.The results in Ghosh (1979) complement the cal-

culations of Brown, Cai and DasGupta (BCD) bypointing out that the Wald interval is “too long” inaddition to being centered at the “wrong” value (theMLE as opposed to a Bayesian point estimate suchis used by the Agresti–Coull interval). His Table 3lists the probability that the Wald interval is longerthan the Wilson interval for a central set of p val-ues (from 0.20 to 0.80) and a range of sample sizesn from 20 to 200. Perhaps surprisingly, in view ofits inferior coverage characteristics, the Wald inter-val tends to be longer than the Wilson intervalwith very high probability. Hence the Wald intervalis both too long and centered at the wrong place.This is a dramatic effect of the skewness that BCDmention.When discussing any system of intervals, one

is concerned with the consistency of the answersgiven by the interval across multiple uses by asingle researcher or by groups of users. Formally,this is the reason why various symmetry propertiesare required of confidence intervals. For example,in the present case, requiring that the p interval�L�X��U�X�� satisfy the symmetry property

�L�x��U�x�� = �1−L�n− x��1−U�n− x��(1)

for x ∈ �0� � n� shows that investigators whoreverse their definitions of success and failure will

Thomas J. Santner is Profesor, Ohio State Univer-sity, 404 Cockins Hall, 1958 Neil Avenue, Columbus,Ohio 43210 �e-mail: [email protected]�.

be consistent in their assessment of the likely valuesfor p. Symmetry (1) is the minimal requirement of abinomial confidence interval. The Wilson and equal-tailed Jeffrey intervals advocated by BCD satisfythe symmetry property (1) and have coverage thatis centered (when coverage is plotted versus true p)about the nominal value. They are also straightfor-ward to motivate, even for elementary students, andsimple to compute for the outcome of interest.However, regarding p confidence intervals as the

inversion of a family of acceptance regions corre-sponding to size α tests of H0� p = p0 versusHA� p �= p0 for 0 < p0 < 1 has some sub-stantial advantages. Indeed, Brown et al. mentionthis inversion technique when they remark on thedesirable properties of intervals formed by invert-ing likelihood ratio test acceptance regions of H0versus HA. In the binomial case, the acceptanceregion of any reasonable test of H0� p = p0 is ofthe form �Lp0

� �Up0�. These acceptance regions

invert to intervals if and only if Lp0and Up0

arenondecreasing in p0 (otherwise the inverted p con-fidence set can be a union of intervals). Of course,there are many families of size α tests that meetthis nondecreasing criterion for inversion, includ-ing the very conservative test used by Clopper andPearson (1934). For the binomial problem, Blyth andStill (1983) constructed a set of confidence intervalsby selecting among size α acceptance regions thosethat possessed additional symmetry properties andwere “small” (leading to short confidence intervals).For example, they desired that the interval should“move to the right” as x increases when n is fixedand should “move the left” as n increases when xis fixed. They also asked that their system of inter-vals increase monotonically in the coverage proba-bility for fixed x and n in the sense that the highernominal coverage interval contain the lower nomi-nal coverage interval.In addition to being less intuitive to unsophisti-

cated statistical consumers, systems of confidenceintervals formed by inversion of acceptance regionsalso have two other handicaps that have hinderedtheir rise in popularity. First, they typically requirethat the confidence interval (essentially) be con-structed for all possible outcomes, rather thanmerely the response of interest. Second, their ratherbrute force character means that a specialized com-puter program must be written to produce theacceptance sets and their inversion (the intervals).

Page 64: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 127

Fig. 1. Coverage of nominal 95% symmetric Duffy–Santner p intervals for n = 20 �bottom panel� and n = 50 �top panel�

However, the benefits of having reasonably shortand suitably symmetric confidence intervals are suf-ficient that such intervals have been constructed forseveral frequently occurring problems of biostatis-tics. For example, Jennison and Turnbull (1983) andDuffy and Santner (1987) present acceptance set–inversion confidence intervals (both with availableFORTRAN programs to implement their methods)for a binomial p based on data from a multistageclinical trial; Coe and Tamhane (1989) describe amore sophisticated set of repeated confidence inter-vals for p1 − p2 also based on multistage clinicaltrial data (and give a SAS macro to produce theintervals). Yamagami and Santner (1990) presentan acceptance set–inversion confidence interval andFORTRAN program for p1 − p2 in the two-samplebinomial problem. There are other examples.To contrast with the intervals whose coverages

are displayed in BCD’s Figure 5 for n = 20 andn = 50, I formed the multistage intervals of Duffyand Santner that strictly attain the nominal con-fidence level for all p. The computation was donenaively in the sense that the multistage FORTRANprogram by Duffy that implements this methodwas applied using one stage with stopping bound-

aries arbitrarily set at �a� b� = �0�1� in the nota-tion of Duffy and Santner, and a small adjustmentwas made to insure symmetry property (1). (Thenonsymmetrical multiple stage stopping boundariesthat produce the data considered in Duffy and Sant-ner do not impose symmetry.) The coverages of thesesystems are shown in Figure 1. To give an idea ofcomputing time, the n = 50 intervals required lessthan two seconds to compute on my 400 Mhz PC.To further facilitate comparison with the intervalswhose coverage is displayed in Figure 5 of BCD,I computed the Duffy and Santner intervals for aslightly lower level of coverage, 93.5%, so that theaverage coverage was about the desired 95% nomi-nal level; the coverage of this system is displayedin Figure 2 on the same vertical scale and com-pares favorably. It is possible to call the FORTRANprogram that makes these intervals within SPLUSwhich makes for convenient data analysis.I wish to mention that are a number of other

small sample interval estimation problems of con-tinuing interest to biostatisticians that may wellhave very reasonable small sample solutions basedon analogs of the methods that BCD recommend.

Page 65: Memo Regarding ACS-With Response

128 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 2. Coverage of nominal 935% symmetric Duffy–Santner p intervals for n = 50.

Most of these would be extremely difficult to han-dle by the more brute force method of invertingacceptance sets. The first of these is the problemof computing simultaneous confidence intervals forp0 − pi�1 ≤ i ≤ T that arises in comparing a con-trol binomial distribution with T treatment ones.The second concerns forming simultaneous confi-dence intervals for pi − pj, the cell probabilitiesof a multinomial distribution. In particular, theequal-tailed Jeffrey prior approach recommended bythe author has strong appeal for both of these prob-lems.Finally, I note that the Wilson intervals seem

to have received some recommendation as the

method of choice in other elementary texts. In hisintroductory texts, Larson (1974) introduces theWilson interval as the method of choice althoughhe makes the vague, and indeed false, statement, asBCD show, that the user can use the Wald interval if“n is large enough.” One reviewer of Santner (1998),an article that showed the coverage virtues of theWilson interval compared with Wald-like intervalsadvocated by another author in the magazine Teach-ing Statistics (written for high school teachers) com-mented that the Wilson method was the “standard”method taught in the U.K.

RejoinderLawrence D. Brown, T. Tony Cai and Anirban DasGupta

We deeply appreciate the many thoughtful andconstructive remarks and suggestions made by thediscussants of this paper. The discussion suggeststhat we were able to make a convincing case thatthe often-used Wald interval is far more problem-

atic than previously believed. We are happy to seea consensus that the Wald interval deserves tobe discarded, as we have recommended. It is notsurprising to us to see disagreement over the spe-cific alternative(s) to be recommended in place of

Page 66: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 129

this interval. We hope the continuing debate willadd to a greater understanding of the problem, andwe welcome the chance to contribute to this debate.

A. It seems that the primary source of disagree-ment is based on differences in interpretationof the coverage goals for confidence intervals.We will begin by presenting our point of viewon this fundamental issue.We will then turn to a number of other issues,as summarized in the following list:

B. Simplicity is important.C. Expected length is also important.D. Santner’s proposal.E. Should a continuity correction be used?F. The Wald interval also performs poorly in

other problems.G. The two-sample binomial problem.H. Probability-matching procedures.I. Results from asymptotic theory.

A. Professors Casella, Corcoran and Mehta comeout in favor of making coverage errors always fallonly on the conservative side. This is a traditionalpoint of view. However, we have taken a differentperspective in our treatment. It seems more consis-tent with contemporary statistical practice to expectthat a γ% confidence interval should cover the truevalue approximately γ% of the time. The approxi-mation should be built on sound, relevant statisti-cal calculations, and it should be as accurate as thesituation allows.We note in this regard that most statistical mod-

els are only felt to be approximately valid as repre-sentations of the true situation. Hence the result-ing coverage properties from those models are atbest only approximately accurate. Furthermore, abroad range of modern procedures is supportedonly by asymptotic or Monte-Carlo calculations, andso again coverage can at best only be approxi-mately the nominal value. As statisticians we dothe best within these constraints to produce proce-dures whose coverage comes close to the nominalvalue. In these contexts when we claim γ% cover-age we clearly intend to convey that the coverage isclose to γ%, rather than to guarantee it is at leastγ%.We grant that the binomial model has a some-

what special character relative to this general dis-cussion. There are practical contexts where one canfeel confident this model holds with very high preci-sion. Furthermore, asymptotics are not required inorder to construct practical procedures or evaluatetheir properties, although asymptotic calculationscan be useful in both regards. But the discrete-ness of the problem introduces a related barrier

to the construction of satisfactory procedures. Thisforces one to again decide whether γ% should mean“approximately γ%,” as it does in most other con-temporary applications, or “at least γ%” as canbe obtained with the Blyth–Still procedure or theCloppe–Pearson procedure. An obvious price of thelatter approach is in its decreased precision, as mea-sured by the increased expected length of the inter-vals.B. All the discussants agree that elementary

motivation and simplicity of computation are impor-tant attributes in the classroom context. We ofcourse agree. If these considerations are paramountthen the Agresti–Coull procedure is ideal. If theneed for simplicity can be relaxed even a little, thenwe prefer the Wilson procedure: it is only slightlyharder to compute, its coverage is clearly closer tothe nominal value across a wider range of values ofp, and it can be easier to motivate since its deriva-tion is totally consistent with Neyman–Pearson the-ory. Other procedures such as Jeffreys or the mid-PClopper–Pearson interval become plausible competi-tors whenever computer software can be substitutedfor the possibility of hand derivation and computa-tion.Corcoran and Mehta take a rather extreme posi-

tion when they write, “Simplicity and ease of com-putation have no role to play in statistical practice[italics ours].” We agree that the ability to performcomputations by hand should be of little, if any, rel-evance in practice. But conceptual simplicity, parsi-mony and consistency with general theory remainimportant secondary conditions to choose amongprocedures with acceptable coverage and precision.These considerations will reappear in our discus-

sion of Santner’s Blyth–Still proposal. They alsoleave us feeling somewhat ambivalent about theboundary-modified procedures we have presented inour Section 4.1. Agresti and Coull correctly implythat other boundary corrections could have beentried and that our choice is thus somewhat ad hoc.(The correction to Wilson can perhaps be defendedon the principle of substituting a Poisson approx-imation for a Gaussian one where the former isclearly more accurate; but we see no such funda-mental motivation for our correction to the Jeffreysinterval.)C. Several discussants commented on the pre-

cision of various proposals in terms of expectedlength of the resulting intervals. We strongly con-cur that precision is the important balancing crite-rion vis-a-vis coverage. We wish only to note thatthere exist other measures of precision than inter-val expected length. In particular, one may investi-gate the probability of covering wrong values. In a

Page 67: Memo Regarding ACS-With Response

130 L. D. BROWN, T. T. CAI AND A. DASGUPTA

charming identity worth noting, Pratt (1961) showsthe connection of this approach to that of expectedlength. Calculations on coverage of wrong values ofp in the binomial case will be presented in Das-Gupta (2001). This article also discusses a numberof additional issues and presents further analyticalcalculations, including a Pearson tilting similar tothe chi-square tilts advised in Hall (1983).Corcoran and Mehta’s Figure 2 compares average

length of three of our proposals with Blyth–Still andwith their likelihood ratio procedure. We note firstthat their LB procedure is not the same as ours.Theirs is based on numerically computed exact per-centiles of the fixed sample likelihood ratio statistic.We suspect this is roughly equivalent to adjustmentof the chi-squared percentile by a Bartlett correc-tion. Ours is based on the traditional asymptoticchi-squared formula for the distribution of the like-lihood ratio statistic. Consequently, their procedurehas conservative coverage, whereas ours has cov-erage fluctuating around the nominal value. Theyassert that the difference in expected length is “neg-ligible.” How much difference qualifies as negligibleis an arguable, subjective evaluation. But we notethat in their plot their intervals can be on aver-age about 8% or 10% longer than Jeffreys or Wilsonintervals, respectively. This seems to us a nonneg-ligible difference. Actually, we suspect their prefer-ence for their LR and BSC intervals rests primarilyon their overriding preference for conservativity incoverage whereas, as we have discussed above, ourintervals are designed to attain approximately thedesired nominal value.D. Santner proposes an interesting variant of the

original Blyth–Still proposal. As we understand it,

he suggests producing nominal γ% intervals by con-structing the γ∗% Blyth–Still intervals, with γ∗%chosen so that the average coverage of the result-ing intervals is approximately the nominal value,γ%. The coverage plot for this procedure compareswell with that for Wilson or Jeffreys in our Figure 5.Perhaps the expected interval length for this proce-dure also compares well, although Santner does notsay so. However, we still do not favor his proposal.It is conceptually more complicated and requires aspecially designed computer program, particularly ifone wishes to compute γ∗% with any degree of accu-racy. It thus fails with respect to the criterion of sci-entific parsimony in relation to other proposals thatappear to have at least competitive performancecharacteristics.E. Casella suggests the possibility of perform-

ing a continuity correction on the score statisticprior to constructing a confidence interval. We donot agree with this proposal from any perspec-tive. These “continuity-corrected Wilson” intervalshave extremely conservative coverage properties,though they may not in principle be guaranteed tobe everywhere conservative. But even if one’s goal,unlike ours, is to produce conservative intervals,these intervals will be very inefficient at their nor-mal level relative to Blyth–Still or even Clopper–Pearson. In Figure 1 below, we plot the coverageof the Wilson interval with and without a conti-nuity correction for n = 25 and α = 005, andthe corresponding expected lengths. It is seemsclear that the loss in precision more than neutral-izes the improvements in coverage and that thenominal coverage of 95% is misleading from anyperspective.

Fig. 1. Comparison of the coverage probabilities and expected lengths of the Wilson �dotted� and continuity-corrected Wilson �solid�intervals for n = 25 and α = 005.

Page 68: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 131

Fig. 2. Comparison of the systematic coverage biases. The y-axis is nSn�p� From top to bottom: the systematic coverage biases of theAgresti–Coull� Wilson� Jeffreys� likelihood ratio and Wald intervals� with n = 50 and α = 005.

F. Agresti and Coull ask if the dismal perfor-mance of the Wald interval manifests itself inother problems, including nordiscrete cases. Indeedit does. In other lattice cases such as the Poissonand negative binomial, both the considerable neg-ative coverage bias and inefficiency in length per-sist. These features also show up in some continu-ous exponential family cases. See Brown, Cai andDasGupta (2000b) for details.In the three important discrete cases, the bino-

mial, Poisson and negative binomial, there is in factsome conformity in regard to which methods workwell in general. Both the likelihood ratio interval(using the asymptotic chi-squared limits) and theequal-tailed Jeffreys interval perform admirably inall of these problems with regard to coverage andexpected length. Perhaps there is an underlying the-oretical reason for the parallel behavior of thesetwo intervals constructed from very different foun-dational principles, and this seems worth furtherstudy.G. Some discussants very logically inquire about

the situation in the two-sample binomial situation.Curiously, in a way, the Wald interval in the two-sample case for the difference of proportions is lessproblematic than in the one-sample case. It cannevertheless be somewhat improved. Agresti andCaffo (2000) present a proposal for this problem,and Brown and Li (2001) discuss some others.H. The discussion by Ghosh raises several inter-

esting issues. The definition of “first-order proba-bility matching” extends in the obvious way to anyset of upper confidence limits; not just those cor-responding to Bayesian intervals. There is also anobvious extension to lower confidence limits. This

probability matching is a one-sided criterion. Thusa family of two-sided intervals �Ln�Un� will be first-order probability matching if

Prp�p ≤ Ln� = α/2+ o�n−1/2� = Prp�p ≥ Un�As Ghosh notes, this definition cannot usefullybe literally applied to the binomial problem here,because the asymptotic expansions always have adiscrete oscillation term that is O�n−1/2�. However,one can correct the definition.One way to do so involves writing asymptotic

expressions for the probabilities of interest that canbe divided into a “smooth” part, S, and an “oscil-lating” part, Osc, that averages to O�n−3/2� withrespect to any smooth density supported within (0,1). Readers could consult BCD (2000a) for moredetails about such expansions. Thus, in much gen-erality one could write

Prp�p ≤ Ln�= α/2+SLn

�p� +OscLn�p� +O�n−1��(1)

where SLn�p� = O�n−1/2�, and OscLn

�p� has theproperty informally described above. We would thensay that the procedure is first-order probabilitymatching if SLn

�p� = o�n−1/2�, with an analogousexpression for the upper limit, Un.In this sense the equal-tail Jeffreys procedure

is probability matching. We believe that the mid-P Clopper–Pearson intervals also have this asymp-totic property. But several of the other proposals,including the Wald, the Wilson and the likelihoodratio intervals are not first-order probability match-ing. See Cai (2001) for exact and asymptotic calcula-tions on one-sided confidence intervals and hypoth-esis testing in the discrete distributions.

Page 69: Memo Regarding ACS-With Response

132 L. D. BROWN, T. T. CAI AND A. DASGUPTA

The failure of this one-sided, first-order property,however, has no obvious bearing on the coverageproperties of the two-sided procedures consideredin the paper. That is because, for any of our proce-dures,

SLn�p� +SUn

�p� = 0+O�n−1��(2)

even when the individual terms on the left are onlyO�n−1/2�. All the procedures thus make compensat-ing one-sided errors, to O�n−1�, even when they arenot accurate to this degree as one-sided procedures.This situation raises the question as to whether

it is desirable to add as a secondary criterion fortwo-sided procedures that they also provide accu-rate one-sided statements, at least to the probabil-ity matchingO�n−1/2�. While Ghosh argues stronglyfor the probability matching property, his argumentdoes not seem to take into account the cancellationinherent in (2). We have heard some others argue infavor of such a requirement and some argue againstit. We do not wish to take a strong position onthis issue now. Perhaps it depends somewhat on thepractical context—if in that context the confidencebounds may be interpreted and used in a one-sidedfashion as well as the two-sided one, then perhapsprobability matching is called for.I. Ghosh’s comments are a reminder that asymp-

totic theory is useful for this problem, even thoughexact calculations here are entirely feasible and con-venient. But, as Ghosh notes, asymptotic expres-sions can be startingly accurate for moderatesample sizes. Asymptotics can thus provide validinsights that are not easily drawn from a series ofexact calculations. For example, the two-sided inter-vals also obey an expression analogous to (1),

Prp�Ln ≤ p ≤ Un�(3)

= 1− α+Sn�p� +Oscn�p� +O�n−3/2�The term Sn�p� is O�n−1� and provides a usefulexpression for the smooth center of the oscillatorycoverage plot. (See Theorem 6 of BCD (2000a) fora precise justification.) The following plot for n =50 compares Sn�p� for five confidence procedures.It shows how the Wilson, Jeffreys and chi-squared likelihood ratio procedures all have cover-age that well approximates the nominal value, withWilson being slightly more conservative than theother two.As we see it our article articulated three primary

goals: to demonstrate unambiguously that the Waldinterval performs extremely poorly; to point out thatnone of the common prescriptions on when the inter-val is satisfactory are correct and to put forwardsome recommendations on what is to be used in itsplace. On the basis of the discussion we feel gratified

that we have satisfactorily met the first two of thesegoals. As Professor Casella notes, the debate aboutalternatives in this timeless problem will linger on,as it should. We thank the discussants again for alucid and engaging discussion of a number of rel-evant issues. We are grateful for the opportunityto have learned so much from these distinguishedcolleagues.

ADDITIONAL REFERENCES

Agresti, A. and Caffo, B. (2000). Simple and effective confi-dence intervals for proportions and differences of proportionsresult from adding two successes and two failures. Amer.Statist. 54. To appear.

Aitkin, M., Anderson, D., Francis, B. and Hinde, J. (1989).Statistical Modelling in GLIM. Oxford Univ. Press.

Boos, D. D. and Hughes-Oliver, J. M. (2000). How large does nhave to be for Z and t intervals? Amer. Statist. 54 121–128.

Brown, L. D., Cai, T. and DasGupta, A. (2000a). Confidenceintervals for a binomial proportion and asymptotic expan-sions. Ann. Statist. To appear.

Brown, L. D., Cai, T. and DasGupta, A. (2000b). Interval estima-tion in exponential families. Technical report, Dept. Statis-tics, Univ. Pennsylvania.

Brown, L. D. and Li, X. (2001). Confidence intervals forthe difference of two binomial proportions. Unpublishedmanuscript.

Cai, T. (2001). One-sided confidence intervals and hypothesistesting in discrete distributions. Preprint.

Coe, P. R. and Tamhane, A. C. (1993). Exact repeated confidenceintervals for Bernoulli parameters in a group sequential clin-ical trial. Controlled Clinical Trials 14 19–29.

Cox, D. R. and Reid, N. (1987). Orthogonal parameters andapproximate conditional inference (with discussion). J. Roy.Statist. Soc. Ser. B 49 113–147.

DasGupta, A. (2001). Some further results in the binomial inter-val estimation problem. Preprint.

Datta, G. S. and Ghosh, M. (1996). On the invariance of nonin-formative priors. Ann. Statist. 24 141–159.

Duffy, D. and Santner, T. J. (1987). Confidence intervals fora binomial parameter based on multistage tests. Biometrics43 81–94.

Fisher, R. A. (1956). Statistical Methods for Scientific Inference.Oliver and Boyd, Edinburgh.

Gart, J. J. (1966). Alternative analyses of contingency tables. J.Roy. Statist. Soc. Ser. B 28 164–179.

Garvan, C. W. and Ghosh, M. (1997). Noninformative priors fordispersion models. Biometrika 84 976–982.

Ghosh, J. K. (1994). Higher Order Asymptotics. IMS, Hayward,CA.

Hall, P. (1983). Chi-squared approximations to the distributionof a sum of independent random variables. Ann. Statist. 111028–1036.

Jennison, C. and Turnbull, B. W. (1983). Confidence intervalsfor a binomial parameter following a multistage test withapplication to MIL-STD 105D and medical trials. Techno-metrics, 25 49–58.

Jorgensen, B. (1997). The Theory of Dispersion Models. CRCChapman and Hall, London.

Laplace, P. S. (1812). Theorie Analytique des Probabilites.Courcier, Paris.

Larson, H. J. (1974). Introduction to Probability Theory and Sta-tistical Inference, 2nd ed. Wiley, New York.

Page 70: Memo Regarding ACS-With Response

INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 133

Pratt, J. W. (1961). Length of confidence intervals. J. Amer.Statist. Assoc. 56 549–567.

Rindskopf, D. (2000). Letter to the editor. Amer. Statist. 54 88.Rubin, D. B. and Schenker, N. (1987). Logit-based interval esti-

mation for binomial data using the Jeffreys prior. Sociologi-cal Methodology 17 131–144.

Sterne, T. E. (1954). Some remarks on confidence or fiduciallimits. Biometrika 41 275–278.

Tibshirani, R. (1989). Noninformative priors for one parameterof many. Biometrika 76 604–608.

Welch, B. L. and Peers, H. W. (1963). On formula for confi-dence points based on intergrals of weighted likelihoods. J.Roy. Statist. Ser. B 25 318–329.

Yamagami, S. and Santner, T. J. (1993). Invariant small sampleconfidence intervals for the difference of two success proba-bilities. Comm. Statist. Simul. Comput. 22 33–59.

Page 71: Memo Regarding ACS-With Response

Appendix 4.

Page 72: Memo Regarding ACS-With Response

B03002. HISPANIC OR LATINO ORIGIN BY RACE - Universe: TOTAL POPULATIONData Set: 2005-2009 American Community Survey 5-Year EstimatesSurvey: American Community Survey

NOTE. Although the American Community Survey (ACS) produces population, demographic and housing unit estimates, it is the Census Bureau's PopulationEstimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns and estimates ofhousing units for states and counties.

For information on confidentiality protection, sampling error, nonsampling error, and definitions, see Survey Methodology.

Autauga County, Alabama Estimate Margin of ErrorTotal: 49,584 *****

Not Hispanic or Latino: 48,572 *****White alone 38,636 +/-43Black or African American alone 8,827 +/-83American Indian and Alaska Native alone 183 +/-89Asian alone 309 +/-94Native Hawaiian and Other Pacific Islander alone 0 +/-119Some other race alone 56 +/-45Two or more races: 561 +/-157

Two races including Some other race 0 +/-119Two races excluding Some other race, and three or more races 561 +/-157

Hispanic or Latino: 1,012 *****White alone 579 +/-115Black or African American alone 27 +/-30American Indian and Alaska Native alone 10 +/-17Asian alone 0 +/-119Native Hawaiian and Other Pacific Islander alone 0 +/-119Some other race alone 280 +/-109Two or more races: 116 +/-87

Two races including Some other race 56 +/-58Two races excluding Some other race, and three or more races 60 +/-68

Source: U.S. Census Bureau, 2005-2009 American Community Survey

Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from samplingvariability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of errorcan be interpreted roughly as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and theestimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, theACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect ofnonsampling error is not represented in these tables.

While the 2005-2009 American Community Survey (ACS) data generally reflect the November 2008 Office of Management and Budget(OMB) definitions of metropolitan and micropolitan statistical areas; in certain instances the names, codes, and boundaries of the principalcities shown in ACS tables may differ from the OMB definitions due to differences in the effective dates of the geographic entities.

Estimates of urban and rural population, housing units, and characteristics reflect boundaries of urban areas defined based on Census2000 data. Boundaries for urban areas have not been updated since Census 2000. As a result, data for urban and rural areas from theACS do not necessarily reflect the results of ongoing urbanization.

Explanation of Symbols:1. An '**' entry in the margin of error column indicates that either no sample observations or too few sample observations were available tocompute a standard error and thus the margin of error. A statistical test is not appropriate.2. An '-' entry in the estimate column indicates that either no sample observations or too few sample observations were available tocompute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval orupper interval of an open-ended distribution.3. An '-' following a median estimate means the median falls in the lowest interval of an open-ended distribution.4. An '+' following a median estimate means the median falls in the upper interval of an open-ended distribution.5. An '***' entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-endeddistribution. A statistical test is not appropriate.6. An '*****' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is notappropriate.

Standard Error/Variance documentation for this dataset:Accuracy of the Data

Main Search Feedback FAQs Glossary Site Map Help

Page 73: Memo Regarding ACS-With Response

Appendix 5

Page 74: Memo Regarding ACS-With Response

Chapter 13.Preparation and Review of Data Products

13.1 OVERVIEW

This chapter discusses the data products derived from the American Community Survey (ACS).ACS data products include the tables, reports, and files that contain estimates of population andhousing characteristics. These products cover geographic areas within the United States andPuerto Rico. Tools such as the Public Use Microdata Sample (PUMS) files, which enable data usersto create their own estimates, also are data products.

ACS data products will continue to meet the traditional needs of those who used the decennialcensus long-form sample estimates. However, as described in Chapter 14, Section 3, the ACS willprovide more current data products than those available from the census long form, an especiallyimportant advantage toward the end of a decade.

Most surveys of the population provide sufficient samples to support the release of data productsonly for the nation, the states, and, possibly, a few substate areas. Because the ACS is a very largesurvey that collects data continuously in every county, products can be released for many types ofgeographic areas, including many smaller geographic areas such as counties, townships, and cen-sus tracts. For this reason, geography is an important topic for all ACS data products.

The first step in the preparation of a data product is defining the topics and characteristics it willcover. Once the initial characteristics are determined, they must be reviewed by the CensusBureau Disclosure Review Board (DRB) to ensure that individual responses will be kept confiden-tial. Based on this review, the specifications of the products may be revised. The DRB also mayrequire that the microdata files be altered in certain ways, and may restrict the population size ofthe geographic areas for which these estimates are published. These activities are collectivelyreferred to as disclosure avoidance.

The actual processing of the data products cannot begin until all response records for a given yearor years are edited and imputed in the data preparation and processing phases, the final weightsare determined, and disclosure avoidance techniques are applied. Using the weights, the sampledata are tabulated for a wide variety of characteristics according to the predetermined content.These tabulations are done for the geographic areas that have a sample size sufficient to supportstatistically reliable estimates, with the exception of 5-year period estimates, which will be avail-able for small geographic areas down to the census tract and block group levels. The PUMS datafiles are created by different processes because the data are a subset of the full sample data.

After the estimates are produced and verified for correctness, Census Bureau subject matter ana-lysts review them. When the estimates have passed the final review, they are released to the pub-lic. A similar process of review and public release is followed for PUMS data.

While the 2005 ACS sample was limited to the housing unit (HU) population for the United Statesand Puerto Rico, starting in sample year 2006, the ACS was expanded to include the group quar-ters (GQ) population. Therefore, the ACS sample is representative of the entire resident populationin the United States and Puerto Rico. In 2007, 1-year period estimates for the total population andsubgroups of the total population in both the United States and Puerto Rico were released forsample year 2006. Similarly, in 2008, 1-year period estimates were released for sample year 2007.

In 2008, the Census Bureau will, for the first time, release products based on 3 years of ACSsample, 2005 through 2007. In 2010, the Census Bureau plans to release the first products basedon 5 years of consecutive ACS samples, 2005 through 2009. Since several years of samples formthe basis of these multiyear products, reliable estimates can be released for much smaller geo-graphic areas than is possible for products based on single-year data.

Preparation and Review of Data Products 13−1ACS Design and Methodology

U.S. Census Bureau

Page 75: Memo Regarding ACS-With Response

In addition to data products regularly released to the public, other data products may berequested by government agencies, private organizations and businesses, or individuals. Toaccommodate such requests, the Census Bureau operates a custom tabulations program for theACS on a fee basis. These tabulation requests are reviewed by the DRB to assure protection ofconfidentiality before release.

Chapter 14 describes the dissemination of the data products discussed in this chapter, includingdisplay of products on the Census Bureau’s Web site and topics related to data file formatting.

13.2 GEOGRAPHY

The Census Bureau strives to provide products for the geographic areas that are most useful tousers of those data. For example, ACS data products are already disseminated for many of thenation’s legal and administrative entities, including states, American Indian and Alaska Native(AIAN) areas, counties, minor civil divisions (MCDs), incorporated places, congressional districts,as well as data for a variety of other geographic entities. In cooperation with state and local agen-cies, the Census Bureau identifies and delineates geographic entities referred to as ‘‘statisticalareas.’’ These include regions, divisions, urban areas (UAs), census county divisions (CCDs), cen-sus designated places (CDPs), census tracts, and block groups. Data users then can select the geo-graphic entity or set of entities that most closely represent their geographic areas of interest andneeds.

‘‘Geographic summary level’’ is a term used by the Census Bureau to designate the different geo-graphic levels or types of geographic areas for which data are summarized. Examples include theentities described above, such as states, counties, and places (the Census Bureau’s term for enti-ties such as for cities and towns, including unincorporated areas). Information on the types ofgeographic areas for which the Census Bureau publishes data is available at<http://www.census.gov/geo/www/garm.html>.

Single-year period estimates of ACS data are published annually for recognized legal, administra-tive, or statistical areas with populations of 65,000 or more (based on the latest Census Bureaupopulation estimates). Three-year period estimates based on 3 successive years of ACS samplesare published for areas of 20,000 or more. If a geographic area met the 1-year or 3-year thresholdfor a previous period but dropped below it for the current period, it will continue to be publishedas long as the population does not drop more than 5 percent below the threshold. Plans are topublish 5-year period estimates based on 5 successive years of ACS samples starting in 2010 withthe 2005−2009 data. Multiyear period estimates based on 5 successive years of ACS samples willbe published for all legal, administrative, and statistical areas down to the block-group level,regardless of population size. However, there are rules from the Census Bureau’s DRB that must beapplied.

The Puerto Rico Community Survey (PRCS) also provides estimates for legal, administrative, andstatistical areas in Puerto Rico. The same rules as described above for the 1-year, 3-year, and5-year period estimates for the U.S resident population apply for the PRCS as well.

The ACS publishes annual estimates for hundreds of substate areas, many of which will undergoboundary changes due to annexations, detachments, or mergers with other areas.1 Each year, theCensus Bureau’s Geography Division, working with state and local governments, updates its filesto reflect these boundary changes. Minor corrections to the location of boundaries also can occuras a result of the Census Bureau’s ongoing Master Address File (MAF)/Topologically IntegratedGeographic Encoding and Referencing (TIGER®) Enhancement Project. The ACS estimates must

1The Census Bureau conducts the Boundary and Annexation Survey (BAS) each year. This survey collects infor-mation on a voluntary basis from local governments and federally recognized American Indian areas. Theinformation collected includes the correct legal place names, type of government, legal actions that resulted inboundary changes, and up-to-date boundary information. The BAS uses a fixed reference date of January 1 ofthe BAS year. In years ending in 8, 9, and 0, all incorporated places, all minor civil divisions, and all federallyrecognized tribal governments are included in the survey. In other years, only governments at or above vari-ous population thresholds are contacted. More detailed information on the BAS can be found at<http://www.census .gov/geo/www/bas/bashome.html>.

13−2 Preparation and Review of Data Products ACS Design and Methodology

U.S. Census Bureau

Page 76: Memo Regarding ACS-With Response

reflect these legal boundary changes, so all estimates are based on Geography Division files thatshow the geographic boundaries as they existed on January 1 of the sample year or, in the case ofmultiyear data products, at the beginning of the final year of data collection.

13.3 DEFINING THE DATA PRODUCTS

For the 1999 through 2002 sample years, the ACS detailed tables were designed to be compa-rable with Census 2000 Summary File 3 to allow comparisons between data from Census 2000and the ACS. However, when Census 2000 data users indicated certain changes they wanted inmany tables, ACS managers saw the years 2003 and 2004 as opportunities to define ACS prod-ucts based on users’ advice.

Once a preliminary version of the revised suite of products had been developed, the CensusBureau asked for feedback on the planned changes from data users (including other federal agen-cies) via a Federal Register Notice (Fed. Reg. #3510-07-P). The notice requested comments on cur-rent and proposed new products, particularly on the basic concept of the product and its useful-ness to the data users. Data users provided a wide variety of comments, leading to modificationsof planned products.

ACS managers determined the exact form of the new products in time for their use in 2005 for theACS data release of sample year 2004. This schedule allowed users sufficient time to becomefamiliar with the new products and to provide comments well in advance of the data release forthe 2005 sample.

Similarly, a Federal Register Notice issued in August 2007 shared with the public plans for thedata release schedule and products that would be available beginning in 2008. This notice wasthe first that described products for multiyear estimates. Improvements will continue when multi-year period estimates are available.

13.4 DESCRIPTION OF AGGREGATED DATA PRODUCTS

ACS data products can be divided into two broad categories: aggregated data products, and thePUMS, which is described in Section 13.5 (‘‘Public Use Microdata Sample’’).

Data for the ACS are collected from a sample of housing units (HUs), as well as the GQ population,and are used to produce estimates of the actual figures that would have been obtained by inter-viewing the entire population. The aggregated data products contain the estimates from the sur-vey responses. Each estimate is created using the sample weights from respondent records thatmeet certain criteria. For example, the 2007 ACS estimate of people under the age of 18 inChicago is calculated by adding the weights from all respondent records from interviews com-pleted in 2007 in Chicago with residents under 18 years old.

This section provides a description of each aggregated product. Each product described is avail-able as single-year period estimates; unless otherwise indicated, they will be available as 3-yearestimates and are planned for the 5-year estimates. Chapter 14 provides more detail on the actualappearance and content of each product.

These data products contain all estimates planned for release each year, including those from mul-tiple years of data, such as the 2005−2007 products. Data release rules will prevent certainsingle- and 3-year period estimates from being released if they do not meet ACS requirements forstatistical reliability.

Detailed Tables

The detailed tables provide basic distributions of characteristics. They are the foundation uponwhich other data products are built. These tables display estimates and the associated lower andupper bounds of the 90 percent confidence interval. They include demographic, social, economic,and housing characteristics, and provide 1-, 3-, or 5-year period estimates for the nation and thestates, as well as for counties, towns, and other small geographic entities, such as census tractsand block groups.

Preparation and Review of Data Products 13−3ACS Design and Methodology

U.S. Census Bureau

Page 77: Memo Regarding ACS-With Response

The Census Bureau’s goal is to maintain a high degree of comparability between ACS detailedtables and Census 2000 sample-based data products. In addition, characteristics not measured inthe Census 2000 tables will be included in the new ACS base tables. The 2007 detailed table prod-ucts include more than almost 600 tables that cover a wide variety of characteristics, and another380 race and Hispanic-origin iterations that cover 40 key characteristics. In addition to the tableson characteristics, approximately 80 tables summarize allocation rates from the data edits formany of the characteristics. These provide measures of data quality by showing the extent towhich responses to various questionnaire items were complete. Altogether, over 1,300 separatedetailed tables are provided.

Data Profiles

Data profiles are high-level reports containing estimates for demographic, social, economic, andhousing characteristics. For a given geographic area, the data profiles include distributions forsuch characteristics as sex, age, type of household, race and Hispanic origin, school enrollment,educational attainment, disability status, veteran status, language spoken at home, ancestry,income, poverty, physical housing characteristics, occupancy and owner/renter status, and hous-ing value. The data profiles include a 90 percent margin of error for each estimate. Beginning withthe 2007 ACS, a comparison profile that compares the 2007 sample year’s estimates with those ofthe 2006 ACS also will be published. These profile reports include the results of a statistical sig-nificance test for each previous year’s estimate, compared to the current year. This test result indi-cates whether the previous year’s estimate is significantly different (at a 90 percent confidencelevel) from that of the current year.

Narrative Profiles

Narrative profiles cover the current sample year only. These are easy-to-read, computer-producedprofiles that describe main topics from the data profiles for the general-purpose user. These arethe only ACS products with no standard errors accompanying the estimates.

Subject Tables

These tables are similar to the Census 2000 quick tables, and like them, are derived from detailedtables. Both quick tables and subject tables are predefined, covering frequently requested infor-mation on a single topic for a single geographic area. However, subject tables contain more detailthan the Census 2000 quick tables or the ACS data profiles. In general, a subject table containsdistributions for a few key universes, such as the race groups and people in various age groups,which are relevant to the topic of the table. The estimates for these universes are displayed aswhole numbers. The distribution that follows is displayed in percentages. For example, subjecttable S1501 on educational attainment provides the estimates for two different age groups—18 to24 years old and 25 years and older, as a whole number. For each age group, these estimates arefollowed by the percentages of people in different educational attainment categories (high schoolgraduate, college undergraduate degree, etc.). Subject tables also contain other measures, such asmedians, and they include the imputation rates for relevant characteristics. More than 40 topic-specific subject tables are released each year.

Ranking Products

Ranking products contain ranked results of many important measures across states. They are pro-duced as 1-year products only, based on the current sample year. The ranked results among thestates for each measure are displayed in three ways—charts, tables, and tabular displays thatallow for testing statistical significance.

The rankings show approximately 80 selected measures. The data used in ranking products arepulled directly from a detailed table or a data profile for each state.

Geographic Comparison Tables (GCTs)

GCTs contain the same measures that appear in the ranking products. They are produced as both1-year and multiyear products. GCTs are produced for states as well as for substate entities, suchas congressional districts. The results among the geographic entities for each measure are dis-played as tables and thematic maps (see next).

13−4 Preparation and Review of Data Products ACS Design and Methodology

U.S. Census Bureau

Page 78: Memo Regarding ACS-With Response

Thematic Maps

Thematic maps are similar to ranking tables. They show mapped values for geographic areas at agiven geographic summary level. They have the added advantage of visually displaying the geo-graphic variation of key characteristics (referred to as themes). An example of a thematic mapwould be a map showing the percentage of a population 65 years and older by state.

Selected Population Profiles (SPPs)

SPPs provide certain characteristics from the data profiles for a specific race or ethnic group (e.g.,Alaska Natives) or some other selected population group (e.g., people aged 60 years and older).SPPs are provided every year for many of the Census 2000 Summary File 4 iteration groups. SPPswere introduced on a limited basis in the fall of 2005, using the 2004 sample. In 2008 (sampleyear 2007), this product was significantly expanded. The earlier SPP requirement was that a sub-state geographic area must have a population of at least 1,000,000 people. This threshold wasreduced to 500,000, and congressional districts were added to the list of geographic types thatcan receive SPPs. Another change to SPPs in 2008 is the addition of many country-of-birth groups.

Groups too small to warrant an SPP for a geographic area based on 1 year of sample data mayappear in an SPP based on the 3- or 5-year accumulations of sample data. More details on theseprofiles can be found in Hillmer (2005), which includes a list of selected race, Hispanic origin, andancestry populations.

13.5 PUBLIC USE MICRODATA SAMPLE

Microdata are the individual records that contain information collected about each person and HU.PUMS files are extracts from the confidential microdata that avoid disclosure of information abouthouseholds or individuals. These extracts cover all of the same characteristics contained in thefull microdata sample files. Chapter 14 provides information on data and file organization for thePUMS.

The only geography other than state shown on a PUMS file is the Public Use Microdata Area(PUMA). PUMAs are special nonoverlapping areas that partition a state, each containing a popula-tion of about 100,000. State governments drew the PUMA boundaries at the time of Census 2000.They were used for the Census 2000 sample PUMS files and are known as the ‘‘5 percent PUMAs.’’(For more information on these geographic areas, go to <http://www.census.gov/prod/cen2000/doc/pums.pdf>.)

The Census Bureau has released a 1-year PUMS file from the ACS since the survey’s inception. Inaddition to the 1-year ACS PUMS file, the Census Bureau plans to create multiyear PUMS files fromthe ACS sample, starting with the 2005−2007 3-year PUMS file. The multiyear PUMS files combineannual PUMS files to create larger samples in each PUMA, covering a longer period of time. Thiswill allow users to create estimates that are more statistically reliable.

13.6 GENERATION OF DATA PRODUCTS

Following conversations with users of census data, the subject matter analysts in the CensusBureau’s Housing and Household Economic Statistics Division and Population Division specify theorganization of the ACS data products. These specifications include the logic used to calculateevery estimate in each data product and the exact textual description associated with each esti-mate. Starting with the 2006 ACS data release, only limited changes to these specifications haveoccurred. Changes to the data product specifications must preserve the ability to compare esti-mates from one year to another and must be operationally feasible. Changes must be made nolater than late winter of each year to ensure that the revised specifications are finalized by thespring of that year and ready for the data releases beginning in the late summer of the year.

After the edited data with the final weights are available (see Chapters 10 and 11), generation ofthe data products begins with the creation of the detailed tables data products with the 1-yearperiod estimates. The programming teams of the American Community Survey Office (ACSO) gen-erate these estimates. Another staff within ACSO verifies that the estimates comply with the speci-fications from subject matter analysts. Both the generation and the verification activities are auto-mated.

Preparation and Review of Data Products 13−5ACS Design and Methodology

U.S. Census Bureau

Page 79: Memo Regarding ACS-With Response

The 1-year data products are released on a phased schedule starting in the summer. Currently, theCensus Bureau plans to release the multiyear data products late each year, after the release of the1-year products.

One distinguishing feature of the ACS data products system is that standard errors are calculatedfor all estimates and are released with the latter in tables. Subject matter analysts also use thestandard errors in their internal reviews of estimates.

Disclosure Avoidance

Once plans are finalized for the ACS data products, the DRB reviews them to assure that confiden-tiality of respondents has been protected.

Title 13 of the United States Code (U.S.C.) is the basis for the Census Bureau’s policies on disclo-sure avoidance. Title 13 says, ‘‘Neither the Secretary, nor any other officer or employee of theDepartment of Commerce may make any publication whereby the data furnished by any particularestablishment or individual under this title can be identified . . .’’ The DRB reviews all data prod-ucts planned for public release to ensure adherence to Title 13 requirements, and may insist onapplying disclosure avoidance rules that could result in the suppression of certain measures forsmall geographic areas. (More information about the DRB and its policies can be found at<http://www.factfinder.census.gov/jsp/saff/SAFFInfo.jsp?_pageId=su5_confidentiality>.

To satisfy Title 13 U.S.C., the Census Bureau uses several statistical methodologies during tabula-tion and data review to ensure that individually identifiable data will not be released.

Swapping. The main procedure used for protecting Census 2000 tabulations was data swap-ping. It was applied to both short-form (100 percent) and long-form (sample) data indepen-dently. Currently, it also is used to protect ACS tabulations. In each case, a small percentage ofhousehold records is swapped. Pairs of households in different geographic regions areswapped. The selection process for deciding which households should be swapped is highlytargeted to affect the records with the most disclosure risk. Pairs of households that areswapped match on a minimal set of demographic variables. All data products (tables andmicrodata) are created from the swapped data files.

For PUMS data the following techniques are employed in addition to swapping:

Top-coding is a method of disclosure avoidance in which all cases in or above a certain per-centage of the distribution are placed into a single category.

Geographic population thresholds prohibit the disclosure of data for individuals or HUs forgeographic units with population counts below a specified level.

Age perturbation (modifying the age of household members) is required for large householdscontaining 10 people or more due to concerns about confidentiality.

Detail for categorical variables is collapsed if the number of occurrences in each categorydoes not meet a specified national minimum threshold.

For more information on disclosure avoidance techniques, see Section 5, ‘‘Current disclosureavoidance practices’’ at <http://www.census.gov/srd/papers/pdf/rrs2005-06.pdf>.

The DRB also may determine that certain tables are so detailed that other restrictions are requiredto ensure that there is sufficient sample to avoid revealing information on individual respondents.In such instances, a restriction may be placed on the size of the geographic area for which thetable can be published. Current DRB rules require that detailed tables containing more than 100detailed cells may not be released below the census tract level.

The data products released in the summer of 2006 for the 2005 sample covered the HU popula-tion of the United States and Puerto Rico only. In January 2006, data collection began for thepopulation living in GQ facilities. Thus, the data products released in summer 2007 (and each year

13−6 Preparation and Review of Data Products ACS Design and Methodology

U.S. Census Bureau

Page 80: Memo Regarding ACS-With Response

thereafter) covered the entire resident population of the United States and Puerto Rico. Most esti-mates for person characteristics covered in the data products were affected by this expansion. Forthe most part, the actual characteristics remained the same, and only the description of the popu-lation group changed from HU to resident population.

Data Release Rules

Even with the population size thresholds described earlier, in certain geographic areas some verydetailed tables might include estimates with unacceptable reliability. Data release rules, based onthe statistical reliability of the survey estimates, were first applied in the 2005 ACS. These releaserules apply only to the 1- and 3-year data products.

The main data release rule for the ACS tables works as follows. Every detailed table consists of aseries of estimates. Each estimate is subject to sampling variability that can be summarized by itsstandard error. If more than half of the estimates in the table are not statistically different from 0(at a 90 percent confidence level), then the table fails. Dividing the standard error by the estimateyields the coefficient of variation (CV) for each estimate. (If the estimate is 0, a CV of 100 percentis assigned.) To implement this requirement for each table at a given geographic area, CVs are cal-culated for each table’s estimates, and the median CV value is determined. If the median CV valuefor the table is less than or equal to 61 percent, the table passes for that geographic area and ispublished; if it is greater than 61 percent, the table fails and is not published.

Whenever a table fails, a simpler table that collapses some of the detailed lines together can besubstituted for the original. If the simpler table passes, it is released. If it fails, none of the esti-mates for that table and geographic area are released. These release rules are applied to single-and multiyear period estimates based on 3 years of sample data. Current plans are not to applydata release rules to the estimates based on 5 years of sample data.

13.7 DATA REVIEW AND ACCEPTANCE

After the editing, imputation, data products generation, disclosure avoidance, and application ofthe release rules have been completed, subject matter analysts perform a final review of the ACSdata and estimates before release. This final data review and acceptance process helps to ensurethat there are no missing values, obvious errors, or other data anomalies.

Each year, the ACS staff and subject matter analysts generate, review, and provide clearance of allACS estimates. At a minimum, the analysts subject their data to a specific multistep review pro-cess before they are cleared and released to the public. Because of the short time available toreview such a large amount of data, an automated review tool (ART) has been developed to facili-tate the process.

ART is a computer application that enables subject matter analysts to detect statistically signifi-cant differences in estimates from one year to the next using several statistical tests. The initialversion of ART was used to review 2003 and 2004 data. It featured predesigned reports as well asad hoc, user-defined queries for hundreds of estimates and for 350 geographic areas. An ARTworkgroup defined a new version of ART to address several issues that emerged. The improvedversion has been used by the analysts since June 2005; it is designed to work on much larger datasets and a wider range of capabilities, with faster response time to user commands. A team ofprogrammers, analysts, and statisticians then developed an automated tool to assist analysts intheir review of the multiyear estimates. This tool was used in 2008 for the review of the2005−2007 estimates.

The ACSO staff, together with the subject matter analysts, also have developed two other auto-mated tools to facilitate documentation and clearance for required data review process steps: theedit management and messaging application (EMMA), and the PUMS management and messagingapplication (PMMA). Both are used to track the progress of analysts’ review activities and bothenable analysts and managers to see the current status of files under review and determine whichreview steps can be initiated.

Preparation and Review of Data Products 13−7ACS Design and Methodology

U.S. Census Bureau

Page 81: Memo Regarding ACS-With Response

13.8 IMPORTANT NOTES ON MULTIYEAR ESTIMATES

While the types of data products for the multiyear estimates are almost entirely identical to thoseused for the 1-year estimates, there are several distinctive features of the multiyear estimates thatdata users must bear in mind.

First, the geographic boundaries that are used for multiyear estimates are always the boundary asof January 1 of the final year of the period. Therefore, if a geographic area has gained or lost terri-tory during the multiyear period, this practice can have a bearing on the user’s interpretation ofthe estimates for that geographic area.

Secondly, for multiyear period estimates based on monetary characteristics (for example, medianearnings), inflation factors are applied to the data to create estimates that reflect the dollar valuesin the final year of the multiyear period.

Finally, although the Census Bureau tries to minimize the changes to the ACS questionnaire, thesechanges will occur from time to time. Changes to a question can result in the inability to build cer-tain estimates for a multiyear period containing the year in which the question was changed. Inaddition, if a new question is introduced during the multiyear period, it may be impossible tomake estimates of characteristics related to the new question for the multiyear period.

13.9 CUSTOM DATA PRODUCTS

The Census Bureau offers a wide variety of general-purpose data products from the ACS designedto meet the needs of the majority of data users. They contain predefined sets of data for standardcensus geographic areas. For users whose data needs are not met by the general-purpose prod-ucts, the Census Bureau offers customized special tabulations on a cost-reimbursable basisthrough the ACS custom tabulation program. Custom tabulations are created by tabulating datafrom ACS edited and weighted data files. These projects vary in size, complexity, and cost,depending on the needs of the sponsoring client.

Each custom tabulation request is reviewed in advance by the DRB to ensure that confidentiality isprotected. The requestor may be required to modify the original request to meet disclosure avoid-ance requirements. For more detailed information on the ACS Custom Tabulations program, go to<http://www.census.gov/acs/www/Products/spec_tabs/index.htm>.

13−8 Preparation and Review of Data Products ACS Design and Methodology

U.S. Census Bureau