Download pdf - Making Bootstrap Statistical Inferences: A Tutorial

This article was downloaded by: [University of Kiel]On: 24 October 2014, At: 21:21Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Research Quarterly for Exercise and SportPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/urqe20

Making Bootstrap Statistical Inferences: A TutorialWeimo Zhu aa Division of Health, Physical Education, and Recreation , Wayne State University , USAPublished online: 26 Feb 2013.

To cite this article: Weimo Zhu (1997) Making Bootstrap Statistical Inferences: A Tutorial, Research Quarterly for Exercise andSport, 68:1, 44-55, DOI: 10.1080/02701367.1997.10608865

To link to this article: http://dx.doi.org/10.1080/02701367.1997.10608865

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/urqe20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/02701367.1997.10608865

http://dx.doi.org/10.1080/02701367.1997.10608865

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Research Quarterly for Exercise andSport© 1997 bytheAmerican Alliance for Health,Physical Education, Recreation and DanceVol. 68, No.1,pp.44-55

Making Bootstrap Statistical Inferences: ATutorial

WeimoZhu

Bootstrapping is a computer-intensive statistical technique in which extensive computational procedures are heavily dependent onmodern high-speed digital computers. The payofffor such intensive computations is freedom from two major limitingfactors thathave dominated classical statistical theory since its beginning: the assumption that the data conform to a bell-shaped curve, andthe need to focus on statistical measures whose theoreticalproperties can be analyzed mathematically. The name "bootstrap" wasderived from an old saying about pulling oneselfup by one's own bootstraps. In this case, bootstrapping means redrawingsamples randomly from the original sample with replacement. The key idea, computations, advantages, limitations, and application potential ofbootstrapping in the field ofphysical education and exercise science are introduced and illustrated using a set ofnational physical fitness testing data. Finally, an example ofa bootstrapping application is provided. Through a step-by-stepapproach, the development and implementation of the bootstrap statistical inference are illustrated.

Key words: methodology, computer-intensive statistics,nonparametric statistics

H oWmany sit-ups can an average American 10-yearold girl do in 1 min? To what degree do females'

body-mass indexes correlate with their percentage ofbody fat? Researchers try to answer these kinds ofpopulation-related questions in their professionalpractices. In real life situations, however, they rarelytry to answer these questions by studying the population directly due to the population size (e.g., all 10year-old American girls) and the cost of such a study.Instead, researchers draw samples from the population and try to make inferences about the populationbased on sample statistics, such as means and standarddeviations. The use of sample statistics to infer thevalue of a population parameter is called "point estimation."

In reality, however, the value of a sample statisticwill usually not be exactly the same as the true population value because of sampling error. Therefore, itis necessary to qualify an estimation to indicate thegeneral magnitude of this error, or the accuracy, ofthe estimation. Often, this is done through construct-

Submitted: January 7, 1994Accepted: July29, 1996

Weimo Zhuis with theDivision of Health, Physical Education, andRecreation at Wayne State University.

44

ing an "interval estimation," which describes a rangeofvalues within which a population parameter is likelyto lie. The 95% confidence interval, in which the interval estimate is constructed by using ±1.96 standarderrors from the statistic employed, is perhaps the mostcommonly used interval estimate. There is, however,an important assumption underlying this commonpractice (i.e., the distribution of the sample statistic,known also as the sampling distribution, must be "normal" or bell shaped). The assumption, therefore, isknown also as the assumption of normality. In fact,most of the currently used "classical" statistics (e.g.,Pearson's product-moment correlation coefficient,t test, and analysis ofvariance [ANOVA]), which weredeveloped at the end of the last century and the earlypart of this century when computing power was slowand expensive, are based on certain, often unverifiable, assumptions. In statistics, those based on suchassumptions are referred to as parametric statistics.

Data collected in practice, however, often do notmeet the assumptions. For example, the data usuallyare not distributed normally when the sample size issmall or the original population itself is not distributed normally. In this situation, applying parametricstatistics depends on the central limit theorem (i.e.,if one repeats the drawing of samples from the population, the sampling distribution of a statistic approaches a bell-shaped curve).

In reality, however, it is almost impossible for practitioners or researchers to draw many repeatedsamples from a population. Therefore, parameters ofsampling distributions must be estimated. This means

ROES: March 1997

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4

that traditional parametric statistical inference requires both a distribution assumption about a samplingdistribution and a readily available formula to calculatethe parameters of the distribution. In fact, that is oneof the major reasons why statisticians in the past devotedmost of their time to developing new formulas. Still, onlylimited statistics are available, because formulas for otherstatistical procedures have not been developed yet or aretoo difficult to derive mathematically. For example, researchers often use only the mean, median, and modeto estimate central tendency, although other statistics,such as the trimmed mean, may be more appropriatein certain circumstances. Furthermore, studies (DiCiccio& Romano, 1989; Efron, 1987) have shown that only inrare cases do parametric assumptions allow for an exactestimation ofsampling distributions, especially when thesample size is small (conventionally n ~ 30). Often, thereare three alternatives when parametric statistics are notappropriate.

The first is to make statistical inferences usingparametric statistics, disregarding the fact that the assumptions of the statistics have been violated but believing the results will be robust. In other words,accurate inferences may be achieved even if the assumptions have been violated. The commonly used ttest and ANOVA, for example, are quite robust, withrespect to violation of the normality assumption. Themajor drawback of this alternative is that the degree ofassumption violation is often difficult to measure, although various robust indexes (see Hill & Dixon, 1982)have been developed.

The second alternative is to apply a transformationin which scores are changed systematically (e.g., takingthe square root ofeach of the scores, so that the assumptions can be met). The major shortcomings are eitherthat the transformation is not necessary (e.g., the useofvariance-stabilizing transformations has little effect onthe significance level and power of the F test [Budescu& Appelbaum, 1981]) or that the transformation maybe not helpful at all (e.g., no transformation will makethe data more suitable for ANOVA if the means of treatment levels are approximately equal, but variances of theerror effects are heterogeneous [Kirk, 1982]). Furthermore, the transformation is sometimes limited by its ownassumptions, and results based on the new scale is sometimes difficult to interpret.

The third alternative is to make inferences usingclassical nonparametric statistics. The sign test, thematched-pair Wilcoxon test, and the Kruskal-Wallistest are a few examples in this category. The majordrawback of using nonparametric statistics is thatmuch information about the data is lost in the computing process (i.e., taking into account only the ordering of observations but not their values), andthereafter, the inference derived from the nonparametric statistic may sometimes lack the statistical power

ROES: March 1997

Zhu

needed. Recently developed bootstrapping statisticsprovide another, new alternative.

Bootstrapping

Bootstrapping andJackknife

Bootstrapping, invented by Efron (1979), was basedon the idea of thejackknife technique (Quenouille, 1949;Tukey, 1958), which is used to reduce the bias of an estimator. The name "bootstrap" was derived from an oldsaying about pulling oneself up by one's own bootstraps,reflecting the fact that the one available sample gives riseto many others. Statistics in both techniques, known alsoas resampling techniques (Efron, 1982), are developedbased on subsamples from a sample originally drawnfrom the population. However, jackknife resampling isconducted in a more systematic way: Each observationis omitted, in turn, to generate subsamples. If, for example, there are 20 observations in a sample, to generate the first jackknife sample statistic (e.g., the mean),omit the first observation from the sample and computethe mean based on the remaining observations; to generate the secondjackknife mean, bring back the first observation, but omit the second one. Repeat this for allthe observations until a total of 20 jackknife means arecomputed.Jackknife estimations then can be determinedbased on these jackknife means. Bootstrap resampling,in contrast, is conducted randomly with replacement andmany more repetitions, described in detail later in thisstudy. Although jackknife resampling does not place aheavy burden on computation, quite often the jackknifewill not work as well as bootstrapping in practice (Efron& Tibshirani, 1993). For greater detail ofjackknifing andits relationship to bootstrapping, refer to Miller (1974)and Efron and Tibshirani (1993).

Bootstrapping andComputer-Intensive Statistics

The development of bootstrapping cannot be separated from the development of modern digital computers, because it is impossible to implement thebootstrapping idea, which requires many resamplings,without the assistance of modern computer power. Infact, it is the reason that bootstrapping, together withrandomization or permutation and Monte Carlo tests, isreferred to as computer-intensive statistics in the literature (Diaconis & Efron, 1983; Noreen, 1989). Computerintensive statistics are free from two major limiting factorsthat have dominated classical statistical theory since itsbeginning: the assumption that the data conform to abell-shaped curve, and the need to focus on statisticalmeasures whose theoretical properties can be analyzedmathematically (Diaconis & Efron, 1983).

45

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4

Zhu

Bootstrapping Samples andSampling Distribution

As with commonly used parametric inferential statistics, bootstrapping also is based on a sampling distribution. The bootstrap sampling distribution, however,is not developed from many repeated samples from thepopulation or any analytic formulas. A bootstrap sampling distribution is developed by redrawing a randomsample from a population. More specifically, to developa bootstrap sampling distribution, first draw a samplerandomly from a population. Instead of drawing succeeding samples repeatedly from the population orcomputing parameters of the sampling distributionbased on existing analytic formulas, redraw many

subsamples, called bootstrap samples, randomly withreplacement from the sample. Thus, bootstrapping hereactually means redrawing samples randomly from theoriginal sample, with replacement. Sample statistics,such as mean and median, are then computed for eachbootstrap sample and called bootstrap statistics. Distributions of the bootstrap sample statistics are thebootstrapping sampling distributions of the statistics.Thus, both parametric statistics and bootstrap statisticsestimate the parameters ofa sampling distribution froma sample. Parametric statistics estimate the parametersbased on available analytic formulas, which are restrictedby related distribution assumptions; while bootstrap statistics estimate the parameters based on the bootstrap

Many origiflQl samplesfrom 1hepopulation

Samplestalistlc

sampling diattibUdcnof sample sllil1fatlc

Eattmaled aampHrodWibulion ofsample stallslIc

ParametrICestimation

one 0riCinal samplefrom ttre population

u.J.l1J.u. ._-~~.~.._...~

• • •""""

""'s. L11.LLll1

One CIriglnal samplefrom 1Mpopulallon

Figure 1.Differences in developing sampling distributions.

46 ROES: March 1997

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4

samples, which are free from these assumptions. Thedifferent steps taken to develop the sampling distribution and the bootstrap sampling distribution are illustrated in Figure 1.

HowBootstrapping Works

The idea behind bootstrapping is quite simple. Although it is impossible to get many samples from a population (F), it is possible to get repeated samples from apopulation (f) whose distribution approximates thepopulation. Given a sample drawn randomly from thepopulation, the cumulative distribution of the sample,known also as the "empirical distribution function," is theoptimal estimator of the population. Let X = (x., x2' ...,xn) denote a random sample ofsize n from a population.IfF denotes the cumulative distribution function of thepopulation, then F(x) = P(X ~ x), where P(X ~ x) denotes the probability ofyielding a value less than or equalto x. This allows the cumulative distribution of the sample

Zhu

denoted by F(X(i») to become the maximum likelihoodestimate ofthe population distribution function, F(x):

A i number ofXI'x.; ..., X ~ x.F(x.)=-= -l! n I (1)ro n n '

where i is, in fact, the cumulative frequency and n is thesample size (see Rohatgi, 1984, pp. 234-236, for moredetails) .

The following example illustrates how the F(X(i» isdeveloped and implemented in determining the centraltendency of a population. A small sample of sit-up datawas drawn from the National Children and Youth FitnessStudy (Ross & Gilbert, 1985), which was defined as thepopulationI studied. There were 20 observations in this~ample (Table 1). The maximum value was 57, and theminimum value was 5, M =32.85, SD=14.40. After arranging the data into an order of increasing magnitudeyields X(I) ~ X(2)' ..., ~ x(n)' the cumulative distribution ofthe sample, F(X(i»)' was computed (Table 1) and plotted

Table 1.A sit-up sample data, cumulative distribution function, and related bootstrap samples

Raw data In =20)35 28 18 32 30 5 57 41 50 49 53 44 35 39 25 40 23 11 12 30

Ordered observations xll) ... xln]5 11 12 18 23 25 28 30 32 35 39 40 41 44 49 50 53 57

Frequency 2 2

Cumulativefrequency 2 3 4 5 6 7 9 10 12 13 14 15 16 17 18 19 20

F(x(i)l 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.45 0.50 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

Frequencies (fl and statistics of 10bootstrap samples

Data: 5 11 12 18 23 25 28 30 32 35 39 40 41 44 49 50 53 57It) 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1

20%Bootstrap Trimmed

sample Mean Median Mean

1 2 1 2 2 3 1 4 3 34.90 35.00 36.082 2 2 2 3 3 4 2 1 28.95 30.00 30.333 3 2 2 1 2 3 2 1 1 2 29.25 31.00 28.584 1 2 3 1 1 1 3 1 1 2 1 2 31.85 37.00 32.755 1 2 1 2 2 1 3 1 1 1 2 1 2 35.30 40.00 36.836 2 1 1 1 3 2 1 1 2 2 1 3 33.60 32.50 34.337 1 1 1 1 1 3 1 4 1 3 1 1 1 36.45 39.00 37.088 5 1 2 2 1 1 1 1 2 3 28.25 24.00 25.509 1 1 1 2 2 2 5 3 1 31.95 35.00 33.0810 3 3 1 2 1 3 2 2 27.70 32.50 27.17

Bootstrap estimate 31.82 33.60 32.17Standard error 3.17 4.71 3.91

ROES: March 1997 47

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4

Zhu

eluding frequencies and statistics of resampled observations in these samples.

Based on these samples, bootstrap sample statisticswere calculated. For example, for the first bootstrapsample, M = 34.90. The bootstrap estimates can be computed based on t,!le bootstrap sample statistics. The bootstrap estimate (8

B) of the population mean, a, can be

defined as follows':

where B is the number of bootstrap samples (B = 10 inthis case) and S* is the bootstrap sample mean. For thebootstrap estimate of the mean, which is the average ofthe bootstrap sample means, M = 31.82, SD= 3.17 (seeTable 1). Because these bootstrap sample means arereally functioning the same as the real sample means,the mean of the bootstrap sampling distribution canalso be expressed as: 9B = E(X

bo) = u, where Jlis theolStrap

population mean. In other words, the mean of thesampling distribution of bootstrap means, as it is in aparametric sampling distribution (Hays, 1981), is thesame as the population mean. Similarly, the standarddeviation of the bootstrap sampling distribution is thestandard error, or, more accurately, the bootstrap estimated standard error of the mean. Two other boot-

(Figure 2). For example, the cumulative frequency ofx(2), that is, value 11, was 2, and the correspondingfi'(X(2» was0.10 (2/20), which meant that for this population the probability ofyielding a value less than or equalto 11 was 0.10.

A nice feature of the cumulative distribution function is that its value has a range from 0 to 1, and each xvalue has its own corresponding cumulative proportion!CP). For example, because the F(X(I» was 0.05 and theF(X(2» was 0.10, the cumulative proportion of the value11, CP(ll)' was in the range: 0.05 < CP(ll):S; 0.10 (see Table1). Based on the cumulative distribution function of thesample and its corresponding cumulative proportions,one can begin the bootstrap sampling process. First,generate a uniform random number between 0 and1, using a random number generator. Then, determine the corresponding value of x according to thecumulative distribution function. Say, for example,the first uniform random number generated was0.6812, which, according the cumulative distributionfunction (Table 1), fell into the CP range of the value40 (0.65 < CP(40) s 0.70) (see Figure 2). The value of40, therefore, was selected, and became the first observation in the first bootstrap example. This processwas repeated 20 times (each bootstrap sample was thesame size as the original sample), and the first bootstrap sample, with 20 observations, was generated. Tensuch bootstrap samples are illustrated in Table 1, in-

1 Be = - ~ e*B "'" i

B i = 1(2)

1

0.9

0.8~1t"e 0.7sJ 0.6

I 0.5

~

ta0.1

00 10 20 30 40 50 60

Sit-Ups(number in 60 sec.)

Figure 2.Cumulative distribution function ofthe sample.

48 ROES: March 1997

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4

strap estimates, the median and the 20% trimmedmean also are summarized in Table 1. By using thesame bootstrap samples, other useful statistics (some ofwhich cannot be computed using parametric statistics)can then be computed at the same time, which is an advanced feature of bootstrapping.

Although a bootstrapping sampling distribution, intheory, resembles a parametric sampling distribution,bootstrapping statistics are rarely employed in point estimation and hypotheses testing", Rather, they are usually used for interval estimations, such as constructingstandard errors and confidence intervals, and other applications related to statistical accuracy, such as correcting bias and determining prediction errors (Hjorth,1994). For point estimation and hypotheses testing,other computer-intensive statistical methods, such asrandomization or permutation tests and Monte Carlo,often work better (Noreen, 1989).

Number of Bootstrap Samples

To apply bootstrap statistics, more than 10 bootstrapsamples should be generated. Generally, 50 to 200 bootstrap samples are adequate to develop a bootstrap estimate of the standard error (Efron & Tibshirani, 1986).To build bootstrap confidence intervals, however, at least1,000 bootstrap samples should be generated (Efron,1988). To examine the effect of increasing the numberof bootstrap samples (B) on bootstrap statistics, bootstrap statistics were computed when B = 50, B = 200, B =500, B = 1,000, B = 2,000, and B = 5,000, using the samedata set shown in Table 1. The results of thebootstrapping, as well as the "population" parametersand parametric statistics measuring the central tendency,are summarized in Table 2. Overall, there was a trendthat the more bootstrap samples, the better the estimation. Improvement of estimation accuracy, however, waslimited, considering the degree of improvement fromthe estimates when B = 10 and B = 5,000. For the sit-updata employed, the parameter statistics provided similar accurate estimations of the true parameters. In fact,as long as a sample is reasonably large (n > 30) and thepopulation distribution is close to a bell-shaped curve,there is little need to replace parametric statistics withbootstrapping.

Original Sample Size andStatisticalPower

Bootstrapping is particularly suited to small datasets, say n ~ 30, because parametric assumptions are often not crucial in large data sets; but bootstrapping alsohas been applied widely to larger and more complexdata sets to which parametric analytic formulas are either not available or not appropriate. It has been reported that in one-sample problems, bootstrapping willfunction well as long as n ~ 8 (Efron, 1995). Generally,

ROES: March 1997

Zhu

the statistical power ofa bootstrap inference, like classical inferential statistics, increases as the sample size increases. However, accurate statistical inference based onbootstrapping depends more on how bootstrap samplesare drawn from an original sample. The bootstrapsamples should be drawn from the original sample inthe same way the original sample was drawn from the

Table 2.Theeffect of increasing the number of bootstrapsamples (B)

20%Mean Median Trimmed mean

True parameters

e 33.92 34.00 33.88

Parametric statistics

() 32.85 33.50 33.50

°statistic 3.22' 4.33b _c

Bootstrap estimates

B= 10

~B 31.82 33.60 32.17O'B 3.18 4.71 3.91

B= 508B 32.43 32.46 32.83<\ 3.78 4.78 4.42

B=200

~B 32.64 33.52 33.28O'B 3.07 3.76 3.52

B= 500

~B 32.77 33.21 33.25O'B 3.27 4.01 3.83

B= 1,0008B 32.71 33.33 33.26

°B 3.19 3.88 3.60B= 2,000

~B 32.82 33.39 33.38O'B 3.18 3.78 3.65

B= 5,000

~B 32.90 33.49 33.45O'B 3.13 3.81 3.57

'Thestandard errorwasestimated bythe formula: 0maan=s/(n)"2,where5 is the standard deviation of the sample and n isthesample size.bThe standard errorwasestimated bythe formula: 0m'di,n=(a • b)/3.4641, wherea is thevalue of (n/2 +(3n)"2/2)th observation and bis thevalue of (n/2 • (3n)"2/2)th observation.cNo existing analytic formula was available.

49

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4

Zhu

unknown population. In addition, the desired statisticalpower as well as the sample size of a bootstrap statisticalinference also can be computed based on a preliminarybootstrapping data analysis. For more details, refer to arelated chapter by Efron and Tibshirani (1993) and otherrelated descriptions by Hall and Wilson (1991).

Bootstrap Computing

Bootstrap computing is very straightforward: (a)generate a bootstrap sample randomly with replacementfrom the original sample; (b) compute bootstrap samplestatistics and save them; (c) repeat steps (a) and (b) forB times; and (d) compute bootstrap estimates. Theamount ofcomputing required by most bootstrap applications, although considerable, is well within the capabilities of today's personal computers. Computinglanguage codes for commonly used statistics have beenincluded in several introductory texts (see Mooney &Duval, 1993, and Noreen, 1989), and most statistical software packages, such as SAS,SPSS-X,and S'Plus', have thecapacity to conduct bootstrapping with some programming. Also, several computer programs specializing inbootstrapping have been developed (see Boomsma,1991, and Lunneborg, 1987).

Advantages of Bootstrapping

The major advantage of bootstrapping over traditional parametric statistics is that it sometimes may bebetter to draw conclusions about the parameters of apopulation strictly from the sample at hand (e.g., a smallsample) than to make perhaps unrealistic assumptionsabout the population. Another advantage of bootstrapping is that it provides a useful alternative when formulas for parametric statistics are not available. For example,the x% trimmed mean is a very useful measure of central tendency, when sample data come from a long-tailedpopulation (Huber, 1981). The robust estimator is obtained by "trimming" the data to exclude values far fromothers to produce a smaller measurement error. In thepast, however, there were no analytic parametric formulas for computing the standard error of this statistic.Thus, the accuracy of the estimation could not be determined, and the application of the statistic was limited.By using bootstrapping, the problem can be solved easily. Bootstrap sampling distributions were generated forthe x% trimmed means (i.e., 20% trimmed mean in thisstudy) and the standard deviations of the distributionswere used as the estimated standard error of the statistics (see Table 2).

Bootstrapping has been applied to a variety offields:medicine (Koehler & McGovern, 1990), psychophysiology (Wasserman & Bockenholt, 1989), social science(Hand, 1990), and education (Maguire, 1986), as wellas a variety of statistics, such as factor analysis (Borrello

50

& Thompson, 1989), repeated measures analysis ofvariance (Lunneborg & Tousignant, 1985), and covariancestructure analysis (Bollen & Stine, 1990). Bootstrappinghas great application potential in the field of physicaleducation (PE) and exercise science because many of thedata distributions are skewed (e.g., pull-up data in fitnesstesting), and many researchers use small samples in theirstudies. For more information about bootstrapping, seethe introductory texts by Efron and Tibshirani (1993)and Mooney and Duval (1993), and a more technicalreference by Efron (1982).

Limitations of Bootstrapping

No statistical procedure always yields the correctanswer. This is true also for bootstrapping, especially ifthe resampling scheme does not parallel the structureof the actual sampling mechanism. Recall that abootstrapping sampling distribution is derived from one"original" sample from the population. If this originalsample is not a representative sample of the population,bootstrapping may give misleading results. Efron andTibshirani (1993) also reported that bootstrapping mayfail when the population distribution is extremelyskewed. Furthermore, a biased random number generator could lead to the failure of bootstrapping; therefore,every effort should be made to control and monitor thefunction of the random number generator (see Ripley,1987, for related tests).

An Example

To demonstrate that bootstrapping is a useful alternative and complement for parametric statistics, the author illustrates how bootstrapping can assist indeveloping a confidence interval and making a bootstrapstatistical inference when a parametric analytic formulais not appropriate. An existing data set served as thepopulation being studied to compare bootstrap estimateswith the "true parameters."

Correlation is one of the most commonly used statistical procedures in the field ofPE and exercise science.To validate a field test, for example, the scores from afield test often are correlated with the scores from a criterion measure. In reality, correlations vary across studies, even when the same measures are applied, becauseofsampling error. Statisticians, therefore, have developedindexes, such as standard error and confidence intervals,to determine the accuracy of sample statistics. Followingis an example of how bootstrapping can help develop aconfidence interval ofa population correlation.

An existing body composition data set was used (Jackson & Pollock, 1985), in which the percentage of bodyfat of 678 participants was determined by hydrostatic, orunderwater, weighing. A number ofanthropometric variables, such as height, weight, and skinfolds, were also

ROES: March 1997

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4

measured. The data for the female participants (N= 279)were employed as the example population, and the bodymass index [BMI = Weight (kg)/Height2 (m)] was usedas the field test to predict participants' body fat. Thescores of BMI and percentage of body fat are plotted inFigure 3 (top). To validate the BMI, the author computedits correlation (i.e., Pearson's product-moment correlation coefficient, p) with percentage of body fat, the criterion measure; and p = 0.707. In real life, however,researchers often do not know the population parameterbut try to make an inference from a sample statistic. Togenerate a sample statistic, a small sample (n = 15 pairs)was drawn (see Table 3) and plotted in Figure 3 (bottom). The correlation coefficient of this sample (r) was0.733, which was the point estimate of the population parameter (p). It provided a reasonably accurate estimateof the population parameter. Itwas known, however, that

Table 3.Sample, Fisher's ztransformation, and confidence intervals

Sample

Zhu

if another sample of the same size were drawn randomlyfrom the population, the new rmight be different fromthe previous one, due to sampling error. Therefore, itwas necessary to develop an interval estimate of the population parameter. The commonly used 95% confidenceinterval was employed as the interval estimate.

Because the author had the "population" data, a"real" sampling distribution of the correlation coefficients could be developed according to the theoreticalapproach illustrated in Figure 1. Based on this samplingdistribution, a "true" 95% confidence interval could bedeveloped to compare with the intervals developed byother approaches. The author randomly drew another1,999 samples (each with n = 15 pairs) with replacement,computed correlation coefficients for each sample, andplotted them (N = 2,000) in a histogram (see the top ofFigure 4). The sampling distribution, as expected, was

Pair 1 2 3 4 5 6 7 8BMI 19.11 20.03 18.79 19.74 25.24 19.63 20.46 22.51Body fat % 25.04 22.32 20.08 19.19 37.68 12.61 19.28 25.23

Pair 9 10 11 12 13 14 15BMI 20.08 21.18 20.64 24.09 25.62 20.74 20.42Bodyfat % 17.51 17.06 22.05 42.04 28.72 29.65 24.59

Sample statistics

r= 0.733

Parametric estimation of 95% confidence interval

SElri =(l-f)/(n _1)1/2 =(1 - 0.7332)/(15-1)1/2 =0.124

[r-l.96 x SElri s p s r+ 1.96 x SElrl) =[0.733 - 1.96 x 0.124 s Ps 0.733 +1.96 x 0.124)

Fisher's zTransformation of95% Confidence Interval

z, =1/2[/n(1 + rI-/n(1 - rH =1/2[/n(1 +0.733) -/n(1-0.733)) =0.935

SElz,) =l/(n - 3)1/2 =1/(15-3j1J2 =0.289

[(z, - 1.96 x SElz,) ~ p ~ z, +1.96 x SElz,)) =[0.935 - 1.96 x 0.289 ~ P~ 0.935 +1.96 x 0.289) =[0.369 ~ P~ 1.501)

Transformed back: tanh(z,) = rtanh(0.369) =0.353; tanh(1.501) =0.905

Summary of95% confidence intervals

Real samples ("true"):Parametric formula:Fisher's z transformation:Bootstrap percentile:

Note. BMI=body mass index.

ROES: March 1997

[0.335 s p s 0.887)[0.490 s Ps 0.976)[0.353 s P~ 0.905][0.385 ~ Ps 0.897)

51

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4

Zhu

Bootstrap &lmptes .!I: :.!I

11:!:1:11.

~: :i.1 :..:til. -I ,01

I ,I ,I ,, II ,• .I

,, ,, ,• ,, ,I,1,,,, ..I,

hi :n

. Jl: ,Real Samp!&s !!' - :-1

~: 11cf:f r-

,~: '='Illtil: :,.;, .01, •, ,

I,, I

• - 1, I, ,II II I, •I I, r- I,

II, ,I II ,

AII,:l

,

o·.43 ·.28 ·.12 .03 .17 .32 .47 .62 .77 .93

50

100

150

450

100

P[r-1.96 x SE(r) ::;; p::;; r+ 1.96 x SE(r)] = 0.95, (3)

300

ric analytical formula for the correlation coefficientis as follows:

350

400

o-.43 •.28 -.12 .03 .17 .32 .41 .62 .77 .93

300

where SE (r) is the standard error of r: SE (r) = (1 - -r) /(n - 1) 1/2. The formula, however, is used rarely in practice, because when the population correlation (p) differs from zero and the sample is not extremely large, thesampling distribution of correlation coefficients is usually skewed. This is demonstrated by the "true" samplingdistribution in Figure 4. Therefore, the confidence interval based on this formula is often incorrect. To illustrate, the author computed an interval estimate, usingFormula 3 with the first sample, and the resulting 95%

+

30

+ ...of.

o

oo

ao 0 a

00000

00

o

45

<40

35

30at~ 25

I :m

15

1C

5

C1S

4S

40

35

30:I;iii 25iL

I 20

15

10

5

015 25

Body. MassIndex

negatively skewed, M = 0.691, SD = 0.145. Because the"real" sample statistics from the population wereknown, the confidence interval could be computeddifferently (i.e., confidence interval was no longerbased on the standard error, as usual, but developeddirectly from the sample statistics. Because the a levelwas set at 0.05, the corresponding percentiles for locating the lower and upper endpoints of the confidence interval should be the 2.5th (0.05/2) percentileand the 97.5th (1- 0.05/2). Therefore, for the 95% confidence interval, the author simply sorted all thesample correlation coefficients (18) and counted upfrom the smallest sorted value to the 50th smallestvalue, which was 0.335. Similarly, the author counteddown from the highest value to the 50th highest value,which was 0.887. Thus, the "true" 95% confidence interval for the population correlation was 0.335 ::;; p ::;;0.887 (see Figure 4).

In reality, however, the population data did notexist for drawing many samples. Therefore, the confidence interval was developed from a single sampleusing the existing parametric formula. The paramet-

Correlation CoelficientFigure 3.The top panel is a scatterplot ofthe "population" data(N= 279); circles indicate the sample data that is plotted in thebottom panel.

Figure 4.Sampling distributions and the 95% confident intervals ofcorrelation coefficients based on realand bootstrap samples.

52 ROES: March 1997

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4

confidence interval was 0.490::; p s 0.976 (see Table 3).As expected, it was an inaccurate estimation when compared with the "true" confidence interval.

One alternative, as described earlier, is to apply atransformation, when assumptions of parametric statistics have been violated. For correlation coefficients, fortunately, a transformation procedure, known as Fisher'sZ transformation, was developed. More specifically, thesample rwas first transformed to zr through Fisher's Z

transformation (see Table 3). The 95% confidence interval of zr' based on a standard error, was then developed. The standard error of zr is a function of samplesize, SE(zr) = I/(n - 3)1/2. After converting zr back to r,the 95% confidence interval based upon the transformation was developed (0.353::; Ps 0.905). By comparing this interval with the "true" confidence interval, itwas clear that the transformation was successful.

Transformation procedures, however, are not without their limitations. First, few work as well as Fisher's Z

transformation. Second, the transformations may be limited by the assumptions on which they are based. Forexample, Fisher's Z transformation assumes that correlated variables follow a bivariate normal distribution.While the violation of this assumption is often robust,the estimation could be misleading when the populationcorrelation is high and the sampling distribution ofzrismarkedly skewed. Finally, a violation of the assumptionsrelated to the transformation may not be easy to detect.For example, it is difficult to detect the degree of theviolation of the "bivariate normal distribution" assumption, when little is known about the population.

Bootstrapping can help in developing a confidenceinterval free from the limitations of often unrealisticand unverifiable assumptions. Bootstrapping statistical inference, as illustrated in Figure 1, is developedin a similar paradigm as the theoretical approach. Theonly difference is that in the theoretical approach, manysamples are redrawn from the population, while inbootstrapping, many samples are redrawn from a singlesample. The steps to developing a bootstrap confidenceinterval for rho (p) follow:

Step 1: Draw a Random Sample. To develop a bootstrapconfidence interval of a parameter (the populationcorrelation between BMI and percentage of body fat,in this case), a sample was drawn from the populationdescribed earlier (see Figure 3 and Table 3).

Step 2: Redraw Bootstrap Samples. Next, many bootstrap samples were redrawn, randomly and with replacement, from the sample to develop a bootstrap samplingdistribution. A total of 2,000 bootstrap samples (eachwith n = 15 pairs) were drawn. These procedures werecompleted by using BOJA bootstrap computer software(Boomsma, 1991).

Step 3: Compute and Sort Bootstrap Sample Statistics. Todevelop a bootstrap sampling distribution, bootstrapsample statistic, in this case, the correlation between the

ROES: March 1997

Zhu

BMI and percentage of body fat, was computed. For example, the correlation coefficient for the first bootstrapsample was 0.689 and 0.730 for the second bootstrapsample. Similar computations were repeated for all 2,000bootstrap samples. All correlation coefficients were thensorted, which consisted of the bootstrap sampling distribution (see the bottom of Figure 4); M= 0.724, SD=0.132, which provided an accurate estimate of the "true"sampling distribution. Furthermore, the bootstrap sampling distribution, similar to the "true" sampling distribution, was negatively skewed.

Step 4: Determine the Confidence Interval. The confidence interval was computed using the percentilemethod. As with the "real" samples, all bootstrap correlation coefficients were sorted. Then, the authorcounted up from the smallest sorted value to the 50thsmallestvalue, which was 0.385, and counted down fromthe highest value to the 50th highest value, which was0.897. Thus, the 95% percentile bootstrap confidenceinterval of the population correlation was 0.385 ::; p ::;0.897. Bycomparing it with the "true" confidence interval, the bootstrapping also provided an accurate estimation (see Table 3 and Figure 4). Generally, the choiceof bootstrap confidence interval method depends onthe practical research situation a researcher faces. Formore information about these methods, refer toMooney and Duval (1995) for a general introductionand Efron (1982) and Efron and Tibshirani (1993) fortechnical details.

Step 5: Make Statistical Inference. Based on the bootstrap confidence interval developed by the percentilemethod, the author concluded, with 95% confidence,that the interval 0.385-0.897 covered the populationcorrelation (p) between participant' body mass indexesand their percentage of body fat.

One may argue that a similar accurate estimationwasachieved by Fisher's Z transformation. While it is truethat Fisher's Z transformation worked well in this case,bootstrapping demonstrated three advantages over conventional procedures in developing the confidence interval of p. First, bootstrapping, which is free fromdistribution assumptions, provided a more accurate estimation of the "true" confidence interval than the parametric approach, whose assumption has been violated.Second, bootstrapping provided a similar estimation ofthe "true" confidence interval as the well developedFisher's Z transformation, which sometimes may be limited by its own assumption. Indeed, there is no need toapply Fisher's z transformation when bootstrapping isemployed, because bootstrapping automatically accomplishes much ofwhat Fisher's z transformation does (seeStine, 1990, for a more detailed explanation). Furthermore, few well developed transformations, such asFisher's z, are available for practical purposes. Finally, itis much simpler and faster to construct a bootstrap confidence interval using bootstrapping methods than by

53

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4

Zhu

using conventional methods. The computations ofFisher's z transformation illustrated in Table 3, althoughnot too complex, can confuse the user. In contrast, byusing a 486 personal computer and BOJA software, theauthor was able to construct the bootstrapping confidence intervals and compute other useful information(e.g., descriptive statistics) in less than 1 min. It is expected that bootstrapping applications will become evenmore convenient as computer speed increases and commonly used statistical software packages integratebootstrapping into their analytical routines.

However, bootstrapping may not work all the timeor under all conditions. For example, Rasmussen(l987) found that bootstrapping resulted in overlyrestricted confidence intervals and overly liberal TypeI error rates of correlation coefficients. Others (Efron,1988; Strube, 1988), however, reported that bootstrapintervals of the correlation coefficient performedquite well. At any rate, bootstrapping statistical inference is a more appropriate choice when an analyticformula is not available, when the assumption of aparametric analytic formula is violated, or when theassumption of a transformation is suspect.

Conclusion

The development of bootstrapping, a major computer-intensive statistics, has provided not only a nonparametric" approach to analyzing data for whichparametric statistics are not appropriate but also a newway to explore statistics that traditionally cannot be applied due to a violation of statistical assumptions or alack of analytic formulas. The key idea, computations,and advantages of bootstrapping were illustrated in thisstudy using a set of real data from the field. An exampleof how to apply bootstrapping to statistical inference wasalso provided. The major advantage of bootstrappingover traditional statistics is that sometimes it may bebetter to draw conclusions about the parameters of apopulation strictly from the sample at hand rather thanmaking perhaps unrealistic assumptions about the population. Finally, bootstrap-ping should not be consideredpreferable to parametric statistics but rather as an alternative, or supplement, when parametric statistics areneither appropriate nor available.

References

Bollen, K A, & Stine, R. A (1990). Direct and indirect effects:Classical and bootstrap estimates of variability. In C. C.Clogg (Ed.), SociologicalMethodology 1990 (pp. 115-140).Oxford, England: Blackwell.

54

Boomsma, A (1991). BOJA: A programfor bootstrapandjackknifeanalysis [Computer program]. Groningen, The Netherlands: iec ProGAMMA.

Borrello, G. M., & Thompson, B. (1989). A replication bootstrap analysis of the structure underlying perceptions ofstereotypic love. TheJournal ofGeneralPsychology, 116, 317327.

Budescu, D. V., & Appelbaum, M. I. (1981). Variance stabilizing transformations and the power of the F test. JournalofEducational Statistics, 6,55-74.

Diaconis, P., & Efron, B. (1983). Computer-intensive methodsin statistics. Scientific American, 248(5), 116-130.

DiCiccio, T.]., & Romano,]. P. (1989). The automatic percentile method: Accurate confidence limits in parametricmodels. CanadianJournalofStatistics, 17,155-169.

Efron, B. (1979). Bootstrap methods: Another look at thejackknife. The Annals ofStatistics, 7, 1-26.

Efron, B. (1982). Thejackknife, the bootstrapand otherresamplingplans. Philadelphia: Society for Industrial and AppliedMathematics.

Efron, B. (1987). Better bootstrap confidence intervals.Journal of the American Statistical Association, 82, 171185.

Efron, B. (1988). Bootstrap confidence intervals: Good orbad? PsychologicalBulletin, 104, 293-296.

Efron, B. (Speaker). (1995). Computers, bootstraps, and statistics (Cassette Recording No. RA5-31.03). Washington, DC:American Educational Research Association.

Efron, B., & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures ofstatistical accuracy. Statistical Science, 1,54-77.

Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. New York: Chapman & Hall.

Hall, P., & Wilson, S. R. (1991). Two guidelines for bootstraphypothesis testing. Biometrics, 47,757-762.

Hand, M. L. (1990). A resampling analysis of federal familyassistance program quality control data: An applicationof the Bootstrap. Evaluation Review, 14,391-410.

Hays, W. (1981). Statistics (3rd ed). New York: CBS CollegePublishing.

Hill, M., & Dixon, W.J. (1982). Robustness in real life: A studyof clinical laboratory data. Biometrics, 38, 377-396.

Hjorth, U. (1994). Computer intensive statistical methods: Validation model selection and bootstrap. New York: Chapman &Hall.

Huber, P. (1981). Robust statistics.New York:John Wiley & Sons,Inc.

Jackson, AS., & Pollock, M. L. (1985). Practical assessmentofbody composition. The Physician and Sportsmedicine, 13,76-90.

Kirk, R. E. (1982). Experimental designfor the behavioral sciences(2nd ed.). Belmont, CA: Brooks/Cole.

Koehler, K]., & McGovern, P. G. (1990). An application ofthe LEP survival model to smoking cessation data. Statistics in Medicine, 9,409-421.

Lunneborg, C. E. (1987). Bootstrap applications for the behavioral sciences. Educational and Psychological Measurement, 47,627-629.

Lunneborg, C. E., & Tousignant,]. P. (1985). Efron's bootstrap with application to the repeated measures design.Multivariate Behavioral Research, 20, 161-178.

ROES: March 1997

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4

Maguire, T. O. (1986). Applications of new directions in statistics to educational research. The Alberta Journal ofEducational Research, 2, 154-171.

Miller, R. G. (1974). Thejackknife: A review. Biometrika, 61,1-15.

Mooney, C. Z., & Duval, R. D. (1993). Bootstrapping; A nonparametric approach to statistical inference. Newbury Park,CA: Sage.

Noreen, E. W. (1989). Computer-intensive methods for testinghypotheses. New York: John Wiley & Sons, Inc.

Quenouille, M. H. (1949). Approximate tests of correlationin time-series. Journal of the Royal Statistical Society, 11,(Series B) 68-84.

Rasmussen,]. L. (1987). Estimating correlation coefficients:Bootstrap and parametric approaches. PsychologicalBulletin, 101, 136-139.

Ripley, B. D. (1987). Stochastic simulation. New York: JohnWiley & Sons, Inc.

Rohatgi, V. K. (1984). Statistical inference. New York: JohnWiley & Sons, Inc.

Ross,]. G., & Gilbert, G. G. (1985). The national childrenand youth fitness study: A summary of findings. Journal ofPhysical Education, Recreation, and Dance, 56, 4550.

Stine, R. (1990). An introduction to bootstrap methods: Examples and ideas. Sociological Methods and Research, 18,243-291.

Strube, M.]. (1988). Bootstrap Type I error rates for correlation coefficient: An examination of alternate procedures. Psychological Bulletin, 104, 290-292.

Tukey, ]. W. (1958). Bias and confidence in not quite largesamples (Abstract). The Annals ofMathematical Statistics,29,614.

Wasserman, S., & Bockenholt, U. (1989). Bootstrapping:Applications to psychophysiology. Psychophysiology, 26,208-221.

ROES: March 1997

Zhu

Notes

1. The population is usually unknown in studies. Forthe purpose of illustration, this national sample is defined as "population" so that bootstrap estimates couldbe compared with known parameters.2. Throughout, the author used e to denote a true parameter of a population and eto denote an estimate ofe based on sample data.3. Bootstrapping, although less accurate than permutation tests, can also be used in hypothesis testing, and,often, it is more widely applicable than permutationtests. For a more thorough treatment on this topic, refer to Efron and Tibshirani (1993).4. S-Plus, developed originally at AT&T's Bell Laboratories, is a data analysis tool published by Statistical Sciences, Inc.5. An idea of "parametric bootstrap" has also been proposed; refer to Efron and Tibshirani (Chapter 21,1993)for more detail.

Author's Notes

The author thanks Andrew S.Jackson of the Universityof Houston for allowing the use of his data and RobertW. Schutz of the University of British Columbia andMarilyn A. Looney of Northern Illinois University fortheir helpful comments and suggestions on the manuscript. Please direct all correspondence regarding thisarticle to Weimo Zhu, Division of Health, Physical Education, and Recreation, Wayne State University, 257Matthaei Building, Detroit, MI 48202.

E-mail: [email protected]

55

Dow

nloa

ded

by [

Uni

vers

ity o

f K

iel]

at 2

1:21

24

Oct

ober

201

4